Chapter 7.3 · High Availability and Disaster Recovery

Your cluster now autoscales intelligently — but scaling is meaningless if the foundation crumbles. What happens when an entire availability zone fails? When control plane nodes go dark? This chapter is about building foundations that survive earthquakes, keeping copies of what matters, and practicing the unthinkable so it never becomes unmanageable.

Analogy: City Emergency Preparedness

A well-prepared city has redundant infrastructure — multiple power grids and duplicate bridges — so citizens never notice a single failure. This is High Availability. The city stores off-site records — copies of vital documents in a secure vault across the state. This is backup strategy. The mayor's office keeps emergency response plans at different readiness levels — from a skeleton crew to a fully staffed center. These are Disaster Recovery patterns. And the city runs fire drills so everyone knows their role when alarms sound. This is Chaos Engineering: practicing failure to prevent surprise.

Kubernetes High Availability: The Multi-Zone Control Plane

A highly available cluster ensures no single component failure causes an outage. The control plane components — API Server, Scheduler, Controller Manager — run as multiple instances across different failure domains. The API Server is fronted by a load balancer; Scheduler and Controller Manager use leader election to fail over within seconds if the leader dies.

etcd Quorum: The Critical Piece

etcd holds the entire cluster state and requires a quorum — more than half of members must agree on any change. A 3-node etcd cluster tolerates 1 failure; a 5-node tolerates 2. Never use an even number: a 2-node etcd cannot survive any failure because losing one member means losing quorum.

Visual Description:

graph TD subgraph "Availability Zone A" CP1[API Server] ETCD1[etcd Member 1] NODE1[Worker Nodes] end subgraph "Availability Zone B" CP2[API Server] ETCD2[etcd Member 2] NODE2[Worker Nodes] end subgraph "Availability Zone C" CP3[API Server] ETCD3[etcd Member 3] NODE3[Worker Nodes] end LB[Load Balancer] --> CP1 LB --> CP2 LB --> CP3 ETCD1 <-->|raft| ETCD2 ETCD2 <-->|raft| ETCD3 ETCD1 <-->|raft| ETCD3 style LB fill:#ffcc80 style CP1 fill:#90caf9 style CP2 fill:#90caf9 style CP3 fill:#90caf9 style ETCD1 fill:#ce93d8 style ETCD2 fill:#ce93d8 style ETCD3 fill:#ce93d8 style NODE1 fill:#a5d6a7 style NODE2 fill:#a5d6a7 style NODE3 fill:#a5d6a7

Pod Anti-Affinity and Topology Spread Constraints

Even with a multi-zone cluster, Kubernetes might schedule all replicas in one zone. Pod anti-affinity forces distribution across zones:

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      podAffinityTerm:
        labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values: [web-server]
        topologyKey: topology.kubernetes.io/zone

The modern alternative is topology spread constraints, which offer finer control:

topologySpreadConstraints:
- maxSkew: 1
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: DoNotSchedule
  labelSelector:
    matchLabels:
      app: web-server

This guarantees pods differ by at most one between any two zones. In the city analogy, this is a fire code requiring emergency vehicles to park in separate garages — if one floods, others remain operational.

⚠️ Common Misconception: A regional cluster does not automatically spread your pods across zones. The control plane spans zones, but your workloads need anti-affinity or topology spread constraints.

🛑 PAUSE & RECALL — 2 minutes

Why does a 2-node etcd cluster provide no fault tolerance, while a 3-node tolerates 1 failure? (Hint: quorum.)
What does topologyKey: topology.kubernetes.io/zone do?
In the city analogy, what does pod anti-affinity correspond to?

Backup Strategies: The Off-Site Records Vault

etcd Snapshots and Application-Aware Backups

For self-managed clusters, etcdctl snapshot save captures the entire cluster state into a single file. On GKE, Google manages etcd backups automatically.

GKE Note: Backup for GKE is Google's managed backup service that captures both Kubernetes resources and PersistentVolume data. Unlike an etcd snapshot (just the resource index), Backup for GKE creates application-aware backups — schedules, retention policies, and cross-region replication included. Restoring creates a complete copy of resources and storage in a target cluster, even in a different region.

Velero for Cross-Platform Flexibility

Velero backs up cluster resources and PVCs to object storage, supporting scheduled backups and cross-cluster restores. It is the tool of choice for multi-cloud environments where Backup for GKE is unavailable.

Follow the 3-2-1 rule: 3 copies of data, on 2 different media, with 1 copy off-site. A backup in the same region as your cluster protects against corruption but not regional outages.

Disaster Recovery Patterns: Choosing Your Response Tier

DR patterns differ in RTO (Recovery Time Objective: how fast to resume) and RPO (Recovery Point Objective: how much data loss is acceptable).

Pilot Light: Minimal Running, Scale Up on DR

Only the database runs in the secondary region; application components are restored from backup when needed. RTO: 30-60 minutes. RPO: hours. Cheapest to run — like a single emergency operator with a radio who can summon the full response.

Warm Standby: Reduced Capacity Running

A scaled-down version of the full application runs continuously, syncing data in real time. On failover, scale to full capacity. RTO: 5-15 minutes. RPO: near-zero. Like a staffed emergency center that can summon all departments within minutes.

Hot Standby: Full Capacity Active-Active

Full-capacity workloads run in multiple regions simultaneously with a global load balancer distributing traffic. Failover is automatic. RTO: seconds. RPO: zero. Like a permanent 24/7 emergency operations center — highest cost, highest readiness.

🤔 TRY BEFORE YOU SEE

A trading platform processes $10M per hour. Requirements: "No more than 30 seconds of lost data; resume within 5 minutes." Budget is secondary.

Which DR pattern? Write your answer before reading on.

Reveal: Hot standby (active-active) is the only option meeting both constraints. The 30-second RPO requires real-time replication; the 5-minute RTO demands instant failover. Warm standby's best-case 5-minute RTO is too risky for a hard requirement.

Cluster Upgrades and Maintenance

The Kubernetes version skew policy allows at most one minor version between control plane and kubelets. A 1.28 control plane can manage 1.27 or 1.28 kubelets, but not 1.26 or 1.29.

GKE Release Channels

GKE Note: GKE offers three release channels: Rapid (newest versions, for dev/test), Regular (production-validated, the default), and Stable (most battle-tested, for conservative environments). Clusters not on a release channel require manual upgrades.

Surge upgrades create extra nodes during upgrades, migrate pods, then delete old nodes — reducing disruption. Configure maxSurge and maxUnavailable to control the pace. Maintenance windows restrict automatic upgrades to specific times (e.g., Tuesdays and Thursdays 2-6 AM), preventing surprises during peak hours.

graph TB subgraph "Upgrade Strategies" AUTO[Auto-upgrades<br/>Release Channel] MANUAL[Manual upgrades] BLUE[Blue-Green clusters] end subgraph "Surge Upgrade Flow" OLD[Old Node v1.27] -->|1| SURGE[New Node v1.28] SURGE -->|2| MIGRATE[Pods migrated] MIGRATE -->|3| NEW[New Node active] MIGRATE -->|4| DRAIN[Old Node deleted] end style AUTO fill:#90caf9 style MANUAL fill:#fff9c4 style BLUE fill:#a5d6a7 style OLD fill:#ef9a9a style NEW fill:#a5d6a7

Blue-Green Cluster Upgrades

For critical workloads, maintain two complete clusters. Run production on blue; provision green on the new version; migrate and validate; shift traffic. If anything fails, route back to blue instantly. Safest — and most expensive, since you run double infrastructure during transition.

Chaos Engineering: The Fire Drill

Chaos engineering intentionally introduces failures to validate resilience. The four principles: define steady state, form a hypothesis, inject real-world failure, validate or improve.

Chaos Monkey randomly terminates pods. Litmus provides pre-built chaos experiments as Kubernetes CRDs. Chaos Mesh offers a visual platform with pod-kill, network latency, CPU stress, and time skew experiments. A game day is a scheduled event where a team tests scenarios like zone loss — cordoning all nodes in one zone and verifying that anti-affinity and HPA correctly redistribute workloads. The output: confidence and documented runbooks for real incidents.

GKE in Practice: Regional Clusters and Multi-Region DR

GKE's HA story centers on regional clusters — the control plane automatically replicates across three zones, and nodes distribute across those zones. You get zone-level fault tolerance without managing etcd.

For cross-region DR, GKE Multi-Cluster Ingress provides a single global load balancer across clusters in different regions with health-based failover. Config Sync (part of GKE Enterprise / Anthos) keeps configuration consistent across fleet clusters, ensuring your standby always matches the primary.

🛑 PAUSE & RECALL — 3 minutes

What is the key difference between an etcd snapshot and Backup for GKE?
Order the three DR patterns from cheapest to most expensive and slowest to fastest recovery.
What are the three GKE release channels, and which suits a conservative production environment?
In the city analogy, what does chaos engineering correspond to?

Lab: LAB-7.3 — High Availability (75 minutes)

Part 1: Configure Zone Distribution (15 min)

kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: zone-spread-app
spec:
  replicas: 6
  selector:
    matchLabels:
      app: zone-spread
  template:
    metadata:
      labels:
        app: zone-spread
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: zone-spread
      containers:
      - name: nginx
        image: nginx:1.25
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
EOF

Verify distribution across zones:

kubectl get pods -o wide -l app=zone-spread

Part 2: Create Backup for GKE Plan and Test Restore (30 min)

# Create test resources
kubectl create configmap app-config --from-literal=database=postgres
kubectl create secret generic app-secret --from-literal=password=lab-password-123

# Create backup plan (replace YOUR_CLUSTER_NAME and YOUR_REGION)
gcloud alpha container backup-restore backup-plans create ha-lab-plan \
  --cluster=YOUR_CLUSTER_NAME --location=YOUR_REGION \
  --included-namespaces=default --backup-retain-days=7

# On-demand backup
gcloud alpha container backup-restore backups create ha-lab-backup \
  --backup-plan=ha-lab-plan --location=YOUR_REGION --wait-for-completion

Key moment — test the restore:

# Delete original resources
kubectl delete configmap app-config secret app-secret

# Create restore plan and execute
gcloud alpha container backup-restore restore-plans create ha-lab-restore \
  --backup-plan=ha-lab-plan --cluster=YOUR_CLUSTER_NAME --location=YOUR_REGION

gcloud alpha container backup-restore restores create ha-lab-restore-run \
  --restore-plan=ha-lab-restore --backup=ha-lab-backup \
  --location=YOUR_REGION --wait-for-completion

Verify restored data:

kubectl get configmap app-config -o yaml
kubectl get secret app-secret -o yaml

Expected outcome: Both resources restored with original data intact.

Part 3: Configure Maintenance Window (15 min)

gcloud container clusters update YOUR_CLUSTER_NAME \
  --region=YOUR_REGION \
  --maintenance-window-start=2024-01-01T02:00:00Z \
  --maintenance-window-end=2024-01-01T06:00:00Z \
  --maintenance-window-recurrence="FREQ=WEEKLY;BYDAY=TU,TH"

Part 4: Simulate Pod Failure and Observe Self-Healing (15 min)

# Watch pods during deletion
kubectl get pods -l app=zone-spread -w &

# Delete a pod
kubectl delete pod -l app=zone-spread --grace-period=0 --force

# Observe: ReplicaSet recreates it within ~10 seconds

Cleanup

gcloud alpha container backup-restore restores delete ha-lab-restore-run \
  --restore-plan=ha-lab-restore --location=YOUR_REGION
gcloud alpha container backup-restore restore-plans delete ha-lab-restore --location=YOUR_REGION
gcloud alpha container backup-restore backups delete ha-lab-backup \
  --backup-plan=ha-lab-plan --location=YOUR_REGION
gcloud alpha container backup-restore backup-plans delete ha-lab-plan --location=YOUR_REGION
kubectl delete deployment zone-spread-app
kubectl delete configmap app-config
kubectl delete secret app-secret

Chapter Summary

High availability keeps your cluster running through failures; disaster recovery brings it back when everything fails. You learned to spread pods across zones with topology spread constraints, choose DR patterns by RTO/RPO, protect workloads with Backup for GKE, manage upgrades through release channels and maintenance windows, and validate resilience through chaos engineering. The city that prepares for emergencies weathers the storm — and so does the cluster.

📇 KEY CONCEPT CARDS

Q: What is the difference between High Availability and Disaster Recovery?
A: HA keeps a system running through component failures (multi-zone clusters, pod spreading). DR is the plan for recovering when the entire system fails (backups, standby clusters, restoration procedures).

Q: How does etcd quorum work, and why use an odd number of members?
A: etcd requires more than half of members to agree on any write. A 3-node cluster tolerates 1 failure; a 5-node tolerates 2. Even numbers provide no additional fault tolerance — a 4-node cluster still tolerates only 1 failure.

Q: Compare pilot light, warm standby, and hot standby by RTO and cost.
A: Pilot light: RTO 30-60 min, cost $ (minimal resources). Warm standby: RTO 5-15 min, cost $$ (reduced capacity). Hot standby: RTO <2 min, cost $$$ (full active-active). Choose based on business RTO/RPO requirements.

Q: What are the three GKE release channels?
A: Rapid (newest, for dev/test), Regular (production-validated, default), Stable (most battle-tested, for conservative production). Not selecting a channel means manual upgrades only.