In the previous chapter, you learned how to observe your cluster — setting up probes, collecting logs, and monitoring metrics. But observation is only half the battle. When your dashboards show CPU spiking during a traffic surge, you don't want to manually run kubectl scale at 3 AM. Kubernetes provides a sophisticated autoscaling system that reacts to demand automatically, proportionally, and efficiently — much like a smart building adjusts its climate.
Analogy: Smart Building Climate Control
Imagine a modern office tower with intelligent climate management. When a conference room fills with people, the building deploys portable heaters or coolers to that specific room. When the sun shifts and one side heats up, individual thermostats adjust. When every floor reaches capacity, the building automatically leases additional floors. And each department has a utility budget they cannot exceed, preventing one team from blasting the AC and driving up costs for everyone.
This is precisely how Kubernetes autoscaling works. The Horizontal Pod Autoscaler (HPA) is like deploying portable heaters — adding or removing pod replicas based on demand. The Vertical Pod Autoscaler (VPA) is like adjusting thermostats — tuning CPU and memory per container. The Cluster Autoscaler is like leasing additional floors — expanding or contracting the node pool. ResourceQuotas are department utility budgets — ensuring no namespace over-consumes. Let's explore each mechanism.
Horizontal Pod Autoscaler (HPA)
The HPA watches pod metrics and adjusts replica count to maintain target utilization. It is the most commonly used autoscaler because it maps directly to the simplest intuition: when load increases, add more copies.
Visual Description: HPA Scaling Decision Flow
The Scaling Formula
The core calculation is simple:
desiredReplicas = currentReplicas * (currentMetricValue / desiredMetricValue)
For example: 3 replicas at 90% CPU with a 50% target yields 3 * (90/50) = 5.4, rounded up to 6 replicas. Here is a complete HPA manifest:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
The metrics block defines what HPA watches — here, CPU as a percentage of each pod's request. The behavior block provides critical tuning. scaleUp doubles replicas every 60 seconds, while scaleDown conservatively removes only 10% per minute with a 5-minute stabilization window. This asymmetry is deliberate: react fast to spikes, scale down slowly to avoid thrashing.
Metric Sources
Beyond CPU, HPA supports four metric types:
- Resource metrics: Built-in CPU and memory utilization
- Pods metrics: Custom application metrics (e.g., requests per second)
- Object metrics: Metrics from any Kubernetes object (e.g., ingress request rate)
- External metrics: Metrics from outside the cluster (e.g., Pub/Sub queue depth)
⚠️ Common Misconception: HPA does not react instantly. The default evaluation interval is 15 seconds, and stabilization windows add further delay. HPA handles gradual changes, not sudden millisecond spikes — over-provision or use cluster-level buffering for those.
🛑 PAUSE & RECALL — 2 minutes
Without looking back, answer these:
- If you have 4 pods each at 80% CPU and your target is 50%, how many replicas will HPA desire? Show the math.
- Why configure
scaleDownmore conservatively thanscaleUp? - Name the four metric source types HPA can use.
Rate your confidence (0-4).
Vertical Pod Autoscaler (VPA)
While HPA adds or removes pods, VPA adjusts resource requests and limits per container — the smart building's thermostat system.
VPA Operating Modes
| Mode | Behavior | Use Case |
|---|---|---|
| Off | Records recommendations only | Analyze usage before trusting automation |
| Initial | Applies recommendations at pod creation | Set-and-forget for batch workloads |
| Auto | Evicts and recreates pods to apply changes | Continuous right-sizing for services |
| Recreate | Applies only during natural recreation | Avoid unnecessary disruption |
Most teams start with recommendation mode. The VPA analyzes historical usage and suggests optimal values:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: web-app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
updatePolicy:
updateMode: "Off"
resourcePolicy:
containerPolicies:
- containerName: '*'
minAllowed:
cpu: 50m
memory: 100Mi
maxAllowed:
cpu: 2
memory: 2Gi
View recommendations with kubectl describe vpa web-app-vpa.
VPA and HPA Interaction
⚠️ Common Misconception: You cannot run HPA and VPA together on the same CPU or memory metric because they conflict — HPA uses resource requests to calculate utilization, and VPA changes those requests. The solution: use HPA for scaling out based on a custom metric (requests-per-second), while VPA handles CPU/memory right-sizing.
Cluster Autoscaler
The Cluster Autoscaler operates at the infrastructure layer, adding or removing nodes based on scheduling pressure. It is the property management team that leases new floors when every floor is full and lets leases expire when floors sit empty.
Visual Description: Three Autoscalers Comparison
The CA watches for Pending pods that cannot be scheduled due to insufficient resources, then provisions new nodes. When nodes are underutilized (below 50% for 10+ minutes) and their pods can be rescheduled, CA drains and removes them. Before scale-down, CA checks PodDisruptionBudgets, respects grace periods, and accounts for multi-zone balance and DaemonSet pods.
Resource Quotas and LimitRanges
ResourceQuotas and LimitRanges are the utility budgets and equipment specifications per floor, preventing the noisy-neighbor problem.
A ResourceQuota caps aggregate namespace consumption:
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: production
spec:
hard:
requests.cpu: "20"
requests.memory: 40Gi
limits.cpu: "40"
limits.memory: 80Gi
pods: "50"
A LimitRange sets defaults and boundaries for individual containers:
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: production
spec:
limits:
- default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
type: Container
Without a LimitRange, containers with no resource specs run as BestEffort QoS — the first evicted under node pressure.
Pod Disruption Budgets
Voluntary disruptions — node upgrades, cluster autoscaler scale-downs, manual drains — are a fact of cluster life. A PodDisruptionBudget (PDB) ensures these never compromise availability.
Visual Description: PodDisruptionBudget Protection Flow
Configure a PDB using minAvailable or maxUnavailable:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: web-app
⚠️ Common Misconception: PDBs do not protect against involuntary disruptions like node failures or kernel panics. They only guard against voluntary evictions. Always design for unexpected pod loss.
🤔 TRY BEFORE YOU SEE
You have a 5-replica Deployment with a PDB set to minAvailable: 2. The cluster autoscaler wants to scale down a node hosting 3 of these pods.
Predict what happens and why. Write your answer before reading on.
Reveal: The autoscaler can evict 3 pods before hitting the limit (5 → 4 → 3 → 2), at which point minAvailable: 2 blocks further evictions. The autoscaler must wait for rescheduled pods to become ready elsewhere, or choose a different node. This is how PDBs protect availability during voluntary disruptions.
GKE in Practice
GKE Note: GKE integrates all three autoscalers natively. HPA works out of the box. VPA is available as a managed add-on. Cluster Autoscaler is built into every node pool with configurable min/max bounds.
Node Auto-Provisioning is a GKE-specific enhancement. Instead of scaling existing pools, it creates new node pools with custom machine types matching pending pod requirements — automatically providing GPUs or specific CPU architectures when needed.
For cost optimization, Committed Use Discounts (CUDs) offer significant price reductions when you commit to baseline compute for 1 or 3 years. Pair CUDs with autoscaling: use committed capacity for predictable baseline workload, and let autoscaler handle bursts on demand. GKE's cost optimization console analyzes usage and suggests right-sizing opportunities.
GKE Note: GKE Autopilot abstracts most autoscaling configuration. It provides automatic scaling behavior and manages node provisioning behind the scenes with pod-based billing rather than provisioned-node billing.
🛑 PAUSE & RECALL — 3 minutes
Close your eyes and picture the smart building:
- Which autoscaler is the portable heater? The thermostat? The building expansion?
- What is the difference between
minAvailableandmaxUnavailablein a PDB? - Why use HPA with custom metrics (not CPU) when also using VPA in Auto mode?
Rate your confidence (0-4).
Lab: LAB-7.2 — Autoscaling (60 min)
Prerequisites
- A running Kubernetes cluster (GKE Standard recommended)
kubectlconfigured and authenticatedheyload generator:go install github.com/rakyll/hey@latest
Step 1: Deploy the Application
kubectl create namespace autoscaling-lab
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: loadgen-app
namespace: autoscaling-lab
spec:
replicas: 2
selector:
matchLabels:
app: loadgen-app
template:
metadata:
labels:
app: loadgen-app
spec:
containers:
- name: app
image: k8s.gcr.io/hpa-example
ports:
- containerPort: 80
resources:
requests:
cpu: 200m
memory: 128Mi
EOF
kubectl expose deployment loadgen-app --type=LoadBalancer --port=80 -n autoscaling-lab
Step 2: Configure HPA
cat <<EOF | kubectl apply -f -
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: loadgen-hpa
namespace: autoscaling-lab
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: loadgen-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
EOF
Verify: kubectl get hpa -n autoscaling-lab — expect 0%/50% with 2 replicas.
Step 3: Generate Load and Observe Scaling
export LB_IP=$(kubectl get svc loadgen-app -n autoscaling-lab -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
hey -z 2m -c 50 http://$LB_IP/
In another terminal: kubectl get hpa -n autoscaling-lab -w
Watch CPU rise above 50% and replicas increase. When load stops, observe replicas gradually decrease.
Step 4: Set ResourceQuota and Test
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ResourceQuota
metadata:
name: lab-quota
namespace: autoscaling-lab
spec:
hard:
requests.cpu: "1"
requests.memory: 1Gi
pods: "5"
EOF
kubectl scale deployment loadgen-app --replicas=10 -n autoscaling-lab
Expected: Error showing quota exceeded. The deployment cannot create pods beyond the pods: "5" and requests.cpu: "1" limits.
Step 5: Create PodDisruptionBudget and Test
cat <<EOF | kubectl apply -f -
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: loadgen-pdb
namespace: autoscaling-lab
spec:
minAvailable: 2
selector:
matchLabels:
app: loadgen-app
EOF
Attempt to drain a node: kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
Observe that the drain respects the PDB, evicting pods one at a time and waiting for replacements to become ready. Evictions that would violate minAvailable: 2 are denied.
Clean up: kubectl delete namespace autoscaling-lab
Chapter Summary
Kubernetes autoscaling operates at three distinct layers. HPA adds or removes pod replicas based on demand — like deploying portable heaters. VPA adjusts CPU and memory per container — like tuning thermostats. Cluster Autoscaler expands and contracts the node pool — like leasing or releasing floors. ResourceQuotas enforce namespace-level budgets, and PodDisruptionBudgets protect availability during voluntary disruptions. Together, they create a self-regulating system. On GKE, node auto-provisioning and committed use discounts provide additional cost optimization tools that integrate seamlessly with native Kubernetes autoscaling.
📇 KEY CONCEPT CARDS
- Q: What is the HPA scaling formula?
A:desiredReplicas = currentReplicas * (currentMetricValue / desiredMetricValue). Multiply current replicas by the ratio of actual to target utilization.
- Q: Why can't HPA and VPA run together on the same CPU/memory metric?
A: HPA calculates utilization ratios using resource requests, and VPA changes those requests. The solution: scale HPA on custom metrics (requests-per-second) while VPA handles CPU/memory right-sizing.
- Q: What is the difference between ResourceQuota and LimitRange?
A: ResourceQuota limits aggregate resource consumption across a namespace. LimitRange sets default, minimum, and maximum resource values for individual containers.
- Q: What does a PodDisruptionBudget protect against — and not protect against?
A: PDBs protect against voluntary disruptions (node drains, cluster autoscaler scale-down, upgrades) by limiting simultaneous evictions. They do NOT protect against involuntary disruptions like node failures or OOM kills.