Chapter 7.2 · Autoscaling and Resource Management

In the previous chapter, you learned how to observe your cluster — setting up probes, collecting logs, and monitoring metrics. But observation is only half the battle. When your dashboards show CPU spiking during a traffic surge, you don't want to manually run kubectl scale at 3 AM. Kubernetes provides a sophisticated autoscaling system that reacts to demand automatically, proportionally, and efficiently — much like a smart building adjusts its climate.

Analogy: Smart Building Climate Control

Imagine a modern office tower with intelligent climate management. When a conference room fills with people, the building deploys portable heaters or coolers to that specific room. When the sun shifts and one side heats up, individual thermostats adjust. When every floor reaches capacity, the building automatically leases additional floors. And each department has a utility budget they cannot exceed, preventing one team from blasting the AC and driving up costs for everyone.

This is precisely how Kubernetes autoscaling works. The Horizontal Pod Autoscaler (HPA) is like deploying portable heaters — adding or removing pod replicas based on demand. The Vertical Pod Autoscaler (VPA) is like adjusting thermostats — tuning CPU and memory per container. The Cluster Autoscaler is like leasing additional floors — expanding or contracting the node pool. ResourceQuotas are department utility budgets — ensuring no namespace over-consumes. Let's explore each mechanism.

Horizontal Pod Autoscaler (HPA)

The HPA watches pod metrics and adjusts replica count to maintain target utilization. It is the most commonly used autoscaler because it maps directly to the simplest intuition: when load increases, add more copies.

Visual Description: HPA Scaling Decision Flow

graph LR A[Metrics Server] -->|CPU/Memory metrics| B[HPA Controller] B -->|Compare current vs target| C{Scale needed?} C -->|Ratio > 1| D[Increase replicas] C -->|Ratio < 1| E[Decrease replicas] C -->|Within tolerance| F[No change] D --> G[Deployment/ReplicaSet] E --> G G --> H[Pod replicas] style A fill:#90caf9 style B fill:#ffcc80 style C fill:#fff9c4 style D fill:#a5d6a7 style E fill:#ef9a9a style F fill:#e0e0e0 style G fill:#ce93d8 style H fill:#a5d6a7

The Scaling Formula

The core calculation is simple:

desiredReplicas = currentReplicas * (currentMetricValue / desiredMetricValue)

For example: 3 replicas at 90% CPU with a 50% target yields 3 * (90/50) = 5.4, rounded up to 6 replicas. Here is a complete HPA manifest:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60

The metrics block defines what HPA watches — here, CPU as a percentage of each pod's request. The behavior block provides critical tuning. scaleUp doubles replicas every 60 seconds, while scaleDown conservatively removes only 10% per minute with a 5-minute stabilization window. This asymmetry is deliberate: react fast to spikes, scale down slowly to avoid thrashing.

Metric Sources

Beyond CPU, HPA supports four metric types:

Resource metrics: Built-in CPU and memory utilization
Pods metrics: Custom application metrics (e.g., requests per second)
Object metrics: Metrics from any Kubernetes object (e.g., ingress request rate)
External metrics: Metrics from outside the cluster (e.g., Pub/Sub queue depth)

⚠️ Common Misconception: HPA does not react instantly. The default evaluation interval is 15 seconds, and stabilization windows add further delay. HPA handles gradual changes, not sudden millisecond spikes — over-provision or use cluster-level buffering for those.

🛑 PAUSE & RECALL — 2 minutes

Without looking back, answer these:

If you have 4 pods each at 80% CPU and your target is 50%, how many replicas will HPA desire? Show the math.
Why configure scaleDown more conservatively than scaleUp?
Name the four metric source types HPA can use.

Rate your confidence (0-4).

Vertical Pod Autoscaler (VPA)

While HPA adds or removes pods, VPA adjusts resource requests and limits per container — the smart building's thermostat system.

VPA Operating Modes

Mode	Behavior	Use Case
Off	Records recommendations only	Analyze usage before trusting automation
Initial	Applies recommendations at pod creation	Set-and-forget for batch workloads
Auto	Evicts and recreates pods to apply changes	Continuous right-sizing for services
Recreate	Applies only during natural recreation	Avoid unnecessary disruption

Most teams start with recommendation mode. The VPA analyzes historical usage and suggests optimal values:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: web-app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  updatePolicy:
    updateMode: "Off"
  resourcePolicy:
    containerPolicies:
    - containerName: '*'
      minAllowed:
        cpu: 50m
        memory: 100Mi
      maxAllowed:
        cpu: 2
        memory: 2Gi

View recommendations with kubectl describe vpa web-app-vpa.

VPA and HPA Interaction

⚠️ Common Misconception: You cannot run HPA and VPA together on the same CPU or memory metric because they conflict — HPA uses resource requests to calculate utilization, and VPA changes those requests. The solution: use HPA for scaling out based on a custom metric (requests-per-second), while VPA handles CPU/memory right-sizing.

Cluster Autoscaler

The Cluster Autoscaler operates at the infrastructure layer, adding or removing nodes based on scheduling pressure. It is the property management team that leases new floors when every floor is full and lets leases expire when floors sit empty.

Visual Description: Three Autoscalers Comparison

graph TB subgraph "HPA — Portable Heaters" H1[High CPU] --> H2[Add/remove pods] H2 --> H3[Same pod, more/fewer copies] end subgraph "VPA — Thermostat Adjustment" V1[Wrong resource size] --> V2[Adjust CPU/memory per pod] V2 --> V3[Same count, different size] end subgraph "Cluster Autoscaler — Building Floors" C1[Unschedulable pods] --> C2[Add/remove nodes] C2 --> C3[Same workload, different infrastructure] end H3 --> C1 V3 --> C1 style H1 fill:#ef9a9a style H2 fill:#a5d6a7 style V1 fill:#ef9a9a style V2 fill:#90caf9 style C1 fill:#ef9a9a style C2 fill:#ce93d8

The CA watches for Pending pods that cannot be scheduled due to insufficient resources, then provisions new nodes. When nodes are underutilized (below 50% for 10+ minutes) and their pods can be rescheduled, CA drains and removes them. Before scale-down, CA checks PodDisruptionBudgets, respects grace periods, and accounts for multi-zone balance and DaemonSet pods.

Resource Quotas and LimitRanges

ResourceQuotas and LimitRanges are the utility budgets and equipment specifications per floor, preventing the noisy-neighbor problem.

A ResourceQuota caps aggregate namespace consumption:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: production
spec:
  hard:
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi
    pods: "50"

A LimitRange sets defaults and boundaries for individual containers:

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
  - default:
      cpu: "500m"
      memory: "512Mi"
    defaultRequest:
      cpu: "100m"
      memory: "128Mi"
    type: Container

Without a LimitRange, containers with no resource specs run as BestEffort QoS — the first evicted under node pressure.

Pod Disruption Budgets

Voluntary disruptions — node upgrades, cluster autoscaler scale-downs, manual drains — are a fact of cluster life. A PodDisruptionBudget (PDB) ensures these never compromise availability.

Visual Description: PodDisruptionBudget Protection Flow

sequenceDiagram participant CA as Cluster Autoscaler participant API as API Server participant PDB as PDB Controller participant POD as Application Pods CA->>API: Request eviction of pod on underutilized node API->>PDB: Check if eviction is allowed alt Disruption allowed PDB->>API: Allow eviction API->>POD: Evict pod CA->>CA: Proceed with node removal else Would violate minAvailable PDB->>API: Deny eviction CA->>CA: Skip node, try another end

Configure a PDB using minAvailable or maxUnavailable:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: web-app

⚠️ Common Misconception: PDBs do not protect against involuntary disruptions like node failures or kernel panics. They only guard against voluntary evictions. Always design for unexpected pod loss.

🤔 TRY BEFORE YOU SEE

You have a 5-replica Deployment with a PDB set to minAvailable: 2. The cluster autoscaler wants to scale down a node hosting 3 of these pods.

Predict what happens and why. Write your answer before reading on.

Reveal: The autoscaler can evict 3 pods before hitting the limit (5 → 4 → 3 → 2), at which point minAvailable: 2 blocks further evictions. The autoscaler must wait for rescheduled pods to become ready elsewhere, or choose a different node. This is how PDBs protect availability during voluntary disruptions.

GKE in Practice

GKE Note: GKE integrates all three autoscalers natively. HPA works out of the box. VPA is available as a managed add-on. Cluster Autoscaler is built into every node pool with configurable min/max bounds.

Node Auto-Provisioning is a GKE-specific enhancement. Instead of scaling existing pools, it creates new node pools with custom machine types matching pending pod requirements — automatically providing GPUs or specific CPU architectures when needed.

For cost optimization, Committed Use Discounts (CUDs) offer significant price reductions when you commit to baseline compute for 1 or 3 years. Pair CUDs with autoscaling: use committed capacity for predictable baseline workload, and let autoscaler handle bursts on demand. GKE's cost optimization console analyzes usage and suggests right-sizing opportunities.

GKE Note: GKE Autopilot abstracts most autoscaling configuration. It provides automatic scaling behavior and manages node provisioning behind the scenes with pod-based billing rather than provisioned-node billing.

🛑 PAUSE & RECALL — 3 minutes

Close your eyes and picture the smart building:

Which autoscaler is the portable heater? The thermostat? The building expansion?
What is the difference between minAvailable and maxUnavailable in a PDB?
Why use HPA with custom metrics (not CPU) when also using VPA in Auto mode?

Rate your confidence (0-4).

Lab: LAB-7.2 — Autoscaling (60 min)

Prerequisites

A running Kubernetes cluster (GKE Standard recommended)
kubectl configured and authenticated
hey load generator: go install github.com/rakyll/hey@latest

Step 1: Deploy the Application

kubectl create namespace autoscaling-lab
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: loadgen-app
  namespace: autoscaling-lab
spec:
  replicas: 2
  selector:
    matchLabels:
      app: loadgen-app
  template:
    metadata:
      labels:
        app: loadgen-app
    spec:
      containers:
      - name: app
        image: k8s.gcr.io/hpa-example
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: 200m
            memory: 128Mi
EOF
kubectl expose deployment loadgen-app --type=LoadBalancer --port=80 -n autoscaling-lab

Step 2: Configure HPA

cat <<EOF | kubectl apply -f -
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: loadgen-hpa
  namespace: autoscaling-lab
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: loadgen-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50
EOF

Verify: kubectl get hpa -n autoscaling-lab — expect 0%/50% with 2 replicas.

Step 3: Generate Load and Observe Scaling

export LB_IP=$(kubectl get svc loadgen-app -n autoscaling-lab -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
hey -z 2m -c 50 http://$LB_IP/

In another terminal: kubectl get hpa -n autoscaling-lab -w

Watch CPU rise above 50% and replicas increase. When load stops, observe replicas gradually decrease.

Step 4: Set ResourceQuota and Test

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ResourceQuota
metadata:
  name: lab-quota
  namespace: autoscaling-lab
spec:
  hard:
    requests.cpu: "1"
    requests.memory: 1Gi
    pods: "5"
EOF
kubectl scale deployment loadgen-app --replicas=10 -n autoscaling-lab

Expected: Error showing quota exceeded. The deployment cannot create pods beyond the pods: "5" and requests.cpu: "1" limits.

Step 5: Create PodDisruptionBudget and Test

cat <<EOF | kubectl apply -f -
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: loadgen-pdb
  namespace: autoscaling-lab
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: loadgen-app
EOF

Attempt to drain a node: kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

Observe that the drain respects the PDB, evicting pods one at a time and waiting for replacements to become ready. Evictions that would violate minAvailable: 2 are denied.

Clean up: kubectl delete namespace autoscaling-lab

Chapter Summary

Kubernetes autoscaling operates at three distinct layers. HPA adds or removes pod replicas based on demand — like deploying portable heaters. VPA adjusts CPU and memory per container — like tuning thermostats. Cluster Autoscaler expands and contracts the node pool — like leasing or releasing floors. ResourceQuotas enforce namespace-level budgets, and PodDisruptionBudgets protect availability during voluntary disruptions. Together, they create a self-regulating system. On GKE, node auto-provisioning and committed use discounts provide additional cost optimization tools that integrate seamlessly with native Kubernetes autoscaling.

📇 KEY CONCEPT CARDS

Q: What is the HPA scaling formula?
A: desiredReplicas = currentReplicas * (currentMetricValue / desiredMetricValue). Multiply current replicas by the ratio of actual to target utilization.

Q: Why can't HPA and VPA run together on the same CPU/memory metric?
A: HPA calculates utilization ratios using resource requests, and VPA changes those requests. The solution: scale HPA on custom metrics (requests-per-second) while VPA handles CPU/memory right-sizing.

Q: What is the difference between ResourceQuota and LimitRange?
A: ResourceQuota limits aggregate resource consumption across a namespace. LimitRange sets default, minimum, and maximum resource values for individual containers.

Q: What does a PodDisruptionBudget protect against — and not protect against?
A: PDBs protect against voluntary disruptions (node drains, cluster autoscaler scale-down, upgrades) by limiting simultaneous evictions. They do NOT protect against involuntary disruptions like node failures or OOM kills.