Chapter 3.3 · Scheduling Deep Dive — Kubernetes Zero to Hero

You've deployed Pods with Deployments and watched them spread across nodes. But who decides which node a Pod lands on? Something in the cluster must look at your Pod's requirements and pick the best machine from dozens — or hundreds — of candidates. That something is the Kubernetes Scheduler, and understanding how it thinks separates casual users from cluster administrators.

Analogy: The Housing Assignment Office

Imagine a university campus with dozens of dormitory buildings. Some have quiet study floors, others house athletes near training facilities, some are designated for honors students, and a few contain specialized labs with expensive equipment. Every semester, the Housing Assignment Office matches each student (Pod) to the right building (Node).

The housing officers follow a rigorous process. First, they eliminate buildings that cannot accommodate a student — no vacancies, wrong designation, full capacity. Then, from the remaining options, they rank buildings by preference match — quiet floor requested? Close to science quad needed? Finally, they stamp the assignment and record it in the campus registry.

This is exactly how Kubernetes scheduling works. The scheduler watches for unassigned Pods, filters out unsuitable nodes, scores the remaining candidates, and binds the Pod to the winner.

Visual Description: Housing Assignment Flowchart

graph TD A[Unscheduled Pod New Application] --> B{Filtering Phase} B -->|Node A: 4 CPU free| C[Passes Filter] B -->|Node B: 1.5 CPU free| D[Filtered Out] B -->|Node C: 3 CPU free| E[Passes Filter] B -->|Has taint app=database No toleration| F[Filtered Out] C --> G{Scoring Phase} E --> G G -->|Score: 87 Image cached + balanced| H[Best Match] G -->|Score: 62| I[Second Choice] H --> J[Bind Pod to Node] style A fill:#fff9c4 style D fill:#ef9a9a style F fill:#ef9a9a style H fill:#a5d6a7 style J fill:#ce93d8

How the Scheduler Works

The Scheduler runs a continuous loop, watching the API Server for Pods with an empty spec.nodeName. Each unscheduled Pod goes through two phases.

The Filtering Phase

The scheduler eliminates nodes that cannot run the Pod. A node is filtered out if it fails ANY hard constraint:

Resource availability: A Pod requesting 2 CPUs cannot land on a node with only 1.5 allocatable. The scheduler cares about requests, not limits — it only reserves the guaranteed minimum, not the burst ceiling.
Node selectors: A Pod with nodeSelector: {disktype: ssd} eliminates nodes without that label.
Affinity/anti-affinity rules: Hard constraints in requiredDuringScheduling remove non-matching nodes.
Taints: Nodes with NoSchedule taints that the Pod cannot tolerate are excluded.

The Scoring Phase

From the remaining candidates, the scheduler ranks nodes across multiple criteria:

Resource balance: More evenly utilized nodes score higher.
Image locality: Nodes that already have the container image cached score higher.
Inter-pod affinity: Nodes running preferred Pods receive bonus points.
Custom priorities: Cluster administrators can configure additional scoring plugins.

The highest-scoring node wins. The scheduler sends a bind request to the API Server, recording the Pod-to-Node assignment in etcd.

GKE Note: On GKE Standard, the default scheduler runs on Google's managed control plane. On GKE Autopilot, Google handles scheduling transparently with additional cost and resource-packing optimizations.

🛑 PAUSE & RECALL — 2 minutes

What are the two scheduling phases, and what happens in each?
Does the scheduler look at resource requests or limits when filtering nodes?
What Pod status do you see if no nodes pass the filtering phase?

Rate your confidence (0-4).

Node Affinity and Anti-Affinity

Node affinity lets Pods express preferences or requirements for node characteristics. It is a more expressive version of nodeSelector, supporting operators like In, NotIn, Exists, and DoesNotExist for matching node labels. There are two critical flavors:

requiredDuringSchedulingIgnoredDuringExecution — a hard constraint. The Pod ONLY schedules on matching nodes; otherwise it stays Pending indefinitely. Like a student who requires a wheelchair-accessible room.

preferredDuringSchedulingIgnoredDuringExecution — a soft preference. The scheduler tries to match but places the Pod elsewhere if necessary. Each preference carries a weight (1-100) added to the node's score. Like a student who prefers a north-facing window but accepts any room.

⚠️ Common Misconception: IgnoredDuringExecution means the rule is ignored after scheduling. If node labels change after the Pod is running, Kubernetes does NOT evict it. The rule only affects the initial placement decision.

Use cases: Target GPU nodes with accelerator=nvidia-tesla-t4 for ML workloads; enforce compliance-zone placement with requiredDuringScheduling; spread replicas across failure domains with podAntiAffinity.

Taints and Tolerations

If node affinity is Pods choosing nodes, taints and tolerations are nodes repelling Pods. This is the most confusing scheduling concept for beginners.

Taints: The "Keep Out" Signs

A taint on a node has a key, value, and effect. The effect determines what happens to Pods without a matching toleration:

NoSchedule: Pods without a toleration cannot be scheduled here. Existing Pods are unaffected.
PreferNoSchedule: The scheduler tries to avoid this node but will use it if necessary.
NoExecute: The most aggressive — Pods without a toleration are immediately evicted, and new Pods cannot be scheduled.

Apply a taint with kubectl:

kubectl taint nodes node-1 dedicated=gpu:NoSchedule

Tolerations: The Special Pass

A toleration in the Pod spec says "I can handle this taint." It does NOT attract the Pod to the node — it merely removes the exclusion.

spec:
  tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "gpu"
      effect: "NoSchedule"

⚠️ Common Misconception: A toleration does NOT force a Pod onto a tainted node. The Pod still needs to pass all other filters and score competitively. To force placement, combine tolerations with node selectors or affinity.

Kubernetes also applies built-in taints automatically: node.kubernetes.io/not-ready (NoExecute) evicts Pods from failed nodes; node.kubernetes.io/unschedulable (NoSchedule) blocks scheduling on cordoned nodes; and resource-pressure taints (memory-pressure, disk-pressure) prevent new Pods from landing on stressed nodes.

graph LR subgraph "Node: gpu-pool-1" N1[Taints: - dedicated=gpu:NoSchedule] end P1[Pod A Has toleration] -->|Allowed| N1 P2[Pod B No toleration] -->|Blocked| N1 style N1 fill:#ffcc80 style P1 fill:#a5d6a7 style P2 fill:#ef9a9a

Pod Topology Spread Constraints

Topology spread constraints distribute related Pods evenly across failure domains defined by node labels. Common topology keys include topology.kubernetes.io/zone for availability zones, topology.kubernetes.io/region for geographic regions, and kubernetes.io/hostname for individual nodes.

The maxSkew parameter controls the maximum difference in pod count between any two domains. With 6 replicas across 3 zones and maxSkew: 1, the scheduler enforces a 2-2-2 distribution — no zone can have more than one extra Pod.

spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: web-server

With whenUnsatisfiable: DoNotSchedule, Pods that would violate the skew stay Pending until the imbalance resolves — perhaps through cluster autoscaling. With ScheduleAnyway, the scheduler accepts the imbalance.

graph LR subgraph "maxSkew=1 (Valid)" Z1[Zone A: 2] Z2[Zone B: 2] Z3[Zone C: 2] end subgraph "maxSkew=2 (Invalid)" Z4[Zone A: 3] Z5[Zone B: 2] Z6[Zone C: 1] end style Z1 fill:#a5d6a7 style Z2 fill:#a5d6a7 style Z3 fill:#a5d6a7 style Z4 fill:#ef9a9a style Z6 fill:#ef9a9a

🤔 TRY BEFORE YOU SEE

You have 6 replicas spread across 3 zones with maxSkew: 1 and whenUnsatisfiable: DoNotSchedule. Zone C fails completely. Predict: how many Pods remain running? Where will the scheduler try to place the 2 replacements? What happens with DoNotSchedule?

Reveal: Four Pods survive (2 in Zone A, 2 in Zone B). Replacement Pods cannot go to C (nodes are down), and placing extras in A or B would violate maxSkew: 1. With DoNotSchedule, they stay Pending until C recovers or the autoscaler adds nodes elsewhere. With ScheduleAnyway, they would land in A or B.

Resource-Based Scheduling

The scheduler uses resource requests, not limits, for placement decisions. A Pod with requests: {cpu: 100m, memory: 128Mi} and limits: {cpu: 500m, memory: 512Mi} is scheduled based on the 100m and 128Mi figures. The scheduler subtracts all running Pods' requests from a node's allocatable capacity to determine remaining room.

This means nodes can be overcommitted — total limits can exceed 100% capacity. This enables efficient utilization, but if too many Pods burst simultaneously, the node runs out of resources. Memory exhaustion triggers OOMKills; CPU overcommitment causes throttling. This is why setting accurate requests is both a scheduling necessity and a capacity planning discipline.

Production clusters need disciplined capacity monitoring. Track the ratio of requested resources to allocatable capacity across all nodes. When average CPU or memory commitment exceeds 80%, you are in the danger zone — any node failure or traffic spike can leave Pods unschedulable. Use the metrics-server or GKE Cloud Monitoring to set alerts on cluster-level resource commitment.

Check node resource pressure manually:

# View detailed resource allocation for a node
kubectl describe node <node-name>

Look for the Allocated resources section. It shows both absolute values and percentages. When commitment exceeds 80%, scale your node pool.

🛑 PAUSE & RECALL — 3 minutes

What is the difference between a taint and a toleration? Which goes on the node, which on the Pod?
With maxSkew: 2 across 3 zones and 8 replicas, is a 4-2-2 distribution valid? What is the skew?
Why does the scheduler use requests instead of limits? What would happen if it used limits?

Rate your confidence (0-4).

GKE Scheduling Features

Node Auto-Provisioning

Node auto-provisioning (NAP) automatically creates node pools when pending Pods cannot be scheduled on existing nodes. NAP analyzes the Pod's resource requests, node selectors, affinity rules, and tolerations, then provisions a node pool with an appropriate machine type. When Pods are deleted and nodes become underutilized, the autoscaler removes the pool.

gcloud container clusters create my-cluster \
  --enable-autoscaling \
  --enable-node-autoprovisioning \
  --min-cpu 1 --max-cpu 100

GKE Note: Node auto-provisioning is a key GKE differentiator. Self-managed Kubernetes requires manual node pool creation. GKE Autopilot handles this entirely transparently.

Node Pool Heterogeneity and gVisor

GKE supports multiple node pools per cluster with different machine types, labels, and taints. A typical setup might include a default-pool with e2-standard-4 for general work, a gpu-pool with a2-highgpu-1g machines tainted for ML jobs, a memory-pool with e2-highmem-8 for caches, and a spot-pool using preemptible VMs for fault-tolerant batch work. Pods target specific pools through node selectors, affinity rules, and tolerations.

GKE also supports gVisor (sandboxed containers) via runtimeClassName: gvisor for additional isolation of untrusted workloads. On GKE Autopilot, workload placement and node scaling happen entirely transparently — you specify resource requirements and constraints, and Google handles the rest.

Lab: LAB-3.3 — Advanced Scheduling (60 min)

In this lab, you configure node affinity, apply taints and tolerations, and observe topology spread constraints through Kubernetes events.

Step 1: Inspect Node Labels

kubectl get nodes -L topology.kubernetes.io/zone -L node.kubernetes.io/instance-type

Step 2: Deploy with Node Affinity

apiVersion: apps/v1
kind: Deployment
metadata:
  name: zone-preferred-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: zone-app
  template:
    metadata:
      labels:
        app: zone-app
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              preference:
                matchExpressions:
                  - key: topology.kubernetes.io/zone
                    operator: In
                    values: [us-central1-a]
      containers:
        - name: nginx
          image: nginx:1.25
          resources:
            requests: {cpu: 100m, memory: 128Mi}

Apply and observe placement: kubectl get pods -o wide -l app=zone-app

Step 3: Apply a Taint and Test Tolerations

kubectl taint nodes <node-name> workload=dedicated:NoSchedule

Deploy a Pod without a toleration — it stays Pending. Check events with kubectl describe pod <pod-name>; you will see a message explaining the taint conflict. Then deploy a Pod with the matching toleration and verify it schedules successfully. This contrast demonstrates that tolerations only remove the taint barrier — they do not guarantee placement on the tainted node.

Step 4: Observe Scheduler Events

# All scheduling events
kubectl get events --field-selector reason=Scheduled

# Events for a specific pod
kubectl describe pod <pod-name> | grep -A 10 Events

Step 5: Configure Topology Spread

spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: ha-web

Deploy 6 replicas and verify a 2-2-2 distribution: kubectl get pods -l app=ha-web -o wide

Step 6: Clean Up

kubectl taint nodes <node-name> workload=dedicated:NoSchedule-
kubectl delete deployment zone-preferred-app ha-web

Chapter Summary

The Kubernetes Scheduler uses a filter-then-score algorithm: filtering eliminates nodes that lack resources or violate hard constraints, while scoring ranks survivors by preference match. Node affinity lets Pods express requirements or preferences for node characteristics. Taints repel unwanted Pods from nodes; tolerations grant Pods permission to schedule on tainted nodes. Topology spread constraints distribute replicas across failure domains using maxSkew to control distribution balance. On GKE, node auto-provisioning automatically creates appropriately-sized node pools for pending workloads.

📇 KEY CONCEPT CARDS

Q: What are the two scheduling phases?
A: Filtering eliminates nodes that cannot run the Pod (resource constraints, taints, affinity violations). Scoring ranks remaining nodes by preference match; the highest-scoring node receives the binding.

Q: What is the difference between requiredDuringScheduling and preferredDuringScheduling?
A: requiredDuringScheduling is a hard constraint — the Pod only schedules on matching nodes, or stays Pending. preferredDuringScheduling is a soft preference with a weight that influences scoring but does not block scheduling.

Q: Does a toleration force a Pod onto a tainted node?
A: No — a toleration only removes the taint's repelling effect. The Pod still needs to pass all other filter checks and score competitively. Combine tolerations with node affinity to force placement.

Q: How does maxSkew work in topology spread constraints?
A: maxSkew sets the maximum allowed difference in pod count between any two topology domains. With DoNotSchedule, violating Pods stay Pending. With ScheduleAnyway, the scheduler accepts the imbalance.