Chapter 8.1 · GKE Cluster Operations — Kubernetes Zero to Hero

Module 8: GKE Administration and Advanced Topics

The capstone module. Everything you need to operate production GKE clusters — from cost optimization to GitOps, multi-cluster management to disaster recovery.

Module 8 of 8 | Difficulty: Advanced to Expert

Welcome to Module 8 — GKE Mastery. You have learned Kubernetes inside and out — pods, networking, storage, security, scaling, observability. But all of that assumed the cluster already existed. This chapter is about building and operating the cluster itself: the property development firm, the utility grid, and the security contractor behind your Kubernetes city.

Analogy: Professional Property Management

Imagine you own rental properties. In Standard mode, you are the self-managing landlord — you buy buildings, hire contractors, screen tenants, and fix plumbing at 2 AM. Full control, full burden. In Autopilot mode, you hire a property management company — they handle maintenance, tenants, and bookkeeping. You give up some profit and day-to-day control, but you sleep through the night. In GKE Enterprise, you run a real estate empire across multiple states and need standardized processes, centralized accounting, and unified contracts from a portfolio management firm.

Kubernetes on GKE mirrors these precisely. Standard gives full control over nodes and networks at the cost of operational burden. Autopilot hands node management to Google so you focus on workloads. GKE Enterprise treats infrastructure as a unified fleet across clusters, regions, and clouds.

8.1.1 GKE Cluster Types: Standard vs. Autopilot

The first decision when creating a GKE cluster is choosing between Standard mode and Autopilot mode.

Standard mode is the full-control Kubernetes experience. You manage node pools, choose machine types, configure autoscaling, decide upgrade timing, and can SSH into worker nodes. Google manages the control plane, but nodes, OS, and local network are yours. You pay for provisioned node capacity — 30% utilization still costs 100% of the machine.

Autopilot mode is Google's opinionated managed Kubernetes. You define workloads; Google provisions nodes, scales, patches, and optimizes them. You cannot choose machine types, SSH into nodes, or run privileged containers. You pay per pod based on requested resources — no idle capacity costs.

Factor	Choose Standard When...	Choose Autopilot When...
Control	Custom node images, kernel tuning needed	Hands-off node management preferred
Cost	Steady workloads you can right-size	Variable workloads; avoid idle capacity
Security	Privileged containers, custom CNI required	Accept Google's restricted security model
Debugging	Need SSH to nodes for deep troubleshooting	App-level debugging via logs is sufficient
DaemonSets	Custom per-node agents required	Built-in agents cover your needs
Team	Dedicated platform engineers available	Small team focused on applications

Many organizations use a mixed-mode strategy: Autopilot for stateless apps, Standard for GPU-intensive ML or workloads needing privileged containers.

⚠️ Common Misconception: "Autopilot is just more expensive Standard." This is false — Autopilot charges per pod request, Standard charges per provisioned node. A Standard cluster with 20% utilization typically costs more than Autopilot for the same workloads.

Visual Description: In Standard mode, you define node pools; Google manages the control plane above and the VPC beneath.

graph TD subgraph "Google-Managed Control Plane" CP["API Server + etcd + Scheduler"] end subgraph "Standard Cluster — Your Node Pools" subgraph "Pool A: General Purpose" N1["e2-standard-4"] N2["e2-standard-4"] end subgraph "Pool B: Spot Instances" N3["e2-medium Spot"] N4["e2-medium Spot"] end end CP --> N1 CP --> N2 CP --> N3 CP --> N4 style CP fill:#ffcc80 style N1 fill:#90caf9 style N2 fill:#90caf9 style N3 fill:#a5d6a7 style N4 fill:#a5d6a7

8.1.2 Node Management: Node Pools, Upgrades, and Cost Optimization

In Standard mode, node pools are groups of identically configured Compute Engine VMs — same machine type, disk, labels, and taints. Different specifications require different pools.

Node pool upgrades apply Kubernetes security patches and version updates. GKE uses surge upgrades: it creates new nodes on the target version, drains old nodes (evicts pods to reschedule), then deletes the old nodes. Control the pace with maxSurge (extra nodes) and maxUnavailable (simultaneously offline nodes). Maintenance windows restrict upgrades to specific time blocks — e.g., Saturday 2 AM to 6 AM.

Machine type selection balances cost and performance. e2-medium or e2-small suit dev/test. n2-standard works for production. Custom machine types define exact vCPU and memory ratios.

Taints and tolerations isolate workloads. A taint on a node says "do not schedule here unless a pod tolerates this." Taint GPU nodes with nvidia.com/gpu=true:NoSchedule so only GPU workloads land on them. The pod's toleration says "I accept this condition — schedule me." They work like a VIP entrance.

Spot and preemptible instances offer 60–91% discounts using excess Compute Engine capacity, but Google can reclaim them with 25 seconds' notice. Ideal for batch jobs, CI/CD runners, and dev environments. Avoid for stateful or latency-sensitive serving workloads.

🛑 PAUSE & RECALL — 2 minutes

In the property management analogy, what does Standard mode correspond to? What does Autopilot correspond to?
Name two operational tasks Standard requires that Autopilot handles automatically.
When would you use Spot VMs, and what is the trade-off?
What is the relationship between a taint and a toleration?

Rate your confidence (0–4), then continue.

8.1.3 Cluster Networking: VPC-Native and Private Clusters

VPC-native clusters assign each pod an IP directly from a secondary VPC IP range. Pod IPs are routable from anywhere in your VPC — Compute Engine VMs, Cloud SQL, on-premises networks via VPN. Traffic flows directly without overlays. Autopilot only supports VPC-native networking.

Private clusters give worker nodes only private IPs — they reach the internet through Cloud NAT. The control plane endpoint can be private, public, or both. When public, always pair it with authorized networks, a CIDR whitelist for API access.

IP planning is critical and permanent. A GKE cluster needs three non-overlapping ranges: nodes (primary VPC subnet), pods (secondary alias range), and Services (secondary range). For example: 10.0.0.0/24 for nodes, 10.4.0.0/14 for pods, 10.8.0.0/20 for Services. The pod range determines maximum cluster capacity and cannot change after creation.

graph TD subgraph "Private Cluster" E1["Control Plane<br/>Authorized Networks Only"] N1["Node Subnet: 10.0.0.0/24"] P1["Pod CIDR: 10.4.0.0/14"] S1["Service CIDR: 10.8.0.0/20"] end NAT["Cloud NAT"] FW["Firewall Rules"] EXT["Internet"] N1 --> NAT --> EXT FW --> E1 style E1 fill:#ef9a9a style NAT fill:#ffcc80 style EXT fill:#a5d6a7

8.1.4 Cluster Security: Workload Identity and Hardening

Workload Identity eliminates long-lived GCP service account keys. Instead of downloading keys and mounting them as Secrets, bind a Kubernetes ServiceAccount to a GCP IAM ServiceAccount. GKE's metadata server intercepts the pod's credential requests and returns short-lived OAuth tokens. Setup requires three steps: (1) enable Workload Identity via --workload-pool=PROJECT_ID.svc.id.goog, (2) grant roles/iam.workloadIdentityUser with the specific namespace and ServiceAccount, (3) annotate the Kubernetes ServiceAccount with the GCP ServiceAccount email. No keys stored. Tokens expire automatically.

Workload Identity Federation extends this to multi-cloud — workloads in EKS or AKS exchange native identity tokens for temporary GCP credentials.

Private endpoint with authorized networks keeps the API server off the public internet. Shielded nodes enable secure boot and integrity monitoring. Binary Authorization enforces deployment-time policies — e.g., requiring signed images or vulnerability scan passes — blocking non-compliant deployments before pods are created.

🛑 PAUSE & RECALL — 2 minutes

What security problem does Workload Identity solve?
Why must the pod CIDR range be sized carefully at cluster creation?
What does Binary Authorization enforce at deployment time?
How do Shielded GKE Nodes protect against boot-level tampering?

Rate your confidence (0–4), then continue.

8.1.5 Cluster Maintenance: Release Channels and Upgrade Strategies

GKE release channels manage Kubernetes versions: Rapid (latest features), Regular (balanced), Stable (maximum stability). Enrolling lets Google manage versions automatically. For production, choose Regular or Stable.

Maintenance windows define when upgrades are permitted; maintenance exclusions define blackout periods. Surge upgrades create new nodes, drain old ones, then delete them — controlled by maxSurge and maxUnavailable. If an upgrade fails, rollback reverses to the previous version. For changes that cannot happen in-place — e.g., switching machine types — use node pool migration: create a new pool, cordon and drain the old, then delete it.

8.1.6 GKE Enterprise Features: Fleet-Scale Operations

GKE Autopilot uses per-pod billing — you pay for requested CPU, memory, and storage. Google handles node sizing, bin packing, patching, and upgrades. It enforces security by default: no privileged pods, no host access, resource requests required on every container.

GKE Enterprise (formerly Anthos) is Google's multi-cluster management platform. Registering clusters creates a fleet — a logical group sharing consistent configuration, policy, and identity. Namespace sameness means a ServiceAccount in the production namespace is the same identity across every cluster in the fleet.

Config Sync continuously synchronizes Git repository configurations to every fleet cluster. Define namespaces, RBAC, and quotas in Git once, and every cluster converges to that state. Ad-hoc changes are automatically reverted.

Policy Controller enforces OPA Gatekeeper policies fleet-wide — e.g., "all pods must have resource limits" — ensuring compliance through automation, not manual enforcement.

graph TD subgraph "GKE Enterprise Fleet" HUB["Fleet Hub"] SYNC["Config Sync"] POL["Policy Controller"] end subgraph "Cluster A: us-central1" C1["GKE: production ns"] end subgraph "Cluster B: us-east1" C2["GKE: production ns"] end GIT["Git Repository"] HUB --> C1 HUB --> C2 GIT --> SYNC --> C1 SYNC --> C2 POL --> C1 POL --> C2 style HUB fill:#ffcc80 style SYNC fill:#ce93d8 style C1 fill:#90caf9 style C2 fill:#a5d6a7 style GIT fill:#b39ddb

GKE in Practice

GKE Note: This chapter is entirely GKE-specific. Standard mode, Autopilot, node pools, VPC-native networking, private clusters, Workload Identity, Shielded Nodes, Binary Authorization, release channels, GKE Enterprise, Config Sync, and Policy Controller are all native GKE features with no direct equivalents in other cloud providers.

GKE Note: Enable Workload Identity at Standard cluster creation with --workload-pool=PROJECT_ID.svc.id.goog. This cannot be retrofitted without recreating the cluster. For production, always use VPC-native (--enable-ip-alias), private nodes (--enable-private-nodes), and the Regular or Stable release channel.

🤔 TRY BEFORE YOU SEE

Design GKE infrastructure for a SaaS company with: (1) a stateless web API serving customer traffic, (2) a nightly batch job tolerant of interruptions, and (3) a GPU-based ML inference service. They have two platform engineers and a tight budget.

Standard, Autopilot, or mixed strategy? Why?
How many node pools, and what configurations?
Public or private cluster?
Which security features now, which deferred?

Recommended: Mixed strategy — Autopilot for the web API (no node management), Standard cluster with three pools: general-purpose (E2-standard), a spot pool (E2-medium, tainted for batch) for 60–91% savings, and a GPU pool with taints for ML. Private cluster with authorized networks, Workload Identity, Shielded Nodes, Regular release channel. Defer Binary Authorization until the team has capacity.

Lab: LAB-8.1 — GKE Cluster Operations (75 min)

Create a Standard GKE cluster with custom node pools, private cluster settings, Workload Identity, and a maintenance window.

Part 1: Create Cluster with Node Pools (25 min)

export PROJECT_ID=$(gcloud config get-value project)
export REGION=us-central1
export CLUSTER_NAME=standard-ops-lab

Create the Standard cluster:

gcloud container clusters create $CLUSTER_NAME \
  --region=$REGION --enable-ip-alias --enable-private-nodes \
  --master-ipv4-cidr=172.16.0.32/28 \
  --enable-master-authorized-networks \
  --master-authorized-networks=$(curl -s ifconfig.me)/32 \
  --workload-pool=$PROJECT_ID.svc.id.goog \
  --enable-shielded-nodes --shielded-secure-boot \
  --shielded-integrity-monitoring --release-channel=regular \
  --num-nodes=1 --machine-type=e2-medium

Create a spot node pool for batch workloads:

gcloud container node-pools create spot-batch-pool \
  --cluster=$CLUSTER_NAME --region=$REGION \
  --machine-type=e2-medium --spot --num-nodes=0 \
  --enable-autoscaling --min-nodes=0 --max-nodes=3 \
  --node-taints=workload-type=batch:NoSchedule

Verify:

gcloud container node-pools list --cluster=$CLUSTER_NAME --region=$REGION

Part 2: Configure Private Access (10 min)

gcloud container clusters update $CLUSTER_NAME --region=$REGION \
  --enable-master-authorized-networks \
  --master-authorized-networks=$(curl -s ifconfig.me)/32,10.0.0.0/8

gcloud container clusters get-credentials $CLUSTER_NAME --region=$REGION
kubectl get nodes -o wide

Part 3: Set Up Workload Identity (25 min)

gcloud iam service-accounts create workload-identity-sa \
  --display-name="Workload Identity Lab SA"

gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:workload-identity-sa@$PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/storage.objectViewer"

kubectl create namespace workload-id-lab

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
  name: workload-sa
  namespace: workload-id-lab
  annotations:
    iam.gke.io/gcp-service-account: workload-identity-sa@$PROJECT_ID.iam.gserviceaccount.com
EOF

gcloud iam service-accounts add-iam-policy-binding \
  workload-identity-sa@$PROJECT_ID.iam.gserviceaccount.com \
  --role="roles/iam.workloadIdentityUser" \
  --member="serviceAccount:$PROJECT_ID.svc.id.goog[workload-id-lab/workload-sa]"

Deploy a test pod and verify Workload Identity works:

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: workload-id-test
  namespace: workload-id-lab
spec:
  serviceAccountName: workload-sa
  containers:
  - name: gcloud
    image: google/cloud-sdk:slim
    command: ["sleep", "300"]
EOF

kubectl wait --for=condition=Ready pod/workload-id-test -n workload-id-lab --timeout=120s
kubectl exec -n workload-id-lab workload-id-test -- gcloud storage ls

Expected: The pod lists GCS buckets without credential files — Workload Identity handles authentication transparently.

Part 4: Configure Maintenance Window (15 min)

gcloud container clusters update $CLUSTER_NAME --region=$REGION \
  --maintenance-window-start=2024-01-01T02:00:00Z \
  --maintenance-window-end=2024-01-01T06:00:00Z \
  --maintenance-window-recurrence="FREQ=WEEKLY;BYDAY=SA"

Cleanup

gcloud container clusters delete $CLUSTER_NAME --region=$REGION --quiet
gcloud iam service-accounts delete workload-identity-sa@$PROJECT_ID.iam.gserviceaccount.com --quiet

Chapter Summary

You compared Standard and Autopilot modes using the property management analogy and built a decision matrix. You explored node pools, surge upgrades, taints and tolerations, and spot instances. You examined VPC-native networking, private clusters, and IP allocation planning. You learned Workload Identity for eliminating long-lived keys, Shielded Nodes for boot integrity, and Binary Authorization for deployment-time security. You covered release channels, maintenance windows, and GKE Enterprise's fleet management with Config Sync and Policy Controller.

📇 KEY CONCEPT CARDS

Q: What is the fundamental difference between GKE Standard and Autopilot in management and pricing?
A: Standard: you manage node pools, pay for provisioned capacity regardless of utilization. Autopilot: Google manages nodes, you pay per pod request — no idle capacity charges.

Q: What are the three IP ranges for a VPC-native GKE cluster, and why is the pod CIDR most critical?
A: Node subnet (primary), pod CIDR (secondary alias), Service CIDR (secondary). The pod CIDR determines maximum cluster pod capacity and cannot change after creation.

Q: How does Workload Identity work, and what security problem does it solve?
A: It binds a Kubernetes ServiceAccount to a GCP IAM ServiceAccount. GKE's metadata server returns short-lived OAuth tokens to pods, eliminating the need to store long-lived service account keys as Secrets.

Q: In the property management analogy, what does each GKE model represent?
A: Standard = self-managing landlord. Autopilot = property management company. GKE Enterprise = portfolio management firm for multi-cluster operations with standardized policies and fleet-wide enforcement.