Running applications is only half the battle — you need to see what's happening, scale with demand, and survive failures. This module covers the operational excellence that separates beginners from professionals.
Module 7 of 8 | Difficulty: Advanced
In the previous module, you built a fortress — RBAC, Pod Security Standards, Network Policies, and audit logging. But a fortress without sentries is just a tomb. How do you know your applications are healthy? How do you detect problems before your users do? And when things break, how do you reconstruct what happened? Welcome to observability — understanding your cluster through the signals it emits.
The Three Pillars of Observability
Analogy: Hospital Patient Monitoring
Imagine a patient in a hospital. Three systems track their wellbeing. The bedside monitor shows continuous vital signs — heart rate at 72 BPM, blood pressure at 120/80. These are metrics: numerical, time-bound, telling you what is happening now. The nurse's chart reads: "10:15 AM — administered medication; 11:00 AM — blood pressure dropped to 95/60." These are logs: discrete event records with full context, telling you what specifically happened. When the patient visits X-ray, then lab, then a specialist, the hospital tracks their journey across departments. This is a trace: it follows a single request to show how the pieces connect.
In Kubernetes, the same three pillars apply. Metrics are numerical measurements — CPU, memory, latency. Logs are timestamped records — errors, access logs, audit events. Traces follow requests across services, recording timing at each hop. You need all three: metrics detect anomalies, traces locate them, logs explain them. Metrics alone show latency spiked at 2:47 AM but not why. Logs reveal an error but not the causal chain. Traces show service C is slow but not whether it is CPU-bound.
Visual Description: The three pillars form a unified observability stack. The cluster generates signals — logs from containers, metrics from the metrics pipeline, traces from request instrumentation. These flow into three parallel collection systems that converge in dashboards and alerting.
Kubernetes Metrics Architecture
The Kubernetes metrics pipeline is layered. metrics-server is the in-cluster aggregator — it collects CPU and memory data from kubelets for kubectl top and the Horizontal Pod Autoscaler. It keeps only the latest data point in memory: a snapshot camera, not a video recorder. cAdvisor runs inside each kubelet, collecting container-level statistics exposed via /metrics/cadvisor. kube-state-metrics listens to the API Server and generates metrics about object states — Pods in Pending, Deployments with unmatched replicas. node-exporter runs as a DaemonSet exposing hardware-level metrics — disk space, filesystem usage, network stats.
GKE Note: On GKE, metrics-server is installed by default, and Google Cloud Monitoring collects cluster metrics automatically. You still want Prometheus for custom application metrics and granular control.
Prometheus and Grafana
Prometheus is the de facto Kubernetes monitoring standard. Its architecture is pull-based: Prometheus periodically scrapes HTTP /metrics endpoints exposed by targets, storing data in its time-series database.
Visual Description: The Prometheus server sits at the center, configured with scrape targets via Kubernetes service discovery. It pulls metrics from application pods, node-exporters, kube-state-metrics, and cAdvisor. Alerts flow to Alertmanager; queries are served to Grafana.
Prometheus uses PromQL to query time-series data. A query like sum(rate(container_cpu_usage_seconds_total[5m])) by (pod) aggregates CPU usage per pod over 5-minute windows. Alertmanager handles alerts from Prometheus — deduplicating, grouping, and routing them to email, Slack, or PagerDuty. Grafana queries Prometheus and renders dashboards. The Prometheus Operator simplifies Kubernetes deployment using custom resources like ServiceMonitor for automatic target discovery.
Logging Patterns
Applications write to stdout and stderr; the container runtime captures these to node-local files. From there, you need a strategy to centralize them. The node-level logging agent pattern runs Fluent Bit or Fluentd as a DaemonSet on every node, tailing container log files and forwarding to a central store. This is the most common pattern — efficient, with one agent per node handling all pods. The sidecar log shipper runs a logging agent inside each Pod, reading from a shared volume. Use this when you need per-application log processing or different destinations per app.
Structured logging — writing logs as JSON — is essential for production. Unstructured logs like User john logged in require expensive parsing. Structured logs like {"event": "user_login", "user": "john"} are immediately parseable, filterable, and alertable. Define retention policies aligned with compliance: 30 days hot, 90 days warm, 1 year cold. Application error logs need longer retention than routine access logs.
Container Probes
Probes are Kubernetes' built-in health check mechanism. They are how Kubernetes decides whether your container is alive, ready, or still starting.
Liveness probe: Is this container running, or is it stuck? If it fails, Kubernetes kills and restarts the container. Use it to detect deadlocks, infinite loops, or memory leaks that freeze an application without crashing it.
Readiness probe: Is this container ready to accept traffic? If it fails, Kubernetes removes the Pod from Service endpoints. No traffic is routed to it. When it passes again, the Pod is re-added. Readiness failure does NOT restart the container.
⚠️ Common Misconception: Do not configure readiness probes to check external dependencies like databases. If the database becomes unavailable, every Pod marks itself not ready and your entire service goes offline — even though the application code is fine. Check only whether your application itself is ready to serve.
Startup probe exists for slow-starting applications — Java apps that take 90 seconds to initialize, ML models loading large weights. It disables liveness and readiness checks until it succeeds. Without it, Kubernetes may kill your container before it finishes starting.
| Probe Type | How It Works | Best For |
|---|---|---|
httpGet |
HTTP GET request | Web applications |
tcpSocket |
TCP connection attempt | Databases, non-HTTP services |
exec |
Command inside container | Custom health checks |
grpc |
gRPC health protocol | gRPC services |
Tune failureThreshold and periodSeconds conservatively. A common starting point: liveness with periodSeconds: 10, failureThreshold: 3 — giving 30 seconds of failed checks before restart.
🛑 PAUSE & RECALL
Without looking back:
- Your liveness probe keeps restarting a container that takes 60 seconds to start. What probe are you missing?
- Should a readiness probe check a database connection? Why or why not?
- Name the three pillars of observability and give one Kubernetes example of each.
Rate your confidence (0-4).
GKE in Practice: Google Cloud Observability
GKE Note: GKE integrates deeply with Google Cloud Operations suite, providing managed observability that reduces operational overhead.
On GKE, the managed path is strongly recommended for production.
Managed Prometheus for GKE (GMP) is a fully managed Prometheus-compatible service. You write PromQL against it as you would self-managed Prometheus, but Google manages the TSDB. Enable it with:
gcloud container clusters update $CLUSTER_NAME \
--enable-managed-prometheus --region=$REGION
Cloud Logging automatically collects container stdout/stderr via a Fluent Bit DaemonSet on every node. No configuration needed. Query logs with Logging Query Language: resource.type="k8s_container" resource.labels.cluster_name="prod" severity>=ERROR.
Pre-built GKE dashboards in Cloud Monitoring show cluster health, node utilization, and pod status out of the box. SLO monitoring lets you define objectives like "99.9% of requests under 200ms" and track error budgets — the foundation of Site Reliability Engineering on GKE.
Lab: LAB-7.1 — Monitoring and Logging (60 min)
You will deploy Prometheus and Grafana via Helm, import a dashboard, and configure container probes.
Step 1: Add the Prometheus Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
Step 2: Install the kube-prometheus-stack
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace \
--set grafana.service.type=LoadBalancer
kubectl wait --for=condition=ready pod \
-l app.kubernetes.io/name=prometheus -n monitoring --timeout=120s
Step 3: Access Grafana
kubectl get secret --namespace monitoring monitoring-grafana \
-o jsonpath="{.data.admin-password}" | base64 -d; echo
kubectl get svc monitoring-grafana -n monitoring
Log in as admin with the retrieved password. Browse to Dashboards → Kubernetes and explore the cluster dashboard.
Step 4: Deploy an application with probes
Create probe-demo.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: probe-demo
spec:
replicas: 2
selector:
matchLabels:
app: probe-demo
template:
metadata:
labels:
app: probe-demo
spec:
containers:
- name: nginx
image: nginx:1.25
ports:
- containerPort: 80
livenessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 3
periodSeconds: 5
failureThreshold: 2
startupProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 2
periodSeconds: 5
failureThreshold: 10
---
apiVersion: v1
kind: Service
metadata:
name: probe-demo
spec:
selector:
app: probe-demo
ports:
- port: 80
targetPort: 80
Apply and verify:
kubectl apply -f probe-demo.yaml
kubectl describe pod -l app=probe-demo
Look for probe status in the output:
Liveness: http-get http://:80/ delay=5s timeout=1s period=10s #failure=3
Readiness: http-get http://:80/ delay=3s timeout=1s period=5s #failure=2
Startup: http-get http://:80/ delay=2s timeout=1s period=5s #failure=10
Step 5: Observe probe failure behavior
Simulate a liveness failure:
kubectl exec -it deploy/probe-demo -- sh -c "rm /usr/share/nginx/html/index.html"
kubectl get pods -l app=probe-demo -w
The container restarts. Check events: kubectl describe pod -l app=probe-demo | grep -A5 Events. You will see Liveness probe failed followed by will be restarted.
Step 6: View Cloud Logging integration
Navigate to Cloud Console → Logging → Logs Explorer. Query: resource.type="k8s_container" resource.labels.pod_name=~"probe-demo-.*"
Cleanup:
helm uninstall monitoring -n monitoring
kubectl delete -f probe-demo.yaml
kubectl delete namespace monitoring
🤔 TRY BEFORE YOU SEE
You have a Java Spring Boot app that takes 45 seconds to initialize. During startup it is not ready for traffic. You only have a liveness probe with periodSeconds: 5 and failureThreshold: 3. The container keeps getting restarted before it finishes starting.
Write the probe configuration to fix this. What probe type gates the others during startup? What parameters would you tune?
Solution: Add a startup probe to protect the initialization period:
startupProbe:
httpGet:
path: /actuator/health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 12 # 10s + 12*5s = 70s max startup
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
periodSeconds: 5
failureThreshold: 3
🛑 PAUSE & RECALL
Without looking back:
- What is the difference between a DaemonSet log agent and a sidecar log shipper?
- Prometheus uses pull-based collection. Name one advantage over push-based models.
- Your readiness probe is failing. Will Kubernetes restart your container? What happens to traffic?
Rate your confidence (0-4).
Chapter Summary
This chapter established the three pillars of observability — metrics, logs, and traces — and why all three are necessary. You learned the Kubernetes metrics architecture, Prometheus' pull-based collection with PromQL and Alertmanager, and logging patterns including node-level agents versus sidecar shippers and structured JSON logging. Most critically, you learned container probes: liveness restarts stuck containers, readiness controls traffic routing, and startup protects slow-starting applications. Finally, you saw how GKE simplifies observability with managed Prometheus, Cloud Logging, and pre-built dashboards.
📇 KEY CONCEPT CARDS
- Q: What are the three pillars of observability, and what does each answer?
A: Metrics (what's happening over time — CPU, memory), Logs (what happened specifically — errors, events), Traces (how requests flow — latency per service). Metrics detect anomalies, traces locate them, logs explain them.
- Q: What is the difference between a liveness probe and a readiness probe?
A: Liveness detects stuck containers and triggers restart on failure. Readiness determines if a container should receive traffic; on failure it removes the Pod from Service endpoints but does NOT restart the container.
- Q: When do you need a startup probe, and what does it do?
A: For slow-starting applications (e.g., Java taking >30s). It disables liveness and readiness checks until the app finishes starting, preventing premature restarts.
- Q: What is the difference between metrics-server and Prometheus?
A: metrics-server collects the latest resource metrics forkubectl topand HPA, keeping only the most recent data point. Prometheus is a full time-series database for long-term storage, PromQL querying, and alerting.