← Articulet Kubernetes Zero to Hero Chapter 7.1
Module 7 Observability, Scaling, and Resilience

Monitoring and Logging

In the previous module, you built a fortress — RBAC, Pod Security Standards, Network Policies, and audit logging. But a fortress without sentries is just a tomb. How do you know your applications are healthy? How do you...

Chapter 17 of 22
Module 7: Observability, Scaling, and Resilience

Running applications is only half the battle — you need to see what's happening, scale with demand, and survive failures. This module covers the operational excellence that separates beginners from professionals.

Module 7 of 8 | Difficulty: Advanced

In the previous module, you built a fortress — RBAC, Pod Security Standards, Network Policies, and audit logging. But a fortress without sentries is just a tomb. How do you know your applications are healthy? How do you detect problems before your users do? And when things break, how do you reconstruct what happened? Welcome to observability — understanding your cluster through the signals it emits.

The Three Pillars of Observability

Analogy: Hospital Patient Monitoring

Imagine a patient in a hospital. Three systems track their wellbeing. The bedside monitor shows continuous vital signs — heart rate at 72 BPM, blood pressure at 120/80. These are metrics: numerical, time-bound, telling you what is happening now. The nurse's chart reads: "10:15 AM — administered medication; 11:00 AM — blood pressure dropped to 95/60." These are logs: discrete event records with full context, telling you what specifically happened. When the patient visits X-ray, then lab, then a specialist, the hospital tracks their journey across departments. This is a trace: it follows a single request to show how the pieces connect.

In Kubernetes, the same three pillars apply. Metrics are numerical measurements — CPU, memory, latency. Logs are timestamped records — errors, access logs, audit events. Traces follow requests across services, recording timing at each hop. You need all three: metrics detect anomalies, traces locate them, logs explain them. Metrics alone show latency spiked at 2:47 AM but not why. Logs reveal an error but not the causal chain. Traces show service C is slow but not whether it is CPU-bound.

Visual Description: The three pillars form a unified observability stack. The cluster generates signals — logs from containers, metrics from the metrics pipeline, traces from request instrumentation. These flow into three parallel collection systems that converge in dashboards and alerting.

graph TD subgraph "Kubernetes Cluster" APP[Application Pods] K8S[Control Plane] NODE[Nodes] end subgraph "Metrics Pipeline" METRICS[Metrics Server / Prometheus] TSDB[(Time-Series DB)] end subgraph "Logging Pipeline" AGENT[Log Agent / DaemonSet] LOGSTORE[(Log Store)] end subgraph "Tracing Pipeline" OTEL[OpenTelemetry / Collector] TRACE[(Trace Backend)] end subgraph "Unified Interface" DASH[Grafana / Cloud Monitoring] ALERT[Alertmanager / Alerting] end APP -->|numerical data| METRICS K8S -->|numerical data| METRICS NODE -->|numerical data| METRICS METRICS --> TSDB APP -->|stdout/stderr| AGENT K8S -->|audit logs| AGENT AGENT --> LOGSTORE APP -->|spans| OTEL OTEL --> TRACE TSDB --> DASH LOGSTORE --> DASH TRACE --> DASH TSDB --> ALERT LOGSTORE --> ALERT style APP fill:#90caf9 style METRICS fill:#fff9c4 style AGENT fill:#fff9c4 style OTEL fill:#fff9c4 style DASH fill:#a5d6a7 style ALERT fill:#ef9a9a

Kubernetes Metrics Architecture

The Kubernetes metrics pipeline is layered. metrics-server is the in-cluster aggregator — it collects CPU and memory data from kubelets for kubectl top and the Horizontal Pod Autoscaler. It keeps only the latest data point in memory: a snapshot camera, not a video recorder. cAdvisor runs inside each kubelet, collecting container-level statistics exposed via /metrics/cadvisor. kube-state-metrics listens to the API Server and generates metrics about object states — Pods in Pending, Deployments with unmatched replicas. node-exporter runs as a DaemonSet exposing hardware-level metrics — disk space, filesystem usage, network stats.

GKE Note: On GKE, metrics-server is installed by default, and Google Cloud Monitoring collects cluster metrics automatically. You still want Prometheus for custom application metrics and granular control.

Prometheus and Grafana

Prometheus is the de facto Kubernetes monitoring standard. Its architecture is pull-based: Prometheus periodically scrapes HTTP /metrics endpoints exposed by targets, storing data in its time-series database.

Visual Description: The Prometheus server sits at the center, configured with scrape targets via Kubernetes service discovery. It pulls metrics from application pods, node-exporters, kube-state-metrics, and cAdvisor. Alerts flow to Alertmanager; queries are served to Grafana.

graph LR subgraph "Prometheus Server" PROM[Prometheus] TSDB[(TSDB)] CONF[Scrape Config] end subgraph "Scrape Targets" APP[App Pods /metrics] NODE[node-exporter] KSM[kube-state-metrics] CADV[cAdvisor via kubelet] end ALERTM[Alertmanager] GRAF[Grafana] PROM -->|scrape| APP PROM -->|scrape| NODE PROM -->|scrape| KSM PROM -->|scrape| CADV PROM --> TSDB CONF --> PROM PROM -->|alerts| ALERTM TSDB -->|query| GRAF style PROM fill:#fff9c4 style TSDB fill:#ffcc80 style APP fill:#90caf9 style NODE fill:#a5d6a7 style GRAF fill:#ce93d8

Prometheus uses PromQL to query time-series data. A query like sum(rate(container_cpu_usage_seconds_total[5m])) by (pod) aggregates CPU usage per pod over 5-minute windows. Alertmanager handles alerts from Prometheus — deduplicating, grouping, and routing them to email, Slack, or PagerDuty. Grafana queries Prometheus and renders dashboards. The Prometheus Operator simplifies Kubernetes deployment using custom resources like ServiceMonitor for automatic target discovery.

Logging Patterns

Applications write to stdout and stderr; the container runtime captures these to node-local files. From there, you need a strategy to centralize them. The node-level logging agent pattern runs Fluent Bit or Fluentd as a DaemonSet on every node, tailing container log files and forwarding to a central store. This is the most common pattern — efficient, with one agent per node handling all pods. The sidecar log shipper runs a logging agent inside each Pod, reading from a shared volume. Use this when you need per-application log processing or different destinations per app.

Structured logging — writing logs as JSON — is essential for production. Unstructured logs like User john logged in require expensive parsing. Structured logs like {"event": "user_login", "user": "john"} are immediately parseable, filterable, and alertable. Define retention policies aligned with compliance: 30 days hot, 90 days warm, 1 year cold. Application error logs need longer retention than routine access logs.

Container Probes

Probes are Kubernetes' built-in health check mechanism. They are how Kubernetes decides whether your container is alive, ready, or still starting.

stateDiagram-v2 [*] --> Running: Pod created Running --> Running: startupProbe active Running --> Started: startupProbe succeeds Started --> Ready: readinessProbe succeeds<br/>(added to Service endpoints) Ready --> Checked: readinessProbe fails<br/>(removed from endpoints) Checked --> Ready: readinessProbe succeeds again Started --> Restarting: livenessProbe fails Ready --> Restarting: livenessProbe fails Restarting --> Running: Container restarted

Liveness probe: Is this container running, or is it stuck? If it fails, Kubernetes kills and restarts the container. Use it to detect deadlocks, infinite loops, or memory leaks that freeze an application without crashing it.

Readiness probe: Is this container ready to accept traffic? If it fails, Kubernetes removes the Pod from Service endpoints. No traffic is routed to it. When it passes again, the Pod is re-added. Readiness failure does NOT restart the container.

⚠️ Common Misconception: Do not configure readiness probes to check external dependencies like databases. If the database becomes unavailable, every Pod marks itself not ready and your entire service goes offline — even though the application code is fine. Check only whether your application itself is ready to serve.

Startup probe exists for slow-starting applications — Java apps that take 90 seconds to initialize, ML models loading large weights. It disables liveness and readiness checks until it succeeds. Without it, Kubernetes may kill your container before it finishes starting.

Probe Type How It Works Best For
httpGet HTTP GET request Web applications
tcpSocket TCP connection attempt Databases, non-HTTP services
exec Command inside container Custom health checks
grpc gRPC health protocol gRPC services

Tune failureThreshold and periodSeconds conservatively. A common starting point: liveness with periodSeconds: 10, failureThreshold: 3 — giving 30 seconds of failed checks before restart.

🛑 PAUSE & RECALL

Without looking back:

  1. Your liveness probe keeps restarting a container that takes 60 seconds to start. What probe are you missing?
  2. Should a readiness probe check a database connection? Why or why not?
  3. Name the three pillars of observability and give one Kubernetes example of each.

Rate your confidence (0-4).

GKE in Practice: Google Cloud Observability

GKE Note: GKE integrates deeply with Google Cloud Operations suite, providing managed observability that reduces operational overhead.

On GKE, the managed path is strongly recommended for production.

Managed Prometheus for GKE (GMP) is a fully managed Prometheus-compatible service. You write PromQL against it as you would self-managed Prometheus, but Google manages the TSDB. Enable it with:

gcloud container clusters update $CLUSTER_NAME \
  --enable-managed-prometheus --region=$REGION

Cloud Logging automatically collects container stdout/stderr via a Fluent Bit DaemonSet on every node. No configuration needed. Query logs with Logging Query Language: resource.type="k8s_container" resource.labels.cluster_name="prod" severity>=ERROR.

Pre-built GKE dashboards in Cloud Monitoring show cluster health, node utilization, and pod status out of the box. SLO monitoring lets you define objectives like "99.9% of requests under 200ms" and track error budgets — the foundation of Site Reliability Engineering on GKE.

Lab: LAB-7.1 — Monitoring and Logging (60 min)

You will deploy Prometheus and Grafana via Helm, import a dashboard, and configure container probes.

Step 1: Add the Prometheus Helm repository

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Step 2: Install the kube-prometheus-stack

helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set grafana.service.type=LoadBalancer

kubectl wait --for=condition=ready pod \
  -l app.kubernetes.io/name=prometheus -n monitoring --timeout=120s

Step 3: Access Grafana

kubectl get secret --namespace monitoring monitoring-grafana \
  -o jsonpath="{.data.admin-password}" | base64 -d; echo
kubectl get svc monitoring-grafana -n monitoring

Log in as admin with the retrieved password. Browse to Dashboards → Kubernetes and explore the cluster dashboard.

Step 4: Deploy an application with probes

Create probe-demo.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: probe-demo
spec:
  replicas: 2
  selector:
    matchLabels:
      app: probe-demo
  template:
    metadata:
      labels:
        app: probe-demo
    spec:
      containers:
      - name: nginx
        image: nginx:1.25
        ports:
        - containerPort: 80
        livenessProbe:
          httpGet:
            path: /
            port: 80
          initialDelaySeconds: 5
          periodSeconds: 10
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /
            port: 80
          initialDelaySeconds: 3
          periodSeconds: 5
          failureThreshold: 2
        startupProbe:
          httpGet:
            path: /
            port: 80
          initialDelaySeconds: 2
          periodSeconds: 5
          failureThreshold: 10
---
apiVersion: v1
kind: Service
metadata:
  name: probe-demo
spec:
  selector:
    app: probe-demo
  ports:
  - port: 80
    targetPort: 80

Apply and verify:

kubectl apply -f probe-demo.yaml
kubectl describe pod -l app=probe-demo

Look for probe status in the output:

Liveness:   http-get http://:80/ delay=5s timeout=1s period=10s #failure=3
Readiness:  http-get http://:80/ delay=3s timeout=1s period=5s #failure=2
Startup:    http-get http://:80/ delay=2s timeout=1s period=5s #failure=10

Step 5: Observe probe failure behavior

Simulate a liveness failure:

kubectl exec -it deploy/probe-demo -- sh -c "rm /usr/share/nginx/html/index.html"
kubectl get pods -l app=probe-demo -w

The container restarts. Check events: kubectl describe pod -l app=probe-demo | grep -A5 Events. You will see Liveness probe failed followed by will be restarted.

Step 6: View Cloud Logging integration

Navigate to Cloud Console → Logging → Logs Explorer. Query: resource.type="k8s_container" resource.labels.pod_name=~"probe-demo-.*"

Cleanup:

helm uninstall monitoring -n monitoring
kubectl delete -f probe-demo.yaml
kubectl delete namespace monitoring

🤔 TRY BEFORE YOU SEE

You have a Java Spring Boot app that takes 45 seconds to initialize. During startup it is not ready for traffic. You only have a liveness probe with periodSeconds: 5 and failureThreshold: 3. The container keeps getting restarted before it finishes starting.

Write the probe configuration to fix this. What probe type gates the others during startup? What parameters would you tune?


Solution: Add a startup probe to protect the initialization period:

startupProbe:
  httpGet:
    path: /actuator/health
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  failureThreshold: 12  # 10s + 12*5s = 70s max startup
livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  periodSeconds: 10
  failureThreshold: 3
readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  periodSeconds: 5
  failureThreshold: 3

🛑 PAUSE & RECALL

Without looking back:

  1. What is the difference between a DaemonSet log agent and a sidecar log shipper?
  2. Prometheus uses pull-based collection. Name one advantage over push-based models.
  3. Your readiness probe is failing. Will Kubernetes restart your container? What happens to traffic?

Rate your confidence (0-4).

Chapter Summary

This chapter established the three pillars of observability — metrics, logs, and traces — and why all three are necessary. You learned the Kubernetes metrics architecture, Prometheus' pull-based collection with PromQL and Alertmanager, and logging patterns including node-level agents versus sidecar shippers and structured JSON logging. Most critically, you learned container probes: liveness restarts stuck containers, readiness controls traffic routing, and startup protects slow-starting applications. Finally, you saw how GKE simplifies observability with managed Prometheus, Cloud Logging, and pre-built dashboards.

📇 KEY CONCEPT CARDS

  1. Q: What are the three pillars of observability, and what does each answer?
    A: Metrics (what's happening over time — CPU, memory), Logs (what happened specifically — errors, events), Traces (how requests flow — latency per service). Metrics detect anomalies, traces locate them, logs explain them.
  1. Q: What is the difference between a liveness probe and a readiness probe?
    A: Liveness detects stuck containers and triggers restart on failure. Readiness determines if a container should receive traffic; on failure it removes the Pod from Service endpoints but does NOT restart the container.
  1. Q: When do you need a startup probe, and what does it do?
    A: For slow-starting applications (e.g., Java taking >30s). It disables liveness and readiness checks until the app finishes starting, preventing premature restarts.
  1. Q: What is the difference between metrics-server and Prometheus?
    A: metrics-server collects the latest resource metrics for kubectl top and HPA, keeping only the most recent data point. Prometheus is a full time-series database for long-term storage, PromQL querying, and alerting.