Probe Configuration Solutions
SummaryStep-by-step solutions for Exercise 1 (diagnose and fix...
Step-by-step solutions for Exercise 1 (diagnose and fix...
Step-by-step solutions for Exercise 1 (diagnose and fix a CrashLoopBackOff caused by a broken liveness probe) and Exercise 2 (diagnose a Pod stuck in Pending due to excessive CPU requests). Includes all commands, expected output, and explanations.
Probe Configuration Solutions
Exercise 1: Diagnose and Fix a CrashLoopBackOff from a Broken Liveness Probe
Step 1: Create the Pod
Save the following manifest as broken-probe.yaml:
apiVersion: v1
kind: Pod
metadata:
name: broken-probe
spec:
containers:
- name: nginx
image: nginx:1.25
ports:
- containerPort: 80
livenessProbe:
httpGet:
path: /does-not-exist
port: 80
periodSeconds: 2
failureThreshold: 3
Apply it:
kubectl apply -f broken-probe.yaml
Step 2: Observe the Failure
Wait approximately 10–15 seconds, then check the Pod status:
kubectl get pods broken-probe
Expected output:
NAME READY STATUS RESTARTS AGE
broken-probe 1/1 Running 2 (4s ago) 20s
The restart count climbs. After several restarts with increasing back-off delays, the status changes to CrashLoopBackOff:
NAME READY STATUS RESTARTS AGE
broken-probe 0/1 CrashLoopBackOff 4 (12s ago) 45s
Note: The term “CrashLoopBackOff” is slightly misleading here. The container itself is not crashing — nginx starts and runs correctly. The kubelet is killing the container because the liveness probe reports failure. From kubernetes’s perspective, the effect is the same: the container is repeatedly terminated and restarted.
Step 3: Diagnose with Events
kubectl describe pod broken-probe
Look at the Events section at the bottom:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 60s default-scheduler Successfully assigned default/broken-probe to ...
Normal Pulled 8s (x4 over 60s) kubelet Container image "nginx:1.25" already present on machine
Normal Created 8s (x4 over 60s) kubelet Created container nginx
Normal Started 8s (x4 over 60s) kubelet Started container nginx
Warning Unhealthy 4s (x10 over 56s) kubelet Liveness probe failed: HTTP probe failed with statuscode: 404
Normal Killing 4s (x3 over 52s) kubelet Container nginx failed liveness probe, will be restarted
Two events reveal the root cause:
- “Liveness probe failed: HTTP probe failed with statuscode: 404” — The path
/does-not-existreturns a 404, which is outside the 200–399 success range. - “Container nginx failed liveness probe, will be restarted” — After
failureThreshold: 3consecutive failures (3 × 2s = 6 seconds), the kubelet kills the container.
The container’s own logs show no errors because nginx itself is functioning correctly:
kubectl logs broken-probe
...
2026/03/01 10:00:01 [error] 29#29: *1 open() "/usr/share/nginx/html/does-not-exist" failed (2: No such file or directory)
The 404 is nginx’s expected response for a nonexistent file. The problem is the probe configuration, not the application.
Step 4: Fix the Probe
Edit the Pod. Since Pods are immutable for most fields, delete and recreate:
kubectl delete pod broken-probe
Update the YAML — change the probe path from /does-not-exist to /:
apiVersion: v1
kind: Pod
metadata:
name: broken-probe
spec:
containers:
- name: nginx
image: nginx:1.25
ports:
- containerPort: 80
livenessProbe:
httpGet:
path: /
port: 80
periodSeconds: 2
failureThreshold: 3
Apply the corrected manifest:
kubectl apply -f broken-probe.yaml
Step 5: Verify the Fix
kubectl get pods broken-probe
Expected output after a few seconds:
NAME READY STATUS RESTARTS AGE
broken-probe 1/1 Running 0 10s
Zero restarts. The Pod stays Running because nginx returns a 200 status code for the / path, and the liveness probe passes consistently.
Verify with kubectl describe pod:
kubectl describe pod broken-probe | grep -A2 Liveness
Liveness: http-get http://:80/ delay=0s timeout=1s period=2s #success=1 #failure=3
And the events should show only Normal events — no Unhealthy warnings.
Cleanup
kubectl delete pod broken-probe
Exercise 2: Diagnose a Pod Stuck in Pending
Step 1: Create the Pod
Save the following manifest as pending-pod.yaml:
apiVersion: v1
kind: Pod
metadata:
name: pending-pod
spec:
containers:
- name: nginx
image: nginx:1.25
resources:
requests:
cpu: "100"
The request of 100 means 100 whole CPU cores. No standard cluster node has 100 cores available.
Apply it:
kubectl apply -f pending-pod.yaml
Step 2: Observe the Pending State
kubectl get pods pending-pod
Expected output:
NAME READY STATUS RESTARTS AGE
pending-pod 0/1 Pending 0 30s
The Pod stays in Pending indefinitely. READY is 0/1 and RESTARTS stays at 0 — no container has ever started.
Step 3: Diagnose from Events
kubectl describe pod pending-pod
The Events section reveals the scheduling failure:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 15s default-scheduler 0/1 nodes are available:
1 Insufficient cpu. preemption:
0/1 nodes are available: 1 No preemption victims found for
incoming pod.
The message is unambiguous: “Insufficient cpu.” The scheduler evaluated every node in the cluster and none had 100 CPUs available.
You can also view this event from the namespace-wide event stream:
kubectl get events --sort-by='.lastTimestamp' --field-selector type=Warning
LAST SEEN TYPE REASON OBJECT MESSAGE
15s Warning FailedScheduling pod/pending-pod 0/1 nodes are available: 1 Insufficient cpu...
Step 4: Understand Why There Are No Logs
Unlike CrashLoopBackOff, a Pending Pod has never started a container. There are no logs to retrieve:
kubectl logs pending-pod
Error from server (BadRequest): container "nginx" in pod "pending-pod" is waiting to start: ContainerCreating
For Pending Pods, events are the only diagnostic tool. The scheduler’s FailedScheduling event always explains why placement failed.
Step 5: Verify the Diagnosis and Clean Up
The root cause is confirmed: the Pod requests 100 CPU cores, which exceeds the capacity of every node. In a real scenario, the fix is to reduce the CPU request to a reasonable value (e.g., 100m for 0.1 cores, or 500m for half a core).
kubectl delete pod pending-pod
Key Takeaway
Pending is exclusively a scheduling problem. The container image is never pulled, the container runtime is never invoked. Diagnosis happens entirely through events. Memorize the common FailedScheduling messages:
- Insufficient cpu or Insufficient memory — lower the request or add nodes.
- didn’t match Pod’s node affinity/selector — correct the
nodeSelectoror node labels. - had taint {key:NoSchedule} — add a toleration.
- persistentvolumeclaim “x” not bound — provision the PersistentVolume or fix the StorageClass.