Diagnosing OOM Kills and CFS Throttling
Diagnosing OOM Kills and CFS Throttling
The Symptom
The MongoDB pod restarts every 6-12 hours. Kubernetes events show OOMKilled as the termination reason. The WiredTiger cache is set to 7 GB in a 16 GB container. The 9 GB headroom should be sufficient.
The Cause
The WiredTiger cache is 7 GB, but MongoDB’s total memory consumption is higher:
| Component | Memory usage |
|---|---|
| WiredTiger cache | 7.0 GB |
| Connection buffers (200 connections * 1 MB each) | 0.2 GB |
| Aggregation pipeline memory (5 concurrent * 100 MB) | 0.5 GB |
| OS page cache (filesystem metadata, journal) | 2.0 GB |
| MongoDB server process (threads, internal buffers) | 1.5 GB |
| Index build temporary storage (in progress) | 3.0 GB |
| Total | 14.2 GB |
Under normal operations, total memory is 11.2 GB (without index build). But during an index build on a 500 GB collection, the temporary storage can consume 2-4 GB of additional memory. 11.2 + 3.0 = 14.2 GB, which is within the 16 GB limit. However, if a spike in connections (300 instead of 200) coincides with the index build, memory reaches 14.3 GB + 0.1 GB = 14.4 GB, and the OS page cache absorbs the rest.
The OOM occurs when all these factors align during peak: 250 connections + index build + 8 concurrent aggregations + WiredTiger cache at maximum dirty threshold.
# Check OOM events
kubectl describe pod mongodb-0 -n database | grep -A5 "Last State"
# Last State: Terminated
# Reason: OOMKilled
# Exit Code: 137
# Check container memory usage over time
kubectl top pod mongodb-0 -n database --containers
# NAME CPU MEMORY
# mongodb 3200m 15.2Gi <- approaching 16Gi limit
The Benchmark
| WiredTiger cache | Container limit | OOM frequency | Safe concurrent aggregations |
|---|---|---|---|
| 7 GB | 16 GB | Every 6-12 hours | 3-5 |
| 6 GB | 16 GB | Weekly | 5-8 |
| 5 GB | 16 GB | Never (in 3 months) | 8-12 |
| 7 GB | 20 GB | Never (in 3 months) | 5-8 |
Reducing the cache from 7 GB to 5 GB eliminates OOM kills at the cost of more cache evictions. Increasing the container limit to 20 GB eliminates OOM kills without sacrificing cache.
The Fix
Step 1: Reduce WiredTiger cache to leave more headroom.
# Conservative memory allocation
containers:
- name: mongodb
resources:
limits:
memory: "16Gi"
args:
- "--wiredTigerCacheSizeGB=5" # Reduced from 7 to 5
Step 2: Limit connection count to cap connection buffer memory.
// FAST: Cap connection pool size to prevent memory pressure
MongoClientSettings settings = MongoClientSettings.builder()
.applyConnectionString(new ConnectionString(uri))
.applyToConnectionPoolSettings(builder -> builder
.maxSize(150) // Cap at 150 instead of default 100
.minSize(20)
.maxWaitTime(5, TimeUnit.SECONDS)
)
.build();
Step 3: Detect CFS throttling.
# Check CFS throttling metrics (from Prometheus/cAdvisor)
# container_cpu_cfs_throttled_periods_total: number of periods where throttling occurred
# container_cpu_cfs_throttled_seconds_total: total time spent throttled
# In Grafana, alert when throttle rate exceeds 5%:
# rate(container_cpu_cfs_throttled_periods_total[5m]) /
# rate(container_cpu_cfs_periods_total[5m]) > 0.05
Correlate CFS throttling with MongoDB latency spikes:
// FAST: Log latency spikes that may be CFS-related
public class ThrottleDetectingCommandListener implements CommandListener {
@Override
public void commandSucceeded(CommandSucceededEvent event) {
long elapsedMs = event.getElapsedTime(TimeUnit.MILLISECONDS);
if (elapsedMs > 100) {
// Correlation: check if this spike aligns with CFS throttling periods
logger.warn("Slow command: {} took {}ms on {}",
event.getCommandName(), elapsedMs,
event.getConnectionDescription().getServerAddress());
}
}
}
Step 4: Set CPU requests equal to limits for Guaranteed QoS.
resources:
requests:
cpu: "8" # Must equal limit for Guaranteed QoS
memory: "16Gi" # Must equal limit for Guaranteed QoS
limits:
cpu: "8"
memory: "16Gi"
Guaranteed QoS prevents the pod from being evicted under node memory pressure and gives it priority CPU access.
The Proof
After reducing cache to 5 GB and setting Guaranteed QoS:
| Metric | Before | After |
|---|---|---|
| OOM kills per month | 4-8 | 0 |
| CFS throttle rate | 12% | 0.3% |
| WiredTiger cache evictions/s | 200 | 350 (acceptable) |
| p99 read latency | 45ms (with spikes to 500ms) | 22ms |
| Pod restarts per month | 4-8 | 0 (planned only) |
The Trade-off
Reducing the WiredTiger cache from 7 GB to 5 GB increases cache evictions. More queries hit disk instead of cache. The p50 read latency increases from 3ms to 5ms because more reads go to storage. For the telemetry platform, where most queries target recent data (which fits in 5 GB of cache), the impact is minimal. For a workload with a large random-access working set, the cache reduction would be more painful.
Setting CPU requests equal to limits (Guaranteed QoS) means the pod reserves 8 CPUs even when idle. In a shared Kubernetes cluster, this wastes resources. The alternative is Burstable QoS (requests < limits), which allows the pod to burst above its request but risks throttling under contention. For a production database, Guaranteed QoS is worth the resource cost.