Diagnosing OOM Kills and CFS Throttling

The Symptom

The MongoDB pod restarts every 6-12 hours. Kubernetes events show OOMKilled as the termination reason. The WiredTiger cache is set to 7 GB in a 16 GB container. The 9 GB headroom should be sufficient.

The Cause

The WiredTiger cache is 7 GB, but MongoDB’s total memory consumption is higher:

Component	Memory usage
WiredTiger cache	7.0 GB
Connection buffers (200 connections * 1 MB each)	0.2 GB
Aggregation pipeline memory (5 concurrent * 100 MB)	0.5 GB
OS page cache (filesystem metadata, journal)	2.0 GB
MongoDB server process (threads, internal buffers)	1.5 GB
Index build temporary storage (in progress)	3.0 GB
Total	14.2 GB

Under normal operations, total memory is 11.2 GB (without index build). But during an index build on a 500 GB collection, the temporary storage can consume 2-4 GB of additional memory. 11.2 + 3.0 = 14.2 GB, which is within the 16 GB limit. However, if a spike in connections (300 instead of 200) coincides with the index build, memory reaches 14.3 GB + 0.1 GB = 14.4 GB, and the OS page cache absorbs the rest.

The OOM occurs when all these factors align during peak: 250 connections + index build + 8 concurrent aggregations + WiredTiger cache at maximum dirty threshold.

# Check OOM events
kubectl describe pod mongodb-0 -n database | grep -A5 "Last State"
# Last State: Terminated
#   Reason: OOMKilled
#   Exit Code: 137

# Check container memory usage over time
kubectl top pod mongodb-0 -n database --containers
# NAME       CPU    MEMORY
# mongodb    3200m  15.2Gi    <- approaching 16Gi limit

The Benchmark

WiredTiger cache	Container limit	OOM frequency	Safe concurrent aggregations
7 GB	16 GB	Every 6-12 hours	3-5
6 GB	16 GB	Weekly	5-8
5 GB	16 GB	Never (in 3 months)	8-12
7 GB	20 GB	Never (in 3 months)	5-8

Reducing the cache from 7 GB to 5 GB eliminates OOM kills at the cost of more cache evictions. Increasing the container limit to 20 GB eliminates OOM kills without sacrificing cache.

The Fix

Step 1: Reduce WiredTiger cache to leave more headroom.

# Conservative memory allocation
containers:
  - name: mongodb
    resources:
      limits:
        memory: "16Gi"
    args:
      - "--wiredTigerCacheSizeGB=5"  # Reduced from 7 to 5

Step 2: Limit connection count to cap connection buffer memory.

// FAST: Cap connection pool size to prevent memory pressure
MongoClientSettings settings = MongoClientSettings.builder()
    .applyConnectionString(new ConnectionString(uri))
    .applyToConnectionPoolSettings(builder -> builder
        .maxSize(150)       // Cap at 150 instead of default 100
        .minSize(20)
        .maxWaitTime(5, TimeUnit.SECONDS)
    )
    .build();

Step 3: Detect CFS throttling.

# Check CFS throttling metrics (from Prometheus/cAdvisor)
# container_cpu_cfs_throttled_periods_total: number of periods where throttling occurred
# container_cpu_cfs_throttled_seconds_total: total time spent throttled

# In Grafana, alert when throttle rate exceeds 5%:
# rate(container_cpu_cfs_throttled_periods_total[5m]) /
# rate(container_cpu_cfs_periods_total[5m]) > 0.05

Correlate CFS throttling with MongoDB latency spikes:

// FAST: Log latency spikes that may be CFS-related
public class ThrottleDetectingCommandListener implements CommandListener {

    @Override
    public void commandSucceeded(CommandSucceededEvent event) {
        long elapsedMs = event.getElapsedTime(TimeUnit.MILLISECONDS);
        if (elapsedMs > 100) {
            // Correlation: check if this spike aligns with CFS throttling periods
            logger.warn("Slow command: {} took {}ms on {}",
                event.getCommandName(), elapsedMs,
                event.getConnectionDescription().getServerAddress());
        }
    }
}

Step 4: Set CPU requests equal to limits for Guaranteed QoS.

resources:
  requests:
    cpu: "8"       # Must equal limit for Guaranteed QoS
    memory: "16Gi" # Must equal limit for Guaranteed QoS
  limits:
    cpu: "8"
    memory: "16Gi"

Guaranteed QoS prevents the pod from being evicted under node memory pressure and gives it priority CPU access.

The Proof

After reducing cache to 5 GB and setting Guaranteed QoS:

Metric	Before	After
OOM kills per month	4-8	0
CFS throttle rate	12%	0.3%
WiredTiger cache evictions/s	200	350 (acceptable)
p99 read latency	45ms (with spikes to 500ms)	22ms
Pod restarts per month	4-8	0 (planned only)

The Trade-off

Reducing the WiredTiger cache from 7 GB to 5 GB increases cache evictions. More queries hit disk instead of cache. The p50 read latency increases from 3ms to 5ms because more reads go to storage. For the telemetry platform, where most queries target recent data (which fits in 5 GB of cache), the impact is minimal. For a workload with a large random-access working set, the cache reduction would be more painful.

Setting CPU requests equal to limits (Guaranteed QoS) means the pod reserves 8 CPUs even when idle. In a shared Kubernetes cluster, this wastes resources. The alternative is Burstable QoS (requests < limits), which allows the pod to burst above its request but risks throttling under contention. For a production database, Guaranteed QoS is worth the resource cost.