Skip to main content
unbound mongodb at scale

Diagnosing Cache Pressure and Application Thread Eviction

4 min read Chapter 38 of 72

Diagnosing Cache Pressure and Application Thread Eviction

The Symptom

The telemetry platform’s p99 write latency shows periodic spikes to 200ms every 60 seconds. The spikes last 2-5 seconds. Between spikes, p99 is a stable 15ms. The pattern is clock-like in its regularity.

The Cause

The 60-second periodicity matches the checkpoint interval. During each checkpoint, WiredTiger writes dirty pages to disk. The I/O burst from checkpointing competes with normal operations. If the dirty data volume is large enough, the eviction system cannot keep up, and application threads are drafted into eviction duty.

Checking the metrics:

// Capture cache metrics over 5 minutes
var start = db.serverStatus().wiredTiger.cache;
sleep(300000);
var end = db.serverStatus().wiredTiger.cache;

print("App thread evictions: " + (end["pages evicted by application threads"] - start["pages evicted by application threads"]));
print("Dirty bytes range: " + start["tracked dirty bytes in the cache"] + " -> " + end["tracked dirty bytes in the cache"]);
print("Cache bytes: " + end["bytes currently in the cache"] + " / " + end["maximum bytes configured"]);

Output:

App thread evictions: 2300
Dirty bytes range: 45000000 -> 850000000
Cache bytes: 14800000000 / 15500000000 (95.5%)

Cache utilization at 95.5% means the eviction_trigger threshold is being hit. 2,300 app thread evictions in 5 minutes means application operations are stalling 7.6 times per second on average.

The Benchmark

// k6 test measuring latency correlation with checkpoints
import http from 'k6/http';
import { Trend } from 'k6/metrics';

const writeLatency = new Trend('write_latency', true);

export const options = {
  scenarios: {
    steady_writes: {
      executor: 'constant-arrival-rate',
      rate: 2000,
      timeUnit: '1s',
      duration: '5m',
      preAllocatedVUs: 100,
      maxVUs: 200,
    },
  },
};

export default function() {
  const startTime = Date.now();
  const res = http.post(`${__ENV.BASE_URL}/api/telemetry/ingest`, JSON.stringify({
    sensorId: `sensor-${String(Math.floor(Math.random() * 10000)).padStart(5, '0')}`,
    timestamp: new Date().toISOString(),
    temperature: 20 + Math.random() * 15,
    humidity: 40 + Math.random() * 30,
  }), { headers: { 'Content-Type': 'application/json' } });

  writeLatency.add(Date.now() - startTime);
}

Results with 15.5 GB WiredTiger cache and a working set of 18 GB:

Time windowp50p95p99App thread evictions/sec
0-10s (post-checkpoint)3ms8ms18ms0
10-40s (normal)3ms9ms20ms0.5
40-55s (dirty accumulation)4ms15ms55ms3.2
55-65s (checkpoint + eviction)8ms45ms200ms12.8

The Fix

Two adjustments:

1. Size the cache to fit the working set.

The working set is the data actively accessed by queries. For the telemetry platform, this is the last 24 hours of readings plus all indexes. Calculate it:

// Working set estimation
var readingsLast24h = db.readings.stats().avgObjSize * 
    db.readings.countDocuments({ ts: { $gte: new Date(Date.now() - 86400000) } });
var totalIndexSize = db.readings.stats().totalIndexSize;
print("Working set: " + (readingsLast24h + totalIndexSize) / (1024*1024*1024) + " GB");

If the working set is 18 GB and the cache is 15.5 GB, increase the cache. On a 48 GB server:

# mongod.conf
storage:
  wiredTiger:
    engineConfig:
      cacheSizeGB: 24

2. Tune eviction thresholds for write-heavy workloads.

# mongod.conf - adjusted eviction thresholds
storage:
  wiredTiger:
    engineConfig:
      configString: "eviction_dirty_target=2,eviction_dirty_trigger=10,eviction=(threads_min=4,threads_max=8)"

Lowering eviction_dirty_target from 5% to 2% starts background dirty page eviction earlier, spreading the checkpoint I/O over time instead of bursting. Increasing threads_min from 1 to 4 provides more background eviction capacity.

The Proof

After increasing cache to 24 GB and tuning eviction:

MetricBefore (15.5 GB cache)After (24 GB, tuned eviction)
Cache utilization95.5%75%
App thread evictions/sec (peak)12.80.1
Write p99 during checkpoint200ms25ms
Write p99 between checkpoints20ms15ms
Dirty bytes at checkpoint time850 MB120 MB

The Trade-off

Allocating 24 GB to WiredTiger cache leaves 24 GB for the operating system, filesystem cache, connections, and applications. If the server runs other processes (monitoring agents, log collectors), available memory drops further. On containerized deployments, the WiredTiger cache must be sized explicitly to stay within the container’s memory limit (covered in CH22).

Lowering eviction_dirty_target to 2% means background eviction runs more frequently, consuming CPU cycles. On a 4-core server, continuous background eviction may compete with query processing. On a 16-core server with ample CPU, the impact is negligible.