Skip to main content
unbound mongodb at scale

Checkpoint Tuning and Write-Ahead Log Sizing

4 min read Chapter 39 of 72

Checkpoint Tuning and Write-Ahead Log Sizing

The Symptom

After tuning the WiredTiger cache and eviction thresholds in CH13-S1, the 60-second latency spikes are reduced but not eliminated. The remaining spikes correlate with disk I/O bursts visible in iostat:

Device            r/s      w/s    rMB/s    wMB/s  await  %util
nvme0n1         120.0  2400.0      8.5    280.0    2.1   85.0

During the checkpoint window, write throughput spikes to 280 MB/s and disk utilization hits 85%. The 2,400 write operations per second during checkpoint compete with the 120 read operations.

The Cause

The default checkpoint interval is 60 seconds. At 2,000 writes/sec with 340-byte documents plus index updates, approximately 80 MB of dirty data accumulates per checkpoint interval. The checkpoint writes all 80 MB in a burst of 2-3 seconds. On a single NVMe drive with 500 MB/s sequential write throughput, 80 MB takes 0.16 seconds. But the writes are not sequential: they are scattered across multiple B-tree data files and index files, resulting in semi-random I/O patterns.

The journal (write-ahead log) also contributes. WiredTiger journals every write operation before the checkpoint. The journal files grow between checkpoints and are trimmed after a successful checkpoint. The default journalCompressor is snappy, and the default commitIntervalMs is 100ms (50ms with j:true write concern).

The Benchmark

Compare checkpoint behavior at different intervals:

// Monitor checkpoint duration and I/O
// Run this during a k6 load test at 2,000 writes/sec

// Checkpoint metrics
db.serverStatus().wiredTiger.transaction["transaction checkpoint currently running"]
db.serverStatus().wiredTiger.transaction["transaction checkpoint most recent time (msecs)"]
db.serverStatus().wiredTiger.transaction["transaction checkpoint max time (msecs)"]
db.serverStatus().wiredTiger.transaction["transaction checkpoint total time (msecs)"]

Results at different checkpoint intervals:

Checkpoint intervalDirty data per checkpointCheckpoint durationWrite p99 during checkpointRecovery time
30s40 MB0.8s45ms30s max
60s (default)80 MB1.8s85ms60s max
120s160 MB3.5s150ms120s max
300s400 MB8.2s280ms300s max

The Fix

For the telemetry platform’s write rate, reduce the checkpoint interval to 30 seconds:

# mongod.conf
storage:
  wiredTiger:
    engineConfig:
      configString: "checkpoint=(wait=30)"

This halves the dirty data accumulated per checkpoint, halving the I/O burst and its latency impact. The trade-off is that checkpoints occur twice as often, consuming more total I/O but in smaller, less disruptive bursts.

Tune the journal commit interval to balance durability and throughput:

# mongod.conf
storage:
  journal:
    commitIntervalMs: 100    # Default: 100ms (50ms with j:true)

For the telemetry platform where individual readings are not critical (a few lost readings are acceptable), keep the default 100ms. This means up to 100ms of writes can be lost on a crash. The journal flushes every 100ms, grouping all writes in that interval into a single disk sync.

For financial data or audit logs, set commitIntervalMs: 10 for near-real-time durability:

// Critical writes: journal acknowledged
collection.withWriteConcern(WriteConcern.JOURNALED)
    .insertOne(auditDocument);

The Proof

After reducing checkpoint interval to 30 seconds:

Metric60s checkpoint30s checkpoint
Dirty data per checkpoint80 MB40 MB
Checkpoint duration1.8s0.8s
Write p99 during checkpoint85ms45ms
Write p99 between checkpoints15ms15ms
Total checkpoint I/O per hour4.8 GB4.8 GB
Checkpoints per hour60120
Max recovery time60s30s

Total checkpoint I/O per hour is the same (4.8 GB). The work is the same; it is just distributed in smaller batches.

The Trade-off

Shorter checkpoint intervals mean faster recovery after an unclean shutdown: MongoDB only needs to replay journal entries since the last checkpoint. At 30 seconds, recovery replays at most 30 seconds of writes. At 300 seconds, it replays 5 minutes.

But shorter intervals increase the metadata overhead. Each checkpoint updates the root page of every B-tree (every collection and index). With 50 collections and 200 indexes, that is 250 root page writes per checkpoint. At 120 checkpoints per hour (30s interval), that is 30,000 root page writes per hour. On SSD, this is negligible. On HDD, the seek overhead accumulates.

Journal sizing also matters. WiredTiger pre-allocates journal files in 100 MB chunks. With 2,000 writes/sec, journal throughput is approximately 2 MB/sec (after snappy compression). Each 100 MB journal file fills in 50 seconds. Journal files older than the last checkpoint are deleted. With a 30-second checkpoint interval, only 1-2 journal files exist at any time (100-200 MB). With a 300-second interval, 6-7 files exist (600-700 MB). On storage-constrained deployments, longer checkpoint intervals consume more journal disk space.