Diagnosing Storage Bottlenecks

The Symptom

Write latency increases gradually over 3 weeks. Initially, write p99 is 8ms. After 3 weeks, write p99 is 35ms. No code changes, no traffic increase. WiredTiger cache hit ratio is stable at 95%.

The Cause

The data volume has grown from 400 GB to 600 GB. The WiredTiger cache (5 GB) holds the same percentage of data (1.25% vs 0.83%), but the checkpoint now flushes more dirty data because there are more active pages. The disk I/O during checkpoints has increased proportionally.

# Check disk utilization over time
iostat -xm 5

# Week 1 (400 GB):
# Device    r/s    w/s    rMB/s   wMB/s   await  %util
# nvme0n1   800   5200     12.5    82.0    0.4    45%

# Week 3 (600 GB):
# Device    r/s    w/s    rMB/s   wMB/s   await  %util
# nvme0n1   1200  8100     18.8   128.0    1.2    72%

# Disk utilization increased from 45% to 72%.
# await increased from 0.4ms to 1.2ms.

The disk is not saturated (72% utilization), but await has tripled. This is the early warning: the disk queue is building up as checkpoint writes increase.

The Benchmark

Use WiredTiger’s internal metrics to correlate with iostat:

// WiredTiger block manager statistics
db.serverStatus().wiredTiger["block-manager"]

// Key metrics:
// "bytes read":        12500000000   (12.5 GB total reads)
// "bytes written":     82000000000   (82 GB total writes)
// "mapped bytes read": 0             (not using mmap)
// "bytes read via memory map API":  0
// "bytes read via system call API": 12500000000

// Calculate write rate
// If checked every 60 seconds:
// write_rate_MB_s = (bytes_written_now - bytes_written_60s_ago) / (60 * 1024 * 1024)

Metric	Healthy	Degrading	Bottlenecked
Disk %util	< 50%	50-80%	> 80%
I/O await (NVMe)	< 0.5ms	0.5-2ms	> 2ms
I/O await (SSD)	< 2ms	2-10ms	> 10ms
WiredTiger cache dirty %	< 5%	5-15%	> 15%
Checkpoint duration	< 5s	5-15s	> 15s

The Fix

Step 1: Monitor storage I/O from MongoDB.

// FAST: Export WiredTiger block manager metrics
public void exportStorageMetrics(MongoClient client, MeterRegistry registry) {
    Document status = client.getDatabase("admin")
        .runCommand(new Document("serverStatus", 1));

    Document wt = status.get("wiredTiger", Document.class);
    Document blockMgr = wt.get("block-manager", Document.class);
    Document connection = wt.get("connection", Document.class);
    Document cache = wt.get("cache", Document.class);

    // Block manager I/O
    registry.gauge("wt.block_manager.bytes_read",
        blockMgr.getLong("bytes read"));
    registry.gauge("wt.block_manager.bytes_written",
        blockMgr.getLong("bytes written"));

    // Cache pressure indicators
    registry.gauge("wt.cache.dirty_bytes",
        cache.getLong("tracked dirty bytes in the cache"));
    registry.gauge("wt.cache.bytes_max",
        cache.getLong("maximum bytes configured"));

    // Checkpoint duration
    Document checkpoint = wt.get("transaction", Document.class);
    registry.gauge("wt.checkpoint.duration_ms",
        checkpoint.get("transaction checkpoint most recent time (msecs)", 0L));
}

Step 2: Identify whether the bottleneck is IOPS, throughput, or latency.

# IOPS bottleneck: r/s + w/s approaches disk IOPS limit
# Symptom: high %util, moderate await, I/O count near limit
# Fix: Upgrade to higher IOPS disk or reduce I/O operations

# Throughput bottleneck: rMB/s + wMB/s approaches disk bandwidth limit
# Symptom: high %util, high MB/s, large block sizes
# Fix: Upgrade to higher bandwidth disk or compress data

# Latency bottleneck: await increases while %util is moderate
# Symptom: moderate %util but high await per I/O
# Fix: Disk is degrading, or I/O queue depth is too high
# Reduce I/O concurrency (WiredTiger eviction threads)

Step 3: Reduce I/O pressure with compaction.

Over time, WiredTiger data files develop internal fragmentation. A 600 GB data directory may contain only 450 GB of live data; the rest is dead space from updates and deletes. Compaction reclaims this space and reduces I/O by making data files denser.

// Run compaction on a specific collection (blocking operation)
db.runCommand({ compact: "readings" })

// Check collection storage stats before and after
db.readings.stats()
// Before: { "size": 450 GB, "storageSize": 600 GB }
// After:  { "size": 450 GB, "storageSize": 460 GB }
// Reclaimed 140 GB

Compaction itself is I/O-intensive. Run it during off-peak hours on a secondary, then step down the primary and compact it.

The Proof

After compaction and monitoring:

Metric	Before compaction	After compaction
Storage size on disk	600 GB	460 GB
Disk %util (peak)	72%	48%
I/O await	1.2ms	0.4ms
Write p99	35ms	10ms
Checkpoint duration	12s	5s

The Trade-off

Compaction requires 2x the collection’s storage temporarily (it rewrites data files). A 600 GB collection needs 600 GB of free disk space during compaction. If the disk is 80% full, compaction may fail with insufficient space.

Monitoring storage I/O adds overhead. The serverStatus command with WiredTiger statistics is lightweight (< 1ms), but polling every 10 seconds generates 6 commands per minute per member. For a 3-member replica set, this is 18 commands per minute. Negligible for MongoDB, but the monitoring pipeline (Prometheus, Grafana) must store and process the resulting time series.

The fundamental insight: MongoDB’s write performance degrades gradually as data grows, not suddenly. Monitoring I/O metrics over weeks reveals the trend before it becomes a user-facing problem. Set alerts on disk %util > 60% and I/O await > 1ms (NVMe) or > 5ms (SSD) to catch degradation early.