Diagnosing Storage Bottlenecks
Diagnosing Storage Bottlenecks
The Symptom
Write latency increases gradually over 3 weeks. Initially, write p99 is 8ms. After 3 weeks, write p99 is 35ms. No code changes, no traffic increase. WiredTiger cache hit ratio is stable at 95%.
The Cause
The data volume has grown from 400 GB to 600 GB. The WiredTiger cache (5 GB) holds the same percentage of data (1.25% vs 0.83%), but the checkpoint now flushes more dirty data because there are more active pages. The disk I/O during checkpoints has increased proportionally.
# Check disk utilization over time
iostat -xm 5
# Week 1 (400 GB):
# Device r/s w/s rMB/s wMB/s await %util
# nvme0n1 800 5200 12.5 82.0 0.4 45%
# Week 3 (600 GB):
# Device r/s w/s rMB/s wMB/s await %util
# nvme0n1 1200 8100 18.8 128.0 1.2 72%
# Disk utilization increased from 45% to 72%.
# await increased from 0.4ms to 1.2ms.
The disk is not saturated (72% utilization), but await has tripled. This is the early warning: the disk queue is building up as checkpoint writes increase.
The Benchmark
Use WiredTiger’s internal metrics to correlate with iostat:
// WiredTiger block manager statistics
db.serverStatus().wiredTiger["block-manager"]
// Key metrics:
// "bytes read": 12500000000 (12.5 GB total reads)
// "bytes written": 82000000000 (82 GB total writes)
// "mapped bytes read": 0 (not using mmap)
// "bytes read via memory map API": 0
// "bytes read via system call API": 12500000000
// Calculate write rate
// If checked every 60 seconds:
// write_rate_MB_s = (bytes_written_now - bytes_written_60s_ago) / (60 * 1024 * 1024)
| Metric | Healthy | Degrading | Bottlenecked |
|---|---|---|---|
| Disk %util | < 50% | 50-80% | > 80% |
| I/O await (NVMe) | < 0.5ms | 0.5-2ms | > 2ms |
| I/O await (SSD) | < 2ms | 2-10ms | > 10ms |
| WiredTiger cache dirty % | < 5% | 5-15% | > 15% |
| Checkpoint duration | < 5s | 5-15s | > 15s |
The Fix
Step 1: Monitor storage I/O from MongoDB.
// FAST: Export WiredTiger block manager metrics
public void exportStorageMetrics(MongoClient client, MeterRegistry registry) {
Document status = client.getDatabase("admin")
.runCommand(new Document("serverStatus", 1));
Document wt = status.get("wiredTiger", Document.class);
Document blockMgr = wt.get("block-manager", Document.class);
Document connection = wt.get("connection", Document.class);
Document cache = wt.get("cache", Document.class);
// Block manager I/O
registry.gauge("wt.block_manager.bytes_read",
blockMgr.getLong("bytes read"));
registry.gauge("wt.block_manager.bytes_written",
blockMgr.getLong("bytes written"));
// Cache pressure indicators
registry.gauge("wt.cache.dirty_bytes",
cache.getLong("tracked dirty bytes in the cache"));
registry.gauge("wt.cache.bytes_max",
cache.getLong("maximum bytes configured"));
// Checkpoint duration
Document checkpoint = wt.get("transaction", Document.class);
registry.gauge("wt.checkpoint.duration_ms",
checkpoint.get("transaction checkpoint most recent time (msecs)", 0L));
}
Step 2: Identify whether the bottleneck is IOPS, throughput, or latency.
# IOPS bottleneck: r/s + w/s approaches disk IOPS limit
# Symptom: high %util, moderate await, I/O count near limit
# Fix: Upgrade to higher IOPS disk or reduce I/O operations
# Throughput bottleneck: rMB/s + wMB/s approaches disk bandwidth limit
# Symptom: high %util, high MB/s, large block sizes
# Fix: Upgrade to higher bandwidth disk or compress data
# Latency bottleneck: await increases while %util is moderate
# Symptom: moderate %util but high await per I/O
# Fix: Disk is degrading, or I/O queue depth is too high
# Reduce I/O concurrency (WiredTiger eviction threads)
Step 3: Reduce I/O pressure with compaction.
Over time, WiredTiger data files develop internal fragmentation. A 600 GB data directory may contain only 450 GB of live data; the rest is dead space from updates and deletes. Compaction reclaims this space and reduces I/O by making data files denser.
// Run compaction on a specific collection (blocking operation)
db.runCommand({ compact: "readings" })
// Check collection storage stats before and after
db.readings.stats()
// Before: { "size": 450 GB, "storageSize": 600 GB }
// After: { "size": 450 GB, "storageSize": 460 GB }
// Reclaimed 140 GB
Compaction itself is I/O-intensive. Run it during off-peak hours on a secondary, then step down the primary and compact it.
The Proof
After compaction and monitoring:
| Metric | Before compaction | After compaction |
|---|---|---|
| Storage size on disk | 600 GB | 460 GB |
| Disk %util (peak) | 72% | 48% |
| I/O await | 1.2ms | 0.4ms |
| Write p99 | 35ms | 10ms |
| Checkpoint duration | 12s | 5s |
The Trade-off
Compaction requires 2x the collection’s storage temporarily (it rewrites data files). A 600 GB collection needs 600 GB of free disk space during compaction. If the disk is 80% full, compaction may fail with insufficient space.
Monitoring storage I/O adds overhead. The serverStatus command with WiredTiger statistics is lightweight (< 1ms), but polling every 10 seconds generates 6 commands per minute per member. For a 3-member replica set, this is 18 commands per minute. Negligible for MongoDB, but the monitoring pipeline (Prometheus, Grafana) must store and process the resulting time series.
The fundamental insight: MongoDB’s write performance degrades gradually as data grows, not suddenly. Monitoring I/O metrics over weeks reveals the trend before it becomes a user-facing problem. Set alerts on disk %util > 60% and I/O await > 1ms (NVMe) or > 5ms (SSD) to catch degradation early.