Storage Classes, IOPS, and PersistentVolume Sizing

The Symptom

The MongoDB pod’s write latency spikes to 200ms+ every 30 minutes, then returns to 15ms. The pattern is periodic and predictable. The spikes coincide with WiredTiger checkpoints. The same workload on bare metal with NVMe drives does not show this pattern.

The Cause

The Kubernetes cluster uses AWS gp3 volumes with the default provisioning: 3,000 IOPS baseline and 125 MB/s throughput. During a checkpoint, WiredTiger flushes dirty cache pages to disk. For the telemetry platform with 5 GB of cache and 30% dirty ratio, a checkpoint flushes approximately 1.5 GB of data.

At 3,000 IOPS with 16 KB block size: 3,000 * 16 KB = 48 MB/s effective write throughput. Flushing 1.5 GB at 48 MB/s takes 31 seconds. During those 31 seconds, normal write I/O competes with checkpoint I/O.

On bare metal NVMe, the drive provides 500,000+ IOPS. The same checkpoint completes in under 1 second.

# Check disk I/O from inside the container
iostat -xm 2

# Output during checkpoint:
# Device    r/s    w/s    rMB/s   wMB/s   await  %util
# sdf       50   3100     0.8     48.0    12.5   98.2

# %util at 98% means the disk is saturated
# await at 12.5ms means each I/O operation waits 12.5ms

The Benchmark

Storage type	IOPS	Throughput	Checkpoint duration	Write p99 during checkpoint
gp3 (default 3,000 IOPS)	3,000	125 MB/s	31s	200ms
gp3 (16,000 IOPS provisioned)	16,000	250 MB/s	6s	45ms
io2 (50,000 IOPS)	50,000	1,000 MB/s	1.5s	18ms
Local NVMe (i3.xlarge)	200,000+	2,000+ MB/s	0.4s	8ms

The Fix

Step 1: Provision gp3 volumes with higher IOPS.

# StorageClass with provisioned IOPS
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: mongodb-fast
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "16000"           # 16,000 provisioned IOPS (max 16,000 for gp3)
  throughput: "250"        # 250 MB/s (max 1,000 for gp3)
  encrypted: "true"
  fsType: xfs             # XFS performs better than ext4 for MongoDB
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

# StatefulSet with storage class
volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: mongodb-fast
      resources:
        requests:
          storage: 500Gi    # Data + oplog + journal + headroom

Step 2: Size the PersistentVolume correctly.

Component	Size calculation	Telemetry platform
Data	Current data + 6 months growth	800 GB + 600 GB = 1,400 GB
Oplog	Sized per CH21-S1	270 GB
Journal	Fixed at ~1 GB	1 GB
Index files	~20% of data size	280 GB
Headroom	20% of total for compaction	390 GB
Total		2,341 GB → 2,500 GB

WiredTiger needs temporary disk space during compaction. Without 20% headroom, compaction fails and the database grows without reclaiming space from deleted documents.

Step 3: Use separate volumes for data and journal (if IOPS-constrained).

# Separate volumes for data and journal
volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      storageClassName: mongodb-fast
      resources:
        requests:
          storage: 2500Gi
  - metadata:
      name: journal
    spec:
      storageClassName: mongodb-journal  # io2 for consistent low-latency journal writes
      resources:
        requests:
          storage: 10Gi

# Mount journal on separate volume
containers:
  - name: mongodb
    volumeMounts:
      - name: data
        mountPath: /data/db
      - name: journal
        mountPath: /data/db/journal

Separating the journal onto a dedicated volume with guaranteed IOPS ensures that journal writes (which are latency-sensitive) do not compete with checkpoint I/O on the data volume.

Step 4: Monitor storage metrics.

// FAST: Export disk I/O metrics from serverStatus
@Component
public class StorageMetricsExporter {

    public void exportMetrics(MongoClient client, MeterRegistry registry) {
        Document serverStatus = client.getDatabase("admin")
            .runCommand(new Document("serverStatus", 1));

        Document wt = serverStatus.get("wiredTiger", Document.class);
        Document blockManager = wt.get("block-manager", Document.class);

        registry.gauge("mongodb.storage.bytes_read",
            blockManager.getLong("bytes read"));
        registry.gauge("mongodb.storage.bytes_written",
            blockManager.getLong("bytes written"));

        // IO latency from serverStatus (MongoDB 4.4+)
        Document opLatencies = serverStatus.get("opLatencies", Document.class);
        Document writes = opLatencies.get("writes", Document.class);
        registry.gauge("mongodb.latency.writes.micros",
            writes.getLong("latency"));
        registry.gauge("mongodb.latency.writes.ops",
            writes.getLong("ops"));
    }
}

The Proof

After switching to gp3 with 16,000 provisioned IOPS:

Metric	gp3 default (3,000 IOPS)	gp3 provisioned (16,000 IOPS)
Checkpoint duration	31s	6s
Write p99 during checkpoint	200ms	45ms
Write p99 (no checkpoint)	15ms	12ms
Disk utilization during checkpoint	98%	42%
Monthly storage cost (500 GB)	$40 + $0 IOPS	$40 + $104 IOPS = $144

The Trade-off

Provisioned IOPS cost money. On AWS, gp3 IOPS above the 3,000 baseline cost $0.005/IOPS/month. 16,000 IOPS = (16,000 - 3,000) * $0.005 = $65/month per volume. For a 3-member replica set, this is $195/month for IOPS alone.

Local NVMe (instance storage on i3/i3en instances) provides the best I/O performance but is ephemeral. If the instance is terminated, the data is lost. This requires careful backup and restore procedures. For MongoDB with replica sets, the data exists on other members, so losing one member’s local storage is recoverable but requires a full initial sync.

XFS outperforms ext4 for MongoDB workloads by 10-15% because XFS handles concurrent I/O better and has more efficient extent-based allocation. Always use XFS for MongoDB data volumes in Kubernetes.