Storage Classes, IOPS, and PersistentVolume Sizing
Storage Classes, IOPS, and PersistentVolume Sizing
The Symptom
The MongoDB pod’s write latency spikes to 200ms+ every 30 minutes, then returns to 15ms. The pattern is periodic and predictable. The spikes coincide with WiredTiger checkpoints. The same workload on bare metal with NVMe drives does not show this pattern.
The Cause
The Kubernetes cluster uses AWS gp3 volumes with the default provisioning: 3,000 IOPS baseline and 125 MB/s throughput. During a checkpoint, WiredTiger flushes dirty cache pages to disk. For the telemetry platform with 5 GB of cache and 30% dirty ratio, a checkpoint flushes approximately 1.5 GB of data.
At 3,000 IOPS with 16 KB block size: 3,000 * 16 KB = 48 MB/s effective write throughput. Flushing 1.5 GB at 48 MB/s takes 31 seconds. During those 31 seconds, normal write I/O competes with checkpoint I/O.
On bare metal NVMe, the drive provides 500,000+ IOPS. The same checkpoint completes in under 1 second.
# Check disk I/O from inside the container
iostat -xm 2
# Output during checkpoint:
# Device r/s w/s rMB/s wMB/s await %util
# sdf 50 3100 0.8 48.0 12.5 98.2
# %util at 98% means the disk is saturated
# await at 12.5ms means each I/O operation waits 12.5ms
The Benchmark
| Storage type | IOPS | Throughput | Checkpoint duration | Write p99 during checkpoint |
|---|---|---|---|---|
| gp3 (default 3,000 IOPS) | 3,000 | 125 MB/s | 31s | 200ms |
| gp3 (16,000 IOPS provisioned) | 16,000 | 250 MB/s | 6s | 45ms |
| io2 (50,000 IOPS) | 50,000 | 1,000 MB/s | 1.5s | 18ms |
| Local NVMe (i3.xlarge) | 200,000+ | 2,000+ MB/s | 0.4s | 8ms |
The Fix
Step 1: Provision gp3 volumes with higher IOPS.
# StorageClass with provisioned IOPS
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: mongodb-fast
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "16000" # 16,000 provisioned IOPS (max 16,000 for gp3)
throughput: "250" # 250 MB/s (max 1,000 for gp3)
encrypted: "true"
fsType: xfs # XFS performs better than ext4 for MongoDB
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
# StatefulSet with storage class
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: mongodb-fast
resources:
requests:
storage: 500Gi # Data + oplog + journal + headroom
Step 2: Size the PersistentVolume correctly.
| Component | Size calculation | Telemetry platform |
|---|---|---|
| Data | Current data + 6 months growth | 800 GB + 600 GB = 1,400 GB |
| Oplog | Sized per CH21-S1 | 270 GB |
| Journal | Fixed at ~1 GB | 1 GB |
| Index files | ~20% of data size | 280 GB |
| Headroom | 20% of total for compaction | 390 GB |
| Total | 2,341 GB → 2,500 GB |
WiredTiger needs temporary disk space during compaction. Without 20% headroom, compaction fails and the database grows without reclaiming space from deleted documents.
Step 3: Use separate volumes for data and journal (if IOPS-constrained).
# Separate volumes for data and journal
volumeClaimTemplates:
- metadata:
name: data
spec:
storageClassName: mongodb-fast
resources:
requests:
storage: 2500Gi
- metadata:
name: journal
spec:
storageClassName: mongodb-journal # io2 for consistent low-latency journal writes
resources:
requests:
storage: 10Gi
# Mount journal on separate volume
containers:
- name: mongodb
volumeMounts:
- name: data
mountPath: /data/db
- name: journal
mountPath: /data/db/journal
Separating the journal onto a dedicated volume with guaranteed IOPS ensures that journal writes (which are latency-sensitive) do not compete with checkpoint I/O on the data volume.
Step 4: Monitor storage metrics.
// FAST: Export disk I/O metrics from serverStatus
@Component
public class StorageMetricsExporter {
public void exportMetrics(MongoClient client, MeterRegistry registry) {
Document serverStatus = client.getDatabase("admin")
.runCommand(new Document("serverStatus", 1));
Document wt = serverStatus.get("wiredTiger", Document.class);
Document blockManager = wt.get("block-manager", Document.class);
registry.gauge("mongodb.storage.bytes_read",
blockManager.getLong("bytes read"));
registry.gauge("mongodb.storage.bytes_written",
blockManager.getLong("bytes written"));
// IO latency from serverStatus (MongoDB 4.4+)
Document opLatencies = serverStatus.get("opLatencies", Document.class);
Document writes = opLatencies.get("writes", Document.class);
registry.gauge("mongodb.latency.writes.micros",
writes.getLong("latency"));
registry.gauge("mongodb.latency.writes.ops",
writes.getLong("ops"));
}
}
The Proof
After switching to gp3 with 16,000 provisioned IOPS:
| Metric | gp3 default (3,000 IOPS) | gp3 provisioned (16,000 IOPS) |
|---|---|---|
| Checkpoint duration | 31s | 6s |
| Write p99 during checkpoint | 200ms | 45ms |
| Write p99 (no checkpoint) | 15ms | 12ms |
| Disk utilization during checkpoint | 98% | 42% |
| Monthly storage cost (500 GB) | $40 + $0 IOPS | $40 + $104 IOPS = $144 |
The Trade-off
Provisioned IOPS cost money. On AWS, gp3 IOPS above the 3,000 baseline cost $0.005/IOPS/month. 16,000 IOPS = (16,000 - 3,000) * $0.005 = $65/month per volume. For a 3-member replica set, this is $195/month for IOPS alone.
Local NVMe (instance storage on i3/i3en instances) provides the best I/O performance but is ephemeral. If the instance is terminated, the data is lost. This requires careful backup and restore procedures. For MongoDB with replica sets, the data exists on other members, so losing one member’s local storage is recoverable but requires a full initial sync.
XFS outperforms ext4 for MongoDB workloads by 10-15% because XFS handles concurrent I/O better and has more efficient extent-based allocation. Always use XFS for MongoDB data volumes in Kubernetes.