Sizing the Oplog for the Telemetry Platform
Sizing the Oplog for the Telemetry Platform
The Symptom
A secondary went offline for a 45-minute kernel upgrade. When it came back, the oplog had advanced past its last replicated position. The secondary entered RECOVERING state and began an initial sync that took 6 hours.
The Cause
The oplog was sized at the default (5% of disk, which was 50 GB on a 1 TB disk). The telemetry platform’s write rate consumed the 50 GB oplog in 33 minutes. The 45-minute maintenance window exceeded the 33-minute oplog window.
Oplog consumption is not constant. During batch ingestion (e.g., a sensor that was offline for 2 hours and sends buffered readings), the write rate spikes to 3x the normal rate. The effective oplog window during spikes is only 11 minutes.
The Benchmark
| Oplog size | Normal window (50k ops/s) | Spike window (150k ops/s) | Safe maintenance window |
|---|---|---|---|
| 50 GB | 33 min | 11 min | ~8 min |
| 100 GB | 66 min | 22 min | ~15 min |
| 200 GB | 133 min | 44 min | ~30 min |
| 500 GB | 333 min (5.5 hrs) | 111 min (1.8 hrs) | ~1.5 hrs |
The safe maintenance window is approximately 70% of the spike window to account for bursts within the spike.
The Fix
Step 1: Calculate the required oplog size.
Formula: oplog_size = write_rate_bytes_per_second * desired_window_seconds * safety_factor
For the telemetry platform:
- Peak write rate: 150,000 ops/s * 500 bytes/op = 75 MB/s
- Desired window: 4 hours (covers the longest maintenance procedure)
- Safety factor: 1.5x (for unexpected write spikes)
oplog_size = 75 MB/s * 14,400s * 1.5 = 1,620,000 MB ≈ 1.6 TB
That is impractical. The oplog would consume 1.6 TB on a 2 TB disk, leaving only 400 GB for data.
Recalculate with a 2-hour window and normal write rate:
oplog_size = 25 MB/s * 7,200s * 1.5 = 270,000 MB ≈ 270 GB
270 GB is 13.5% of a 2 TB disk. Acceptable.
Step 2: Resize the oplog online (MongoDB 4.0+).
// Resize oplog to 270 GB (specified in MB)
db.adminCommand({ replSetResizeOplog: 1, size: 276480 })
// Verify
rs.printReplicationInfo()
// configured oplog size: 276480MB
This operation is online and does not require downtime. The oplog grows immediately (if there is disk space) or shrinks as old entries are truncated.
Step 3: Monitor oplog window and alert on shrinkage.
// FAST: Monitor oplog window in Prometheus
@Component
public class OplogWindowMonitor {
private final MongoClient client;
private final Gauge oplogWindow;
public OplogWindowMonitor(MongoClient client, MeterRegistry registry) {
this.client = client;
this.oplogWindow = Gauge.builder("mongodb.oplog.window.seconds",
this, OplogWindowMonitor::measureWindow)
.description("Oplog window in seconds")
.register(registry);
}
private double measureWindow() {
try {
MongoDatabase local = client.getDatabase("local");
MongoCollection<Document> oplog = local.getCollection("oplog.rs");
// Get oldest entry
Document oldest = oplog.find()
.sort(new Document("$natural", 1))
.limit(1)
.first();
// Get newest entry
Document newest = oplog.find()
.sort(new Document("$natural", -1))
.limit(1)
.first();
if (oldest != null && newest != null) {
BsonTimestamp oldTs = oldest.get("ts", BsonTimestamp.class);
BsonTimestamp newTs = newest.get("ts", BsonTimestamp.class);
return newTs.getTime() - oldTs.getTime();
}
return 0;
} catch (Exception e) {
return -1;
}
}
}
Alert thresholds:
| Alert | Condition | Action |
|---|---|---|
| Oplog window shrinking | Window < 4 hours | Investigate write rate increase |
| Oplog window critical | Window < 1 hour | Increase oplog size or reduce writes |
| Oplog window danger | Window < 30 minutes | Cancel any planned maintenance |
Step 4: Reduce oplog consumption with bulk-aware operations.
// SLOW: Individual updates generate one oplog entry per document
for (Document reading : readings) {
collection.updateOne(
Filters.eq("_id", reading.get("_id")),
Updates.set("processed", true)
);
}
// 10,000 readings -> 10,000 oplog entries
// FAST: Bulk operations still generate one entry per document in the oplog,
// but they are more efficient on the primary (fewer network round trips)
// The oplog cost is the same, but the throughput is higher
BulkWriteOptions options = new BulkWriteOptions().ordered(false);
List<WriteModel<Document>> writes = readings.stream()
.map(r -> new UpdateOneModel<Document>(
Filters.eq("_id", r.get("_id")),
Updates.set("processed", true)
))
.collect(Collectors.toList());
collection.bulkWrite(writes, options);
The Proof
After resizing the oplog to 270 GB:
| Metric | Before (50 GB) | After (270 GB) |
|---|---|---|
| Normal oplog window | 33 min | 3 hours |
| Spike oplog window | 11 min | 1 hour |
| Initial syncs triggered | 3 per quarter | 0 |
| Maintenance window available | 8 min | 45 min |
| Disk usage for oplog | 5% | 13.5% |
The Trade-off
A larger oplog consumes disk space that could store data. On a 2 TB disk, 270 GB for the oplog leaves 1.73 TB for data. If the data is growing at 100 GB per month, this reduces the time before the disk fills from 20 months to 17 months.
The oplog also consumes WiredTiger cache. Oplog entries pass through the cache during writes and reads (by secondaries). A very large oplog (> 50% of RAM) can pressure the cache, but in practice, only the recent oplog entries (the “hot” tail) are in cache, and older entries are on disk.
For the telemetry platform, 270 GB of oplog on a 2 TB disk is a reasonable trade-off. The alternative (a 6-hour initial sync every time maintenance exceeds 33 minutes) is far more costly in operational time and reduced redundancy.