Sizing the Oplog for the Telemetry Platform

The Symptom

A secondary went offline for a 45-minute kernel upgrade. When it came back, the oplog had advanced past its last replicated position. The secondary entered RECOVERING state and began an initial sync that took 6 hours.

The Cause

The oplog was sized at the default (5% of disk, which was 50 GB on a 1 TB disk). The telemetry platform’s write rate consumed the 50 GB oplog in 33 minutes. The 45-minute maintenance window exceeded the 33-minute oplog window.

Oplog consumption is not constant. During batch ingestion (e.g., a sensor that was offline for 2 hours and sends buffered readings), the write rate spikes to 3x the normal rate. The effective oplog window during spikes is only 11 minutes.

The Benchmark

Oplog size	Normal window (50k ops/s)	Spike window (150k ops/s)	Safe maintenance window
50 GB	33 min	11 min	~8 min
100 GB	66 min	22 min	~15 min
200 GB	133 min	44 min	~30 min
500 GB	333 min (5.5 hrs)	111 min (1.8 hrs)	~1.5 hrs

The safe maintenance window is approximately 70% of the spike window to account for bursts within the spike.

The Fix

Step 1: Calculate the required oplog size.

Formula: oplog_size = write_rate_bytes_per_second * desired_window_seconds * safety_factor

For the telemetry platform:

Peak write rate: 150,000 ops/s * 500 bytes/op = 75 MB/s
Desired window: 4 hours (covers the longest maintenance procedure)
Safety factor: 1.5x (for unexpected write spikes)

oplog_size = 75 MB/s * 14,400s * 1.5 = 1,620,000 MB ≈ 1.6 TB

That is impractical. The oplog would consume 1.6 TB on a 2 TB disk, leaving only 400 GB for data.

Recalculate with a 2-hour window and normal write rate:

oplog_size = 25 MB/s * 7,200s * 1.5 = 270,000 MB ≈ 270 GB

270 GB is 13.5% of a 2 TB disk. Acceptable.

Step 2: Resize the oplog online (MongoDB 4.0+).

// Resize oplog to 270 GB (specified in MB)
db.adminCommand({ replSetResizeOplog: 1, size: 276480 })

// Verify
rs.printReplicationInfo()
// configured oplog size: 276480MB

This operation is online and does not require downtime. The oplog grows immediately (if there is disk space) or shrinks as old entries are truncated.

Step 3: Monitor oplog window and alert on shrinkage.

// FAST: Monitor oplog window in Prometheus
@Component
public class OplogWindowMonitor {

    private final MongoClient client;
    private final Gauge oplogWindow;

    public OplogWindowMonitor(MongoClient client, MeterRegistry registry) {
        this.client = client;
        this.oplogWindow = Gauge.builder("mongodb.oplog.window.seconds",
                this, OplogWindowMonitor::measureWindow)
            .description("Oplog window in seconds")
            .register(registry);
    }

    private double measureWindow() {
        try {
            MongoDatabase local = client.getDatabase("local");
            MongoCollection<Document> oplog = local.getCollection("oplog.rs");

            // Get oldest entry
            Document oldest = oplog.find()
                .sort(new Document("$natural", 1))
                .limit(1)
                .first();

            // Get newest entry
            Document newest = oplog.find()
                .sort(new Document("$natural", -1))
                .limit(1)
                .first();

            if (oldest != null && newest != null) {
                BsonTimestamp oldTs = oldest.get("ts", BsonTimestamp.class);
                BsonTimestamp newTs = newest.get("ts", BsonTimestamp.class);
                return newTs.getTime() - oldTs.getTime();
            }
            return 0;
        } catch (Exception e) {
            return -1;
        }
    }
}

Alert thresholds:

Alert	Condition	Action
Oplog window shrinking	Window < 4 hours	Investigate write rate increase
Oplog window critical	Window < 1 hour	Increase oplog size or reduce writes
Oplog window danger	Window < 30 minutes	Cancel any planned maintenance

Step 4: Reduce oplog consumption with bulk-aware operations.

// SLOW: Individual updates generate one oplog entry per document
for (Document reading : readings) {
    collection.updateOne(
        Filters.eq("_id", reading.get("_id")),
        Updates.set("processed", true)
    );
}
// 10,000 readings -> 10,000 oplog entries

// FAST: Bulk operations still generate one entry per document in the oplog,
// but they are more efficient on the primary (fewer network round trips)
// The oplog cost is the same, but the throughput is higher
BulkWriteOptions options = new BulkWriteOptions().ordered(false);
List<WriteModel<Document>> writes = readings.stream()
    .map(r -> new UpdateOneModel<Document>(
        Filters.eq("_id", r.get("_id")),
        Updates.set("processed", true)
    ))
    .collect(Collectors.toList());

collection.bulkWrite(writes, options);

The Proof

After resizing the oplog to 270 GB:

Metric	Before (50 GB)	After (270 GB)
Normal oplog window	33 min	3 hours
Spike oplog window	11 min	1 hour
Initial syncs triggered	3 per quarter	0
Maintenance window available	8 min	45 min
Disk usage for oplog	5%	13.5%

The Trade-off

A larger oplog consumes disk space that could store data. On a 2 TB disk, 270 GB for the oplog leaves 1.73 TB for data. If the data is growing at 100 GB per month, this reduces the time before the disk fills from 20 months to 17 months.

The oplog also consumes WiredTiger cache. Oplog entries pass through the cache during writes and reads (by secondaries). A very large oplog (> 50% of RAM) can pressure the cache, but in practice, only the recent oplog entries (the “hot” tail) are in cache, and older entries are on disk.

For the telemetry platform, 270 GB of oplog on a 2 TB disk is a reasonable trade-off. The alternative (a 6-hour initial sync every time maintenance exceeds 33 minutes) is far more costly in operational time and reduced redundancy.