Oplog Interaction with Index Builds and Initial Sync
Oplog Interaction with Index Builds and Initial Sync
The Symptom
The team creates a new compound index on the 800 GB readings collection. The index build starts on all replica set members simultaneously (MongoDB 4.4+ hybrid index build). During the build, secondary2 falls further behind: its replication lag grows from 2 seconds to 180 seconds over 30 minutes. After 45 minutes, secondary2 enters RECOVERING state. The oplog window was exceeded.
The Cause
Hybrid index builds (MongoDB 4.4+) run on all members simultaneously but at different speeds. The primary may complete the index build in 20 minutes, but a secondary with slower storage takes 45 minutes. During the index build, the secondary’s oplog application pauses for the build phase. Writes that arrive during this pause accumulate in the oplog. If the index build duration exceeds the oplog window, the secondary cannot catch up.
The compounding factor: the index build itself generates oplog entries. The createIndexes command is recorded in the oplog, and each member applies it. But the data scanning and sorting for the index build consume disk I/O, which slows down the secondary’s ability to apply other oplog entries simultaneously.
// Check index build progress on all members
db.currentOp({ "command.createIndexes": { $exists: true } })
// Output shows:
// {
// "desc": "IndexBuildsCoordinatorMongod",
// "command": { "createIndexes": "readings" },
// "progress": { "done": 450000000, "total": 800000000 },
// "msg": "Index Build: scanning collection"
// }
The Benchmark
| Collection size | Index build time (NVMe) | Index build time (SSD) | Min oplog window needed |
|---|---|---|---|
| 100 GB | 8 minutes | 15 minutes | 25 minutes |
| 500 GB | 40 minutes | 75 minutes | 2 hours |
| 800 GB | 65 minutes | 120 minutes | 3 hours |
| 2 TB | 160 minutes | 300 minutes | 7 hours |
The minimum oplog window should be at least 2x the expected index build time on the slowest member.
The Fix
Step 1: Verify oplog window before starting an index build.
// FAST: Pre-flight check before index creation
public boolean canBuildIndex(long estimatedBuildMinutes) {
double oplogWindowSeconds = oplogWindowMonitor.measureWindow();
double oplogWindowMinutes = oplogWindowSeconds / 60.0;
// Need at least 2x the build time as oplog window
double requiredWindow = estimatedBuildMinutes * 2.0;
if (oplogWindowMinutes < requiredWindow) {
logger.error(
"Oplog window ({} min) is less than 2x estimated build time ({} min). " +
"Resize oplog before building index.",
oplogWindowMinutes, estimatedBuildMinutes);
return false;
}
logger.info("Oplog window ({} min) is sufficient for index build ({} min estimate)",
oplogWindowMinutes, estimatedBuildMinutes);
return true;
}
Step 2: Build indexes during low-traffic periods.
Lower write rates mean slower oplog consumption. Building the index during a period with 10,000 ops/s instead of 50,000 ops/s gives 5x more oplog runway.
// FAST: Schedule index build during maintenance window
public void buildIndexSafely(MongoCollection<Document> collection,
Bson keys, IndexOptions options) {
// Verify oplog window
if (!canBuildIndex(90)) { // Estimate 90 minutes for 800 GB
throw new IllegalStateException("Oplog window insufficient for index build");
}
// Verify replication lag is low
double lag = replicationLagMonitor.measureLag();
if (lag > 5) {
throw new IllegalStateException(
"Replication lag is " + lag + "s. Wait for secondaries to catch up.");
}
// Build the index
String indexName = collection.createIndex(keys, options);
logger.info("Index build started: {}", indexName);
}
Step 3: Handle initial sync oplog requirements.
Initial sync copies the entire dataset from a sync source (primary or secondary) to the new member. During the copy, the source continues to receive writes. These writes are recorded in the oplog. After the data copy completes, the new member applies the oplog entries that accumulated during the copy to catch up to the current state.
If the data copy takes 6 hours and the oplog window is 3 hours, the oplog entries from the first 3 hours of the copy are overwritten before the new member can apply them. The initial sync fails and restarts.
// Estimate initial sync time
var dataSize = db.stats().dataSize; // bytes
var copySpeed = 100 * 1024 * 1024; // ~100 MB/s typical initial sync speed
var estimatedSeconds = dataSize / copySpeed;
var estimatedHours = estimatedSeconds / 3600;
print("Estimated initial sync time: " + estimatedHours.toFixed(1) + " hours");
print("Required oplog window: " + (estimatedHours * 2).toFixed(1) + " hours");
// For 2 TB dataset:
// Estimated initial sync time: 5.6 hours
// Required oplog window: 11.2 hours
Step 4: Reduce write rate during initial sync if oplog is tight.
// FAST: Throttle ingestion during initial sync
@Component
public class AdaptiveIngestionThrottle {
private final AtomicBoolean syncInProgress = new AtomicBoolean(false);
public void setSyncInProgress(boolean inProgress) {
syncInProgress.set(inProgress);
}
public int getBatchSize() {
// Normal: 100 documents per batch
// During sync: 25 documents per batch (4x slower ingestion)
return syncInProgress.get() ? 25 : 100;
}
public long getBatchDelayMs() {
// Normal: 0ms between batches
// During sync: 50ms between batches
return syncInProgress.get() ? 50 : 0;
}
}
The Proof
After implementing pre-flight checks and maintenance window scheduling:
| Scenario | Before | After |
|---|---|---|
| Index build on 800 GB | Secondary entered RECOVERING | Completed in 65 min, 45s max lag |
| Initial sync (2 TB) | Failed twice, succeeded third time | Succeeded first attempt |
| Unplanned initial syncs/quarter | 3 | 0 |
| Time spent on initial sync recovery | 18 hours/quarter | 0 |
The Trade-off
Pre-flight checks prevent index builds when the oplog window is insufficient. This means the team cannot build indexes immediately when they are needed. They must either resize the oplog first (which takes seconds but consumes disk space) or wait for a low-traffic window.
Throttling ingestion during initial sync reduces data freshness. For 6 hours during the sync, the telemetry platform ingests data at 25% of normal rate. Sensor readings queue in the application layer or in Kafka. After the sync completes, the backlog must be processed, which creates a temporary spike.
The fundamental constraint: large datasets and high write rates require large oplogs. A 2 TB dataset with 50,000 writes/second needs a 270+ GB oplog to safely handle maintenance. This is not configurable around; it is a capacity planning requirement. Budget 10-15% of disk for the oplog when planning storage for write-heavy MongoDB workloads.