Diagnosing Migration-Induced Latency Spikes

The Symptom

The telemetry platform’s read latency shows periodic spikes: p99 jumps from 15ms to 120ms every 15-20 minutes, then returns to baseline. The spikes last 30-90 seconds. They are not correlated with traffic patterns or deployment events.

The Cause

Each spike coincides with a chunk migration. When the balancer moves a chunk from shard 2 to shard 4:

Shard 2 reads the chunk’s documents from WiredTiger, evicting cached query data.
Shard 4 writes the incoming documents, consuming write I/O bandwidth.
After the migration completes, shard 2 performs range deletion (cleanup), which holds locks and generates I/O.

The range deletion phase (step 3) is the worst offender. MongoDB deletes migrated documents in batches, but each batch acquires a lock and generates write I/O that competes with normal operations.

// Identify active migrations
db.adminCommand({ currentOp: true, desc: /moveChunk|migrat/ })

// Check range deletion queue
db.adminCommand({ currentOp: true, desc: "RangeDeleter" })

// Output shows:
// {
//   "desc": "RangeDeleter",
//   "active": true,
//   "ns": "telemetry.readings",
//   "range": { "min": {"sensorId": "sensor-02000"}, "max": {"sensorId": "sensor-03000"} },
//   "numDocsDeleted": 45000,
//   "totalDocs": 180000
// }

Correlate migration timestamps with latency spikes:

// Migration history with timing
use config
db.changelog.find({
  what: { $in: ["moveChunk.start", "moveChunk.commit"] },
  time: { $gte: new Date(Date.now() - 3600000) }  // last hour
}).sort({ time: 1 }).forEach(function(entry) {
  print(entry.time.toISOString() + " " + entry.what + " " + 
        entry.details.from + " -> " + entry.details.to);
});

The Benchmark

Measure the read latency impact during migrations:

// SLOW: Default migration settings, no balancer window
// Migrations happen during peak traffic hours
// k6 results during migration:
//   p50: 8ms -> 25ms (3.1x increase)
//   p99: 15ms -> 120ms (8x increase)
//   Duration: 30-90 seconds per migration

// Monitor per-shard latency during migration
MongoClient client = MongoClients.create(connectionString);
ServerDescription sourceDesc = client.getClusterDescription()
    .getServerDescriptions().stream()
    .filter(s -> s.getAddress().equals(sourceShardAddress))
    .findFirst().orElseThrow();

Metric	No migration	During migration (source)	During migration (target)
Read p50	5ms	18ms	7ms
Read p99	15ms	120ms	22ms
Write p50	8ms	12ms	35ms
WiredTiger cache evictions/s	200	1,400	850
Disk read IOPS	3,000	8,500	3,200

The source shard suffers most on reads (cache eviction), and the target shard suffers on writes (incoming documents and index builds).

The Fix

Step 1: Set a balancer window to avoid peak traffic.

// Run migrations only during off-peak hours (2 AM - 5 AM UTC)
db.settings.updateOne(
  { _id: "balancer" },
  { $set: { 
    activeWindow: { start: "02:00", stop: "05:00" }
  }},
  { upsert: true }
)

Step 2: Tune range deletion to reduce lock contention.

// Increase delay between range deletion batches
// Default: 20ms. Increase to 100ms to reduce I/O pressure.
db.adminCommand({
  setParameter: 1,
  rangeDeleterBatchDelayMS: 100
})

// Reduce batch size for range deletions
// Default: Unlimited. Set to 1000 to limit per-batch I/O.
db.adminCommand({
  setParameter: 1,
  rangeDeleterBatchSize: 1000
})

Step 3: Monitor migrations in the Java application.

// FAST: Connection pool listener that logs migration-related events
public class MigrationAwareCommandListener implements CommandListener {
    private final MeterRegistry registry;

    @Override
    public void commandSucceeded(CommandSucceededEvent event) {
        if (event.getElapsedTime(TimeUnit.MILLISECONDS) > 50) {
            // Correlate slow commands with migration windows
            registry.counter("mongo.slow_commands",
                "command", event.getCommandName(),
                "server", event.getConnectionDescription()
                    .getServerAddress().toString()
            ).increment();
        }
    }
}

The Proof

After applying balancer window and range deletion tuning:

Metric	Before (default)	After (tuned)
Migration-induced p99 spike	120ms	45ms
Spike frequency during peak	Every 15-20 min	Zero (migrations only 2-5 AM)
Range deletion duration	30s	90s (slower but less impactful)
Cache evictions during migration	1,400/s	800/s

The Trade-off

Restricting the balancer window means that chunk imbalance accumulates during peak hours. If the cluster receives 500 GB of new data during the 21-hour peak window, the balancer has only 3 hours to redistribute. For rapidly growing collections, this may not be enough time to rebalance.

Increasing range deletion batch delay reduces the I/O impact but extends the time that orphaned documents exist on the source shard. During this window, a query that targets the source shard may see stale data if it reads from a secondary that has not yet replicated the deletions. For the telemetry platform, where readings are append-only and queries use readPreference: primary, this is acceptable.