Diagnosing Ticket Starvation and Long-Running Operations
Diagnosing Ticket Starvation and Long-Running Operations
The Symptom
The telemetry dashboard’s p99 latency jumps from 20ms to 3,000ms intermittently. The spikes do not correlate with checkpoint intervals (CH13). They correlate with the analytics team’s ad-hoc query sessions. When an analyst runs an unindexed query against the readings collection, dashboard queries slow down.
The Cause
The analyst’s query performs a collection scan:
// Analyst query: no index, full collection scan
db.readings.find({ temperature: { $gt: 50 } }).sort({ ts: -1 })
There is no index on {temperature: 1}. This query performs a COLLSCAN on 200 million documents, holding a read ticket for 30-60 seconds. While this ticket is held, one fewer ticket is available for dashboard queries. If multiple analysts run similar queries, ticket availability drops rapidly.
// Find long-running operations
db.currentOp({
"active": true,
"secs_running": { "$gt": 5 },
"op": { "$in": ["query", "getmore"] }
}).inprog.forEach(op => {
print(`OpId: ${op.opid}, Running: ${op.secs_running}s, ` +
`Namespace: ${op.ns}, Plan: ${JSON.stringify(op.planSummary)}`);
});
Output:
OpId: 12345, Running: 42s, Namespace: telemetry.readings, Plan: ["COLLSCAN"]
OpId: 12346, Running: 18s, Namespace: telemetry.readings, Plan: ["COLLSCAN"]
Two collection scans running for 42 and 18 seconds, each holding a read ticket.
The Benchmark
Measure the impact of a single long-running operation on short queries:
// k6 test: steady dashboard load with/without concurrent COLLSCAN
import http from 'k6/http';
import { Trend } from 'k6/metrics';
const dashboardLatency = new Trend('dashboard_latency', true);
export const options = {
scenarios: {
dashboard: {
executor: 'constant-arrival-rate',
rate: 500,
timeUnit: '1s',
duration: '3m',
preAllocatedVUs: 50,
maxVUs: 100,
},
},
};
export default function() {
const sensorId = `sensor-${String(Math.floor(Math.random() * 100)).padStart(5, '0')}`;
const res = http.get(`${__ENV.BASE_URL}/api/dashboard/sensor/${sensorId}`);
dashboardLatency.add(res.timings.duration);
}
Results:
| Condition | p50 | p95 | p99 |
|---|---|---|---|
| No concurrent COLLSCAN | 8ms | 15ms | 22ms |
| 1 concurrent COLLSCAN | 8ms | 18ms | 85ms |
| 3 concurrent COLLSCANs | 10ms | 45ms | 450ms |
| 5 concurrent COLLSCANs | 15ms | 180ms | 3,000ms |
Five collection scans (holding 5 read tickets each for 30-60 seconds) degrade dashboard p99 from 22ms to 3,000ms.
The Fix
Three defenses:
Defense 1: Set maxTimeMS on all queries.
// FAST: maxTimeMS prevents runaway queries
collection.find(Filters.eq("sensorId", sensorId))
.sort(Sorts.descending("ts"))
.limit(100)
.maxTime(5, TimeUnit.SECONDS); // Kill if exceeds 5 seconds
// For analytics queries, allow longer but bounded time
analyticsCollection.find(analyticsFilter)
.maxTime(30, TimeUnit.SECONDS);
maxTimeMS tells MongoDB to kill the operation if it exceeds the specified time. This releases the read ticket and prevents unbounded ticket holding.
Defense 2: Route analytics queries to a secondary.
// FAST: Analytics queries use secondary read preference
MongoCollection<Document> analyticsCollection = database
.getCollection("readings")
.withReadPreference(ReadPreference.secondaryPreferred());
// Dashboard queries use primary (default)
MongoCollection<Document> dashboardCollection = database
.getCollection("readings");
Analytics queries on secondaries do not consume tickets on the primary, protecting dashboard query performance. (Read preferences are covered in depth in CH19.)
Defense 3: Ensure indexes exist for common analytics patterns.
// Index for temperature range queries
collection.createIndex(
new Document("temperature", 1).append("ts", -1),
new IndexOptions()
.name("idx_temp_ts")
.background(true) // Build without blocking writes (deprecated in 4.2+, use default)
);
The Proof
After implementing all three defenses:
| Condition | p50 | p95 | p99 |
|---|---|---|---|
| No analytics queries | 8ms | 15ms | 22ms |
| Analytics on secondary | 8ms | 15ms | 25ms |
| Analytics on primary (with maxTimeMS) | 8ms | 18ms | 35ms |
The Trade-off
maxTimeMS kills queries that exceed the limit. The analyst’s query returns an error instead of results. This is intentional: protecting production workloads is more important than completing an ad-hoc query. Provide the analytics team with a dedicated analytics replica that has no ticket contention from production traffic.
Routing analytics to secondaries introduces replication lag. The secondary may be 0-10 seconds behind the primary (typical for a healthy replica set). Analytics queries on the secondary see slightly stale data. For most analytics use cases, seconds-old data is acceptable. For real-time analytics, keep the queries on the primary but ensure they are indexed.