Diagnosing Ticket Starvation and Long-Running Operations

The Symptom

The telemetry dashboard’s p99 latency jumps from 20ms to 3,000ms intermittently. The spikes do not correlate with checkpoint intervals (CH13). They correlate with the analytics team’s ad-hoc query sessions. When an analyst runs an unindexed query against the readings collection, dashboard queries slow down.

The Cause

The analyst’s query performs a collection scan:

// Analyst query: no index, full collection scan
db.readings.find({ temperature: { $gt: 50 } }).sort({ ts: -1 })

There is no index on {temperature: 1}. This query performs a COLLSCAN on 200 million documents, holding a read ticket for 30-60 seconds. While this ticket is held, one fewer ticket is available for dashboard queries. If multiple analysts run similar queries, ticket availability drops rapidly.

// Find long-running operations
db.currentOp({ 
  "active": true,
  "secs_running": { "$gt": 5 },
  "op": { "$in": ["query", "getmore"] }
}).inprog.forEach(op => {
  print(`OpId: ${op.opid}, Running: ${op.secs_running}s, ` +
        `Namespace: ${op.ns}, Plan: ${JSON.stringify(op.planSummary)}`);
});

Output:

OpId: 12345, Running: 42s, Namespace: telemetry.readings, Plan: ["COLLSCAN"]
OpId: 12346, Running: 18s, Namespace: telemetry.readings, Plan: ["COLLSCAN"]

Two collection scans running for 42 and 18 seconds, each holding a read ticket.

The Benchmark

Measure the impact of a single long-running operation on short queries:

// k6 test: steady dashboard load with/without concurrent COLLSCAN
import http from 'k6/http';
import { Trend } from 'k6/metrics';

const dashboardLatency = new Trend('dashboard_latency', true);

export const options = {
  scenarios: {
    dashboard: {
      executor: 'constant-arrival-rate',
      rate: 500,
      timeUnit: '1s',
      duration: '3m',
      preAllocatedVUs: 50,
      maxVUs: 100,
    },
  },
};

export default function() {
  const sensorId = `sensor-${String(Math.floor(Math.random() * 100)).padStart(5, '0')}`;
  const res = http.get(`${__ENV.BASE_URL}/api/dashboard/sensor/${sensorId}`);
  dashboardLatency.add(res.timings.duration);
}

Results:

Condition	p50	p95	p99
No concurrent COLLSCAN	8ms	15ms	22ms
1 concurrent COLLSCAN	8ms	18ms	85ms
3 concurrent COLLSCANs	10ms	45ms	450ms
5 concurrent COLLSCANs	15ms	180ms	3,000ms

Five collection scans (holding 5 read tickets each for 30-60 seconds) degrade dashboard p99 from 22ms to 3,000ms.

The Fix

Three defenses:

Defense 1: Set maxTimeMS on all queries.

// FAST: maxTimeMS prevents runaway queries
collection.find(Filters.eq("sensorId", sensorId))
    .sort(Sorts.descending("ts"))
    .limit(100)
    .maxTime(5, TimeUnit.SECONDS);    // Kill if exceeds 5 seconds

// For analytics queries, allow longer but bounded time
analyticsCollection.find(analyticsFilter)
    .maxTime(30, TimeUnit.SECONDS);

maxTimeMS tells MongoDB to kill the operation if it exceeds the specified time. This releases the read ticket and prevents unbounded ticket holding.

Defense 2: Route analytics queries to a secondary.

// FAST: Analytics queries use secondary read preference
MongoCollection<Document> analyticsCollection = database
    .getCollection("readings")
    .withReadPreference(ReadPreference.secondaryPreferred());

// Dashboard queries use primary (default)
MongoCollection<Document> dashboardCollection = database
    .getCollection("readings");

Analytics queries on secondaries do not consume tickets on the primary, protecting dashboard query performance. (Read preferences are covered in depth in CH19.)

Defense 3: Ensure indexes exist for common analytics patterns.

// Index for temperature range queries
collection.createIndex(
    new Document("temperature", 1).append("ts", -1),
    new IndexOptions()
        .name("idx_temp_ts")
        .background(true)    // Build without blocking writes (deprecated in 4.2+, use default)
);

The Proof

After implementing all three defenses:

Condition	p50	p95	p99
No analytics queries	8ms	15ms	22ms
Analytics on secondary	8ms	15ms	25ms
Analytics on primary (with maxTimeMS)	8ms	18ms	35ms

The Trade-off

maxTimeMS kills queries that exceed the limit. The analyst’s query returns an error instead of results. This is intentional: protecting production workloads is more important than completing an ad-hoc query. Provide the analytics team with a dedicated analytics replica that has no ticket contention from production traffic.

Routing analytics to secondaries introduces replication lag. The secondary may be 0-10 seconds behind the primary (typical for a healthy replica set). Analytics queries on the secondary see slightly stale data. For most analytics use cases, seconds-old data is acceptable. For real-time analytics, keep the queries on the primary but ensure they are indexed.