Percentile Mechanics and the Coordinated Omission Problem
Percentile Mechanics and the Coordinated Omission Problem
The Symptom
The telemetry platform’s Grafana dashboard shows a single number for the sensor ingestion endpoint: average latency 67ms. The line is steady. Green across the board.
The IoT operations team reports 3% of sensor readings arriving with stale timestamps, indicating the sensor retried after a timeout. The dashboard says 67ms. The sensors say “too slow.” Both are correct.
The Cause
Average latency is an arithmetic mean. When 95% of requests complete in 12ms and 1% take 3,800ms, the average lands at 67ms, a number that describes neither the fast majority nor the slow tail. The distribution is bimodal because the ingestion endpoint has two code paths.
// SLOW: Two paths with vastly different latencies
@RestController
public class TelemetryController {
private final MongoCollection<Document> telemetryCollection;
private final MongoCollection<Document> bucketCollection;
@PostMapping("/api/telemetry/ingest")
public ResponseEntity<Void> ingest(@RequestBody TelemetryReading reading) {
Document doc = new Document()
.append("sensorId", reading.sensorId())
.append("ts", reading.timestamp())
.append("temp", reading.temperature())
.append("humidity", reading.humidity())
.append("pressure", reading.pressure());
// Path A: Bucket exists, $push into array → 8-15ms
// Path B: New bucket, insert + index update + possible
// WiredTiger cache eviction → 400-3800ms
UpdateResult result = bucketCollection.updateOne(
Filters.and(
Filters.eq("sensorId", reading.sensorId()),
Filters.lt("count", 60)
),
Updates.combine(
Updates.push("readings", doc),
Updates.inc("count", 1),
Updates.min("startTs", reading.timestamp()),
Updates.max("endTs", reading.timestamp())
)
);
if (result.getMatchedCount() == 0) {
// No open bucket: create new one (the slow path)
Document bucket = new Document()
.append("sensorId", reading.sensorId())
.append("count", 1)
.append("startTs", reading.timestamp())
.append("endTs", reading.timestamp())
.append("readings", List.of(doc));
bucketCollection.insertOne(bucket);
}
return ResponseEntity.status(201).build();
}
}
Path A updates an existing bucket document with $push. The document is likely in the WiredTiger cache. The index entry already exists. Total time: 8-15ms.
Path B creates a new bucket. This triggers a new index entry insertion, potentially a B-tree page split, and if the WiredTiger cache is under pressure, an eviction before the new page can be loaded. Under high write load, connection pool contention adds wait time on top. Total time: 400-3,800ms.
The Benchmark
Prometheus histogram buckets capture the distribution:
# Latency distribution buckets for telemetry ingestion
histogram_quantile(0.50, sum(rate(http_server_requests_seconds_bucket{uri="/api/telemetry/ingest"}[5m])) by (le))
# Result: 0.012 (12ms)
histogram_quantile(0.90, sum(rate(http_server_requests_seconds_bucket{uri="/api/telemetry/ingest"}[5m])) by (le))
# Result: 0.085 (85ms)
histogram_quantile(0.95, sum(rate(http_server_requests_seconds_bucket{uri="/api/telemetry/ingest"}[5m])) by (le))
# Result: 0.420 (420ms)
histogram_quantile(0.99, sum(rate(http_server_requests_seconds_bucket{uri="/api/telemetry/ingest"}[5m])) by (le))
# Result: 3.800 (3,800ms)
The jump from p90 (85ms) to p95 (420ms) marks the boundary between cache hits and cache misses. The jump from p95 to p99 (3,800ms) marks the boundary between “cache miss with available connections” and “cache miss plus connection pool exhaustion plus checkpoint stall.”
At 60,000 ingestion requests per minute, p99 = 3,800ms means 600 requests per minute experience near-timeout latency. For sensors with a 5-second retry window, that 600 is the leading indicator of data loss.
The Fix
Track percentiles at every layer. Configure Micrometer in the Spring Boot application to emit percentile histograms:
// FAST: Micrometer configuration for percentile tracking
@Configuration
public class MetricsConfig {
@Bean
public MeterRegistryCustomizer<MeterRegistry> metricsCustomizer() {
return registry -> registry.config()
.meterFilter(new MeterFilter() {
@Override
public DistributionStatisticConfig configure(
Meter.Id id,
DistributionStatisticConfig config) {
if (id.getName().startsWith("http.server.requests")) {
return DistributionStatisticConfig.builder()
.percentiles(0.5, 0.95, 0.99)
.percentilesHistogram(true)
.serviceLevelObjectives(
Duration.ofMillis(50).toNanos() / 1e9,
Duration.ofMillis(100).toNanos() / 1e9,
Duration.ofMillis(500).toNanos() / 1e9,
Duration.ofSeconds(1).toNanos() / 1e9,
Duration.ofSeconds(5).toNanos() / 1e9
)
.build()
.merge(config);
}
return config;
}
});
}
}
The SLO buckets (50ms, 100ms, 500ms, 1s, 5s) let Prometheus calculate the percentage of requests meeting each threshold. The Grafana panel that matters: “percentage of ingestion requests completing under 500ms.” When that dips below 99%, something changed.
The Proof
| Metric | Before (avg only) | After (percentiles) |
|---|---|---|
| Dashboard value | 67ms (avg) | p50: 12ms, p95: 420ms, p99: 3,800ms |
| Alerts triggered | None | p99 > 2,000ms fires during write bursts |
| Time to detect | Support ticket (hours) | 15 seconds (Prometheus scrape interval) |
The system’s behavior did not change. The team’s ability to see the problem changed. That is the first step.
The Trade-off
Percentile histograms consume more memory in Prometheus. Each metric with percentile tracking generates approximately 20 time series per endpoint (one per bucket boundary). For a service with 15 endpoints, that is 300 additional time series. At a 15-second scrape interval, Prometheus ingests approximately 1,200 additional samples per minute. The storage cost is negligible compared to the cost of a missed tail latency spike.
Coordinated Omission
Load test tools have a subtle flaw. When a request takes 5 seconds, the tool’s simulated user waits those 5 seconds before sending the next request. The 5-second response is recorded, but the fact that other requests would have arrived during those 5 seconds is not. The load test coordinates with the slow system by backing off when it should be piling on.
Real users do not coordinate. When the activity feed takes 5 seconds to load, the user refreshes the page. The system receives more load precisely when it is struggling.
Gil Tene named this problem “coordinated omission.” It means naive load test results undercount tail latency. k6 mitigates this with the constant-arrival-rate executor, which maintains a fixed request rate regardless of response times. Use it for every test in this book.
// FAST: constant-arrival-rate prevents coordinated omission
export const options = {
scenarios: {
ingestion: {
executor: 'constant-arrival-rate', // not 'per-vu-iterations'
rate: 1000,
timeUnit: '1s',
duration: '5m',
preAllocatedVUs: 200,
maxVUs: 500,
},
},
};
The constant-arrival-rate executor spawns new virtual users as needed to maintain the target rate. If requests are slow, more VUs are created. This models real traffic accurately: the system receives a constant load regardless of how fast it responds.