SLOs, Error Budgets, and Escaping Alert Fatigue
SLOs, Error Budgets, and Escaping Alert Fatigue
The Symptom
The on-call engineer’s phone buzzed 14 times last night. Twelve of those alerts were “p99 latency > 500ms” on the rider API. Each spike lasted 5-15 seconds, coinciding with deployments rolling through Kubernetes pods. The other two alerts were “error rate > 1%” triggered by a single failed health check that Kubernetes retried successfully.
None of the 14 alerts required action. The engineer checked each one, confirmed it was transient, and went back to sleep. This has happened every night for three weeks.
Last Tuesday, a real incident happened during a surge event. The rider API’s p99 climbed to 1.2 seconds and stayed there for 45 minutes. The on-call engineer’s phone buzzed. They assumed it was another deployment blip and silenced it. The incident was detected 38 minutes later when riders started calling support.
Alert fatigue killed the alert. The signal drowned in noise.
The Cause
Threshold-based alerting fires when a metric crosses a line. “Alert when p99 > 500ms” is a threshold alert. It treats a 5-second spike during a rolling deployment the same as a 45-minute degradation during a surge event. Both cross the threshold. Both fire the same alert. One requires action. The other does not.
The problem is not the threshold value. The problem is that threshold alerts have no concept of duration, magnitude, or user impact. A 5-second spike that affects 10 requests is noise. A 45-minute degradation that affects 50,000 requests is a real incident.
SLO-based alerting with burn rates solves this by asking a different question. Instead of “is the metric above a line?” it asks “at the current rate of failure, will we exhaust our error budget before the end of the SLO window?” A 5-second spike does not consume meaningful error budget. A 45-minute degradation does. The alert fires for the second case, not the first.
The Baseline
SLIs for the Ride-Hailing Platform
SLIs (Service Level Indicators) are the raw measurements:
SLI Definition Measurement
Latency Proportion of requests < 500ms http_server_requests_seconds_bucket{le="0.5"}
Availability Proportion of non-5xx responses 1 - (5xx count / total count)
Correctness Proportion of fare calcs within expected range fare_calculation_accurate_total / fare_calculation_total
SLOs
SLOs (Service Level Objectives) set targets:
SLO Target Error Budget (30 days)
Rider API latency 99.9% < 500ms 0.1% = 43.2 minutes
Rider API availability 99.95% 0.05% = 21.6 minutes
Fare correctness 99.99% 0.01% = 4.3 minutes
The error budget is the acceptable amount of failure. 99.9% latency SLO means 0.1% of requests are allowed to exceed 500ms. Over 30 days at 100 RPS, that is 259,200 slow requests out of 259.2 million. Translated to time: 43.2 minutes of total violation allowed.
Current Alerting
Alert Fires/Week Actionable?
p99 > 500ms 12 2 (17%)
Error rate > 1% 8 1 (13%)
CPU > 80% 5 0 (0%)
Memory > 70% 3 0 (0%)
Total 28 3 (11%)
89% of alerts are noise. The on-call engineer is paged 25 times per week for nothing.
Target: alerts that fire only when user experience is meaningfully degraded. Fewer than 3 false positives per week.
The Fix
Prometheus Recording Rules for SLIs
# SCALED: Prometheus recording rules for SLO tracking
groups:
- name: slo_recording_rules
interval: 30s
rules:
# Latency SLI: proportion of requests faster than 500ms
- record: sli:rider_api:latency:success_rate5m
expr: |
sum(rate(http_server_requests_seconds_bucket{
service="rider-api", uri=~"/api/rides/.*", le="0.5"
}[5m]))
/
sum(rate(http_server_requests_seconds_count{
service="rider-api", uri=~"/api/rides/.*"
}[5m]))
- record: sli:rider_api:latency:success_rate30m
expr: |
sum(rate(http_server_requests_seconds_bucket{
service="rider-api", uri=~"/api/rides/.*", le="0.5"
}[30m]))
/
sum(rate(http_server_requests_seconds_count{
service="rider-api", uri=~"/api/rides/.*"
}[30m]))
- record: sli:rider_api:latency:success_rate1h
expr: |
sum(rate(http_server_requests_seconds_bucket{
service="rider-api", uri=~"/api/rides/.*", le="0.5"
}[1h]))
/
sum(rate(http_server_requests_seconds_count{
service="rider-api", uri=~"/api/rides/.*"
}[1h]))
- record: sli:rider_api:latency:success_rate6h
expr: |
sum(rate(http_server_requests_seconds_bucket{
service="rider-api", uri=~"/api/rides/.*", le="0.5"
}[6h]))
/
sum(rate(http_server_requests_seconds_count{
service="rider-api", uri=~"/api/rides/.*"
}[6h]))
# Availability SLI: proportion of non-5xx responses
- record: sli:rider_api:availability:success_rate5m
expr: |
1 - (
sum(rate(http_server_requests_seconds_count{
service="rider-api", uri=~"/api/rides/.*", status=~"5.."
}[5m]))
/
sum(rate(http_server_requests_seconds_count{
service="rider-api", uri=~"/api/rides/.*"
}[5m]))
)
Pre-computed SLI ratios over 5m, 30m, 1h, and 6h windows. These recording rules run every 30 seconds, so alerting rules can reference them without recomputing expensive range queries.
The chart above illustrates how error budgets work in practice. Under normal operation, the budget depletes gradually at roughly 1x burn rate. At day 15, a surge event combined with connection pool exhaustion causes a 14x burn rate — the budget drops from 55% to near zero in just a few days. The horizontal threshold lines show where alerts fire: a 2x burn rate generates a ticket for investigation, while a 14x burn rate pages the on-call engineer immediately. Once the budget is exhausted, all non-essential deploys are frozen until reliability work restores the budget.
Burn Rate Alerting Rules
# SCALED: Burn rate alerting for rider API latency SLO (99.9%)
groups:
- name: slo_alerts
rules:
# Fast burn: 14.4x burn rate over 1 hour, validated against 5 minutes
# At 14.4x, the 30-day budget would be exhausted in ~50 hours
# Consuming ~2% of the monthly budget per hour
# Action: PAGE
- alert: RiderAPILatencyBudgetFastBurn
expr: |
(1 - sli:rider_api:latency:success_rate1h) > (14.4 * 0.001)
and
(1 - sli:rider_api:latency:success_rate5m) > (14.4 * 0.001)
for: 2m
labels:
severity: critical
slo: rider-api-latency
burn_rate: fast
annotations:
summary: "Rider API latency SLO fast burn: error budget consumption is critical"
description: |
Current error rate: {{ $value | humanizePercentage }}
SLO target: 99.9%
Burn rate: >14.4x (budget consumed in ~50 hours at this rate)
# Slow burn: 1x burn rate over 3 days, validated against 6 hours
# Budget will be exactly exhausted at end of 30-day window
# Action: TICKET
- alert: RiderAPILatencyBudgetSlowBurn
expr: |
(1 - sli:rider_api:latency:success_rate6h) > (1 * 0.001)
and
(1 - sli:rider_api:latency:success_rate30m) > (1 * 0.001)
for: 30m
labels:
severity: warning
slo: rider-api-latency
burn_rate: slow
annotations:
summary: "Rider API latency SLO slow burn: error budget being consumed steadily"
description: |
Current error rate: {{ $value | humanizePercentage }}
SLO target: 99.9%
Burn rate: ~1x (budget on track to exhaust before window ends)
The fast burn alert checks: “Is the 1-hour error rate 14.4 times the allowed rate, AND is the 5-minute rate also elevated?” Both conditions must be true. A 5-second spike elevates the 5-minute window but not the 1-hour window. It does not fire. A 45-minute degradation elevates both windows. It fires.
The slow burn alert checks: “Is the 6-hour error rate at or above the allowed rate, AND is the 30-minute rate also elevated?” This catches gradual degradations that would exhaust the budget over days. A slow memory leak that adds 10ms per hour, eventually crossing the 500ms threshold. Threshold alerting would not fire until the leak is severe. Burn rate alerting files a ticket when the trend becomes dangerous.
Grafana SLO Dashboard
{
"panels": [
{
"title": "Error Budget Remaining (Latency SLO 99.9%)",
"type": "gauge",
"targets": [
{
"expr": "1 - ((1 - sli:rider_api:latency:success_rate30d) / 0.001)",
"legendFormat": "Budget Remaining"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{ "color": "red", "value": 0 },
{ "color": "yellow", "value": 0.25 },
{ "color": "green", "value": 0.5 }
]
},
"unit": "percentunit",
"max": 1,
"min": 0
}
}
},
{
"title": "Burn Rate (1h window)",
"type": "timeseries",
"targets": [
{
"expr": "(1 - sli:rider_api:latency:success_rate1h) / 0.001",
"legendFormat": "Burn Rate"
}
],
"fieldConfig": {
"defaults": {
"custom": {
"thresholdsStyle": { "mode": "line+area" }
},
"thresholds": {
"steps": [
{ "color": "transparent", "value": 0 },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 14.4 }
]
}
}
}
}
]
}
The gauge shows remaining error budget as a percentage. Green above 50%, yellow between 25-50%, red below 25%. The burn rate chart shows the current consumption rate with threshold lines at 1x (exact budget pace) and 14.4x (fast burn page threshold).
The Alert That Should Not Have Paged
Tuesday, 3:17 AM. Kubernetes rolls out a new version of the rider API. Rolling deployment: old pods drain connections, new pods warm up. For 12 seconds, 30% of requests hit pods that are starting up. Cold JIT compilation. Cold connection pools. p99 spikes to 1.8 seconds.
Threshold alert: p99 > 500ms fires immediately. On-call is paged.
Burn rate analysis: 12 seconds of elevated latency at 100 RPS affects ~360 requests. Error budget for the month is 259,200 requests. This event consumed 0.14% of the budget. The 1-hour error rate barely moves. The fast-burn alert does not fire. The slow-burn alert does not fire. The on-call engineer sleeps through the deployment.
The Proof
Locust: Simulating an SLO Violation
# SCALED: Locust simulating a sustained SLO violation
from locust import HttpUser, task, between
import time
class SLOViolationUser(HttpUser):
wait_time = between(0.1, 0.5)
start_time = None
def on_start(self):
if SLOViolationUser.start_time is None:
SLOViolationUser.start_time = time.time()
@task
def request_ride(self):
elapsed = time.time() - SLOViolationUser.start_time
params = {
"pickup_lat": 40.7128, "pickup_lng": -74.0060,
"dropoff_lat": 40.7589, "dropoff_lng": -73.9851
}
# After 5 minutes, add artificial delay to 2% of requests
# 2% error rate vs 0.1% budget = 20x burn rate
# Fast-burn threshold (14.4x) should fire within 10 minutes
if elapsed > 300 and hash(str(time.time())) % 100 < 2:
params["simulate_delay_ms"] = 2000
self.client.get("/api/rides/fare-estimate",
params=params,
name="/api/rides/fare-estimate"
)
Run for 30 minutes with 200 users:
locust -f slo_violation.py --users 200 --spawn-rate 50 --run-time 30m --headless
Minutes 0-5: Normal traffic. Burn rate near 0. Minutes 5-15: 2% of requests exceed 500ms. Burn rate climbs to ~20x (2% error rate vs 0.1% budget = 20x). Minute 7: Fast-burn alert fires (14.4x threshold crossed, sustained for 2 minutes). Minute 15: Stop simulated delay. Burn rate drops to 0.
Budget consumed: ~10 minutes at 20x burn rate. Roughly 7.7% of the 30-day budget. The gauge shows 92.3% remaining.
The threshold alert would have fired and stayed firing for the entire 10-minute window, generating continuous noise. The burn-rate alert fired once, with a meaningful severity and a clear description: “budget consumption is critical at current rate.” One actionable alert vs. continuous noise.