Burn Rate Alerting and Escaping Alert Fatigue
Burn Rate Alerting and Escaping Alert Fatigue
The Symptom
The on-call rotation is a punishment. Engineers dread their turn. Three engineers have requested transfers out of the team in the last six months, citing “unsustainable on-call burden.” The PagerDuty statistics tell the story:
Month Pages Actionable False Positive Rate
January 127 11 91.3%
February 143 8 94.4%
March 118 14 88.1%
91% false positive rate. For every real incident, the on-call engineer is woken up 10 times for nothing. The median time to acknowledge an alert: 14 minutes in January, 22 minutes in February, 31 minutes in March. Response time is increasing because trust in the alerting system is collapsing.
The Cause
Every alert in the current system is a threshold alert:
# BOTTLENECK: Threshold alerting rules
groups:
- name: rider_api_alerts
rules:
- alert: RiderAPIHighLatency
expr: histogram_quantile(0.99,
sum(rate(http_server_requests_seconds_bucket{
service="rider-api"
}[5m])) by (le)
) > 0.5
for: 1m
labels:
severity: critical
annotations:
summary: "Rider API p99 latency > 500ms"
- alert: RiderAPIHighErrorRate
expr: |
sum(rate(http_server_requests_seconds_count{
service="rider-api", status=~"5.."
}[5m]))
/
sum(rate(http_server_requests_seconds_count{
service="rider-api"
}[5m]))
> 0.01
for: 1m
labels:
severity: critical
annotations:
summary: "Rider API error rate > 1%"
These rules fire during:
- Rolling deployments (new pods warming up, 5-15 seconds of elevated latency)
- Garbage collection pauses (G1 mixed collections, 200-500ms pauses every 30 minutes)
- Kubernetes node rebalancing (pod migration, brief connection drops)
- Network blips (cloud provider maintenance, 2-3 second packet loss)
- Single failed health checks (retry succeeds, no user impact)
Each event crosses the threshold for 1-5 minutes. The for: 1m clause is too short to filter them out. Increasing it to for: 10m would filter the noise but also delay real incidents by 10 minutes.
Threshold alerting cannot distinguish between “brief transient spike” and “sustained degradation.” Both cross the threshold. The only difference is duration and impact.
The Baseline
Alert Type Pros Cons
Threshold Simple to write Cannot distinguish transient from sustained
Easy to understand No concept of error budget
Fires fast Fires on every deployment
False positive rate > 80%
Burn Rate Budget-aware Requires SLO definition
Duration-sensitive More complex PromQL
Severity-tiered Requires recording rules
False positive rate < 5%
The Fix
Burn Rate: The Concept
Burn rate measures how fast the error budget is being consumed:
Burn Rate = Actual Error Rate / Allowed Error Rate
For a 99.9% SLO (0.1% allowed error rate):
Scenario Error Rate Burn Rate Time to Exhaust 30-Day Budget
Normal operation 0.02% 0.2x 150 days (well within budget)
Rolling deployment spike 0.5% 5x 6 days
Moderate degradation 1.0% 10x 3 days
Severe incident 1.44% 14.4x ~50 hours
Total outage 100% 1000x ~43 minutes
A burn rate of 1x means the budget will be exactly exhausted at the end of the 30-day window. A burn rate of 14.4x means the budget will be gone in roughly 50 hours of sustained failure. The alerting window determines how quickly you detect it.
Multi-Window Alerting
The key insight: pair a long window with a short window. The long window ensures the problem is significant (not a blip). The short window ensures the problem is still happening (not already resolved).
Alert Tier Burn Rate Long Window Short Window Action Detects
Fast burn 14.4x 1 hour 5 minutes PagerDuty Severe incidents
Slow burn 1x 3 days 6 hours Slack ticket Gradual degradation
Fast burn: “Have we been burning budget at 14.4x for the last hour, AND is it still happening in the last 5 minutes?” This catches severe incidents within minutes while ignoring 12-second deployment blips.
Slow burn: “Have we been burning budget at 1x for the last 3 days, AND is it still happening in the last 6 hours?” This catches slow degradation that threshold alerting would miss entirely. A memory leak that adds 5ms per hour. A slow connection pool exhaustion. A gradual increase in backend latency from a growing table.
Prometheus Alerting Rules
# SCALED: Multi-window, multi-burn-rate alerting
groups:
- name: slo_burn_rate_alerts
rules:
# =====================
# FAST BURN: PAGE
# =====================
# 14.4x burn rate, 1h long window, 5m short window
# At this rate: ~2% of monthly budget consumed per hour
- alert: RiderAPILatencyFastBurn
expr: |
(1 - sli:rider_api:latency:success_rate1h) > (14.4 * 0.001)
and
(1 - sli:rider_api:latency:success_rate5m) > (14.4 * 0.001)
for: 2m
labels:
severity: critical
team: rider-platform
slo: rider-api-latency
annotations:
summary: "FAST BURN: Rider API latency budget being consumed at >14.4x"
runbook: "https://wiki.internal/runbooks/rider-api-latency"
# =====================
# SLOW BURN: TICKET
# =====================
# 1x burn rate, 3d long window approximated by 6h, 30m short window
- alert: RiderAPILatencySlowBurn
expr: |
(1 - sli:rider_api:latency:success_rate6h) > (1 * 0.001)
and
(1 - sli:rider_api:latency:success_rate30m) > (1 * 0.001)
for: 30m
labels:
severity: warning
team: rider-platform
slo: rider-api-latency
annotations:
summary: "SLOW BURN: Rider API latency budget consumption trending toward exhaustion"
runbook: "https://wiki.internal/runbooks/rider-api-latency-slow"
# =====================
# AVAILABILITY: FAST BURN
# =====================
- alert: RiderAPIAvailabilityFastBurn
expr: |
(1 - sli:rider_api:availability:success_rate1h) > (14.4 * 0.0005)
and
(1 - sli:rider_api:availability:success_rate5m) > (14.4 * 0.0005)
for: 2m
labels:
severity: critical
team: rider-platform
slo: rider-api-availability
annotations:
summary: "FAST BURN: Rider API availability budget being consumed at >14.4x"
Grafana Dashboard
{
"dashboard": {
"title": "Rider API SLO Dashboard",
"panels": [
{
"title": "Error Budget Remaining",
"type": "gauge",
"gridPos": { "h": 8, "w": 6, "x": 0, "y": 0 },
"targets": [
{
"expr": "clamp_min(1 - ((1 - sli:rider_api:latency:success_rate30d) / 0.001), 0)",
"legendFormat": "Latency Budget"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{ "color": "red", "value": 0 },
{ "color": "orange", "value": 0.1 },
{ "color": "yellow", "value": 0.25 },
{ "color": "green", "value": 0.5 }
]
},
"unit": "percentunit"
}
}
},
{
"title": "Burn Rate Over Time",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 6, "y": 0 },
"targets": [
{
"expr": "(1 - sli:rider_api:latency:success_rate1h) / 0.001",
"legendFormat": "1h Burn Rate"
},
{
"expr": "(1 - sli:rider_api:latency:success_rate6h) / 0.001",
"legendFormat": "6h Burn Rate"
}
],
"fieldConfig": {
"defaults": {
"custom": {
"thresholdsStyle": { "mode": "line" }
},
"thresholds": {
"steps": [
{ "color": "green", "value": 0 },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 14.4 }
]
}
}
}
},
{
"title": "Time Until Budget Exhaustion",
"type": "stat",
"gridPos": { "h": 8, "w": 6, "x": 18, "y": 0 },
"targets": [
{
"expr": "clamp_min((sli:rider_api:latency:success_rate1h - 0.999) / (1 - sli:rider_api:latency:success_rate1h) * 720, 0)",
"legendFormat": "Hours Remaining"
}
],
"fieldConfig": {
"defaults": {
"unit": "h",
"thresholds": {
"steps": [
{ "color": "red", "value": 0 },
{ "color": "yellow", "value": 168 },
{ "color": "green", "value": 336 }
]
}
}
}
}
]
}
}
Three panels. The gauge shows percentage of budget remaining. The time series shows burn rate with horizontal threshold lines at 1x and 14.4x. The stat panel shows estimated hours until budget exhaustion at the current rate. When the burn rate drops below 1x, the stat panel shows ”> 720h” (more than the 30-day window).
Alertmanager Routing
# SCALED: Alertmanager routing by burn rate severity
route:
receiver: default-slack
group_by: ["slo"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# Fast burn alerts: page the on-call engineer
- matchers:
- severity = critical
- burn_rate =~ "fast|critical"
receiver: pagerduty-rider-platform
group_wait: 10s
repeat_interval: 1h
continue: true
# Also send fast burn to Slack for visibility
- matchers:
- severity = critical
receiver: slack-incidents
group_wait: 10s
# Slow burn alerts: create a ticket, notify Slack
- matchers:
- severity = warning
- burn_rate = slow
receiver: slack-slo-warnings
group_wait: 5m
repeat_interval: 24h
receivers:
- name: default-slack
slack_configs:
- channel: "#rider-platform-alerts"
title: "{{ .GroupLabels.slo }}"
text: "{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}"
- name: pagerduty-rider-platform
pagerduty_configs:
- service_key_file: /etc/alertmanager/pagerduty-key
severity: critical
description: "{{ .CommonAnnotations.summary }}"
- name: slack-incidents
slack_configs:
- channel: "#incidents"
title: "SLO VIOLATION: {{ .GroupLabels.slo }}"
text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ .Annotations.description }}{{ end }}'
color: danger
- name: slack-slo-warnings
slack_configs:
- channel: "#rider-platform-slo"
title: "Slow Burn: {{ .GroupLabels.slo }}"
text: "{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}"
color: warning
Fast burn goes to PagerDuty and the #incidents Slack channel. The on-call engineer is paged. Slow burn goes to #rider-platform-slo as a ticket-level notification. Nobody is woken up for a slow burn. The team reviews slow burn alerts during business hours and investigates the trend.
The Alert That Should Have Paged But Didn’t
Thursday. A new database index is deployed that improves 99% of queries but makes 0.3% of queries 200ms slower due to a different query plan for edge-case zone lookups. The threshold alert does not fire because p99 stays at 460ms (under the 500ms threshold). The 0.3% of slow requests are hidden inside the p95-p99 range.
Burn rate analysis: 0.3% error rate vs 0.1% budget = 3x burn rate. Not enough for fast burn (14.4x threshold). But after 2 days of sustained 3x burn, the 6-hour window shows a consistent 1x+ burn rate. The slow burn alert fires a ticket to #rider-platform-slo.
The team investigates, finds the query plan regression, adds a query hint to force the original plan. Budget consumed: ~20% over 2 days. Threshold alerting would not have noticed until the budget was exhausted and riders started complaining.
The Proof
Locust: Inducing a Slow Burn
# SCALED: Locust inducing a slow burn scenario
from locust import HttpUser, task, between
import random
class SlowBurnUser(HttpUser):
wait_time = between(0.1, 0.5)
@task
def fare_estimate(self):
params = {
"pickup_lat": 40.7128, "pickup_lng": -74.0060,
"dropoff_lat": 40.7589, "dropoff_lng": -73.9851
}
# Add 50ms delay to 0.5% of requests
# This pushes them from ~450ms to ~500ms, crossing the SLO threshold
# 0.5% failure rate vs 0.1% budget = 5x burn rate
# Slow enough that threshold alerting ignores it
# Fast enough that slow-burn alert fires within 6 hours
if random.random() < 0.005:
params["simulate_delay_ms"] = 50
self.client.get("/api/rides/fare-estimate",
params=params,
name="/api/rides/fare-estimate"
)
Run for 8 hours with 100 users:
locust -f slow_burn.py --users 100 --spawn-rate 20 --run-time 8h --headless
Expected timeline:
Time Burn Rate (6h) Alert Status
0-30min ~5x No alert (for: 30m not met)
30min-1h ~5x No alert (30m window stabilizing)
1h-6h ~5x Slow burn pending (for: 30m condition met at ~1h)
~1.5h ~5x SLOW BURN ALERT FIRES → Slack ticket
6h-8h ~5x Alert continues (repeat_interval: 24h, no re-alert)
The threshold alert (p99 > 500ms) never fires because p99 stays at 460ms. Only 0.5% of requests cross 500ms. The burn rate alert catches it because 0.5% failure rate against a 0.1% budget is a 5x burn rate, which exceeds the 1x slow-burn threshold.
After the run, check the error budget:
1 - ((1 - sli:rider_api:latency:success_rate6h) / 0.001)
Expected: ~0.6 (40% of budget consumed in 8 hours at 5x burn rate). The dashboard gauge is yellow. The team has 12 hours before the budget is exhausted at the current rate.
Before burn rate alerting:
Month Pages Actionable False Positive Rate Mean Acknowledge Time
March 118 14 88.1% 31 minutes
After burn rate alerting:
Month Pages Actionable False Positive Rate Mean Acknowledge Time
April 9 8 11.1% 4 minutes
From 118 pages to 9. From 88% false positives to 11%. From 31-minute acknowledge time to 4 minutes. The on-call engineer trusts the alert. When the phone buzzes, they know it matters.