Defining SLOs for the Ride-Hailing Platform
Defining SLOs for the Ride-Hailing Platform
The Symptom
The engineering team defines SLOs for every service. The surge pricing service gets a 99.9% availability SLO. The driver analytics dashboard gets a 99.9% latency SLO. The internal admin API gets a 99.95% availability SLO.
Three months later, the error budget dashboard shows all SLOs at 100% budget remaining. Not because the services are perfect, but because nobody is measuring the SLIs correctly. The surge pricing SLO measures internal health checks, not actual surge calculations. The analytics dashboard SLO counts page loads, including the blank loading state. The admin API has 3 requests per hour, making the SLI statistically meaningless.
The team spent time defining SLOs that measure nothing useful.
The Cause
SLOs fail when they measure the wrong thing. Three common mistakes:
- Measuring infrastructure instead of user experience: “The database is available” is not an SLI. “The user can complete a ride request” is.
- Measuring all traffic equally: Health checks, readiness probes, and admin endpoints should not be in the SLI calculation. They dilute the signal.
- Setting targets without understanding the current baseline: Choosing 99.99% because it sounds good, when the service currently runs at 99.5%.
A meaningful SLI answers: “Did the user get what they wanted, fast enough, and correctly?” For the ride-hailing platform, the user wants:
- To request a ride and get a driver (availability)
- To get a fare estimate in under half a second (latency)
- To be charged the correct amount (correctness)
The Baseline
Current SLI measurements:
Service SLI Type What It Measures Problem
Rider API Availability All HTTP 200s Includes health checks
Surge Pricing Availability Internal health endpoint Not measuring surge calcs
Driver Analytics Latency Page load (empty state) Not measuring data load
Admin API Availability All responses 3 req/hour, meaningless
Fare Service None Nothing Not measured
Target SLI definitions:
Service SLI Type What It Measures Excludes
Rider API Latency Ride request < 500ms Health checks, admin
Rider API Availability Ride request non-5xx Health checks, admin
Fare Service Correctness Fare within expected range Test requests
Fare Service Latency Fare estimate < 200ms Test requests
The Fix
SLI Selection: The Three Proportions
Every SLI is a proportion: good events divided by total events.
SLI Type Good Event Total Event
Latency Request completed in < threshold All requests
Availability Request completed without server error All requests
Correctness Fare within ±5% of expected value All fare calculations
Prometheus Recording Rules
# SCALED: Recording rules for ride-hailing SLIs
groups:
- name: ride_hailing_slis
interval: 30s
rules:
# ============================
# LATENCY SLI: Rider API
# ============================
# Good: requests faster than 500ms
# Total: all requests (excluding health checks)
- record: sli:rider_api:latency:good_total5m
expr: |
sum(rate(http_server_requests_seconds_bucket{
service="rider-api",
uri=~"/api/rides/.*",
le="0.5"
}[5m]))
- record: sli:rider_api:latency:total5m
expr: |
sum(rate(http_server_requests_seconds_count{
service="rider-api",
uri=~"/api/rides/.*"
}[5m]))
- record: sli:rider_api:latency:ratio5m
expr: |
sli:rider_api:latency:good_total5m
/
sli:rider_api:latency:total5m
# 30-minute window
- record: sli:rider_api:latency:ratio30m
expr: |
sum(rate(http_server_requests_seconds_bucket{
service="rider-api",
uri=~"/api/rides/.*",
le="0.5"
}[30m]))
/
sum(rate(http_server_requests_seconds_count{
service="rider-api",
uri=~"/api/rides/.*"
}[30m]))
# 1-hour window
- record: sli:rider_api:latency:ratio1h
expr: |
sum(rate(http_server_requests_seconds_bucket{
service="rider-api",
uri=~"/api/rides/.*",
le="0.5"
}[1h]))
/
sum(rate(http_server_requests_seconds_count{
service="rider-api",
uri=~"/api/rides/.*"
}[1h]))
# 6-hour window
- record: sli:rider_api:latency:ratio6h
expr: |
sum(rate(http_server_requests_seconds_bucket{
service="rider-api",
uri=~"/api/rides/.*",
le="0.5"
}[6h]))
/
sum(rate(http_server_requests_seconds_count{
service="rider-api",
uri=~"/api/rides/.*"
}[6h]))
# ============================
# AVAILABILITY SLI: Rider API
# ============================
- record: sli:rider_api:availability:ratio5m
expr: |
1 - (
sum(rate(http_server_requests_seconds_count{
service="rider-api",
uri=~"/api/rides/.*",
status=~"5.."
}[5m]))
/
sum(rate(http_server_requests_seconds_count{
service="rider-api",
uri=~"/api/rides/.*"
}[5m]))
)
- record: sli:rider_api:availability:ratio1h
expr: |
1 - (
sum(rate(http_server_requests_seconds_count{
service="rider-api",
uri=~"/api/rides/.*",
status=~"5.."
}[1h]))
/
sum(rate(http_server_requests_seconds_count{
service="rider-api",
uri=~"/api/rides/.*"
}[1h]))
)
# ============================
# CORRECTNESS SLI: Fare Service
# ============================
- record: sli:fare:correctness:ratio5m
expr: |
sum(rate(fare_calculation_accurate_total{
service="fare-service"
}[5m]))
/
sum(rate(fare_calculation_total{
service="fare-service"
}[5m]))
The uri=~"/api/rides/.*" filter excludes health checks (/health), readiness probes (/ready), and admin endpoints (/admin/*). Only rider-facing traffic counts toward the SLO.
Why Multiple Windows Matter
A single window SLI is vulnerable to edge effects. If you compute the ratio over only 5 minutes, a brief spike looks catastrophic. If you compute only over 6 hours, a brief spike is invisible but a real degradation takes hours to surface.
Multiple windows serve different purposes:
Window Purpose Used By
5m Short-term validation (is it still Fast burn short window
happening right now?)
30m Recent trend confirmation Slow burn short window
1h Sustained impact detection Fast burn long window
6h Gradual degradation detection Slow burn long window
The 5m and 30m windows are short validation windows. They confirm the problem is current, not historical. The 1h and 6h windows are long detection windows. They confirm the problem is significant, not a blip. Alerting rules pair one long window with one short window (CH17-S2).
Error Budget Calculation
SLO Target Error Budget Rate 30-Day Budget (time) 30-Day Budget (requests at 100 RPS)
99.9% 0.1% 43.2 minutes 259,200 slow/failed requests
99.95% 0.05% 21.6 minutes 129,600 slow/failed requests
99.99% 0.01% 4.32 minutes 25,920 slow/failed requests
A 30-day error budget query:
# SCALED: Error budget remaining for rider API latency SLO
1 - (
(1 - sli:rider_api:latency:ratio30d) # actual error rate over 30 days
/
0.001 # allowed error rate (1 - 0.999)
)
If the result is 0.75, 75% of the error budget remains. If it drops below 0, the SLO is violated.
Vanity Metrics vs Meaningful SLOs
SLO Meaningful? Why?
Surge pricing 99.9% availability No Riders can book without surge pricing
Driver analytics 99.9% latency No Drivers don't need real-time analytics
Rider API 99.9% latency Yes Directly affects ride request experience
Fare service 99.99% correctness Yes Wrong fares lose trust and revenue
The surge pricing service can be down for 10 minutes and riders still book rides. They just do not see surge pricing. That is a degraded experience, not an outage. The rider API going down for 10 minutes means nobody can request a ride. That is an outage.
Prioritizing SLOs
Priority Service SLO Engineering Investment
1 Rider API 99.9% latency < 500ms High (auto-scaling, caching, circuit breakers)
2 Rider API 99.95% availability High (multi-AZ, graceful degradation)
3 Fare Service 99.99% correctness Medium (validation, reconciliation)
4 Driver API 99.5% latency < 1s Low (batch-tolerant users)
5 Analytics Dashboard 99% availability Minimal (internal tool)
Lower-priority services get looser SLOs and less engineering investment. The analytics dashboard at 99% availability gets 7.2 hours of allowed downtime per month. That is generous enough to deploy during business hours without worrying about SLO violations.
Error Budget as an Engineering Lever
The error budget is not just a measurement. It is a decision-making tool:
Budget Remaining Action
> 75% Ship features freely, take risks
50-75% Normal development, monitor trends
25-50% Slow feature work, prioritize reliability
< 25% Feature freeze, all hands on reliability
0% (violated) Postmortem required, mandatory reliability sprint
When the rider API has 90% budget remaining, the team ships a risky database migration without hesitation. When budget drops to 30%, the team postpones the migration and investigates the burn rate. When budget hits 0%, feature development stops until reliability is restored.
This converts “how reliable should we be?” from a philosophical debate into a data-driven discussion. Product managers see the budget gauge. They understand that shipping a risky feature when the budget is at 15% means accepting the possibility of a feature freeze.
The Proof
After defining correct SLIs, validate them against real traffic:
# SCALED: Validate SLI accuracy
# Step 1: Check SLI ratio for the last hour
sli:rider_api:latency:ratio1h
# Expected: 0.997-0.999 for a healthy system
# If you see 1.0, the SLI might not be measuring real traffic
# If you see < 0.99, either the SLO target is too aggressive or the service has issues
# Step 2: Verify the SLI excludes health checks
# This ratio should be 0 (no health check traffic in the SLI)
sum(rate(http_server_requests_seconds_count{
service="rider-api",
uri="/health"
}[5m]))
/
sum(rate(http_server_requests_seconds_count{
service="rider-api",
uri=~"/api/rides/.*"
}[5m]))
If health check traffic contributes more than 1% to the denominator, the SLI is diluted. The filter is working correctly when health checks contribute 0% to the SLI calculation.
Run Locust for 1 hour and verify the SLI tracks reality:
# SCALED: Locust for SLI validation
from locust import HttpUser, task, between
class SLIValidationUser(HttpUser):
wait_time = between(0.1, 0.5)
@task(10)
def ride_request(self):
"""Rider-facing traffic: should be in SLI"""
self.client.post("/api/rides/request",
json={
"rider_id": "validation-rider",
"pickup": {"lat": 40.7128, "lng": -74.0060},
"dropoff": {"lat": 40.7589, "lng": -73.9851}
},
name="/api/rides/request"
)
@task(1)
def health_check(self):
"""Infrastructure traffic: should NOT be in SLI"""
self.client.get("/health", name="/health")
After 1 hour, compare:
- Total requests to
/api/rides/requestin Locust: ~36,000 - Total requests in
sli:rider_api:latency:total5msummed over 1 hour: ~36,000 - Total requests to
/healthin Locust: ~3,600 - Health check contribution to SLI: 0%
The SLI measures what users experience, nothing more.