Cascading Failures, Circuit Breakers, and Bulkheads
Cascading Failures, Circuit Breakers, and Bulkheads
The Symptom
Saturday night. 9:14 PM. The surge pricing service starts returning responses in 4 seconds instead of 50ms. The dashboard shows rider API CPU at 12%, memory at 40%, garbage collection normal. Nothing looks wrong except the one metric that matters: ride booking success rate dropped from 99.97% to 23%.
The surge pricing service is slow. The rider API is dead. These two facts are connected by a thread pool and 200 connections that refused to let go.
The Cause
The rider API calls the surge pricing service on every ride request. Under normal conditions, each call takes 50ms. The rider API has 200 Netty event loop threads handling requests. When the surge pricing service starts responding in 4 seconds, each thread blocks for 80x longer than normal.
At 500 requests per second, the math destroys you:
Normal: 500 RPS × 50ms = 25 concurrent connections (12.5% of pool)
Degraded: 500 RPS × 4000ms = 2000 concurrent connections (1000% of pool)
The connection pool has 200 slots. After 0.4 seconds, every slot holds a connection waiting for surge pricing. New ride requests arrive, find no available connections, and queue. The queue fills. Timeouts fire. But timeouts are set to 30 seconds because someone once saw a legitimate 15-second response during a deployment.
The surge pricing service did not crash. It slowed down. And that slowdown propagated upstream through a shared resource pool, killing every feature that shares the rider API: ride booking, fare estimation, driver ETA, trip history. All dead because of one slow dependency.
This is a cascading failure. Service A depends on Service B. Service B degrades. Service A holds resources waiting for B. Service A runs out of resources. Everything behind A dies.
Timeline of a cascading failure:
T+0s Surge pricing response time: 50ms → 4000ms
T+0.4s Rider API connection pool: 200/200 occupied
T+0.5s Ride booking requests start queueing
T+2s Queue depth: 1000 requests
T+5s Rider API health check times out
T+8s Load balancer marks rider API unhealthy
T+10s All rider API pods marked unhealthy
T+12s 0% of ride requests succeed
Three patterns prevent this: circuit breakers stop calling a failing dependency. Bulkheads isolate failure domains so one slow dependency cannot consume all resources. Retries with backoff and jitter recover gracefully without stampeding.
The Baseline
The rider API before resilience patterns:
// BOTTLENECK: No circuit breaker, no bulkhead, shared thread pool
@Service
public class RideBookingService {
private final SurgePricingClient surgePricingClient;
private final DriverMatchingClient driverMatchingClient;
private final FareService fareService;
public Mono<RideBooking> bookRide(RideRequest request) {
return surgePricingClient.getMultiplier(request.getZoneId())
.flatMap(multiplier ->
fareService.calculate(request, multiplier))
.flatMap(fare ->
driverMatchingClient.findDriver(request, fare))
.map(driver -> createBooking(request, driver));
}
}
// BOTTLENECK: WebClient with no timeout isolation
@Component
public class SurgePricingClient {
private final WebClient webClient;
public Mono<BigDecimal> getMultiplier(String zoneId) {
return webClient.get()
.uri("/api/surge/{zoneId}", zoneId)
.retrieve()
.bodyToMono(SurgeResponse.class)
.map(SurgeResponse::getMultiplier)
.timeout(Duration.ofSeconds(30)); // 30s timeout, might as well be forever
}
}
Every surge pricing call, driver matching call, and fare calculation shares the same WebClient connection pool. When surge pricing hangs, the pool fills, and driver matching calls that would succeed in 20ms cannot even start.
Load test baseline with all services healthy:
Locust: 500 users, 10 RPS per user
Metric Value
p50 latency 120ms
p95 latency 280ms
p99 latency 410ms
Error rate 0.03%
Throughput 4,980 RPS
Load test with surge pricing at 4-second latency:
Locust: 500 users, 10 RPS per user, surge pricing at 4s
Metric Value
p50 latency 28,400ms
p95 latency timeout
p99 latency timeout
Error rate 77%
Throughput 310 RPS
77% error rate. The surge pricing service is not down. It is slow. And that slowness killed the entire platform.
The Fix
Three layers of defense.
Layer 1: Circuit Breaker. When the surge pricing service fails repeatedly, stop calling it. Return a fallback value. Stop wasting connections on a service that is not responding.
Layer 2: Bulkhead. Give the surge pricing client its own limited connection pool. When those 20 connections fill up, the remaining 180 are still available for ride bookings that do not need surge pricing.
Layer 3: Retry with Backoff. When a call fails, retry with exponential delay and random jitter. Without backoff, 5,000 simultaneous retries kill the recovering service. Without jitter, 5,000 retries with the same delay hit at the same millisecond.
// SCALED: Resilience4j dependencies
// build.gradle.kts
dependencies {
implementation("io.github.resilience4j:resilience4j-spring-boot3:2.2.0")
implementation("io.github.resilience4j:resilience4j-reactor:2.2.0")
implementation("io.github.resilience4j:resilience4j-micrometer:2.2.0")
}
# SCALED: application.yml - Resilience4j configuration
resilience4j:
circuitbreaker:
instances:
surgePricing:
slidingWindowSize: 20
failureRateThreshold: 50
waitDurationInOpenState: 10s
permittedNumberOfCallsInHalfOpenState: 5
slidingWindowType: COUNT_BASED
minimumNumberOfCalls: 10
recordExceptions:
- java.io.IOException
- java.util.concurrent.TimeoutException
- org.springframework.web.reactive.function.client.WebClientResponseException.ServiceUnavailable
driverMatching:
slidingWindowSize: 20
failureRateThreshold: 50
waitDurationInOpenState: 15s
permittedNumberOfCallsInHalfOpenState: 3
bulkhead:
instances:
surgePricing:
maxConcurrentCalls: 20
maxWaitDuration: 500ms
driverMatching:
maxConcurrentCalls: 50
maxWaitDuration: 1s
retry:
instances:
surgePricing:
maxAttempts: 3
waitDuration: 100ms
enableExponentialBackoff: true
exponentialBackoffMultiplier: 2
enableRandomizedWait: true
randomizedWaitFactor: 0.5
retryExceptions:
- java.io.IOException
- java.util.concurrent.TimeoutException
The circuit breaker monitors the last 20 calls. If 50% fail, it opens. For 10 seconds, all calls return the fallback immediately. Then it moves to half-open, allowing 5 test calls. If those succeed, it closes. If they fail, it opens again for another 10 seconds.
The bulkhead limits surge pricing to 20 concurrent calls. The remaining capacity serves ride bookings.
The retry waits 100ms after the first failure, 200ms after the second, with random jitter up to 50% of the delay. Three attempts total. If all three fail and the circuit breaker is still closed, the circuit breaker records the failure.
The Proof
Load test with surge pricing at 4-second latency, circuit breaker + bulkhead + retry enabled:
# SCALED: Locust test for cascading failure with resilience patterns
from locust import HttpUser, task, between, events
import time
class RideBookingUser(HttpUser):
wait_time = between(0.1, 0.5)
@task(10)
def book_ride(self):
payload = {
"riderId": f"rider-{self.environment.runner.user_count}",
"pickupLat": 40.7128,
"pickupLng": -74.0060,
"dropoffLat": 40.7580,
"dropoffLng": -73.9855,
"zoneId": "manhattan-midtown"
}
with self.client.post("/api/rides/book", json=payload,
catch_response=True) as response:
if response.status_code == 200:
data = response.json()
if data.get("degraded"):
response.success() # Degraded but functional
else:
response.success()
elif response.status_code == 503:
response.failure("Service unavailable")
@task(3)
def get_fare_estimate(self):
self.client.get("/api/fares/estimate?zoneId=manhattan-midtown")
@task(1)
def get_trip_history(self):
self.client.get("/api/trips/history?riderId=rider-1")
Results with resilience patterns:
Locust: 500 users, 10 RPS per user, surge pricing at 4s latency
Without Resilience With Resilience
p50 latency 28,400ms 140ms
p95 latency timeout 310ms
p99 latency timeout 890ms
Error rate 77% 0.4%
Throughput 310 RPS 4,850 RPS
Booking success 23% 99.6%
Circuit state N/A OPEN after 8s
Surge fallback N/A Cached multiplier
The circuit breaker opened 8 seconds after the surge pricing degradation started. During those 8 seconds, 20 connections (the bulkhead limit) were occupied by slow surge pricing calls. The remaining 180 connections served ride bookings and fare estimates at near-normal latency.
After the circuit opened, surge pricing calls returned the cached multiplier in under 1ms. The rider got a ride at the last-known surge price instead of no ride at all.
Trip history and fare estimates continued unaffected throughout the incident because the bulkhead prevented surge pricing from consuming their connection capacity.
The 0.4% error rate came from the 8-second window before the circuit opened. Requests that were already queued behind the bulkhead’s 20 connections timed out at the 500ms maxWaitDuration. Those are the users who retried and got through on the second attempt.