Circuit Breakers with Resilience4j
Circuit Breakers with Resilience4j
The Symptom
The rider API calls the surge pricing service 4,800 times per minute during peak hours. The surge pricing service deploys a bad config change. Response times jump from 50ms to 3 seconds. Within 12 seconds, the rider API’s connection pool is exhausted. Ride bookings fail across all zones, including zones where surge pricing is not active and the multiplier would be 1.0x.
The surge pricing service is not returning errors. It is returning correct responses, slowly. The HTTP status codes are all 200. No alerts fire because the error rate is zero. The p99 latency alert threshold is 2 seconds, and the 15-second Prometheus scrape window smooths the spike below the threshold.
By the time a human notices, 4,200 ride requests have failed.
The Cause
The rider API treats the surge pricing service as a synchronous dependency with unlimited patience. The WebClient timeout is 30 seconds. The connection pool is shared with every other outbound call. When surge pricing takes 3 seconds per call, the pool fills with waiting connections. New calls to driver matching (20ms response time) cannot acquire a connection because 200 connections are occupied waiting for surge pricing responses.
A circuit breaker solves this by tracking the success rate of calls to a dependency. When the failure rate exceeds a threshold, the circuit “opens” and immediately returns a fallback response without making the network call. After a cooling period, it enters “half-open” state and allows a few test calls through. If those succeed, the circuit closes and normal traffic resumes. If they fail, the circuit opens again.
The state machine has three states. In CLOSED state (green), all requests flow through normally while the breaker tracks failure rates. When failures exceed 50%, the circuit transitions to OPEN (red), where all requests are immediately rejected with a fallback response — no network call is made. After a 10-second timeout, the circuit moves to HALF_OPEN (yellow), allowing a limited number of test calls through. If those test calls succeed, the circuit closes and normal traffic resumes. If they fail, the circuit reopens.
The Baseline
Surge pricing client without circuit breaker:
// BOTTLENECK: Unlimited retry, no circuit breaking
@Component
public class SurgePricingClient {
private final WebClient webClient;
public Mono<BigDecimal> getMultiplier(String zoneId) {
return webClient.get()
.uri("/api/surge/{zoneId}", zoneId)
.retrieve()
.bodyToMono(SurgeResponse.class)
.map(SurgeResponse::getMultiplier)
.timeout(Duration.ofSeconds(30))
.onErrorReturn(BigDecimal.ONE); // Silent fallback, no tracking
}
}
The onErrorReturn only fires after 30 seconds. For those 30 seconds, a connection is occupied. At 500 RPS, 15,000 connections are needed to handle the backlog. The pool has 200.
The Fix
Resilience4j CircuitBreaker Configuration
# SCALED: Circuit breaker tuned for ride-hailing
resilience4j:
circuitbreaker:
instances:
surgePricing:
slidingWindowType: COUNT_BASED
slidingWindowSize: 20
failureRateThreshold: 50
waitDurationInOpenState: 10s
permittedNumberOfCallsInHalfOpenState: 5
minimumNumberOfCalls: 10
recordExceptions:
- java.io.IOException
- java.util.concurrent.TimeoutException
- org.springframework.web.reactive.function.client.WebClientResponseException.ServiceUnavailable
- org.springframework.web.reactive.function.client.WebClientResponseException.GatewayTimeout
ignoreExceptions:
- org.springframework.web.reactive.function.client.WebClientResponseException.BadRequest
automaticTransitionFromOpenToHalfOpenEnabled: true
Configuration breakdown:
Parameter Value Why
slidingWindowSize 20 20 calls = 4 seconds at 5 RPS per zone
failureRateThreshold 50 Open after 10 of 20 calls fail
waitDurationInOpenState 10s Long enough for a transient fix
permittedNumberOfCallsInHalfOpenState 5 5 test calls to confirm recovery
minimumNumberOfCalls 10 Do not evaluate until 10 calls recorded
automaticTransitionFromOpenToHalfOpenEnabled true Timer-based, not request-triggered
The minimumNumberOfCalls at 10 prevents false positives during cold starts. Without it, the first 3 calls timing out (during JVM warmup) would open the circuit at a 100% failure rate with a window of 3.
Circuit Breaker on the Surge Pricing Client
// SCALED: Circuit breaker with fallback to cached multiplier
@Component
public class SurgePricingClient {
private final WebClient webClient;
private final ReactiveRedisTemplate<String, String> redis;
private final CircuitBreakerRegistry circuitBreakerRegistry;
private static final Duration CALL_TIMEOUT = Duration.ofSeconds(2);
private static final String CACHE_PREFIX = "surge:last_known:";
public Mono<BigDecimal> getMultiplier(String zoneId) {
CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("surgePricing");
return webClient.get()
.uri("/api/surge/{zoneId}", zoneId)
.retrieve()
.bodyToMono(SurgeResponse.class)
.map(SurgeResponse::getMultiplier)
.timeout(CALL_TIMEOUT)
.doOnNext(multiplier -> cacheMultiplier(zoneId, multiplier))
.transformDeferred(CircuitBreakerOperator.of(cb))
.onErrorResume(CallNotPermittedException.class,
ex -> getFallbackMultiplier(zoneId))
.onErrorResume(TimeoutException.class,
ex -> getFallbackMultiplier(zoneId))
.onErrorResume(ex -> getFallbackMultiplier(zoneId));
}
private Mono<BigDecimal> getFallbackMultiplier(String zoneId) {
return redis.opsForValue()
.get(CACHE_PREFIX + zoneId)
.map(BigDecimal::new)
.defaultIfEmpty(BigDecimal.ONE) // No cache = no surge
.doOnNext(m -> Metrics.counter("surge.fallback.used",
"zone", zoneId, "source",
m.equals(BigDecimal.ONE) ? "default" : "cache").increment());
}
private void cacheMultiplier(String zoneId, BigDecimal multiplier) {
redis.opsForValue()
.set(CACHE_PREFIX + zoneId, multiplier.toString(),
Duration.ofMinutes(5))
.subscribe();
}
}
When the circuit is closed, calls go through normally. Each successful response is cached in Redis with a 5-minute TTL. When the circuit opens (50% failure rate over 20 calls), CallNotPermittedException fires immediately. The fallback reads the last cached multiplier from Redis. If Redis has no cached value, it defaults to 1.0x (no surge).
The rider gets a ride. The fare might be slightly stale. That is better than no ride.
Circuit Breaker on Driver Matching
// SCALED: Circuit breaker with Kafka queue fallback
@Component
public class DriverMatchingClient {
private final WebClient webClient;
private final KafkaTemplate<String, MatchRequest> kafkaTemplate;
private final CircuitBreakerRegistry circuitBreakerRegistry;
public Mono<MatchResult> findDriver(RideRequest request, FareEstimate fare) {
CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("driverMatching");
return webClient.post()
.uri("/api/matching/find")
.bodyValue(new MatchRequest(request, fare))
.retrieve()
.bodyToMono(MatchResult.class)
.timeout(Duration.ofSeconds(3))
.transformDeferred(CircuitBreakerOperator.of(cb))
.onErrorResume(ex -> queueForRetry(request, fare));
}
private Mono<MatchResult> queueForRetry(RideRequest request, FareEstimate fare) {
return Mono.fromFuture(
kafkaTemplate.send("driver-matching-retry",
request.getRideId(),
new MatchRequest(request, fare))
).map(result -> MatchResult.pending(request.getRideId()))
.doOnNext(r -> Metrics.counter("matching.fallback.queued").increment());
}
}
When driver matching is down, the match request goes to Kafka. A consumer picks it up when the service recovers. The rider sees “Finding your driver…” instead of an error. The match completes asynchronously, and a push notification confirms the assignment.
Prometheus Metrics
Resilience4j exposes circuit breaker metrics automatically via Micrometer:
# SCALED: Prometheus metrics exposed by Resilience4j
# resilience4j_circuitbreaker_state{name="surgePricing"} 0=closed, 1=open, 2=half_open
# resilience4j_circuitbreaker_calls_seconds_count{name="surgePricing", kind="successful"}
# resilience4j_circuitbreaker_calls_seconds_count{name="surgePricing", kind="failed"}
# resilience4j_circuitbreaker_failure_rate{name="surgePricing"}
# resilience4j_circuitbreaker_not_permitted_calls_total{name="surgePricing"}
Grafana alert on circuit breaker state change:
# SCALED: Grafana alert rule
- alert: CircuitBreakerOpen
expr: resilience4j_circuitbreaker_state{name=~"surgePricing|driverMatching"} == 1
for: 0s
labels:
severity: warning
annotations:
summary: "Circuit breaker {{ $labels.name }} is OPEN"
description: "Dependency {{ $labels.name }} has exceeded failure threshold"
The alert fires the instant a circuit opens. Not after 5 minutes. Not after a human checks a dashboard. The circuit breaker is the alerting mechanism.
The Too-Aggressive Circuit Breaker
First deployment. Monday morning. The rider API starts up. The JVM is cold. The first 5 calls to surge pricing take 800ms each (JIT compilation, class loading, connection establishment). The circuit breaker configuration:
# BOTTLENECK: Circuit breaker opens on cold start
resilience4j:
circuitbreaker:
instances:
surgePricing:
slidingWindowSize: 5 # Only 5 calls in the window
failureRateThreshold: 60 # 60% threshold
waitDurationInOpenState: 10s
minimumNumberOfCalls: 3 # Evaluate after just 3 calls
Three of the first 5 calls exceed the 2-second timeout (cold JVM). That is a 60% failure rate. The circuit opens. For 10 seconds, all surge pricing calls return the fallback. During those 10 seconds, the JVM warms up. The half-open test calls succeed. The circuit closes. Total disruption: 10 seconds of stale surge pricing on every pod restart, every deployment, every scaling event.
The fix:
# SCALED: Circuit breaker that survives cold starts
resilience4j:
circuitbreaker:
instances:
surgePricing:
slidingWindowSize: 20 # 20-call window
failureRateThreshold: 50 # 50% threshold
waitDurationInOpenState: 10s
minimumNumberOfCalls: 10 # Wait for 10 calls before evaluating
With minimumNumberOfCalls: 10, the circuit breaker does not evaluate the failure rate until 10 calls have been recorded. By call 10, the JVM is warm. The 3 slow calls from startup are outnumbered by 7 fast calls. The failure rate is 30%, below the 50% threshold. The circuit stays closed.
The Proof
Locust test: surge pricing returns 503 for 30 seconds starting at T+60s.
# SCALED: Locust with timed surge pricing failure
import gevent
class SurgePricingFailureTest(HttpUser):
wait_time = between(0.05, 0.2)
@task
def book_ride(self):
payload = {
"riderId": f"rider-{self.environment.runner.user_count}",
"pickupLat": 40.7128, "pickupLng": -74.0060,
"dropoffLat": 40.7580, "dropoffLng": -73.9855,
"zoneId": "manhattan-midtown"
}
self.client.post("/api/rides/book", json=payload)
Results during the 30-second surge pricing outage:
Phase Duration p99 Latency Error Rate Booking Rate
Before failure 60s 410ms 0.03% 4,980 RPS
Circuit closing 8s 1,200ms 2.1% 4,100 RPS
Circuit open 22s 95ms 0.08% 4,920 RPS
Recovery (half) 3s 180ms 0.3% 4,850 RPS
After recovery 60s 400ms 0.03% 4,980 RPS
During the 8 seconds before the circuit opened, the failure rate climbed to 2.1%. Those are the calls that hit the surge pricing service, timed out at 2 seconds, and retried. After the circuit opened, the fallback returned cached multipliers in under 1ms. The p99 dropped to 95ms because the surge pricing network call was eliminated entirely.
The 22-second open period was invisible to riders. Rides booked at the last cached surge price. When the surge pricing service recovered and the circuit moved to half-open, 5 test calls succeeded, the circuit closed, and live surge prices resumed.
Total rider impact: 2.1% error rate for 8 seconds. 168 failed requests out of 32,800 during the 8-second window. Without the circuit breaker, the 30-second outage would have produced 77% error rate for 30+ seconds: over 115,000 failed requests.