Skip to main content
surviving the spike

Circuit Breakers with Resilience4j

8 min read Chapter 53 of 66

Circuit Breakers with Resilience4j

The Symptom

The rider API calls the surge pricing service 4,800 times per minute during peak hours. The surge pricing service deploys a bad config change. Response times jump from 50ms to 3 seconds. Within 12 seconds, the rider API’s connection pool is exhausted. Ride bookings fail across all zones, including zones where surge pricing is not active and the multiplier would be 1.0x.

The surge pricing service is not returning errors. It is returning correct responses, slowly. The HTTP status codes are all 200. No alerts fire because the error rate is zero. The p99 latency alert threshold is 2 seconds, and the 15-second Prometheus scrape window smooths the spike below the threshold.

By the time a human notices, 4,200 ride requests have failed.

The Cause

The rider API treats the surge pricing service as a synchronous dependency with unlimited patience. The WebClient timeout is 30 seconds. The connection pool is shared with every other outbound call. When surge pricing takes 3 seconds per call, the pool fills with waiting connections. New calls to driver matching (20ms response time) cannot acquire a connection because 200 connections are occupied waiting for surge pricing responses.

A circuit breaker solves this by tracking the success rate of calls to a dependency. When the failure rate exceeds a threshold, the circuit “opens” and immediately returns a fallback response without making the network call. After a cooling period, it enters “half-open” state and allows a few test calls through. If those succeed, the circuit closes and normal traffic resumes. If they fail, the circuit opens again.

Circuit breaker state machine diagram showing three states: CLOSED (green, normal flow), OPEN (red, fail fast), and HALF_OPEN (yellow, test traffic), with labeled transitions between them

The state machine has three states. In CLOSED state (green), all requests flow through normally while the breaker tracks failure rates. When failures exceed 50%, the circuit transitions to OPEN (red), where all requests are immediately rejected with a fallback response — no network call is made. After a 10-second timeout, the circuit moves to HALF_OPEN (yellow), allowing a limited number of test calls through. If those test calls succeed, the circuit closes and normal traffic resumes. If they fail, the circuit reopens.

The Baseline

Surge pricing client without circuit breaker:

// BOTTLENECK: Unlimited retry, no circuit breaking
@Component
public class SurgePricingClient {

    private final WebClient webClient;

    public Mono<BigDecimal> getMultiplier(String zoneId) {
        return webClient.get()
            .uri("/api/surge/{zoneId}", zoneId)
            .retrieve()
            .bodyToMono(SurgeResponse.class)
            .map(SurgeResponse::getMultiplier)
            .timeout(Duration.ofSeconds(30))
            .onErrorReturn(BigDecimal.ONE); // Silent fallback, no tracking
    }
}

The onErrorReturn only fires after 30 seconds. For those 30 seconds, a connection is occupied. At 500 RPS, 15,000 connections are needed to handle the backlog. The pool has 200.

The Fix

Resilience4j CircuitBreaker Configuration

# SCALED: Circuit breaker tuned for ride-hailing
resilience4j:
  circuitbreaker:
    instances:
      surgePricing:
        slidingWindowType: COUNT_BASED
        slidingWindowSize: 20
        failureRateThreshold: 50
        waitDurationInOpenState: 10s
        permittedNumberOfCallsInHalfOpenState: 5
        minimumNumberOfCalls: 10
        recordExceptions:
          - java.io.IOException
          - java.util.concurrent.TimeoutException
          - org.springframework.web.reactive.function.client.WebClientResponseException.ServiceUnavailable
          - org.springframework.web.reactive.function.client.WebClientResponseException.GatewayTimeout
        ignoreExceptions:
          - org.springframework.web.reactive.function.client.WebClientResponseException.BadRequest
        automaticTransitionFromOpenToHalfOpenEnabled: true

Configuration breakdown:

Parameter                          Value   Why
slidingWindowSize                  20      20 calls = 4 seconds at 5 RPS per zone
failureRateThreshold               50      Open after 10 of 20 calls fail
waitDurationInOpenState             10s     Long enough for a transient fix
permittedNumberOfCallsInHalfOpenState  5   5 test calls to confirm recovery
minimumNumberOfCalls                10     Do not evaluate until 10 calls recorded
automaticTransitionFromOpenToHalfOpenEnabled  true  Timer-based, not request-triggered

The minimumNumberOfCalls at 10 prevents false positives during cold starts. Without it, the first 3 calls timing out (during JVM warmup) would open the circuit at a 100% failure rate with a window of 3.

Circuit Breaker on the Surge Pricing Client

// SCALED: Circuit breaker with fallback to cached multiplier
@Component
public class SurgePricingClient {

    private final WebClient webClient;
    private final ReactiveRedisTemplate<String, String> redis;
    private final CircuitBreakerRegistry circuitBreakerRegistry;

    private static final Duration CALL_TIMEOUT = Duration.ofSeconds(2);
    private static final String CACHE_PREFIX = "surge:last_known:";

    public Mono<BigDecimal> getMultiplier(String zoneId) {
        CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("surgePricing");

        return webClient.get()
            .uri("/api/surge/{zoneId}", zoneId)
            .retrieve()
            .bodyToMono(SurgeResponse.class)
            .map(SurgeResponse::getMultiplier)
            .timeout(CALL_TIMEOUT)
            .doOnNext(multiplier -> cacheMultiplier(zoneId, multiplier))
            .transformDeferred(CircuitBreakerOperator.of(cb))
            .onErrorResume(CallNotPermittedException.class,
                ex -> getFallbackMultiplier(zoneId))
            .onErrorResume(TimeoutException.class,
                ex -> getFallbackMultiplier(zoneId))
            .onErrorResume(ex -> getFallbackMultiplier(zoneId));
    }

    private Mono<BigDecimal> getFallbackMultiplier(String zoneId) {
        return redis.opsForValue()
            .get(CACHE_PREFIX + zoneId)
            .map(BigDecimal::new)
            .defaultIfEmpty(BigDecimal.ONE) // No cache = no surge
            .doOnNext(m -> Metrics.counter("surge.fallback.used",
                "zone", zoneId, "source",
                m.equals(BigDecimal.ONE) ? "default" : "cache").increment());
    }

    private void cacheMultiplier(String zoneId, BigDecimal multiplier) {
        redis.opsForValue()
            .set(CACHE_PREFIX + zoneId, multiplier.toString(),
                Duration.ofMinutes(5))
            .subscribe();
    }
}

When the circuit is closed, calls go through normally. Each successful response is cached in Redis with a 5-minute TTL. When the circuit opens (50% failure rate over 20 calls), CallNotPermittedException fires immediately. The fallback reads the last cached multiplier from Redis. If Redis has no cached value, it defaults to 1.0x (no surge).

The rider gets a ride. The fare might be slightly stale. That is better than no ride.

Circuit Breaker on Driver Matching

// SCALED: Circuit breaker with Kafka queue fallback
@Component
public class DriverMatchingClient {

    private final WebClient webClient;
    private final KafkaTemplate<String, MatchRequest> kafkaTemplate;
    private final CircuitBreakerRegistry circuitBreakerRegistry;

    public Mono<MatchResult> findDriver(RideRequest request, FareEstimate fare) {
        CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("driverMatching");

        return webClient.post()
            .uri("/api/matching/find")
            .bodyValue(new MatchRequest(request, fare))
            .retrieve()
            .bodyToMono(MatchResult.class)
            .timeout(Duration.ofSeconds(3))
            .transformDeferred(CircuitBreakerOperator.of(cb))
            .onErrorResume(ex -> queueForRetry(request, fare));
    }

    private Mono<MatchResult> queueForRetry(RideRequest request, FareEstimate fare) {
        return Mono.fromFuture(
            kafkaTemplate.send("driver-matching-retry",
                request.getRideId(),
                new MatchRequest(request, fare))
        ).map(result -> MatchResult.pending(request.getRideId()))
         .doOnNext(r -> Metrics.counter("matching.fallback.queued").increment());
    }
}

When driver matching is down, the match request goes to Kafka. A consumer picks it up when the service recovers. The rider sees “Finding your driver…” instead of an error. The match completes asynchronously, and a push notification confirms the assignment.

Prometheus Metrics

Resilience4j exposes circuit breaker metrics automatically via Micrometer:

# SCALED: Prometheus metrics exposed by Resilience4j
# resilience4j_circuitbreaker_state{name="surgePricing"} 0=closed, 1=open, 2=half_open
# resilience4j_circuitbreaker_calls_seconds_count{name="surgePricing", kind="successful"}
# resilience4j_circuitbreaker_calls_seconds_count{name="surgePricing", kind="failed"}
# resilience4j_circuitbreaker_failure_rate{name="surgePricing"}
# resilience4j_circuitbreaker_not_permitted_calls_total{name="surgePricing"}

Grafana alert on circuit breaker state change:

# SCALED: Grafana alert rule
- alert: CircuitBreakerOpen
  expr: resilience4j_circuitbreaker_state{name=~"surgePricing|driverMatching"} == 1
  for: 0s
  labels:
    severity: warning
  annotations:
    summary: "Circuit breaker {{ $labels.name }} is OPEN"
    description: "Dependency {{ $labels.name }} has exceeded failure threshold"

The alert fires the instant a circuit opens. Not after 5 minutes. Not after a human checks a dashboard. The circuit breaker is the alerting mechanism.

The Too-Aggressive Circuit Breaker

First deployment. Monday morning. The rider API starts up. The JVM is cold. The first 5 calls to surge pricing take 800ms each (JIT compilation, class loading, connection establishment). The circuit breaker configuration:

# BOTTLENECK: Circuit breaker opens on cold start
resilience4j:
  circuitbreaker:
    instances:
      surgePricing:
        slidingWindowSize: 5 # Only 5 calls in the window
        failureRateThreshold: 60 # 60% threshold
        waitDurationInOpenState: 10s
        minimumNumberOfCalls: 3 # Evaluate after just 3 calls

Three of the first 5 calls exceed the 2-second timeout (cold JVM). That is a 60% failure rate. The circuit opens. For 10 seconds, all surge pricing calls return the fallback. During those 10 seconds, the JVM warms up. The half-open test calls succeed. The circuit closes. Total disruption: 10 seconds of stale surge pricing on every pod restart, every deployment, every scaling event.

The fix:

# SCALED: Circuit breaker that survives cold starts
resilience4j:
  circuitbreaker:
    instances:
      surgePricing:
        slidingWindowSize: 20 # 20-call window
        failureRateThreshold: 50 # 50% threshold
        waitDurationInOpenState: 10s
        minimumNumberOfCalls: 10 # Wait for 10 calls before evaluating

With minimumNumberOfCalls: 10, the circuit breaker does not evaluate the failure rate until 10 calls have been recorded. By call 10, the JVM is warm. The 3 slow calls from startup are outnumbered by 7 fast calls. The failure rate is 30%, below the 50% threshold. The circuit stays closed.

The Proof

Locust test: surge pricing returns 503 for 30 seconds starting at T+60s.

# SCALED: Locust with timed surge pricing failure
import gevent

class SurgePricingFailureTest(HttpUser):
    wait_time = between(0.05, 0.2)

    @task
    def book_ride(self):
        payload = {
            "riderId": f"rider-{self.environment.runner.user_count}",
            "pickupLat": 40.7128, "pickupLng": -74.0060,
            "dropoffLat": 40.7580, "dropoffLng": -73.9855,
            "zoneId": "manhattan-midtown"
        }
        self.client.post("/api/rides/book", json=payload)

Results during the 30-second surge pricing outage:

Phase              Duration   p99 Latency   Error Rate   Booking Rate
Before failure     60s        410ms         0.03%        4,980 RPS
Circuit closing    8s         1,200ms       2.1%         4,100 RPS
Circuit open       22s        95ms          0.08%        4,920 RPS
Recovery (half)    3s         180ms         0.3%         4,850 RPS
After recovery     60s        400ms         0.03%        4,980 RPS

During the 8 seconds before the circuit opened, the failure rate climbed to 2.1%. Those are the calls that hit the surge pricing service, timed out at 2 seconds, and retried. After the circuit opened, the fallback returned cached multipliers in under 1ms. The p99 dropped to 95ms because the surge pricing network call was eliminated entirely.

The 22-second open period was invisible to riders. Rides booked at the last cached surge price. When the surge pricing service recovered and the circuit moved to half-open, 5 test calls succeeded, the circuit closed, and live surge prices resumed.

Total rider impact: 2.1% error rate for 8 seconds. 168 failed requests out of 32,800 during the 8-second window. Without the circuit breaker, the 30-second outage would have produced 77% error rate for 30+ seconds: over 115,000 failed requests.