Skip to main content
resilience patterns in production

Interaction Effects and Debugging Composed Patterns

3 min read Chapter 20 of 40

Interaction Effects and Debugging Composed Patterns

When five patterns wrap a single call, failures produce cascading effects through the decorator chain. Understanding these interactions is the difference between diagnosing a problem in minutes and spending hours wondering why your circuit breaker never opens.

Interaction: Retry + Circuit Breaker

Expected behavior: Retry catches the exception, waits for backoff, retries. If the circuit breaker opens during the retry sequence, the next retry attempt gets CallNotPermittedException, which should not be retried.

Bug: Retry configuration includes CallNotPermittedException in retry-exceptions. The retry keeps attempting even though the circuit breaker is open. Each attempt is rejected instantly by the circuit breaker (no thread consumption), but the retry backoff adds unnecessary delay. Total call time: max_attempts * backoff_sum for no useful work.

Fix:

resilience4j:
  retry:
    instances:
      fraudDetection:
        ignore-exceptions:
          - io.github.resilience4j.circuitbreaker.CallNotPermittedException
        # Never retry when the circuit breaker is open.
        # The breaker is open because the dependency is confirmed broken.
        # Retrying is pointless.

Interaction: Bulkhead + TimeLimiter

Expected behavior: Bulkhead rejects immediately when full. TimeLimiter cancels the call after its timeout.

Bug: Bulkhead max-wait-duration is 5 seconds. TimeLimiter timeout-duration is 2 seconds. A call arrives when the bulkhead is full. It waits for a permit. After 2 seconds, the TimeLimiter fires and cancels the call. The thread was blocked for 2 seconds waiting for a bulkhead permit, doing no useful work. The TimeLimiter timeout effectively replaces the bulkhead wait timeout.

Fix: Bulkhead max-wait-duration must be shorter than TimeLimiter timeout-duration. The bulkhead should reject fast so the TimeLimiter budget is available for actual work.

resilience4j:
  bulkhead:
    instances:
      fraudDetection:
        max-wait-duration: 100ms # Short: reject fast
  timelimiter:
    instances:
      fraudDetection:
        timeout-duration: 2s # Budget for actual work, not waiting

The Diagnostic Dashboard

When a dependency degrades, the metrics tell you which pattern detected the problem:

# Which pattern is firing?
rate(resilience4j_circuitbreaker_calls_seconds_count{kind="not_permitted"}[5m]) > 0
  -> Circuit breaker is open. Dependency confirmed broken.

rate(resilience4j_bulkhead_calls_seconds_count{kind="rejected"}[5m]) > 0
  -> Bulkhead full. Dependency is slow (consuming all permits).

rate(resilience4j_ratelimiter_calls_seconds_count{kind="failed"}[5m]) > 0
  -> Rate limit exceeded. Traffic spike or dependency rate limit hit.

rate(resilience4j_retry_calls_total{kind="failed_with_retry"}[5m]) > 0
  -> Retries exhausted. Dependency returning intermittent errors.

rate(resilience4j_timelimiter_calls_seconds_count{kind="timeout"}[5m]) > 0
  -> Total time budget exceeded. Dependency too slow for entire chain.

A Grafana dashboard with one panel per pattern, per dependency, gives immediate visibility into which pattern is active and why. When the on-call engineer sees the circuit breaker panel turn red, they know the dependency is broken. When they see the bulkhead panel turn amber, they know the dependency is slow but not completely broken. The pattern that fires first tells you the nature of the degradation.

The “Everything Is Red” Scenario

When a dependency crashes completely, multiple patterns fire simultaneously:

  1. HTTP client throws ConnectException (connection refused)
  2. Circuit breaker records failure, failure rate climbs
  3. After minimum-number-of-calls failures, circuit breaker opens
  4. Subsequent calls get CallNotPermittedException
  5. Retry sees CallNotPermittedException, does not retry (if configured correctly)
  6. Fallback fires

The metrics show: circuit breaker state = OPEN, bulkhead utilization = 0% (no calls reaching the bulkhead), retry success_with_retry = 0 (no retries attempted). This pattern of metrics, OPEN breaker with idle bulkhead, means the dependency is confirmed unreachable, not just slow. If the breaker were OPEN but the bulkhead were 100% utilized, that would indicate the half-open probes are reaching the dependency but the dependency is still too slow.

These metric patterns are the language of resilience debugging. Learn to read them and you can diagnose dependency issues from the dashboard without checking the dependency’s logs.