Resilience Observability
Resilience Observability
A circuit breaker that opens without anyone noticing is worse than no circuit breaker at all. The system is degraded, serving fallback responses, and the operations team does not know. When someone finally checks, the circuit breaker has been open for hours. Customers have been receiving approximate fraud scores since the morning. The circuit breaker worked perfectly; the observability failed.
The Four Resilience Signals
Every resilience-protected dependency needs four signals:
1. Health state. Is the dependency available? Circuit breaker state (CLOSED, OPEN, HALF_OPEN) is the definitive signal. A CLOSED breaker means the dependency is healthy. An OPEN breaker means the dependency is confirmed broken. A HALF_OPEN breaker means recovery is being tested.
2. Latency profile. How fast is the dependency? The p50, p95, and p99 latencies show the current performance. A widening gap between p50 and p99 indicates emerging problems before the circuit breaker opens.
3. Error budget. How close is the dependency to triggering the circuit breaker? If the failure rate threshold is 50% and the current rate is 40%, the breaker is 80% of the way to opening. This is the early warning.
4. Capacity utilization. How much headroom does the resilience pattern have? Bulkhead permit utilization at 90% means the dependency is one traffic spike away from rejections. Rate limiter token consumption rate near the limit means throttling is imminent.
Resilience4J Micrometer Integration
// PRODUCTION - Automatic metric binding for all resilience patterns
@Configuration
public class ResilienceMetricsConfig {
@Bean
public MeterRegistryCustomizer<MeterRegistry> resilience4jMetrics(
CircuitBreakerRegistry cbRegistry,
RetryRegistry retryRegistry,
BulkheadRegistry bulkheadRegistry,
RateLimiterRegistry rlRegistry,
TimeLimiterRegistry tlRegistry) {
return registry -> {
// Circuit breaker metrics
TaggedCircuitBreakerMetrics
.ofCircuitBreakerRegistry(cbRegistry)
.bindTo(registry);
// Retry metrics
TaggedRetryMetrics
.ofRetryRegistry(retryRegistry)
.bindTo(registry);
// Bulkhead metrics
TaggedBulkheadMetrics
.ofBulkheadRegistry(bulkheadRegistry)
.bindTo(registry);
// Rate limiter metrics
TaggedRateLimiterMetrics
.ofRateLimiterRegistry(rlRegistry)
.bindTo(registry);
// Time limiter metrics
TaggedTimeLimiterMetrics
.ofTimeLimiterRegistry(tlRegistry)
.bindTo(registry);
};
}
}
With Spring Boot Actuator and Micrometer, these metrics are automatically exposed at /actuator/prometheus:
# Circuit breaker metrics
resilience4j_circuitbreaker_state{name="fraudDetection"} 0 # 0=CLOSED
resilience4j_circuitbreaker_calls_seconds_count{name="fraudDetection",kind="successful"} 1523
resilience4j_circuitbreaker_calls_seconds_count{name="fraudDetection",kind="failed"} 7
resilience4j_circuitbreaker_failure_rate{name="fraudDetection"} 0.46
# Bulkhead metrics
resilience4j_bulkhead_available_concurrent_calls{name="fraudDetection"} 15
resilience4j_bulkhead_max_allowed_concurrent_calls{name="fraudDetection"} 20
# Retry metrics
resilience4j_retry_calls_total{name="paymentGateway",kind="successful_without_retry"} 980
resilience4j_retry_calls_total{name="paymentGateway",kind="successful_with_retry"} 15
resilience4j_retry_calls_total{name="paymentGateway",kind="failed_with_retry"} 5
Grafana Dashboard Design
The resilience dashboard is organized by dependency (one row per dependency) with four panels per row:
Panel 1: Circuit Breaker State (stat panel)
resilience4j_circuitbreaker_state{name="$dependency"}
Value mapping: 0 = CLOSED (green), 1 = OPEN (red), 2 = HALF_OPEN (yellow), -1 = DISABLED (gray), -2 = FORCED_OPEN (purple).
Panel 2: Call Outcomes (stacked time series)
rate(resilience4j_circuitbreaker_calls_seconds_count{name="$dependency"}[5m])
Grouped by kind: successful (green), failed (red), not_permitted (orange). A growing orange area means the circuit breaker is open and rejecting calls. A growing red area means the dependency is returning errors but the breaker has not opened yet (approaching the threshold).
Panel 3: Failure Rate vs Threshold (gauge)
resilience4j_circuitbreaker_failure_rate{name="$dependency"}
With threshold line at the configured failure-rate-threshold. The gauge shows how close the dependency is to tripping the breaker.
Panel 4: Bulkhead Utilization (time series)
1 - (
resilience4j_bulkhead_available_concurrent_calls{name="$dependency"}
/
resilience4j_bulkhead_max_allowed_concurrent_calls{name="$dependency"}
)
Shows the percentage of bulkhead permits in use. Sustained values above 80% indicate the dependency is slow and approaching saturation.
Alert Rules
# PRODUCTION - Prometheus alert rules for resilience patterns
groups:
- name: resilience
rules:
# Circuit breaker opened
- alert: CircuitBreakerOpen
expr: resilience4j_circuitbreaker_state == 1
for: 1m
labels:
severity: warning
annotations:
summary: "Circuit breaker {{ $labels.name }} is OPEN"
description: >
The circuit breaker for {{ $labels.name }} has been open
for more than 1 minute. Fallback responses are being served.
# Failure rate approaching threshold
- alert: FailureRateHigh
expr: >
resilience4j_circuitbreaker_failure_rate > 30
and resilience4j_circuitbreaker_state == 0
for: 2m
labels:
severity: warning
annotations:
summary: "{{ $labels.name }} failure rate at {{ $value }}%"
description: >
Failure rate is approaching the circuit breaker threshold.
The breaker may open soon.
# Bulkhead near saturation
- alert: BulkheadNearSaturation
expr: >
(1 - resilience4j_bulkhead_available_concurrent_calls
/ resilience4j_bulkhead_max_allowed_concurrent_calls) > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Bulkhead {{ $labels.name }} at >90% utilization"
# Retry rate elevated
- alert: RetryRateElevated
expr: >
rate(resilience4j_retry_calls_total{kind="successful_with_retry"}[5m])
/ rate(resilience4j_retry_calls_total[5m]) > 0.1
for: 5m
labels:
severity: info
annotations:
summary: ">10% of {{ $labels.name }} calls require retries"
# DLQ messages accumulating (from Chapter 13)
- alert: DlqAccumulating
expr: rate(kafka_dlq_received_total[5m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Dead letter queue receiving messages"
description: >
Messages are arriving in the DLQ. These represent
operations that failed all retry attempts.
Distributed Tracing
Resilience events appear in distributed traces as span annotations:
// PRODUCTION - Custom span annotations for resilience events
@Configuration
public class ResilienceTracingConfig {
@Bean
public RegistryEventConsumer<CircuitBreaker> circuitBreakerTracing(
Tracer tracer) {
return new RegistryEventConsumer<>() {
@Override
public void onEntryAddedEvent(EntryAddedEvent<CircuitBreaker> e) {
CircuitBreaker cb = e.getAddedEntry();
cb.getEventPublisher()
.onStateTransition(event -> {
Span span = tracer.currentSpan();
if (span != null) {
span.event("circuit_breaker." +
event.getStateTransition().name());
span.tag("cb.name", cb.getName());
}
})
.onCallNotPermitted(event -> {
Span span = tracer.currentSpan();
if (span != null) {
span.event("circuit_breaker.rejected");
span.tag("cb.name", cb.getName());
}
});
}
@Override
public void onEntryRemovedEvent(
EntryRemovedEvent<CircuitBreaker> e) {}
@Override
public void onEntryReplacedEvent(
EntryReplacedEvent<CircuitBreaker> e) {}
};
}
}
When a payment request triggers a circuit breaker rejection, the trace shows:
POST /payments [200, 45ms]
└── fraudDetection [CIRCUIT_BREAKER_REJECTED, 0.1ms]
event: circuit_breaker.rejected
tag: cb.name=fraudDetection
└── fallback.fraudScore [0.2ms]
└── balanceCheck [15ms]
└── paymentGateway [25ms]
The trace reveals which dependency was rejected, that the fallback was invoked, and that the overall request still succeeded (200 status). Without trace-level resilience visibility, the operations team sees a 200 response and assumes everything is normal. The trace shows the degradation.
The Observability Checklist
For each resilience-protected dependency:
- Circuit breaker state exposed as a metric
- Failure rate exposed with a threshold reference line
- Bulkhead utilization exposed as a percentage
- Retry success/failure rates exposed
- Dashboard panel showing all four signals
- Alert on circuit breaker open (warning)
- Alert on failure rate approaching threshold (warning)
- Alert on bulkhead saturation (warning)
- Alert on sustained retry rate (info)
- Trace annotations for circuit breaker events
- Fallback activation metric
Missing any one of these items creates a blind spot. The circuit breaker opens, the fallback serves stale data, the dashboard shows green, and the team discovers the outage from a customer complaint.