Resilience Maturity Model and Adoption Roadmap

Adopting every pattern in this book at once is impractical. The investment required, the operational complexity, and the learning curve make a phased approach necessary. This section provides a framework for assessing where your system is and planning the path forward.

The Five Levels

Level 0: Unprotected. No resilience patterns. HTTP client uses default timeouts (often infinite). A single dependency failure cascades to the entire system. Failures are detected by user complaints.

Level 1: Basic Protection. HTTP client timeouts configured. Retry on transient errors. Basic health checks. Failures detected by error rate alerts. Recovery requires manual intervention (restart pods, clear caches).

Level 2: Isolated Failures. Circuit breakers on critical dependencies. Bulkheads separating failure domains. Fallbacks for each circuit-broken dependency. Failures are contained: one dependency failure does not affect others. Dashboard shows circuit breaker states.

Level 3: Self-Healing. Composed resilience patterns with correct ordering. Chaos experiments validate resilience behavior. SLOs defined and error budgets tracked. Automatic recovery: circuit breakers transition through half-open to closed without intervention. Degraded modes explicitly designed and tested.

Level 4: Adaptive. Adaptive parameters based on observed behavior (adaptive hedge delay, dynamic circuit breaker thresholds). Load shedding and admission control protect system capacity. Backpressure propagation across service boundaries. Resilience decisions automated based on error budget.

Assessment Checklist

Level 1 Checklist

All HTTP clients have explicit connect and read timeouts
Retry configured for idempotent operations
Health check endpoint exposed
Error rate alert configured
Logging captures dependency failures with context

Level 2 Checklist

Circuit breaker on each external dependency
Fallback strategy defined for each circuit-broken call
Bulkhead isolating each dependency’s thread/connection usage
Circuit breaker state visible on a dashboard
Integration tests verify circuit breaker activation

Level 3 Checklist

Decorator ordering configured and verified (Retry outside CB)
At least 3 chaos experiments executed with documented findings
SLOs defined for availability and latency
Error budget dashboard with burn rate alerts
Degraded modes documented, implemented, and tested
Graceful shutdown with connection draining implemented
Contract tests verify resilience at service boundaries

Level 4 Checklist

Load shedding with priority-based admission control
Adaptive circuit breaker or hedge delay parameters
Backpressure propagation across service boundaries
Automated resilience tuning based on error budget
Kafka retry/DLQ topology for asynchronous resilience
Resilience runbook automated for common failure scenarios

Adoption Roadmap

Phase 1 (Level 0 to 1): Foundations. Configure timeouts. Add retry for transient errors. Deploy health checks. Set up error alerting. Duration: 1-2 sprints.

Phase 2 (Level 1 to 2): Isolation. Add circuit breakers to external dependencies. Implement fallbacks. Configure bulkheads. Build the resilience dashboard. Write integration tests. Duration: 2-4 sprints.

Phase 3 (Level 2 to 3): Validation. Compose patterns with correct ordering. Run chaos experiments. Define SLOs. Design degraded modes. Implement graceful shutdown. Duration: 3-6 sprints.

Phase 4 (Level 3 to 4): Optimization. Add load shedding. Implement adaptive parameters. Propagate backpressure. Automate resilience decisions. Duration: ongoing.

Measuring Improvement

Each level transition produces measurable improvement:

Level 0 to 1: Mean time to detect (MTTD) decreases from hours (user complaints) to minutes (alerts).
Level 1 to 2: Blast radius of a dependency failure decreases from “entire system” to “single feature.”
Level 2 to 3: Mean time to recover (MTTR) decreases from manual intervention (30+ minutes) to automatic recovery (circuit breaker closes in 1-5 minutes).
Level 3 to 4: System maintains SLO under load spikes and multi-dependency failures that previously caused outages.

Track these metrics across each phase. The improvement is not theoretical: each level transition reduces the frequency, duration, and blast radius of production incidents. The investment in resilience patterns pays back in reduced incident burden, higher customer satisfaction, and operational confidence.

The final metric: the number of incidents where a resilience pattern prevented a customer-visible outage. When the fraud detection service goes down and no customer notices, because the circuit breaker opened, the fallback activated, the dashboard showed the degradation, and the circuit breaker closed automatically when the service recovered, that is the architecture working as designed.