Resilience Maturity Model and Adoption Roadmap
Resilience Maturity Model and Adoption Roadmap
Adopting every pattern in this book at once is impractical. The investment required, the operational complexity, and the learning curve make a phased approach necessary. This section provides a framework for assessing where your system is and planning the path forward.
The Five Levels
Level 0: Unprotected. No resilience patterns. HTTP client uses default timeouts (often infinite). A single dependency failure cascades to the entire system. Failures are detected by user complaints.
Level 1: Basic Protection. HTTP client timeouts configured. Retry on transient errors. Basic health checks. Failures detected by error rate alerts. Recovery requires manual intervention (restart pods, clear caches).
Level 2: Isolated Failures. Circuit breakers on critical dependencies. Bulkheads separating failure domains. Fallbacks for each circuit-broken dependency. Failures are contained: one dependency failure does not affect others. Dashboard shows circuit breaker states.
Level 3: Self-Healing. Composed resilience patterns with correct ordering. Chaos experiments validate resilience behavior. SLOs defined and error budgets tracked. Automatic recovery: circuit breakers transition through half-open to closed without intervention. Degraded modes explicitly designed and tested.
Level 4: Adaptive. Adaptive parameters based on observed behavior (adaptive hedge delay, dynamic circuit breaker thresholds). Load shedding and admission control protect system capacity. Backpressure propagation across service boundaries. Resilience decisions automated based on error budget.
Assessment Checklist
Level 1 Checklist
- All HTTP clients have explicit connect and read timeouts
- Retry configured for idempotent operations
- Health check endpoint exposed
- Error rate alert configured
- Logging captures dependency failures with context
Level 2 Checklist
- Circuit breaker on each external dependency
- Fallback strategy defined for each circuit-broken call
- Bulkhead isolating each dependency’s thread/connection usage
- Circuit breaker state visible on a dashboard
- Integration tests verify circuit breaker activation
Level 3 Checklist
- Decorator ordering configured and verified (Retry outside CB)
- At least 3 chaos experiments executed with documented findings
- SLOs defined for availability and latency
- Error budget dashboard with burn rate alerts
- Degraded modes documented, implemented, and tested
- Graceful shutdown with connection draining implemented
- Contract tests verify resilience at service boundaries
Level 4 Checklist
- Load shedding with priority-based admission control
- Adaptive circuit breaker or hedge delay parameters
- Backpressure propagation across service boundaries
- Automated resilience tuning based on error budget
- Kafka retry/DLQ topology for asynchronous resilience
- Resilience runbook automated for common failure scenarios
Adoption Roadmap
Phase 1 (Level 0 to 1): Foundations. Configure timeouts. Add retry for transient errors. Deploy health checks. Set up error alerting. Duration: 1-2 sprints.
Phase 2 (Level 1 to 2): Isolation. Add circuit breakers to external dependencies. Implement fallbacks. Configure bulkheads. Build the resilience dashboard. Write integration tests. Duration: 2-4 sprints.
Phase 3 (Level 2 to 3): Validation. Compose patterns with correct ordering. Run chaos experiments. Define SLOs. Design degraded modes. Implement graceful shutdown. Duration: 3-6 sprints.
Phase 4 (Level 3 to 4): Optimization. Add load shedding. Implement adaptive parameters. Propagate backpressure. Automate resilience decisions. Duration: ongoing.
Measuring Improvement
Each level transition produces measurable improvement:
- Level 0 to 1: Mean time to detect (MTTD) decreases from hours (user complaints) to minutes (alerts).
- Level 1 to 2: Blast radius of a dependency failure decreases from “entire system” to “single feature.”
- Level 2 to 3: Mean time to recover (MTTR) decreases from manual intervention (30+ minutes) to automatic recovery (circuit breaker closes in 1-5 minutes).
- Level 3 to 4: System maintains SLO under load spikes and multi-dependency failures that previously caused outages.
Track these metrics across each phase. The improvement is not theoretical: each level transition reduces the frequency, duration, and blast radius of production incidents. The investment in resilience patterns pays back in reduced incident burden, higher customer satisfaction, and operational confidence.
The final metric: the number of incidents where a resilience pattern prevented a customer-visible outage. When the fraud detection service goes down and no customer notices, because the circuit breaker opened, the fallback activated, the dashboard showed the degradation, and the circuit breaker closed automatically when the service recovered, that is the architecture working as designed.