Chaos Engineering
Chaos Engineering
Integration tests (Chapter 14) verify that resilience patterns activate under controlled, predictable failures. Chaos engineering verifies that the entire system behaves correctly under uncontrolled, realistic failures. The difference: an integration test configures WireMock to return a 500. A chaos experiment kills the fraud detection service’s database connection pool while 500 real requests per second are flowing through the system.
Integration tests answer: “Does the circuit breaker open when the fraud service returns errors?” Chaos experiments answer: “When the fraud service degrades under load, does the payment service maintain acceptable response times, and does it recover automatically when the fraud service stabilizes?”
Chaos Monkey for Spring Boot
<!-- PRODUCTION - Maven dependency -->
<dependency>
<groupId>de.codecentric</groupId>
<artifactId>chaos-monkey-spring-boot</artifactId>
<version>3.1.0</version>
<scope>runtime</scope>
</dependency>
# PRODUCTION - Chaos Monkey configuration
# Only active with the 'chaos-monkey' Spring profile
spring:
profiles:
active: chaos-monkey
chaos:
monkey:
enabled: true
assaults:
level: 5 # Attack every 5th call
latency-active: true
latency-range-start: 500
latency-range-end: 3000 # Add 500ms-3s latency
exceptions-active: false # Start with latency only
kill-application-active: false
watcher:
rest-controller: true # Attack REST controllers
service: true # Attack @Service beans
repository: false # Do not attack database calls (yet)
component: false
Chaos Monkey intercepts Spring bean method calls and injects failures. The level setting controls frequency: level 5 means every 5th call is attacked. The watcher settings control which beans are targeted.
Designing a Chaos Experiment
Every chaos experiment follows the scientific method:
1. Steady state hypothesis. Define what “normal” looks like in measurable terms.
Under normal operation, the payment service processes 500 requests per second with p99 latency below 200ms, error rate below 0.1%, and fraud check success rate above 99%.
2. Introduce the variable. Inject a specific failure.
Enable latency injection (500ms-3000ms) on the fraud detection client, affecting 20% of calls.
3. Observe the impact. Measure the deviation from steady state.
Observe payment service response time, error rate, circuit breaker state, and fallback activation rate.
4. Verify the hypothesis. Either the system maintained its steady state (resilience patterns worked) or it did not (gaps identified).
// PRODUCTION - Chaos experiment: fraud detection latency injection
@SpringBootTest(
webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT,
properties = {
"chaos.monkey.enabled=true",
"chaos.monkey.assaults.latency-active=true",
"chaos.monkey.assaults.latency-range-start=1000",
"chaos.monkey.assaults.latency-range-end=3000",
"chaos.monkey.assaults.level=5",
"chaos.monkey.watcher.service=true"
})
class FraudLatencyChaosExperiment extends ResilienceTestBase {
@Autowired
private TestRestTemplate restTemplate;
@Autowired
private MeterRegistry meterRegistry;
@Autowired
private CircuitBreakerRegistry cbRegistry;
@Test
void paymentServiceMaintainsResponseTime_underFraudLatency() {
// Configure fraud service with normal response
fraudWireMock().register(
WireMock.post("/fraud/score")
.willReturn(WireMock.okJson(
"{\"score\":0.1,\"decision\":\"PERMIT\"}")));
// Phase 1: Establish steady state (100 requests)
List<Long> steadyStateLatencies = new ArrayList<>();
for (int i = 0; i < 100; i++) {
long start = System.nanoTime();
restTemplate.postForEntity("/payments",
samplePayment(), PaymentResponse.class);
steadyStateLatencies.add(
Duration.ofNanos(System.nanoTime() - start).toMillis());
}
double steadyP99 = percentile(steadyStateLatencies, 99);
// Phase 2: Run under chaos (500 requests)
// Chaos Monkey is now injecting latency on every 5th service call
List<Long> chaosLatencies = new ArrayList<>();
AtomicInteger errors = new AtomicInteger();
for (int i = 0; i < 500; i++) {
long start = System.nanoTime();
ResponseEntity<PaymentResponse> response =
restTemplate.postForEntity("/payments",
samplePayment(), PaymentResponse.class);
chaosLatencies.add(
Duration.ofNanos(System.nanoTime() - start).toMillis());
if (!response.getStatusCode().is2xxSuccessful()) {
errors.incrementAndGet();
}
}
double chaosP99 = percentile(chaosLatencies, 99);
// Hypothesis: p99 latency should not exceed 3x steady state
// (TimeLimiter at 2s caps the impact)
assertThat(chaosP99).isLessThan(steadyP99 * 3);
// Hypothesis: error rate should remain below 5%
double errorRate = (double) errors.get() / 500;
assertThat(errorRate).isLessThan(0.05);
// Verify circuit breaker activated (recorded slow calls)
CircuitBreaker cb = cbRegistry.circuitBreaker("fraudDetection");
assertThat(cb.getMetrics().getNumberOfSlowCalls())
.isGreaterThan(0);
}
}
The Recovery Curve
The recovery curve shows four phases:
Steady state (0-5min). Normal operation. Response time flat, error rate near zero.
Chaos injection (5min mark). Latency injection begins. Response time increases as some requests hit injected delays.
Degraded steady state (5-15min). The circuit breaker opens for the fraud detection dependency. Response time stabilizes at a new level (fallback responses are fast). Error rate may spike briefly during the transition, then stabilizes.
Recovery (15min mark). Chaos injection stops. The circuit breaker transitions through half-open to closed. Response time returns to the original steady state. The recovery time (time from chaos stop to steady state restoration) is the key metric. A good recovery is under 2 minutes. A poor recovery takes 10+ minutes (circuit breaker wait duration is too long, retry backoff is too aggressive, or cached data has expired and must be repopulated).
Safe Chaos: Blast Radius Control
Chaos experiments in staging or production require safety controls:
// PRODUCTION - Chaos experiment with abort conditions
@Component
public class ChaosExperimentController {
private final ChaosMonkeySettings settings;
private final MeterRegistry meterRegistry;
@Scheduled(fixedRate = 10_000) // Check every 10 seconds
public void checkAbortConditions() {
if (!settings.getChaosMonkeyProperties().isEnabled()) {
return;
}
// Abort if error rate exceeds 10%
double errorRate = getErrorRate();
if (errorRate > 0.10) {
disableChaos("Error rate exceeded 10%: " + errorRate);
return;
}
// Abort if p99 latency exceeds 5 seconds
double p99 = getP99Latency();
if (p99 > 5000) {
disableChaos("P99 latency exceeded 5s: " + p99 + "ms");
return;
}
// Abort if any circuit breaker has been open for > 5 minutes
// (indicates resilience patterns are not recovering)
}
private void disableChaos(String reason) {
settings.getChaosMonkeyProperties().setEnabled(false);
log.warn("Chaos experiment aborted: {}", reason);
meterRegistry.counter("chaos.experiment.aborted",
"reason", reason).increment();
}
}
Blast radius is the percentage of traffic or users affected by the experiment. Start with 1% of traffic (using a feature flag or traffic splitting), observe the impact, and increase gradually. Never run a chaos experiment that affects 100% of production traffic without validating at smaller percentages first.
Duration control. Set a maximum experiment duration. If the experiment runs longer than planned (engineer forgot to disable it, went to lunch), an automatic timer disables the chaos injection.
Escalation. The chaos experiment controller should page the on-call engineer when abort conditions fire. The abort condition firing is itself a finding: the resilience patterns did not contain the blast radius as expected.
Experiment Results Analysis
After each chaos experiment, document:
- Hypothesis: What did you expect to happen?
- Actual behavior: What happened?
- Gaps found: Where did the system fail to meet the hypothesis?
- Recovery time: How long from chaos stop to steady state?
- Actions: What changes are needed to close the gaps?
## Experiment: Fraud Detection Latency Injection
Date: [redacted]
Duration: 10 minutes
Blast radius: 20% of fraud detection calls
### Hypothesis
Payment service p99 latency stays below 500ms.
Error rate stays below 1%.
Circuit breaker opens within 30 seconds.
### Actual Behavior
- p99 latency peaked at 2.1s before circuit breaker opened (gap)
- Error rate reached 3.2% during the transition period (gap)
- Circuit breaker opened after 45 seconds (close to target)
- After circuit breaker opened, p99 dropped to 50ms (fallback)
- Recovery after chaos stopped: 65 seconds (acceptable)
### Gaps
1. Slow call threshold was set to 2000ms, allowing 2s requests
through before the breaker triggered on slow call rate.
Reduce to 1000ms.
2. Error rate spike during transition: minimum-number-of-calls
was 20, requiring 20 failed calls before evaluation.
Reduce to 10 for fraud detection.
### Actions
- [ ] Reduce slow-call-duration-threshold to 1000ms
- [ ] Reduce minimum-number-of-calls to 10
- [ ] Re-run experiment after changes
Each experiment generates concrete, actionable improvements to the resilience configuration. The cycle is: experiment, find gaps, fix, re-experiment. After several cycles, the system’s resilience behavior under each failure mode is well-characterized and the configuration is tuned to real-world conditions, not theoretical calculations.