Chaos Engineering

Integration tests (Chapter 14) verify that resilience patterns activate under controlled, predictable failures. Chaos engineering verifies that the entire system behaves correctly under uncontrolled, realistic failures. The difference: an integration test configures WireMock to return a 500. A chaos experiment kills the fraud detection service’s database connection pool while 500 real requests per second are flowing through the system.

Integration tests answer: “Does the circuit breaker open when the fraud service returns errors?” Chaos experiments answer: “When the fraud service degrades under load, does the payment service maintain acceptable response times, and does it recover automatically when the fraud service stabilizes?”

Chaos Monkey for Spring Boot

<!-- PRODUCTION - Maven dependency -->
<dependency>
    <groupId>de.codecentric</groupId>
    <artifactId>chaos-monkey-spring-boot</artifactId>
    <version>3.1.0</version>
    <scope>runtime</scope>
</dependency>

# PRODUCTION - Chaos Monkey configuration
# Only active with the 'chaos-monkey' Spring profile
spring:
  profiles:
    active: chaos-monkey

chaos:
  monkey:
    enabled: true
    assaults:
      level: 5 # Attack every 5th call
      latency-active: true
      latency-range-start: 500
      latency-range-end: 3000 # Add 500ms-3s latency
      exceptions-active: false # Start with latency only
      kill-application-active: false
    watcher:
      rest-controller: true # Attack REST controllers
      service: true # Attack @Service beans
      repository: false # Do not attack database calls (yet)
      component: false

Chaos Monkey intercepts Spring bean method calls and injects failures. The level setting controls frequency: level 5 means every 5th call is attacked. The watcher settings control which beans are targeted.

Designing a Chaos Experiment

Every chaos experiment follows the scientific method:

1. Steady state hypothesis. Define what “normal” looks like in measurable terms.

Under normal operation, the payment service processes 500 requests per second with p99 latency below 200ms, error rate below 0.1%, and fraud check success rate above 99%.

2. Introduce the variable. Inject a specific failure.

Enable latency injection (500ms-3000ms) on the fraud detection client, affecting 20% of calls.

3. Observe the impact. Measure the deviation from steady state.

Observe payment service response time, error rate, circuit breaker state, and fallback activation rate.

4. Verify the hypothesis. Either the system maintained its steady state (resilience patterns worked) or it did not (gaps identified).

// PRODUCTION - Chaos experiment: fraud detection latency injection
@SpringBootTest(
        webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT,
        properties = {
                "chaos.monkey.enabled=true",
                "chaos.monkey.assaults.latency-active=true",
                "chaos.monkey.assaults.latency-range-start=1000",
                "chaos.monkey.assaults.latency-range-end=3000",
                "chaos.monkey.assaults.level=5",
                "chaos.monkey.watcher.service=true"
        })
class FraudLatencyChaosExperiment extends ResilienceTestBase {

    @Autowired
    private TestRestTemplate restTemplate;

    @Autowired
    private MeterRegistry meterRegistry;

    @Autowired
    private CircuitBreakerRegistry cbRegistry;

    @Test
    void paymentServiceMaintainsResponseTime_underFraudLatency() {
        // Configure fraud service with normal response
        fraudWireMock().register(
                WireMock.post("/fraud/score")
                        .willReturn(WireMock.okJson(
                                "{\"score\":0.1,\"decision\":\"PERMIT\"}")));

        // Phase 1: Establish steady state (100 requests)
        List<Long> steadyStateLatencies = new ArrayList<>();
        for (int i = 0; i < 100; i++) {
            long start = System.nanoTime();
            restTemplate.postForEntity("/payments",
                    samplePayment(), PaymentResponse.class);
            steadyStateLatencies.add(
                    Duration.ofNanos(System.nanoTime() - start).toMillis());
        }

        double steadyP99 = percentile(steadyStateLatencies, 99);

        // Phase 2: Run under chaos (500 requests)
        // Chaos Monkey is now injecting latency on every 5th service call
        List<Long> chaosLatencies = new ArrayList<>();
        AtomicInteger errors = new AtomicInteger();

        for (int i = 0; i < 500; i++) {
            long start = System.nanoTime();
            ResponseEntity<PaymentResponse> response =
                    restTemplate.postForEntity("/payments",
                            samplePayment(), PaymentResponse.class);
            chaosLatencies.add(
                    Duration.ofNanos(System.nanoTime() - start).toMillis());

            if (!response.getStatusCode().is2xxSuccessful()) {
                errors.incrementAndGet();
            }
        }

        double chaosP99 = percentile(chaosLatencies, 99);

        // Hypothesis: p99 latency should not exceed 3x steady state
        // (TimeLimiter at 2s caps the impact)
        assertThat(chaosP99).isLessThan(steadyP99 * 3);

        // Hypothesis: error rate should remain below 5%
        double errorRate = (double) errors.get() / 500;
        assertThat(errorRate).isLessThan(0.05);

        // Verify circuit breaker activated (recorded slow calls)
        CircuitBreaker cb = cbRegistry.circuitBreaker("fraudDetection");
        assertThat(cb.getMetrics().getNumberOfSlowCalls())
                .isGreaterThan(0);
    }
}

The Recovery Curve

Chaos Experiment Recovery Curve

The recovery curve shows four phases:

Steady state (0-5min). Normal operation. Response time flat, error rate near zero.

Chaos injection (5min mark). Latency injection begins. Response time increases as some requests hit injected delays.

Degraded steady state (5-15min). The circuit breaker opens for the fraud detection dependency. Response time stabilizes at a new level (fallback responses are fast). Error rate may spike briefly during the transition, then stabilizes.

Recovery (15min mark). Chaos injection stops. The circuit breaker transitions through half-open to closed. Response time returns to the original steady state. The recovery time (time from chaos stop to steady state restoration) is the key metric. A good recovery is under 2 minutes. A poor recovery takes 10+ minutes (circuit breaker wait duration is too long, retry backoff is too aggressive, or cached data has expired and must be repopulated).

Safe Chaos: Blast Radius Control

Chaos experiments in staging or production require safety controls:

// PRODUCTION - Chaos experiment with abort conditions
@Component
public class ChaosExperimentController {

    private final ChaosMonkeySettings settings;
    private final MeterRegistry meterRegistry;

    @Scheduled(fixedRate = 10_000) // Check every 10 seconds
    public void checkAbortConditions() {
        if (!settings.getChaosMonkeyProperties().isEnabled()) {
            return;
        }

        // Abort if error rate exceeds 10%
        double errorRate = getErrorRate();
        if (errorRate > 0.10) {
            disableChaos("Error rate exceeded 10%: " + errorRate);
            return;
        }

        // Abort if p99 latency exceeds 5 seconds
        double p99 = getP99Latency();
        if (p99 > 5000) {
            disableChaos("P99 latency exceeded 5s: " + p99 + "ms");
            return;
        }

        // Abort if any circuit breaker has been open for > 5 minutes
        // (indicates resilience patterns are not recovering)
    }

    private void disableChaos(String reason) {
        settings.getChaosMonkeyProperties().setEnabled(false);
        log.warn("Chaos experiment aborted: {}", reason);
        meterRegistry.counter("chaos.experiment.aborted",
                "reason", reason).increment();
    }
}

Blast radius is the percentage of traffic or users affected by the experiment. Start with 1% of traffic (using a feature flag or traffic splitting), observe the impact, and increase gradually. Never run a chaos experiment that affects 100% of production traffic without validating at smaller percentages first.

Duration control. Set a maximum experiment duration. If the experiment runs longer than planned (engineer forgot to disable it, went to lunch), an automatic timer disables the chaos injection.

Escalation. The chaos experiment controller should page the on-call engineer when abort conditions fire. The abort condition firing is itself a finding: the resilience patterns did not contain the blast radius as expected.

Experiment Results Analysis

After each chaos experiment, document:

Hypothesis: What did you expect to happen?
Actual behavior: What happened?
Gaps found: Where did the system fail to meet the hypothesis?
Recovery time: How long from chaos stop to steady state?
Actions: What changes are needed to close the gaps?

## Experiment: Fraud Detection Latency Injection

Date: [redacted]
Duration: 10 minutes
Blast radius: 20% of fraud detection calls

### Hypothesis

Payment service p99 latency stays below 500ms.
Error rate stays below 1%.
Circuit breaker opens within 30 seconds.

### Actual Behavior

- p99 latency peaked at 2.1s before circuit breaker opened (gap)
- Error rate reached 3.2% during the transition period (gap)
- Circuit breaker opened after 45 seconds (close to target)
- After circuit breaker opened, p99 dropped to 50ms (fallback)
- Recovery after chaos stopped: 65 seconds (acceptable)

### Gaps

1. Slow call threshold was set to 2000ms, allowing 2s requests
   through before the breaker triggered on slow call rate.
   Reduce to 1000ms.
2. Error rate spike during transition: minimum-number-of-calls
   was 20, requiring 20 failed calls before evaluation.
   Reduce to 10 for fraud detection.

### Actions

- [ ] Reduce slow-call-duration-threshold to 1000ms
- [ ] Reduce minimum-number-of-calls to 10
- [ ] Re-run experiment after changes

Each experiment generates concrete, actionable improvements to the resilience configuration. The cycle is: experiment, find gaps, fix, re-experiment. After several cycles, the system’s resilience behavior under each failure mode is well-characterized and the configuration is tuned to real-world conditions, not theoretical calculations.