Skip to main content
surviving the spike

Chaos Toolkit and Steady State Hypotheses

8 min read Chapter 59 of 66

Chaos Toolkit and Steady State Hypotheses

The Symptom

The team runs their first chaos experiment by hand. An engineer SSHs into the staging Kubernetes cluster, deletes the surge pricing pod, watches the Grafana dashboard for 30 seconds, says “looks fine,” and restarts the pod. The “experiment” took 45 seconds. No metrics were recorded. No baseline was established. No one knows if the 2% error spike on the dashboard was from the experiment or from the usual background noise.

Two weeks later, the same failure happens in production. The circuit breaker opens, but the fallback returns stale data for 8 minutes. The engineer who ran the chaos test says “it worked in staging.” It did not work. It was never tested. Someone watched a dashboard for 30 seconds and called it done.

The Cause

Chaos engineering without tooling is a human judgment call disguised as a scientific experiment. The experiment needs:

  1. A quantitative definition of “working” (steady state hypothesis)
  2. Automated measurement before, during, and after the failure injection
  3. A pass/fail verdict based on data, not dashboards
  4. A rollback that fires automatically if the experiment goes wrong

Chaos Toolkit provides all four. It is open source, YAML-driven, and extensible with Python drivers for Kubernetes, Prometheus, HTTP, and process-level operations.

The Baseline

Manual chaos testing process:

Step   Action                    Problem
1      SSH into staging          No audit trail
2      kubectl delete pod        No controlled timing
3      Watch Grafana             Subjective, no recording
4      "Looks fine"              No quantitative threshold
5      Recreate pod              Manual, might forget

No recorded metrics. No pass/fail criteria. No way to compare results across experiments or over time.

The Fix

Installation

# SCALED: Chaos Toolkit installation
pip install chaostoolkit \
            chaostoolkit-kubernetes \
            chaostoolkit-prometheus \
            chaostoolkit-reporting

# Verify installation
chaos --version
chaos info extensions

Steady State Hypothesis

The hypothesis defines what “normal” looks like in numbers. If the system maintains the steady state during and after the experiment, the resilience patterns are working.

# SCALED: Steady state hypothesis for the ride-hailing platform
steady-state-hypothesis:
  title: "Ride booking operates within SLO"
  probes:
    # Probe 1: p99 latency under 500ms
    - type: probe
      name: "p99-latency-under-500ms"
      provider:
        type: python
        module: chaosPrometheus.probes
        func: query_interval
        arguments:
          api_url: "http://prometheus:9090"
          query: >
            histogram_quantile(0.99,
              sum(rate(http_server_requests_seconds_bucket{
                uri="/api/rides/book",
                status="200"
              }[1m])) by (le))
          start: "1 minute ago"
          end: "now"
      tolerance:
        type: range
        range: [0, 0.5]

    # Probe 2: Error rate under 0.1%
    - type: probe
      name: "error-rate-under-threshold"
      provider:
        type: python
        module: chaosPrometheus.probes
        func: query_interval
        arguments:
          api_url: "http://prometheus:9090"
          query: >
            sum(rate(http_server_requests_seconds_count{
              uri="/api/rides/book",
              status=~"5.."}[1m]))
            /
            sum(rate(http_server_requests_seconds_count{
              uri="/api/rides/book"}[1m]))
            * 100
          start: "1 minute ago"
          end: "now"
      tolerance:
        type: range
        range: [0, 0.1]

    # Probe 3: Bookings completing (health endpoint)
    - type: probe
      name: "rider-api-healthy"
      provider:
        type: http
        url: "http://rider-api:8080/actuator/health"
        timeout: 5
      tolerance:
        status: 200

    # Probe 4: Locust confirms requests succeeding
    - type: probe
      name: "locust-success-rate"
      provider:
        type: http
        url: "http://locust:8089/stats/requests"
        timeout: 5
      tolerance:
        type: jsonpath
        path: "$.stats[?(@.name=='/api/rides/book')].current_fail_per_sec"
        expect:
          type: range
          range: [0, 5]

Four probes. p99 latency, error rate, health endpoint, and Locust success rate. All four must pass for the steady state to hold. If any probe fails, the experiment reports a steady state violation.

Probes: Prometheus, HTTP, and Locust

Prometheus probe queries PromQL directly. The query_interval function runs the query over the specified time range and checks whether all data points fall within the tolerance range.

HTTP probe hits an endpoint and checks the status code. Simple but essential. If the health endpoint returns 503, something is fundamentally broken.

Locust probe queries the Locust statistics API. Locust exposes real-time stats at /stats/requests with current request rate, failure rate, and percentile latencies. The probe checks whether the failure rate per second is below threshold.

Actions: Kill Pod and Inject Latency

Kill a Kubernetes pod:

# SCALED: Action to kill surge pricing pod
- type: action
  name: "kill-surge-pricing-pod"
  provider:
    type: python
    module: chaosk8s.pod.actions
    func: terminate_pods
    arguments:
      label_selector: "app=surge-pricing"
      ns: "ride-hailing"
      qty: 1
      rand: true
      grace_period: 0
  pauses:
    after: 30 # Wait 30s for system to react

Inject network latency with tc:

# SCALED: Action to inject 500ms latency to PostgreSQL
- type: action
  name: "inject-pg-latency"
  provider:
    type: process
    path: "kubectl"
    arguments:
      - "exec"
      - "-n"
      - "ride-hailing"
      - "deploy/rider-api"
      - "--"
      - "tc"
      - "qdisc"
      - "add"
      - "dev"
      - "eth0"
      - "root"
      - "netem"
      - "delay"
      - "500ms"
      - "50ms" # 50ms jitter
      - "distribution"
      - "normal"
  pauses:
    after: 60 # Let latency soak for 60 seconds

The tc (traffic control) command adds 500ms of latency with 50ms jitter to all network traffic from the rider API pod. This affects PostgreSQL, Redis, and any other network call. To target only PostgreSQL, use iptables to mark PostgreSQL traffic and apply tc only to marked packets.

Rollback

# SCALED: Rollback actions
rollbacks:
  # Rollback for killed pod (Kubernetes restarts automatically, but force it)
  - type: action
    name: "restart-surge-pricing"
    provider:
      type: python
      module: chaosk8s.deployment.actions
      func: rollout_restart
      arguments:
        name: "surge-pricing"
        ns: "ride-hailing"

  # Rollback for injected latency
  - type: action
    name: "remove-pg-latency"
    provider:
      type: process
      path: "kubectl"
      arguments:
        - "exec"
        - "-n"
        - "ride-hailing"
        - "deploy/rider-api"
        - "--"
        - "tc"
        - "qdisc"
        - "del"
        - "dev"
        - "eth0"
        - "root"

Rollbacks fire in two cases: the experiment completes, or the experiment aborts due to a safety check. Every action must have a corresponding rollback. No rollback means no experiment.

Complete Experiment: Kill Surge Pricing

# SCALED: Full experiment - kill surge pricing service
version: 1.0.0
title: "Kill Surge Pricing Service"
description: >
  Verify that the ride booking service continues to operate
  when the surge pricing service is killed. The circuit breaker
  should open, the fallback should return cached multipliers,
  and ride bookings should continue with zero user-facing errors.
tags:
  - "resilience"
  - "circuit-breaker"
  - "surge-pricing"
contributions:
  reliability: "high"
  security: "none"
  scalability: "medium"

# Define what "working" looks like
steady-state-hypothesis:
  title: "Ride bookings continue within SLO"
  probes:
    - type: probe
      name: "p99-latency-under-500ms"
      provider:
        type: python
        module: chaosPrometheus.probes
        func: query_interval
        arguments:
          api_url: "http://prometheus:9090"
          query: >
            histogram_quantile(0.99,
              sum(rate(http_server_requests_seconds_bucket{
                uri="/api/rides/book", status="200"}[1m])) by (le))
          start: "1 minute ago"
          end: "now"
      tolerance:
        type: range
        range: [0, 0.5]

    - type: probe
      name: "error-rate-under-0.1-percent"
      provider:
        type: python
        module: chaosPrometheus.probes
        func: query_interval
        arguments:
          api_url: "http://prometheus:9090"
          query: >
            sum(rate(http_server_requests_seconds_count{
              uri="/api/rides/book", status=~"5.."}[1m]))
            / sum(rate(http_server_requests_seconds_count{
              uri="/api/rides/book"}[1m])) * 100
          start: "1 minute ago"
          end: "now"
      tolerance:
        type: range
        range: [0, 0.1]

    - type: probe
      name: "rider-api-healthy"
      provider:
        type: http
        url: "http://rider-api:8080/actuator/health"
        timeout: 5
      tolerance:
        status: 200

# What we break
method:
  # Step 1: Verify steady state holds before injection
  - type: probe
    name: "pre-check-circuit-breaker-closed"
    provider:
      type: python
      module: chaosPrometheus.probes
      func: query
      arguments:
        api_url: "http://prometheus:9090"
        query: >
          resilience4j_circuitbreaker_state{name="surgePricing"}
    tolerance:
      type: range
      range: [0, 0] # 0 = CLOSED

  # Step 2: Kill the surge pricing service
  - type: action
    name: "kill-surge-pricing"
    provider:
      type: python
      module: chaosk8s.pod.actions
      func: terminate_pods
      arguments:
        label_selector: "app=surge-pricing"
        ns: "ride-hailing"
        qty: 1 # Kill all replicas
        rand: false
        grace_period: 0
    pauses:
      after: 30 # Wait 30 seconds

  # Step 3: Verify circuit breaker opened
  - type: probe
    name: "circuit-breaker-should-be-open"
    provider:
      type: python
      module: chaosPrometheus.probes
      func: query
      arguments:
        api_url: "http://prometheus:9090"
        query: >
          resilience4j_circuitbreaker_state{name="surgePricing"}
    tolerance:
      type: range
      range: [1, 1] # 1 = OPEN

  # Step 4: Verify fallback is serving cached multipliers
  - type: probe
    name: "fallback-serving-cached-data"
    provider:
      type: python
      module: chaosPrometheus.probes
      func: query
      arguments:
        api_url: "http://prometheus:9090"
        query: >
          increase(surge_fallback_used_total[1m])
    tolerance:
      type: range
      range: [1, 100000] # At least 1 fallback call

# How we clean up
rollbacks:
  - type: action
    name: "restart-surge-pricing"
    provider:
      type: python
      module: chaosk8s.deployment.actions
      func: rollout_restart
      arguments:
        name: "surge-pricing"
        ns: "ride-hailing"

Running the Experiment

# SCALED: Run the experiment with journal output
chaos run chaos/experiments/kill-surge-pricing.yaml \
  --journal-path chaos/results/kill-surge-$(date +%Y%m%d-%H%M%S).json

# Output:
# [INFO] Experiment: Kill Surge Pricing Service
# [INFO] Steady state hypothesis: Ride bookings continue within SLO
# [INFO]   Probe: p99-latency-under-500ms [PASSED]
# [INFO]   Probe: error-rate-under-0.1-percent [PASSED]
# [INFO]   Probe: rider-api-healthy [PASSED]
# [INFO] Action: kill-surge-pricing
# [INFO]   Pausing after action for 30s
# [INFO] Probe: circuit-breaker-should-be-open [PASSED]
# [INFO] Probe: fallback-serving-cached-data [PASSED]
# [INFO] Steady state hypothesis: Ride bookings continue within SLO
# [INFO]   Probe: p99-latency-under-500ms [PASSED]
# [INFO]   Probe: error-rate-under-0.1-percent [PASSED]
# [INFO]   Probe: rider-api-healthy [PASSED]
# [INFO] Experiment ended: PASSED
# [INFO] Rollback: restart-surge-pricing

The experiment checks steady state before and after the action. Both checks must pass for the experiment to pass. The journal file records every probe result, timing, and tolerance evaluation.

# Generate HTML report from journal
chaos report --export-format=html \
  chaos/results/kill-surge-*.json \
  chaos/results/report.html

The Proof

The experiment framework validates itself. Run the kill-surge-pricing experiment without the circuit breaker enabled:

# BOTTLENECK: Run without resilience patterns
# Set resilience4j.circuitbreaker.instances.surgePricing.enabled=false
chaos run chaos/experiments/kill-surge-pricing.yaml

# Output:
# [INFO] Steady state hypothesis: Ride bookings continue within SLO
# [INFO]   Probe: p99-latency-under-500ms [PASSED]
# [INFO]   Probe: error-rate-under-0.1-percent [PASSED]
# [INFO] Action: kill-surge-pricing
# [INFO]   Pausing after action for 30s
# [INFO] Steady state hypothesis: Ride bookings continue within SLO
# [INFO]   Probe: p99-latency-under-500ms [FAILED]
#          Value: 2.34 not in range [0, 0.5]
# [INFO]   Probe: error-rate-under-0.1-percent [FAILED]
#          Value: 34.2 not in range [0, 0.1]
# [INFO] Experiment ended: FAILED (steady state violated)
# [INFO] Rollback: restart-surge-pricing

Without the circuit breaker, the experiment correctly reports failure. p99 at 2.34 seconds. Error rate at 34.2%. The experiment detects the gap.

Re-enable the circuit breaker. Re-run. The experiment passes. The circuit breaker is validated under real load with automated verification. Not a human watching a dashboard. A quantitative pass/fail based on SLO thresholds.