Chaos Toolkit and Steady State Hypotheses
Chaos Toolkit and Steady State Hypotheses
The Symptom
The team runs their first chaos experiment by hand. An engineer SSHs into the staging Kubernetes cluster, deletes the surge pricing pod, watches the Grafana dashboard for 30 seconds, says “looks fine,” and restarts the pod. The “experiment” took 45 seconds. No metrics were recorded. No baseline was established. No one knows if the 2% error spike on the dashboard was from the experiment or from the usual background noise.
Two weeks later, the same failure happens in production. The circuit breaker opens, but the fallback returns stale data for 8 minutes. The engineer who ran the chaos test says “it worked in staging.” It did not work. It was never tested. Someone watched a dashboard for 30 seconds and called it done.
The Cause
Chaos engineering without tooling is a human judgment call disguised as a scientific experiment. The experiment needs:
- A quantitative definition of “working” (steady state hypothesis)
- Automated measurement before, during, and after the failure injection
- A pass/fail verdict based on data, not dashboards
- A rollback that fires automatically if the experiment goes wrong
Chaos Toolkit provides all four. It is open source, YAML-driven, and extensible with Python drivers for Kubernetes, Prometheus, HTTP, and process-level operations.
The Baseline
Manual chaos testing process:
Step Action Problem
1 SSH into staging No audit trail
2 kubectl delete pod No controlled timing
3 Watch Grafana Subjective, no recording
4 "Looks fine" No quantitative threshold
5 Recreate pod Manual, might forget
No recorded metrics. No pass/fail criteria. No way to compare results across experiments or over time.
The Fix
Installation
# SCALED: Chaos Toolkit installation
pip install chaostoolkit \
chaostoolkit-kubernetes \
chaostoolkit-prometheus \
chaostoolkit-reporting
# Verify installation
chaos --version
chaos info extensions
Steady State Hypothesis
The hypothesis defines what “normal” looks like in numbers. If the system maintains the steady state during and after the experiment, the resilience patterns are working.
# SCALED: Steady state hypothesis for the ride-hailing platform
steady-state-hypothesis:
title: "Ride booking operates within SLO"
probes:
# Probe 1: p99 latency under 500ms
- type: probe
name: "p99-latency-under-500ms"
provider:
type: python
module: chaosPrometheus.probes
func: query_interval
arguments:
api_url: "http://prometheus:9090"
query: >
histogram_quantile(0.99,
sum(rate(http_server_requests_seconds_bucket{
uri="/api/rides/book",
status="200"
}[1m])) by (le))
start: "1 minute ago"
end: "now"
tolerance:
type: range
range: [0, 0.5]
# Probe 2: Error rate under 0.1%
- type: probe
name: "error-rate-under-threshold"
provider:
type: python
module: chaosPrometheus.probes
func: query_interval
arguments:
api_url: "http://prometheus:9090"
query: >
sum(rate(http_server_requests_seconds_count{
uri="/api/rides/book",
status=~"5.."}[1m]))
/
sum(rate(http_server_requests_seconds_count{
uri="/api/rides/book"}[1m]))
* 100
start: "1 minute ago"
end: "now"
tolerance:
type: range
range: [0, 0.1]
# Probe 3: Bookings completing (health endpoint)
- type: probe
name: "rider-api-healthy"
provider:
type: http
url: "http://rider-api:8080/actuator/health"
timeout: 5
tolerance:
status: 200
# Probe 4: Locust confirms requests succeeding
- type: probe
name: "locust-success-rate"
provider:
type: http
url: "http://locust:8089/stats/requests"
timeout: 5
tolerance:
type: jsonpath
path: "$.stats[?(@.name=='/api/rides/book')].current_fail_per_sec"
expect:
type: range
range: [0, 5]
Four probes. p99 latency, error rate, health endpoint, and Locust success rate. All four must pass for the steady state to hold. If any probe fails, the experiment reports a steady state violation.
Probes: Prometheus, HTTP, and Locust
Prometheus probe queries PromQL directly. The query_interval function runs the query over the specified time range and checks whether all data points fall within the tolerance range.
HTTP probe hits an endpoint and checks the status code. Simple but essential. If the health endpoint returns 503, something is fundamentally broken.
Locust probe queries the Locust statistics API. Locust exposes real-time stats at /stats/requests with current request rate, failure rate, and percentile latencies. The probe checks whether the failure rate per second is below threshold.
Actions: Kill Pod and Inject Latency
Kill a Kubernetes pod:
# SCALED: Action to kill surge pricing pod
- type: action
name: "kill-surge-pricing-pod"
provider:
type: python
module: chaosk8s.pod.actions
func: terminate_pods
arguments:
label_selector: "app=surge-pricing"
ns: "ride-hailing"
qty: 1
rand: true
grace_period: 0
pauses:
after: 30 # Wait 30s for system to react
Inject network latency with tc:
# SCALED: Action to inject 500ms latency to PostgreSQL
- type: action
name: "inject-pg-latency"
provider:
type: process
path: "kubectl"
arguments:
- "exec"
- "-n"
- "ride-hailing"
- "deploy/rider-api"
- "--"
- "tc"
- "qdisc"
- "add"
- "dev"
- "eth0"
- "root"
- "netem"
- "delay"
- "500ms"
- "50ms" # 50ms jitter
- "distribution"
- "normal"
pauses:
after: 60 # Let latency soak for 60 seconds
The tc (traffic control) command adds 500ms of latency with 50ms jitter to all network traffic from the rider API pod. This affects PostgreSQL, Redis, and any other network call. To target only PostgreSQL, use iptables to mark PostgreSQL traffic and apply tc only to marked packets.
Rollback
# SCALED: Rollback actions
rollbacks:
# Rollback for killed pod (Kubernetes restarts automatically, but force it)
- type: action
name: "restart-surge-pricing"
provider:
type: python
module: chaosk8s.deployment.actions
func: rollout_restart
arguments:
name: "surge-pricing"
ns: "ride-hailing"
# Rollback for injected latency
- type: action
name: "remove-pg-latency"
provider:
type: process
path: "kubectl"
arguments:
- "exec"
- "-n"
- "ride-hailing"
- "deploy/rider-api"
- "--"
- "tc"
- "qdisc"
- "del"
- "dev"
- "eth0"
- "root"
Rollbacks fire in two cases: the experiment completes, or the experiment aborts due to a safety check. Every action must have a corresponding rollback. No rollback means no experiment.
Complete Experiment: Kill Surge Pricing
# SCALED: Full experiment - kill surge pricing service
version: 1.0.0
title: "Kill Surge Pricing Service"
description: >
Verify that the ride booking service continues to operate
when the surge pricing service is killed. The circuit breaker
should open, the fallback should return cached multipliers,
and ride bookings should continue with zero user-facing errors.
tags:
- "resilience"
- "circuit-breaker"
- "surge-pricing"
contributions:
reliability: "high"
security: "none"
scalability: "medium"
# Define what "working" looks like
steady-state-hypothesis:
title: "Ride bookings continue within SLO"
probes:
- type: probe
name: "p99-latency-under-500ms"
provider:
type: python
module: chaosPrometheus.probes
func: query_interval
arguments:
api_url: "http://prometheus:9090"
query: >
histogram_quantile(0.99,
sum(rate(http_server_requests_seconds_bucket{
uri="/api/rides/book", status="200"}[1m])) by (le))
start: "1 minute ago"
end: "now"
tolerance:
type: range
range: [0, 0.5]
- type: probe
name: "error-rate-under-0.1-percent"
provider:
type: python
module: chaosPrometheus.probes
func: query_interval
arguments:
api_url: "http://prometheus:9090"
query: >
sum(rate(http_server_requests_seconds_count{
uri="/api/rides/book", status=~"5.."}[1m]))
/ sum(rate(http_server_requests_seconds_count{
uri="/api/rides/book"}[1m])) * 100
start: "1 minute ago"
end: "now"
tolerance:
type: range
range: [0, 0.1]
- type: probe
name: "rider-api-healthy"
provider:
type: http
url: "http://rider-api:8080/actuator/health"
timeout: 5
tolerance:
status: 200
# What we break
method:
# Step 1: Verify steady state holds before injection
- type: probe
name: "pre-check-circuit-breaker-closed"
provider:
type: python
module: chaosPrometheus.probes
func: query
arguments:
api_url: "http://prometheus:9090"
query: >
resilience4j_circuitbreaker_state{name="surgePricing"}
tolerance:
type: range
range: [0, 0] # 0 = CLOSED
# Step 2: Kill the surge pricing service
- type: action
name: "kill-surge-pricing"
provider:
type: python
module: chaosk8s.pod.actions
func: terminate_pods
arguments:
label_selector: "app=surge-pricing"
ns: "ride-hailing"
qty: 1 # Kill all replicas
rand: false
grace_period: 0
pauses:
after: 30 # Wait 30 seconds
# Step 3: Verify circuit breaker opened
- type: probe
name: "circuit-breaker-should-be-open"
provider:
type: python
module: chaosPrometheus.probes
func: query
arguments:
api_url: "http://prometheus:9090"
query: >
resilience4j_circuitbreaker_state{name="surgePricing"}
tolerance:
type: range
range: [1, 1] # 1 = OPEN
# Step 4: Verify fallback is serving cached multipliers
- type: probe
name: "fallback-serving-cached-data"
provider:
type: python
module: chaosPrometheus.probes
func: query
arguments:
api_url: "http://prometheus:9090"
query: >
increase(surge_fallback_used_total[1m])
tolerance:
type: range
range: [1, 100000] # At least 1 fallback call
# How we clean up
rollbacks:
- type: action
name: "restart-surge-pricing"
provider:
type: python
module: chaosk8s.deployment.actions
func: rollout_restart
arguments:
name: "surge-pricing"
ns: "ride-hailing"
Running the Experiment
# SCALED: Run the experiment with journal output
chaos run chaos/experiments/kill-surge-pricing.yaml \
--journal-path chaos/results/kill-surge-$(date +%Y%m%d-%H%M%S).json
# Output:
# [INFO] Experiment: Kill Surge Pricing Service
# [INFO] Steady state hypothesis: Ride bookings continue within SLO
# [INFO] Probe: p99-latency-under-500ms [PASSED]
# [INFO] Probe: error-rate-under-0.1-percent [PASSED]
# [INFO] Probe: rider-api-healthy [PASSED]
# [INFO] Action: kill-surge-pricing
# [INFO] Pausing after action for 30s
# [INFO] Probe: circuit-breaker-should-be-open [PASSED]
# [INFO] Probe: fallback-serving-cached-data [PASSED]
# [INFO] Steady state hypothesis: Ride bookings continue within SLO
# [INFO] Probe: p99-latency-under-500ms [PASSED]
# [INFO] Probe: error-rate-under-0.1-percent [PASSED]
# [INFO] Probe: rider-api-healthy [PASSED]
# [INFO] Experiment ended: PASSED
# [INFO] Rollback: restart-surge-pricing
The experiment checks steady state before and after the action. Both checks must pass for the experiment to pass. The journal file records every probe result, timing, and tolerance evaluation.
# Generate HTML report from journal
chaos report --export-format=html \
chaos/results/kill-surge-*.json \
chaos/results/report.html
The Proof
The experiment framework validates itself. Run the kill-surge-pricing experiment without the circuit breaker enabled:
# BOTTLENECK: Run without resilience patterns
# Set resilience4j.circuitbreaker.instances.surgePricing.enabled=false
chaos run chaos/experiments/kill-surge-pricing.yaml
# Output:
# [INFO] Steady state hypothesis: Ride bookings continue within SLO
# [INFO] Probe: p99-latency-under-500ms [PASSED]
# [INFO] Probe: error-rate-under-0.1-percent [PASSED]
# [INFO] Action: kill-surge-pricing
# [INFO] Pausing after action for 30s
# [INFO] Steady state hypothesis: Ride bookings continue within SLO
# [INFO] Probe: p99-latency-under-500ms [FAILED]
# Value: 2.34 not in range [0, 0.5]
# [INFO] Probe: error-rate-under-0.1-percent [FAILED]
# Value: 34.2 not in range [0, 0.1]
# [INFO] Experiment ended: FAILED (steady state violated)
# [INFO] Rollback: restart-surge-pricing
Without the circuit breaker, the experiment correctly reports failure. p99 at 2.34 seconds. Error rate at 34.2%. The experiment detects the gap.
Re-enable the circuit breaker. Re-run. The experiment passes. The circuit breaker is validated under real load with automated verification. Not a human watching a dashboard. A quantitative pass/fail based on SLO thresholds.