Skip to main content

On This Page

Proving Resilience: How AWS Chaos Engineering Prevents Facebook-Style Outages

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

The Uncomfortable Truth About Platform Stability

In 2021, Facebook faced a 6-hour outage caused by a BGP routing error, crippling its services and even disabling office badge readers. This incident revealed a critical flaw: systems built on unreliable assumptions about network stability and security can fail catastrophically when those assumptions are violated.

Why This Matters

Modern distributed systems rely on eight dangerous assumptions, such as “the network is reliable” or “transport cost is zero.” These fallacies, known as the Fallacies of Distributed Computing, create blind spots in reliability planning. AWS data shows that ignoring transport costs alone can lead to $2,000/month in Data Transfer Out (DTO) charges for 100TB of cross-AZ traffic. Chaos engineering shifts the paradigm from “preventing failure” to “proving resilience” through deliberate, controlled experiments.

Key Insights

  • “Facebook’s 6-hour outage, 2021”: A BGP routing error exposed systemic vulnerabilities in network and security assumptions.
  • “Transport cost is zero fallacy”: Misplaced cost assumptions can lead to $2,000/month in AWS DTO charges for 100TB of cross-AZ traffic.
  • “Chaos Mesh used with AWS FIS”: Combines Kubernetes-native chaos tools with AWS’s centralized control plane for resilience testing.

Working Example

# Topology spread constraint to distribute pods across AZs
topologySpreadConstraints:
- maxSkew: 1
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: DoNotSchedule
  labelSelector:
    matchLabels:
      app: payment-api
# PodDisruptionBudget to ensure minimum pod availability
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: payment-api-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: payment-api

Practical Applications

  • Use Case: AWS FIS + Chaos Mesh for testing Kubernetes resilience in production-like environments.
  • Pitfall: Assuming transport cost is zero can lead to unanticipated AWS billing spikes during cross-region data transfers.

References:

Continue reading

Next article

Cache Optimization Boosts Web Performance by 60%: Master HTTP Cache, CDNs, and Invalidation Strategies

Related Content