How Distributed Systems Fail
How Distributed Systems Fail
Your service does not fail because of bugs. Not usually. It fails because another service changed its behavior in a way you did not anticipate, and your service had no plan for that change. The fraud detection service that responded in 50 milliseconds for two years starts responding in 30 seconds because its database ran a checkpoint during peak load. Your payment service, which calls fraud detection on every transaction, now has 200 threads blocked waiting for responses that will not arrive in time. Those 200 threads were your entire Tomcat thread pool. No new HTTP requests can be processed. Your health check endpoint, which shares that thread pool, stops responding. The load balancer marks your instance as unhealthy and routes all traffic to the remaining two instances, which are experiencing the same problem. The entire payment platform is down. Not because of a bug. Because of a slow database checkpoint in a service you do not own.
This is a cascading failure. It is the most common way production distributed systems die, and it is the failure mode that every pattern in this book exists to prevent.
The Transaction Processing Platform
Every example in this book operates within a financial transaction processing platform. Five services, each with a distinct responsibility and a distinct failure profile:
Payment Service accepts transaction requests from clients. It validates the request, calls the fraud detection service, calls the balance service, initiates the payment, dispatches a notification, and writes an audit log entry. It is the orchestrator. When it breaks, users see errors.
Fraud Detection Service scores each transaction against a set of rules and a machine learning model hosted by an external scoring API. Under normal conditions, it responds in 40-80 milliseconds. Under load, the external scoring API degrades, and fraud detection response times climb to seconds. Under failure, the external scoring API is unreachable, and fraud detection must decide: block the transaction or allow it with a flag for manual review.
Balance Service checks and reserves funds. It is backed by a relational database with strong consistency requirements. It cannot serve stale data. When it is down, payments cannot proceed.
Notification Service sends confirmation emails and SMS messages. It is non-critical path. A delayed notification is acceptable. A missing notification is not ideal but does not invalidate the payment.
Audit Log Service writes immutable records of every transaction. Regulatory requirement. Must eventually receive every event, but can tolerate short buffering delays.
This domain is chosen deliberately. The services have different criticality levels, different latency profiles, and different tolerance for degradation. A resilience pattern that makes sense for notification dispatch may be dangerous for balance checks. The domain forces you to think about these differences.
Anatomy of a Cascading Failure
A cascading failure has three phases: origin, amplification, and collapse.
Origin. One component changes its performance profile. Not a crash. Crashes are easy. The component keeps accepting requests but responds slowly. In the transaction platform, the external scoring API starts responding in 10 seconds instead of 100 milliseconds. Fraud detection still works, but each call holds a thread for 100x longer than expected.
Amplification. The caller does not adapt. The payment service continues sending requests to fraud detection at the same rate. Each request now occupies a thread for 10 seconds. If the payment service receives 20 requests per second, and each request now holds a thread for 10 seconds, it needs 200 concurrent threads just for fraud detection calls. A typical Tomcat thread pool has 200 threads total.
Collapse. The thread pool fills. New requests queue. The queue fills. Requests time out at the load balancer or the client. Health checks fail. The load balancer removes the instance. Remaining instances absorb the traffic and collapse faster. The entire service is down.
The diagram shows the full progression from initial slowdown to complete outage. The thread pool visualization on the left makes the mechanics concrete: at T=0, 20 of 200 threads are active, normal utilization. By T=30, 140 threads are stuck waiting for fraud detection responses. By T=60, the pool is full and no request of any kind can be served. The right column shows what the monitoring dashboard displays at each phase: latency spikes, then 503 errors, then upstream cascade.
The Numbers That Kill You
The math of cascading failures is multiplication, not addition.
Consider the payment service under normal conditions:
- Incoming rate: 100 requests per second
- Fraud detection call duration: 50ms
- Concurrent threads needed for fraud: 100 * 0.05 = 5 threads
- Thread pool size: 200
- Utilization: 2.5%
Now fraud detection degrades. Response time increases from 50ms to 5 seconds, a 100x increase:
- Incoming rate: still 100 requests per second
- Fraud detection call duration: 5,000ms
- Concurrent threads needed for fraud: 100 * 5 = 500 threads
- Thread pool size: still 200
- Utilization: 250%. The pool is saturated. Requests queue.
The transition from healthy to dead takes approximately thread_pool_size / (request_rate * new_latency - thread_pool_size) seconds of continued operation. With the numbers above: 200 / (100 * 5 - 200) = 200 / 300 = 0.67 seconds. Under one second from the moment fraud detection degrades to the moment the payment service thread pool is full.
This is why “we’ll notice and fix it” is not a resilience strategy. The failure completes before the first alert fires.
Why Horizontal Scaling Does Not Help
Running three instances of the payment service instead of one does not protect against cascading failures. Each instance has the same thread pool. Each instance calls the same degraded fraud detection service. Each instance fills its thread pool in 0.67 seconds.
Three instances give you three simultaneous failures instead of one. The load balancer sees three unhealthy instances and has nowhere to route traffic. The failure is identical in outcome and only marginally slower in progression.
Horizontal scaling protects against load. It does not protect against dependency degradation. Those require fundamentally different mechanisms: timeouts, circuit breakers, bulkheads, and rate limiters. Mechanisms that limit the impact a single failing dependency can have on the resources available to everything else.
Latency Amplification
Cascading failures are the acute version. Latency amplification is the chronic version.
In the payment flow, a single transaction touches five services sequentially: fraud detection, balance check, payment execution, notification, and audit. If each service adds 10 milliseconds of latency, the total call chain takes 50 milliseconds. Acceptable.
If each service adds 100 milliseconds, the chain takes 500 milliseconds. Noticeable. If each service occasionally hits a p99 tail latency of 500 milliseconds, a request that hits the p99 in two of the five services takes at least 1 second. Hit the p99 in all five and you are at 2.5 seconds.
The probability of a request hitting at least one p99 tail in a chain of N services: 1 - (0.99)^N. For N=5: 4.9%. Nearly 5% of your requests experience at least one tail-latency hit. For N=10: 9.6%. This is why microservice architectures feel slow without deliberate latency management.
Tail latency is not a monitoring curiosity. It is a user experience problem that compounds with every service in the call chain.
// Typical payment flow - sequential calls, no timeout configuration
// INCOMPLETE - every call here needs an explicit timeout
public PaymentResult processPayment(PaymentRequest request) {
FraudScore score = fraudService.checkFraud(request); // What if this takes 30s?
Balance balance = balanceService.check(request.amount()); // Blocked until fraud returns
PaymentConfirmation confirmation = paymentGateway.charge(request);
notificationService.send(request.userId(), confirmation); // Why is this blocking?
auditService.log(request, confirmation); // And this?
return new PaymentResult(confirmation.id(), score, balance);
}
Every call in this method is a network call. None has a timeout. None has a fallback. The method executes sequentially, so the total latency is the sum of all five calls. If any single call hangs, the entire method hangs, and the thread is consumed until the socket times out at the OS level, which is typically 120 seconds.
// EXPLICIT AND INTENTIONAL - every call has a timeout, non-critical calls are async
public PaymentResult processPayment(PaymentRequest request) {
FraudScore score = fraudService.checkFraud(request); // timeout: 2s
Balance balance = balanceService.check(request.amount()); // timeout: 1s
PaymentConfirmation confirmation = paymentGateway.charge(request); // timeout: 5s
CompletableFuture.runAsync(() -> notificationService.send( // fire-and-forget
request.userId(), confirmation));
CompletableFuture.runAsync(() -> auditService.log(request, confirmation)); // fire-and-forget
return new PaymentResult(confirmation.id(), score, balance);
}
The corrected version puts timeouts on every critical call and moves non-critical calls (notification, audit) to asynchronous execution. The maximum latency is now bounded: 2s + 1s + 5s = 8 seconds worst case for the critical path, rather than infinity.
The Failure Taxonomy
Every failure in a distributed system falls into one of four categories. Each category requires a different resilience response.
Crash failure. The dependency is completely unreachable. TCP connection refused. DNS resolution fails. This is the easiest failure to handle: it fails fast. The calling thread gets an exception in milliseconds, not seconds.
Omission failure. The dependency accepts the request but never responds. The request enters a void. No response, no error. The calling thread blocks until the socket timeout, which defaults to minutes. This is the most dangerous failure because it silently consumes threads.
Timing failure. The dependency responds, but too slowly. It still works, but its latency has increased 10x or 100x. This is what causes cascading failures. The dependency is not “down,” so naive health checks pass. But every call consumes a thread for much longer than expected.
Byzantine failure. The dependency responds with incorrect data. It says the balance is sufficient when it is not. It approves a fraudulent transaction. This is the hardest to detect and the hardest to protect against. Resilience patterns do not help here. Data validation and correctness verification do.
This book focuses on omission and timing failures because they are the most common, the most damaging, and the most preventable with the right patterns.
What This Book Builds
Each subsequent chapter introduces one resilience pattern, following the same structure:
- The specific failure mode the pattern addresses
- A from-scratch Java 21 implementation that reveals the internals
- Commentary on what building it from scratch exposed
- The production Resilience4J implementation with full configuration
- A Testcontainers integration test that proves the pattern works
- The Prometheus metric and Grafana panel that makes the pattern observable
The patterns build on each other. Timeouts (Chapter 2) are the foundation. Circuit breakers (Chapter 4) depend on timeouts. Bulkheads (Chapter 6) complement circuit breakers. Pattern composition (Chapter 9) combines them correctly. Testing (Chapter 14) and chaos engineering (Chapter 15) prove they work under real failure conditions.
The through-line system, the financial transaction platform, provides the context for every decision. A circuit breaker on the notification service has different parameters than a circuit breaker on the balance service. A bulkhead for fraud detection has a different size than a bulkhead for audit logging. The domain forces these distinctions, and the distinctions are the entire point.
Resilience is not a feature you add to a service. It is a discipline you practice across every network boundary, every thread pool, and every dependency relationship. The next chapter starts with the non-negotiable foundation: timeouts.