Skip to main content
postmortem

The Pattern Behind the Patterns

9 min read Chapter 38 of 38

The Pattern Behind the Patterns: What Every Failure in This Book Has in Common and the Question Every Engineer Should Ask Before the Next Release

Twelve failures. Five decades. Six industries. Seven countries. Hundreds of lives, billions of dollars, and a set of engineering practices that did not exist before these systems broke.

The failures are not the same. A race condition in a radiation therapy machine is not the same kind of problem as a missing unit conversion in a spacecraft navigation system. A feature flag that activates dead code is not the same kind of problem as a regular expression that pins CPU to 100%. But the patterns beneath them are the same, and there are fewer of them than you might expect.

Pattern 1: The Untested Assumption Carried Forward

The Ariane 5 carries software from the Ariane 4. The software assumes horizontal velocity will never exceed a value that the Ariane 4’s flight profile guarantees. The Ariane 5 has a different flight profile. Nobody checks whether the assumption still holds.

The Therac-25 carries a safety case from the Therac-20. The safety case assumes the software is reliable because it operated without incident on the Therac-20. But on the Therac-20, hardware interlocks provided independent safety. The Therac-25 removes the interlocks. Nobody checks whether the safety case still holds without them.

The 2038 problem carries a time representation from the 1970s. The representation assumes the software will not still be running in 2038. The assumption was reasonable in the 1970s. The software was copied into systems that will still be running decades after the overflow date. Nobody checked whether the representation’s bounds matched the software’s actual lifespan.

The pattern: every system inherits assumptions from its predecessors. When the context changes, the assumptions become invisible, untested preconditions. They do not announce themselves. They appear as constraints that were true when the system was designed and are no longer true when the system is operating. Finding them requires asking a question that most engineering cultures do not ask: what did the previous system assume about its operating environment, and is that assumption still valid here?

Pattern 2: The Interface Without a Contract

The Mars Climate Orbiter fails because two teams use different unit systems and the file that connects them carries numbers without unit labels. The interface specification says SI units. The code says English units. Nobody verifies compliance.

The Ariane 5 SRI sends a diagnostic dump on the data bus. The OBC expects navigation data on the data bus. The bus protocol has no message type field. Nobody tests what happens when the SRI sends non-navigation data.

The Boeing 737 MAX MCAS reads a single angle-of-attack sensor. The system has two sensors. The interface between the sensors and MCAS does not include a cross-check or a disagree indication.

The pattern: system boundaries are where failures cross from one component to another. An interface that is specified in a document but not enforced in code is fiction. An interface that carries data without type information is a latent mismatch. An interface that has been tested only under nominal conditions will surprise you under failure conditions. The failure is never in the component. It is in the space between components, where implicit contracts govern behavior and nobody writes them down.

Pattern 3: The Silent Monitoring Failure

The Northeast Blackout occurs because the alarm system fails silently. No alarm fires to indicate that the alarm system has stopped working. The operators believe everything is normal because the alarm screen is quiet. Quiet means either “no problems” or “the alarm system is dead.” The two are indistinguishable.

GitLab’s backup systems fail silently. pg_dump fails because of disk space. No alert fires. LVM snapshots are not configured. No check reports the missing configuration. The team discovers the backup failures only when they need the backups.

The pattern: monitoring systems are trusted infrastructure. When they fail, they fail silently because nobody monitors the monitor. The absence of information (no alarms, no alerts, no error messages) is interpreted as the presence of normalcy. This interpretation is wrong exactly when it matters most: during a failure. Any monitoring system that does not monitor itself is a single point of failure for situational awareness.

Pattern 4: The Missing Circuit Breaker

Knight Capital’s SMARS has no kill switch. When the trading desk detects abnormal behavior, there is no single action that can halt all order routing. The 45-minute loss accumulation is the cost of the missing emergency stop.

The Flash Crash’s sell algorithm has no price sensitivity check, no spread check, no depth check. It sells at whatever rate the volume dictates. It has no mechanism to detect that it is participating in a feedback loop. The CME circuit breaker, external to the algorithm, is the only brake.

Cloudflare’s WAF rule deployment has no canary mechanism and no CPU budget per regex evaluation. A single bad rule consumes 100% CPU on every server globally. No circuit breaker limits the damage from a pathological rule.

The pattern: systems without emergency stops produce unbounded damage during failures. The circuit breaker is the component that limits the blast radius: the kill switch that stops trading, the timeout that aborts a regex evaluation, the canary deployment that limits a bad rule to 1% of traffic before promoting it to 100%. Systems that omit circuit breakers are systems where the failure duration is determined by the speed of human diagnosis, which is always too slow.

Pattern 5: The Cost-Optimized Safety Layer

The Therac-25 removes hardware interlocks because the software is deemed reliable enough to be the sole safety layer. The cost savings are real. The risk increase is invisible until the race condition kills patients.

The Boeing 737 MAX relies on a single AOA sensor for MCAS because the safety assessment classifies MCAS as a low-risk system. The AOA disagree alert is available as an optional purchase. Airlines do not purchase it. The cost savings are trivial relative to the aircraft’s price. The risk is catastrophic.

The GitLab backup architecture has five layers, but the resources to verify and maintain those layers are not allocated. The backups exist on paper but not in practice. The cost of periodic restore testing is small. It is not prioritized.

The pattern: safety layers cost money to build, maintain, and verify. When budgets are constrained, safety layers are the first to be reduced because their value is invisible until the failure occurs. A safety layer that is present but not maintained is worse than no safety layer because it creates false confidence. The cost of a safety layer is not the cost of building it. It is the cost of verifying it works, continuously, for the life of the system.

The Twelve Rules

  1. Never rely on software as the sole safety mechanism in a system where software failure can cause physical harm. (Therac-25)

  2. Never reuse software in a new system without re-validating every assumption the software makes about its operating environment. (Ariane 5)

  3. Every monitoring system must monitor itself. (Northeast Blackout)

  4. Every data interface between independently developed components must enforce unit and type consistency at the software level. (Mars Climate Orbiter)

  5. Never reuse a feature flag. When a feature is retired, remove both the flag and the code it activates. (Knight Capital)

  6. Any automated system that adjusts its behavior based on the aggregate behavior of other automated systems must have a circuit breaker. (Flash Crash)

  7. Never evaluate a backtracking regular expression against input whose length or content you do not control. (Cloudflare)

  8. A backup that has never been restored is not a backup. (GitLab)

  9. Software that compensates for a hardware design limitation is safety-critical by definition. (Boeing 737 MAX)

  10. Never assume a fixed-width representation is sufficient because it works today. (2038 Problem)

  11. Do not depend on trivial packages for functionality you can implement in fewer lines than the dependency declaration requires. (Left-Pad)

  12. A library must not have capabilities beyond its stated purpose. (Log4Shell)

These rules are not abstractions. Each one has a failure behind it. Each failure has a chain of events, a mechanism, and a consequence. The rules are the compressed form of the lessons. The chapters are the uncompressed form.

The Question

Every failure in this book was preceded by a period where the system worked correctly. The Therac-25 treated patients safely for months before the race condition was triggered. The Ariane 4 SRI software flew 113 successful missions. Knight Capital’s SMARS processed millions of orders without incident. Log4j’s JNDI lookup feature existed for eight years before it was exploited.

The system working correctly is not evidence that the system is safe. It is evidence that the failure conditions have not yet been met.

The question every engineer should ask before the next release is not “does this work?” That question is answered by testing. The question is:

What am I assuming about the environment this system will operate in, and what happens when that assumption is wrong?

The Ariane 5 engineers assumed horizontal velocity would stay within the Ariane 4’s range. The MCO engineers assumed both teams used SI units. The 737 MAX engineers assumed the AOA sensor would not fail, or that if it did, the pilot could recover manually. The Cloudflare engineers assumed WAF rules would be deployed with sufficient testing. The GitLab engineers assumed the backup systems were functional.

Each assumption was reasonable. Each assumption was wrong. The difference between a functioning system and a failed system is not the quality of the engineering. It is whether the assumptions embedded in the engineering match the conditions the system actually encounters.

This book cannot tell you what assumptions your system makes. Only you know that. But after reading twelve investigations of what happens when assumptions are wrong, you are equipped to ask the question, to find the assumptions, and to decide which ones are worth testing before the system tests them for you.