Skip to main content

On This Page

Reliability Is an Emergent Property, Not a Root Cause

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Looking for Root Causes is a False Path

In a recent podcast, David Blank-Edelman, program lead for Microsoft’s SRE Academy, argued that searching for a single root cause of system failures is counterproductive. He emphasized that reliability is an emergent property of complex systems, shaped by interactions between technical and socio-technical factors, not by isolating one failure point.

Why This Matters

Traditional root cause analysis assumes a single failure point, but modern systems are too interconnected for this approach. As Blank-Edelman explains, failures often stem from multiple contributing factors, including human decisions, design trade-offs, and unanticipated interactions. For example, the B-12 bomber crash in WWII was not caused by a single mechanical failure but by poorly designed cockpit switches, a socio-technical oversight. Focusing on root causes ignores these systemic issues, leading to recurring failures and missed opportunities for systemic improvement.

Key Insights

  • “Reliability is an emergent property of an architecture and can include any property important to the customer, such as availability or durability.” (David Blank-Edelman, 2025)
  • “Failures have multiple causes, some of which are socio-technological in nature.” (Podcast transcript, 2025)
  • “Temporal used by Stripe, Coinbase” (Example of tools for managing distributed workflows)

Practical Applications

  • Use Case: Microsoft’s SRE Academy trains Azure engineers to focus on systemic feedback loops rather than isolated incidents.
  • Pitfall: Prematurely blaming human error or a single component in post-incident reviews can mask deeper systemic issues, leading to recurring outages.

References:

Continue reading

Next article

Prevent a page from scrolling while a dialog is open

Related Content