Reliability Is an Emergent Property, Not a Root Cause
These articles are AI-generated summaries. Please check the original sources for full details.
Looking for Root Causes is a False Path
In a recent podcast, David Blank-Edelman, program lead for Microsoft’s SRE Academy, argued that searching for a single root cause of system failures is counterproductive. He emphasized that reliability is an emergent property of complex systems, shaped by interactions between technical and socio-technical factors, not by isolating one failure point.
Why This Matters
Traditional root cause analysis assumes a single failure point, but modern systems are too interconnected for this approach. As Blank-Edelman explains, failures often stem from multiple contributing factors, including human decisions, design trade-offs, and unanticipated interactions. For example, the B-12 bomber crash in WWII was not caused by a single mechanical failure but by poorly designed cockpit switches, a socio-technical oversight. Focusing on root causes ignores these systemic issues, leading to recurring failures and missed opportunities for systemic improvement.
Key Insights
- “Reliability is an emergent property of an architecture and can include any property important to the customer, such as availability or durability.” (David Blank-Edelman, 2025)
- “Failures have multiple causes, some of which are socio-technological in nature.” (Podcast transcript, 2025)
- “Temporal used by Stripe, Coinbase” (Example of tools for managing distributed workflows)
Practical Applications
- Use Case: Microsoft’s SRE Academy trains Azure engineers to focus on systemic feedback loops rather than isolated incidents.
- Pitfall: Prematurely blaming human error or a single component in post-incident reviews can mask deeper systemic issues, leading to recurring outages.
References:
Continue reading
Next article
Prevent a page from scrolling while a dialog is open
Related Content
Why Agent Memory is Not a Database: Shifting to Governed Evolving Memory
A new research paper argues that record-level database abstractions cause four critical failure modes in AI agent memory systems.
Architecting Production Systems: Integrating Go and Node.js for Scalability
Kevin Nambubbi details a systems-engineering approach to learning by integrating Go and Node.js into a production-minded incident platform.
Engineering an IoT Ecosystem: The E-CO Smart Plant Monitoring System
A full-stack IoT implementation integrating NodeMCU, Raspberry Pi, and Laravel to automate plant irrigation based on real-time soil moisture data.