Operationalizing Runbooks: Moving Beyond Documentation Theater
These articles are AI-generated summaries. Please check the original sources for full details.
Your Runbook Is Written. Nobody Runs It.
Jono Herrington argues that runbooks often serve as ornamental “comfort objects” rather than operational truth. When documentation isn’t treated as release-critical, on-call engineers revert to DM threads and tribal knowledge during production incidents.
Why This Matters
Technical reliability is frequently treated as a capture problem when it is actually an enforcement problem. Organizations often experience a “Brent-shaped” bottleneck where a single person on PTO stalls multiple workstreams because operational memory was never decentralized through verified documentation. The cost of this “narrative management” is a high interest rate on stability, where recurring failures are treated as leadership signals rather than engineering mysteries.
Key Insights
- Concept: Separation of ‘Mitigated’ vs ‘Closed’ states ensures that customer impact is controlled while prevention work is verified and shipped before a ticket is finalized.
- Fact: The ‘Brent-shaped’ week occurs when one person on PTO stalls three lanes of work, revealing that an organization has mistaken URL access for distributed capability (Herrington, 2026).
- Concept: Behavior contracts over documentation theater, using release checklists that cannot be marked done without a link to an updated runbook section.
- Tool: Verifying runbook steps in staging by a named owner ensures that operational memory is functional and not just ‘paper compliance’ for auditors.
Practical Applications
- Use Case: Release leads gating a shipment until a diff in the runbook is provided alongside the code diff to ensure documentation matches production reality.
- Pitfall: Relying on ‘heroics’ to unblock releases, which hides missing operational memory and stores risk in individual humans instead of systems.
- Use Case: Tracking the percentage of incidents with runbook updates completed before the next release to measure true reliability compounding.
- Pitfall: Managing reliability through quarterly dashboards and narratives rather than verifying that teams run operational docs under normal Tuesday load.
References:
Continue reading
Next article
Building Graph-Based Zero-Trust Network Simulations for Insider Threat Detection
Related Content
The Runbook Is Already Lying to You: Solving Documentation Rot with AI Agents
Static runbooks decay as infrastructure evolves, but AI agents using RAG and tool-use can reduce MTTR by 95% by automating routine triage and correlating telemetry in real-time.
Automating Dependency Management with Renovate for Small Engineering Teams
Eliminate manual dependency updates and CVE risks by implementing an end-to-end automation system using Renovate.
SwiftDeploy: Automating Infrastructure with OPA Guardrails and Chaos Engineering
SwiftDeploy automates infrastructure generation from a single manifest, using OPA policy gates to block deployments when CPU load exceeds thresholds.