Skip to main content

On This Page

Operationalizing Runbooks: Moving Beyond Documentation Theater

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Your Runbook Is Written. Nobody Runs It.

Jono Herrington argues that runbooks often serve as ornamental “comfort objects” rather than operational truth. When documentation isn’t treated as release-critical, on-call engineers revert to DM threads and tribal knowledge during production incidents.

Why This Matters

Technical reliability is frequently treated as a capture problem when it is actually an enforcement problem. Organizations often experience a “Brent-shaped” bottleneck where a single person on PTO stalls multiple workstreams because operational memory was never decentralized through verified documentation. The cost of this “narrative management” is a high interest rate on stability, where recurring failures are treated as leadership signals rather than engineering mysteries.

Key Insights

  • Concept: Separation of ‘Mitigated’ vs ‘Closed’ states ensures that customer impact is controlled while prevention work is verified and shipped before a ticket is finalized.
  • Fact: The ‘Brent-shaped’ week occurs when one person on PTO stalls three lanes of work, revealing that an organization has mistaken URL access for distributed capability (Herrington, 2026).
  • Concept: Behavior contracts over documentation theater, using release checklists that cannot be marked done without a link to an updated runbook section.
  • Tool: Verifying runbook steps in staging by a named owner ensures that operational memory is functional and not just ‘paper compliance’ for auditors.

Practical Applications

  • Use Case: Release leads gating a shipment until a diff in the runbook is provided alongside the code diff to ensure documentation matches production reality.
  • Pitfall: Relying on ‘heroics’ to unblock releases, which hides missing operational memory and stores risk in individual humans instead of systems.
  • Use Case: Tracking the percentage of incidents with runbook updates completed before the next release to measure true reliability compounding.
  • Pitfall: Managing reliability through quarterly dashboards and narratives rather than verifying that teams run operational docs under normal Tuesday load.

References:

Continue reading

Next article

Building Graph-Based Zero-Trust Network Simulations for Insider Threat Detection

Related Content