Skip to main content
postmortem

What the Review Missed and What Changed

4 min read Chapter 22 of 38

What the Review Missed

Cloudflare published a detailed post-incident blog post that is among the most transparent and technically thorough public retrospectives in the industry. The post identifies the root cause (catastrophic backtracking in a WAF regex), the contributing causes (simultaneous global deployment, absence of regex complexity analysis in the rule pipeline, absence of CPU usage guardrails), and the timeline of detection and recovery.

The blog post does not miss the technical analysis. Where it is necessarily limited is in addressing the broader architectural question: why was a backtracking regex engine used for security-critical pattern matching in the first place?

The backtracking regex engine was the standard choice. Most programming languages and frameworks ship with backtracking engines (PCRE, Java’s java.util.regex, Python’s re, JavaScript’s built-in regex). Non-backtracking engines like RE2 require a deliberate choice, a separate library, and often a compromise on regex features. The Cloudflare team used what was available, which is what most teams do. The risk of catastrophic backtracking was known in the abstract but not operationalized: no automated tool in the deployment pipeline checked for patterns susceptible to exponential backtracking.

The deeper lesson, which the blog post touches on but does not fully develop, is that configuration changes can be as dangerous as code changes but are often deployed with fewer safeguards. The WAF rule was not a code deployment. It was a configuration change. It went through a lighter review process and a faster deployment path because rules need to change quickly in response to new threats. The speed that makes the WAF effective against attackers also makes it effective against Cloudflare’s own infrastructure when a rule is defective.

What Changed

Regex engine migration. Cloudflare began migrating its WAF from PCRE (a backtracking engine) to a combination of approaches: a Rust-based regex library that uses finite automaton matching for the common case (linear time, no backtracking), with CPU time budgets for any path that requires backtracking features. This migration addressed the root cause: the class of bugs that includes catastrophic backtracking is eliminated by using an engine that does not backtrack.

The broader industry impact: the Cloudflare outage became the most widely cited argument for using RE2, Rust’s regex crate, or other linear-time regex implementations in any system that evaluates regular expressions against untrusted or uncontrolled input. The recommendation “do not use backtracking regex engines for security pattern matching” existed before the Cloudflare outage. After the outage, it became a rule with a $5-7 billion company’s 27-minute global outage behind it.

Progressive deployment for configuration. Cloudflare implemented staged rollouts for WAF rule changes. New rules are deployed to a subset of traffic or a subset of data centers first. Metrics (CPU usage, request latency, error rate) are monitored during the initial rollout. If metrics deviate from baseline, the deployment is automatically halted and rolled back. Only after the rule has been validated on the canary population is it promoted to global deployment.

This practice, routine for code deployments, was not standard for configuration changes before the Cloudflare outage. The incident established that configuration is code: any change that alters system behavior at runtime must go through the same deployment safeguards as compiled code, including canary testing, monitoring, and automated rollback.

Regex complexity analysis. Cloudflare added automated analysis to its rule pipeline that evaluates regex patterns for backtracking risk before they can be deployed. The analysis detects patterns with overlapping alternatives inside quantified groups, nested quantifiers, and other structures known to produce exponential backtracking. Rules that fail the analysis are rejected and must be rewritten.

This practice is now standard in organizations that use regex for any form of content matching: WAFs, log parsing, input validation. The principle is that a regex pattern is not reviewed only for correctness (does it match what it should match?) but also for safety (can it be made to consume unbounded resources?).

CPU isolation and timeouts. Cloudflare implemented per-request CPU budgets for WAF evaluation. If a single request’s WAF processing exceeds the budget, the evaluation is terminated and the request is passed through (or blocked, depending on the fail-open/fail-closed configuration). This bounds the impact of any future backtracking event to a single request, preventing a single pathological input from consuming an entire server’s CPU capacity.

The Rule

Never evaluate a regular expression from a backtracking engine against input whose length or content you do not control. Use a linear-time regex engine, or enforce execution time budgets that terminate evaluation before resources are exhausted. A regex that takes microseconds on expected input can take hours on adversarial input.

This rule comes from the Cloudflare outage, where a single WAF regex deployed simultaneously to every data center consumed 100% CPU on every server, taking millions of websites offline for 27 minutes.