The Cloudflare Global Outage: A Regular Expression, a Backtracking Engine, and 100% CPU Across Every Data Center Simultaneously
The Cloudflare Global Outage: A Regular Expression, a Backtracking Engine, and 100% CPU Across Every Data Center Simultaneously
The System as Its Engineers Understood It
Cloudflare operates a content delivery network and web security platform that handles a significant percentage of global internet traffic. As of the incident date, Cloudflare’s network operates in 194 cities across 90 countries. Every HTTP request to a Cloudflare-protected website passes through Cloudflare’s edge servers, where it is inspected for malicious content by the Web Application Firewall (WAF).
The WAF operates by matching incoming HTTP request content against a set of rules. Each rule contains one or more patterns, typically expressed as regular expressions, that identify known attack signatures: SQL injection patterns, cross-site scripting payloads, path traversal attempts, and other malicious patterns. When a request matches a rule, the WAF can block it, log it, or challenge the requester.
The WAF rules are written by Cloudflare’s security team and deployed to all edge servers globally. A rule change is not a software deployment. Rules are data. They are compiled into a pattern-matching engine and pushed to edge servers through a configuration pipeline. This pipeline is faster than a code deployment because rules change frequently as new attack patterns are discovered. The speed of rule deployment is a security feature: when a new attack vector is identified, a new rule can be deployed globally within minutes.
The regular expression engine used by the WAF is a backtracking engine. This is significant. There are two fundamental approaches to regular expression matching:
A finite automaton engine (DFA or NFA without backtracking) processes each character in the input exactly once. Its execution time is linear in the length of the input, regardless of the pattern’s complexity. It cannot express certain advanced features like backreferences, but for attack signature matching, this limitation is rarely relevant.
A backtracking engine processes the input by trying each possible path through the pattern. When a path fails to match, the engine backtracks to the last decision point and tries the next alternative. For most patterns and inputs, this is fast. For certain patterns with specific inputs, the number of backtracking paths grows exponentially with the input length. This is called catastrophic backtracking, and it can cause a single regex match to consume minutes or hours of CPU time on a short input string.
Cloudflare’s WAF uses a backtracking regex engine. The engineering team is aware of the catastrophic backtracking risk. The rule deployment pipeline includes testing, but the testing does not include automated detection of patterns susceptible to catastrophic backtracking.
The Chain
July 2, 2019, 13:42 UTC. A Cloudflare engineer deploys a new WAF rule. The rule is designed to detect a specific attack pattern in HTTP request content. The rule contains a regular expression.
13:42 UTC. The rule is pushed to all Cloudflare edge servers globally. The deployment is simultaneous. There is no staged rollout. There is no canary deployment that activates the rule on a subset of servers first. The rule goes live on every server in every data center at the same time.
13:42 UTC (continued). The regex engine begins evaluating the new rule against incoming HTTP requests. For most requests, the regex evaluates quickly and either matches or does not match. For certain requests whose content contains patterns that trigger exponential backtracking in the regex, the evaluation does not complete. The regex engine enters a backtracking loop that consumes 100% of one CPU core indefinitely.
13:42 to 13:45 UTC. Because every edge server runs the same rule against the same distribution of traffic, every data center simultaneously experiences CPU exhaustion. The WAF processing pipeline saturates all available CPU capacity. HTTP request processing stalls. Cloudflare’s network, which normally proxies millions of requests per second, stops responding.
13:45 UTC. Cloudflare’s monitoring systems detect the global CPU spike. The alert reaches the engineering team. Every server in every data center is at 100% CPU. The pattern is unmistakable: something deployed globally is consuming all resources.
13:52 UTC. The engineering team identifies the WAF rule deployment as the cause and begins the process of rolling back the rule. The rollback itself is complicated by the CPU saturation: management plane operations that require CPU time on the edge servers are slow because the CPUs are consumed by the regex backtracking.
14:09 UTC. The rule is disabled globally. Cloudflare’s edge servers begin recovering. Normal request processing resumes. The total outage duration is approximately 27 minutes.
During those 27 minutes, every website, API, and service behind Cloudflare is unreachable. The impact spans millions of websites and hundreds of millions of users.
The diagram shows the deployment topology. Unlike the Knight Capital failure, where the problem was that one server was different, the Cloudflare failure occurred because every server was the same. The same rule, deployed at the same time, evaluating the same distribution of traffic, exhausting CPU in the same way. Global consistency, normally a strength, became the mechanism of simultaneous global failure.