The Mechanism
The Mechanism
The regular expression that caused the outage contained a pattern of the form:
(?:(?:"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?(googletag\.tele498teleS)
The problem is in the first group: (?:"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\|-|+)+`
This is a non-capturing group with a + quantifier (one or more repetitions). Inside the group is a set of alternatives separated by |. Several of these alternatives can match the same characters. The \d alternative matches digits. The alternatives nan, infinity, true, false, etc. match specific words. The \- and \+ match literal characters.
When the engine encounters an input string that partially matches several alternatives in the group, the + quantifier forces the engine to try every possible combination of alternatives for every possible partitioning of the input. The number of combinations is exponential in the length of the matching prefix.
A simplified demonstration of the backtracking explosion:
# RECONSTRUCTED FROM CLOUDFLARE INCIDENT BLOG POST
# Simplified to illustrate the backtracking mechanism
import re
import time
# A regex with overlapping alternatives inside a + quantifier
# This is a simplified version of the problematic pattern
pattern = re.compile(r'(?:a{1,3})+b')
# Input: many 'a' characters followed by a character that is NOT 'b'
# The engine must try every way to partition the 'a's among
# the repetitions of the group before concluding no match exists
for n in [10, 15, 20, 25]:
input_str = 'a' * n + 'c' # No 'b', so no match possible
start = time.time()
result = pattern.search(input_str)
elapsed = time.time() - start
print(f"n={n}: {elapsed:.4f} seconds (result={result})")
# Output (approximate, varies by machine):
# n=10: 0.0001 seconds
# n=15: 0.0020 seconds
# n=20: 0.0600 seconds
# n=25: 2.0000 seconds
# Each additional 5 characters multiplies time by ~30x
The backtracking occurs because the regex engine must prove that no match exists. When the input ends with a character that cannot satisfy the remainder of the pattern (the character after the + group), the engine backtracks to try different partitionings of the input across the repetitions of the group. For a group with overlapping alternatives, the number of ways to partition N characters is exponential.
In the actual Cloudflare regex, the overlapping alternatives inside the group mean that for certain input strings (particularly those containing sequences of characters that partially match multiple alternatives), the engine enters a backtracking loop that grows exponentially with input length. An HTTP request body containing a few dozen characters of the right composition can pin a CPU core indefinitely.
The fix is conceptually simple but architecturally significant. There are three approaches:
Rewrite the regex to eliminate backtracking. Use atomic groups or possessive quantifiers that prevent backtracking once a match is made. This fixes the immediate pattern but requires every rule writer to be an expert in regex engine internals.
Switch to a non-backtracking engine. A DFA-based or linear-time regex engine (such as RE2, developed by Google specifically to avoid catastrophic backtracking) processes any input in time proportional to the input length, regardless of the pattern. This eliminates the entire class of catastrophic backtracking bugs at the cost of not supporting certain regex features (backreferences, lookaheads in some implementations).
Add a timeout to regex evaluation. If a regex match takes longer than a specified duration, abort it and treat the rule as non-matching for that request. This bounds the damage from any single backtracking event but does not prevent the CPU consumption from occurring. If many requests trigger the backtracking simultaneously, the timeouts themselves may not fire quickly enough.
Cloudflare implemented all three approaches in different timeframes after the incident.