Five regex patterns that will DoS your service via catastrophic backtracking
In 2019, Cloudflare took down a chunk of the internet for 27 minutes with a
regex. The pattern was meant to detect XSS in WAF rules. It contained
.*(?:.*=.*). Against the right input, that pattern blows up to
exponential time.
This is catastrophic backtracking — a class of regex bugs where a seemingly-fine pattern hits an input the engine can’t handle without trying every possible path. Below are the five most common shapes, the fix for each, and how to spot them.
1. Nested quantifiers on overlapping classes
The classic shape:
(a+)+
(.*)*
(\d+)+
For input like aaaaaaaaaaaaaaaaaaaaaa! (22 a’s followed by a non-match),
the engine tries every possible split of the a’s between the outer and
inner groups. That’s 2^22 paths.
Why it happens: the inner + and outer + both consume the same
characters, so the engine can’t decide which one “owns” each character
without trying all combinations.
Fix: collapse to a single quantifier or use atomic groups.
# Before:
(a+)+
# After:
a+
# Or, if you actually need the group:
(?>a+)+
(?>...) is an atomic group (PCRE/Python re2/Java/.NET — not JS;
JS uses (?!a)a+ workarounds or just rewrites the pattern).
2. Optional inside repeating optional
(a?b?)+
([0-9]?[a-z]?)*
Same disease. The engine has 4 ways to match each iteration: ab, a,
b, empty. With repetition, it explores all 4^n combinations on a
near-miss input.
Fix: rewrite to commit. If you mean “any number of a’s and b’s
mixed”, use [ab]*. If you mean “the literal sequence”, spell it out.
# Before:
(a?b?)+
# After:
[ab]+
3. Greedy .* in front of an alternation
.*(foo|bar)
This looks innocent. It usually is — until you give it a long input that
ends with food (just foo plus extra chars). The engine matches .*
greedily to the end, then backtracks character by character looking for
foo or bar.
Fix: anchor or restrict the leading match.
# If the line is the unit:
^[^fb]*(foo|bar)
# If you want everything before:
^.*?(foo|bar)
.*? (lazy) backs off naturally; [^fb]* excludes characters that can’t
be the start of either alternative.
4. Email validation gone too clever
^([a-zA-Z0-9_.+-]+)@([a-zA-Z0-9-]+\.)+[a-zA-Z]{2,}$
That (...)+ on the domain part is the trap. Inputs like
a@aaaaaaaaaaaaaaaaa (no .) blow up the engine.
Fix: don’t write your own email regex. Use a single-pass parser, or the lib your stdlib ships with. If you must regex it, anchor and limit:
^[^@\s]+@[^@\s]+\.[^@\s]+$
This won’t reject every bogus email, but it won’t blow up either, and a cheap regex + an MX check is what you actually want.
5. Exponential alternation overlap
(foo|foo)+
(red|green|red|blue)+
When alternatives overlap in what they match, the engine tries each ordering. On long matching input, this hits exponential paths.
Fix: deduplicate. If alternatives genuinely overlap (e.g., one is a prefix of another), order them with the longest first and consider atomic grouping.
How to know your regex has the disease
Three quick tests:
- Run it against
"a" * 30 + "!"(or analogous near-miss input). If it takes more than ~100ms, it’s likely exponential. - Use
safe-regexorredos-detectoras a CI check. They flag common nested-quantifier shapes statically. - Switch to a non-backtracking engine for untrusted input. Go’s
regexp, Rust’sregexcrate, andre2are all linear-time guaranteed. JavaScript’s V8 engine added some linear-time fast paths in 2023 but doesn’t promise it.
Defense in depth
For any regex that runs against user-supplied input:
- Set a timeout. In Node,
vm.runInNewContext(..., { timeout }). In Python 3.11+,re.match(..., timeout=...). In Go, usecontext.WithTimeout. - Cap input length before matching.
if (len(input) > 4096) reject;— most legitimate inputs are short. - Prefer parsers for structured input. Email, URL, JWT — there’s a proper parser for each. Regex is the wrong tool.
- Switch engines for hot paths.
re2and Go’sregexpgive you the guarantee that no input can blow up the engine; the trade-off is they don’t support backreferences (which is usually fine).
Try it
Paste a suspect pattern into the regex tester, then try inputs of varying lengths. If a small input gets a quick result and a slightly longer one hangs your browser tab, you’ve found a backtracking case.
The patterns in this post are common enough that I’ve seen each one personally take down production. Worth doing the audit.
Related across the network
- hash.tooljo.com — for content checks where regex is the wrong tool. Hashing is the right answer for “did the bytes change?” or “does this match the canonical version?”
- epoch.tooljo.com — for timestamp parsing, use the auto-detector instead of writing a regex; ISO 8601 has too many edge-case forms to match by pattern.
- jwt.tooljo.com — JWT validation is a parser problem, not a regex problem. Don’t be the next CVE.