What is a regular expression?
On this page
A regular expression (regex, regexp) is a small language for describing text patterns. You write a pattern; a regex engine matches it against input strings. Used for searching, validating, extracting, and transforming text.
If you’ve ever used Find/Replace in VS Code, written a grep command,
or validated a form field client-side, you’ve probably touched regex.
In one example
Pattern: \b\d{3}-\d{2}-\d{4}\b
Input: "Call me at 555-12-3456 about the report."
Match: "555-12-3456" (a US SSN-format number)
The pattern is a tiny program: “find a word boundary, then 3 digits, then a dash, then 2 digits, then a dash, then 4 digits, then a word boundary.” The regex engine compiles that to a state machine and runs it against the input.
What it’s good at
- Validating input format: “is this a valid email shape?” “does this look like a phone number?”
- Searching unstructured text: log lines, source code, config files.
- Extracting fields from semi-structured text: dates, IDs, URLs.
- Bulk renaming and reformatting: shell scripts, IDE find-and-replace.
- Tokenizing: splitting text on patterns more sophisticated than a single character.
What it’s bad at
- Parsing nested structures. HTML, JSON, code — anything with arbitrary nesting requires a parser, not a regex. The classic Stack Overflow answer “You can’t parse HTML with regex” is the canonical statement.
- Anything where context matters. If matching a word depends on what surrounded it three sentences ago, regex isn’t the right tool.
- Complex grammars. Programming languages have proper grammars that parser generators handle; regex chokes on them.
A useful heuristic: if the input has matched delimiters (parens, brackets, tags) at arbitrary depth, you need a parser, not regex.
Where it came from
Regex traces back to Stephen Kleene’s 1956 paper on regular sets in
formal language theory. The implementation we use today started with
Ken Thompson’s 1968 QED editor and grep — Thompson built a pattern
engine that became the foundation of every Unix tool’s regex support.
The “regex” you write in 2026 is several generations descended from that original. It includes features (lookaround, backreferences, recursion in some flavors) that aren’t formally “regular” in the language-theory sense, but the name stuck.
Flavors
There’s no single regex spec. Each language and tool implements its own flavor, with variations in:
- Which metacharacters need escaping
- Whether
^/$match line boundaries by default - Support for lookaround, named groups, recursion
- Unicode handling
- Greedy / lazy / atomic / possessive quantifiers
The major flavors:
| Flavor | Used in | Notable features |
|---|---|---|
| POSIX BRE | older grep, sed -E (with --basic) | Most basic; (, ), {, } don’t need escapes in BRE |
| POSIX ERE | modern grep -E, awk | Adds +, ?, |, parens and braces |
| PCRE | grep -P, PHP, many editors | The “kitchen sink” — recursion, atomic groups, conditionals |
| JavaScript | browsers, Node | ECMAScript spec; close to PCRE but missing atomic groups and recursion |
Python re | Python | PCRE-ish; named groups, but no recursion |
| Java | JDK | PCRE-style with explicit anchor variants (\A, \Z) |
Go regexp | Go | RE2 engine — guaranteed linear time, no backreferences |
| .NET | C#, F# | Variable-length lookbehind, balancing groups |
The tester on this site uses JavaScript flavor. See Regex flavors compared for a detailed table.
A vocabulary primer
You’ll hear these terms:
- Pattern — the regex source string itself.
- Match — a substring the pattern successfully matches.
- Capture group — a parenthesized sub-pattern whose match is “remembered” for backreference or extraction.
(\d+)-(\d+)has two capture groups. - Backreference —
\1or$1referring to the first capture group’s match. - Anchor — a position-only pattern:
^(start),$(end),\b(word boundary). Anchors don’t consume characters. - Lookaround —
(?=...)/(?!...)etc. Match a position contingent on what’s around it, but don’t consume. - Flag — a modifier that changes how the engine interprets the pattern:
g(global),i(ignore case),m(multiline). - Greedy / lazy — quantifier preference: greedy
+matches as much as possible; lazy+?matches as little.
Common gotchas
- Forgetting to escape metacharacters.
.matches any character; to match a literal dot, write\.. - Greedy by default.
<.*>on<a><b>matches the whole<a><b>, not just<a>. Use<.*?>(lazy) or a character class<[^>]*>. - No
gflag, only first match. In JavaScript,"aaa".match(/a/)returns one match; with thegflag it returns all three. - Anchors and multiline.
^matches start-of-string, not start-of-line, unless themflag is set. .doesn’t cross newlines. Until you set thes(dotall) flag.- Catastrophic backtracking. Patterns like
(a+)+bcan take exponential time on input that doesn’t match. Watch for nested quantifiers.
When to reach for something else
If you find yourself writing a regex with:
- More than three nested groups
- Multiple lookarounds inside lookarounds
- A pattern longer than ~100 characters
- A regex to validate “real” emails (RFC 5321/5322 is genuinely too complex for regex)
…consider whether a parser, a dedicated library, or a different
approach is cleaner. Email validation in particular: the canonical
“email regex” is several thousand characters and still doesn’t cover
every legal address. Most production code accepts anything with an @
and lets the SMTP server confirm.
Tools on this site
- Live tester — paste a pattern, see matches highlighted
- Find and replace — apply with capture-group substitution
- Cheat sheet — every token in one page
- Regex flavors — when JS regex isn’t enough
Related across the network
- guid.tooljo.com/validate — when a regex tells you “looks like a UUID,” confirm with a real validator.
- json.tooljo.com — for structured data, parse rather than regex-extract.
- regex.tooljo.com/blog/catastrophic-backtracking-patterns — patterns that turn O(n) into O(2^n) and have taken down production.