What is a regular expression?

Updated April 26, 2026 6 min read by Samuel Yoo

On this page

In one example
What it’s good at
What it’s bad at
Where it came from
Flavors
A vocabulary primer
Common gotchas
When to reach for something else
Tools on this site
Related across the network

A regular expression (regex, regexp) is a small language for describing text patterns. You write a pattern; a regex engine matches it against input strings. Used for searching, validating, extracting, and transforming text.

If you’ve ever used Find/Replace in VS Code, written a grep command, or validated a form field client-side, you’ve probably touched regex.

In one example

Pattern:  \b\d{3}-\d{2}-\d{4}\b
Input:    "Call me at 555-12-3456 about the report."
Match:    "555-12-3456" (a US SSN-format number)

The pattern is a tiny program: “find a word boundary, then 3 digits, then a dash, then 2 digits, then a dash, then 4 digits, then a word boundary.” The regex engine compiles that to a state machine and runs it against the input.

What it’s good at

Validating input format: “is this a valid email shape?” “does this look like a phone number?”
Searching unstructured text: log lines, source code, config files.
Extracting fields from semi-structured text: dates, IDs, URLs.
Bulk renaming and reformatting: shell scripts, IDE find-and-replace.
Tokenizing: splitting text on patterns more sophisticated than a single character.

What it’s bad at

Parsing nested structures. HTML, JSON, code — anything with arbitrary nesting requires a parser, not a regex. The classic Stack Overflow answer “You can’t parse HTML with regex” is the canonical statement.
Anything where context matters. If matching a word depends on what surrounded it three sentences ago, regex isn’t the right tool.
Complex grammars. Programming languages have proper grammars that parser generators handle; regex chokes on them.

A useful heuristic: if the input has matched delimiters (parens, brackets, tags) at arbitrary depth, you need a parser, not regex.

Where it came from

Regex traces back to Stephen Kleene’s 1956 paper on regular sets in formal language theory. The implementation we use today started with Ken Thompson’s 1968 QED editor and grep — Thompson built a pattern engine that became the foundation of every Unix tool’s regex support.

The “regex” you write in 2026 is several generations descended from that original. It includes features (lookaround, backreferences, recursion in some flavors) that aren’t formally “regular” in the language-theory sense, but the name stuck.

Flavors

There’s no single regex spec. Each language and tool implements its own flavor, with variations in:

Which metacharacters need escaping
Whether ^/$ match line boundaries by default
Support for lookaround, named groups, recursion
Unicode handling
Greedy / lazy / atomic / possessive quantifiers

The major flavors:

Flavor	Used in	Notable features
POSIX BRE	older `grep`, `sed -E` (with `--basic`)	Most basic; `(`, `)`, `{`, `}` don’t need escapes in BRE
POSIX ERE	modern `grep -E`, `awk`	Adds `+`, `?`, `\|`, parens and braces
PCRE	`grep -P`, PHP, many editors	The “kitchen sink” — recursion, atomic groups, conditionals
JavaScript	browsers, Node	ECMAScript spec; close to PCRE but missing atomic groups and recursion
Python `re`	Python	PCRE-ish; named groups, but no recursion
Java	JDK	PCRE-style with explicit anchor variants (`\A`, `\Z`)
Go `regexp`	Go	RE2 engine — guaranteed linear time, no backreferences
.NET	C#, F#	Variable-length lookbehind, balancing groups

The tester on this site uses JavaScript flavor. See Regex flavors compared for a detailed table.

A vocabulary primer

You’ll hear these terms:

Pattern — the regex source string itself.
Match — a substring the pattern successfully matches.
Capture group — a parenthesized sub-pattern whose match is “remembered” for backreference or extraction. (\d+)-(\d+) has two capture groups.
Backreference — \1 or $1 referring to the first capture group’s match.
Anchor — a position-only pattern: ^ (start), $ (end), \b (word boundary). Anchors don’t consume characters.
Lookaround — (?=...) / (?!...) etc. Match a position contingent on what’s around it, but don’t consume.
Flag — a modifier that changes how the engine interprets the pattern: g (global), i (ignore case), m (multiline).
Greedy / lazy — quantifier preference: greedy + matches as much as possible; lazy +? matches as little.

Common gotchas

Forgetting to escape metacharacters. . matches any character; to match a literal dot, write \..
Greedy by default. <.*> on <a><b> matches the whole <a><b>, not just <a>. Use <.*?> (lazy) or a character class <[^>]*>.
No g flag, only first match. In JavaScript, "aaa".match(/a/) returns one match; with the g flag it returns all three.
Anchors and multiline. ^ matches start-of-string, not start-of-line, unless the m flag is set.
. doesn’t cross newlines. Until you set the s (dotall) flag.
Catastrophic backtracking. Patterns like (a+)+b can take exponential time on input that doesn’t match. Watch for nested quantifiers.

When to reach for something else

If you find yourself writing a regex with:

More than three nested groups
Multiple lookarounds inside lookarounds
A pattern longer than ~100 characters
A regex to validate “real” emails (RFC 5321/5322 is genuinely too complex for regex)

…consider whether a parser, a dedicated library, or a different approach is cleaner. Email validation in particular: the canonical “email regex” is several thousand characters and still doesn’t cover every legal address. Most production code accepts anything with an @ and lets the SMTP server confirm.

Tools on this site

Live tester — paste a pattern, see matches highlighted
Find and replace — apply with capture-group substitution
Cheat sheet — every token in one page
Regex flavors — when JS regex isn’t enough

guid.tooljo.com/validate — when a regex tells you “looks like a UUID,” confirm with a real validator.
json.tooljo.com — for structured data, parse rather than regex-extract.
regex.tooljo.com/blog/catastrophic-backtracking-patterns — patterns that turn O(n) into O(2^n) and have taken down production.