1.2 Introduction to Regexes and Pattern Matching
A regular expression is a
string containing a combination of normal characters and special
metacharacters or metasequences. The normal characters match
themselves.
Metacharacters
and metasequences are characters or sequences
of characters that represent ideas such as quantity, locations, or
types of characters. The list in Section 1.2.1 shows the most common
metacharacters and metasequences in the regular expression world.
Later sections list the availability of and syntax for supported
metacharacters for particular implementations of regular expressions.
Pattern matching consists of finding a section of text
that is described (matched) by a regular expression. The underlying
code that searchs the text is the regular expression
engine.
You can guess the
results of most matches by keeping two rules in mind:
The earliest (leftmost) match wins Regular expressions are applied to the input starting at the first
character and proceeding toward the last. As soon as the regular
expression engine finds a match, it returns. (See MRE 148-149,
177-179.) Standard quantifiers are greedy Quantifiers
specify how many times something can be repeated. The standard
quantifiers attempt to match as many times as possible. They settle
for less than the maximum only if this is necessary for the success
of the match. The process of giving up characters and trying
less-greedy matches is called backtracking. (See MRE 151-153.)
Regular expression engines have subtle differences based on their
type. There are two classes of
engines: Deterministic Finite
Automaton (DFA) and Nondeterministic Finite
Automaton (NFA). DFAs are faster but lack many of the features of an
NFA, such as capturing, lookaround, and non-greedy quantifiers. In
the NFA world there are two types: Traditional and POSIX.
- DFA engines
-
DFAs compare each character of the input string to the regular
expression, keeping track of all matches in progress. Since each
character is examined at most once, the DFA engine is the fastest.
One additional rule to remember with DFAs is that the alternation
metasequence is greedy. When more than one option in an alternation
(foo|foobar) matches, the longest one is selected.
So, rule #1 can be amended to read "the longest
leftmost match wins." (See MRE 155-156.)
- Traditional NFA engines
-
Traditional NFA engines compare each element of the regex to the
input string, keeping track of positions where it chose between two
options in the regex. If an option fails, the engine backtracks to
the most recently saved position. For standard quantifiers, the
engine chooses the greedy option of matching more text; however, if
that option leads to the failure of the match, the engine returns to
a saved position and tries a less greedy path. The traditional NFA
engine uses ordered alternation, where each option in the alternation
is tried sequentially. A longer match may be ignored if an earlier
option leads to a successful match. So, rule #1 can be amended to
read "the first leftmost match after greedy
quantifiers have had their fill." (See MRE 153-154.)
- POSIX NFA engines
-
POSIX NFA Engines work similarly to Traditional NFAs with one
exception: a POSIX engine always picks the longest of the leftmost
matches. For example, the alternation cat|category
would match the full word
"category" whenever possible, even
if the first alternative ("cat")
matched and appeared earlier in the alternation. (See MRE 153-154.)
1.2.1 Regex Metacharacters, Modes, and Constructs
The
metacharacters
and
metasequences
shown here represent most available types of regular expression
constructs and their most common syntax. However, syntax and
availability vary by implementation.
1.2.1.1 Character representations
Many implementations provide shortcuts to represent some characters
that may be difficult to input. (See MRE 114-117.)
- Character shorthands
-
Most implementations have specific shorthands for the
alert, backspace,
escape character, form feed,
newline, carriage return,
horizontal tab, and vertical
tab characters. For example, \n is often
a shorthand for the newline character, which is usually LF (012
octal) but can sometimes be CR (15 octal) depending on the operating
system. Confusingly, many implementations use \b
to mean both backspace and word boundary (between
a "word" character and a non-word
character). For these implementations, \b means
backspace in a character class (a set of possible
characters to match in the string) and word boundary elsewhere.
- Octal escape: \num
-
Represents a character corresponding to a two- or three- octal digit
number. For example, \015\012 matches an ASCII
CR/LF sequence.
- Hex and Unicode escapes: \x num, \x{ num}, \u num, \U num
-
Represents a character corresponding to a hexadecimal number.
Four-digit and larger hex numbers can represent the range of Unicode
characters. For example, \x0D\x0A matches an ASCII
CR/LF sequence.
- Control characters: \c char
-
Corresponds to
ASCII control characters encoded with
values less than 32. To be safe, always use an uppercase
char—some implementations do not
handle lowercase representations. For example, \cH
matches Control-H, an ASCII backspace character.
1.2.1.2 Character classes and class-like constructs
Character classes are ways to define or specify a set of
characters. A character class matches one character in the input
string that is within the defined set. (See MRE 117-127.)
- Normal classes: [...] and [^...]
-
Character classes, [...], and
negated character classes,
[^...], allow you to list the characters that
you do or do not want to match. A character class always matches one
character. The - (dash) indicates a range of
characters. To include the dash in the list of characters, list it
first or escape it. For example, [a-z] matches any
lowercase ASCII letter.
- Almost any character: dot (.)
-
Usually matches any character except a newline. The match mode can
often be changed so that dot also matches newlines.
- Class shorthands: \w, \d, \s, \W, \D, \S
-
Commonly provided shorthands for digit, word character, and space
character classes. A word character is often
all ASCII alphanumeric characters plus the underscore. However, the
list of alphanumerics can include additional locale or Unicode
alphanumerics, depending on the implementation. For example,
\d matches a single digit character and is usually
equivalent to [0-9].
- POSIX character class: [ :alnum:]
-
POSIX defines several character classes that can be used only within
regular expression character classes (see Table 1-1). For example, [:lower:],
when written as [[:lower:]], is equivalent to
[a-z] in the ASCII locale.
- Unicode properties, scripts, and blocks: \p{ prop} , \P{ prop}
-
The Unicode standard defines classes of characters that have a
particular property, belong to a script, or exist within a block.
Properties are characteristics such as being a
letter or a number (see Table 1-2).
Scripts are systems of writing, such as Hebrew,
Latin, or Han. Blocks are ranges of characters
on the Unicode character map. Some implementations require that
Unicode properties be prefixed with Is or
In. For example, \p{Ll} matches
lowercase letters in any Unicode supported language, such as
a or a.
- Unicode combining character sequence: \X
-
Matches a Unicode base character followed by any number of Unicode
combining characters. This is a shorthand for
\P{M}\p{M}. For example, \X
matches è as well as the two characters
e'.
Table 1-1. POSIX character classes|
alnum
|
Letters and digits.
|
alpha
|
Letters.
|
blank
|
Space or tab only.
|
cntrl
|
Control characters.
|
digit
|
Decimal digits.
|
graph
|
Printing characters, excluding space.
|
lower
|
Lowercase letters.
|
print
|
Printing characters, including space.
|
punct
|
Printing characters, excluding letters and digits.
|
space
|
Whitespace.
|
upper
|
Uppercase letters.
|
xdigit
|
Hexadecimal digits.
|
Table 1-2. Standard Unicode properties (continued)|
\p{L}
|
Letters.
|
\p{Ll}
|
Lowercase letters.
|
\p{Lm}
|
Modifier letters.
|
\p{Lo}
|
Letters, other. These have no case and are not considered modifiers.
|
\p{Lt}
|
Titlecase letters.
|
\p{Lu}
|
Uppercase letters.
|
\p{C}
|
Control codes and characters not in other categories.
|
\p{Cc}
|
ASCII and Latin-1 control characters.
|
\p{Cf}
|
Non-visible formatting characters.
|
\p{Cn}
|
Unassigned code points.
|
\p{Co}
|
Private use, such as company logos.
|
\p{Cs}
|
Surrogates.
|
\p{M}
|
Marks meant to combine with base characters, such as accent marks.
|
\p{Mc}
|
Modification characters that take up their own space. Examples
include "vowel signs."
|
\p{Me}
|
Marks that enclose other characters, such as circles, squares, and
diamonds.
|
\p{Mn}
|
Characters that modify other characters, such as accents and umlauts.
|
\p{N}
|
Numeric characters.
|
\p{Nd}
|
Decimal digits in various scripts.
|
\p{Nl}
|
Letters that are numbers, such as Roman numerals.
|
\p{No}
|
Superscripts, symbols, or non-digit characters representing numbers.
|
\p{P}
|
Punctuation.
|
\p{Pc}
|
Connecting punctuation, such as an underscore.
|
\p{Pd}
|
Dashes and hyphens.
|
\p{Pe}
|
Closing punctuation complementing \p{Ps}.
|
\p{Pi}
|
Initial punctuation, such as opening quotes.
|
\p{Pf}
|
Final punctuation, such as closing quotes.
|
\p{Po}
|
Other punctuation marks.
|
\p{Ps}
|
Opening punctuation, such as opening parentheses.
|
\p{S}
|
Symbols.
|
\p{Sc}
|
Currency.
|
\p{Sk}
|
Combining characters represented as individual characters.
|
\p{Sm}
|
Math symbols.
|
\p{So}
|
Other symbols.
|
\p{Z}
|
Separating characters with no visual representation.
|
\p{Zl}
|
Line separators.
|
\p{Zp}
|
Paragraph separators.
|
\p{Zs}
|
Space characters.
|
1.2.1.3 Anchors and zero-width assertions
Anchors
and "zero-width assertions" match
positions in the input string. (See MRE 127-133.)
- Start of line/string: ^, \A
-
Matches at the beginning of the text being searched. In multiline
mode, ^ matches after any newline. Some
implementations support \A, which only matches at
the beginning of the text.
- End of line/string: $, \Z, \z
-
$ matches at the end of a string. Some
implementations also allow $ to match before a
string-ending newline. If modified by multiline mode,
$ matches before any newline as well. When
supported, \Z matches the end of string or before
a string-ending newline, regardless of match mode. Some
implementations also provide \z, which only
matches the end of the string, regardless of newlines.
- Start of match: \G
-
In iterative matching,
\G matches the position where the previous match
ended. Often, this spot is reset to the beginning of a string on a
failed match.
- Word boundary: \b, \B,
\<, \>
-
Word boundary metacharacters match a
location where a word character is next to a non-word character.
\b often specifies a word boundary location, and
\B often specifies a not-word-boundary location.
Some implementations provide separate metasequences for start- and
end-of-word boundaries, often \< and
\>.
- Lookahead: (?=...), (?!...)
- Lookbehind: (?<=...), (?<!...)
-
Lookaround
constructs match a location in the text where
the subpattern would match
(lookahead), would
not match (negative lookahead), would have finished matching
(lookbehind), or
would not have finished matching (negative lookbehind). For example,
foo(?=bar) matches foo in
foobar but not food.
Implementations often limit lookbehind constructs to subpatterns with
a predetermined length.
1.2.1.4 Comments and mode modifiers
Mode modifiers are a way to change how
the regular expression engine interprets a regular expression. (See
MRE 109-112, 133-135.)
- Multiline mode: m
-
Changes the behavior of ^ and $
to match next to newlines within the input string.
- Single-line mode: s
-
Changes the behavior of . (dot) to match all
characters, including newlines, within the input string.
- Case-insensitive mode: i
-
Treat as identical letters that differ only in case.
- Free-spacing mode: x
-
Allows for whitespace and comments within a regular expression. The
whitespace and comments (starting with # and
extending to the end of the line) are ignored by the regular
expression engine.
- Mode modifiers: (?i) , (?-i) , (? mod:...)
-
Usually, mode modifiers may be set within a regular expression with
(?mod)
to turn modes on for the rest of the current subexpression;
(?-mod)
to turn modes off for the rest of the current subexpression; and
(?mod:...)
to turn modes on or off between the colon and the closing
parentheses. For example, "use
(?i:perl)" matches "use
perl", "use Perl",
"use PeRl", etc.
- Comments: (?#...) and #
-
In free-spacing mode, # indicates that the rest of
the line is a comment. When supported, the comment span
(?#...) can be embedded anywhere in a regular
expression, regardless of mode. For example, .{0,80}(?#Field
limit is 80 chars) allows you to make
notes about why you wrote .{0,80}.
- Literal-text span: \Q...\E
-
Escapes metacharacters between \Q and
\E. For example, \Q(.*)\E is
the same as \(\.\*\).
1.2.1.5 Grouping, capturing, conditionals, and control
This section covers the syntax for grouping
subpatterns, capturing submatches,
conditional submatches, and quantifying the number of times a
subpattern matches. (See MRE 135-140.)
- Capturing and grouping parentheses: (...) and \1, \2, ...
-
Parentheses perform two functions:
grouping and capturing. Text matched by the subpattern within
parentheses is captured for later use. Capturing parentheses are
numbered by counting their opening parentheses from the left. If
backreferences are available, the submatch can be referred to later
in the same match with \1, \2,
etc. The captured text is made available after a match by
implementation-specific methods. For example,
\b(\w+)\b\s+\1\b matches duplicate words, such as
the the.
- Grouping-only parentheses: (?:...)
-
Groups a subexpression, possibly for alternation or quantifiers, but
does not capture the submatch. This is useful for efficiency and
reusability. For example, (?:foobar) matches
foobar, but does not save the match to a capture
group.
- Named capture: (?< name>...)
-
Performs capturing and grouping, with captured text later referenced
by name. For example,
Subject:(?<subject>.*) captures the text
following Subject: to a capture group that can be
referenced by the name subject.
- Atomic grouping: (?>...)
-
Text matched within the group is never backtracked into, even if this
leads to a match failure. For example,
(?>[ab]*)\w\w matches aabbcc
but not aabbaa.
- Alternation: ...|...
-
Allows several subexpressions to be tested.
Alternation's low precedence sometimes causes
subexpressions to be longer than intended, so use parentheses to
specifically group what you want alternated. For example,
\b(foo|bar)\b matches either of the words
foo or bar.
- Conditional: (? if then | else)
-
The if is implementation dependent, but
generally is a reference to a captured subexpression or a lookaround.
The then and
else parts are both regular expression
patterns. If the if part is true, the
then is applied. Otherwise,
else is applied. For example,
(<)?foo(?(1)>|bar) matches
<foo> and foobar.
- Greedy quantifiers: * , + , ? , { num,num }
-
The greedy quantifiers determine how many times a construct may be
applied. They attempt to match as many times as possible, but will
backtrack and give up matches if necessary for the success of the
overall match. For example, (ab)+ matches all of
ababababab.
- Lazy quantifiers: *? , +? , ?? , { num,num }?
-
Lazy quantifiers control how many times a construct may be applied.
However, unlike greedy quantifiers, they attempt to match as few
times as possible. For example, (an)+? matches
only an of banana.
- Possessive Quantifiers: *+ , ++ , ?+ , { num,num }+
-
Possessive quantifiers are like greedy quantifiers, except that they
"lock in" their match, disallowing
later backtracking to break up the sub-match. For example,
(ab)++ab will not match
ababababab.
1.2.2 Unicode Support
Unicode
is a character set that gives unique numbers to the characters in all
the world's languages. Because of the large number
of possible characters, Unicode requires more than one byte to
represent a character. Some regular expression implementations will
not understand Unicode characters, because they expect one-byte ASCII
characters. Basic support for Unicode characters starts with being
able to match a literal string of Unicode characters. Advanced
support includes character classes and other constructs that contain
characters from all Unicode-supported languages. For example,
\w might match è as
well as e.
|