1.3 Perl 5.8
Perl provides a rich set of
regular-expression operators, constructs, and features, with more
being added in each new release. Perl uses a Traditional NFA match
engine. For an explanation of the rules behind an NFA engine, see
Section 1.2.
This reference covers Perl Version 5.8. Unicode features were
introduced in 5.6, but did not stabilize until 5.8. Most other
features work in Versions 5.004 and later.
1.3.1 Supported Metacharacters
Perl
supports the metacharacters and metasequences listed in Table 1-3 through Table 1-7. For
expanded definitions of each metacharacter, see Section 1.2.1.
Table 1-3. Character representations|
\a
|
Alert (bell).
|
\b
|
Backspace; supported only in character class.
|
\e
|
ESC character, x1B.
|
\n
|
Newline; x0A on Unix and Windows,
x0D on Mac OS 9.
|
\r
|
Carriage return; x0D on Unix and Windows,
x0A on Mac OS 9.
|
\f
|
Form feed, x0C.
|
\t
|
Horizontal tab, x09.
|
\octal
|
Character specified by a two- or three-digit octal code.
|
\xhex
|
Character specified by a one- or two-digit hexadecimal code.
|
\x{hex}
|
Character specified by any hexadecimal code.
|
\cchar
|
Named control character.
|
\N{name}
|
A named character specified in the Unicode standard or listed in
PATH_TO_PERLLIB/unicode/Names.txt. Requires
use charnames ':full'.
|
Table 1-4. Character classes and class-like constructs (continued)|
[...]
|
A single character listed or contained in a listed range.
|
[^...]
|
A single character not listed and not contained within a listed range.
|
[:class:]
|
POSIX-style character class valid only within a regex character class.
|
.
|
Any character except newline (unless single-line mode,
/s).
|
\C
|
One byte; however, this may corrupt a Unicode character stream.
|
\X
|
Base character followed by any number of Unicode combining characters.
|
\w
|
Word character, \p{IsWord}.
|
\W
|
Non-word character ,\P{IsWord}.
|
\d
|
Digit character, \p{IsDigit}.
|
\D
|
Non-digit character, \P{IsDigit}.
|
\s
|
Whitespace character, \p{IsSpace}.
|
\S
|
Non-whitespace character, \P{IsSpace}.
|
\p{prop}
|
Character contained by given Unicode property, script, or block.
|
\P{prop}
|
Character not contained by given Unicode property, script, or block.
|
Table 1-5. Anchors and zero-width tests|
^
|
Start of string, or after any newline in multiline match mode,
/m.
|
\A
|
Start of search string, in all match modes.
|
$
|
End of search string or before a string-ending newline, or before any
newline in multiline match mode, /m.
|
\Z
|
End of string or before a string-ending newline, in any match mode.
|
\z
|
End of string, in any match mode.
|
\G
|
Beginning of current search.
|
\b
|
Word boundary.
|
\B
|
Not-word-boundary.
|
(?=...)
|
Positive lookahead.
|
(?!...)
|
Negative lookahead.
|
(?<=...)
|
Positive lookbehind; fixed-length only.
|
(?<!...)
|
Negative lookbehind; fixed-length only.
|
Table 1-6. Comments and mode modifiers (continued)|
/i
|
Case-insensitive matching.
|
/m
|
^ and $ match next to embedded
\n.
|
/s
|
Dot (.) matches newline.
|
/x
|
Ignore whitespace and allow comments (#) in
pattern.
|
/o
|
Compile pattern only once.
|
(?mode)
|
Turn listed modes (xsmi) on for the rest of the
subexpression.
|
(?-mode)
|
Turn listed modes (xsmi) off for the rest of the
subexpression.
|
(?mode:...)
|
Turn listed modes (xsmi) on within parentheses.
|
(?mode:...)
|
Turn listed modes (xsmi) off within parentheses.
|
(?#...)
|
Treat substring as a comment.
|
#...
|
Treat rest of line as a comment in /x mode.
|
\u
|
Force next character to uppercase.
|
\l
|
Force next character to lowercase.
|
\U
|
Force all following characters to uppercase.
|
\L
|
Force all following characters to lowercase.
|
\Q
|
Quote all following regex metacharacters.
|
\E
|
End a span started with \U, \L, or \Q.
|
Table 1-7. Grouping, capturing, conditional, and control (continued)|
(...)
|
Group subpattern and capture submatch into
\1,\2,... and
$1, $2,....
|
\n
|
Contains text matched by the nth capture
group.
|
(?:...)
|
Groups subpattern, but does not capture submatch.
|
(?>...)
|
Disallow backtracking for text matched by subpattern.
|
...|...
|
Try subpatterns in alternation.
|
*
|
Match 0 or more times.
|
+
|
Match 1 or more times.
|
?
|
Match 1 or 0 times.
|
{n}
|
Match exactly n times.
|
{n,}
|
Match at least n times.
|
{x,y}
|
Match at least x times but no more than
y times.
|
*?
|
Match 0 or more times, but as few times as possible.
|
+?
|
Match 1 or more times, but as few times as possible.
|
??
|
Match 0 or 1 time, but as few times as possible.
|
{n,}?
|
Match at least n times, but as few times as
possible.
|
{x,y}?
|
Match at least x times, no more than
y times, but as few times as possible .
|
(?(COND)...|...)
|
Match with if-then-else pattern where COND
is an integer referring to either a backreference or a lookaround
assertion.
|
(?(COND)...)
|
Match with if-then pattern.
|
(?{CODE})
|
Execute embedded Perl code.
|
(??{CODE})
|
Match regex from embedded Perl code.
|
1.3.2 Regular Expression Operators
Perl provides the built-in regular expression operators
qr//, m//, and
s///, as well as the split
function. Each operator accepts a regular expression pattern string
that is run through string and variable interpolation and then
compiled.
Regular expressions are often delimited with the forward slash, but
you can pick any non-alphanumeric, non-whitespace character. Here are
some examples:
qr#...# m!...! m{...}
s|...|...| s[...][...] s<...>/.../
A match delimited by slashes (/.../)
doesn't require a leading m:
/.../ #same as m/.../
Using the single quote as a delimiter suppresses
interpolation of variables and the constructs
\N{name},
\u, \l, \U,
\L, \Q, \E.
Normally these are interpolated before being passed to the regular
expression engine.
qr/PATTERN/ismxo
Quote and compile PATTERN as a regular
expression. The returned value may be used in a later pattern match
or substitution. This saves time if the regular expression is going
to be repeatedly interpolated. The match modes (or lack of),
/ismxo, are locked in.
m/PATTERN/imsxocg
Match PATTERN against input string. In
list context, returns a list of substrings matched by capturing
parentheses, or else (1) for a successful match or
( ) for a failed match. In scalar context, returns
1 for success or "" for
failure. /imsxo are optional mode modifiers.
/cg are optional match modifiers.
/g in scalar context causes the match to start
from the end of the previous match. In list context, a
/g match returns all matches or all captured
substrings from all matches. A failed /g match
will reset the match start to the beginning of the string unless the
match is in combined /cg mode.
s/PATTERN/REPLACEMENT/egimosx
Match PATTERN in the input string and
replace the match text with REPLACEMENT,
returning the number of successes. /imosx are
optional mode modifiers. /g substitutes all
occurrences of PATTERN. Each
/e causes an evaluation of
REPLACEMENT as Perl code.
split /PATTERN/, EXPR, LIMIT
split /PATTERN/, EXPR
split /PATTERN/
split
Return a list of substrings surrounding matches of
PATTERN in
EXPR. If LIMIT,
the list contains substrings surrounding the first
LIMIT matches. The pattern argument is a
match operator, so use m if you want alternate
delimiters (e.g., split
m{PATTERN}).
The match permits the same modifiers as m{}. Table 1-8 lists the after-match variables.
Table 1-8. After-match variables|
$1, $2, ...
|
Captured submatches.
|
@-
|
$-[0] offset of start of match. $-[n]
offset of start of $n.
|
@+
|
$+[0] offset of end of match. $+[n] offset
of end of $n.
|
$+
|
Last parenthesized match.
|
$'
|
Text before match. Causes all regular expressions to be slower. Same
as substr($input, 0, $-[0]).
|
$&
|
Text of match. Causes all regular expressions to be slower. Same as
substr($input, $-[0], $+[0] - $-[0]).
|
$'
|
Text after match. Causes all regular expressions to be slower. Same
as substr($input, $+[0]).
|
$^N
|
Text of most recently closed capturing parentheses.
|
$*
|
If true, \m is assumed for all matches without a
\s.
|
$^R
|
The result value of the most recently executed code construct within
a pattern match.
|
1.3.3 Unicode Support
Perl
provides built-in support for Unicode 3.2, including full support in
the \w, \d,
\s, and \b metasequences.
The following constructs respect the current locale if
use locale is defined:
case-insensitive (i) mode, \L,
\l, \U, \u,
\w, and \W.
Perl supports the standard Unicode properties (see Table 1-3) as well as Perl-specific composite
properties (see Table 1-9). Scripts and
properties may have an Is prefix but do not
require it. Blocks require an In prefix only if
the block name conflicts with a script name.
Table 1-9. Composite Unicode properties|
IsASCII
|
[\x00-\x7f]
|
IsAlnum
|
[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}]
|
IsAlpha
|
[\p{Ll}\p{Lu}\p{Lt}\p{Lo}]
|
IsCntrl
|
\p{C}
|
IsDigit
|
\p{Nd}
|
IsGraph
|
[^\p{C}\p{Space}]
|
IsLower
|
\p{Ll}
|
IsPrint
|
\P{C}
|
IsPunct
|
\p{P}
|
IsSpace
|
[\t\n\f\r\p{Z}]
|
IsUppper
|
[\p{Lu}\p{Lt}]
|
IsWord
|
[_\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}]
|
IsXDigit
|
[0-9a-fA-F]
|
1.3.4 Examples
Example 1-1. Simple match
# Match Spider-Man, Spiderman, SPIDER-MAN, etc.
my $dailybugle = "Spider-Man Menaces City!";
if ($dailybugle =~ m/spider[- ]?man/i) { do_something( ); }
Example 1-2. Match, capture group, and qr
# Match dates formatted like MM/DD/YYYY, MM-DD-YY,...
my $date = "12/30/1969";
my $regex = qr!(\d\d)[-/](\d\d)[-/](\d\d(?:\d\d)?)!;
if ($date =~ m/$regex/) {
print "Day= ", $1,
"Month=", $2,
"Year= ", $3;
}
Example 1-3. Simple substitution
# Convert <br> to <br /> for XHTML compliance
my $text = "Hello World! <br>";
$text =~ s#<br>#<br />#ig;
Example 1-4. Harder substitution
# urlify - turn URL's into HTML links
$text = "Check the website, http://www.oreilly.com/catalog/repr.";
$text =~
s{
\b # start at word boundary
( # capture to $1
(https?|telnet|gopher|file|wais|ftp) :
# resource and colon
[\w/#~:.?+=&%@!\-] +? # one or more valid
# characters
# but take as little as
# possible
)
(?= # lookahead
[.:?\-] * # for possible punctuation
(?: [^\w/#~:.?+=&%@!\-] # invalid character
| $ ) # or end of string
)
}{<a href="$1">$1</a>}igox;
1.3.5 Other Resources
Programming Perl, by Larry Wall, Tom
Christiansen, and Jon Orwant (O'Reilly), is the
standard Perl reference. Mastering Regular Expressions, Second Edition,
by Jeffrey E. F. Friedl (O'Reilly), covers the
details of Perl regular expressions on pages 283-364. perlre is the perldoc documentation provided
with most Perl distributions.
|