1.3 Perl 5.8

Perl provides a rich set of regular-expression operators, constructs, and features, with more being added in each new release. Perl uses a Traditional NFA match engine. For an explanation of the rules behind an NFA engine, see Section 1.2.

This reference covers Perl Version 5.8. Unicode features were introduced in 5.6, but did not stabilize until 5.8. Most other features work in Versions 5.004 and later.

1.3.1 Supported Metacharacters

Perl supports the metacharacters and metasequences listed in Table 1-3 through Table 1-7. For expanded definitions of each metacharacter, see Section 1.2.1.

Table 1-3. Character representations

Sequence

Meaning

\a

Alert (bell).

\b

Backspace; supported only in character class.

\e

ESC character, x1B.

\n

Newline; x0A on Unix and Windows, x0D on Mac OS 9.

\r

Carriage return; x0D on Unix and Windows, x0A on Mac OS 9.

\f

Form feed, x0C.

\t

Horizontal tab, x09.

\octal

Character specified by a two- or three-digit octal code.

\xhex

Character specified by a one- or two-digit hexadecimal code.

\x{hex}

Character specified by any hexadecimal code.

\cchar

Named control character.

\N{name}

A named character specified in the Unicode standard or listed in PATH_TO_PERLLIB/unicode/Names.txt. Requires use charnames ':full'.

Table 1-4. Character classes and class-like constructs (continued)

Class

Meaning

[...]

A single character listed or contained in a listed range.

[^...]

A single character not listed and not contained within a listed range.

[:class:]

POSIX-style character class valid only within a regex character class.

.

Any character except newline (unless single-line mode, /s).

\C

One byte; however, this may corrupt a Unicode character stream.

\X

Base character followed by any number of Unicode combining characters.

\w

Word character, \p{IsWord}.

\W

Non-word character ,\P{IsWord}.

\d

Digit character, \p{IsDigit}.

\D

Non-digit character, \P{IsDigit}.

\s

Whitespace character, \p{IsSpace}.

\S

Non-whitespace character, \P{IsSpace}.

\p{prop}

Character contained by given Unicode property, script, or block.

\P{prop}

Character not contained by given Unicode property, script, or block.

Table 1-5. Anchors and zero-width tests

Sequence

Meaning

^

Start of string, or after any newline in multiline match mode, /m.

\A

Start of search string, in all match modes.

$

End of search string or before a string-ending newline, or before any newline in multiline match mode, /m.

\Z

End of string or before a string-ending newline, in any match mode.

\z

End of string, in any match mode.

\G

Beginning of current search.

\b

Word boundary.

\B

Not-word-boundary.

(?=...)

Positive lookahead.

(?!...)

Negative lookahead.

(?<=...)

Positive lookbehind; fixed-length only.

(?<!...)

Negative lookbehind; fixed-length only.

Table 1-6. Comments and mode modifiers (continued)

Modifier

Meaning

/i

Case-insensitive matching.

/m

^ and $ match next to embedded \n.

/s

Dot (.) matches newline.

/x

Ignore whitespace and allow comments (#) in pattern.

/o

Compile pattern only once.

(?mode)

Turn listed modes (xsmi) on for the rest of the subexpression.

(?-mode)

Turn listed modes (xsmi) off for the rest of the subexpression.

(?mode:...)

Turn listed modes (xsmi) on within parentheses.

(?mode:...)

Turn listed modes (xsmi) off within parentheses.

(?#...)

Treat substring as a comment.

#...

Treat rest of line as a comment in /x mode.

\u

Force next character to uppercase.

\l

Force next character to lowercase.

\U

Force all following characters to uppercase.

\L

Force all following characters to lowercase.

\Q

Quote all following regex metacharacters.

\E

End a span started with \U, \L, or \Q.

Table 1-7. Grouping, capturing, conditional, and control (continued)

Sequence

Meaning

(...)

Group subpattern and capture submatch into \1,\2,... and $1, $2,....

\n

Contains text matched by the nth capture group.

(?:...)

Groups subpattern, but does not capture submatch.

(?>...)

Disallow backtracking for text matched by subpattern.

...|...

Try subpatterns in alternation.

*

Match 0 or more times.

+

Match 1 or more times.

?

Match 1 or 0 times.

{n}

Match exactly n times.

{n,}

Match at least n times.

{x,y}

Match at least x times but no more than y times.

*?

Match 0 or more times, but as few times as possible.

+?

Match 1 or more times, but as few times as possible.

??

Match 0 or 1 time, but as few times as possible.

{n,}?

Match at least n times, but as few times as possible.

{x,y}?

Match at least x times, no more than y times, but as few times as possible .

(?(COND)...|...)

Match with if-then-else pattern where COND is an integer referring to either a backreference or a lookaround assertion.

(?(COND)...)

Match with if-then pattern.

(?{CODE})

Execute embedded Perl code.

(??{CODE})

Match regex from embedded Perl code.

1.3.2 Regular Expression Operators

Perl provides the built-in regular expression operators qr//, m//, and s///, as well as the split function. Each operator accepts a regular expression pattern string that is run through string and variable interpolation and then compiled.

Regular expressions are often delimited with the forward slash, but you can pick any non-alphanumeric, non-whitespace character. Here are some examples:

qr#...#       m!...!        m{...}
s|...|...|    s[...][...]   s<...>/.../

A match delimited by slashes (/.../) doesn't require a leading m:

/.../      #same as m/.../

Using the single quote as a delimiter suppresses interpolation of variables and the constructs \N{name}, \u, \l, \U, \L, \Q, \E. Normally these are interpolated before being passed to the regular expression engine.

qr// (Quote Regex)

qr/PATTERN/ismxo

Quote and compile PATTERN as a regular expression. The returned value may be used in a later pattern match or substitution. This saves time if the regular expression is going to be repeatedly interpolated. The match modes (or lack of), /ismxo, are locked in.

m// (Matching)

m/PATTERN/imsxocg

Match PATTERN against input string. In list context, returns a list of substrings matched by capturing parentheses, or else (1) for a successful match or ( ) for a failed match. In scalar context, returns 1 for success or "" for failure. /imsxo are optional mode modifiers. /cg are optional match modifiers. /g in scalar context causes the match to start from the end of the previous match. In list context, a /g match returns all matches or all captured substrings from all matches. A failed /g match will reset the match start to the beginning of the string unless the match is in combined /cg mode.

s/// (Substitution)

s/PATTERN/REPLACEMENT/egimosx

Match PATTERN in the input string and replace the match text with REPLACEMENT, returning the number of successes. /imosx are optional mode modifiers. /g substitutes all occurrences of PATTERN. Each /e causes an evaluation of REPLACEMENT as Perl code.

split

split /PATTERN/, EXPR, LIMIT
split /PATTERN/, EXPR
split /PATTERN/
split

Return a list of substrings surrounding matches of PATTERN in EXPR. If LIMIT, the list contains substrings surrounding the first LIMIT matches. The pattern argument is a match operator, so use m if you want alternate delimiters (e.g., split m{PATTERN}). The match permits the same modifiers as m{}. Table 1-8 lists the after-match variables.

Table 1-8. After-match variables

Variable

Meaning

$1, $2, ...

Captured submatches.

@-

$-[0] offset of start of match. $-[n] offset of start of $n.

@+

$+[0] offset of end of match. $+[n] offset of end of $n.

$+

Last parenthesized match.

$'

Text before match. Causes all regular expressions to be slower. Same as substr($input, 0, $-[0]).

$&

Text of match. Causes all regular expressions to be slower. Same as substr($input, $-[0], $+[0] - $-[0]).

$'

Text after match. Causes all regular expressions to be slower. Same as substr($input, $+[0]).

$^N

Text of most recently closed capturing parentheses.

$*

If true, \m is assumed for all matches without a \s.

$^R

The result value of the most recently executed code construct within a pattern match.

1.3.3 Unicode Support

Perl provides built-in support for Unicode 3.2, including full support in the \w, \d, \s, and \b metasequences.

The following constructs respect the current locale if use locale is defined: case-insensitive (i) mode, \L, \l, \U, \u, \w, and \W.

Perl supports the standard Unicode properties (see Table 1-3) as well as Perl-specific composite properties (see Table 1-9). Scripts and properties may have an Is prefix but do not require it. Blocks require an In prefix only if the block name conflicts with a script name.

Table 1-9. Composite Unicode properties

Property

Equivalent

IsASCII

[\x00-\x7f]

IsAlnum

[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}]

IsAlpha

[\p{Ll}\p{Lu}\p{Lt}\p{Lo}]

IsCntrl

\p{C}

IsDigit

\p{Nd}

IsGraph

[^\p{C}\p{Space}]

IsLower

\p{Ll}

IsPrint

\P{C}

IsPunct

\p{P}

IsSpace

[\t\n\f\r\p{Z}]

IsUppper

[\p{Lu}\p{Lt}]

IsWord

[_\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}]

IsXDigit

[0-9a-fA-F]

1.3.4 Examples

Example 1-1. Simple match

# Match Spider-Man, Spiderman, SPIDER-MAN, etc.
my $dailybugle = "Spider-Man Menaces City!";
if ($dailybugle =~ m/spider[- ]?man/i) { do_something(  ); }

Example 1-2. Match, capture group, and qr

# Match dates formatted like MM/DD/YYYY, MM-DD-YY,...
my $date  = "12/30/1969";
my $regex = qr!(\d\d)[-/](\d\d)[-/](\d\d(?:\d\d)?)!;
if ($date =~ m/$regex/) {
  print "Day=  ", $1,
        "Month=", $2,
        "Year= ", $3;
}

Example 1-3. Simple substitution

# Convert <br> to <br /> for XHTML compliance
my $text = "Hello World! <br>";
$text =~ s#<br>#<br />#ig;

Example 1-4. Harder substitution

# urlify - turn URL's into HTML links
$text = "Check the website, http://www.oreilly.com/catalog/repr.";
$text =~ 
    s{
      \b                         # start at word boundary
      (                          # capture to $1
       (https?|telnet|gopher|file|wais|ftp) : 
                                 # resource and colon
       [\w/#~:.?+=&%@!\-] +?     # one or more valid
                                 # characters 
                                 # but take as little as
                                 # possible
      )
      (?=                        # lookahead   
        [.:?\-] *                #  for possible punctuation
        (?: [^\w/#~:.?+=&%@!\-]  #  invalid character
          | $ )                  #  or end of string
      )
     }{<a href="$1">$1</a>}igox;

1.3.5 Other Resources

Programming Perl, by Larry Wall, Tom Christiansen, and Jon Orwant (O'Reilly), is the standard Perl reference.
Mastering Regular Expressions, Second Edition, by Jeffrey E. F. Friedl (O'Reilly), covers the details of Perl regular expressions on pages 283-364.
perlre is the perldoc documentation provided with most Perl distributions.

[ Team LiB ]