Metacharacters

Metacharacters

Most regular expressions rarely look for exact literal matches (such as m/Saruman/) but more often for fuzzy matches, as in our earlier Baggins code snippet. For example, suppose we have possession of a secret file called listOfPowers.txt. This was discovered within the bowels of Orthanc by Gandalf before he left for the Blessed Realm. It consists of the following names:

Saruman
Aragorn, son of Arathorn
Frodo Baggins
Mithrandir
Sauron
Bombadill
Durin's Bane
Smaug
Elrond
Galadriel
Witch-King
Celeborn
Radagast
Dain Ironfoot
Denethor

We'd like to find all the Powers known to have existed within Middle-Earth in the Third Age, whose names begin with an "S" and end with an "n." Example C-1 is the program we use.

Using regex metacharacters — findTheBaddies.pl

#!perl
  
while(<>){
   print if /S.*n/;
}

Notice the use of two special metacharacters, the . (dot) and the * (Kleene Star), within our main match expression. Before we explain how these are being used, let's see what we get:

$ perl findTheBaddies.pl listOfPowers.txt
  
Saruman
Sauron

The results seem appropriate, but notice how we failed to pick up Smaug or anything else that nearly matches. So how are these two metacharacters combining? Before we answer this question, let's examine all of Perl's main regex metacharacters in Table C-3.

Table C-3. Perl's main regular expression metacharacters

Metacharacter

Description

\...

The backslash giveth and the backslash taketh away. If the next character is special (for example, a $, *, or even another \), the backslash character takes away its specialness and makes it just another character (so \\ means "match a single backslash"). If the backslashed character is ordinary (e.g., a straightforward "n", "b", or "w" keyboard letter), \ usually gives it special meaning (\n for a newline character).

...|...

This is used for alternation, matching either one expression or the other, as in:

m/Merry|Pippin/ for a match that contains either "Merry" or "Pippin."

(...)

This has two concurrent meanings:

It can group various matches, usually in combination with the alternation just shown, as in m/(Sam|Frodo|Gollum) bore the ring in Mordor/.

At the same time, it will store or return whatever is found within the brackets, usually into backreference variables, so we can make use of this information elsewhere in the program (we'll say more about this later).

[...]

The character class brackets allow you to provide a range of match characters, so [abc]fg can match afg, bfg or cfg. You can also use a character class range, so that [0-9] is equivalent to [0123456789].

[^...]

A slight variation on [...]. If the first character encountered within a character class is a ^ (caret), it negates the whole thing. So [^abc]fg will match dfg, efg, and every character in the known Unicode universe preceding the fg string, except an "a" or a "b" or a "c."

*

The Kleene Star — match the preceding item zero or more times, up to infinity. See the Kleene Star and the other regex multipliers at work in Figure C-4.

+

Match the preceding item one or more times.

?

Match the preceding item zero times or once only.

{Exact Count}

Match the preceding item an exact number of times. For instance, a{4} means "Find exactly four "a" characters within the pattern, so they look like aaaa."

{Min,}

(Note the comma.) Find at least the specified number of the previous item, up to infinity. For example, a{3,} greedily matches aaa, aaaa, aaaaa, and so on. (We'll say more about "greediness" shortly.) Incidentally {0,} is exactly equivalent to *, the Kleene Star shorthand version, and {1,} is exactly equivalent to +.

{Min, Max}

Match an exact range of the preceding item. For instance, a{4,5} fails to match a, aa, aaa, but will completely match aaaa and aaaaa. Under greedy conditions, it will match the first five characters of aaaaaa, aaaaaaa, and so on. The {0,1} construct is exactly equivalent to the ? metacharacter above, as you can see in Figure C-4.

^

Anchors the beginning of a string, and sometimes follows the \n newline character depending on the /m match suffix discussed later. This means that m/^Angmar/ will match "Angmar was of old the realm of the Witch-King" but fails to match "The Witch-King of Angmar."

$

Anchors the end of a string before any \n newline, if there is one. It can occasionally precede other embedded \n newlines, depending on suffixes (described later). This means that m/Minas Morgul$/ will match "Dreadful was the vale of Minas Morgul," but refuses to match "Minas Morgul was once the fair moonlit valley of Minas Ithil."

.

The dot character matches any character except the \n newline, although this behavior can be modified slightly with the /s suffix, as we'll see later.

Figure C-4. Variable numeric character requirements

To answer our original question, we can now see how the /S.*n/ match worked its magic on the listOfPowers.txt file:

It looked for a capital "S." (No prizes so far.)
The . dot character then meant it looked for any character, except \n newlines, and the * meant it looked for zero or more of them.
The regex then looked for an "n" to terminate the name, which ensured that Smaug was pulled at the last hurdle because it didn't comply with this condition.
Only Saruman and Sauron matched all three of these requirements in full, with aruma and auro matching the .* multiplier.

In the following sections, we'll examine how we can further refine such munge requirements to search for other fuzzy data, while keeping track of what falls within our fuzzy requirements and what falls outside them.

Character Class Shortcuts

If you've ever used character class ranges with an older version of grep, you may have used a command like the following one to find words of at least one character in length:

$ grep '[a-zA-Z0-9][a-zA-Z0-9]*' myFile.txt

This seems reasonable enough, and the ranges are nice because they've cut out the typing in of many alphabetical characters. But this is still too much work for a Perlhead; a similar match in Perl would involve just three keystrokes:

\w+

There are many other such regex shortcuts in Perl for other character class ranges. To illustrate these, including \w, we'll first detail some double-quotish characters which are recognized within Perl regexes, in Table C-4. Table C-5 will then display some of the best-known character class shortcuts. (Fortunately, many of these have now made their way into more modern versions of grep and egrep, too.)

Table C-4. Escaped characters

Escape

Description

\0

Null character

\a

Alarm (often producing an OS bell ring)

\e

Escape character

\f

Form feed

\n

Newline

\r

Return

\t

Tab

\cX

Control character, where Control-C is \cC

\N{NAME}

Named character, such as \N{greek:Alpha}

\x{abcd}

Hexadecimal character, where \x{263a} is a smiley face

The . (dot) character is normally used to represent any character, except \n newline. However, it has no such special meaning within character classes. Therefore [.]+ literally means one or more . dot characters, such as full-stops, periods, or decimal points.

Table C-5. Character class shortcuts

Symbol

Description

Fully expanded version

\d

Any digit.

[0-9]

\D

Anything except a digit.

[^0-9]

\s

Whitespace, including spaces, tabs, line feeds, form feeds and newlines.

[ \t\n\r\f] (Note that the first character in this range is a single ordinary spacebar character.)

\S

Non-whitespace.

(You have to be careful when using shortcuts such as \s and \S. They can easily look like each other within large code blocks, or even within small ones.)

[^ \t\n\r\f]

\w

A word, or alphanumeric character (includes underscores, typically found in file names).

[a-zA-Z0-9_] (Note that this also depends upon your locale settings — for example, ö in a German locale is matched by \w; see perldoc perllocale for more details.)

\W

Non-word character.

(The ends and beginnings of strings, as marked by the ^ and $ string anchors, are often honorary \W characters for the devilish purposes of regexes.)

[^a-zA-Z0-9_]

Boundaries

In addition to the ^ (caret) and $ (dollar) string anchors, there are two other special boundary assertions commonly used in Perl regexes. These are described in Table C-6.

Table C-6. Positions and boundaries

Symbol

Description

\b

This matches any boundary between a \w word character, and a \W non-word character, in either the \w\W or \W\w order. It is a zero-width assertion and can be seen matching various word boundaries in Figure C-5. For the purposes of \b boundary matches, the ^ and $ anchors count as honorary \W non-word characters.

\B

Simply the opposite of \b. This is the boundary between either a \w\w or \W\W pairing.

\A

This is like a strict ^. It matches at the beginning of the string. We'll see later in this appendix how ^ can also match just after embedded \n characters, if it is used with the /m match suffix. However, \A only matches right at the start of the string, come what may.

\z

Again, this is like a super-strict $. The \z symbol only matches at the end of a string, with or without \n newlines, and with or without the /m match suffix (described later).

\Z

This usually means the same as $ — that is, it comes either before the \n newline at the end of a string, (if there is one) or right at the end (if there isn't). With the /m match suffix, the $ character can then come before \n characters embedded within the string, whereas \Z cannot.

Beware of punctuation within words such as "Let's", as in Figure C-5. Remember that \w is both for alphanumerics and the underscore, but it never covers punctuation marks, such as apostrophes. Sometimes it's better to use matches, such as the following, to pick up words containing punctuation:

m/\s+\S+\s+/

Figure C-5. Word boundaries

This means: "Some spaces, followed by some nonspaces, followed by some more spaces." The sequential non-space characters can be a word containing apostrophes. Notice how it may also be difficult, at first glance, to pick out the difference between the \s and the \S shortcuts.

Greediness

If the first great principle of Perl regexes is the left-most match wins, the second great principle is that by default any match will try to take as much text as it can. More specifically, the mass character quantifiers will always try to grab the maximal possible match. These quantifiers include:

* (or {0,})

+ (or {1,})

? (or {0,1})

{Min,}

{Exact Count}(see Table C-7 for the special case created by this quantifier)

{Min,Max}

Particularly when they're used in combination with the . (dot) character, they will always try to eat as much as they can, unless we tell them otherwise. For instance, let's take a typical line out of an /etc/passwd file line:

andyd:fruitbat:/home/andyd:dba,apache,users:/bin/ksh

You might expect the substitution s/.*:/jkstill:/ to produce the following:

jkstill:fruitbat:/home/andyd:dba,apache,users:/bin/ksh

You might have thought the .* would only take the andyd, and allow the colon character to match the first colon. But this doesn't happen. Instead, the .* will try and grab as much as it possibly can get away with. Remember that the . dot character can match anything, except \n newlines, and that includes colons. What the preceding substitution actually produces is:

jkstill:/bin/ksh

This may go against common sense, but it is the result of default greedy behavior. Perl regexes operate greedily via the NFA mechanism of backtracking and by saving success states. This is illustrated in Figure C-6, which we've broken down into seven steps:.

The first step establishes that there is at least one possible solution involving a trailing colon. The regex saves this state and will only come back to it later if it's forced to by the turn of events.
Being greedy, the regex decides to march on and go for another bridge over the river into the enemy's territory. It assigns the colon it has just found to being part of the .* match and moves on until it can (if it's lucky) find a second save state and another colon.
Continuing the greedy pattern, the regex has another go to see if it can feed yet more bridge-head territory into the .* multiplier. It finds a third save state.
Once again, the regex moves on to greedily acquire a fourth save state. This is the last one it will successfully find, but it has yet to learn this.
The regex goes for glory and attempts to acquire a fifth save state, but crashes and burns instead, running out of text and failing to find a fifth colon to complete its target match. It has gone a colon too far.
Using the NFA architecture, the regex can now backtrack to the latest save state.
It hands this save state result onto the rest of the substitution program, which will then go on to complete the operation by replacing andyd:fruitbat:/home/andyd:dba,apache,users: with jkstill:. Mission accomplished.

Figure C-6. Greedy matching, save states and backtracking

When you're munging large quantities of data, be sure to take this greedy behavior into account when crafting your regular expressions.

Now that we know how the NFA works on greediness, we can think about the regex pathways that will take up the least amount of work. However, sometimes we would rather avoid this maximally greedy behavior — perhaps we want just the bare minimum. In the case under consideration, all we really wanted to do was to replace andyd: with jkstill:. So how do we do this? With the multiply useful ? (question mark character), summarized in Figure C-7. What we have in the top portion of Figure C-7 is a maximally greedy regex, which eats as much it can, while still ultimately producing the match. In the bottom half, the regex has been limited by the shackles of the extra question mark suffix. It is now a minimalist regex, and will match as little as it can to find a successful match. It may be less than happy about this, but what can it do?

Figure C-7. The question mark and its effect on greediness

In our earlier example, if we use a substitution regex of s/.*?:/jkstill:/, we now get the result of:

jkstill:fruitbat:/home/andyd:dba,apache,users:/bin/ksh

Incidentally, the greediness-restraining ? (question mark) suffix is, in addition to the other main use of ?, a quantifier in its own right equivalent to {0,1}. All the multiple quantifiers are similarly restrained, as in Table C-7.

Table C-7. Minimizing greediness

Syntax

Description

{Min,Max}?

Will match at least Min of the preceding character, and up to Max in order to make the match work, but will try to only match Min if it can get away with it.

{Min,}?

Will match at least Min of the preceding character, and up to infinity of them in order to make the match work, but again will try to only match Min.

{Exact Count}?

Although minimization is logically available here, this match always has to get an Exact Count anyway, regardless of whether it's greedy or otherwise. There may be processing implications by making this non-greedy, but these constantly vary depending on whatever else you're doing.

*?

From zero to infinity of the preceding character, and as close as possible to zero, to make the match work.

+?

From one to infinity of the preceding characters, and as close as possible to one, to make the match work.

??

This match will try to find zero to one of the preceding character, but will prefer to find zero characters, if that will make the match work.

Interpolated Strings

Variables found within Perl regexes behave similarly to interpolated strings within print statements. This is because of two levels of parsing:

The first parse interpolates, or expands, any possible variables.
The second parse works out the actual regular expression, and how to process it.

For instance:

$orginal_Gandalf = "Olorin";
  
$wizard_String = "Olorin or Mithrandir? ";
  
if ($wizard_String =~ m/$original_Gandalf/)
{
   print "Wizard found! <|:-))))"
}

This interpolates the $original_Gandalf variable inside the match, which expands to Olorin and then processes the regex on the $wizard_String input data to see if it contains Olorin. The resultant output is:

Wizard found! <|:-))))

However, although regexes can generally be treated in the same way double-quoted interpolated strings are treated, this varies slightly with the special use of metacharacters. For instance, Example C-2 will fail.

almostInterpolated.pl — Checking interpolation in regexes

#!perl -w
  
use strict;
  
my $regex_pattern = "[*Casablanca";
  
my $input_film = "[*Casablanca";
  
if ($input_film =~ m/$regex_pattern/)
{
   print "Is the Maltese Falcon just as good?\n";
}

If we run almostInterpolated.pl, we get a rude awakening:

$ perl almostInterpolated.pl
Unmatched [ before HERE mark in regex m/[ << HERE *Casablanca/ at almostInterpolated.
pl line 10.

This is because [ is a special regex metacharacter for character classes, as described in Table C-3, which needs a matching ] dancing partner. Because we're looking for the [ square bracket opener as an actual literal within the string, we need to backslash it to escape its special meaning. Fortunately, we can avoid pasting backslashes everywhere into our pattern. We can use the quotemeta( ) built-in function instead. What this does is return the input string value with all nonalphanumeric characters, including the underscore, backslashed for our convenience:

my $regex_pattern = quotemeta('[*Casablanca');
  
my $input_film = "[*Casablanca";
  
if ($input_film =~ m/$regex_pattern/)
{
   print "Is the Maltese Falcon just as good?\n"; 
}

Now we get:

$ perl almostInterpolated.pl
  
Is the Maltese Falcon just as good?

Scalar or List Context Results

A match in a scalar setting will generally produce either a 1 for true (if it finds a match) or an empty string "" for false (if it fails to find the required match):

# Scalar context on the LHS, left hand side.
  
$my_string = "Galadriel and Celeborn";
  
# Note below how the =~ symbol takes precedence over the = symbol.
# What happens in the following, is that the $my_string =~ m/Galad/
# operation takes place, and then the $result = (match operation)
# comes second.
  
$result = $my_string =~ m/Galad/;
  
print "Expecting 1: result: >", $result, "<\n";
  
$result = $my_string =~ /Legolas and Gimli/;
  
print "Expecting Empty String: result: >", $result, "<\n";

Fingers crossed, we get the results we're after:

Expecting 1: result: >1<
Expecting Empty String: result: ><

Excellent. This behavior of returning 1 or "" differs if Perl detects that a list array is required on the left-hand side of the equation (i.e., whether it is in scalar context or list context). In this case, if anything within a match is marked for storage with parentheses, these values are copied across into the list array elements on the left-hand side. If no valid match is found, these array elements are left empty:

# Array context on the LHS
  
$my_string = "Galadriel and Celeborn";
  
# Once again, the =~ operation takes precedence over the = operation,
# and the wantarray(  ) function detects that a list is required on the
# left-hand side.
  
($queen, $king) = $my_string =~ m/(Galad\w+)\s+\w+\s+(\w+)/;
  
# Valid results expected
  
print "Value Expected, Queen: >", $queen, "<\n";
print "Value Expected, King: >", $king, "<\n";
  
($queen, $king) = $my_string =~ m/(Legolas\w+)\s+\w+\s+(\w+)/;
  
print "Empty String Expected, Queen: >", $queen, "<\n";
print "Empty String Expected, King: >", $king, "<\n";

When executed, this provides:

Value Expected, Queen: >Galadriel<
Value Expected, King: >Celeborn<
Empty String Expected, Queen: ><
Empty String Expected, King: ><

This is a bit fiddly, but if you work through a few examples of your own, it should begin to make sense.

Alternation and Memory

We promised earlier, when we were discussing list contexts and the internal use of the wantarray( ) function, that we'd cover backreferences. So what's the mechanism behind backreference memory storage?

Capturing backreferences

As we explained earlier, backreferences are made possible by the architecture of the NFA engine, which always leaves a ball of string back into the labyrinth. Think of the bracketing as paired knots in the string, which tell the regular expression what to retrieve. We can see this in action in Figure C-8, where we're using backreferences to store the noted values in special built-in variables, rather than returning them as part of a list. Note also the use of the /i regex suffix in Figure C-8, which ignores the alphabetic case of the target string under scrutiny.

Figure C-8. Capturing backreferences and ignoring case

Note the following:

These special built-in variables start from $1, and move up to $n, depending on how many bracketed elements you have (which always start from the left).
This is why we are prohibited from starting the name of a normal Perl scalar value with a number. Such names are reserved for built-in regex backreferences.
A value, like $1, will continue to exist within your program until another regular expression is executed that successfully matches. (Such values are dynamically scoped until the end of the innermost block, until the end of the current file, until the eval statement, or until the next successful match, whichever comes first.)
You can nest your brackets as much as you dare.

Let's run through Example C-3, with a range from $1 to $12.

Capturing multi-bracketed values — roundDozen.pl

#!perl -w
  
# Start with a large match, involving twelve captures
  
$_ = "abcdefghijklmnopqrstuvwxyz";
  
#  a b c d e f g h i j k l m n o p q r s t u v w x y z
  
m/(.(.(.(.(.(.(.(.(.(.(.(.).).).).).).).).).).).)/;
# 1 2 3 4 5 6 7 8 9 t e w w e t 9 8 7 6 5 4 3 2 1  backreferences
  
# t = ten, e = eleven, w = twelve
  
print '$1  :', $1,  "\n";
print '$2  :', $2,  "\n";
print '$3  :', $3,  "\n";
print '$4  :', $4,  "\n";
print '$5  :', $5,  "\n";
print '$6  :', $6,  "\n";
print '$7  :', $7,  "\n";
print '$8  :', $8,  "\n";
print '$9  :', $9,  "\n";
print '$10 :', $10, "\n";
print '$11 :', $11, "\n";
print '$12 :', $12, "\n";
  
# Now let's go for a small match, which only fills
# up $1, $2 and $3
  
$_ = "1234567890";
  
#  1 2 3 4 5 6 7 8 9 0
  
m/(.(.(.).).)/;
# 1 2 3 3 2 1 backreferences
  
print '$1  :', $1,  "\n";
print '$2  :', $2,  "\n";
print '$3  :', $3,  "\n";
print '$4  :', $4,  "\n";
print '$5  :', $5,  "\n";
print '$6  :', $6,  "\n";
print '$7  :', $7,  "\n";
print '$8  :', $8,  "\n";
print '$9  :', $9,  "\n";
print '$10 :', $10, "\n";
print '$11 :', $11, "\n";
print '$12 :', $12, "\n";

Running this script produces the following results:

$ perl roundDozen.pl
$1  :abcdefghijklmnopqrstuvw
$2  :bcdefghijklmnopqrstuv
$3  :cdefghijklmnopqrstu
$4  :defghijklmnopqrst
$5  :efghijklmnopqrs
$6  :fghijklmnopqr
$7  :ghijklmnopq
$8  :hijklmnop
$9  :ijklmno
$10 :jklmn
$11 :klm
$12 :l
$1  :12345
$2  :234
$3  :3
Use of uninitialized value in print at roundDozen.pl line 32.
$4  :
Use of uninitialized value in print at roundDozen.pl line 33.
$5  :
Use of uninitialized value in print at roundDozen.pl line 34.
$6  :
Use of uninitialized value in print at roundDozen.pl line 35.
$7  :
Use of uninitialized value in print at roundDozen.pl line 36.
$8  :
Use of uninitialized value in print at roundDozen.pl line 37.
$9  :
Use of uninitialized value in print at roundDozen.pl line 38.
$10 :
Use of uninitialized value in print at roundDozen.pl line 39.
$11 :
Use of uninitialized value in print at roundDozen.pl line 40.
$12 :

Note the following:

On the first set of printouts, we got $1 to $12 printed out neatly, following the left-to-right bracketing rule.
However, on the second print run, after the second regular expression the values, $1, $2, and $3 printed out OK, but $4 to $12 are now completely undefined.
You may have expected $4 to $12 to remain the same as they were after the first regex, but to keep a logically consistent picture, the entire board is swept clean if a successful match is found. As soon as you run another matching regex, the whole $1 to $n shooting match begins again, all the way up to infinity.

You can also use backreferences within the actual matches. The rule is that if these are used on the left side of the substitution or within an ordinary match, you must use the \1 style notation (instead of $1). On the other hand, on the right-hand side of the substitution you can use the straight $1 notation. For instance, you might be trying to replace all double-word typos in a piece of text with equivalent single words:

#!perl -w
  
# Our input string has two double-word typos,
# "work work", and "was was".  We'd like to remove both of them.
  
$_ = "Ludwig von Mises greatest work work was Human Action, " .
     "and F.A. Hayek's greatest work was was the Road to Serfdom.";
  
# On the left side of the substitution, to pick up
# the double-word, we have to use \1 in the match, 
# and on the right side substitution we use $1 to replace 
# both instances of the same word with a single string value.
  
s#\b(\w+)\b\s+\1\b#$1#g; # Substitute double-word typos
  
print;

Note the following:

We've used the # character to delineate the substitution, to prevent eye-strain among all those shooting-star slashes.
We've also used the global suffix, g, which we'll talk about shortly, to ensure that we substitute the first match found, work work, and the second one too, was was.
The use of the \b word boundary ensures that we're only picking up real individual words, and avoiding phrase combinations such as:
```
the theocracy
lathe the
bathe their
```

Our solution code produces the following output text:

Ludwig von Mises greatest work was Human Action, and F.A. Hayek's 
greatest work was the Road to Serfdom.