Regular Expression Concepts

Regular Expression Concepts

As we mentioned earlier, regular expressions are patterns of literals and metacharacters that match target combinations of characters embedded within input data:

Literal character: A plain honest-to-goodness character, which mostly means no harm to anyone and which goes about under the motto, "What you see is what you get". So when you see the letter n by itself, without a mischievous backslash nearby, it means "please match the letter n at this position, and nothing else".
Metacharacter: A nonalphabetic keyboard character, such as ^, *, $, and so on, which either has a special meaning or can give special meaning to surrounding characters. For instance, the \ metacharacter backslash gives a special meaning to the letter n, making it into the newline character, \n.

Matching, Substitution, and Translation

There are three main types of regular expression. All three work only on scalars, usually strings:

m// is for match

At the basic level, m// simply tells you whether the required regular expression is matched, or exists, within the input data. Because matching is so ubiquitous within Perl, just the simple use of // will indicate to Perl that you're performing a match. If you change the delimiters, however, you do need to explicitly use the m prefix, as in m%my match%. (We'll say more about delimiters shortly.)

s/// is for substitute

If you want to replace the located matches with something else, you call up the substitution operator. You always need to use the s prefix.

tr/// is for translate

Although St. Peter at the Perly gates would fail to recognize tr/// as a definitive regular expression operator, it's so close in form and function that we can usually get away with fudging the issue. The translate operator takes a range of characters on its left side, and replaces them with another range of specified characters on its right side. Its typical use is to capitalize a passage of text, or shift some number ranges. To keep old sed users happy, tr/// also possesses a synonym, the y/// operator, which behaves in an identical fashion. A typical translation program to uppercase every line of an input file would look like this:

while(<>){
   tr/a-z/A-Z/;
   print;
}

Regular expression input

A special double character is used to indicate the scalar value the regex should work on. This is the =~ pattern binding operator. Although this looks a lot like an = assignment operator, try to think of it being more like the word "contains":^[3]

^[3] The =~ operator originates from the ~ and !~ regex operators in awk.

print "I have found a match" if $target_string =~ /corleone/;

This translates into English, as follows:

Print out the phrase, `I have found a match' if the variable $target_string contains the matching word `corleone'.

As with awk, the good angel of =~ has a naughty devil partner-in-crime for negative assertions, !~:

print "I have failed to find a match" if $target_string !~ /michael/;

This translates to:

Print out the phrase `I have failed to find a match' if $target_string fails to contain the word `michael'.

You can use any nonalphanumeric or non-space character as a delimiter within Perl regexes. This is particularly useful when you're matching strings that contain / Unix slash characters, which you need to otherwise escape with a \ backslash character. A typically required match pattern string might be "/etc/passwd". When using the standard match syntax, this would become /\/etc\/passwd/, a process known as toothpicking. You can avoid this unsightly use by changing the delimiter character directly following from the now compulsory match function character, m, thus m#/etc/passwd#. No more fangs!

You can also use four sets of brackets, m<...>, m{...}, m(...), and m[...]. The first two are usually preferred because their bracket characters are less frequently used within regexes. You can also mix and match brackets for substitutions, s{...}<...>, though you may still prefer s(...)(...) or even s#...#...#. As a rule of thumb, use delimiters that aren't going to appear in your regex to keep everything clean. For example s<><> is a good one if you're not dealing with XML or HTML.

The implicit use of $_

As with many other places in Perl, if no scalar variable is supplied to our regular expression via the =~ or !~ constructs, it's assumed that the scalar value under consideration is the $_ default value. The good angel of =~ is also assumed:

$_ = "Who is greater, Von Mises or Hayek?";
print if /greater/;

The print statement here could be fully expanded to:

print $_ if $_ =~ m/greater/;

Either of the two preceding print statements would translate to:

Print the full contents of the $_ variable, if $_ contains a match for the word `greater'.

The Implicit Left-to-Right Assumption

Some rules are so implicit in Perl (and in regular expressions in general) that it's hard to spot them as assumptions as opposed to incontrovertible facts of life. The important one to watch out for is that regular expressions work in a left-to-right fashion. This may seem obvious to native English speakers, but if you're a fluent writer of right-to-left Arabic or Hebrew script, or you're trying to match right-to-left Unicode data, the importance of this assumption becomes more significant. If there are two or more matches on a single line, it is the left-most one that is matched first. Without the special /g global suffix, which we'll talk about later, it is only this first match that is either then validated, recorded, or substituted. (As with all computer languages that ultimately originate from the English language, Perl goes left-to-right, following the left-to-right convention of Latin, which followed the left-to-right tradition of fourth century BC classical Greek.)

Let's run our first regular expression to take a look at this concept. Try to keep an open mind on the following syntactical details, just for the moment; we promise to get to the meaty details of regular bracketing, curly bracketing, list context, and so on, a bit later. All we need to know for now is that we're looking for a seven-character phrase, in the supplied text, starting with the word "Bag," and we're going to store this in the $seven_letter_Bag_phrase variable. As we'll also cover later, the .{4} notation picks up any four characters, except \n newlines:

#!/usr/bin/perl
  
use strict;
  
my $party_text = 
   "When Mr. Bilbo Baggins of Bag End announced that " .
   "he would shortly be celebrating his eleventy-first " .
   "birthday with a party of special magnificence, there " .
   "was much talk and excitement in Hobbiton.";
  
my ($seven_letter_Bag_phrase) = ($party_text =~ m/(Bag.{4})/);

So what got stored inside the $seven_letter_Bag_phrase variable?:

print "Seven Letter Bag Phrase: >", $seven_letter_Bag_phrase, "<\n";

This provides an output of:

Seven Letter Bag Phrase: >Baggins<

This may be a surprise. Our clever plan was to get you to guess "Bag End," as this may seem at first glance the slightly more obvious match within the supplied $party_text variable. However using our left-to-right rule, the first match found was in fact "Baggins." As soon as we'd matched this, it was game over. To get the actual word "Bag," followed by a space and then a three-letter word, we'd have to tune our regular expression accordingly. For instance, we could re-tune our original code like this, to make sure there is an \s for space character after the word "Bag":

my ($seven_letter_Bag_phrase) = ($party_text =~ m/(Bag\s.{3})/);
  
print "Seven Letter Bag Phrase: >", $seven_letter_Bag_phrase, "<\n";

This produces:

Seven Letter Bag Phrase: >Bag End<

(We'll also explain more about \s later.)

Planning regular expressions is like planning killer attack moves in chess. It's easy to go forward with all bishops blazing, but we've got to leave ourselves covered at the back. Let's summarize the rules we followed here:

In the first case, we were after "Bag End," so we played a quick Bag.{4} move to go and grab it.
We then got punished for our hastiness, because this matched the more left "Baggins," too.
By replaying the move with Bag\s.{3}, we got the desired result.
Our regex opponent had to concede to us the "Bag End" phrase we were after originally. Fantastic!

Regular expressions can also be likened to the perfect jury. They must always get the guilty party (or match), and must always release the innocent bystander (or fail to match an unwanted pattern). Of course in the real world, perfect juries are uncommon, and we therefore tend to err on the side of letting the odd guilty person go (or miss the odd match), in order to make sure innocent bystanders (or false matches) are never wrongly convicted. Getting regular expressions to match exact requirements can be equally troublesome. It is only through fine-tuning and the constant honing of regex common law that we're able to achieve ultimate grand regex mastery.

Regular Expression Architectures

There are two major regular expression engine types.

The DFA (Deterministic Finite Automaton): With a name taking us back to Kleene's 1951 paper on neural nerve nets, the DFA engine powers many regex tools, including most versions of Alfred Aho's egrep, and awk, as well as lex and flex. Basically, while the DFA filters the input text, it simultaneously holds every single possible combination of text the regex could be searching for. Think of a police cell filling with the usual suspects, until the guilty party is recognized or the match found. The DFA engine therefore provides fast, consistent matches.
The NFA (Nondeterministic Finite Automaton): The alternative backtracker NFA engine drives Perl, sed, vi, and most versions of grep. It is controlled much more by the actual regex. It works by bumping and grinding through the input text one character at a time. If it goes up a blind alley, it backtracks to the last position that still makes sense (the saved state), and then begins working through the regex again, bumping and grinding once more through the text. It's a bit like a mad genius film editor, checking out a film sequence one frame at a time, and then cutting backwards and forwards until the Holy Grail's final cut is discovered. The NFA approach is illustrated in Figure C-2, where the Witch King of Angmar has crafted a regex to try to find Baggins.

Figure C-2. Bump 'n' grind backtrack matching

Although the NFA is logically slower than the DFA engine at finding matches, it has two major advantages which often overcome this speed gap:

Its backtracking architecture allows the NFA to save and store marked snippets of information as it works through the text. Think of Theseus trying to locate the Cretan Minotaur in the labyrinth at Knossos. He could use Ariadne's ball of silken thread to retrace his way out from blind alleys and dead ends, until in the end he found and slaughtered the Minotaur (or got his match). He could then get out of the maze, once again using the thread, bringing with him his life and his sword. The DFA would approach the situation differently. If there were 99 blind alleys, it would send in 100 gladiators, only one of which would find and kill the Minotaur (or get the match). Then all 100 gladiators would stay where they were (99 stuck up blind alleys, 1 in the central chamber), all incapable of going anywhere except forwards into the nearest wall. The task would be achieved, and the Minotaur would be dead, but no gladiators would be able to re-emerge into the light. This is illustrated in Figure C-3.
The other disadvantage of the DFA engine is that its swordsmen slaves would never benefit from a map. Their orders are always to just keep piling into the Labyrinth, like the Roman soldiers in The Life of Brian, until every possible cubby-hole, including the central chamber, is covered. In other words whatever regex you provide, as long as it's logically similar to another regex looking for the same match, it makes no speed difference to the match. The same 100 gladiators will always end up in the same 100 locations, one of which will happen to be the final match in the central chamber. NFA regexes, on the other hand, are very different. If you become skilled at regexes (or skilled at solving mazes), you can begin directing the route of Theseus beforehand by drawing him a map (or a better regex). And the better your map-drawing or predictive skills become, the fewer blind alleys Theseus will hit, the less backtracking he'll have to do, and the quicker he'll find the Minotaur. Because you can craft your regex in this way to speed up the game, the NFA appeals to code crafters and Perl hackers everywhere.

Figure C-3. DFA and NFA engines compared