Book Home Programming PerlSearch this book

5.5. Quantifiers

Unless you say otherwise, each item in a regular expression matches just once. With a pattern like /nop/, each of those characters must match, each right after the other. Words like "panoply" or "xenophobia" are fine, because where the match occurs doesn't matter.

If you wanted to match both "xenophobia" and "Snoopy", you couldn't use the /nop/ pattern, since that requires just one "o" between the "n" and the "p", and Snoopy has two. This is where quantifiers come in handy: they say how many times something may match, instead of the default of matching just once. Quantifiers in a regular expression are like loops in a program; in fact, if you think of a regex as a program, then they are loops. Some loops are exact, like "repeat this match five times only" ({5}). Others give both lower and upper bounds on the match count, like "repeat this match at least twice but no more than four times" ({2,4}). Others have no closed upper bound at all, like "match this at least twice, but as many times as you'd like" ({2,}).

Table 5-12 shows the quantifiers that Perl recognizes in a pattern.

Table 5.12. Regex Quantifiers Compared

Maximal Minimal Allowed Range
{MIN,MAX} {MIN,MAX}? Must occur at least MIN times but no more than MAX times
{MIN,} {MIN,}? Must occur at least MIN times
{COUNT} {COUNT}? Must match exactly COUNT times
* *?

0 or more times (same as {0,})

+ +? 1 or more times (same as {1,})
? ?? 0 or 1 time (same as {0,1})

Something with a * or a ? doesn't actually have to match. That's because they can match 0 times and still be considered a success. A + may often be a better fit, since it has to be there at least once.

Don't be confused by the use of "exactly" in the previous table. It refers only to the repeat count, not the overall string. For example, $n =~ /\d{3}/ doesn't say "is this string exactly three digits long?" It asks whether there's any point within $n at which three digits occur in a row. Strings like "101 Morris Street" test true, but so do strings like "95472" or "1-800-555-1212". All contain three digits at one or more points, which is all you asked about. See the section Section 5.6, "Positions" for how to use positional assertions (as in /^\d{3}$/) to nail this down.

Given the opportunity to match something a variable number of times, maximal quantifiers will elect to maximize the repeat count. So when we say "as many times as you'd like", the greedy quantifier interprets this to mean "as many times as you can possibly get away with", constrained only by the requirement that this not cause specifications later in the match to fail. If a pattern contains two open-ended quantifiers, then obviously both cannot consume the entire string: characters used by one part of the match are no longer available to a later part. Each quantifier is greedy at the expense of those that follow it, reading the pattern left to right.

That's the traditional behavior of quantifiers in regular expressions. However, Perl permits you to reform the behavior of its quantifiers: by placing a ? after that quantifier, you change it from maximal to minimal. That doesn't mean that a minimal quantifier will always match the smallest number of repetitions allowed by its range, any more than a maximal quantifier must always match the greatest number allowed in its range. The overall match must still succeed, and the minimal match will take as much as it needs to succeed, and no more. (Minimal quantifiers value contentment over greed.)

For example, in the match:

"exasperate" =~ /e(.*)e/    #  $1 now "xasperat"
the .* matches "xasperat", the longest possible string for it to match. (It also stores that value in $1, as described in the section Section 5.7, "Capturing and Clustering" later in the chapter.) Although a shorter match was available, a greedy match doesn't care. Given two choices at the same starting point, it always returns the longer of the two.

Contrast this with this:

"exasperate" =~ /e(.*?)e/   #  $1 now "xasp"
Here, the minimal matching version, .*?, is used. Adding the ? to * makes *? take on the opposite behavior: now given two choices at the same starting point, it always returns the shorter of the two.

Although you could read *? as saying to match zero or more of something but preferring zero, that doesn't mean it will always match zero characters. If it did so here, for example, and left $1 set to "", then the second "e" wouldn't be found, since it doesn't immediately follow the first one.

You might also wonder why, in minimally matching /e(.*?)e/, Perl didn't stick "rat" into $1. After all, "rat" also falls between two e's, and is shorter than "xasp". In Perl, the minimal/maximal choice applies only when selecting the shortest or longest from among several matches that all have the same starting point. If two possible matches exist, but these start at different offsets in the string, then their lengths don't matter--nor does it matter whether you've used a minimal quantifier or a maximal one. The earliest of several valid matches always wins out over all latecomers. It's only when multiple possible matches start at the same point that you use minimal or maximal matching to break the tie. If the starting points differ, there's no tie to break. Perl's matching is normally leftmost longest; with minimal matching, it becomes leftmost shortest. But the "leftmost" part never varies and is the dominant criterion.[7]

[7] Not all regex engines work this way. Some believe in overall greed, in which the longest match always wins, even if it shows up later. Perl isn't that way. You might say that eagerness holds priority over greed (or thrift). For a more formal discussion of this principle and many others, see the section Section 5.9.4, "The Little Engine That /Could(n't)?/".

There are two ways to defeat the leftward leanings of the pattern matcher. First, you can use an earlier greedy quantifier (typically .*) to try to slurp earlier parts of the string. In searching for a match for a greedy quantifier, it tries for the longest match first, which effectively searches the rest of the string right-to-left:

"exasperate" =~ /.*e(.*?)e/   #  $1 now "rat"
But be careful with that, since the overall match now includes the entire string up to that point.

The second way to defeat leftmostness to use positional assertions, discussed in the next section.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.