4.6 Grammars and Rules

Perl 6 "regular expressions" are so far beyond the formal definition of regular expressions that we decided it was time for a more meaningful name.^[15] We now call them "rules." Perl 6 rules bring the full power of recursive descent parsing to the core of Perl, but are comfortably useful even if you don't know anything about recursive descent parsing. A grammar is a collection of rules, in the same way that a class is a collection of methods.

^[15] Regular expressions describe regular languages, and consist of three primitives and a limited set of operations (three or so, depending on the formulation). So, even Perl 5 "regular expressions" weren't formal regular expressions.

4.6.1 Basic Rule Syntax

A rule is just a pattern for matching text. Rules can match right where they're defined, or they can be stored up to match later. Rules can be named or anonymous. They may be defined with variations on the familiar /.../ syntax, or using subroutine-like syntax with the keyword rule. Table 4-2 shows the basic syntax for defining rules.

Table 4-2. Rules

Syntax

Meaning

m/.../

Match a pattern (immediate execution).

s/.../.../

Perform a substitution (immediate execution).

rx/.../

Define an anonymous rule (deferred execution).

/.../

Immediately match or define an anonymous rule, depending on the context.

rule { ... }

Define an anonymous rule.

rule name { ... }

Define a named rule.

?...? and (...) are no longer valid replacements for the /.../ delimiters, but you can use other standard quoting characters as replacement delimiters. The unary context forcing operators, +, ?, and ~, interact with the bare /.../. +/.../ immediately matches and returns a count of matches. ?/.../ immediately matches and returns a boolean value of success or failure. ~/.../ immediately matches and returns the matched string value. The results are the ordinary behavior of /.../ in numeric, boolean, and string contexts. The bare /.../ also matches immediately in void context, or when it's an argument of the smart match operator (~~ ). In all other contexts, it constructs an anonymous rule.

4.6.2 Building Blocks

Every pattern is built out of a series of metacharacters, metasymbols, bracketing symbols, escape sequences, and assertions of various types. These are the basic vocabulary of pattern matching. The most basic set of metacharacters is shown in Table 4-3.

Table 4-3. Metacharacters

Symbol

Meaning

.

Match any single character, including a newline.

^

Match the beginning of a string.

$

Match the end of a string.

^^

Match the beginning of a line.

$$

Match the end of a line.

|

Separate alternate patterns.

\

Escape a metacharacter to get a literal character, or escape a literal character to get a metacharacter.

#

Mark a comment (to the end of the line).

:=

Bind the result of a match to a hypothetical variable.

(...)

Group patterns and capture the result.

[...]

Group patterns without capturing.

{...}

Execute a closure (Perl 6 code) within a rule.

<...>

Assertion delimiters.

By default, rules ignore literal whitespace within the pattern. You can put the # comment marker at the end of any line. Just make sure you don't comment out the symbol that terminates the rule. Closures within bare {...} are always a successful zero-width match, unless they explicitly call the fail function. Assertions, marked with <...> delimiters, handle a variety of constructs, including character classes and user-defined quantifiers. The built-in quantifiers are shown in Table 4-4.

Table 4-4. Quantifiers

Maximal

Minimal

Meaning

*

*?

Match 0 or more times.

+

+?

Match 1 or more times.

?

??

Match 0 or 1 times.

<n>

<n>?

Match exactly n times.

<n..m>

<n..m>?

Match at least n and no more than m times.

<n...>

<n...>?

Match at least n times.

n..m is the range quantifier, so it uses the range operator "..". n... is shorthand for n..Inf and matches as many times as possible.

Table 4-5 shows the escape sequences for special characters. With all the escape sequences that use brackets, (...), {...}, and <...> work in place of [...]. An ordinary variable now interpolates as a literal string, so \Q is rarely needed.

Table 4-5. Escape sequences

Escape

Meaning

\0[...]

Match a character given in octal (brackets optional).

\b

Match a word boundary.

\B

Match when not on a word boundary.

\c[...]

Match a named character or control character.

\C[...]

Match any character except the bracketed named or control character.

\d

Match a digit.

\D

Match a non-digit.

\e

Match an escape character.

\E

Match anything but an escape character.

\f

Match the form feed character.

\F

Match anything but a form feed.

\n

Match a newline.

\N

Match anything but a newline.

\h

Match horizontal whitespace.

\H

Match anything but horizontal whitespace.

\L[...]

Everything within the brackets is lowercase.

\r

Match a return.

\R

Match anything but a return.

\s

Match any whitespace character.

\S

Match anything but whitespace.

\t

Match a tab.

\T

Match anything but a tab.

\U[...]

Everything within the brackets is uppercase.

\v

Match vertical whitespace.

\V

Match anything but vertical whitespace.

\w

Match a word character (Unicode alphanumeric plus "_").

\W

Match anything but a word character.

\x[...]

Match a character given in hexadecimal (brackets optional).

\X[...]

Match anything but the character given in hexadecimal (brackets optional).

\Q[...]

All metacharacters within the brackets match as literal characters.

4.6.3 Modifiers

Modifiers alter the meaning of the pattern syntax. The standard position for modifiers is at the beginning of the rule, right after the m, s, or rx, or after the name in a named rule. Modifiers cannot attach to the outside of a bare /.../. For example:

m:i/marvin/ # case insensitive
rule names :i { marvin | ford | arthur }

The single-character modifiers can be grouped, but the others must be separated by a colon:

m:iwe/ zaphod / # Ok
m:ignorecase:words:each/ zaphod / # Ok
m:ignorecasewordseach / zaphod / # Not Ok

Most of the modifiers can also go inside the rule, attached to the rule delimiters or to grouping delimiters. Internal modifiers are lexically scoped to their enclosing delimiters, so you get a temporary alteration of the pattern:

m/:w I saw [:i zaphod] / # only 'zaphod' is case insensitive

Really, it's only the repetition modifiers that can't be lexically scoped, because they alter the return value of the entire rule. Table 4-6 shows the current list of modifiers.

Table 4-6. Modifiers

Short

Long

Meaning

:i

:ignorecase

Case-insensitive match.

:I

Case-sensitive match (on by default).

:c

:cont

Continue where the previous match on the string left off.

:w

:words

Literal whitespace in the pattern matches as \s+ or \s*.

:W

Turn off intelligent whitespace matching (return to default).

:Nx/:x(N)

Match the pattern N times.

:Nth/:nth(N)

Match the N th occurrence of a pattern.

:once

Match the pattern only once.

:e

:each

Match the pattern as many times as possible, but only possibilities that don't overlap.

:any

Match every possible occurrence of a pattern, even overlapping possibilities.

:u0

. is a byte.

:u1

. is a Unicode codepoint.

:u2

. is a Unicode grapheme.

:u3

. is language dependent.

:p5

The pattern uses Perl 5 regex syntax.

:w makes patterns sensitive to literal whitespace, but in an intelligent way. Any cluster of literal whitespace acts like an explicit \s+ when it separates two identifiers and \s* everywhere else.

The :Nth modifier also has the alternate forms :Nst, :Nnd, and :Nrd for cases where it's more natural to write :1st, :2nd, :3rd than it is to write :1th, :2th, :3th. Either way is valid, so pick the one that's most comfortable for you.

There are no modifiers to alter whether the matched string is treated as a single line or multiple lines. That's why the "beginning of string" and "end of string" metasymbols now have "beginning of line" and "end of line" counterparts.

4.6.4 Assertions

Assertions hold many different constructs with many different purposes. In general, an assertion simply states that some condition or state is true and the match fails when that assertion is false. Table 4-7 shows the syntax for assertions.

Table 4-7. Assertions

Syntax

Meaning

<...>

Generic assertion delimiter.

<name>

Match a named rule or character class.

<!...>

Negate any assertion.

<[...]>

Match an enumerated character class.

<-...>

Complement a character class (named or enumerated).

<"...">

Match a literal string (interpolated at match time).

<'...'>

Match a literal string (not interpolated).

<(...)>

Boolean assertion. Execute a closure and match if it returns a true result.

<$scalar>

Match an anonymous rule.

<@array>

Match a series of anonymous rules as alternates.

<%hash>

Match a key from the hash, then its value (which is an anonymous rule).

<&sub( )>

Match an anonymous rule returned by a sub.

<{ code }>

Match an anonymous rule returned by a closure.

<.>

Match any logical grapheme, including combining character sequences.

<(...)> is similar to {...}, in that it allows you to include straight Perl code within a rule. The difference is that <(...)> evaluates the return value of the closure in boolean context. The match succeeds if the return is true and fails if the return is false.

A bare scalar within a pattern interpolates as a literal string, an array matches as a series of alternate literal strings, and by default a hash matches a word (\w+) and tries to find that word as one of its keys.^[16] You have to enclose a variable in assertion delimiters to get it to interpolate as an anonymous rule or rules.^[17]

^[16] The effect is much as if it matched the keys as a series of alternates, but you're guaranteed to match the longest possible key, instead of just the first one it hits in random order.

^[17] This is the old Perl 5 behavior of a variable interpolating as a regex, but with a kick.

4.6.5 Built-in Rules

A number of named rules are provided by default, including a complete set of POSIX-style classes, and Unicode property classes. The list isn't fully defined yet, but Table 4-8 shows a few you're likely to see.

Table 4-8. Built-in rules

Rule

Meaning

<alpha>

Match a Unicode alphabetic character.

<digit>

Match a digit.

<sp>

Match a single space character (the same as \s).

<ws>

Match any whitespace (the same as \s+).

<null>

Match the null string.

<prior>

Match the same thing as the previous match.

<before ...>

Lookahead. Assert that you're before a pattern.

<after ...>

Lookbehind. Assert that you're after a pattern.

<prop ...>

Match any character with the named property.

<replace(...)>

Replace everything matched so far in the rule or subrule with the given string (under consideration).

The null pattern // is no longer valid syntax for rules. The built-in rule <null> matches a zero-width string (so it's always true) and <prior> matches whatever the most recent successful rule matched.

4.6.6 Backtracking Control

Backtracking is triggered whenever part of the pattern fails to match. You can also explicitly trigger backtracking by calling the fail function within a closure. Table 4-9 shows some metacharacters and built-in rules relevant to backtracking.

Table 4-9. Backtracking controls

Operator

Meaning

:

Don't retry the previous atom, fail to the next earlier atom.

::

Don't backtrack over this point, fail out of the closest enclosing group ((...), [...], or the rule delimiters).

:::

Don't backtrack over this point, fail out of the current rule or subrule.

<commit>

Don't backtrack over this point, fail out of the entire match (even from within a subrule).

<cut>

Like <commit>, but also cuts the string matched. The point of the <cut> becomes the new beginning of the string.

4.6.7 Hypothetical Variables

Hypothetical variables are a powerful way of building up data structures from within a match. An ordinary capture with (...) stores the result of the capture in $1, $2, etc. The values stored in these variables will be kept if the match is successful, but thrown away if the match fails (hence the term "hypothetical"). The numbered capture variables are accessible outside the match, but only within the immediate surrounding lexical scope:

"Zaphod Beeblebrox" ~~ m:w/ (\w+) (\w+) /;

print $1; # prints Zaphod

You can also capture into any user-defined variable with the binding operator :=. These variables must already be defined in the lexical scope surrounding the rule:

my $person;
"Zaphod's just this guy." ~~ / ^ $person := (\w+) /;
print $person; # prints Zaphod

Repeated matches can be captured into an array:

my @words;
"feefifofum" ~~ / @words := (f<-[f]>+)* /;
# ("fee", "fi", "fo", "fum")

Pairs of repeated matches can be captured into a hash:

my %customers;
$records ~~ m:w/ %customers := [ <id> = <name> \n]* /;

If you don't need the captured value outside the rule, use a $? variable instead. These are lexically scoped to the rule:

"Zaphod saw Zaphod" ~~ m:w/ $?name := (\w+) \w+ $?name/;

A match of a named rule stores the result in a $? variable with the same name as the rule. These variables are also accessible only within the rule:

"Zaphod saw Zaphod" ~~ m:w/ <name> \w+ $?name /;

4.6.8 Grammars

Sometimes you don't want one rule, you need a whole collection of rules, especially for complex text parsing. Rules live in a grammar, like methods live in a class. In fact, grammars are classes, they're just classes that inherit from the universal Rule class. This means that grammars can inherit from other grammars, and that they define a namespace for their rules.

grammar Hitchhikers {
    rule name :w {
        Zaphod :: [Beeblebrox]?
        | Ford :: [Prefect]?
        | Arthur :: [Dent]?
    }

    rule id :w { \d<10> }
}

Any rule in the current grammar or in one of its parents can be called directly, but rules from other grammars need to have their package specified:

if $newsrelease ~~ /<Hitchhiker.name>/ {
    send_alert($1);
}