4.6 Grammars and RulesPerl 6 "regular expressions" are so far beyond the formal definition of regular expressions that we decided it was time for a more meaningful name.[15] We now call them "rules." Perl 6 rules bring the full power of recursive descent parsing to the core of Perl, but are comfortably useful even if you don't know anything about recursive descent parsing. A grammar is a collection of rules, in the same way that a class is a collection of methods.
4.6.1 Basic Rule SyntaxA rule is just a pattern for matching text. Rules can match right where they're defined, or they can be stored up to match later. Rules can be named or anonymous. They may be defined with variations on the familiar /.../ syntax, or using subroutine-like syntax with the keyword rule. Table 4-2 shows the basic syntax for defining rules.
?...? and (...) are no longer valid replacements for the /.../ delimiters, but you can use other standard quoting characters as replacement delimiters. The unary context forcing operators, +, ?, and ~, interact with the bare /.../. +/.../ immediately matches and returns a count of matches. ?/.../ immediately matches and returns a boolean value of success or failure. ~/.../ immediately matches and returns the matched string value. The results are the ordinary behavior of /.../ in numeric, boolean, and string contexts. The bare /.../ also matches immediately in void context, or when it's an argument of the smart match operator (~~ ). In all other contexts, it constructs an anonymous rule. 4.6.2 Building BlocksEvery pattern is built out of a series of metacharacters, metasymbols, bracketing symbols, escape sequences, and assertions of various types. These are the basic vocabulary of pattern matching. The most basic set of metacharacters is shown in Table 4-3.
By default, rules ignore literal whitespace within the pattern. You can put the # comment marker at the end of any line. Just make sure you don't comment out the symbol that terminates the rule. Closures within bare {...} are always a successful zero-width match, unless they explicitly call the fail function. Assertions, marked with <...> delimiters, handle a variety of constructs, including character classes and user-defined quantifiers. The built-in quantifiers are shown in Table 4-4.
n..m is the range quantifier, so it uses the range operator "..". n... is shorthand for n..Inf and matches as many times as possible. Table 4-5 shows the escape sequences for special characters. With all the escape sequences that use brackets, (...), {...}, and <...> work in place of [...]. An ordinary variable now interpolates as a literal string, so \Q is rarely needed.
4.6.3 ModifiersModifiers alter the meaning of the pattern syntax. The standard position for modifiers is at the beginning of the rule, right after the m, s, or rx, or after the name in a named rule. Modifiers cannot attach to the outside of a bare /.../. For example: m:i/marvin/ # case insensitive rule names :i { marvin | ford | arthur } The single-character modifiers can be grouped, but the others must be separated by a colon: m:iwe/ zaphod / # Ok m:ignorecase:words:each/ zaphod / # Ok m:ignorecasewordseach / zaphod / # Not Ok Most of the modifiers can also go inside the rule, attached to the rule delimiters or to grouping delimiters. Internal modifiers are lexically scoped to their enclosing delimiters, so you get a temporary alteration of the pattern: m/:w I saw [:i zaphod] / # only 'zaphod' is case insensitive Really, it's only the repetition modifiers that can't be lexically scoped, because they alter the return value of the entire rule. Table 4-6 shows the current list of modifiers.
:w makes patterns sensitive to literal whitespace, but in an intelligent way. Any cluster of literal whitespace acts like an explicit \s+ when it separates two identifiers and \s* everywhere else. The :Nth modifier also has the alternate forms :Nst, :Nnd, and :Nrd for cases where it's more natural to write :1st, :2nd, :3rd than it is to write :1th, :2th, :3th. Either way is valid, so pick the one that's most comfortable for you. There are no modifiers to alter whether the matched string is treated as a single line or multiple lines. That's why the "beginning of string" and "end of string" metasymbols now have "beginning of line" and "end of line" counterparts. 4.6.4 AssertionsAssertions hold many different constructs with many different purposes. In general, an assertion simply states that some condition or state is true and the match fails when that assertion is false. Table 4-7 shows the syntax for assertions.
<(...)> is similar to {...}, in that it allows you to include straight Perl code within a rule. The difference is that <(...)> evaluates the return value of the closure in boolean context. The match succeeds if the return is true and fails if the return is false. A bare scalar within a pattern interpolates as a literal string, an array matches as a series of alternate literal strings, and by default a hash matches a word (\w+) and tries to find that word as one of its keys.[16] You have to enclose a variable in assertion delimiters to get it to interpolate as an anonymous rule or rules.[17]
4.6.5 Built-in RulesA number of named rules are provided by default, including a complete set of POSIX-style classes, and Unicode property classes. The list isn't fully defined yet, but Table 4-8 shows a few you're likely to see.
The null pattern // is no longer valid syntax for rules. The built-in rule <null> matches a zero-width string (so it's always true) and <prior> matches whatever the most recent successful rule matched. 4.6.6 Backtracking ControlBacktracking is triggered whenever part of the pattern fails to match. You can also explicitly trigger backtracking by calling the fail function within a closure. Table 4-9 shows some metacharacters and built-in rules relevant to backtracking.
4.6.7 Hypothetical VariablesHypothetical variables are a powerful way of building up data structures from within a match. An ordinary capture with (...) stores the result of the capture in $1, $2, etc. The values stored in these variables will be kept if the match is successful, but thrown away if the match fails (hence the term "hypothetical"). The numbered capture variables are accessible outside the match, but only within the immediate surrounding lexical scope: "Zaphod Beeblebrox" ~~ m:w/ (\w+) (\w+) /; print $1; # prints Zaphod You can also capture into any user-defined variable with the binding operator :=. These variables must already be defined in the lexical scope surrounding the rule: my $person; "Zaphod's just this guy." ~~ / ^ $person := (\w+) /; print $person; # prints Zaphod Repeated matches can be captured into an array: my @words; "feefifofum" ~~ / @words := (f<-[f]>+)* /; # ("fee", "fi", "fo", "fum") Pairs of repeated matches can be captured into a hash: my %customers; $records ~~ m:w/ %customers := [ <id> = <name> \n]* /; If you don't need the captured value outside the rule, use a $? variable instead. These are lexically scoped to the rule: "Zaphod saw Zaphod" ~~ m:w/ $?name := (\w+) \w+ $?name/; A match of a named rule stores the result in a $? variable with the same name as the rule. These variables are also accessible only within the rule: "Zaphod saw Zaphod" ~~ m:w/ <name> \w+ $?name /; 4.6.8 GrammarsSometimes you don't want one rule, you need a whole collection of rules, especially for complex text parsing. Rules live in a grammar, like methods live in a class. In fact, grammars are classes, they're just classes that inherit from the universal Rule class. This means that grammars can inherit from other grammars, and that they define a namespace for their rules. grammar Hitchhikers { rule name :w { Zaphod :: [Beeblebrox]? | Ford :: [Prefect]? | Arthur :: [Dent]? } rule id :w { \d<10> } } Any rule in the current grammar or in one of its parents can be called directly, but rules from other grammars need to have their package specified: if $newsrelease ~~ /<Hitchhiker.name>/ { send_alert($1); } |