TITLE

Synopsis 5: Regexes and Rules

AUTHORS

Damian Conway <damian@conway.org> and Allison Randal <al@shadowed.net>

VERSION

   Maintainer: Patrick Michaud <pmichaud@pobox.com> and
               Larry Wall <larry@wall.org>
   Date: 24 Jun 2002
   Last Modified: 8 Oct 2008
   Number: 5
   Version: 85

This document summarizes Apocalypse 5, which is about the new regex syntax. We now try to call them regex rather than "regular expressions" because they haven't been regular expressions for a long time, and we think the popular term "regex" is in the process of becoming a technical term with a precise meaning of: "something you do pattern matching with, kinda like a regular expression". On the other hand, one of the purposes of the redesign is to make portions of our patterns more amenable to analysis under traditional regular expression and parser semantics, and that involves making careful distinctions between which parts of our patterns and grammars are to be treated as declarative, and which parts as procedural.

In any case, when referring to recursive patterns within a grammar, the terms rule and token are generally preferred over regex.

New match result and capture variables

The underlying match result object is now available as the $/ variable, which is implicitly lexically scoped. All user access to the most recent match is through this variable, even when it doesn't look like it. The individual capture variables (such as $0, $1, etc.) are just elements of $/.

By the way, unlike in Perl 5, the numbered capture variables now start at $0 instead of $1. See below.

Unchanged syntactic features

The following regex features use the same syntax as in Perl 5:

While the syntax of | does not change, the default semantics do change slightly. We are attempting to concoct a pleasing mixture of declarative and procedural matching so that we can have the best of both. In short, you need not write your own tokener for a grammar because Perl will write one for you. See the section below on "Longest-token matching".

Simplified lexical parsing of patterns

Unlike traditional regular expressions, Perl 6 does not require you to memorize an arbitrary list of metacharacters. Instead it classifies characters by a simple rule. All glyphs (graphemes) whose base characters are either the underscore (_) or have a Unicode classification beginning with 'L' (i.e. letters) or 'N' (i.e. numbers) are always literal (i.e. self-matching) in regexes. They must be escaped with a \ to make them metasyntactic (in which case that single alphanumeric character is itself metasyntactic, but any immediately following alphanumeric character is not).

All other glyphs--including whitespace--are exactly the opposite: they are always considered metasyntactic (i.e. non-self-matching) and must be escaped or quoted to make them literal. As is traditional, they may be individually escaped with \, but in Perl 6 they may be also quoted as follows.

Sequences of one or more glyphs of either type (i.e. any glyphs at all) may be made literal by placing them inside single quotes. (Double quotes are also allowed, with the same interpolative semantics as the current language in which the regex is lexically embedded.) Quotes create a quantifiable atom, so while

    moose*

quantifies only the 'e' and matches "mooseee", saying

    'moose'*

quantifies the whole string and would match "moosemoose".

Here is a table that summarizes the distinctions:

                 Alphanumerics        Non-alphanumerics         Mixed

 Literal glyphs   a    1    _        \*  \$  \.   \\   \'       K\-9\!
 Metasyntax      \a   \1   \_         *   $   .    \    '      \K-\9!
 Quoted glyphs   'a'  '1'  '_'       '*' '$' '.' '\\' '\''     'K-9!'

In other words, identifier glyphs are literal (or metasyntactic when escaped), non-identifier glyphs are metasyntactic (or literal when escaped), and single quotes make everything inside them literal.

Note, however, that not all non-identifier glyphs are currently meaningful as metasyntax in Perl 6 regexes (e.g. \1 \_ - !). It is more accurate to say that all unescaped non-identifier glyphs are potential metasyntax, and reserved for future use. If you use such a sequence, a helpful compile-time error is issued indicating that you either need to quote the sequence or define a new operator to recognize it.

Modifiers

Changed metacharacters

New metacharacters

Bracket rationalization

Variable (non-)interpolation

Extensible metasyntax (<...>)

Both < and > are metacharacters, and are usually (but not always) used in matched pairs. (Some combinations of metacharacters function as standalone tokens, and these may include angles. These are described below.) Most assertions are considered declarative; procedural assertions will be marked as exceptions.

For matched pairs, the first character after < determines the nature of the assertion:

The following tokens include angles but are not required to balance:

Backslash reform

Regexes are now first-class language, not strings

Backtracking control

Within those portions of a pattern that are considered procedural rather than declarative, you may control the backtracking behavior.

Regex Routines, Named and Anonymous

Nothing is illegal

Longest-token matching

Instead of representing temporal alternation, | now represents logical alternation with declarative longest-token semantics. (You may now use || to indicate the old temporal alternation. That is, | and || now work within regex syntax much the same as they do outside of regex syntax, where they represent junctional and short-circuit OR. This includes the fact that | has tighter precedence than ||.)

Historically regex processing has proceeded in Perl via a backtracking NFA algorithm. This is quite powerful, but many parsers work more efficiently by processing rules in parallel rather than one after another, at least up to a point. If you look at something like a yacc grammar, you find a lot of pattern/action declarations where the patterns are considered in parallel, and eventually the grammar decides which action to fire off. While the default Perl view of parsing is essentially top-down (perhaps with a bottom-up "middle layer" to handle operator precedence), it is extremely useful for user understanding if at least the token processing proceeds deterministically. So for regex matching purposes we define token patterns as those patterns containing no whitespace that can be matched without side effects or self-reference. Basically, Perl automatically derives a lexer from the grammar without you having to write one yourself.

To that end, every regex in Perl 6 is required to be able to distinguish its "pure" patterns from its actions, and return its list of initial token patterns (transitively including the token patterns of any subrule called by the "pure" part of that regex, but not including any subrule more than once, since that would involve self reference, which is not allowed in traditional regular expressions). A logical alternation using | then takes two or more of these lists and dispatches to the alternative that matches the longest token prefix. This may or may not be the alternative that comes first lexically.

However, if two alternatives match at the same length, the tie is broken by one of two methods. If the alternatives are in different grammars, standard MRO (method resolution order) determines which one to try first. If the alternatives are in the same grammar, the textually earlier alternative takes precedence. (If a grammar's rules are defined in more than one file, the results are undefined.)

This longest token prefix corresponds roughly to the notion of "token" in other parsing systems that use a lexer, but in the case of Perl this is largely an epiphenomenon derived automatically from the grammar definition. However, despite being automatically calculated, the set of tokens can be modified by the user; various constructs within a regex declaratively tell the grammar engine that it is finished with the pattern part and starting in on the side effects, so by inserting such constructs the user controls what is considered a token and what is not. The constructs deemed to terminate a token declaration and start the "action" part of the pattern include:

Subpatterns (captures) specifically do not terminate the token pattern, but may require a reparse of the token to find the location of the subpatterns. Likewise assertions may need to be checked out after the longest token is determined. (Alternately, if DFA semantics are simulated in any of various ways, such as by Thompson NFA, it may be possible to know when to fire off the assertions without backchecks.)

Greedy quantifiers and character classes do not terminate a token pattern. Zero-width assertions such as word boundaries are also okay.

Because such assertions can be part of the token, the lexer engine must be able to recover from the failure of such an assertion and backtrack to the next best token candidate, which might be the same length or shorter, but can never be longer than the current candidate.

For a pattern that starts with a positive lookahead assertion, the assertion is assumed to be more specific than the subsequent pattern, so the lookahead's pattern is treated as the longest token; the longest-token matcher will be smart enough to rematch any text traversed by the lookahead when (and if) it continues the match.

Oddly enough, the token keyword specifically does not determine the scope of a token, except insofar as a token pattern usually doesn't do much matching of whitespace. In contrast, the rule keyword (which assumes :sigspace) defines a pattern that tends to disqualify itself on the first whitespace. So most of the token patterns will end up coming from token declarations. For instance, a token declaration such as

    token list_composer { \[ <expr> \] }

considers its "longest token" to be just the left square bracket, because the first thing the expr rule will do is traverse optional whitespace.

The initial token matcher must take into account case sensitivity (or any other canonicalization primitives) and do the right thing even when propagated up to rules that don't have the same canonicalization. That is, they must continue to represent the set of matches that the lower rule would match.

The || form has the old short-circuit semantics, and will not attempt to match its right side unless all possibilities (including all | possibilities) are exhausted on its left. The first || in a regex makes the token patterns on its left available to the outer longest-token matcher, but hides any subsequent tests from longest-token matching. Every || establishes a new longest-token matcher. That is, if you use | on the right side of ||, that right side establishes a new top level scope for longest-token processing for this subexpression and any called subrules. The right side's longest-token automaton is invisible to the left of the || or outside the regex containing the ||.

Return values from matches

Match objects

Subpattern captures

Accessing captured subpatterns

Nested subpattern captures

Quantified subpattern captures

Indirectly quantified subpattern captures

Subpattern numbering

Subrule captures

Accessing captured subrules

Repeated captures of the same subrule

Aliasing

Aliases can be named or numbered. They can be scalar-, array-, or hash-like. And they can be applied to either capturing or non-capturing constructs. The following sections highlight special features of the semantics of some of those combinations.

Named scalar aliasing to subpatterns

Named scalar aliases applied to non-capturing brackets

Named scalar aliasing to subrules

Numbered scalar aliasing

Scalar aliases applied to quantified constructs

Array aliasing

Hash aliasing