© Moritz Lenz 2020
M. LenzRaku Fundamentals https://doi.org/10.1007/978-1-4842-6109-5_9

9. Parsing INI Files Using Regexes and Grammars

Moritz Lenz1 
(1)
Fürth, Bayern, Germany
 

You’ve probably seen .ini files before; they are quite common as configuration files on the Microsoft Windows platform but are also found in many other places such as ODBC configuration files, Ansible’s inventory files,1 and so on.

This is what they look like:
key1=value2
[section1]
key2=value2
key3 = with spaces
; comment lines start with a semicolon, and are
; ignored by the parser
[section2]
more=stuff

Raku offers regexes for parsing and grammars for structuring and reusing regexes.

You could use the Config::INI2 module (after installing with zef install Config::INI) to parse files INI files like so:
use Config::INI;
my %hash = Config::INI::parse($ini_string);

Under the hood it uses regexes and grammars. Here, we will explore how we could write our own INI parser.

9.1 Regex Basics

A regex is a piece of code that acts as a pattern for strings with a common structure. It’s derived from the computer science concept of a regular expression3 but adapted to provide more constructs than pure regular expressions allow and extended with some features that make them easier to use.

We’ll use named regexes to match the primitives and then use regexes that call these named regexes to build a parser for the INI files. Since INI files have no universally accepted, formal grammar, we have to make stuff up as we go.

Let’s start with parsing value pairs, like key1=value1. First let’s consider just the key. It may contain letters, digits, and the underscore _. There’s a shortcut to match such characters, w, and matching one or more works by appending a + character:
use v6.d;
my regex key { w+ }
multi sub MAIN('test') {
    use Test;
    ok 'abc'    ~~ /^ <key> $/, '<key> matches a simple identifier';
    ok '[abc]' !~~ /^ <key> $/, '<key> does not match a section header';
    done-testing;
}

my regex key { w+ } declares a lexically (my) scoped regex called key that matches one or more word characters.

There is a long tradition in programming languages to support so-called Perl Compatible Regular Expressions (PCRE). Many programming languages support some deviations from PCRE, including Perl itself, but common syntax elements remain throughout most implementations. Raku still supports some of these elements but deviates substantially in others.

Here w+ is the same as in PCRE, but in contrast to PCRE, whitespace in the regex is ignored. This allows you to write much more readable regexes, with freedom to format regexes just like you would with normal code.

In the testing routine, the slashes in 'abc' ∼∼ /ˆ <key> $/ delimit an anonymous regex. In this regex, ˆ and $ stand for the start and the end of the matched string, respectively, which is familiar from PCRE. However, in contrast to PCRE, the <key> subrule calls the named regex key from earlier. This is a Raku extension. In PCRE, the < in a regex matches a literal <. In Raku regexes, it introduces a subrule call.

In general, all nonword characters are reserved for “special” syntax, and you have to quote or backslash them to get the literal meaning. For example, < or '<' in a regex matches a less than sign. Quoting can apply to more than one character, so 'a+b' in a regex matches an a, followed by a plus +, followed by a b.

Word characters (letters, digits, and the underscore) always match literally.

9.1.1 Character Classes

Besides literals, character classes are a common building block of regexes. There are many predefined character classes in the form of a backslash followed by a lowercase single letter; for example, d matches a digit. Its inverse uses the uppercase letter, so D matches any character that is not a digit.

Character Class

Negation

Matches

d

D

A digit

w

W

A word character (letter, digit, underscore)

s

S

Whitespace, blanks, newlines, etc.

h

H

Horizontal whitespace

v

V

Vertical whitespace

N

Logical newline (carriage return, line feed)

.

 

Any character

You can also build your own character classes by enumerating characters or ranges of characters:

Method

Example

Matches

Enumeration

<[abc]>

a, b, or c

Negation

<-[abc]>

Anything except a, b, or c

Range

<[a..c]>

a, b, or c

Let's formulate some of these character classes and their properties as tests, all of which pass:
use Test;
ok  'a' ~~ /w/, '"a" matches w';
ok  'Σ' ~~ /w/, 'Greek Sigma matches w';
nok '!' ~~ /w/, 'bang ! is not a word character';
nok 'ab' ~~ /^ w $/, 'w matches just one char';
ok  'b' ~~ /<[abc]>/, 'enumeration';
nok 'B' ~~ /<[abc]>/, 'enumeration is case sensitive';
ok  'a' ~~ /<[a..c]>/, 'in range';
nok 'd' ~~ /<[a..c]>/, 'out of range';
done-testing;

The official Raku test suite contains many of such tests4; we just have a few here to illustrate the behavior of the character classes.

9.1.2 Quantifiers

Matching only one repetition of anything is boring, so regexes offer quantifiers. A quantifier states how often the previous regex must match.

Quantifier

Matches how many characters?

*

0..Inf

+

1..Inf

?

0..1

** 3

3

** 1..5

1..5

Again, some examples in the form of passing tests:
nok 'ba'     ~~ / ^ ba [na]+ $ /, '+ must match at least once';
ok  'bana'   ~~ / ^ ba [na]+ $ /, '+ with a single match';
ok  'banana' ~~ / ^ ba [na]+ $ /, '+ with two matches';
ok  'bananana' ~~ / ^ ba [na]+ $ /, '+ with three matches';

9.1.3 Alternatives

Either-or alternatives are separated by the vertical bar |. For example, d+ | x matches either a sequence of one or more digits or the character x.

If more than one path of an alternative matches, Raku prefers the longest match. If that behavior is not desired, || takes the first alternative that matches. Formulated as tests:
ok  'a' ~~ /a|b|c/,  'first branch of an alternative';
ok  'c' ~~ /a|b|c/,  'third branch of an alternative';
nok 'd' ~~ /a|b|c/,  'not part of alternative';
is 'x42' ~~ /x | w+/, 'x42', 'longer branch wins';
is 'x42' ~~ /a||x||w+/, 'x', 'with || first matching branch wins';

9.2 Parsing the INI Primitives

Coming back to INI parsing, we have to think about what characters are allowed inside a value. Listing allowed characters seems to be like a futile exercise, since we are very likely to forget some. Instead, we should think about what’s not allowed in a value. Newlines certainly aren’t, because they introduce the next key/value pair or a section heading. Neither are semicolons allowed, because they introduce a comment.

We can formulate this exclusion as a negated character class: <-[ ; ]> matches any single character that is neither a newline nor a semicolon. Note that inside a character class, nearly all characters lose their special meaning. Only backslash, whitespace, two dots, and the closing bracket stand for anything other than themselves. Inside and outside of character classes alike, matches a single newline character and s whitespace. The uppercase inverts that, so that, for example, S matches any single character that is not whitespace.

This leads us to a version of a regex to match a value in an INI file:
my regex value { <-[ ; ]>+ }
There is one problem with this regex: it also matches leading and trailing whitespace, which we don’t want to consider as part of the value:
my regex value { <-[ ; ]>+ }
if ' abc ' ~~ /<value>/ {
    say "matched '$/'";           # matched ' abc '
}
If Raku regexes were limited to a regular language in the computer science sense, we’d have to do something like this:
my regex value {
    # match a first non-whitespace character
    <-[ s ; ]>
    [
        # then arbitrarily many that can contain whitespace
        <-[ ; ]>*
        # ... terminated by one non-whitespace
        <-[ s ; ]>
    ]?  # and make it optional, in case the value is only
        # only one non-whitespace character
}
And now you know why people respond with “And now you have two problems”5 when proposing to solve problems with regexes. A simpler solution is to match a value as introduced first and then to introduce a constraint that neither the first nor the last character may be whitespace:
my regex value { <!before s> <-[ ; ]>+ <!after s> }
along with accompanying tests:
is ' abc ' ~~ /<value>/, 'abc', '<value> does not match leading or trailing whitespace';
is ' a' ~~ /<value>/, 'a', '<value> matches single non-whitespace too';
ok "a b" !~~ /^ <value> $/, '<value> does not match ';

<!before regex> is a negated look-ahead, that is, the following text must not match the regex, and the text isn’t consumed while matching. Unsurprisingly, <!after regex> is the negated look-behind, which tries to match text that has already been matched and must not succeed in doing so for the whole match to be successful.

This being Raku, there is of course yet another way to approach this problem. If you formulate the requirements as “a value must not contain a newline or semicolon and start with a non-whitespace and end with a non-whitespace,” it becomes obvious that if we just had an AND operator in regexes, this could be easy. And it is
my regex value { <-[ ; ]>+ & S.* & .*S }

The & operator delimits two or more smaller regex expressions that must all match the same string successfully for the whole match to succeed. S.* matches any string that starts with a non-whitespace character (S), followed by any character (.) any number of times *. Likewise, .*S matches any string that ends with a non-whitespace character.

Who would have thought that matching something as seemingly simple as a value in a configuration file could be so involved? Luckily, matching a key/value pair is much simpler now that we know how to match each on their own:
my regex pair { <key> '=' <value> }
And this works great, as long as there are no blanks surrounding the equality sign. If there are, we have to match them separately:
my regex pair { <key> h* '=' h* <value> }

h matches a horizontal whitespace, that is, a blank, a tabulator character, or any other fancy spacelike thing that Unicode has in store for us (e.g., also the nonbreaking space) but not a newline.

Speaking of newlines, it’s a good idea to match a newline at the end of regex pair, and since we ignore empty lines, let’s match more than one as well:
my regex pair { <key> h* '=' h* <value> + }
Time to write some tests:
ok "key=value " ~~ /<pair>/, 'simple pair';
ok "key = value " ~~ /<pair>/, 'pair with blanks';
ok "key = value " !~~ /<pair>/, 'pair with newline before assignment';
A section header is a string in square brackets, so the string itself shouldn’t contain brackets or a newline:
my regex header { '[' <-[ [ ] ]>+ ']' + }
# and in multi sub MAIN('test'):
ok "[abc] "    ~~ /^ <header> $/, 'simple header';
ok "[a c] "    ~~ /^ <header> $/, 'header with spaces';
ok "[a [b]] " !~~ /^ <header> $/, 'cannot nest headers';
ok "[a b] "  !~~ /^ <header> $/, 'No newlines inside headers';
The last remaining primitive is the comment
my regex comment { ';' N* + }

N matches any character that’s not a newline, so the comment is just a semicolon, and then anything until the end of the line.

9.3 Putting Things Together

A section of an INI file is a header followed by some key/value pairs or comment lines:
my regex section {
    <header>
    [ <pair> | <comment> ]*
}

[...] groups a part of a regex so that the quantifier * after it applies to the whole group, not just to the last term.

The whole INI file consists of potentially some initial key/value pairs or comments followed by some sections:
my regex inifile {
    [ <pair> | <comment> ]*
    <section>*
}
The avid reader has noticed that the [ <pair> | <comment> ]* part of a regex has been used twice, so it’s a good idea to extract it into a stand-alone regex:
my regex block   { [ <pair> | <comment> ]* }
my regex section { <header> <block> }
my regex inifile { <block> <section>* }
It’s time for the “ultimate” test:
my $ini = q:to/EOI/;
key1=value2
[section1]
key2=value2
key3 = with spaces
; comment lines start with a semicolon, and are
; ignored by the parser
[section2]
more=stuff
EOI
ok $ini ~~ /^<inifile>$/, 'Can parse a full INI file';

9.4 Backtracking

Regex matching seems magical to many programmers. You just state the pattern, and the regex engine determines for you whether a string matches the pattern or not. While implementing a regex engine is a tricky business, the basics aren’t too hard to understand.

The regex engine goes through the parts of a regex from left to right, trying to match each part of the regex. It keeps track of what part of the string it matched so far in a cursor. If a part of a regex can’t find a match, the regex engine tries to alter the previous match to take up fewer characters and then retry the failed match at the new position.

For instance, if you execute the regex match
'abc' ~~ /.* b/

the regex engine first evaluates the .*. The . matches any character. The * quantifier is greedy, which means it tries to match as many characters as it can. It ends up matching the whole string, abc. Then the regex engine tries to match the b, which is a literal. Since the previous match gobbled up the whole string, matching b against the remaining empty string fails. So the previous regex part, .*, must give up a character. It now matches ab, and the literal matcher for the b compares b from the regex against the third character of the string, c, and fails again. So there is a final iteration where the .* once again gives up one character it matched, and now the b literal can match the second character in the string.

This back and forth between the parts of a regex is called backtracking . It’s a great feature when you search for a pattern in a string. But in a parser, it is usually not desirable. If, for example, the regex key matched the substring key2 in the input key2=value2, you don’t want it to match a shorter substring just because the next part of the regex can’t match.

There are three major reasons why you don’t want that. The first is that it makes debugging harder. When humans think about how a text is structured, they usually commit pretty quickly to basic tokenization, such as where a word or a sentence ends. Thus backtracking can be very unintuitive. If you generate error messages based on which regexes failed to match, backtracking basically always leads to the error message being pretty useless.

The second reason is that backtracking can lead to unexpected regex matches. For example, you want to match two words, optionally separated by whitespace, and you try to translate this directly to a regex: ../images/449994_2_En_9_Chapter/449994_2_En_9_Figa_HTML.gif

This seems to work: the first w+ matches the first word, and the second one matches the second word, all fine and good—until you find that it actually matches a single word too:../images/449994_2_En_9_Chapter/449994_2_En_9_Figb_HTML.gif

How did that happen? Well, the first w+ matched the whole word, s* successfully matched an empty string due to the * quantifier, and then the second w+ failed, forcing the previous two parts of the regex to match differently. So in the second iteration, the first w+ only matches tw, the s* matches the empty string between tw and o, and the second w+ matches o. And then you realize that if two words aren’t delimited by whitespace, how do you even tell where one word ends and the next one starts? With backtracking disabled, the regex fails to match instead of matching in an unintended way.

The third reason is performance. When you disable backtracking, the regex engine has to look at each character only once or once for each branch it can take in the case of alternatives. With backtracking, the regex engine can be stuck in backtracking loops that take overproportionally longer with increasing length of the input string.

To disable backtracking, you simply have to replace the word regex by token in the declaration or by using the :ratchet modifier inside the regex.

In the INI file parser, only the regex value needs backtracking (though other formulations discussed in the preceding don’t need it); all the other regexes can be switched over to tokens safely:
my token key     { w+ }
my regex value   { <!before s> <-[ ;]>+ <!after s> }
my token pair    { <key> h* '=' h* <value> + }
my token header  { '[' <-[ [ ] ]>+ ']' + }
my token comment { ';' N* +  }
my token block { [ <pair> | <comment> ]* }
my token section { <header> <block> }
my token inifile { <block> <section>* }

9.5 Grammars

This collection of regexes that parse INI files is not the pinnacle of encapsulation and reusability.

Hence, we’ll explore grammars, a feature that groups regexes into a class-like structure, and how to extract structured data from a successful match.

A grammar is a class with some extra features that make it suitable for parsing text. Along with methods and attributes, you can put regexes into a grammar.

This is what the INI file parser looks like when formulated as a grammar:
grammar IniFile {
    token key     { w+ }
    regex value   { <!before s> <-[ ;]>+ <!after s> }
    token pair    { <key> h* '=' h* <value> + }
    token header  { '[' <-[ [ ] ]>+ ']' + }
    token comment { ';' N* +  }
    token block   { [<pair> | <comment>]* }
    token section { <header> <block> }
    token TOP     { <block> <section>* }
}
You can use it to parse some text by calling the parse method, which uses regex or token TOP as the entry point:
my $result = IniFile.parse($text);

Besides the standardized entry point, a grammar offers more advantages. You can inherit from it like from a normal class, thus bringing even more reusability to regexes. You can group extra functionality together with the regexes by adding methods to the grammar. There are also some mechanisms in grammars that can make your life as a developer easier.

One of them is dealing with whitespace. In INI files, horizontal whitespace is generally considered to be insignificant, in that key=value and key = value lead to the same configuration of the application. So far we’ve dealt with that explicitly by adding h* to token pair. But there are places we haven’t actually considered. For example, it’s OK to have a comment that’s not at the start of the line.

The mechanism that grammars offer is that you can define a regex called ws6, and when you declare a token with rule instead of token (or enable this feature in regex through the :sigspace modifier), Raku inserts implicit <ws> calls for you where there is whitespace in the regex definition:
grammar IniFile {
    token ws { h* }
    rule pair { <key>    '='    <value> + }
    # rest as before
}

This might not be worth the effort for a single rule that needs to parse whitespace, but when there are more, this really pays off by keeping whitespace parsing in a single location.

Note that you should only parse insignificant whitespace in token ws. In the case of INI files, newlines are significant, so we shouldn’t match them.

9.6 Extracting Data from the Match

So far the IniFile grammar only checks whether a given input matches the grammar or not. However, when it does match, we really want the parse result in a data structure that’s easy to use. For instance, we could translate this example INI file:
key1=value2
[section1]
key2=value2
key3 = with spaces
; comment lines start with a semicolon and are
; ignored by the parser
[section2]
more=stuff
into this data structure of nested hashes:
{
      _ => {
          key1 => "value2"
      },
      section1 => {
          key2 => "value2",
          key3 => "with spaces"
      },
      section2 => {
          more => "stuff"
      }
}

Note that key/value pairs from outside any section show up in the _ top-level key.

The result from the IniFile.parse call is a Match7 object that has (nearly) all the information necessary to extract the desired match. If you turn a Match object into a string, it becomes the matched string. But there’s more. You can use it like a hash to extract the matches from named submatches. Hence, if the top-level match from
token TOP { <block> <section>* }
produces a Match object $m, then $m<block> is again a Match object, this one from the match of the call of token block. And $m<section> is a list of Match objects from the repeated calls to token section. So a Match is really a tree of matches (Figure 9-1).
../images/449994_2_En_9_Chapter/449994_2_En_9_Fig1_HTML.jpg
Figure 9-1

Match tree from parsing the example INI file

We can walk this data structure to extract the nested hashes. The header token matches a string like "[section1] ", and we’re only interested in "section1". To get to the inner part, we can modify header by inserting a pair of parentheses around the subregex whose match we’re interested in:
token header { '[' ( <-[ [ ] ]>+ ) ']' + }
#                   ^^^^^^^^^^^^^^^^^^^^ a capturing group
That’s a capturing group , and we can get its match by using the top-level match for header as an array and access its first element. This leads us to the full INI parser:
sub parse-ini(Str $input) {
    my $m = IniFile.parse($input);
    unless $m {
        die "The input is not a valid INI file.";
    }
    sub block(Match $m) {
        my %result;
        for $m<block><pair> -> $pair {
            %result{ $pair<key>.Str } = $pair<value>.Str;
        }
        return %result;
    }
    my %result;
    %result<_> = block($m);
    for $m<section> -> $section {
        %result{ $section<header>[0].Str } = block($section);
    }
    return %result;
}

This top-down approach works, but it requires a very intimate understanding of the grammar’s structure. This means that if you change the structure during maintenance, you’ll have a hard time figuring out how to change the data extraction code.

Raku offers a bottom-up approach as well. It allows you to write a data extraction or action method for each regex, token, or rule. The grammar engine passes in the match object as the single argument, and the action method can call the routine make to attach a result to the match object. The result is available through the .made method on the match object.

This execution of action methods happens as soon as a regex matches successfully; thus, an action method for a regex can rely on the fact that the action methods for subregex calls have already run. For example, when the rule pair { <key> '=' <value> + } is being executed, first token key matches successfully, and its action method runs immediately. Then, token value matches, and its action method runs too. Finally, the rule pair itself can match successfully, so its action method can rely on $m<key>.made and $m<value>.made being available, assuming that the match result is stored in variable $m.

Speaking of variables, a regex match implicitly stores its result in the special variable $/, and it is customary to use $/ as a parameter in action methods. There is also a shortcut for accessing named submatches: instead of writing $/<key>, you can write $<key>. With this convention in mind, the action class becomes
class IniFile::Actions {
    method key($/)     { make $/.Str }
    method value($/)   { make $/.Str }
    method header($/)  { make $/[0].Str }
    method pair($/)    { make $<key>.made => $<value>.made }
    method block($/)   { make $<pair>.map({ .made }).hash }
    method section($/) { make $<header>.made => $<block>.made }
    method TOP($/)     {
        make {
            _ => $<block>.made,
            $<section>.map: { .made },
        }
    }
}

The first two action methods are really simple. The result of a key or value match is simply the string that matched. For a header, it’s just the substring inside the brackets. Fittingly, a pair returns a Pair8 object, composed from key and value. The block method constructs a hash from all the lines in the block by iterating over each pair submatch and extracting the already attached Pair object. One level above that in the match tree section takes that hash and pairs it with the name of section, extracted from $<header>.made. Finally, the top-level action method gathers the sectionless key/value pairs under the key _ as well as all the sections and returns them in a hash.

In each method of the action class, we only rely on the knowledge of the first level of regexes called directly from the regex that corresponds to the action method and the data types that they .made. Thus, when you refactor one regex, you also have to change only the corresponding action method. Nobody needs to be aware of the global structure of the grammar.

Now we just have to tell Raku to actually use the action class:
sub parse-ini(Str $input) {
    my $m = IniFile.parse($input, :actions(IniFile::Actions));
    unless $m {
        die "The input is not a valid INI file.";
    }
    return $m.made
}
If you want to start parsing with a different rule than TOP (e.g., which you might want to do in a test), you can pass a named argument rule to method parse:
sub parse-ini(Str $input, :$rule = 'TOP') {
    my $m = IniFile.parse($input,
        :actions(IniFile::Actions),
        :$rule,
    );
    unless $m {
        die "The input is not a valid INI file.";
    }
    return $m.made
}
say parse-ini($ini).perl;
use Test;
is-deeply parse-ini("k = v ", :rule<pair>), 'k' => 'v',
    'can parse a simple pair';
done-testing;
To better encapsulate all the parsing functionality within the grammar, we can turn parse-ini into a method:
grammar IniFile {
    # regexes/tokens unchanged as before
    method parse-ini(Str $input, :$rule = 'TOP') {
        my $m = self.parse($input,
            :actions(IniFile::Actions),
            :$rule,
        );
        unless $m {
            die "The input is not a valid INI file.";
        }
        return $m.made
    }
}
# Usage:
my $result = IniFile.parse-ini($text);

To make this work, the class IniFile::Actions either has to be declared before the grammar, or needs to be predeclared with class IniFile::Action { ... } at the top of the file (with the literal three dots to mark it as a forward declaration).

9.7 Generating Good Error Messages

Good error messages are paramount to the user experience of any product. Parsers are no exception to this. Consider the difference between the message Square bracket [ on line 3 closed by curly bracket } on line 5, in contrast to Python’s lazy and generic SyntaxError: invalid syntax.

In addition to the textual message, knowing the location of the parse error helps tremendously in figuring out what’s wrong.

We’ll explore how to generate better parsing error messages from a grammar, using our INI file parser as an example.

9.7.1 Failure Is Normal

Before we start, it’s important to realize that in a grammar-based parser, it’s normal for a regex to fail to match, even in an overall successful parse.

Let’s recall a part of the parser:
token block { [<pair> | <comment>]* }
token section { <header> <block> }
token TOP { <block> <section>* }
When this grammar matches against the string
key=value
[header]
other=stuff

then TOP calls block, which calls both pair and comment. The pair match succeeds; the comment match fails. No big deal. But since there is a * quantifier in token block, it tries again to match pair or comment. Neither succeeds, but the overall match of token block still succeeds.

A nice way to visualize passed and failed submatches is to install the Grammar::Tracer module (zef install Grammar::Tracer) and simply add the statement use Grammar::Tracer before the grammar definition. This produces debug output showing which rules matched and which didn’t:
TOP
|  block
|  |  pair
|  |  |  key
|  |  |  * MATCH "key"
|  |  |  ws
|  |  |  * MATCH ""
|  |  |  ws
|  |  |  * MATCH ""
|  |  |  value
|  |  |  * MATCH "value"
|  |  |  ws
|  |  |  * MATCH ""
|  |  |  ws
|  |  |  * MATCH ""
|  |  * MATCH "key=value "
|  |  pair
|  |  |  key
|  |  |  * FAIL
|  |  * FAIL
|  |  comment
|  |  * FAIL
|  * MATCH "key=value "
|  section
...

9.7.2 Detecting Harmful Failure

To produce good parsing error messages, you must distinguish between expected and unexpected parse failures. As explained in the preceding, a match failure of a single regex or token is not generally an indication of a malformed input. But you can identify points where you know that once the regex engine got this far, the rest of the match must succeed.

If you recall pair
rule pair { <key>  '='  <value> + }

we know that if a key was parsed, we really expect the next character to be an equals sign. If not, the input is malformed.

In code, this is written like so:
rule pair {
    <key>
    [ '=' || <expect('=')> ]
    <value> +
}

|| is a sequential alternative, which first tries to match the subregex on the left-hand side and only executes the right-hand side if that failed.

So now we have to define expect:
method expect($what) {
    die "Cannot parse input as INI file: Expected $what";
}
Yes, you can call methods just like regexes, because regexes really are methods under the hood. die throws an exception, so now the malformed input justakey produces the error
Cannot parse input as INI file: Expected =

followed by a backtrace. That’s already better than “invalid syntax,” though the position is still missing. Inside method expect, we can find the current parsing position through the method pos, which is supplied by the implicit parent class Grammar9 that the grammar declaration brings with it.

We can use that to improve the error message a bit:
method expect($what) {
    die "Cannot parse input as INI file: Expected $what at character {self.pos}";
}

9.7.3 Providing Context

For larger inputs, we really want to print the line number. To calculate that, we need to get hold of the target string, which is available via the method target:
method expect($what) {
    my $parsed-so-far = self.target.substr(0, self.pos);
    my @lines = $parsed-so-far.lines;
    die "Cannot parse input as INI file: Expected $what at line @lines.elems(), after '@lines[*-1]'";
}
This brings us from the “meh” realm of error messages to quite good. Thus
IniFile.parse(q:to/EOI/);
key=value
[section]
key_without_value
more=key
EOI
now dies with
Cannot parse input as INI file: Expected = at line 3, after 'key_without_value'

You can further refine the expect method by providing context both before and after the position of the parse failure. And of course you have to apply the [ thing || <expect('thing')> ] pattern at more places inside the regex to get better error messages.

Finally, you can provide different kinds of error messages too. For example, when parsing a section header, once the initial [ is parsed, you likely don’t want an error message “expected rest of section header” but rather “malformed section header, at line …”:
rule pair {
    <key>
    [ '=' || <expect('=')> ]
    [ <value> || <expect('value')>]
      +
}
token header {
     '['
     [ ( <-[ [ ] ]>+ )    ']'
         || <error("malformed section header")> ]
      +
}
...
method expect($what) {
    self.error("expected $what");
}
method error($msg) {
    my $parsed-so-far = self.target.substr(0, self.pos);
    my @lines = $parsed-so-far.lines;
    die "Cannot parse input as INI file: $msg at line @lines.elems(), after '@lines[*-1]'";
}

Since Rakudo uses grammars to parse Raku input, you can use Rakudo’s own grammar10 as a source of inspiration for more ways to make error reporting even better.

9.7.4 Shortcuts for Parsing Matching Pairs

Since it’s such a common task, Raku grammars have a special goal-matching syntax for matching a pair of delimiters with something between them. In the INI file example, that’s a pair of brackets with a section header between them.

We can change
token header { '[' ( <-[ [ ] ]>+ ) ']' + }
to read
token header { '[' ~ ']' ( <-[ [ ] ]>+ ) + }
Not only does this have the aesthetic benefit of putting the matching delimiters closer together; it also calls a method FAILGOAL for us if everything except the closing delimiter matched. We can use this to generate better error messages for parse failures of matched pairs:
method FAILGOAL($goal) {
    my $cleaned-goal = $goal.trim;
    $cleaned-goal = $0 if $goal ~~ / ' (.+) ' /;
    self.error("Cannot find closing $cleaned-goal");
}

The argument passed to FAILGOAL is the string of the regex source code that failed to match the closing delimiter, here ']' (with a trailing space). From that we want to extract the literal ] for the error message, hence the regex match in the middle of the method. If that regex matches successfully, the literal is in $/[0], for which $0 is a shortcut.

All parsing constructs using ∼ can benefit from such a method FAILGOAL, so writing one is worth the effort in a grammar that parses several distinct quoting or bracketing constructs.

9.8 Write Your Own Grammars

Parsing is a skill that must be learned, mostly separately from your ordinary programming skills. So I encourage you to start with something small, like a parser for CSV or comma-separated values.11 It’s tempting to write a whole grammar for that in one go, but instead I recommend starting with parsing some atoms (like a cell of data between two commas), testing it, and only then proceeding to the next one.

And even in something as deceptively simple as CSV, some complexity lurks. For example, you could allow quoted strings that themselves can contain the separator character and an escape character that allows you to use the quoting character inside a quoted string.

For a deeper treatment of Raku regexes and grammars, check out Parsing with Perl 6 Regexes and Grammars by Moritz Lenz (Apress, 2017).

9.9 Summary

Raku allows regex reuse by treating them as first-class citizens, allowing them to be named and called like normal routines. Further clutter is removed by allowing whitespace inside regexes.

These features allow you to write regexes to parse proper file formats and even programming languages. Grammars let you structure, reuse, and encapsulate regexes.

The result of a regex match is a Match object, which is really a tree with nodes for each named submatch and for each capturing group. Action methods make it easy to decouple parsing from data extraction.

To generate good error messages from a parser, you need to distinguish between expected and unexpected match failures. The sequential alternative || is a tool you can use to turn unexpected match failures into error messages by raising an exception from the second branch of the alternative.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.119.199