You’ve probably seen .ini files before; they are quite common as configuration files on the Microsoft Windows platform but are also found in many other places such as ODBC configuration files, Ansible’s inventory files,1 and so on.
Raku offers regexes for parsing and grammars for structuring and reusing regexes.
Under the hood it uses regexes and grammars. Here, we will explore how we could write our own INI parser.
9.1 Regex Basics
A regex is a piece of code that acts as a pattern for strings with a common structure. It’s derived from the computer science concept of a regular expression3 but adapted to provide more constructs than pure regular expressions allow and extended with some features that make them easier to use.
We’ll use named regexes to match the primitives and then use regexes that call these named regexes to build a parser for the INI files. Since INI files have no universally accepted, formal grammar, we have to make stuff up as we go.
my regex key { w+ } declares a lexically (my) scoped regex called key that matches one or more word characters.
There is a long tradition in programming languages to support so-called Perl Compatible Regular Expressions (PCRE). Many programming languages support some deviations from PCRE, including Perl itself, but common syntax elements remain throughout most implementations. Raku still supports some of these elements but deviates substantially in others.
Here w+ is the same as in PCRE, but in contrast to PCRE, whitespace in the regex is ignored. This allows you to write much more readable regexes, with freedom to format regexes just like you would with normal code.
In the testing routine, the slashes in 'abc' ∼∼ /ˆ <key> $/ delimit an anonymous regex. In this regex, ˆ and $ stand for the start and the end of the matched string, respectively, which is familiar from PCRE. However, in contrast to PCRE, the <key> subrule calls the named regex key from earlier. This is a Raku extension. In PCRE, the < in a regex matches a literal <. In Raku regexes, it introduces a subrule call.
In general, all nonword characters are reserved for “special” syntax, and you have to quote or backslash them to get the literal meaning. For example, < or '<' in a regex matches a less than sign. Quoting can apply to more than one character, so 'a+b' in a regex matches an a, followed by a plus +, followed by a b.
Word characters (letters, digits, and the underscore) always match literally.
9.1.1 Character Classes
Character Class | Negation | Matches |
---|---|---|
d | D | A digit |
w | W | A word character (letter, digit, underscore) |
s | S | Whitespace, blanks, newlines, etc. |
h | H | Horizontal whitespace |
v | V | Vertical whitespace |
| N | Logical newline (carriage return, line feed) |
. | Any character |
Method | Example | Matches |
---|---|---|
Enumeration | <[abc]> | a, b, or c |
Negation | <-[abc]> | Anything except a, b, or c |
Range | <[a..c]> | a, b, or c |
The official Raku test suite contains many of such tests4; we just have a few here to illustrate the behavior of the character classes.
9.1.2 Quantifiers
Quantifier | Matches how many characters? |
---|---|
* | 0..Inf |
+ | 1..Inf |
? | 0..1 |
** 3 | 3 |
** 1..5 | 1..5 |
9.1.3 Alternatives
Either-or alternatives are separated by the vertical bar |. For example, d+ | x matches either a sequence of one or more digits or the character x.
9.2 Parsing the INI Primitives
Coming back to INI parsing, we have to think about what characters are allowed inside a value. Listing allowed characters seems to be like a futile exercise, since we are very likely to forget some. Instead, we should think about what’s not allowed in a value. Newlines certainly aren’t, because they introduce the next key/value pair or a section heading. Neither are semicolons allowed, because they introduce a comment.
We can formulate this exclusion as a negated character class: <-[ ; ]> matches any single character that is neither a newline nor a semicolon. Note that inside a character class, nearly all characters lose their special meaning. Only backslash, whitespace, two dots, and the closing bracket stand for anything other than themselves. Inside and outside of character classes alike, matches a single newline character and s whitespace. The uppercase inverts that, so that, for example, S matches any single character that is not whitespace.
<!before regex> is a negated look-ahead, that is, the following text must not match the regex, and the text isn’t consumed while matching. Unsurprisingly, <!after regex> is the negated look-behind, which tries to match text that has already been matched and must not succeed in doing so for the whole match to be successful.
The & operator delimits two or more smaller regex expressions that must all match the same string successfully for the whole match to succeed. S.* matches any string that starts with a non-whitespace character (S), followed by any character (.) any number of times *. Likewise, .*S matches any string that ends with a non-whitespace character.
h matches a horizontal whitespace, that is, a blank, a tabulator character, or any other fancy spacelike thing that Unicode has in store for us (e.g., also the nonbreaking space) but not a newline.
N matches any character that’s not a newline, so the comment is just a semicolon, and then anything until the end of the line.
9.3 Putting Things Together
[...] groups a part of a regex so that the quantifier * after it applies to the whole group, not just to the last term.
9.4 Backtracking
Regex matching seems magical to many programmers. You just state the pattern, and the regex engine determines for you whether a string matches the pattern or not. While implementing a regex engine is a tricky business, the basics aren’t too hard to understand.
The regex engine goes through the parts of a regex from left to right, trying to match each part of the regex. It keeps track of what part of the string it matched so far in a cursor. If a part of a regex can’t find a match, the regex engine tries to alter the previous match to take up fewer characters and then retry the failed match at the new position.
the regex engine first evaluates the .*. The . matches any character. The * quantifier is greedy, which means it tries to match as many characters as it can. It ends up matching the whole string, abc. Then the regex engine tries to match the b, which is a literal. Since the previous match gobbled up the whole string, matching b against the remaining empty string fails. So the previous regex part, .*, must give up a character. It now matches ab, and the literal matcher for the b compares b from the regex against the third character of the string, c, and fails again. So there is a final iteration where the .* once again gives up one character it matched, and now the b literal can match the second character in the string.
This back and forth between the parts of a regex is called backtracking . It’s a great feature when you search for a pattern in a string. But in a parser, it is usually not desirable. If, for example, the regex key matched the substring key2 in the input key2=value2, you don’t want it to match a shorter substring just because the next part of the regex can’t match.
There are three major reasons why you don’t want that. The first is that it makes debugging harder. When humans think about how a text is structured, they usually commit pretty quickly to basic tokenization, such as where a word or a sentence ends. Thus backtracking can be very unintuitive. If you generate error messages based on which regexes failed to match, backtracking basically always leads to the error message being pretty useless.
The second reason is that backtracking can lead to unexpected regex matches. For example, you want to match two words, optionally separated by whitespace, and you try to translate this directly to a regex:
This seems to work: the first w+ matches the first word, and the second one matches the second word, all fine and good—until you find that it actually matches a single word too:
How did that happen? Well, the first w+ matched the whole word, s* successfully matched an empty string due to the * quantifier, and then the second w+ failed, forcing the previous two parts of the regex to match differently. So in the second iteration, the first w+ only matches tw, the s* matches the empty string between tw and o, and the second w+ matches o. And then you realize that if two words aren’t delimited by whitespace, how do you even tell where one word ends and the next one starts? With backtracking disabled, the regex fails to match instead of matching in an unintended way.
The third reason is performance. When you disable backtracking, the regex engine has to look at each character only once or once for each branch it can take in the case of alternatives. With backtracking, the regex engine can be stuck in backtracking loops that take overproportionally longer with increasing length of the input string.
To disable backtracking, you simply have to replace the word regex by token in the declaration or by using the :ratchet modifier inside the regex.
9.5 Grammars
This collection of regexes that parse INI files is not the pinnacle of encapsulation and reusability.
Hence, we’ll explore grammars, a feature that groups regexes into a class-like structure, and how to extract structured data from a successful match.
A grammar is a class with some extra features that make it suitable for parsing text. Along with methods and attributes, you can put regexes into a grammar.
Besides the standardized entry point, a grammar offers more advantages. You can inherit from it like from a normal class, thus bringing even more reusability to regexes. You can group extra functionality together with the regexes by adding methods to the grammar. There are also some mechanisms in grammars that can make your life as a developer easier.
One of them is dealing with whitespace. In INI files, horizontal whitespace is generally considered to be insignificant, in that key=value and key = value lead to the same configuration of the application. So far we’ve dealt with that explicitly by adding h* to token pair. But there are places we haven’t actually considered. For example, it’s OK to have a comment that’s not at the start of the line.
This might not be worth the effort for a single rule that needs to parse whitespace, but when there are more, this really pays off by keeping whitespace parsing in a single location.
Note that you should only parse insignificant whitespace in token ws. In the case of INI files, newlines are significant, so we shouldn’t match them.
9.6 Extracting Data from the Match
Note that key/value pairs from outside any section show up in the _ top-level key.
This top-down approach works, but it requires a very intimate understanding of the grammar’s structure. This means that if you change the structure during maintenance, you’ll have a hard time figuring out how to change the data extraction code.
Raku offers a bottom-up approach as well. It allows you to write a data extraction or action method for each regex, token, or rule. The grammar engine passes in the match object as the single argument, and the action method can call the routine make to attach a result to the match object. The result is available through the .made method on the match object.
This execution of action methods happens as soon as a regex matches successfully; thus, an action method for a regex can rely on the fact that the action methods for subregex calls have already run. For example, when the rule pair { <key> '=' <value> + } is being executed, first token key matches successfully, and its action method runs immediately. Then, token value matches, and its action method runs too. Finally, the rule pair itself can match successfully, so its action method can rely on $m<key>.made and $m<value>.made being available, assuming that the match result is stored in variable $m.
The first two action methods are really simple. The result of a key or value match is simply the string that matched. For a header, it’s just the substring inside the brackets. Fittingly, a pair returns a Pair8 object, composed from key and value. The block method constructs a hash from all the lines in the block by iterating over each pair submatch and extracting the already attached Pair object. One level above that in the match tree section takes that hash and pairs it with the name of section, extracted from $<header>.made. Finally, the top-level action method gathers the sectionless key/value pairs under the key _ as well as all the sections and returns them in a hash.
In each method of the action class, we only rely on the knowledge of the first level of regexes called directly from the regex that corresponds to the action method and the data types that they .made. Thus, when you refactor one regex, you also have to change only the corresponding action method. Nobody needs to be aware of the global structure of the grammar.
To make this work, the class IniFile::Actions either has to be declared before the grammar, or needs to be predeclared with class IniFile::Action { ... } at the top of the file (with the literal three dots to mark it as a forward declaration).
9.7 Generating Good Error Messages
Good error messages are paramount to the user experience of any product. Parsers are no exception to this. Consider the difference between the message Square bracket [ on line 3 closed by curly bracket } on line 5, in contrast to Python’s lazy and generic SyntaxError: invalid syntax.
In addition to the textual message, knowing the location of the parse error helps tremendously in figuring out what’s wrong.
We’ll explore how to generate better parsing error messages from a grammar, using our INI file parser as an example.
9.7.1 Failure Is Normal
Before we start, it’s important to realize that in a grammar-based parser, it’s normal for a regex to fail to match, even in an overall successful parse.
then TOP calls block, which calls both pair and comment. The pair match succeeds; the comment match fails. No big deal. But since there is a * quantifier in token block, it tries again to match pair or comment. Neither succeeds, but the overall match of token block still succeeds.
9.7.2 Detecting Harmful Failure
To produce good parsing error messages, you must distinguish between expected and unexpected parse failures. As explained in the preceding, a match failure of a single regex or token is not generally an indication of a malformed input. But you can identify points where you know that once the regex engine got this far, the rest of the match must succeed.
we know that if a key was parsed, we really expect the next character to be an equals sign. If not, the input is malformed.
|| is a sequential alternative, which first tries to match the subregex on the left-hand side and only executes the right-hand side if that failed.
followed by a backtrace. That’s already better than “invalid syntax,” though the position is still missing. Inside method expect, we can find the current parsing position through the method pos, which is supplied by the implicit parent class Grammar9 that the grammar declaration brings with it.
9.7.3 Providing Context
You can further refine the expect method by providing context both before and after the position of the parse failure. And of course you have to apply the [ thing || <expect('thing')> ] pattern at more places inside the regex to get better error messages.
Since Rakudo uses grammars to parse Raku input, you can use Rakudo’s own grammar10 as a source of inspiration for more ways to make error reporting even better.
9.7.4 Shortcuts for Parsing Matching Pairs
Since it’s such a common task, Raku grammars have a special goal-matching syntax for matching a pair of delimiters with something between them. In the INI file example, that’s a pair of brackets with a section header between them.
The argument passed to FAILGOAL is the string of the regex source code that failed to match the closing delimiter, here ']' (with a trailing space). From that we want to extract the literal ] for the error message, hence the regex match in the middle of the method. If that regex matches successfully, the literal is in $/[0], for which $0 is a shortcut.
All parsing constructs using ∼ can benefit from such a method FAILGOAL, so writing one is worth the effort in a grammar that parses several distinct quoting or bracketing constructs.
9.8 Write Your Own Grammars
Parsing is a skill that must be learned, mostly separately from your ordinary programming skills. So I encourage you to start with something small, like a parser for CSV or comma-separated values.11 It’s tempting to write a whole grammar for that in one go, but instead I recommend starting with parsing some atoms (like a cell of data between two commas), testing it, and only then proceeding to the next one.
And even in something as deceptively simple as CSV, some complexity lurks. For example, you could allow quoted strings that themselves can contain the separator character and an escape character that allows you to use the quoting character inside a quoted string.
For a deeper treatment of Raku regexes and grammars, check out Parsing with Perl 6 Regexes and Grammars by Moritz Lenz (Apress, 2017).
9.9 Summary
Raku allows regex reuse by treating them as first-class citizens, allowing them to be named and called like normal routines. Further clutter is removed by allowing whitespace inside regexes.
These features allow you to write regexes to parse proper file formats and even programming languages. Grammars let you structure, reuse, and encapsulate regexes.
The result of a regex match is a Match object, which is really a tree with nodes for each named submatch and for each capturing group. Action methods make it easy to decouple parsing from data extraction.
To generate good error messages from a parser, you need to distinguish between expected and unexpected match failures. The sequential alternative || is a tool you can use to turn unexpected match failures into error messages by raising an exception from the second branch of the alternative.