What Makes Pyparsing So Special?

Pyparsing was designed with some specific goals in mind. These goals are based on the premise that the grammar must be easy to write, to understand, and to adapt as the parsing demands on a given parser change and expand over time. The intent behind these goals is to simplify the parser design task as much as possible and to allow the pyparsing user to focus his attention on the parser, and not to be distracted by the mechanics of the parsing library or grammar syntax. The rest of this section lists the high points of the Zen of Pyparsing.

The grammar specification should be a natural-looking part of the Python program, easy-to-read, and familiar in style and format to Python programmers

Pyparsing accomplishes this in a couple of ways:

  • Using operators to join parser elements together. Python's support for defining operator functions allows us to go beyond standard object construction syntax, and we can compose parsing expressions that read naturally. Instead of this:

    streetAddress = And( [streetNumber, name,
                               Or( [Literal("Rd."), Literal("St.")] ) ] )

    we can write this:

    streetAddress = streetNumber + name + ( Literal("Rd.") | Literal("St.") )
  • Many attribute setting methods in pyparsing return self so that several of these methods can be chained together. This permits parser elements within a grammar to be more self-contained. For example, a common parser expression is the definition of an integer, including the specification of its name, and the attachment of a parse action to convert the integer string to a Python int. Using properties, this would look like:

    integer = Word(nums)
    integer.Name = "integer"
    integer.ParseAction = lambda t: int(t[0])

    Using attribute setters that return self, this can be collapsed to:

    integer = Word(nums).setName("integer").setParseAction(lambda t:int(t[0]))
Class names are easier to read and understand than specialized typography

This is probably the most explicit distinction of pyparsing from regular expressions, and regular expression-based parsing tools. The IP address and phone number example given in the introduction allude to this idea, but regular expressions get truly inscrutable when regular expression control characters are also part of the text to be matched. The result is a mish-mash of backslashes to escape the control characters to be interpreted as input text. Here is a regular expression to match a simplified C function call, constrained to accept zero or more arguments that are either words or integers:

(w+)((((d+|w+)(,(d+|w+))*)?))

It is not easy to tell at a glance which parentheses are grouping operators, and which are part of the expression to be matched. Things get even more complicated if the input text contains , ., *, or ? characters. The pyparsing version of this same expression is:

Word(alphas)+ "(" + Group( Optional(Word(nums)|Word(alphas) +
                                              ZeroOrMore("," + Word(nums)|Word
(alphas))) ) + ")"

In the pyparsing version, the grouping and repetition are explicit and easy to read. In fact, this pattern of x + ZeroOrMore(","+x) is so common, there is a pyparsing helper method, delimitedList, that emits this expression. Using delimitedList, our pyparsing rendition further simplifies to:

Word(alphas)+ "(" + Group( Optional(delimitedList(Word(nums)|Word(alphas))) ) + ")"
Whitespace markers clutter and distract from the grammar definition

In addition to the "special characters aren't really that special" problem, regular expressions must also explicitly indicate where whitespace can occur in the input text. In this C function example, the regular expression would match:

abc(1,2,def,5)

but would not match:

abc(1, 2, def, 5)

Unfortunately, it is not easy to predict where optional whitespace might or might not occur in such an expression, so one must include s* expressions liberally throughout, further obscuring the real text matching that was intended:

(w+)s*(s*(((d+|w+)(s*,s*(d+|w+))*)?)s*)

In contrast, pyparsing skips over whitespace between parser elements by default, so that this same pyparsing expression:

Word(alphas)+ "(" + Group( Optional(delimitedList(Word(nums)|Word(alphas))) ) + ")"

matches either of the listed calls to the abc function, without any additional whitespace indicators.

This same concept also applies to comments, which can appear anywhere in a source program. Imagine trying to match a function in which the developer had inserted a comment to document each parameter in the argument list. With pyparsing, this is accomplished with the code:

cFunction = Word(alphas)+ "(" + 
                    Group( Optional(delimitedList(Word(nums)|Word(alphas))) ) + ")"
cFunction.ignore( cStyleComment )
The results of the parsing process should do more than just represent a nested list of tokens, especially when grammars get complicated

Pyparsing returns the results of the parsing process using a class named ParseResults. ParseResults will support simple list-based access (such as indexing using [], len, iter, and slicing) for simple grammars, but it can also represent nested results, and dict-style and object attribute-style access to named fields within the results. The results from parsing our C function example are:

['abc', '(', ['1', '2', 'def', '5'], ')']

You can see that the function arguments have been collected into their own sublist, making the extraction of the function arguments easier during post-parsing analysis. If the grammar definition includes results names, specific fields can be accessed by name instead of by error-prone list indexing.

These higher-level access techniques are crucial to making sense of the results from a complex grammar.

Parse time is a good time for additional text processing

While parsing, the parser is performing many checks on the format of fields within the input text: testing for the validity of a numeric string, or matching a pattern of punctuation such as a string within quotation marks. If left as strings, the post-parsing code will have to re-examine these fields to convert them into Python ints and strings, and likely have to repeat the same validation tests before doing the conversion.

Pyparsing supports the definition of parse-time callbacks (called parse actions) that you can attach to individual expressions within the grammar. Since the parser calls these functions immediately after matching their respective patterns, there is often little or no extra validation required. For instance, to extract the string from the body of a parsed quoted string, a simple parse action to remove the opening and closing quotation marks, such as:

quotedString.setParseAction( lambda t: t[0][1:−1] )

is sufficient. There is no need to test the leading and trailing characters to see whether they are quotation marks—the function won't be called unless they are.

Parse actions can also be used to perform additional validation checks, such as testing whether a matched word exists in a list of valid words, and raising a ParseException if not. Parse actions can also return a constructed list or application object, essentially compiling the input text into a series of executable or callable user objects. Parse actions can be a powerful tool when designing a parser with pyparsing.

Grammars must tolerate change, as grammar evolves or input text becomes more challenging

The death spiral of frustration that is so common when you have to write parsers is not easy to avoid. What starts out as a simple pattern-matching exercise can become progressively complex and unwieldy. The input text can contain data that doesn't quite match the pattern but is needed anyway, so the parser gets a minor patch to include the new variation. Or, a language for which the parser was written gains a new addition to the language syntax. After this happens several times, the patches begin to get in the way of the original pattern definition, and further patches get more and more difficult. When a new change occurs after a quiet period of a few months or so, reacquiring the parser knowledge takes longer than expected, and this just adds to the frustration.

Pyparsing doesn't cure this problem, but the grammar definition techniques and the coding style it fosters in the grammar and parser code make many of these problems simpler. Individual elements of the grammar are likely to be explicit and easy to find, and correspondingly easy to extend or modify. Here is a wonderful quote sent to me by a pyparsing user considering writing a grammar for a particularly tricky parser: "I could just write a custom method, but my past experience was that once I got the basic pyparsing grammar working, it turned out to be more self documenting and easier to maintain/extend."

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.139.169