Getting Started with Pyparsing

Paul McGuire

October XX, 2007

Abstract

Need to extract data from a text file or a web page? Or do you want to make your application more flexible with user-defined commands or search strings? Do regular expressions and lex/yacc make your eyes blur and your brain hurt?

Pyparsing could be the solution. Pyparsing is a pure-Python class library that makes it easy to build recursive-descent parsers quickly. There is no need to handcraft your own parsing state machine. With pyparsing, you can quickly create HTML page scrapers, logfile data extractors, or complex data structure or command processors. This Short Cut shows you how!


"I need to analyze this logfile..."
"Just extract the data from this web page..."
"We need a simple input command processor..."
"Our source code needs to be migrated to the new API..."

Each of these everyday requests generates the same reflex response in any developer faced with them: "Oh, *&#$*!, not another parser!"

The task of parsing data from loosely formatted text crops up in different forms on a regular basis for most developers. Sometimes it is for one-off development utilities, such as the API-upgrade example, that are purely for internal use. Other times, the parsing task is a user-interface function to be built in to a command-driven application.

If you are working in Python, you can tackle many of these jobs using Python's built-in string methods, most notably split(), index(), and startswith().

What makes parser writing unpleasant are those jobs that go beyond simple string splitting and indexing, with some context-varying format, or a structure defined to a language syntax rather than a simple character pattern. For instance,

      y = 2 * x + 10

is easy to parse when split on separating spaces. Unfortunately, few users are so careful in their spacing, and an arithmetic expression parser is just as likely to encounter any of these:

      y = 2*x + 10
      y = 2*x+10
      y=2*x+10

Splitting this last expression on whitespace just returns the original string, with no further insight into the separate elements y, =, 2, etc.

The traditional tools for developing parsing utilities that are beyond processing with just str.split are regular expressions and lex/yacc. Regular expressions use a text string to describe a text pattern to be matched. The text string uses special characters (such as |, +, ., *, and ?) to denote various parsing concepts such as alternation, repetition, and wildcards. Lex and yacc are utilities that lexically detect token boundaries, and then apply processing code to the extracted tokens. Lex and yacc use a separate token-definition file, and then generate lexing and token-processing code templates for the programmer to extend with application-specific behavior.

The problem in using these traditional tools is that each introduces its own specialized notation, which must then be mapped to a Python design and Python code. In the case of lex/yacc-style tools, a separate code-generation step is usually required.

In practice, parser writing often takes the form of a seemingly endless cycle: write code, run parser on sample text, encounter unexpected text input case, modify code, rerun modified parser, find additional "special" case, etc. Combined with the notation issues of regular expressions, or the extra code-generation steps of lex/yacc, this cyclic process can spiral into frustration.

What Is Pyparsing?

Pyparsing is a pure Python module that you can add to your Python application with little difficulty. Pyparsing's class library provides a set of classes for building up a parser from individual expression elements, up to complex, variable-syntax expressions. Expressions are combined using intuitive operators, such as + for sequentially adding one expression after another, and | and ^ for defining parsing alternatives (meaning "match first alternative" or "match longest alternative"). Replication of expressions is added using classes such as OneOrMore, ZeroOrMore, and Optional.

For example, a regular expression that would parse an IP address followed by a U.S.-style phone number might look like the following:

(d{1,3}(?:.d{1,3}){3})s+((d{3})d{3}-d{4})

In contrast, the same expression using pyparsing might be written as follows:

ipField = Word(nums, max=3)
ipAddr = Combine( ipField + "." + ipField + "." + ipField + "." + ipField )
phoneNum = Combine( "(" + Word(nums, exact=3) + ")" +
                    Word(nums, exact=3) + "−" + Word(nums, exact=4) )
userdata = ipAddr + phoneNum

Although it is more verbose, the pyparsing version is clearly more readable; it would be much easier to go back and update this version to handle international phone numbers, for example.

Pyparsing is:

  • 100 percent pure Python—no compiled dynamic link libraries (DLLs) or shared libraries are used in pyparsing, so you can use it on any platform that is Python 2.3-compatible.

  • Driven by parsing expressions that are coded inline, using standard Python class notation and constructs —no separate code generation process and no specialized character notation make your application easier to develop, understand, and maintain.

  • Enhanced with helpers for common parsing patterns:

    • C, C++, Java, Python, and HTML comments

    • quoted strings (using single or double quotes, with ' or " escapes)

    • HTML and XML tags (including upper-/lowercase and tag attribute handling)

    • comma-separated values and delimited lists of arbitrary expressions

  • Small in footprint—Pyparsing's code is contained within a single Python source file, easily dropped into a site-packages directory, or included with your own application.

  • Liberally licensed—MIT license permits any use, commercial or non-commercial.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.42.243