Parsing Data from a Table—Using Parse Actions and ParseResults

As our first example, let's look at a simple set of scores for college football games that might be given in a datafile. Each row of text gives the date of each game, followed by the college names and each school's score.

09/04/2004  Virginia              44   Temple             14
09/04/2004  LSU                   22   Oregon State       21
09/09/2004  Troy State            24   Missouri           14
01/02/2003  Florida State        103   University of Miami 2

Our BNF for this data is simple and clean:

digit      ::= '0'..'9'
alpha      ::= 'A'..'Z' 'a'..'z'
date       ::= digit+ '/' digit+ '/' digit+
schoolName ::= ( alpha+ )+
score      ::= digit+
schoolAndScore ::= schoolName score
gameResult ::= date schoolAndScore schoolAndScore

We begin building up our parser by converting these BNF definitions into pyparsing class instances. Just as we did in the extended "Hello, World!" program, we'll start by defining the basic building blocks that will later get combined to form the complete grammar:

# nums and alphas are already defined by pyparsing
num = Word(nums)
date = num + "/" + num + "/" + num
schoolName = OneOrMore( Word(alphas) )

Notice that you can compose pyparsing expressions using the + operator to combine pyparsing expressions and string literals. Using these basic elements, we can finish the grammar by combining them into larger expressions:

score = Word(nums)
schoolAndScore = schoolName + score
gameResult = date + schoolAndScore + schoolAndScore

We use the gameResult expression to parse the individual lines of the input text:

tests = """
    09/04/2004  Virginia              44   Temple             14
    09/04/2004  LSU                   22   Oregon State       21
    09/09/2004  Troy State            24   Missouri           14
    01/02/2003  Florida State        103   University of Miami 2""".splitlines()

for test in tests:
    stats = gameResult.parseString(test)
    print stats.asList()

Just as we saw in the "Hello, World!" parser, we get an unstructured list of strings from this grammar:

['09', '/', '04', '/', '2004', 'Virginia', '44', 'Temple', '14']
['09', '/', '04', '/', '2004', 'LSU', '22', 'Oregon', 'State', '21']
['09', '/', '09', '/', '2004', 'Troy', 'State', '24', 'Missouri', '14']
['01', '/', '02', '/', '2003', 'Florida', 'State', '103', 'University', 'of',
'Miami', '2']

The first change we'll make is to combine the tokens returned by date into a single MM/DD/YYYY date string. The pyparsing Combine class does this for us by simply wrapping the composed expression:

date = Combine( num + "/" + num + "/" + num )

With this single change, the parsed results become:

['09/04/2004', 'Virginia', '44', 'Temple', '14']
['09/04/2004', 'LSU', '22', 'Oregon', 'State', '21']
['09/09/2004', 'Troy', 'State', '24', 'Missouri', '14']
['01/02/2003', 'Florida', 'State', '103', 'University', 'of', 'Miami', '2']

Combine actually performs two tasks for us. In addition to concatenating the matched tokens into a single string, it also enforces that the tokens are adjacent in the incoming text.

The next change to make will be to combine the school names, too. Because Combine's default behavior requires that the tokens be adjacent, we will not use it, since some of the school names have embedded spaces. Instead we'll define a routine to be run at parse time to join and return the tokens as a single string. As mentioned previously, such routines are referred to in pyparsing as parse actions, and they can perform a variety of functions during the parsing process.

For this example, we will define a parse action that takes the parsed tokens, uses the string join function, and returns the joined string. This is such a simple parse action that it can be written as a Python lambda. The parse action gets hooked to a particular expression by calling setParseAction, as in:

schoolName.setParseAction( lambda tokens: " ".join(tokens) )

Another common use for parse actions is to do additional semantic validation, beyond the basic syntax matching that is defined in the expressions. For instance, the expression for date will accept 03023/808098/29921 as a valid date, and this is certainly not desirable. A parse action to validate the input date could use time.strptime to parse the time string into an actual date:

time.strptime(tokens[0],"%m/%d/%Y")

If strptime fails, then it will raise a ValueError exception. Pyparsing uses its own exception class, ParseException, for signaling whether an expression matched or not. Parse actions can raise their own exceptions to indicate that, even though the syntax matched, some higher-level validation failed. Our validation parse action would look like this:

def validateDateString(tokens):
    try
        time.strptime(tokens[0], "%m/%d/%Y")
    except ValueError,ve:
        raise ParseException("Invalid date string (%s)" % tokens[0])
date.setParseAction(validateDateString)

If we change the date in the first line of the input to 19/04/2004, we get the exception:

pyparsing.ParseException: Invalid date string (19/04/2004) (at char 0), (line:1, col:1)

Another modifier of the parsed results is the pyparsing Group class. Group does not change the parsed tokens; instead, it nests them within a sublist. Group is a useful class for providing structure to the results returned from parsing:

score = Word(nums)
schoolAndScore = Group( schoolName + score )

With grouping and joining, the parsed results are now structured into nested lists of strings:

['09/04/2004', ['Virginia', '44'], ['Temple', '14']]
['09/04/2004', ['LSU', '22'], ['Oregon State', '21']]
['09/09/2004', ['Troy State', '24'], ['Missouri', '14']]
['01/02/2003', ['Florida State', '103'], ['University of Miami', '2']]

Finally, we will add one more parse action to perform the conversion of numeric strings into actual integers. This is a very common use for parse actions, and it also shows how pyparsing can return structured data, not just nested lists of parsed strings. This parse action is also simple enough to implement as a lambda:

score = Word(nums).setParseAction( lambda tokens : int(tokens[0]) )

Once again, we can define our parse action to perform this conversion, without the need for error handling in case the argument to int is not a valid integer string. The only time this lambda will ever be called is with a string that matches the pyparsing expression Word (nums), which guarantees that only valid numeric strings will be passed to the parse action.

Our parsed results are starting to look like real database records or objects:

['09/04/2004', ['Virginia', 44], ['Temple', 14]]
['09/04/2004', ['LSU', 22], ['Oregon State', 21]]
['09/09/2004', ['Troy State', 24], ['Missouri', 14]]
['01/02/2003', ['Florida State', 103], ['University of Miami', 2]]

At this point, the returned data is structured and converted so that we could do some actual processing on the data, such as listing the game results by date and marking the winning team. The ParseResults object passed back from parseString allows us to index into the parsed data using nested list notation, but for data with this kind of structure, things get ugly fairly quickly:

for test in tests:
    stats = gameResult.parseString(test)
    if stats[1][1] != stats[2][1]:
        if stats[1][1] > stats[2][1]:
            result = "won by " + stats[1][0]
        else:
            result = "won by " + stats[2][0]
    else:
        result = "tied"
    print "%s %s(%d) %s(%d), %s" % (stats[0], stats[1][0], stats[1][1],
                                              stats[2][0], stats[2][1], result)

Not only do the indexes make the code hard to follow (and easy to get wrong!), the processing of the parsed data is very sensitive to the order of things in the results. If our grammar included some optional fields, we would have to include other logic to test for the existence of those fields, and adjust the indexes accordingly. This makes for a very fragile parser.

We could try using multiple variable assignment to reduce the indexing like we did in '"Hello, World!" on Steroids!':

for test in tests:
    stats = gameResult.parseString(test)
    gamedate,team1,team2 = stats  # <- assign parsed bits to individual variable names
    if team1[1] != team2[1]:
        if team1[1] > team2[1]:
            result = "won by " + team1[0]
        else:
            result = "won by " + team2[0]
    else:
        result = "tied"
    print "%s %s(%d) %s(%d), %s" % (gamedate, team1[0], team1[1], team2[0], team2[1],
             result)

But this still leaves us sensitive to the order of the parsed data.

Instead, we can define names in the grammar that different expressions should use to label the resulting tokens returned by those expressions. To do this, we insert calls to setResults-Name into our grammar, so that expressions will label the tokens as they are accumulated into the Parse-Results for the overall grammar:

schoolAndScore = Group(
schoolName.setResultsName("school") +
       score.setResultsName("score") )
gameResult = date.setResultsName("date") + schoolAndScore.setResultsName("team1") +
      schoolAndScore.setResultsName("team2")

And the code to process the results is more readable:

    if stats.team1.score != stats.team2.score
        if stats.team1.score > stats.team2.score:
            result = "won by " + stats.team1.school
        else:
            result = "won by " + stats.team2.school
    else:
        result = "tied"
    print "%s %s(%d) %s(%d), %s" % (stats.date, stats.team1.school, stats.team1.
score,
             stats.team2.school, stats.team2.score, result)

This code has the added bonus of being able to refer to individual tokens by name rather than by index, making the processing code immune to changes in the token order and to the presence/absence of optional data fields.

Creating ParseResults with results names will enable you to use dict-style semantics to access the tokens. For example, you can use ParseResults objects to supply data values to interpolated strings with labeled fields, further simplifying the output code:

print "%(date)s %(team1)s %(team2)s" % stats

This gives the following:

09/04/2004 ['Virginia', 44] ['Temple', 14]
09/04/2004 ['LSU', 22] ['Oregon State', 21]
09/09/2004 ['Troy State', 24] ['Missouri', 14]
01/02/2003 ['Florida State', 103] ['University of Miami', 2]

ParseResults also implements the keys(), items(), and values() methods, and supports key testing with Python's in keyword.

For debugging, you can call dump() to return a string showing the nested token list, followed by a hierarchical listing of keys and values. Here is a sample of calling stats.dump() for the first line of input text:

print stats.dump()

['09/04/2004', ['Virginia', 44],
['Temple', 14]]
- date: 09/04/2004
- team1: ['Virginia', 44]
    - school: Virginia
    - score: 44
- team2: ['Temple', 14]
    - school: Temple
    - score: 14

Finally, you can generate XML representing this same hierarchy by calling stats.asXML() and specifying a root element name:

print stats.asXML("GAME")

<GAME>
    <date>09/04/2004</date>
    <team1>
        <school>Virginia</school>
        <score>44</score>
    </team1>
    <team2>
        <school>Temple</school>
        <score>14</score>
    </team2>
</GAME>

There is one last issue to deal with, having to do with validation of the input text. Pyparsing will parse a grammar until it reaches the end of the grammar, and then return the matched results, even if the input string has more text in it. For instance, this statement:

word = Word("A")
data = "AAA AA AAA BA AAA"
print OneOrMore(word).parseString(data)

will not raise an exception, but simply return:

['AAA', 'AA', 'AAA']

even though the string continues with more "AAA" words to be parsed. Many times, this "extra" text is really more data, but with some mismatch that does not satisfy the continued parsing of the grammar.

To check whether your grammar has processed the entire string, pyparsing provides a class StringEnd (and a built-in expression stringEnd) that you can add to the end of the grammar. This is your way of signifying, "at this point, I expect there to be no more text—this should be the end of the input string." If the grammar has left some part of the input unparsed, then StringEnd will raise a ParseException. Note that if there is trailing whitespace, pyparsing will automatically skip over it before testing for end-of-string.

In our current application, adding stringEnd to the end of our parsing expression will protect against accidentally matching

09/04/2004  LSU                   2x2   Oregon State       21

as:

09/04/2004 ['LSU', 2] ['x', 2]

treating this as a tie game between LSU and College X. Instead we get a ParseException that looks like:

pyparsing.ParseException: Expected stringEnd (at char 44), (line:1, col:45)

Here is a complete listing of the parser code:

from pyparsing import Word, Group, Combine, Suppress, OneOrMore, alphas, nums,
   alphanums, stringEnd, ParseException
import time
num = Word(nums)
date = Combine(num + "/" + num + "/" + num)

def validateDateString(tokens):
    try:
        time.strptime(tokens[0], "%m/%d/%Y")
    except ValueError,ve:
        raise ParseException("Invalid date string (%s)" % tokens[0])
date.setParseAction(validateDateString)

schoolName = OneOrMore( Word(alphas) )
schoolName.setParseAction( lambda tokens: " ".join(tokens) )
score = Word(nums).setParseAction(lambda tokens: int(tokens[0]))
schoolAndScore = Group( schoolName.setResultsName("school") + 
        score.setResultsName("score") )
gameResult = date.setResultsName("date") + schoolAndScore.setResultsName("team1") + 
        schoolAndScore.setResultsName("team2")

tests = """
    09/04/2004  Virginia              44   Temple             14
    09/04/2004  LSU                   22   Oregon State       21
    09/09/2004  Troy State            24   Missouri           14
    01/02/2003  Florida State        103   University of Miami 2""".splitlines()

for test in tests:
    stats = (gameResult + stringEnd).parseString(test)

    if stats.team1.score != stats.team2.score:
        if stats.team1.score > stats.team2.score:
            result = "won by " + stats.team1.school
        else:
            result = "won by " + stats.team2.school
    else:
        result = "tied"
    print "%s %s(%d) %s(%d), %s" % (stats.date, stats.team1.school, stats.team1.score,
            stats.team2.school, stats.team2.score, result)
    # or print one of these alternative formats
    #print "%(date)s %(team1)s %(team2)s" % stats
    #print stats.asXML("GAME")
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.97.187