As our first example, let's look at a simple set of scores for college football games that might be given in a datafile. Each row of text gives the date of each game, followed by the college names and each school's score.
09/04/2004 Virginia 44 Temple 14 09/04/2004 LSU 22 Oregon State 21 09/09/2004 Troy State 24 Missouri 14 01/02/2003 Florida State 103 University of Miami 2
Our BNF for this data is simple and clean:
digit ::= '0'..'9' alpha ::= 'A'..'Z' 'a'..'z' date ::= digit+ '/' digit+ '/' digit+ schoolName ::= ( alpha+ )+ score ::= digit+ schoolAndScore ::= schoolName score gameResult ::= date schoolAndScore schoolAndScore
We begin building up our parser by converting these BNF definitions into pyparsing class instances. Just as we did in the extended "Hello, World!" program, we'll start by defining the basic building blocks that will later get combined to form the complete grammar:
# nums and alphas are already defined by pyparsing num = Word(nums) date = num + "/" + num + "/" + num schoolName = OneOrMore( Word(alphas) )
Notice that you can compose pyparsing expressions using the +
operator to combine pyparsing expressions and string literals. Using these basic elements, we can finish the grammar by combining them into larger expressions:
score = Word(nums) schoolAndScore = schoolName + score gameResult = date + schoolAndScore + schoolAndScore
We use the gameResult
expression to parse the individual lines of the input text:
tests = """ 09/04/2004 Virginia 44 Temple 14 09/04/2004 LSU 22 Oregon State 21 09/09/2004 Troy State 24 Missouri 14 01/02/2003 Florida State 103 University of Miami 2""".splitlines() for test in tests: stats = gameResult.parseString(test) print stats.asList()
Just as we saw in the "Hello, World!" parser, we get an unstructured list of strings from this grammar:
['09', '/', '04', '/', '2004', 'Virginia', '44', 'Temple', '14'] ['09', '/', '04', '/', '2004', 'LSU', '22', 'Oregon', 'State', '21'] ['09', '/', '09', '/', '2004', 'Troy', 'State', '24', 'Missouri', '14'] ['01', '/', '02', '/', '2003', 'Florida', 'State', '103', 'University', 'of', 'Miami', '2']
The first change we'll make is to combine the tokens returned by date into a single MM/DD/YYYY
date string. The pyparsing Combine
class does this for us by simply wrapping the composed expression:
date = Combine( num + "/" + num + "/" + num )
With this single change, the parsed results become:
['09/04/2004', 'Virginia', '44', 'Temple', '14'] ['09/04/2004', 'LSU', '22', 'Oregon', 'State', '21'] ['09/09/2004', 'Troy', 'State', '24', 'Missouri', '14'] ['01/02/2003', 'Florida', 'State', '103', 'University', 'of', 'Miami', '2']
Combine
actually performs two tasks for us. In addition to concatenating the matched tokens into a single string, it also enforces that the tokens are adjacent in the incoming text.
The next change to make will be to combine the school names, too. Because Combine
's default behavior requires that the tokens be adjacent, we will not use it, since some of the school names have embedded spaces. Instead we'll define a routine to be run at parse time to join and return the tokens as a single string. As mentioned previously, such routines are referred to in pyparsing as parse actions, and they can perform a variety of functions during the parsing process.
For this example, we will define a parse action that takes the parsed tokens, uses the string join
function, and returns the joined string. This is such a simple parse action that it can be written as a Python lambda. The parse action gets hooked to a particular expression by calling setParseAction
, as in:
schoolName.setParseAction( lambda tokens: " ".join(tokens) )
Another common use for parse actions is to do additional semantic validation, beyond the basic syntax matching that is defined in the expressions. For instance, the expression for date
will accept 03023/808098/29921
as a valid date, and this is certainly not desirable. A parse action to validate the input date could use time.strptime
to parse the time string into an actual date:
time.strptime(tokens[0],"%m/%d/%Y")
If strptime
fails, then it will raise a ValueError
exception. Pyparsing uses its own exception class, ParseException
, for signaling whether an expression matched or not. Parse actions can raise their own exceptions to indicate that, even though the syntax matched, some higher-level validation failed. Our validation parse action would look like this:
def validateDateString(tokens): try time.strptime(tokens[0], "%m/%d/%Y") except ValueError,ve: raise ParseException("Invalid date string (%s)" % tokens[0]) date.setParseAction(validateDateString)
If we change the date in the first line of the input to 19/04/2004, we get the exception:
pyparsing.ParseException: Invalid date string (19/04/2004) (at char 0), (line:1, col:1)
Another modifier of the parsed results is the pyparsing Group
class. Group
does not change the parsed tokens; instead, it nests them within a sublist. Group
is a useful class for providing structure to the results returned from parsing:
score = Word(nums) schoolAndScore = Group( schoolName + score )
With grouping and joining, the parsed results are now structured into nested lists of strings:
['09/04/2004', ['Virginia', '44'], ['Temple', '14']] ['09/04/2004', ['LSU', '22'], ['Oregon State', '21']] ['09/09/2004', ['Troy State', '24'], ['Missouri', '14']] ['01/02/2003', ['Florida State', '103'], ['University of Miami', '2']]
Finally, we will add one more parse action to perform the conversion of numeric strings into actual integers. This is a very common use for parse actions, and it also shows how pyparsing can return structured data, not just nested lists of parsed strings. This parse action is also simple enough to implement as a lambda:
score = Word(nums).setParseAction( lambda tokens : int(tokens[0]) )
Once again, we can define our parse action to perform this conversion, without the need for error handling in case the argument to int
is not a valid integer string. The only time this lambda will ever be called is with a string that matches the pyparsing expression Word (nums)
, which guarantees that only valid numeric strings will be passed to the parse action.
Our parsed results are starting to look like real database records or objects:
['09/04/2004', ['Virginia', 44], ['Temple', 14]] ['09/04/2004', ['LSU', 22], ['Oregon State', 21]] ['09/09/2004', ['Troy State', 24], ['Missouri', 14]] ['01/02/2003', ['Florida State', 103], ['University of Miami', 2]]
At this point, the returned data is structured and converted so that we could do some actual processing on the data, such as listing the game results by date and marking the winning team. The ParseResults
object passed back from parseString
allows us to index into the parsed data using nested list notation, but for data with this kind of structure, things get ugly fairly quickly:
for test in tests: stats = gameResult.parseString(test) if stats[1][1] != stats[2][1]: if stats[1][1] > stats[2][1]: result = "won by " + stats[1][0] else: result = "won by " + stats[2][0] else: result = "tied" print "%s %s(%d) %s(%d), %s" % (stats[0], stats[1][0], stats[1][1], stats[2][0], stats[2][1], result)
Not only do the indexes make the code hard to follow (and easy to get wrong!), the processing of the parsed data is very sensitive to the order of things in the results. If our grammar included some optional fields, we would have to include other logic to test for the existence of those fields, and adjust the indexes accordingly. This makes for a very fragile parser.
We could try using multiple variable assignment to reduce the indexing like we did in '"Hello, World!" on Steroids!':
for test in tests: stats = gameResult.parseString(test) gamedate,team1,team2 = stats # <- assign parsed bits to individual variable names if team1[1] != team2[1]: if team1[1] > team2[1]: result = "won by " + team1[0] else: result = "won by " + team2[0] else: result = "tied" print "%s %s(%d) %s(%d), %s" % (gamedate, team1[0], team1[1], team2[0], team2[1], result)
But this still leaves us sensitive to the order of the parsed data.
Instead, we can define names in the grammar that different expressions should use to label the resulting tokens returned by those expressions. To do this, we insert calls to setResults-Name
into our grammar, so that expressions will label the tokens as they are accumulated into the Parse-Results
for the overall grammar:
schoolAndScore = Group( schoolName.setResultsName("school") + score.setResultsName("score") ) gameResult = date.setResultsName("date") + schoolAndScore.setResultsName("team1") + schoolAndScore.setResultsName("team2")
And the code to process the results is more readable:
if stats.team1.score != stats.team2.score if stats.team1.score > stats.team2.score: result = "won by " + stats.team1.school else: result = "won by " + stats.team2.school else: result = "tied" print "%s %s(%d) %s(%d), %s" % (stats.date, stats.team1.school, stats.team1. score, stats.team2.school, stats.team2.score, result)
This code has the added bonus of being able to refer to individual tokens by name rather than by index, making the processing code immune to changes in the token order and to the presence/absence of optional data fields.
Creating ParseResults
with results names will enable you to use dict-style semantics to access the tokens. For example, you can use ParseResults
objects to supply data values to interpolated strings with labeled fields, further simplifying the output code:
print "%(date)s %(team1)s %(team2)s" % stats
This gives the following:
09/04/2004 ['Virginia', 44] ['Temple', 14] 09/04/2004 ['LSU', 22] ['Oregon State', 21] 09/09/2004 ['Troy State', 24] ['Missouri', 14] 01/02/2003 ['Florida State', 103] ['University of Miami', 2]
ParseResults
also implements the keys()
, items()
, and values()
methods, and supports key testing with Python's in
keyword.
For debugging, you can call dump()
to return a string showing the nested token list, followed by a hierarchical listing of keys and values. Here is a sample of calling stats.dump()
for the first line of input text:
print stats.dump() ['09/04/2004', ['Virginia', 44], ['Temple', 14]] - date: 09/04/2004 - team1: ['Virginia', 44] - school: Virginia - score: 44 - team2: ['Temple', 14] - school: Temple - score: 14
Finally, you can generate XML representing this same hierarchy by calling stats.asXML()
and specifying a root element name:
print stats.asXML("GAME") <GAME> <date>09/04/2004</date> <team1> <school>Virginia</school> <score>44</score> </team1> <team2> <school>Temple</school> <score>14</score> </team2> </GAME>
There is one last issue to deal with, having to do with validation of the input text. Pyparsing will parse a grammar until it reaches the end of the grammar, and then return the matched results, even if the input string has more text in it. For instance, this statement:
word = Word("A") data = "AAA AA AAA BA AAA" print OneOrMore(word).parseString(data)
will not raise an exception, but simply return:
['AAA', 'AA', 'AAA']
even though the string continues with more "AAA" words to be parsed. Many times, this "extra" text is really more data, but with some mismatch that does not satisfy the continued parsing of the grammar.
To check whether your grammar has processed the entire string, pyparsing provides a class StringEnd
(and a built-in expression stringEnd
) that you can add to the end of the grammar. This is your way of signifying, "at this point, I expect there to be no more text—this should be the end of the input string." If the grammar has left some part of the input unparsed, then StringEnd
will raise a ParseException
. Note that if there is trailing whitespace, pyparsing will automatically skip over it before testing for end-of-string.
In our current application, adding stringEnd
to the end of our parsing expression will protect against accidentally matching
09/04/2004 LSU 2x2 Oregon State 21
as:
09/04/2004 ['LSU', 2] ['x', 2]
treating this as a tie game between LSU and College X. Instead we get a ParseException
that looks like:
pyparsing.ParseException: Expected stringEnd (at char 44), (line:1, col:45)
Here is a complete listing of the parser code:
from pyparsing import Word, Group, Combine, Suppress, OneOrMore, alphas, nums, alphanums, stringEnd, ParseException import time num = Word(nums) date = Combine(num + "/" + num + "/" + num) def validateDateString(tokens): try: time.strptime(tokens[0], "%m/%d/%Y") except ValueError,ve: raise ParseException("Invalid date string (%s)" % tokens[0]) date.setParseAction(validateDateString) schoolName = OneOrMore( Word(alphas) ) schoolName.setParseAction( lambda tokens: " ".join(tokens) ) score = Word(nums).setParseAction(lambda tokens: int(tokens[0])) schoolAndScore = Group( schoolName.setResultsName("school") + score.setResultsName("score") ) gameResult = date.setResultsName("date") + schoolAndScore.setResultsName("team1") + schoolAndScore.setResultsName("team2") tests = """ 09/04/2004 Virginia 44 Temple 14 09/04/2004 LSU 22 Oregon State 21 09/09/2004 Troy State 24 Missouri 14 01/02/2003 Florida State 103 University of Miami 2""".splitlines() for test in tests: stats = (gameResult + stringEnd).parseString(test) if stats.team1.score != stats.team2.score: if stats.team1.score > stats.team2.score: result = "won by " + stats.team1.school else: result = "won by " + stats.team2.school else: result = "tied" print "%s %s(%d) %s(%d), %s" % (stats.date, stats.team1.school, stats.team1.score, stats.team2.school, stats.team2.score, result) # or print one of these alternative formats #print "%(date)s %(team1)s %(team2)s" % stats #print stats.asXML("GAME")
3.144.97.187