"Hello, World!" on Steroids!

Pyparsing comes with a number of examples, including a basic "Hello, World!" parser[1]. This simple example is also covered in the O'Reilly ONLamp.com article "Building Recursive Descent Parsers with Python" (http://www.onlamp.com/-pub/a/python/2006/01/26/pyparsing.html). In this section, I use this same example to introduce many of the basic parsing tools in pyparsing.

The current "Hello, World!" parsers are limited to greetings of the form:

word, word !

This limits our options a bit, so let's expand the grammar to handle more complicated greetings. Let's say we want to parse any of the following:

Hello, World!
Hi, Mom!
Good morning, Miss Crabtree!
Yo, Adrian!
Whattup, G?
How's it goin', Dude?
Hey, Jude!
Goodbye, Mr. Chips!

The first step in writing a parser for these strings is to identify the pattern that they all follow. Following our best practice, we write up this pattern as a BNF. Using ordinary words to describe a greeting, we would say, "a greeting is made up of one or more words (which is the salutation), followed by a comma, followed by one or more additional words (which is the subject of the greeting, or greetee), and ending with either an exclamation point or a question mark." As BNF, this description looks like:

greeting ::= salutation comma greetee endpunc
salutation ::= word+
comma ::= ,
greetee ::= word+
word ::= a collection of one or more characters, which are any alpha or ' or .
endpunc ::= ! | ?

This BNF translates almost directly into pyparsing, using the basic pyparsing elements Word, Literal, OneOrMore, and the helper method oneOf. (One of the translation issues in going from BNF to pyparsing is that BNF is traditionally a "top-down" definition of a grammar. Pyparsing must build its grammar "bottom-up," to ensure that referenced variables are defined before they are used.)

word = Word(alphas+"'.")
salutation = OneOrMore(word)
comma = Literal(",")
greetee = OneOrMore(word)
endpunc = oneOf("! ?")
greeting = salutation + comma + greetee + endpunc

oneOf is a handy shortcut for defining a list of literal alternatives. It is simpler to write:

endpunc = oneOf("! ?")

than:

endpunc = Literal("!") | Literal("?")

You can call oneOf with a list of items, or with a single string of items separated by whitespace.

Using our greeting parser on the set of sample strings gives the following results:

['Hello', ',', 'World', '!']
['Hi', ',', 'Mom', '!']
['Good', 'morning', ',', 'Miss', 'Crabtree', '!']
['Yo', ',', 'Adrian', '!']
['Whattup', ',', 'G', '?']
["How's", 'it', "goin'", ',', 'Dude', '?']
['Hey', ',', 'Jude', '!']
['Goodbye', ',', 'Mr.', 'Chips', '!']

Everything parses into individual tokens all right, but there is very little structure to the results. With this parser, there is quite a bit of work still to do to pick out the significant parts of each greeting string. For instance, to identify the tokens that compose the initial part of the greeting—the salutation—we need to iterate over the results until we reach the comma token:

for t in tests:
    results = greeting.parseString(t)
    salutation = []
    for token in results:
        if token == ",": break
        salutation.append(token)
    print salutation

Yuck! We might as well just have written a character-by-character scanner in the first place! Fortunately, we can avoid this drudgery by making our parser a bit smarter.

Since we know that the salutation and greetee parts of the greeting are logical groups, we can use pyparsing's Group class to give more structure to the returned results. By changing the definitions of salutation and greetee to:

salutation = Group( OneOrMore(word) )
greetee = Group( OneOrMore(word) )

our results start to look a bit more organized:

[['Hello'], ',', ['World'], '!']
[['Hi'], ',', ['Mom'], '!']
[['Good', 'morning'], ',', ['Miss', 'Crabtree'], '!']
[['Yo'], ',', ['Adrian'], '!']
[['Whattup'], ',', ['G'], '?']
[["How's", 'it', "goin'"], ',', ['Dude'], '?']
[['Hey'], ',', ['Jude'], '!']
[['Goodbye'], ',', ['Mr.', 'Chips'], '!']

and we can use basic list-to-variable assignment to access the different parts:

for t in tests:
    salutation, dummy, greetee, endpunc = greeting.parseString(t)
    print salutation, greetee, endpunc

prints:

['Hello'] ['World'] !
['Hi'] ['Mom'] !
['Good', 'morning'] ['Miss', 'Crabtree'] !
['Yo'] ['Adrian'] !
['Whattup'] ['G'] ?
["How's", 'it', "goin'"] ['Dude'] ?
['Hey'] ['Jude'] !
['Goodbye'] ['Mr.', 'Chips'] !

Note that we had to put in the scratch variable dummy to handle the parsed comma character. The comma is a very important element during parsing, since it shows where the parser stops reading the salutation and starts the greetee. But in the returned results, the comma is not really very interesting at all, and it would be nice to suppress it from the returned results. You can do this by wrapping the definition of comma in a pyparsing Suppress instance:

comma = Suppress( Literal(",") )

There are actually a number of shortcuts built into pyparsing, and since this function is so common, any of the following forms accomplish the same thing:

comma = Suppress( Literal(",") )
comma = Literal(",").suppress()
comma = Suppress(",")

Using one of these forms to suppress the parsed comma, our results are further cleaned up to read:

[['Hello'], ['World'], '!']
[['Hi'], ['Mom'], '!']
[['Good', 'morning'], ['Miss', 'Crabtree'], '!']
[['Yo'], ['Adrian'], '!']
[['Whattup'], ['G'], '?']
[["How's", 'it', "goin'"], ['Dude'], '?']
[['Hey'], ['Jude'], '!']
[['Goodbye'], ['Mr.', 'Chips'], '!']

The results-handling code can now drop the distracting dummy variable, and just use:

for t in tests:
    salutation, greetee, endpunc = greeting.parseString(t)

Now that we have a decent parser and a good way to get out the results, we can start to have fun with the test data. First, let's accumulate the salutations and greetees into lists of their own:

salutes = []
greetees = []
for t in tests:
    salutation, greetee, endpunc = greeting.parseString(t)
    salutes.append( ( " ".join(salutation), endpunc) )
    greetees.append( " ".join(greetee) )

I've also made a few other changes to the parsed tokens:

  • Used " ".join(list) to convert the grouped tokens back into simple strings

  • Saved the end punctuation in a tuple with each greeting to distinguish the exclamations from the questions

Now that we have collected these assorted names and salutations, we can use them to contrive some additional, never-before-seen greetings and introductions.

After importing the random module, we can synthesize some new greetings:

for i in range(50):
    salute = random.choice( salutes )
    greetee = random.choice( greetees )
    print "%s, %s%s" % ( salute[0], greetee, salute[1] )

Now we see the all-new set of greetings:

Hello, Miss Crabtree!
How's it goin', G?
Yo, Mr. Chips!
Whattup, World?
Good morning, Mr. Chips!
Goodbye, Jude!
Good morning, Miss Crabtree!
Hello, G!
Hey, Dude!
How's it goin', World?
Good morning, Mom!
How's it goin', Adrian?
Yo, G!
Hey, Adrian!
Hi, Mom!
Hello, Mr. Chips!
Hey, G!
Whattup, Mr. Chips?
Whattup, Miss Crabtree?
...

We can also simulate some introductions with the following code:

for i in range(50):
    print '%s, say "%s" to %s.' % ( random.choice( greetees ),
                                    "".join( random.choice( salutes ) ),
                                    random.choice( greetees ) )

And now the cocktail party starts shifting into high gear!

Jude, say "Good morning!" to Mom.
G, say "Yo!" to Miss Crabtree.
Jude, say "Goodbye!" to World.
Adrian, say "Whattup?" to World.
Mom, say "Hello!" to Dude.
Mr. Chips, say "Good morning!" to Miss Crabtree.
Miss Crabtree, say "Hi!" to Adrian.
Adrian, say "Hey!" to Mr. Chips.
Mr. Chips, say "How's it goin'?" to Mom.
G, say "Whattup?" to Mom.
Dude, say "Hello!" to World.
Miss Crabtree, say "Goodbye!" to Miss Crabtree.
Dude, say "Hi!" to Mr. Chips.
G, say "Yo!" to Mr. Chips.
World, say "Hey!" to Mr. Chips.
G, say "Hey!" to Adrian.
Adrian, say "Good morning!" to G.
Adrian, say "Hello!" to Mom.
World, say "Good morning!" to Miss Crabtree.
Miss Crabtree, say "Yo!" to G.
...

So, now we've had some fun with the pyparsing module. Using some of the simpler pyparsing classes and methods, we're ready to say "Whattup" to the world!



[1] Of course, writing a parser to extract the components from "Hello, World!" is beyond overkill. But hopefully, by expanding this example to implement a generalized greeting parser, I cover most of the pyparsing basics.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.212.71