A Simple S-Expression Parser

S-expressions are a plain ASCII form for representing complex data structures. They can be used to serialize data to send over a communication path, to persist into a database, or to otherwise use as a string representation of a hierarchical data element. When displayed in indented form, S-expressions are even suitable for human comprehension, providing a simple and intuitive nesting syntax, with parentheses used as the nesting operators. Here is a sample S-expression describing an authentication certificate:

(certificate
 (issuer
  (name
   (public-key
    rsa-with-md5
    (e |NFGq/E3wh9f4rJIQVXhS|)
    (n |d738/4ghP9rFZ0gAIYZ5q9y6iskDJwASi5rEQpEQq8ZyMZeIZzIAR2I5iGE=|))
   aid-committee))
 (subject
  (ref
   (public-key
    rsa-with-md5
    (e |NFGq/E3wh9f4rJIQVXhS|)
    (n |d738/4ghP9rFZ0gAIYZ5q9y6iskDJwASi5rEQpEQq8ZyMZeIZzIAR2I5iGE=|))
   tom
   mother))
(not-after "1998-01-01_09:00:00")
 (tag
(spend (account "12345678") (* numeric range "1" "1000"))))

The attraction of S-expressions is that they consist purely of lists of basic character or numeric strings, with structure represented using nested parentheses.

The languages Lisp and Scheme use S-expressions as their actual program syntax. Here is a factorial function written in Common Lisp:

(defun factorial (x)
   (if (zerop x) 1
       (* x (factorial (- x 1)))))

The online Wikipedia article (http://en.wikipedia.org/wiki/s-expression) has more background and additional links for further information on S-expressions.

In computer science classes, it is common to assign as homework the development of an S-expression parser. Doing so with pyparsing is actually a fairly straightforward task. This is also our first case of a recursive grammar, in which some expressions can be written in terms of other expressions of the same type.

Let's start with a very simple S-expression form, in which an expression can be a single alphabetic word or integer, or a sequence of alphabetic words or integers enclosed in parentheses. Following our standard practice, we start by defining the BNF for an S-expression:

alphaword ::= alphas+
integer   ::= nums+
sexp      ::= alphaword | integer | '(' sexp* ')'

The first two expressions are nothing new, and you can see how we would use pyparsing's Word class to define them:

alphaword = Word(alphas)
integer = Word(nums)

But sexp is more difficult. Our dilemma is that the definition for sexp includes sexp itself, but to define sexp we need to refer to sexp!

To resolve this "chicken-and-egg" problem, pyparsing provides the Forward class, which allows you to "forward" declare an expression before you can fully define it. The first step is to declare sexp as an empty Forward:

sexp = Forward()

At this point, sexp has no grammar definition yet. To assign the recursive definition of sexp into sexp, we use the << shift operator:

LPAREN = Suppress("(")
RPAREN = Suppress(")")
sexp << ( alphaword | integer | ( LPAREN + ZeroOrMore(sexp) + RPAREN )

The << operator "injects" the definition into the existing sexp variable. sexp now safely contains a reference to itself as part of its recursive definition.

Let's test this simple S-expression parser all by itself:

tests = """
    red
    100
    ( red 100 blue )
    ( green ( ( 1 2 ) mauve ) plaid () )""".splitlines()

for t in tests:
    print t
    print sexp.parseString(t)
    print

This gives us the following results:

red
['red']

100
['100']

( red 100 blue )
['red', '100', 'blue']

( green ( 1 2 mauve ) plaid () )
['green', '1', '2', 'mauve', 'plaid']

This successfully parses all of the expressions, but that last test case is a little disappointing. There is no representation of the subexpressions defined within the nested parentheses. Once again, the Group class provides the missing link.

Remember that Group causes the matched tokens to be enclosed within their own sublist. By changing the definition of sexp to:

sexp << ( alphaword | integer | Group( LPAR + ZeroOrMore(sexp) + RPAR ) )

the elements within nested parentheses end up within a nested sublist in the results. And, since this Group construct is defined within the recursive part of the grammar, this nesting will work recursively as deeply nested as the parenthesis nesting in the original string. With this change, our results become:

red
['red']

100
['100']

( red 100 blue )
[['red', '100', 'blue']]

( green ( ( 1 2 ) mauve ) plaid () )
[['green', [['1', '2'], 'mauve'], 'plaid', []]]

Much better!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.237.79