S-expressions are a plain ASCII form for representing complex data structures. They can be used to serialize data to send over a communication path, to persist into a database, or to otherwise use as a string representation of a hierarchical data element. When displayed in indented form, S-expressions are even suitable for human comprehension, providing a simple and intuitive nesting syntax, with parentheses used as the nesting operators. Here is a sample S-expression describing an authentication certificate:
(certificate (issuer (name (public-key rsa-with-md5 (e |NFGq/E3wh9f4rJIQVXhS|) (n |d738/4ghP9rFZ0gAIYZ5q9y6iskDJwASi5rEQpEQq8ZyMZeIZzIAR2I5iGE=|)) aid-committee)) (subject (ref (public-key rsa-with-md5 (e |NFGq/E3wh9f4rJIQVXhS|) (n |d738/4ghP9rFZ0gAIYZ5q9y6iskDJwASi5rEQpEQq8ZyMZeIZzIAR2I5iGE=|)) tom mother)) (not-after "1998-01-01_09:00:00") (tag (spend (account "12345678") (* numeric range "1" "1000"))))
The attraction of S-expressions is that they consist purely of lists of basic character or numeric strings, with structure represented using nested parentheses.
The languages Lisp and Scheme use S-expressions as their actual program syntax. Here is a factorial function written in Common Lisp:
(defun factorial (x) (if (zerop x) 1 (* x (factorial (- x 1)))))
The online Wikipedia article (http://en.wikipedia.org/wiki/s-expression) has more background and additional links for further information on S-expressions.
In computer science classes, it is common to assign as homework the development of an S-expression parser. Doing so with pyparsing is actually a fairly straightforward task. This is also our first case of a recursive grammar, in which some expressions can be written in terms of other expressions of the same type.
Let's start with a very simple S-expression form, in which an expression can be a single alphabetic word or integer, or a sequence of alphabetic words or integers enclosed in parentheses. Following our standard practice, we start by defining the BNF for an S-expression:
alphaword ::= alphas+ integer ::= nums+ sexp ::= alphaword | integer | '(' sexp* ')'
The first two expressions are nothing new, and you can see how we would use pyparsing's Word
class to define them:
alphaword = Word(alphas) integer = Word(nums)
But sexp
is more difficult. Our dilemma is that the definition for sexp
includes sexp
itself, but to define sexp
we need to refer to sexp
!
To resolve this "chicken-and-egg" problem, pyparsing provides the Forward
class, which allows you to "forward" declare an expression before you can fully define it. The first step is to declare sexp as an empty Forward
:
sexp = Forward()
At this point, sexp
has no grammar definition yet. To assign the recursive definition of sexp
into sexp
, we use the <<
shift operator:
LPAREN = Suppress("(") RPAREN = Suppress(")") sexp << ( alphaword | integer | ( LPAREN + ZeroOrMore(sexp) + RPAREN )
The <<
operator "injects" the definition into the existing sexp
variable. sexp
now safely contains a reference to itself as part of its recursive definition.
Let's test this simple S-expression parser all by itself:
tests = """ red 100 ( red 100 blue ) ( green ( ( 1 2 ) mauve ) plaid () )""".splitlines() for t in tests: print t print sexp.parseString(t) print
This gives us the following results:
red ['red'] 100 ['100'] ( red 100 blue ) ['red', '100', 'blue'] ( green ( 1 2 mauve ) plaid () ) ['green', '1', '2', 'mauve', 'plaid']
This successfully parses all of the expressions, but that last test case is a little disappointing. There is no representation of the subexpressions defined within the nested parentheses. Once again, the Group
class provides the missing link.
Remember that Group
causes the matched tokens to be enclosed within their own sublist. By changing the definition of sexp
to:
sexp << ( alphaword | integer | Group( LPAR + ZeroOrMore(sexp) + RPAR ) )
the elements within nested parentheses end up within a nested sublist in the results. And, since this Group
construct is defined within the recursive part of the grammar, this nesting will work recursively as deeply nested as the parenthesis nesting in the original string. With this change, our results become:
red ['red'] 100 ['100'] ( red 100 blue ) [['red', '100', 'blue']] ( green ( ( 1 2 ) mauve ) plaid () ) [['green', [['1', '2'], 'mauve'], 'plaid', []]]
Much better!
18.191.237.79