As it turns out, there is a formal definition for S-expressions to handle many applications requiring the representation of hierarchical data. You can imagine that this goes beyond data represented as words and integers—the authentication certificate sample in the previous section requires a variety of different field types.
The Internet Draft describing S-expressions for data representation can be found at http://people.csail.mit.edu/rivest/sexp.txt. Fortunately, the draft defines the BNF for us:
<sexp> :: <string> | <list> <string> :: <display>? <simple-string> ; <simple-string> :: <raw> | <token> | <base-64> | <hexadecimal> | <quoted-string> ; <display> :: "[" <simple-string> "]" ; <raw> :: <decimal> ":" <bytes> ; <decimal> :: <decimal-digit>+ ; -- decimal numbers should have no unnecessary leading zeros <bytes> -- any string of bytes, of the indicated length <token> :: <tokenchar>+ ; <base-64> :: <decimal>? "|" ( <base-64-char> | <whitespace> )* "|" ; <hexadecimal> :: "#" ( <hex-digit> | <white-space> )* "#" ; <quoted-string> :: <decimal>? <quoted-string-body> <quoted-string-body> :: """ <bytes> """ <list> :: "(" ( <sexp> | <whitespace> )* ")" ; <whitespace> :: <whitespace-char>* ; <token-char> :: <alpha> | <decimal-digit> | <simple-punc> ; <alpha> :: <upper-case> | <lower-case> | <digit> ; <lower-case> :: "a" | ... | "z" ; <upper-case> :: "A" | ... | "Z" ; <decimal-digit> :: "0" | ... | "9" ; <hex-digit> :: <decimal-digit> | "A" | ... | "F" | "a" | ... | "f" ; <simple-punc> :: "−" | "." | "/" | "_" | ":" | "*" | "+" | "=" ; <whitespace-char> :: " " | " " | " " | " " ; <base-64-char> :: <alpha> | <decimal-digit> | "+" | "/" | "=" ;
Wait! Did I say "fortunately"? This seems like a lot to digest, but going step by step, we can implement even a complex BNF such as this by converting each of these expressions to pyparsing elements. In fact, some of them are already built in to pyparsing, so we can just use them as is. Deep breath...OK, let's begin.
Since the published BNF is a "top-down" definition, we should work from the "bottom-up" in defining our pyparsing grammar. We need to do this since Python requires us to define elements before referencing them.
With some observation, we can see that the bottom half of this BNF consists mostly of defining sets of characters, rather than actual parsing expressions. In pyparsing, we will convert these definitions into strings that can be used in composing Word
elements. This BNF also explicitly indicates whitespace in places. Since pyparsing skips over whitespace by default, we can leave these terms out where they are separators, and make sure that we accommodate whitespace when it is expressly part of an expression definition.
So, to begin expressing the sets of valid characters, let's review what pyparsing already provides us (a reminder, these are strings, not expressions, to be used in defining Word
expressions):
alphas
The characters A-Z and a-z
nums
The characters 0–9
alphanums
The combination of alphas
and nums
hexnums
The combination of nums
and the characters A-F and a-f
printables
All 7-bit ASCII characters that are not whitespace or control characters
If need be, we could also use the pyparsing function srange
(for "string range"), which borrows the range syntax from regular expressions; for example, srange("[A-Z]")
returns a string containing the uppercase letters A-Z.
We can now implement these basic character sets, working bottom-up, for composing our Word
elements:
#<base-64-char> :: <alpha> | <decimal-digit> | "+" | "/" | "=" ; base_64_char = alphanums + "+/=" #<whitespace-char> :: " " | " " | " " | " " ; # not needed #<simple-punc> :: "−" | "." | "/" | "_" | ":" | "*" | "+" | "=" ; simple_punc = "-./_:*+=" #<hex-digit> :: <decimal-digit> | "A" | ... | "F" | "a" | ... | "f" ; # use hexnums #<decimal-digit> :: "0" | ... | "9" ; # use nums #<alpha> :: <upper-case> | <lower-case> | <digit> ; #<lower-case> :: "a" | ... | "z" ; #<upper-case> :: "A" | ... | "Z" ; # use alphanums #<token-char> :: <alpha> | <decimal-digit> | <simple-punc> ; token_char = alphanums + simple_punc
Once again looking ahead at the next set of expressions, we can see that we will need some punctuation defined. The punctuation will be important during the parsing process, but I plan to convert the indicated fields during the parsing process using parse actions, so the punctuation itself will not be needed in the returned results. Using Python's list-to-variable assignment and the map function, we can define all our punctuation in a single compact statement:
LPAREN, RPAREN, LBRACK, RBRACK = map(Suppress, "()[]")
We can now visit the top half of the list, and implement those expressions that are defined in terms of the character sets and punctuation that have been defined:
# <bytes> -- any string of bytes, of the indicated length bytes = Word( printables ) # <decimal> :: <decimal-digit>+ ; # -- decimal numbers should have no unnecessary leading zeros decimal = "0" | Word( srange("[1–9]"), nums ) # <quoted-string-body> :: """ <bytes> """ # not needed, use dblQuotedString # <quoted-string> :: <decimal>? <quoted-string-body> quoted_string = Optional( decimal ) + dblQuotedString # <hexadecimal> :: "#" ( <hex-digit> | <white-space> )* "#" ; hexadecimal = "#" + ZeroOrMore( Word(hexnums) ) + "#" # <base-64> :: <decimal>? "|" ( <base-64-char> | <whitespace> )* "|" ; base_64 = Optional(decimal) + "|" + ZeroOrMore( Word( base_64_char ) ) + "|" # <token> :: <tokenchar>+ ; token = Word( tokenchar ) # <raw> :: <decimal> ":" <bytes> ; raw = decimal + ":" + bytes # <simple-string> :: <raw> | <token> | <base-64> | <hexadecimal> | <quoted-string> ; simple_string = raw | token | base_64 | hexadecimal | quoted_string # <display> :: "[" <simple-string> "]" ; display = LBRACK + simple_string + RBRACK # <string> :: <display>? <simple-string> ; string_ = Optional(display) + simple_string
As I mentioned before, the published BNF describes elements in a general "top-down" order, or most complex expression to simplest expression. In developing these expressions in Python, we have to define the simple expressions first, so that they can be used in composing the more complicated expressions to follow.
Here are some other points in developing these next-level expressions:
The definition of decimal
states that there should be no extra leading zeros. We implemented this by using Word
's two-argument constructor, specifying the non-zero digits as the set of valid leading characters, and the set of all digits as the set of valid body characters.
Most BNF's use the conventions of +
for "1 or more," *
for "zero or more," and ?
for "zero or one." For the most part, we can map these to pyparsing classes OneOrMore
, ZeroOrMore
, and Optional
. (The exception is when the BNF indicates OneOrMore
of a set of characters; in this case, use the Word
class, as shown with token
being made up of token_char+
.)
This BNF includes optional whitespace as possible characters in the hexadecimal
and base-64
expressions. One way to implement this is to define an expression such as hexadecimal
as Word(hexnums+" ")
. Instead, I've chosen to define this as ZeroOrMore(Word(hexnums))
, which will give us a list of the parts composed of hex digits, and implicitly skip over the interior whitespace.
At this point, the remaining expressions are:
<sexp> :: <string> | <list> <list> :: "(" ( <sexp> | <whitespace> )* ")" ;
Here we have come to the recursive part of the BNF. In this case, sexp
is defined in terms of list
, but list
is defined using sexp
. To break this cycle, we'll define sexp
as a Forward
(and also rename list
to list_
, so as not to mask the Python built-in type):
# <list> :: "(" ( <sexp> | <whitespace> )* ")" ; sexp = Forward() list_ = LPAREN + Group( ZeroOrMore( sexp ) ) + RPAREN
And to close the loop, we define the body of sexp
as (string_ | list_)
. Remember, we cannot just use a Python assignment statement or else the definition of sexp
will not be used as the contents of list_
. Instead, we use the <<
shift operator to insert the definition into the previously defined sexp
:
# <sexp> :: <string> | <list> sexp << ( string_ | list_ )
Now we have completed the basic syntax definition of the S-expression BNF. Let's try it out on our previous example, the authentication certificate:
print sexp.parseString(certificateExpr)
gives us the results:
[['certificate', ['issuer', ['name', ['public-key', 'rsa-with-md5', ['e','|', 'NFGq/E3wh9f4rJIQVXhS','|'], ['n', '|', 'd738/4ghP9rFZ0gAIYZ5q9y6iskDJwAS i5rEQpEQq8ZyMZeIZzIAR2I5iGE=', '|']], 'aid-committee']], ['subject', ['ref', ['public-key', 'rsa-with-md5', ['e','|', 'NFGq/E3wh9f4rJIQVXhS','|'], ['n', '|', 'd738/4ghP9rFZ0gAIYZ5q9y6iskDJwASi5rEQpEQq8ZyMZeIZzIAR2I5iGE=', '|']], 'tom', 'mother']], ['not-after', '"1998-01-01_09:00:00"'], ['tag', ['spend', ['account', '"12345678"'], ['*', 'numeric', 'range', '"1"', '"1000"']]]]]
Before leaving this parser, there are just a few more parse actions to add:
Removal of the quotation marks on quoted strings.
Conversion of the base64 data to binary data.
Validation of the data length values, when specified.
The quotation marks on the quoted strings are simply there to enclose some body of text into a single string. The dblQuotedString
expression already does this for us so that the leading and trailing quotation marks themselves are not needed in the results. This task is so common that pyparsing includes a method for just this purpose, removeQuotes
:
dblQuotedString.setParseAction( removeQuotes )
To convert the base64 data field, we'll add a short parse action to call b64decode
, defined in Python's base64
module:
base_64 = Optional(decimal) + "|" + OneOrMore(Word(base_64_char)).setParseAction(lambda t:b64decode("".join(t))) + "|"
This parse action does two tasks in one: it joins together the multiple tokens returned from the OneOrMore
expression, and then calls b64decode
to convert that data back to its original form.
Lastly, we'll implement the length validation for the base-64
, quoted-string
, and raw
expressions. These expressions all include an optional leading decimal
element to be used as an expected character count in the attached base64 or character string data. We can use a parse action to verify that these data length values are correct. We'll design the validation parse action to look for two named results in the input tokens: the length field will be named length
, and the content field will be named data
. Here is how that parse action will look:
def validateDataLength( tokens ): if tokens.length != "": if len(tokens.data) != int(tokens.length): raise ParseFatalException ("invalid data length, %d specified, found %s (%d chars)" % (int(tokens.length), tokens.data, len(tokens.data)))
At the start, validateDataLength
checks to see whether a length
field is given. If the length
field is not present, retrieving the length
attribute will return an empty string. If a length
field is present, we then see whether the length of the data
field matches it. If not, our parse action will a raise a different kind of ParseException
, a ParseFatalException
. Unlike the Parse-Exception
, which simply signals a failed syntax match, ParseFatal-Exception
will cause parsing to stop immediately at the current location.
For this parse action to work, we'll need to go back and attach results names to the appropriate elements of the base_64
, quoted_string
, and raw
expressions:
raw = decimal.setResultsName("length") + ":" + Word(bytes).setResultsName("data") dblQuotedString.setParseAction( removeQuotes ) quoted_string = Optional( decimal.setResultsName("length") ) + dblQuotedString.setResultsName("data") base_64_body = OneOrMore(Word(base_64_char)) base_64_body.setParseAction(lambda t:b64decode("".join(t))) base_64 = Optional(decimal.setResultsName("length")) + "|" + base_64_body.setResultsName("data") + "|"
The base_64
expression was getting rather complicated, so I broke out the content field as its own expression, base_64_body
.
With these changes in place, here is our final parser, with test cases:
from pyparsing import * from base64 import b64decode import pprint LPAREN, RPAREN, LBRACK, RBRACK = map(Suppress, "()[]") base_64_char = alphanums + "+/=" simple_punc = "-./_:*+=" token_char = alphanums + simple_punc bytes = Word( printables ) decimal = ("0" | Word( srange("[1-9]"), nums )).setParseAction(lambda t: int(t[0])) token = Word( token_char ) hexadecimal = "#" + ZeroOrMore( Word(hexnums) ) + "#" dblQuotedString.setParseAction( removeQuotes ) quoted_string = Optional( decimal.setResultsName("length") ) + dblQuotedString.setResultsName("data") base_64_body = OneOrMore(Word(base_64_char)) base_64_body.setParseAction(lambda t:b64decode("".join(t))) base_64 = Optional(decimal.setResultsName("length")) + "|" + base_64_body.setResultsName("data") + "|" raw = (decimal.setResultsName("length") + ":" + bytes.setResultsName("data")) simple_string = raw | token | base_64 | hexadecimal | quoted_string display = LBRACK + simple_string + RBRACK string_ = Optional(display) + simple_string sexp = Forward() list_ = Group( LPAREN + ZeroOrMore( sexp ) + RPAREN ) sexp << ( string_ | list_ ) def validateDataLength( tokens ): if tokens.length != "": if len(tokens.data) != int(tokens.length): raise ParseFatalException ("invalid data length, %d specified, found %s (%d chars)" % (int(tokens.length), tokens.data, len(tokens.data))) quoted_string.setParseAction( validateDataLength ) base_64.setParseAction( validateDataLength ) raw.setParseAction( validateDataLength ) ######### Test data ########### test0 = """(snicker "abc" (#03# |YWJj|))""" test1 = """(certificate (issuer (name (public-key rsa-with-md5 (e |NFGq/E3wh9f4rJIQVXhS|) (n |d738/4ghP9rFZ0gAIYZ5q9y6iskDJwASi5rEQpEQq8ZyMZeIZzIAR2I5iGE=|)) aid-committee)) (subject (ref (public-key rsa-with-md5 (e |NFGq/E3wh9f4rJIQVXhS|) (n |d738/4ghP9rFZ0gAIYZ5q9y6iskDJwASi5rEQpEQq8ZyMZeIZzIAR2I5iGE=|)) tom mother)) (not-after "1998-01-01_09:00:00") (tag (spend (account "12345678") (* numeric range "1" "1000")))) """ test2 = """ (defun factorial (x) (if (zerop x) 1 (* x (factorial (- x 1))))) """ test3 = """(3:XX "abc" (#30# |YWJj|))""" # Run tests for t in (test0, test1, test2, test3): print '-'*50 print t try: sexpr = sexp.parseString(t) pprint.pprint(sexpr.asList()) except ParseFatalException, pfe: print "Error:", pfe.msg print line(pfe.loc,t) print pfe.markInputline() print
The test output is:
-------------------------------------------------- (snicker "abc" (#03# |YWJj|)) [['snicker', 'abc', ['#', '03', '#', '|', 'abc', '|']]] -------------------------------------------------- (certificate (issuer (name (public-key rsa-with-md5 (e |NFGq/E3wh9f4rJIQVXhS|) (n |d738/4ghP9rFZ0gAIYZ5q9y6iskDJwASi5rEQpEQq8ZyMZeIZzIAR2I5iGE=|)) aid-committee)) (subject (ref (public-key rsa-with-md5 (e |NFGq/E3wh9f4rJIQVXhS|) (n |d738/4ghP9rFZ0gAIYZ5q9y6iskDJwASi5rEQpEQq8ZyMZeIZzIAR2I5iGE=|)) tom mother)) (not-after "1998-01-01_09:00:00") (tag (spend (account "12345678") (* numeric range "1" "1000")))) [['certificate', ['issuer', ['name', ['public-key', 'rsa-with-md5', ['e', '|', '4QxaaxfcMxf0x87xd7xf8xacx92x10UxR', '|'], ['n', '|', "wxbdxfcxffx88!?xdaxc5gHx00!x86yxabxdcxbax8axc9x03' x00x12x8bx9axc4Bx91x10xabxc6r1x97x88g2x00Gb9x88a", '|']], 'aid-committee']], ['subject', ['ref', ['public-key', 'rsa-with-md5', ['e', '|', '4QxaaxfcMxf0x87xd7xf8xacx92x10UxR', '|'], ['n', '|', "wxbdxfcxffx88!?xdaxc5gHx00!x86yxabxdcxbax8axc9x03' x00x12x8bx9axc4Bx91x10xabxc6r1x97x88g2x00Gb9x88a", '|']], 'tom', 'mother']], ['not-after', '1998-01-01_09:00:00'], ['tag', ['spend', ['account', '12345678'], ['*', 'numeric', 'range', '1', '1000']]]]] -------------------------------------------------- (defun factorial (x) (if (zerop x) 1 (* x (factorial (- x 1))))) [['defun', 'factorial', ['x'], ['if', ['zerop', 'x'], '1', ['*', 'x', ['factorial', ['-', 'x', '1']]]]]] -------------------------------------------------- (3:XX "abc" (#30# |YWJj|)) Error: invalid data length, 3 specified, found XX (2 chars) (3:XX "abc" (#30# |YWJj|)) >!<
3.141.200.3