A Complete S-Expression Parser

As it turns out, there is a formal definition for S-expressions to handle many applications requiring the representation of hierarchical data. You can imagine that this goes beyond data represented as words and integers—the authentication certificate sample in the previous section requires a variety of different field types.

The Internet Draft describing S-expressions for data representation can be found at http://people.csail.mit.edu/rivest/sexp.txt. Fortunately, the draft defines the BNF for us:

<sexp>        :: <string> | <list>
<string>      :: <display>? <simple-string> ;
<simple-string> :: <raw> | <token> | <base-64> | <hexadecimal> | <quoted-string> ;
<display>     :: "[" <simple-string> "]" ;
<raw>         :: <decimal> ":" <bytes> ;
<decimal>     :: <decimal-digit>+ ;
              -- decimal numbers should have no unnecessary leading zeros
<bytes>       -- any string of bytes, of the indicated length
<token>       :: <tokenchar>+ ;
<base-64>     :: <decimal>? "|" ( <base-64-char> | <whitespace> )* "|" ;
<hexadecimal> :: "#" ( <hex-digit> | <white-space> )* "#" ;
<quoted-string> :: <decimal>? <quoted-string-body>
<quoted-string-body> :: """ <bytes> """
<list>        :: "(" ( <sexp> | <whitespace> )* ")" ;
<whitespace>  :: <whitespace-char>* ;
<token-char>  :: <alpha> | <decimal-digit> | <simple-punc> ;
<alpha>       :: <upper-case> | <lower-case> | <digit> ;
<lower-case>  :: "a" | ... | "z" ;
<upper-case>  :: "A" | ... | "Z" ;
<decimal-digit> :: "0" | ... | "9" ;
<hex-digit>   :: <decimal-digit> | "A" | ... | "F" | "a" | ... | "f" ;
<simple-punc> :: "−" | "." | "/" | "_" | ":" | "*" | "+" | "=" ;
<whitespace-char> :: " " | "	" | "
" | "
" ;
<base-64-char> :: <alpha> | <decimal-digit> | "+" | "/" | "=" ;

Wait! Did I say "fortunately"? This seems like a lot to digest, but going step by step, we can implement even a complex BNF such as this by converting each of these expressions to pyparsing elements. In fact, some of them are already built in to pyparsing, so we can just use them as is. Deep breath...OK, let's begin.

Since the published BNF is a "top-down" definition, we should work from the "bottom-up" in defining our pyparsing grammar. We need to do this since Python requires us to define elements before referencing them.

With some observation, we can see that the bottom half of this BNF consists mostly of defining sets of characters, rather than actual parsing expressions. In pyparsing, we will convert these definitions into strings that can be used in composing Word elements. This BNF also explicitly indicates whitespace in places. Since pyparsing skips over whitespace by default, we can leave these terms out where they are separators, and make sure that we accommodate whitespace when it is expressly part of an expression definition.

So, to begin expressing the sets of valid characters, let's review what pyparsing already provides us (a reminder, these are strings, not expressions, to be used in defining Word expressions):

alphas

The characters A-Z and a-z

nums

The characters 0–9

alphanums

The combination of alphas and nums

hexnums

The combination of nums and the characters A-F and a-f

printables

All 7-bit ASCII characters that are not whitespace or control characters

If need be, we could also use the pyparsing function srange (for "string range"), which borrows the range syntax from regular expressions; for example, srange("[A-Z]") returns a string containing the uppercase letters A-Z.

We can now implement these basic character sets, working bottom-up, for composing our Word elements:

#<base-64-char> :: <alpha> | <decimal-digit> | "+" | "/" | "=" ;
base_64_char = alphanums + "+/="

#<whitespace-char> :: " " | "	" | "
" | "
" ;
# not needed

#<simple-punc>       :: "−" | "." | "/" | "_" | ":" | "*" | "+" | "=" ;
simple_punc = "-./_:*+="

#<hex-digit>  :: <decimal-digit> | "A" | ... | "F" | "a" | ... | "f" ;
# use hexnums

#<decimal-digit> :: "0" | ... | "9" ;
# use nums

#<alpha>      :: <upper-case> | <lower-case> | <digit> ;
#<lower-case>        :: "a" | ... | "z" ;
#<upper-case> :: "A" | ... | "Z" ;
# use alphanums

#<token-char> :: <alpha> | <decimal-digit> | <simple-punc> ;
token_char = alphanums + simple_punc

Once again looking ahead at the next set of expressions, we can see that we will need some punctuation defined. The punctuation will be important during the parsing process, but I plan to convert the indicated fields during the parsing process using parse actions, so the punctuation itself will not be needed in the returned results. Using Python's list-to-variable assignment and the map function, we can define all our punctuation in a single compact statement:

LPAREN, RPAREN, LBRACK, RBRACK = map(Suppress, "()[]")

We can now visit the top half of the list, and implement those expressions that are defined in terms of the character sets and punctuation that have been defined:

# <bytes>     -- any string of bytes, of the indicated length
bytes = Word( printables )

# <decimal>   :: <decimal-digit>+ ;
#             -- decimal numbers should have no unnecessary leading zeros
decimal = "0" | Word( srange("[1–9]"), nums )

# <quoted-string-body> :: """ <bytes> """
# not needed, use dblQuotedString

# <quoted-string> :: <decimal>? <quoted-string-body>
quoted_string = Optional( decimal ) + dblQuotedString

# <hexadecimal>      :: "#" ( <hex-digit> | <white-space> )* "#" ;
hexadecimal = "#" + ZeroOrMore( Word(hexnums) ) + "#"

# <base-64>   :: <decimal>? "|" ( <base-64-char> | <whitespace> )* "|" ;
base_64 = Optional(decimal) + "|" + ZeroOrMore( Word( base_64_char ) ) + "|"

# <token>     :: <tokenchar>+ ;
token = Word( tokenchar )

# <raw>       :: <decimal> ":" <bytes> ;
raw = decimal + ":" + bytes

# <simple-string> :: <raw> | <token> | <base-64> | <hexadecimal> | <quoted-string> ;
simple_string = raw | token | base_64 | hexadecimal | quoted_string

# <display>   :: "[" <simple-string> "]" ;
display = LBRACK + simple_string + RBRACK

# <string>    :: <display>? <simple-string> ;
string_ = Optional(display) + simple_string

As I mentioned before, the published BNF describes elements in a general "top-down" order, or most complex expression to simplest expression. In developing these expressions in Python, we have to define the simple expressions first, so that they can be used in composing the more complicated expressions to follow.

Here are some other points in developing these next-level expressions:

  • The definition of decimal states that there should be no extra leading zeros. We implemented this by using Word's two-argument constructor, specifying the non-zero digits as the set of valid leading characters, and the set of all digits as the set of valid body characters.

  • Most BNF's use the conventions of + for "1 or more," * for "zero or more," and ? for "zero or one." For the most part, we can map these to pyparsing classes OneOrMore, ZeroOrMore, and Optional. (The exception is when the BNF indicates OneOrMore of a set of characters; in this case, use the Word class, as shown with token being made up of token_char+.)

  • This BNF includes optional whitespace as possible characters in the hexadecimal and base-64 expressions. One way to implement this is to define an expression such as hexadecimal as Word(hexnums+" "). Instead, I've chosen to define this as ZeroOrMore(Word(hexnums)), which will give us a list of the parts composed of hex digits, and implicitly skip over the interior whitespace.

At this point, the remaining expressions are:

<sexp>        :: <string> | <list>
<list>        :: "(" ( <sexp> | <whitespace> )* ")" ;

Here we have come to the recursive part of the BNF. In this case, sexp is defined in terms of list, but list is defined using sexp. To break this cycle, we'll define sexp as a Forward (and also rename list to list_, so as not to mask the Python built-in type):

# <list>      :: "(" ( <sexp> | <whitespace> )* ")" ;
sexp = Forward()
list_ = LPAREN + Group( ZeroOrMore( sexp ) ) + RPAREN

And to close the loop, we define the body of sexp as (string_ | list_). Remember, we cannot just use a Python assignment statement or else the definition of sexp will not be used as the contents of list_. Instead, we use the << shift operator to insert the definition into the previously defined sexp:

# <sexp>      :: <string> | <list>
sexp << ( string_ | list_ )

Now we have completed the basic syntax definition of the S-expression BNF. Let's try it out on our previous example, the authentication certificate:

print sexp.parseString(certificateExpr)

gives us the results:

[['certificate',
  ['issuer',
   ['name',
    ['public-key',
     'rsa-with-md5',
     ['e','|', 'NFGq/E3wh9f4rJIQVXhS','|'],
     ['n',
      '|',
      'd738/4ghP9rFZ0gAIYZ5q9y6iskDJwAS
       i5rEQpEQq8ZyMZeIZzIAR2I5iGE=',
      '|']],
    'aid-committee']],
  ['subject',
   ['ref',
    ['public-key',
     'rsa-with-md5',
     ['e','|', 'NFGq/E3wh9f4rJIQVXhS','|'],
     ['n',
      '|',
      'd738/4ghP9rFZ0gAIYZ5q9y6iskDJwASi5rEQpEQq8ZyMZeIZzIAR2I5iGE=',
      '|']],
    'tom',
    'mother']],
  ['not-after', '"1998-01-01_09:00:00"'],
  ['tag',
   ['spend',
    ['account', '"12345678"'],
    ['*', 'numeric', 'range', '"1"', '"1000"']]]]]

Before leaving this parser, there are just a few more parse actions to add:

  • Removal of the quotation marks on quoted strings.

  • Conversion of the base64 data to binary data.

  • Validation of the data length values, when specified.

The quotation marks on the quoted strings are simply there to enclose some body of text into a single string. The dblQuotedString expression already does this for us so that the leading and trailing quotation marks themselves are not needed in the results. This task is so common that pyparsing includes a method for just this purpose, removeQuotes:

dblQuotedString.setParseAction( removeQuotes )

To convert the base64 data field, we'll add a short parse action to call b64decode, defined in Python's base64 module:

base_64 = Optional(decimal) + "|" + 
    OneOrMore(Word(base_64_char)).setParseAction(lambda t:b64decode("".join(t))) + "|"

This parse action does two tasks in one: it joins together the multiple tokens returned from the OneOrMore expression, and then calls b64decode to convert that data back to its original form.

Lastly, we'll implement the length validation for the base-64, quoted-string, and raw expressions. These expressions all include an optional leading decimal element to be used as an expected character count in the attached base64 or character string data. We can use a parse action to verify that these data length values are correct. We'll design the validation parse action to look for two named results in the input tokens: the length field will be named length, and the content field will be named data. Here is how that parse action will look:

def validateDataLength( tokens ):
    if tokens.length != "":
        if len(tokens.data) != int(tokens.length):
            raise ParseFatalException 
                ("invalid data length, %d specified, found %s (%d chars)" %
                 (int(tokens.length), tokens.data, len(tokens.data)))

At the start, validateDataLength checks to see whether a length field is given. If the length field is not present, retrieving the length attribute will return an empty string. If a length field is present, we then see whether the length of the data field matches it. If not, our parse action will a raise a different kind of ParseException, a ParseFatalException. Unlike the Parse-Exception, which simply signals a failed syntax match, ParseFatal-Exception will cause parsing to stop immediately at the current location.

For this parse action to work, we'll need to go back and attach results names to the appropriate elements of the base_64, quoted_string, and raw expressions:

raw = decimal.setResultsName("length") + ":" + Word(bytes).setResultsName("data")

dblQuotedString.setParseAction( removeQuotes )
quoted_string = Optional( decimal.setResultsName("length") ) + 
    dblQuotedString.setResultsName("data")

base_64_body = OneOrMore(Word(base_64_char))
base_64_body.setParseAction(lambda t:b64decode("".join(t)))
base_64 = Optional(decimal.setResultsName("length")) + "|" + 
    base_64_body.setResultsName("data") + "|"

The base_64 expression was getting rather complicated, so I broke out the content field as its own expression, base_64_body.

With these changes in place, here is our final parser, with test cases:

from pyparsing import *
from base64 import b64decode
import pprint

LPAREN, RPAREN, LBRACK, RBRACK = map(Suppress, "()[]")

base_64_char = alphanums + "+/="
simple_punc = "-./_:*+="
token_char = alphanums + simple_punc
bytes = Word( printables )
decimal = ("0" | Word( srange("[1-9]"), nums )).setParseAction(lambda t: int(t[0]))

token = Word( token_char )
hexadecimal = "#" + ZeroOrMore( Word(hexnums) ) + "#"

dblQuotedString.setParseAction( removeQuotes )
quoted_string = Optional( decimal.setResultsName("length") ) + 
    dblQuotedString.setResultsName("data")

base_64_body = OneOrMore(Word(base_64_char))
base_64_body.setParseAction(lambda t:b64decode("".join(t)))
base_64 = Optional(decimal.setResultsName("length")) + 
    "|" + base_64_body.setResultsName("data") + "|"

raw = (decimal.setResultsName("length") + ":" +
        bytes.setResultsName("data"))
simple_string = raw | token | base_64 | hexadecimal | quoted_string
display = LBRACK + simple_string + RBRACK
string_ = Optional(display) + simple_string

sexp = Forward()
list_ = Group( LPAREN + ZeroOrMore( sexp ) + RPAREN )
sexp << ( string_ | list_ )

def validateDataLength( tokens ):
    if tokens.length != "":
        if len(tokens.data) != int(tokens.length):
            raise ParseFatalException 
                ("invalid data length, %d specified, found %s (%d chars)" %
                 (int(tokens.length), tokens.data, len(tokens.data)))

quoted_string.setParseAction( validateDataLength )
base_64.setParseAction( validateDataLength )
raw.setParseAction( validateDataLength )

######### Test data ###########
test0 = """(snicker "abc" (#03# |YWJj|))"""
test1 = """(certificate
 (issuer
  (name
   (public-key
    rsa-with-md5
    (e |NFGq/E3wh9f4rJIQVXhS|)
    (n |d738/4ghP9rFZ0gAIYZ5q9y6iskDJwASi5rEQpEQq8ZyMZeIZzIAR2I5iGE=|))
   aid-committee))
 (subject
  (ref
   (public-key
    rsa-with-md5
    (e |NFGq/E3wh9f4rJIQVXhS|)
    (n |d738/4ghP9rFZ0gAIYZ5q9y6iskDJwASi5rEQpEQq8ZyMZeIZzIAR2I5iGE=|))
   tom
   mother))
 (not-after "1998-01-01_09:00:00")
 (tag
  (spend (account "12345678") (* numeric range "1" "1000"))))
"""
test2 = """
(defun factorial (x)
   (if (zerop x) 1
       (* x (factorial (- x 1)))))
       """
test3 = """(3:XX "abc" (#30# |YWJj|))"""

# Run tests
for t in (test0, test1, test2, test3):
    print '-'*50
    print t
    try:
        sexpr = sexp.parseString(t)
        pprint.pprint(sexpr.asList())
    except ParseFatalException, pfe:
        print "Error:", pfe.msg
        print line(pfe.loc,t)
        print pfe.markInputline()
    print

The test output is:

--------------------------------------------------
(snicker "abc" (#03# |YWJj|))
[['snicker', 'abc', ['#', '03', '#', '|', 'abc', '|']]]

--------------------------------------------------
(certificate
 (issuer
  (name
   (public-key
    rsa-with-md5
    (e |NFGq/E3wh9f4rJIQVXhS|)
    (n |d738/4ghP9rFZ0gAIYZ5q9y6iskDJwASi5rEQpEQq8ZyMZeIZzIAR2I5iGE=|))
   aid-committee))
 (subject
  (ref
   (public-key
    rsa-with-md5
    (e |NFGq/E3wh9f4rJIQVXhS|)
    (n |d738/4ghP9rFZ0gAIYZ5q9y6iskDJwASi5rEQpEQq8ZyMZeIZzIAR2I5iGE=|))
   tom
   mother))
 (not-after "1998-01-01_09:00:00")
 (tag
  (spend (account "12345678") (* numeric range "1" "1000"))))

[['certificate',
  ['issuer',
   ['name',
    ['public-key',
     'rsa-with-md5',
     ['e', '|', '4QxaaxfcMxf0x87xd7xf8xacx92x10UxR', '|'],
     ['n',
      '|',
     "wxbdxfcxffx88!?xdaxc5gHx00!x86yxabxdcxbax8axc9x03'
       x00x12x8bx9axc4Bx91x10xabxc6r1x97x88g2x00Gb9x88a",
      '|']],
    'aid-committee']],
  ['subject',
   ['ref',
    ['public-key',
     'rsa-with-md5',
     ['e', '|', '4QxaaxfcMxf0x87xd7xf8xacx92x10UxR', '|'],
     ['n',
      '|',
     "wxbdxfcxffx88!?xdaxc5gHx00!x86yxabxdcxbax8axc9x03'
       x00x12x8bx9axc4Bx91x10xabxc6r1x97x88g2x00Gb9x88a",
      '|']],
    'tom',
    'mother']],
  ['not-after', '1998-01-01_09:00:00'],
  ['tag',
   ['spend',
    ['account', '12345678'],
    ['*', 'numeric', 'range', '1', '1000']]]]]

--------------------------------------------------
(defun factorial (x)
   (if (zerop x) 1
       (* x (factorial (- x 1)))))

[['defun',
  'factorial',
  ['x'],
  ['if', ['zerop', 'x'], '1', ['*', 'x', ['factorial', ['-', 'x', '1']]]]]]

--------------------------------------------------
(3:XX "abc" (#30# |YWJj|))
Error: invalid data length, 3 specified, found XX (2 chars)
(3:XX "abc" (#30# |YWJj|))
>!<
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.200.3