Native Parser Interfaces

Now that we’ve looked at how SAX can be used and have seen just how regular the code is to set up the parser and the ContentHandler, you may be wondering how much of that ease comes from using SAX and how much is a matter of convenience functions in the Python libraries. While we won’t delve deeply into the native interfaces of the individual parsers, this is a good question, and can lead to some interesting observations.

The key advantage to using SAX is that the callback methods have the same names and significance regardless of the actual parser you use. There are at least two nice results of this: changing parsers does not affect your application, and your code is more maintainable because someone new to the code is more likely to know the SAX interface than any particular parser-specific interface.

So just how do the native interfaces to the individual parsers differ from SAX, and why would we choose to use them instead? Let’s take a quick look at the PyExpat parser to get a taste of the differences.

Using PyExpat Directly

Of course, to use PyExpat, you need to have it installed. It is included as part of the Python installer for Windows, and is built automatically on Unix if you have the Expat library installed. If you did not install PyExpat as part of Python, it is installed as part of the PyXML package.

PyExpat resides in the xml.parsers.expat module. If we want to modify our last example to use PyExpat directly, we don’t have a lot of work to do, but there are a few changes. Since the PyExpat handler methods closely match the SAX handlers, at least for the basic use we demonstrate here, we can use the same handler class we’ve already written. The imports won’t need to change much:

#!/usr/bin/env python

import sys

from xml.parsers import expat
from handlers    import PyXMLConversionHandler

Once the parser is imported, it can be created and used:

parser = expat.ParserCreate(  )

Were we to do this at the interactive prompt, we could poke at the parser object to see what attributes it has:

>>> from xml.parsers import expat
>>> parser = expat.ParserCreate(  )
>>> dir(parser)
['CharacterDataHandler', 'CommentHandler', 'DefaultHandler', 'DefaultHandlerExpa
nd', 'EndCdataSectionHandler', 'EndElementHandler', 'EndNamespaceDeclHandler', '
ErrorByteIndex', 'ErrorCode', 'ErrorColumnNumber', 'ErrorLineNumber', 'ExternalE
ntityParserCreate', 'ExternalEntityRefHandler', 'GetBase', 'NotStandaloneHandler
', 'NotationDeclHandler', 'Parse', 'ParseFile', 'ProcessingInstructionHandler',
'SetBase', 'StartCdataSectionHandler', 'StartElementHandler', 'StartNamespaceDec
lHandler', 'UnparsedEntityDeclHandler', 'ordered_attributes', 'returns_unicode',
 'specified_attributes']

That certainly doesn’t look like a SAX parser!

There is no setContentHandler method, nor is there anything that takes its place. To register our content handler, we need to set various attributes to the methods of a content handler instance:

dh = PyXMLConversionHandler(sys.stdout)
parser.StartElementHandler = dh.startElement
parser.EndElementHandler = dh.endElement
parser.CharacterDataHandler = dh.characters

This isn’t hard, but it is certainly more tedious than the SAX setContentHandler method, and the code actually needs to be changed, as we need to use more methods on the handler object.

Once we’ve initialized the handler methods we’re interested in using, we can start the parse. Again, this is a little different from the SAX version:

parser.Parse(sys.stdin.read(  ), 1)

We know what sys.stdin.read( ) does, but the 1 used for the second parameter looks suspiciously like a magic number in our source code. It is actually a Boolean value indicating that the string being passed to the Parse method is the final chunk of the input; Parse can be called multiple times with smaller portions of the input and the flag set to 0, and then called with an indicator of 1 for the final chunk of data. This can be useful when reading data asynchronously from a network connection.

When parsing XML from a file object, the following method is also available:

parser.ParseFile(sys.stdin)

The complete script that uses the handler with PyExpat is shown in Example 3-11.

Example 3-11. genhtml2.py with PyExpat
"""
genhtml2.py - generates HTML from pyxml.xml
"""

import sys

from xml.parsers import expat
from handlers    import PyXMLConversionHandler

dh = PyXMLConversionHandler(sys.stdout)
parser = expat.ParserCreate(  )

parser.StartElementHandler = dh.startElement
parser.EndElementHandler = dh.endElement
parser.CharacterDataHandler = dh.characters
parser.ParseFile(sys.stdin)

The output is to the standard output stream. If opened in your browser, it shows you all of the classes of the PyXML package and their methods, exactly as the pure SAX version of this example did.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.145.82