Now that we’ve looked at how SAX can be used and have seen
just how regular the code is to set up the parser and the ContentHandler
, you may be wondering how much
of that ease comes from using SAX and how much is a matter of
convenience functions in the Python libraries. While we won’t delve
deeply into the native interfaces of the individual parsers, this is a
good question, and can lead to some interesting observations.
The key advantage to using SAX is that the callback methods have the same names and significance regardless of the actual parser you use. There are at least two nice results of this: changing parsers does not affect your application, and your code is more maintainable because someone new to the code is more likely to know the SAX interface than any particular parser-specific interface.
So just how do the native interfaces to the individual parsers differ from SAX, and why would we choose to use them instead? Let’s take a quick look at the PyExpat parser to get a taste of the differences.
Of course, to use PyExpat, you need to have it installed. It is included as part of the Python installer for Windows, and is built automatically on Unix if you have the Expat library installed. If you did not install PyExpat as part of Python, it is installed as part of the PyXML package.
PyExpat resides in the xml.parsers.expat
module. If we want to
modify our last example to use PyExpat directly, we don’t have a lot
of work to do, but there are a few changes. Since the PyExpat handler
methods closely match the SAX handlers, at least for the basic use we
demonstrate here, we can use the same handler class we’ve already
written. The imports won’t need to change much:
#!/usr/bin/env python import sys from xml.parsers import expat from handlers import PyXMLConversionHandler
Once the parser is imported, it can be created and used:
parser = expat.ParserCreate( )
Were we to do this at the interactive prompt, we could poke at the parser object to see what attributes it has:
>>> from xml.parsers import expat >>> parser = expat.ParserCreate( ) >>> dir(parser) ['CharacterDataHandler', 'CommentHandler', 'DefaultHandler', 'DefaultHandlerExpa nd', 'EndCdataSectionHandler', 'EndElementHandler', 'EndNamespaceDeclHandler', ' ErrorByteIndex', 'ErrorCode', 'ErrorColumnNumber', 'ErrorLineNumber', 'ExternalE ntityParserCreate', 'ExternalEntityRefHandler', 'GetBase', 'NotStandaloneHandler ', 'NotationDeclHandler', 'Parse', 'ParseFile', 'ProcessingInstructionHandler', 'SetBase', 'StartCdataSectionHandler', 'StartElementHandler', 'StartNamespaceDec lHandler', 'UnparsedEntityDeclHandler', 'ordered_attributes', 'returns_unicode', 'specified_attributes']
That certainly doesn’t look like a SAX parser!
There is no setContentHandler
method, nor is there anything that takes its place. To register our
content handler, we need to set various attributes to the methods of a
content handler instance:
dh = PyXMLConversionHandler(sys.stdout)parser.StartElementHandler = dh.startElement
parser.EndElementHandler = dh.endElement
parser.CharacterDataHandler = dh.characters
This isn’t hard, but it is certainly more tedious than the SAX
setContentHandler
method, and the
code actually needs to be changed, as we need to use more methods on
the handler object.
Once we’ve initialized the handler methods we’re interested in using, we can start the parse. Again, this is a little different from the SAX version:
parser.Parse(sys.stdin.read( ), 1)
We know what sys.stdin.read(
)
does, but the 1
used
for the second parameter looks suspiciously like a magic number in our
source code. It is actually a Boolean value indicating that the string
being passed to the Parse
method is
the final chunk of the input; Parse
can be called multiple times with smaller portions of the input and
the flag set to 0
, and then called
with an indicator of 1
for the
final chunk of data. This can be useful when reading data
asynchronously from a network connection.
When parsing XML from a file object, the following method is also available:
parser.ParseFile(sys.stdin)
The complete script that uses the handler with PyExpat is shown in Example 3-11.
""" genhtml2.py - generates HTML from pyxml.xml """ import sys from xml.parsers import expat from handlers import PyXMLConversionHandler dh = PyXMLConversionHandler(sys.stdout) parser = expat.ParserCreate( ) parser.StartElementHandler = dh.startElement parser.EndElementHandler = dh.endElement parser.CharacterDataHandler = dh.characters parser.ParseFile(sys.stdin)
The output is to the standard output stream. If opened in your browser, it shows you all of the classes of the PyXML package and their methods, exactly as the pure SAX version of this example did.
3.14.145.82