Chapter 12. Processing XML

Introduction

Credit: Paul Prescod, co-author of XML Handbook (Prentice-Hall)

XML has become a central technology for all kinds of information exchange. Today, most new file formats that are invented are based on XML. Most new protocols are based upon XML. It simply isn’t possible to work with the emerging Internet infrastructure without supporting XML. Luckily, Python has had XML support since many versions ago, and Python’s support for XML has kept growing and maturing year after year.

Python and XML are perfect complements. XML is an open standards way of exchanging information. Python is an open source language that processes the information. Python excels at text processing and at handling complicated data structures. XML is text based and is, above all, a way of exchanging complicated data structures.

That said, working with XML is not so seamless that it requires no effort. There is always somewhat of a mismatch between the needs of a particular programming language and a language-independent information representation. So there is often a requirement to write code that reads (i.e., deserializes or parses) and writes (i.e., serializes) XML.

Parsing XML can be done with code written purely in Python, or with a module that is a C/Python mix. Python comes with the fast Expat parser written in C. Many XML applications use the Expat parser, and one of these recipes accesses Expat directly to build its own concept of an ideal in-memory Python representation of an XML document as a tree of “element” objects (an alternative to the standard DOM approach, which I will mention later in this introduction).

However, although Expat is ubiquitous in the XML world, it is far from being the only parser available, or necessarily the best one for any given application. A standard API called SAX allows any XML parser to be plugged into a Python program. The SAX API is demonstrated in several recipes that perform typical tasks such as checking that an XML document is well formed, extracting text from a document, or counting the tags in a document. These recipes should give you a good understanding of how SAX works. One more advanced recipe shows how to use one of SAX’s several auxiliary features, “filtering”, to normalize “text events” that might otherwise happen to get “fragmented”.

XML-RPC is a protocol built on top of XML for sending data structures from one program to another, typically across the Internet. XML-RPC allows programmers to completely hide the implementation languages of the two communicating components. Two components running on different operating systems, written in different languages, can still communicate easily. XML-RPC is built into Python. This chapter does not deal with XML-RPC, because, together with other alternatives for distributed programming, XML-RPC is covered in Chapter 15.

Other recipes in this chapter are a little bit more eclectic, dealing with issues that range from interfacing, to proprietary XML parsers and document formats, to representing an entire XML document in memory as a Python object. One, in particular, shows how to auto-detect the Unicode encoding that an XML document uses without parsing the document. Unicode is central to the definition of XML, so it’s important to understand Python’s Unicode support if you will be doing any sophisticated work with XML.

The PyXML extension package supplies a variety of useful tools for working with XML. PyXML offers a full implementation of the Document Object Model (DOM)—as opposed to the subset bundled with Python itself—and a validating XML parser written entirely in Python. The DOM is a standard API that loads an entire XML document into memory. This can make XML processing easier for complicated structures in which there are many references from one part of the document to another, or when you need to correlate (i.e., compare) more than one XML document. One recipe shows how to use PyXML’s validating parser to validate and process an XML document, and another shows how to remove whitespace-only text nodes from an XML document’s DOM. You’ll find many other examples in the documentation of the PyXML package (http://pyxml.sourceforge.net/).

Other advanced tools that you can find in PyXML or, in some cases, in FourThought’s open source 4Suite package (http://www.4suite.org/) from which much of PyXML derives, include implementations of a variety of XML-related standards, such as XPath, XSLT, XLink, XPointer, and RDF. If PyXML is already an excellent resource for XML power users in Python, 4Suite is even richer and more powerful.

XML has become so pervasive that, inevitably, you will also find XML-related recipes in other chapters of this book. Recipe 2.26 strips XML markup in a very rough and ready way. Recipe 1.23 shows how to insert XML character references while encoding Unicode text. Recipe 10.17, parses a Mac OS X pinfo-format XML stream to get detailed system information. Recipe 11.10 uses Tkinter to display a XML DOM as a GUI Tree widget. Recipe 14.11 deals with two XML file formats related to RSS[1] feeds, fetching and parsing a FOAF[2]-format input to produce an OPML[3]-format result—quite a typical XML-related task in today’s programming, and a good general example of how Python can help you with such tasks.

For more information on using Python and XML together, see Python and XML by Christopher A. Jones and Fred L. Drake, Jr. (O’Reilly).

12.1. Checking XML Well-Formedness

Credit: Paul Prescod, Farhad Fouladi

Problem

You need to check whether an XML document is well formed (not whether it conforms to a given DTD or schema), and you need to do this check quickly.

Solution

SAX (presumably using a fast parser such as Expat underneath) offers a fast, simple way to perform this task. Here is a script to check well-formedness on every file you mention on the script’s command line:

from xml.sax.handler import ContentHandler
from xml.sax import make_parser
from glob import glob
import sys
def parsefile(filename):
    parser = make_parser( )parser.setContentHandler(ContentHandler( ))
    parser.parse(filename)
for arg in sys.argv[1:]:
    for filename in glob(arg):
        try:
            parsefile(filename)
            print "%s is well-formed" % filename
        except Exception, e:
            print "%s is NOT well-formed! %s" % (filename, e)

Discussion

A text is a well-formed XML document if it adheres to all the basic syntax rules for XML documents. In other words, it has a correct XML declaration and a single root element, all tags are properly nested, tag attributes are quoted, and so on.

This recipe uses the SAX API with a dummy ContentHandler that does nothing. Generally, when we parse an XML document with SAX, we use a ContentHandler instance to process the document’s contents. But in this case, we only want to know whether the document meets the most fundamental syntax constraints of XML; therefore, we need not do any processing, and the do-nothing handler suffices.

The parsefile function parses the whole document and throws an exception if an error is found. The recipe’s main code catches any such exception and prints it out like this:

$ python wellformed.py test.xmltest.xml is NOT well-formed! test.xml:1002:2: mismatched tag

This means that character 2 on line 1,002 has a mismatched tag.

This recipe does not check adherence to a DTD or schema, which is a separate procedure called validation. The performance of the script should be quite good, precisely because it focuses on performing a minimal irreducible core task. However, sometimes you need to squeeze out the last drop of performance because you’re checking the well-formedness of truly huge files. If you know for sure that you do have Expat, specifically, installed on your system, you may alternatively choose to use Expat directly instead of SAX. To try this approach, you can change function parsefile to the following code:

import xml.parsers.expat
def parsefile(file):
    parser = xml.parsers.expat.ParserCreate( )
    parser.ParseFile(open(file, "r"))

Don’t expect all that much of an improvement in performance when using Expat directly instead of SAX. However, you might gain a little bit.

See Also

Recipe 12.2 and Recipe 12.3, for other uses of SAX; the PyXML package (http://pyxml.sourceforge.net/) includes the pure-Python validating parser xmlproc, which checks the conformance of XML documents to specific DTDs; the PyRXP package from ReportLab is a wrapper around the fast validating parser RXP (http://www.reportlab.com/xml/pyrxp.html), which is available under the GPL license.

12.2. Counting Tags in a Document

Credit: Paul Prescod

Problem

You want to get a sense of how often particular elements occur in an XML document, and the relevant counts must be extracted rapidly.

Solution

You can subclass SAX’s ContentHandler to make your own specialized classes for any kind of task, including the collection of such statistics:

from xml.sax.handler import ContentHandler
import xml.sax
class countHandler(ContentHandler):
    def _ _init_ _(self):
        self.tags={  }
    def startElement(self, name, attr):
        self.tags[name] = 1 + self.tags.get(name, 0)
parser = xml.sax.make_parser( )
handler = countHandler( )
parser.setContentHandler(handler)
parser.parse("test.xml")
tags = handler.tags.keys( )
tags.sort( )
for tag in tags:
    print tag, handler.tags[tag]

Discussion

When I start working with a new XML content set, I like to get a sense of which elements are in it and how often they occur. For this purpose, I use several small variants of this recipe. I could also collect attributes just as easily, as you can see, since attributes are also passed to the startElement method that I’m overriding. If you add a stack, you can also keep track of which elements occur within other elements (for this, of course, you also have to override the endElement method so you can pop the stack).

This recipe also works well as a simple example of a SAX application, usable as the basis for any SAX application. Alternatives to SAX include pulldom and minidom. For any simple processing (including this example), these alternatives would be overkill, particularly if the document you are processing is very large. DOM approaches are generally justified only when you need to perform complicated editing and alteration on an XML document, when the document itself is made complicated by references that go back and forth inside it, or when you need to correlate (i.e., compare) multiple documents.

ContentHandler subclasses offer many other options, and the online Python documentation does a pretty good job of explaining them. This recipe’s countHandler class overrides ContentHandler’s startElement method, which the parser calls at the start of each element, passing as arguments the element’s tag name as a Unicode string and the collection of attributes. Our override of this method counts the number of times each tag name occurs. In the end, we extract the dictionary used for counting and emit it (in alphabetical order, which we easily obtain by sorting the keys).

See Also

Recipe 12.3 for other uses of SAX.

12.3. Extracting Text from an XML Document

Credit: Paul Prescod

Problem

You need to extract only the text from an XML document, not the tags.

Solution

Once again, subclassing SAX’s ContentHandler makes this task quite easy:

from xml.sax.handler import ContentHandler
import xml.sax
import sys
class textHandler(ContentHandler):
    def characters(self, ch):sys.stdout.write(ch.encode("Latin-1"))
parser = xml.sax.make_parser( )
handler = textHandler( )
parser.setContentHandler(handler)
parser.parse("test.xml")

Discussion

Sometimes you want to get rid of XML tags—for example, to re-key a document or to spell-check it. This recipe performs this task and works with any well-formed XML document. It is quite efficient.

In this recipe’s textHandler class, we subclass ContentHander’s characters method, which the parser calls for each string of text in the XML document (excluding tags, XML comments, and processing instructions), passing as the only argument the piece of text as a Unicode string. We have to encode this Unicode before we can emit it to standard output. (See Recipe 1.22 for more information about emitting Unicode to standard output.) In this recipe, we’re using the Latin-1 (also known as ISO-8859-1) encoding, which covers all western European alphabets and is supported by many popular output devices (e.g., printers and terminal-emulation windows). However, you should use whatever encoding is most appropriate for the documents you’re handling, as long, of course, as that encoding is supported by the devices you need to use. The configuration of your devices may depend on your operating system’s concepts of locale and code page. Unfortunately, these issues vary too much between operating systems for me to go into further detail.

A simple alternative, if you know that handling Unicode is not going to be a problem, is to use sgmllib. It’s not quite as fast but somewhat more robust against XML of dubious well-formedness:

from sgmllib import SGMLParser
class XMLJustText(SGMLParser):
    def handle_data(self, data):
        print data
XMLJustText( ).feed(open('text.xml').read( ))

An even simpler and rougher way to extract text from an XML document is shown in Recipe 2.26.

See Also

Recipe 12.1 and Recipe 12.2 for other uses of SAX.

12.4. Autodetecting XML Encoding

Credit: Paul Prescod

Problem

You have XML documents that may use a large variety of Unicode encodings, and you need to find out which encoding each document is using.

Solution

This task is one that we need to code ourselves, rather than getting an existing package to perform it, if we want complete generality:

import codecs, encodings
""" Caller will hand this library a buffer string, and ask us to convert
    the buffer, or autodetect what codec the buffer probably uses. """
# 'None' stands for a potentially variable byte ("##" in the XML spec...)
autodetect_dict={ # bytepattern          : ("name",
                (0x00, 0x00, 0xFE, 0xFF) : ("ucs4_be"),
                (0xFF, 0xFE, 0x00, 0x00) : ("ucs4_le"),
                (0xFE, 0xFF, None, None) : ("utf_16_be"),
                (0xFF, 0xFE, None, None) : ("utf_16_le"),
                (0x00, 0x3C, 0x00, 0x3F) : ("utf_16_be"),
                (0x3C, 0x00, 0x3F, 0x00) : ("utf_16_le"),
                (0x3C, 0x3F, 0x78, 0x6D) : ("utf_8"),
                (0x4C, 0x6F, 0xA7, 0x94) : ("EBCDIC"),
                }
def autoDetectXMLEncoding(buffer):
    """ buffer -> encoding_name
        The buffer string should be at least four bytes long.
        Returns None if encoding cannot be detected.
        Note that encoding_name might not have an installed
        decoder (e.g., EBCDIC)
    """
    # A more efficient implementation would not decode the whole
    # buffer at once, but then we'd have to decode a character at
    # a time looking for the quote character, and that's a pain
    encoding = "utf_8" # According to the XML spec, this is the default
                       # This code successively tries to refine the default:
                       # Whenever it fails to refine, it falls back to
                       # the last place encoding was setbytes = byte1, byte2, byte3, byte4 = map(ord, buffer[0:4])
               enc_info = autodetect_dict.get(bytes, None)
    if not enc_info: # Try autodetection again, removing potentially
                     # variable bytes
        bytes = byte1, byte2, None, None
        enc_info = autodetect_dict.get(bytes)
    if enc_info:
        encoding = enc_info # We have a guess...these are
                            # the new defaults
        # Try to find a more precise encoding using XML declaration
        secret_decoder_ring = codecs.lookup(encoding)[1]
        decoded, length = secret_decoder_ring(buffer)
        first_line = decoded.split("
", 1)[0]
        if first_line and first_line.startswith(u"<?xml"):
            encoding_pos = first_line.find(u"encoding")
            if encoding_pos!=-1:
                # Look for double quotes
                quote_pos = first_line.find('"', encoding_pos)
                if quote_pos==-1:                 # Look for single quote
                    quote_pos = first_line.find("'", encoding_pos)
                if quote_pos>-1:
                    quote_char = first_line[quote_pos]
                    rest = first_line[quote_pos+1:]
                    encoding = rest[:rest.find(quote_char)]
    return encoding

Discussion

The XML specification describes the outline of an algorithm for detecting the Unicode encoding that an XML document uses. This recipe implements that algorithm and helps your XML-processing programs determine which encoding is being used by a specific document.

The default encoding (unless we can determine another one specifically) must be UTF-8, as it is part of the specifications that define XML. Certain byte patterns in the first four, or sometimes even just the first two, bytes of the text can identify a different encoding. For example, if the text starts with the two bytes 0xFF, 0xFE we can be certain that these bytes are a byte-order mark that identifies the encoding type as little-endian (low byte before high byte in each character) and the encoding itself as UTF-16 (or the 32-bits-per-character UCS-4, if the next two bytes in the text are 0, 0).

If we get as far as this, we must also examine the first line of the text. For this purpose, we decode the text from a bytestring into Unicode, with the encoding determined so far and detect the first line-end ' ' character. If the first line begins with u'<?xml', it’s an XML declaration and may explicitly specify an encoding by using the keyword encoding as an attribute. The nested if statements in the recipe check for that case, and, if they find an encoding thus specified, the recipe returns the encoding thus found as the encoding the recipe has determined. This step is absolutely crucial, since any text starting with the single-byte ASCII-like representation of the XML declaration, <?xml, would be otherwise erroneously identified as encoded in UTF-8, while its explicit encoding attribute may specify it as being, for example, one of the ISO-8859 standard encodings.

This recipe makes the assumption that, as the XML specs require, the XML declaration, if any, is terminated by an end-of-line character. If you need to deal with almost-XML documents that are malformed in this very specific way (i.e., an incorrect XML declaration that is not terminated by an end-of-line character), you may need to apply some heuristic adjustments, for example, through regular expressions. However, it’s impossible to offer precise suggestions, since malformedness may come in such a wide variety of errant forms.

This code detects a variety of encodings, including some that are not yet supported by Python’s Unicode decoders. So, the fact that you can decipher the encoding does not guarantee that you can then decipher the document itself!

See Also

Unicode is a huge topic, but a recommended book is Unicode: A Primer, by Tony Graham (Hungry Minds, Inc.)—details are available at http://www.menteith.com/unicode/primer/; Library Reference and Python in a Nutshell document the built-in str and unicode types, and modules unidata and codecs; Recipe 1.21 and Recipe 1.22.

12.5. Converting an XML Document into a Tree of Python Objects

Credit: John Bair, Christoph Dietze

Problem

You want to load an XML document into memory, but you don’t like the complicated access procedures of DOM. You’d prefer something more Pythonic—specifically, you’d like to map the document into a tree of Python objects.

Solution

To build our tree of objects, we can directly wrap the fast expat parser:

from xml.parsers import expat
class Element(object):
    ''' A parsed XML element '''
    def _ _init_ _(self, name, attributes):
        # Record tagname and attributes dictionary
        self.name = name
        self.attributes = attributes
        # Initialize the element's cdata and children to empty
        self.cdata = ''
        self.children = [  ]
    def addChild(self, element):
        self.children.append(element)
    def getAttribute(self, key):
        return self.attributes.get(key)
    def getData(self):
        return self.cdata
    def getElements(self, name=''):
        if name:
            return [c for c in self.children if c.name == name]
        else:
            return list(self.children)
class Xml2Obj(object)
    ''' XML to Object converter '''
    def _ _init_ _(self):
        self.root = None
        self.nodeStack = [  ]
    def StartElement(self, name, attributes):
        'Expat start element event handler'
        # Instantiate an Element object
        element = Element(name.encode( ), attributes)
        # Push element onto the stack and make it a child of parent
        if self.nodeStack:
            parent = self.nodeStack[-1]
            parent.addChild(element)
        else:
            self.root = element
        self.nodeStack.append(element)
    def EndElement(self, name):
        'Expat end element event handler'
        self.nodeStack[-1].pop( )
    def CharacterData(self, data):
        'Expat character data event handler'
        if data.strip( ):
            data = data.encode( )
            element = self.nodeStack[-1]
            element.cdata += data
    def Parse(self, filename):
        # Create an Expat parser
        Parser = expat.ParserCreate( )
        # Set the Expat event handlers to our methods
        Parser.StartElementHandler = self.StartElement
        Parser.EndElementHandler = self.EndElement
        Parser.CharacterDataHandler = self.CharacterData
        # Parse the XML File
        ParserStatus = Parser.Parse(open(filename).read( ), 1)
        return self.root
parser = Xml2Obj( )
root_element = parser.Parse('sample.xml')

Discussion

I saw Christoph Dietze’s recipe (http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/116539) about turning the structure of an XML document into a simple combination of dictionaries and lists and thought it was a really good idea. This recipe is a variation on that idea, with several differences.

For maximum speed, the recipe uses the low-level expat parser directly. It would get no real added value from the richer SAX interface, much less from the slow and memory-hungry DOM approach. Building the parent-children connections is not hard even with an event-driven interface, as this recipe shows by using a simple stack for the purpose.

The main difference with respect to Dietze’s original idea is that this recipe loads the XML document into a tree of Python objects (rather than a combination of dictionaries and lists), one per node, with nicely named attributes allowing access to each node’s characteristics—tagname, attributes (as a Python dictionary), character data (i.e., cdata in XML parlance) and children elements (as a Python list).

The various accessor methods of class Element are, of course, optional. You might prefer to access the attributes directly. I think they add no complexity and look nicer, but, obviously, your tastes may differ. This is, after all, just a recipe, so feel free to alter the mix of seasonings at will!

You can find other similar ideas (e.g., bypass the DOM, build something more Pythonic as the memory representation of an XML document) in many other excellent and more complete projects, such as PyRXP (http://www.reportlab.org/pyrxp.html), ElementTree (http://effbot.org/zone/element-index.htm), and XIST (http://www.livinglogic.de/Python/xist/).

See Also

Library Reference and Python in a Nutshell document the built-in XML support in the Python Standard Library, and xml.parsers.expat in particular. PyRXP is at http://www.reportlab.org/pyrxp.html; ElementTree is at http://effbot.org/zone/element-index.htm; XIST is at http://www.livinglogic.de/Python/xist/.

12.6. Removing Whitespace-only Text Nodes from an XML DOM Node’s Subtree

Credit: Brian Quinlan, David Wilson

Problem

You want to remove, from the DOM representation of an XML document, all the text nodes within a subtree, which contain only whitespace.

Solution

XML parsers consider several complex conditions when deciding which whitespace-only text nodes to preserve during DOM construction. Unfortunately, the result is often not what you want, so it’s helpful to have a function to remove all whitespace-only text nodes from among a given node’s descendants:

def remove_whilespace_nodes(node):
    """ Removes all of the whitespace-only text decendants of a DOM node. """
    # prepare the list of text nodes to remove (and recurse when needed)
    remove_list = [  ]
    for child in node.childNodes:
        if child.nodeType == dom.Node.TEXT_NODE and not child.data.strip( ):
            # add this text node to the to-be-removed list
            remove_list.append(child)
        elif child.hasChildNodes( ):
            # recurse, it's the simplest way to deal with the subtree
            remove_whilespace_nodes(child)
    # perform the removals
    for node in remove_list:
        node.parentNode.removeChild(node)
        node.unlink( )

Discussion

This recipe’s code works with any correctly implemented Python XML DOM, including the xml.dom.minidom that is part of the Python Standard Library and the more complete DOM implementation that comes with PyXML.

The implementation of function remove_whitespace_node is quite simple but rather instructive: in the first for loop we build a list of all child nodes to remove, and then in a second, separate loop we do the removal. This precaution is a good example of a general rule in Python: do not alter the very container you’re looping on—sometimes you can get away with it, but it is unwise to count on it in the general case. On the other hand, the function can perfectly well call itself recursively within its first for loop because such a call does not alter the very list node.childNodes on which the loop is iterating (it may alter some items in that list, but it does not alter the list object itself).

See Also

Library Reference and Python in a Nutshell document the built-in XML support in the Python Standard Library.

12.7. Parsing Microsoft Excel’s XML

Credit: Thomas Guettler

Problem

You have Microsoft Excel spreadsheets saved in XML form, and want to parse them into memory as Python nested lists.

Solution

The XML form of Excel spreadsheets is quite simple: all text is in Cell tags, which are nested in Row tags nested in Table tags. SAX makes it quite simple to parse this kind of XML into memory:

import sys
from xml.sax import saxutils, parse
class ExcelHandler(saxutils.DefaultHandler):
    def _ _init_ _(self):
        self.chars = [  ]
        self.cells = [  ]
        self.rows = [  ]
        self.tables = [  ]
    def characters(self, content):
        self.chars.append(content)
    def startElement(self, name, atts):
        if name=="Cell":
            self.chars = [  ]
        elif name=="Row":
            self.cells=[  ]
        elif name=="Table":
            self.rows = [  ]
    def endElement(self, name):
        if name=="Cell":
            self.cells.append(''.join(self.chars))
        elif name=="Row":
            self.rows.append(self.cells)
        elif name=="Table":
            self.tables.append(self.rows)
if _ _name_ _ == '_ _main_ _':
    excelHandler = ExcelHandler( )
    parse(sys.argv[1], excelHandler)
    print excelHandler.tables

Discussion

The structure of the parser presented in this recipe is pleasingly simple: at each of three logical nesting levels of data, we collect content into a list. Each time a tag of a given level begins, we start with an empty list for it; each time the tag ends, we append the tag’s contents to the list of the next upper level. The net result is that the top-level list, the one named tables, accumulates all of the spreadsheet’s contents with the proper structure (a triply nested list). At the lowest level, of course, we join all the text strings that are reported as being within the same cell into a single cell content text string, when we accumulate, because the division between the various strings is just an artefact of the XML parsing process.

For example, consider a tiny spreadsheet with one column and three rows, where the first two rows each hold the number 2 and the third one holds the number 4 obtained by summing the numbers in the first two rows with an Excel formula. The relevant snippet of the Excel XML output (XMLSS format, as Microsoft calls it) is then:

<Table ss:ExpandedColumnCount="1" ss:ExpandedRowCount="3"
       x:FullColumns="1" x:FullRows="1">
   <Row>
    <Cell><Data ss:Type="Number">2</Data></Cell>
   </Row>
   <Row>
    <Cell><Data ss:Type="Number">2</Data></Cell>
   </Row>
   <Row>
    <Cell ss:Formula="=SUM(R[-2]C, R[-1]C)">
          <Data ss:Type="Number">4</Data></Cell>
   </Row>
</Table>

and running the script in this recipe over this file emits:

[[[u'2'], [u'2'], [u'4']]]

As you can see, the XMLSS file also contains a lot of supplementary information that this recipe is not collecting—the attributes hold information about the type of data (number or string), the formula used for the computation (if any), and so on. If you need any or all of this supplemental information, it’s not hard to enrich this recipe to record and use it.

See Also

Library Reference and Python in a Nutshell document the built-in XML support in the Python Standard Library and SAX in particular.

12.8. Validating XML Documents

Credit: Paul Sholtz, Jeroen Jeroen, Marius Gedminas

Problem

You are handling XML documents and must check the validity with respect to either internal or external DTDs. You possibly also want to perform application-specific processing during the validation process.

Solution

You often want to validate an XML document file with respect to a !DOCTYPE processing instruction that the document file contains. On occasion, though, you may want to force loading of an external DTD from a given file. Moreover, a frequent need is to also perform application-specific processing during validation. A function with optional parameters, using modules from the PyXML package, can accommodate all of these needs:

from xml.parsers.xmlproc import utils, xmlval, xmldtd
def validate_xml_file(xml_filename, app=None, dtd_filename=None):
    # build validating parser object with appropriate error handler
    parser = xmlval.Validator( )
    parser.set_error_handler(utils.ErrorPrinter(parser))
    if dtd_filename is not None:
        # DTD file specified, load and set it as the DTD to use
        dtd = xmldtd.load_dtd(dtd_filename)
        parser.val.dtd = parser.dtd = parser.ent = dtd
    if app is not None:
        # Application processing requested, set appliation object
        parser.set_application(app)
    # everything being set correctly, finally perform the parsing
    parser.parse_resource(xml_filename)

If your XML data is in a string s, rather than in a file, instead of the parse.parse_resource call, you should use the following two statements in a variant of the previously shown function:

    parser.feed(s)
    parser.close( )

Discussion

Documentation on XML parsing in general, and xmlproc in particular, is easy enough to come by. However, XML is a very large subject, and PyXML is a correspondingly large package. The package’s documentation is often not entirely complete and up to date; even if it were, finding out how to perform specific tasks would still take quite a bit of digging. This recipe shows how to validate documents in a simple way that is easy to adapt to your specific needs.

If you need to perform application-specific processing, as well as validation, you need to make your own application object (an instance of some subclass of xmlproc.xmlproc.Application that appropriately overrides some or all of its various methods, most typically handle_start_tag, handle_end_tag, handle_data, and doc_end) and pass the application object as the app argument to the validate_xml_file function.

If you need to handle errors and warnings differently from the emitting of copious error messages that xmlproc.utils.ErrorPrinter performs, you need to subclass (either that class or its base xmlproc.xmlapp.ErrorHandler directly) to perform whatever tweaking you need. (See the sources of the utils.py module for examples; that module will usually be at relative path _xmlplus/parsers/xmlproc/utils.py in your Python library directory, after you have installed the PyXML package.) Then, you need to alter the call to the method set_error_handler that you see in this recipe’s validate_xml_file function so that it uses an instance of your own error-handling class. You might modify the validate_xml_file function to take yet another optional parameter err=None for the purpose, but this way overgeneralization lies. I’ve found ErrorHandler’s diagnostics normally cover my applications’ needs, so, in the code shown in this recipe’s Solution, I have not provided for this specific customization.

See Also

The PyXML web site at http://pyxml.sourceforge.net/.

12.9. Filtering Elements and Attributes Belonging to a Given Namespace

Credit: A.M. Kuchling

Problem

While parsing an XML document with SAX, you need to filter out all of the elements and attributes that belong to a particular namespace.

Solution

The SAX filter concept is just what we need here:

from xml import sax
from xml.sax import handler, saxutils, xmlreader
# the namespace we want to remove in our filter
RDF_NS = 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'
class RDFFilter(saxutils.XMLFilterBase):
    def _ _init_ _ (self, *args):
        saxutils.XMLFilterBase._ _init_ _(self, *args)
        # initially, we're not in RDF, and just one stack level is needed
        self.in_rdf_stack = [False]
    def startElementNS(self, (uri, localname), qname, attrs):
        if uri == RDF_NS or self.in_rdf_stack[-1] == True:
            # skip elements with namespace, if that namespace is RDF or
            # the element is nested in an RDF one -- and grow the stack
            self.in_rdf_stack.append(True)
            return
        # Make a dict of attributes that DON'T belong to the RDF namespace
        keep_attrs = {  }
        for key, value in attrs.items( ):
            uri, localname = key
            if uri != RDF_NS:
                keep_attrs[key] = value
        # prepare the cleaned-up bunch of non-RDF-namespace attributes
        attrs = xmlreader.AttributesNSImpl(keep_attrs, attrs.getQNames( ))
        # grow the stack by replicating the latest entry
        self.in_rdf_stack.append(self.in_rdf_stack[-1])
        # finally delegate the rest of the operation to our base class
        saxutils.XMLFilterBase.startElementNS(self,
                 (uri, localname), qname, attrs)
    def characters(self, content):
        # skip characters that are inside an RDF-namespaced tag being skipped
        if self.in_rdf_stack[-1]:
            return
        # delegate the rest of the operation to our base class
        saxutils.XMLFilterBase.characters(self, content)
    def endElementNS (self, (uri, localname), qname):
        # pop the stack -- nothing else to be done, if we were skipping
        if self.in_rdf_stack.pop( ) == True:
            return
        # delegate the rest of the operation to our base class
        saxutils.XMLFilterBase.endElementNS(self, (uri, localname), qname)
def filter_rdf(input, output):
    """ filter_rdf(input=some_input_filename, output=some_output_filename)
        Parses the XML input from the input stream, filtering out all
        elements and attributes that are in the RDF namespace.
    """
    output_gen = saxutils.XMLGenerator(output)
    parser = sax.make_parser( )
    filter = RDFFilter(parser)
    filter.setFeature(handler.feature_namespaces, True)
    filter.setContentHandler(output_gen)
    filter.setErrorHandler(handler.ErrorHandler( ))
    filter.parse(input)
if _ _name_ _ == '_ _main_ _':
    import StringIO, sys
    TEST_RDF = '''<?xml version="1.0"?>
<metadata xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:dc="http://purl.org/dc/elements/1.1/">
   <title>  This is non-RDF content </title>
   <rdf:RDF>
     <rdf:Description rdf:about="%s">
       <dc:Creator>%s</dc:Creator>
     </rdf:Description>
   </rdf:RDF>
  <element />
</metadata>
'''
    input = StringIO.StringIO(TEST_RDF)
    filter_rdf(input, sys.stdout)

This module, when run as a main script, emits something like:

<?xml version="1.0" encoding="iso-8859-1"?>
<metadata xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
          xmlns:dc="http://purl.org/dc/elements/1.1/">
   <title>  This is non-RDF content </title>
  <element></element>
</metadata>

Discussion

My motivation for originally writing this recipe came from processing files of metadata, containing RDF mixed with other elements. I wanted to generate a version of the metadata with the RDF filtered out.

The filter_rdf function does the job, reading XML input from the input stream and writing it to the output stream. The standard XMLGenerator class in xml.sax.saxutils is used to produce the output. Function filter_rdf internally uses a filtering class called RDFFilter, also shown in this recipe’s Solution, pushing that filter on top of the XML parser to suppress elements and attributes belonging to the RDF_NS namespace.

Non-RDF elements contained within an RDF element are also removed. To modify this behavior, change the first line of the startElementNS method to use just if uri = = RDF_NS as the guard.

This code doesn’t delete the xmlns declaration for the RDF namespace; I’m willing to live with a little unnecessary but harmless cruft in the output rather than go to huge trouble to remove it.

See Also

Library Reference and Python in a Nutshell document the built-in XML support in the Python Standard Library.

12.10. Merging Continuous Text Events with a SAX Filter

Credit: Uche Ogbuji, James Kew, Peter Cogolo

Problem

A SAX parser can report contiguous text using multiple characters events (meaning, in practice, multiple calls to the characters method), and this multiplicity of events for a single text string may give problems to SAX handlers. You want to insert a filter into the SAX handler chain to ensure that each text node in the document is reported as a single SAX characters event (meaning, in practice, that it calls character just once).

Solution

Module xml.sax.saxutils in the standard Python library includes a class XMLFilterBase that we can subclass to implement any XML filter we may need:

from xml.sax.saxutils import XMLFilterBase
class text_normalize_filter(XMLFilterBase):
    """ SAX filter to ensure that contiguous text nodes are merged into one
    """
    def _ _init_ _(self, upstream, downstream):
        XMLFilterBase._ _init_ _(self, upstream)
        self._downstream = downstream
        self._accumulator = [  ]
    def _complete_text_node(self):
        if self._accumulator:
            self._downstream.characters(''.join(self._accumulator))
            self._accumulator = [  ]
    def characters(self, text):
        self._accumulator.append(text)
    def ignorableWhitespace(self, ws):
        self._accumulator.append(text)
def _wrap_complete(method_name):
    def method(self, *a, **k):
        self._complete_text_node( )
        getattr(self._downstream, method_name)(*a, **k)
    # 2.4 only: method._ _name_ _ = method_name
    setattr(text_normalize_filter, method_name, method)
for n in '''startElement startElementNS endElement endElementNS
            processingInstruction comment'''.split( ):
    _wrap_complete(n)
if _ _name_ _ == "_ _main_ _":
    import sys
    from xml import sax
    from xml.sax.saxutils import XMLGenerator
    parser = sax.make_parser( )
    # XMLGenerator is a special predefined SAX handler that merely writes
    # SAX events back into an XML document
    downstream_handler = XMLGenerator( )
    # upstream, the parser; downstream, the next handler in the chain
    filter_handler = text_normalize_filter(parser, downstream_handler)
    # The SAX filter base is designed so that the filter takes on much of the
    # interface of the parser itself, including the "parse" method
    filter_handler.parse(sys.argv[1])

Discussion

A SAX parser can report contiguous text using multiple characters events (meaning, in practice, multiple calls to the characters method of the downstream handler). In other words, given an XML document whose content is 'abc', the text could technically be reported as up to three character events: one for the 'a' character, one for the ‘b', and a third for the ‘c’. Such an extreme case of “fragmentation” of a text string into multiple events is unlikely in real life, but it is not impossible.

A typical reason that might cause a parser to report text nodes a bit at a time would be buffering of the XML input source. Most low-level parsers use a buffer of a certain number of characters that are read and parsed at a time. If a text node straddles such a buffer boundary, many parsers will just wrap up the current text event and start a new one to send characters from the next buffer. If you don’t account for this behavior in your SAX handlers, you may run into very obscure and hard-to-reproduce bugs. Even if the parser you usually use does combine text nodes for you, you never know when you may want to run your code in a situation where a different parser is selected. You’d need to write logic to accommodate the possibility, which can be rather cumbersome when mixed into typical SAX-style state machine logic.

The class text_normalize_filter presented in this recipe ensures that all text events are reported to downstream SAX handlers in the contiguous manner that most developers would expect. In this recipe’s example case, the filter would consolidate the three characters events into a single one for the entire text node 'abc‘.

Tip

For more information on SAX filters in general, see my article “Tip: SAX filters for flexible processing,” http://www-106.ibm.com/developerworks/xml/library/x-tipsaxflex.html.

Python’s XMLGenerator does not do anything with processing instructions, so, if you run the main code presented in this recipe on an XML document that uses them, you’ll have a gap in the output, along with other minor deviations between input and output. Comments are similar but worse, because XMLFilterBase does not even filter them; if you do need to get comments, your test_normalize_filter class must multiply inherit from xml.sax.saxlib.LexicalHandler, as well as from xml.sax.saxutils.XMLFilterBase, and it must override the parse method as follows:

    def parse(self, source):
        # force connection of self as the lexical handler
        self._parent.setProperty(property_lexical_handler, self)
        # Delegate to XMLFilterBase for the rest
        XMLFilterBase.parse(self, source)

This code is hairy enough, using the “internal” attribute self._parent, and the need to deal properly with XML comments is rare enough, to make this addition somewhat doubtful, which is why it is not part of this recipe’s Solution.

If you need ease of chaining to other filters, you may prefer not to take both upstream and downstream parameters in _ _init_ _. In this case, keep the same signature as XMLFilterBase._ _init_ _:

    def _ _init_ _(self, parent):
        XMLFilterBase._ _init_ _(self, parent)
        self._accumulator = [  ]

and change the _wrap_complete factory function so that the wrapper, rather than calling methods on the downstream handler directly, delegates to the default implementations in XMLFilterBase, which in turn call out to handlers that have been set on the filter with such methods as setContentHandler and the like:

def _wrap_complete(method_name):
    def method(self, *a, **k):
        self._complete_text_node( )
        getattr(XMLFilterBase, method_name)(self, *a, **k)
    # 2.4 only: method._ _name_ _ = method_name
    setattr(text_normalize_filter, method_name, method)

This is slightly less convenient for the typical simple case, but it pays back this inconvenience by letting you easily chain filters:

parser = sax.make_parser( )
filtered_parser = text_normalise_filter(some_other_filter(parser))

as well as letting you use a filter in contexts that call the parse method on your behalf:

doc = xml.dom.minidom.parse(input_file, parser=filtered_parser)

See Also

Library Reference and Python in a Nutshell document the built-in XML support in the Python Standard Library.

12.11. Using MSHTML to Parse XML or HTML

Credit: Bill Bell

Problem

Your Python application, running on Windows, needs to use the Microsoft MSHTML COM component, which is also the parser that Microsoft Internet Explorer uses to parse HTML and XML web pages.

Solution

As usual, PyWin32 lets our Python code access COM quite simply:

from win32com.client import Dispatch
html = Dispatch('htmlfile')    # the disguise for MSHTML as a COM server
html.writeln( "<html><header><title>A title</title>"
     "<meta name='a name' content='page description'></header>"
     "<body>This is some of it. <span>And this is the rest.</span>"
     "</body></html>" )
print "Title: %s" % (html.title,)
print "Bag of words from body of the page: %s" % (html.body.innerText,)
print "URL associated with the page: %s" % (html.url,)
print "Display of name:content pairs from the metatags: "
metas = html.getElementsByTagName("meta")
for m in xrange(metas.length):
    print "	%s: %s" % (metas[m].name, metas[m].content,)

Discussion

While Python offers many ways to parse HTML or XML, as long as you’re running your programs only on Windows, MSHTML is very speedy and simple to use. As the recipe shows, you can simply use the writeln method of the COM object to feed the page into MSHTML and then you can use the methods and properties of the components to get at all kinds of aspects of the page’s DOM. Of course, you can get the string of markup and text to feed into MSHTML in any way that suits your application, such as by using the Python Standard Library module urllib if you’re getting a page from some URL.

Since the structure of the enriched DOM that MSHTML makes available is quite rich and complicated, I suggest you experiment with it in the PythonWin interactive environment that comes with PyWin32. The strength of PythonWin for such exploratory tasks is that it displays all of the properties and methods made available by each interface.

See Also

A detailed reference to MSHTML, albeit oriented to Visual Basic and C# users, can be found at http://www.xaml.net/articles/type.asp?o=MSHTML.



[1] RSS (Really Simple Syndication)

[2] FOAF (Friend of a Friend)

[3] OPML (Outline Processor Markup Language)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.47.169