Chapter 15. Using Python for XML

XML has exploded in popularity over the past few years as a medium for storing and transmitting structured data. Python supports the wealth of standards that have sprung up around XML, either through standard libraries or a number of third-party libraries.

In this chapter you learn:

  • Create and manipulate XML.

  • Validate XML.

  • Work with some of the standard libraries that come bundled with Python.

What Is XML?

The term XML is bandied around in corporate boardrooms and meetings around the world. Its flexibility and extensibility have encouraged people to think big, advocating XML for everything from a new, formatting-independent semantic code storage mechanism to a replacement for object serialization. But beyond the buzzwords and hype, what is it, really? Is it a panacea for the world's woes? Probably not. But it is a powerful, flexible, open-standards–based method of data storage. Its vocabulary is infinitely customizable to fit whatever kind of data you want to store. Its format makes it human readable, while remaining easy to parse for programs. It encourages semantic markup, rather than formatting-based markup, separating content and presentation from each other, so that a single piece of data can be repurposed many times and displayed in many ways.

A Hierarchical Markup Language

At the core of XML is a simple hierarchical markup language. Tags are used to mark off sections of content with different semantic meanings, and attributes are used to add metadata about the content.

Following is an example of a simple XML document that could be used to describe a library:

<?xml version="1.0"?>
<library>
  <book>
    <title>Sandman Volume 1: Preludes and Nocturnes</title>
    <author>Neil Gaiman</author>
  </book>
  <book>
    <title>Good Omens</title>
    <author>Neil Gamain</author>
    <author>Terry Pratchett</author>
  </book>
  <book>
    <title>"Repent, Harlequin!" Said the Tick-Tock Man</title>
    <author>Harlan Ellison</author>
  </book>
</library>

Notice that every piece of data is wrapped in a tag and that tags are nested in a hierarchy that contains further information about the data it wraps. Based on the previous document, you can surmise that <author> is a child piece of information for <book>, as is <title>, and that a library has an attribute called owner.

Unlike semantic markup languages like LaTeX, every piece of data in XML must be enclosed in tags. The top-level tag is known as the document root, which encloses everything in the document. An XML document can have only one document root.

Just before the document root is the XML declaration: <?xml version="1.0"?>. This mandatory element lets the processor know that this is an XML document. As of the writing of this book, there are two versions of XML—1.0 and 1.1. Because version 1.1 is not fully supported yet, for our examples we will be concentrating on version 1.0.

One problem with semantic markup is the possibility for confusion as data changes contexts. For instance, you might want to ship a list of book titles off to a database about authors. However, without a human to look at it, the database has no way of knowing that <title> means a book title, as opposed to an editor's business title or an author's honorific. This is where namespaces come in. A namespace is used to provide a frame of reference for tags and is given a unique ID in the form of a URL, plus a prefix to apply to tags from that namespace. For example, you might create a library namespace, with an identifier of http://server.domain.tld/NameSpaces/Library and with a prefix of lib: and use that to provide a frame of reference for the tags. With a namespace, the document would look like this:

<?xml version="1.0"?>
<lib:library
        xmlns:lib="http://server.domain.tld/NameSpaces/Library">
  <lib:book>
    <lib:title>Sandman Volume 1: Preludes and Nocturnes</lib:title>
    <lib:author>Neil Gaiman</lib:author>
  </lib:book>
  <lib:book>
    <lib:title>Good Omens</lib:title>
    <lib:author>Neil Gamain</lib:author>
<lib:author>Terry Pratchett</lib:author>
  </lib:book>
  <lib:book>
    <lib:title>"Repent, Harlequin!" Said the Tick-Tock Man</lib:title>
    <lib:author>Harlan Ellison</lib:author>
  </lib:book>
</lib:library>

It's now explicit that the title element comes from a set of elements defined by a library namespace, and can be treated accordingly.

A namespace declaration can be added to any node in a document, and that namespace will be available to every descendant node of that node. In most documents, all namespace declarations are applied to the root element of the document, even if the namespace isn't used until deeper in the document. In this case, the namespace is applied to every tag in the document, so the namespace declaration must be on the root element.

A document can have and use multiple namespaces. For instance, the preceding example library might use one namespace for library information and a second one to add publisher information.

Notice the xmlns: prefix for the namespace declaration. Certain namespace prefixes are reserved for use by XML and its associated languages, such as xml:, xsl:, and xmlns:. A namespace declaration can be added to any node in a document, and that namespace will be available to every descendant node of that node.

This is a fairly simple document. A more complex document might contain CDATA sections for storing unprocessed data, comments, and processing instructions for storing information specific to a single XML processor. For more thorough coverage of the subject, you may want to visit http://w3cschools.org or pick up Wrox Press's Beginning XML, 3rd Edition (9780764570773) by David Hunter et al.

A Family of Standards

XML is more than just a way to store hierarchical data. If that were all there were to it, XML would quickly fall to more lightweight data storage methods that already exist. XML's big strength lies in its extensibility, and its companion standards, XSLT, XPath, Schema, and DTD languages, and a host of other standards for querying, linking, describing, displaying, and manipulating data. Schemas and DTDs provide a way for describing XML vocabularies and a way to validate documents. XSLT provides a powerful transformation engine to turn one XML vocabulary into another, or into HTML, plaintext, PDF, or a host of other formats. XPath is a query language for describing XML node sets. XSL-FO provides a way to create XML that describes the format and layout of a document for transformation to PDF or other visual formats.

Another good thing about XML is that most of the tools for working with XML are also written in XML, and can be manipulated using the same tools. XSLTs are written in XML, as are schemas. What this means in practical terms is that it's easy to use an XSLT to write another XSLT or a schema, or to validate XSLTs or schemas using schemas.

What Is a Schema/DTD?

Schemas and DTDs (Document Type Definitions) are both ways of implementing document models. A document model is a way of describing the vocabulary and structure of a document. It's somewhat akin to what a DBA does when creating a database. You define the data elements that will be present in your document, what relationship they have to one another, and how many of them you expect. In plain English, a document model for the previous XML example might read as follows: "A library is a collection of books with a single owner. Each book has a title and at least one author."

DTDs and schemas have different ways of expressing this document model, but they both describe the same basic formula for the document. Subtle differences exist between the two, as you see later, but they have roughly the same capabilities.

What Are Document Models For?

Document models are used when you want to be able to validate content against a standard before manipulating or processing it. They are useful whenever you will be interchanging data with an application that may change data models unexpectedly, or when you want to constrain what a user can enter, as in an XML-based documentation system where you will be working with hand-created XML rather than with something from an application.

Do You Need One?

In some applications, a document model might not be needed. If you control both ends of the data exchange and can predict what elements you are going to be receiving, a document model would be redundant.

Document Type Definitions

A DTD is a Document Type Definition. These were the original methods of expressing a document model and are ubiquitous throughout the Internet. DTDs were originally created for describing SGML, and the syntax has barely changed since that time, so DTDs have had quite a while to proliferate. The W3C (the World Wide Web Consortium, or one of the groups that brings standards to the Internet) continues to express document types using DTDs, so DTDs exist for each of the HTML standards, for Scalable Vector Graphics (SVG), MathML, and for many other useful XML vocabularies.

An Example DTD

If you were to translate the English description of the example library XML document into a DTD, it might look something like the following:

<?xml version="1.0"?>
<!ELEMENT library (book+)>
<!ATTLIST library
          owner CDATA #REQUIRED
>
<!ELEMENT book (title, author+)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT author (#PCDATA)>

To add a reference to this DTD in the library file discussed before, you would insert a line at the top of the file after the XML declaration that read <!DOCTYPE config SYSTEM "library.dtd">, where library.dtd was the path to the DTD on your system.

Let's break this down, one step at a time. The first line, <?xml version="1.0"?>, tells you that this is going to be an XML document. Technically, this line is optional; DTDs don't behave like other XML documents, but we'll get to that later. The next line, <!ELEMENT library (book+)>, tells you that there is an element known as library, which can have one or more child elements of the book type. The syntax for element frequencies and grouping in DTDs is terse, but similar to that of regular expressions. The following table lists element frequency and element grouping operators in DTDs.

Operator

Definition

?

Specifies zero or one of the preceding elements. For instance, editor? would mean that a book could have an optional editor element.

+

Specifies one or more of the preceding elements. As in the previous example, author+ means that a book has one or more authors.

,

Specifies a sequence of elements that must occur in that order. (title, author+) means that the book must have a title, followed by one or more authors, in that order.

(list)

Groups elements together. An operator applied after parentheses applies to all elements in the group. For instance, (author, editor)+ would mean that a document could have one or more authors and one or more editors.

|

Or operator. This operator permits a choice between alternatives. As an example, (author | editor) would permit a book to have an author or an editor, but not both.

*

Specifies that zero or more of the preceding elements or group can appear. (book, CD)* would permit the library to have any number of books and CDs in it, or none at all.

The next bit is a little more complex:

<!ATTLIST library
          owner CDATA #REQUIRED
>

The first line specifies that the library element has a list of attributes. Notice that the attribute list is separate from the library element declaration itself and linked to it by the element name. If the element name changes, the attribute list must be updated to point to the new element name. Next is a list of attributes for the element. In this case, library has only one attribute, but the list can contain an unbounded number of attributes. The attribute declaration has three mandatory elements: an attribute name, an attribute type, and an attribute description. An attribute type can either be a data type, as specified by the DTD specification, or a list of allowed values. The attribute description is used to specify the behavior of the attribute. A default value can be described here, and whether the attribute is optional or required.

DTDs Aren't Exactly XML

As a holdover from SGML, DTDs are technically not exactly XML. Unlike schemas, they are difficult to manipulate and validate using the same tools as XML. If you apply a document type declaration at the beginning of a DTD, your parser will either ignore it or, more likely, generate a syntax error. Although there is a specification for creating DTDs, there is no document model in the form of a DTD for validating the structure of a DTD. Tools exist for validating DTDs, but they are distinct from the tools used to validate XML. On the other hand, there is a document model in the form of a schema against which schemas can be validated using standard XML tools.

Limitations of DTDs

DTDs have a number of limitations. Although it is possible to express complex structures in DTDs, it becomes very difficult to maintain. DTDs have difficulty cleanly expressing numeric bounds on a document model. If you wanted to specify that a library could contain no more than 100 books, you could write <!ELEMENT library (book, book, book, book etc etc)>, but that quickly becomes an unreadable morass of code. DTDs also make it hard to permit a number of elements in any order. If you have three elements that you could receive in any order, you have to write <!ELEMENT book ( ( (author, ((title, publisher) | (publisher, title))) | (title, ((author, publisher) | (publisher, author))) | (publisher, ((author, title) | (title, publisher)))))>, which is beginning to look more like LISP (which is a language with a lot of parentheses) than XML and is far more complicated than it really should be. Finally, DTDs don't permit you to specify a pattern for data, so you can't express constructs such as "A telephone number should be composed of digits, dashes, and plus signs." Thankfully, the W3C has published a specification for a slightly more sophisticated language for describing documents, known as Schema.

Schemas

Schema was designed to address some of the limitations of DTDs and provide a more sophisticated XML-based language for describing document models. It enables you to cleanly specify numeric models for content, describe character data patterns using regular expressions, and express content models such as sequences, choices, and unrestricted models.

An Example Schema

If you wanted to translate the hypothetical library model into a schema with the same information contained in the DTD, you would wind up with something like the following:

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

<xs:element name="library">
  <xs:complexType>
<xs:sequence>
      <xs:element name="book" maxOccurs="unbounded">
        <xs:complexType>
          <xs:sequence>
            <xs:element name="title" type="xs:string"/>
            <xs:element name="author" type="xs:string" maxOccurs="unbounded"/>
          </xs:sequence>
        </xs:complexType>
      </xs:element>
    </xs:sequence>
    <xs:attribute name="owner" type="xs:string" use="required"/>
  </xs:complexType>

</xs:element>
</xs:schema>

This expresses exactly the same data model as the DTD, but some differences are immediately apparent.

Schemas Are Pure XML

To begin with, this document's top-level node contains a namespace declaration, specifying that all tags starting with xs: belong to the namespace identified by the URI "http://www.w3.org/2001/XMLSchema". For practical purposes, this means that you now have a document model that you can validate your schema against, using the same tools you would use to validate any other XML document.

Schemas Are Hierarchical

Next, notice that the preceding document has a hierarchy very similar to the document it is describing. Rather than create individual elements and link them together using references, the document model mimics the structure of the document as closely as possible. You can also create global elements and then reference them in a structure, but you are not required to use references; they are optional. This creates a more intuitive structure for visualizing the form of possible documents that can be created from this model.

Other Advantages of Schemas

Finally, schemas support attributes such as maxOccurs, which will take either a numeric value from 1 to infinity or the value unbounded, which expresses that any number of that element or grouping may occur. Although this schema doesn't illustrate it, schemas can express that an element matches a specific regular expression, using the pattern attribute, and schemas can express more flexible content models by mixing the choice and sequence content models.

XPath

XPath is a language for describing locations and node sets within an XML document. Entire books have been written on it. However, the basics are fairly simple. An XPath expression contains a description of a pattern that a node must match. If the node matches, it is selected; otherwise, it is ignored. Patterns are composed of a series of steps, either relative to a context node or absolutely defined from the document root. An absolute path begins with a slash, a relative one does not, and each step is separated by a slash.

A step contains three parts: an axis that describes the direction to travel, a node test to select nodes along that axis, and optional predicates, which are Boolean (true or false) tests that a node must meet. An example step might be ancestor-or-self::book[1], where ancestor-or-self is the axis to move along, book is the node test, and [1] is a predicate specifying to select the first node that meets all the other conditions. If the axis is omitted, it is assumed to refer to the child axis for the current node, so library/book[1]/author[1] would select the first author of the first book in the library.

A node test can be a function as well as a node name. For instance, book/node() will return all nodes below the selected book node, regardless of whether they are text or elements.

The following table describes a handful of shortcuts for axes.

Shortcut

Meaning

@

Specifies the attribute axis. This is an abbreviation for attribute::.

*

Specifies all children of the current node.

//

Specifies any descendant of the current node. This is an abbreviation for descendant-or-self::*//. If used at the beginning of an XPath, it matches elements anywhere in the document.

For more thorough coverage of the subject, you may want to visit http://w3schools.org or pick up a book on XPath.

HTML as a Subset of XML

XML bears a striking resemblance to HTML. This isn't entirely by accident. XML and HTML both sprang from SGML and share a number of syntactic features. Earlier versions of HTML aren't directly compatible with XML, because XML requires that every tag be closed, and certain HTML tags don't require a closing tag, such as <br> and <img>. However, the W3C has declared the XHTML schema in an attempt to bring the two standards in line with each other. XHTML can be manipulated using the same sets of tools as pure XML. However, Python also comes with specialized libraries designed specifically for dealing with HTML.

The HTML DTDs

The current version of HTML is 4.01, which includes 4.01 Transitional, 4.01 Strict, and 4.01 Frameset, specifically for dealing with frames.

HTMLParser

The HTMLParser class, unlike the htmllib class, is not based on an SGML parser and can be used for both XHTML and earlier versions of HTML.

XML Libraries Available for Python

Python comes standard with a number of libraries designed to help you work with XML. You have your choice of several DOM (Document Object Model) implementations, an interface to the nonvalidating Expat XML parser, and several libraries for using SAX (the Simple API for XML).

The available DOM implementations are as follows:

  • xml.dom: A fully compliant DOM processor

  • Xml.dom.minidom: A lightweight and much faster but not fully compliant implementation of the DOM specification

What Is SAX?

When parsing XML, you have your choice of two different types of parsers: SAX and DOM. SAX stands for the Simple API for XML. Originally only implemented for Java, it was added to Python as of version 2.0. It is a stream-based, event-driven parser. The events are known as document events, and a document event might be the start of an element, the end of an element, encountering a text node, or encountering a comment. For example, the following simple document:

<?xml version="1.0"?>
<author>
  <name>Ursula K. LeGuin</name>
</author>

might fire the following events:

start document
start element: author
start element: name
characters:  Ursula K. LeGuin
end element: name
end element: author
end document

Whenever a document event occurs, the parser fires an event for the calling application to handle. More precisely, it fires an event for the calling application's Content Handler object to handle. Content Handlers are objects that implement a known interface specified by the SAX API from which the parser can call methods. In the preceding example, the parser would call the startDocument method of the content handler, followed by two calls to the startElement method, and so on.

Stream-based

When parsing a document with SAX, the document is read and parsed in the order in which it appears. The parser opens the file or other datasource (such as a URL) as a stream of data (which means that it doesn't have to have it all at once) and then fires events whenever an element is encountered.

Because the parser does not wait for the whole document to load before beginning parsing, SAX can parse documents very soon after it starts reading the document. However, because SAX does not read the whole document, it may process a partial document before discovering that the document is badly formed. SAX-based applications should implement error-checking for such conditions.

Event-driven

When working with SAX, document events are handled by event handlers, similar to a GUI. You declare callback functions for specific types of document events, which are then passed to the parser and called when a document event occurs that matches the callback function.

What Is DOM?

At the heart of DOM lies the Document object. This is a tree-based representation of the XML document. Tree-based models are a natural fit for XML's hierarchical structure, making this a very intuitive way of working with XML. Each element in the tree is called a Node object, and it may have attributes, child nodes, text, and so on, all of which are also objects stored in the tree. DOM objects have a number of methods for creating and adding nodes, for finding nodes of a specific type or name, and for reordering or deleting nodes.

In-memory Access

The major difference between SAX and DOM is the latter's ability to store the entire document in memory and manipulate and search it as a tree, rather than force you to parse the document repeatedly, or force you to build your own in-memory representation of the document. The document is parsed once, and then nodes can be added, removed, or changed in memory and then written back out to a file when the program is finished.

Why Use SAX or DOM

Although either SAX or DOM can do almost anything you might want to do with XML, you might want to use one over the other for a given task for several reasons. For instance, if you are working on an application in which you will be modifying an XML document repeatedly based on user input, you might want the convenient random access capabilities of DOM. On the other hand, if you're building an application that needs to process a stream of XML quickly with minimal overhead, SAX might be a better choice for you. Following are some of the advantages and disadvantages you might want to be aware of when architecting your application to use XML.

Capability Trade-Offs

DOM is architected with random access in mind. It provides a tree that can be manipulated at runtime and needs to be loaded into memory only once. SAX is stream-based so data comes in as a stream one character after the next, but the document isn't seen in its entirety before it starts getting processed; therefore, if you want to randomly access data, you have to either build a partial tree of the document in memory based on document events, or reparse the document every time you want a different piece of data.

Most people find the object-oriented behavior of DOM very intuitive and easy to learn. The event-driven model of SAX is more similar to functional programming and can be more challenging to get up to speed on.

Memory Considerations

If you are working in a memory-limited environment, DOM is probably not the right choice. Even on a fairly high-end system, constructing a DOM tree for a 2 or 3 MB XML document can bring the computer grinding to a halt while it processes. Because SAX treats the document as a stream, it never loads the whole document into memory, so it is preferable if you are memory constrained or working with very large documents.

Speed Considerations

Using DOM requires a great deal of up-front processing time while the document tree is being built, but once the tree is built DOM allows for much faster searching and manipulation of nodes because the entire document is in memory. SAX is somewhat fast for searching documents, but not as efficient for their manipulation. However, for document transformations, SAX is considered to be the parser of choice because the event-driven model is fast and very compatible with how XSLT works.

SAX and DOM Parsers Available for Python

The following Python SAX and DOM parsers are available: xml.sax and xml.dom.minidom. They each behave a bit differently, so here is an overview of each of them.

xml.sax

xml.sax is the built-in SAX package that comes with Python. It uses the Expat nonvalidating parser by default but can be passed a list of parser instances that can change this behavior.

xml.dom.minidom

xml.dom.minidom is a lightweight DOM implementation, designed to be simpler and smaller than a full DOM implementation.

Intro to XSLT

XSLT stands for Extensible Stylesheet Language Transformations. Used for transforming XML into output formats such as HTML, it is a procedural, template-driven language.

XSLT Is XML

Like a Schema, XSLT is defined in terms of XML, and it's being used to supplement the capabilities of XML. The XSLT namespace is "http://www.w3.org/1999/XSL/Transform", which specifies the structure and syntax of the language. XSLT can be validated, like all other XML.

Transformation and Formatting Language

XSLT is used to transform one XML syntax into another or into any other text-based format. It is often used to transform XML into HTML in preparation for web presentation or a custom document model into XSL-FO for conversion into PDF.

Functional, Template-Driven

XSLT is a functional language, much like LISP. The XSLT programmer declares a series of templates, which are functions triggered when a node in the document matches an XPath expression. The programmer cannot guarantee the order of execution, so each function must stand on its own and make no assumptions about the results of other functions.

Python doesn't directly supply a way to create an XSLT, unfortunately. To transform XML documents, an XSLT must be created, and then it can be applied via Python to the XML.

In addition, Python's core libraries don't supply a method for transforming XML via XSLT, but a couple of different options are available from other libraries. Fourthought, Inc., offers an XSLT engine as part of its freely available 4Suite package, which unfortunately, at the time of this writing, does not support Python 3.0. However, there are also Python bindings for the widely popular libxslt C library, in particular lxml.

What Is lxml?

The following examples use the latest version of lxml, which, as of this writing, is 2.2. If you don't have it installed, please download it from http://pypi.python.org/pypi/lxml/. You will need it to complete the exercises later in this chapter.

lxml is a unique Python binding that utilizes the speed and rich features of the libxml2 and libxslt libraries alongside a simplistic API that allows you to work with both HTML and XML. The package uses the ElementTree API with a few twists, trying to make coding with libxml2 less error prone.

Importing lxml is fairly simple:

>>>import lxml
>>>from lxml import etree

Element Classes

Elements are the primary container objects for the ElementTree API, providing the core of your XML tree functionality. They behave like lists, and in fact, are technically lists. They are capable of having attributes and containing text, which we discuss in a bit. Let's first learn to create an element class. Type in the following:

>>>author = etree.Element("Horror")
>>>print(author.tag)
Horror

In this example, we created a new element class called author, and then assigned it a tag name: Horror. We then used the print() function and printed out the name via the element classes tag property. Element classes follow your standard XML tree hierarchy, and therefore support both parent and child elements.

Let's say that we wanted author to be the root element. We gave author the element tag name of "horror," and now we would like to add a group of horror writers to the author element class. These new elements will now become children of our horror element.

Adding Text to Elements

In addition to attributes, we can also add text to our elements. When dealing with XML documents that are primarily data-driven, the only place you can place text is within the element. To do so is quite simple. In our next sample, we create a tree resembling a basic HTML document:

>>>html=etree.Element("html")
>>>body=etree.SubElement(html,"body")
>>>h1=etree.SubElement(body, "h1")
>>>h1.text="Introduction"
>>>paragraph=etree.SubElement(body, "p")
>>>paragraph.text="Here is some text representing our paragraph"
>>>etree.tostring(html)
b'<html><body><h1>Introduction</h1>
<p>Here is some text</p></body></html>'
>>>etree.tostring(paragraph)
b'<p>Here is some text</p>'

The preceding example is an excellent way to showcase parent-child relationships. Even though etree.tostring prints the results out on a single line, perhaps it would be better to view it as an actual tree structure:

<html>
       <body>
               <h1>Introduction</h1>
               <p>Here is some text</p>
       </body>
</html>

As you can see, <html> is the parent, <body> is the child of <html>, and <h1> and <p> are both siblings, whose parent is <body>.

In this scenario, the text property allows us to display content that the user could view, while any attributes we added would provide data about the elements themselves.

One last thing about the preceding code. You will notice that we use etree.tostring not only to print out the entire contents of HTML, but we also used it to hone in on the contents of paragraph specifically. This is a great method to see what a given element contains, but there are times when we do not wish to see the tags. What if we wanted to just see the text of an element, if there was any? For that, we could do the following:

>>>etree.tostring(paragraph, method="text")
b'Here is some text'

Parsing with lxml

Parsing with lxml is pretty straightforward. There are three parser functions to choose from, each with its own benefits and pitfalls. Each supports the parsing of a particular type of object, such as files, strings, and URLs (both the HTTP and FTP variety), with the simplest being our string parser: fromstring().

Of all the parser functions lxml has to offer, fromstring() is the easiest to use:

>>>sentence="<info>Here is a sentence</info>"
>>>info=etree.fromstring(sentence)
>>>print(info.tag)
info
>>>print(info.text)
Here is a sentence

In this code, we begin by assigning our variable sentence with an open and closed <info> tag, with some text nested in-between. We then create another variable, info, and use the etree.fromstring() function to parse the data in the sentence.

Another method of parsing is to use the XML() function, which is similar to fromstring(), but differs by writing XML literals straight to the source, like so:

>>>info=etree.XML("<info>Here is a sentence</info>")
>>>print(info.tag)
info
>>>print(info.text)
Here is a sentence
>>>etree.tostring(info)
b'<info>Here is a sentence</info>'

Here, we've skipped the initial step of creating and assigning data to the sentence variable, and instead used the XML() function to assign the data straight to info, saving us a step. When we print the tag and text from info, we get the same result as before.

Parsing Files

So far you have learned to parse simple strings. To truly understand the power of lxml parsing however, you need to learn to work with files and file-like objects, including URLs, objects with a .read method, and file name strings.

Unlike our other two parser functions, the parse() function returns an ElementTree object, instead of an Element object. This allows us to parse entire documents, and not just simple XML fragments:

>>>import io
>>>newsentence=io.StringIO("<info>This is another sentence</info>")
>>>somesentence=etree.parse(newsentence)
>>>etree.tostring(somesentence)
b'<info>This is another sentence</info>'

If you want to access the value in somesentence with a print() function, you can do so in the following manner:

>>>printit=somesentence.getroot()
>>> print(printit.tag)
info
>>> print(printit.text)
This is another sentence

This only scrapes the surface of what you can achieve with lxml. A complete coverage of the subject would easily encompass two books, which, unfortunately, we do not have space for here. For more information, you can visit the documentation for the module here: http://codespeak.net/lxml/index.html.

Summary

The key things to take away from this chapter are:

  • How to parse XML using both SAX and DOM

  • How to validate XML using xmlproc

  • How to parse HTML using HTMLParser

  • How to work with lxml

In Chapter 16, you learn more about network programming and e-mail. Before proceeding, however, try the exercises that follow to test your understanding of the material covered in this chapter. You can find the solutions to these exercises in Appendix A.

Exercises

  1. Given the following configuration file for a Python application, write some code to extract the configuration information using a DOM parser:

    <?xml version="1.0"?>
    <!DOCTYPE config SYSTEM "configfile.dtd">
    <config>
      <utilitydirectory>/usr/bin</utilitydirectory>
      <utility>grep</utility>
      <mode>recursive</mode>
    </config>
  2. Given the following DTD, named configfile.dtd, write a Python script to validate the previous configuration file:

    <!ELEMENT config  (utilitydirectory, utility, mode)>
    <!ELEMENT utilitydirectory    (#PCDATA)*>
    <!ELEMENT utility    (#PCDATA)*>
    <!ELEMENT mode  (#PCDATA)*>
  3. Use SAX to extract configuration information from the preceding config file instead of DOM.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.108.119