Many Python programmers consider working with XML to be a right royal pain. However, XML is extremely popular for a wide range of purposes. Python programs frequently have to interact with it; both as a consumer that needs to interpret XML data from another source, and as a producer that needs to create XML data for other programs or computers to parse.
Python includes three well-documented libraries for interacting with XML documents in its standard library. Two are based on traditional XML parsing techniques, while the third is a nice Pythonic interface.
The SAX (Simple API for XML) library is an event-driven system that calls specific functions when specific objects are encountered in the byte stream: opening and closing tags, attributes, and contents. It can be unwieldy to work with, but has the advantage of parsing XML documents "on the fly" without having to load the entire stream into memory. This is useful for huge documents.
The DOM (Document Object Model) library takes a different approach. It allows any part of the document to be accessed at any time, and treats the document like a tree of connected nodes. Each node represents an element, attribute, or text in the document. Elements can have child elements and each can be accessed randomly. It allows XML documents to be both read and written as well as modified by dynamically adding new nodes to the tree.
Both of these tools have their uses, but most common XML tasks in Python can be parsed, least painfully, using the third built-in library, ElementTree
, or a more advanced library that is based on it, called lxml
. Both libraries allow XML documents to be treated like Python objects, making them easy to read, compose, interact with, and modify.
The xml.etree.ElementTree
package contains several classes and functions for manipulating XML documents. The most important of these are the Element
and ElementTree
classes. An ElementTree
essentially represents an entire XML document in memory; it uses the composite pattern we discussed in Chapter 10 to construct a tree of Element
objects. It has a single pointer to a root node, which contains relevant child nodes, which may contain more children, and so on.
Any one Element
object contains everything between the opening and associated closing tags in an XML document. It references the tag name, any attributes on the opening tag, the text inside the element, and a list of child elements (nested opening and closing tags) recursively. If the text includes interspersed text and tag elements (as in an HTML document), any text between a closing tag and the following tag (whether it is a new opening tag or the parent's closing tag) is added to a tail
attribute on the element.
As an example, let's start with a simple HTML document:
<!DOCTYPE html> ElementTreeexample<html> <head> <title>This is a web page</title> <link rel="stylesheet" href="styles.css" /> </head> <body> <h1>Welcome To My Site</h1> <article class="simple"> This is <em>my</em> site. It is rather lame. But it's still <strong>mine</strong>. Mine, I say. </article> </body> </html>
If you're interested, this document is valid HTML 5 source code. If you're familiar with XHTML or HTML 4, you'll be relieved to see that they've made the new version much more readable. There must be some Python programmers on the standards development team, reminding the world that readability is important!
Now, the following program will load this HTML document into memory, and illustrate how the various elements are associated:
from xml.etree.ElementTree import fromstring with open("html_document.html") as file: root = fromstring(file.read()) print("ROOT NODE") print(" tag:", root.tag) print(" number of children:", len(root)) print(" children:", [c.tag for c in root.getchildren()]) head = root[0] print("HEAD NODE") print(" first child:", head[0].tag) print(" second_child:", head[1].tag) print(" link attributes:", head[1].attrib) article = root[1][1] print("ARTICLE NODE") print(" tag:", article.tag) print(" text:", article.text) print(" first child:", article[0].tag, article[0].text) print(" em's tail:", article[0].tail) print(" second child:", article[1].tag, article[1].text) print(" strong's tail:", article[1].tail)
If we run this, we can see how child list access, and tag, text, attrib
, and tail
attributes work on an element class:
ROOT NODE tag: html number of children: 2 children: ['head', 'body'] HEAD NODE first child: title second_child: link link attributes: {'href': 'styles.css', 'rel': 'stylesheet'} ARTICLE NODE tag: article text: This is first child: em my em's tail: site. It is rather lame. But it's still second child: strong mine strong's tail: . Mine, I say.
The important thing to note is that the child of an element is always another element with a similar interface. The list lookup supports iteration over a node (as in for child in element)
and slice notation (as in element[1:5])
, so it is easy to treat an element like a normal Python sequence.
One thing to be careful of is that ElementTree
is a bit ambiguous when it comes to checking for children. Do not use the idiom if element
to determine if an element exists or not, because it may return False
if the element exists, yet contains no children. If you want to check if an element contains children, use if len(element)
. Conversely, if you want to check if an element exists, use if element is not None
.
This short introduction is almost enough if we need to interpret or work with an element tree we read from a file or was provided across the network. Typically when we read or receive an XML file, we need to do one of two things:
The first task can be accomplished by recursively iterating over nodes and looking at their attributes, texts, and tails. The second task usually implies some sort of searching mechanism. The Element
class does provide a few methods to help with matching elements. There are three of them. They each return a different value, but all accept a single pattern
parameter. This parameter supports a bastardized version of the XPath selection language. Unfortunately, the entire XPath language is not supported. Basic features, such as selecting a tag, selecting a tag recursively, and building a path from the current node work, though:
print('search for child tag:', root.find('head')) print('search children recursively:', root.findall('.//em')) print('build path:', root.findtext('./body/article/em'))
If we run this code it outputs:
search for child tag: <Element head at 961f7ac> search children recursively: [<Element em at 961fb2c>] build path: my
This example also illustrates the three different search methods, each of which accepts the same kind of path. The find
method returns the first matching element. The findall
method returns a list of matching elements. The findtext
method is a bit different; it finds the first matching sub-element (just like find)
, but then returns the text attribute for that element, instead of the element itself. Thus, e.findtext(path)
is identical to e.find(path).text
.
ElementTree isn't only good for parsing and searching XML documents. It also provides an intuitive interface for constructing them, using standard list and object access features. We can use the append
function on an element to add a child, or the helper function, SubElement
, which is a little bit less typing. We can set attributes using dictionary syntax, and text and tails using object attribute access. The following code constructs a simple HTML document using these features:
from xml.etree.ElementTree import (Element, SubElement, tostring) root = Element("html") head = Element("head") root.append(head) title = SubElement(head, "title") title.text = "my page" link = SubElement(head, "link") link.attrib['rel'] = "stylesheet" link.attrib['href'] = "styles.css" body = Element("body") body.text = "This is my website." root.append(body) print(tostring(root))
First we create a root element, then append some children to it in order. The SubElement
function does the same thing to append a title to the head. When we create a link element, we also update the attribute dictionary to set attributes on it. Finally, we can use the tostring
method to convert the element to an XML string, which looks like this:
<html><head><title>my page</title><link href="styles.css" rel="stylesheet" /></head><body>This is my website.</body></html>
There's plenty more to ElementTree than we've considered so far, but unlike the alternative XML libraries, the basics will get you a long ways.
lxml is an advanced XML parsing library that uses the lightning fast libxml2
library to do the underlying hard work. It can be downloaded from the lxml website at http://codespeak.net/lxml/. It is a third-party library and, in the past, has been difficult to install on some operating systems, although this should not be the case with the latest releases.
If your needs are basic and can be covered by the ElementTree
API we just discussed, then, by all means, use that. But if you need to parse invalid XML documents, advanced XPath searching, or CSS selectors, lxml is your tool.
lxml has an interface that is very similar to ElementTree, in fact, for basic usage, lxml can be used as a drop-in replacement for ElementTree. This will invariably give your parsing code a speedup, but this compatibility with ElementTree isn't what makes lxml great. lxml is far more advanced and provides numerous features above and beyond ElementTree.
The previous example for parsing and searching our XML file only needs one change to work in lxml; change the import to read from lxml.etree import fromstring
and the code will run unmodified.
My favorite advanced feature of lxml is its support for advanced XPath and CSS selectors for searching through XML documents. These are far more useful than the basic ElementTree searches. Here are some examples:
print('xpath attribute:', root.xpath('//link[@href]')) print('xpath text filter:', root.xpath('//*[contains(em, "my")]')) print('xpath first child:', root.xpath('/html/body/article/em[1]')) from lxml.cssselect import CSSSelector print('css class selector:', CSSSelector('.simple')(root)) print('css tag selector:', CSSSelector('em')(root))
lxml will support any XPath selector that is supported by the underlying libxml2
library. This basically encompasses the entire XPath language definition, although some of the most exotic selectors may be buggy.
The CSS selectors are very comfortable to anyone used to the jQuery JavaScript library or similar libraries. CSS selectors compile, internally, to equivalent XPath selectors before the selection is run. Both the XPath and the CSS selector functions return a list of all matching elements, similar to the ElementTree findall
method.
In addition to these advanced search features, lxml provides:
We don't have space to discuss these, but if you have any advanced or complicated requirements when it comes to XML or HTML parsing, lxml is invariably the tool you want to reach for.
3.135.183.138