XML

Many Python programmers consider working with XML to be a right royal pain. However, XML is extremely popular for a wide range of purposes. Python programs frequently have to interact with it; both as a consumer that needs to interpret XML data from another source, and as a producer that needs to create XML data for other programs or computers to parse.

Python includes three well-documented libraries for interacting with XML documents in its standard library. Two are based on traditional XML parsing techniques, while the third is a nice Pythonic interface.

The SAX (Simple API for XML) library is an event-driven system that calls specific functions when specific objects are encountered in the byte stream: opening and closing tags, attributes, and contents. It can be unwieldy to work with, but has the advantage of parsing XML documents "on the fly" without having to load the entire stream into memory. This is useful for huge documents.

The DOM (Document Object Model) library takes a different approach. It allows any part of the document to be accessed at any time, and treats the document like a tree of connected nodes. Each node represents an element, attribute, or text in the document. Elements can have child elements and each can be accessed randomly. It allows XML documents to be both read and written as well as modified by dynamically adding new nodes to the tree.

Both of these tools have their uses, but most common XML tasks in Python can be parsed, least painfully, using the third built-in library, ElementTree, or a more advanced library that is based on it, called lxml. Both libraries allow XML documents to be treated like Python objects, making them easy to read, compose, interact with, and modify.

ElementTree

The xml.etree.ElementTree package contains several classes and functions for manipulating XML documents. The most important of these are the Element and ElementTree classes. An ElementTree essentially represents an entire XML document in memory; it uses the composite pattern we discussed in Chapter 10 to construct a tree of Element objects. It has a single pointer to a root node, which contains relevant child nodes, which may contain more children, and so on.

Any one Element object contains everything between the opening and associated closing tags in an XML document. It references the tag name, any attributes on the opening tag, the text inside the element, and a list of child elements (nested opening and closing tags) recursively. If the text includes interspersed text and tag elements (as in an HTML document), any text between a closing tag and the following tag (whether it is a new opening tag or the parent's closing tag) is added to a tail attribute on the element.

As an example, let's start with a simple HTML document:

	<!DOCTYPE html>
	ElementTreeexample<html>
		<head>
			<title>This is a web page</title>
			<link rel="stylesheet" href="styles.css" />
		</head>
		<body>
			<h1>Welcome To My Site</h1>
			<article class="simple">
				This is <em>my</em> site. It is rather lame.
				But it's still <strong>mine</strong>. Mine,
				I say.
			</article>
		</body>
	</html>

Note

If you're interested, this document is valid HTML 5 source code. If you're familiar with XHTML or HTML 4, you'll be relieved to see that they've made the new version much more readable. There must be some Python programmers on the standards development team, reminding the world that readability is important!

Now, the following program will load this HTML document into memory, and illustrate how the various elements are associated:

	from xml.etree.ElementTree import fromstring

	with open("html_document.html") as file:
		root = fromstring(file.read())

	print("ROOT NODE")
	print("	tag:", root.tag)
	print("	number of children:", len(root))
	print("	children:", [c.tag for c in root.getchildren()])
	head = root[0]
	print("HEAD NODE")
	print("	first child:", head[0].tag)
	print("	second_child:", head[1].tag)
	print("	link attributes:", head[1].attrib)
	article = root[1][1]
	print("ARTICLE NODE")
	print("	tag:", article.tag)
	print("	text:", article.text)
	print("	first child:", article[0].tag, article[0].text)
	print("		em's tail:", article[0].tail)
	print("	second child:", article[1].tag, article[1].text)
	print("	strong's tail:", article[1].tail)

If we run this, we can see how child list access, and tag, text, attrib, and tail attributes work on an element class:

	ROOT NODE
		tag: html
		number of children: 2
		children: ['head', 'body']
	HEAD NODE
		first child: title
		second_child: link
		link attributes: {'href': 'styles.css', 'rel': 'stylesheet'}
	ARTICLE NODE
		tag: article
		text:
				This is
		first child: em my
				em's tail: site. It is rather lame.
					But it's still
		second child: strong mine
		strong's tail: . Mine,
					I say.

The important thing to note is that the child of an element is always another element with a similar interface. The list lookup supports iteration over a node (as in for child in element) and slice notation (as in element[1:5]), so it is easy to treat an element like a normal Python sequence.

One thing to be careful of is that ElementTree is a bit ambiguous when it comes to checking for children. Do not use the idiom if element to determine if an element exists or not, because it may return False if the element exists, yet contains no children. If you want to check if an element contains children, use if len(element). Conversely, if you want to check if an element exists, use if element is not None.

This short introduction is almost enough if we need to interpret or work with an element tree we read from a file or was provided across the network. Typically when we read or receive an XML file, we need to do one of two things:

  • Parse it node by node and convert it to some other structure
  • Find specific elements or attributes and look up their value

The first task can be accomplished by recursively iterating over nodes and looking at their attributes, texts, and tails. The second task usually implies some sort of searching mechanism. The Element class does provide a few methods to help with matching elements. There are three of them. They each return a different value, but all accept a single pattern parameter. This parameter supports a bastardized version of the XPath selection language. Unfortunately, the entire XPath language is not supported. Basic features, such as selecting a tag, selecting a tag recursively, and building a path from the current node work, though:

	print('search for child tag:', root.find('head'))
	print('search children recursively:', root.findall('.//em'))
	print('build path:', root.findtext('./body/article/em'))

If we run this code it outputs:

	search for child tag: <Element head at 961f7ac>
	search children recursively: [<Element em at 961fb2c>]
	build path: my

This example also illustrates the three different search methods, each of which accepts the same kind of path. The find method returns the first matching element. The findall method returns a list of matching elements. The findtext method is a bit different; it finds the first matching sub-element (just like find), but then returns the text attribute for that element, instead of the element itself. Thus, e.findtext(path) is identical to e.find(path).text.

Constructing XML documents

ElementTree isn't only good for parsing and searching XML documents. It also provides an intuitive interface for constructing them, using standard list and object access features. We can use the append function on an element to add a child, or the helper function, SubElement, which is a little bit less typing. We can set attributes using dictionary syntax, and text and tails using object attribute access. The following code constructs a simple HTML document using these features:

	from xml.etree.ElementTree import (Element, SubElement,
			tostring)
			
	root = Element("html")
	head = Element("head")
	root.append(head)
	title = SubElement(head, "title")
	title.text = "my page"
	link = SubElement(head, "link")
	link.attrib['rel'] = "stylesheet"
	link.attrib['href'] = "styles.css"
	body = Element("body")
	body.text = "This is my website."
	root.append(body)
	
	print(tostring(root))

First we create a root element, then append some children to it in order. The SubElement function does the same thing to append a title to the head. When we create a link element, we also update the attribute dictionary to set attributes on it. Finally, we can use the tostring method to convert the element to an XML string, which looks like this:

	<html><head><title>my page</title><link href="styles.css" 
	rel="stylesheet" /></head><body>This is my website.</body></html>

There's plenty more to ElementTree than we've considered so far, but unlike the alternative XML libraries, the basics will get you a long ways.

lxml

lxml is an advanced XML parsing library that uses the lightning fast libxml2 library to do the underlying hard work. It can be downloaded from the lxml website at http://codespeak.net/lxml/. It is a third-party library and, in the past, has been difficult to install on some operating systems, although this should not be the case with the latest releases.

If your needs are basic and can be covered by the ElementTree API we just discussed, then, by all means, use that. But if you need to parse invalid XML documents, advanced XPath searching, or CSS selectors, lxml is your tool.

lxml has an interface that is very similar to ElementTree, in fact, for basic usage, lxml can be used as a drop-in replacement for ElementTree. This will invariably give your parsing code a speedup, but this compatibility with ElementTree isn't what makes lxml great. lxml is far more advanced and provides numerous features above and beyond ElementTree.

The previous example for parsing and searching our XML file only needs one change to work in lxml; change the import to read from lxml.etree import fromstring and the code will run unmodified.

My favorite advanced feature of lxml is its support for advanced XPath and CSS selectors for searching through XML documents. These are far more useful than the basic ElementTree searches. Here are some examples:

	print('xpath attribute:', root.xpath('//link[@href]'))
	print('xpath text filter:', root.xpath('//*[contains(em, "my")]'))
	print('xpath first child:', root.xpath('/html/body/article/em[1]'))
	from lxml.cssselect import CSSSelector
	print('css class selector:', CSSSelector('.simple')(root))
	print('css tag selector:', CSSSelector('em')(root))

lxml will support any XPath selector that is supported by the underlying libxml2 library. This basically encompasses the entire XPath language definition, although some of the most exotic selectors may be buggy.

The CSS selectors are very comfortable to anyone used to the jQuery JavaScript library or similar libraries. CSS selectors compile, internally, to equivalent XPath selectors before the selection is run. Both the XPath and the CSS selector functions return a list of all matching elements, similar to the ElementTree findall method.

In addition to these advanced search features, lxml provides:

  • A parser for badly formed HTML
  • A unique library for treating elements like objects, so you can access sub-tags as if they were attributes on objects
  • A complete XML validation tool that can utilize DTDs, XMLSchema, and RELAX NG schemas

We don't have space to discuss these, but if you have any advanced or complicated requirements when it comes to XML or HTML parsing, lxml is invariably the tool you want to reach for.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.183.138