This chapter describes a number of modules that are used to parse different file formats.
Python comes with extensive support for the Extensible Markup Language (XML) and Hypertext Markup Language (HTML) file formats. Python also provides basic support for Standard Generalized Markup Language (SGML).
All these formats share the same basic structure because both HTML and XML are derived from SGML. Each document contains a mix of start tags, end tags, plain text (also called character data), and entity references, as shown in the following:
<document name="sample.xml"> <header>This is a header</header> <body>This is the body text. The text can contain plain text ("character data"), tags, and entities. </body> </document>
In the previous example, <document>
,
<header>
, and <body>
are start tags. For each start tag, there’s a corresponding end tag
that looks similar, but has a slash before the tag name. The start
tag can also contain one or more attributes, like
the name
attribute in this example.
Everything between a start tag and its matching end tag is called an
element. In the previous example, the
document
element contains two other elements:
header
and body
.
Finally, "
is a character entity. It is
used to represent reserved characters in the text sections. In this
case, it’s an ampersand (&
), which is used to
start the entity itself. Other common entities include
<
for “less than”
(<
), and >
for
“greater than” (>
).
While XML, HTML, and SGML all share the same building blocks, there
are important differences between them. In XML, all elements must
have both start tags and end tags, and the tags must be properly
nested (if they are, the document is said to be
well-formed). In addition, XML is
case-sensitive, so <document>
and
<Document>
are two different element types.
HTML, in contrast, is much more flexible. The HTML parser can often
fill in missing tags; for example, if you open a new paragraph in HTML
using the <P>
tag without closing the
previous paragraph, the parser automatically adds a
</P>
end tag. HTML is also case-insensitive.
On the other hand, XML allows you to define your own elements, while
HTML uses a fixed element set, as defined by the HTML specifications.
SGML is even more flexible. In its full incarnation, you can use a custom declaration to define how to translate the source text into an element structure, and a document type description (DTD) to validate the structure and fill in missing tags. Technically, both HTML and XML are SGML applications; they both have their own SGML declaration, and HTML also has a standard DTD.
Python comes with parsers for all markup flavors. While SGML is the
most flexible of the formats, Python’s sgmllib
parser is actually
pretty simple. It avoids most of the problems by only understanding
enough of the SGML standard to be able to deal with HTML. It doesn’t
handle DTDs either; instead, you can customize
the parser via subclassing.
Python’s HTML support is built on the SGML parser. The
htmllib
parser delegates the actual rendering to a formatter object. The
formatter
module contains a couple of standard formatters.
Python’s XML support is most complex. In Python 1.5.2, the built-in
support was limited to the xmllib
parser, which is
pretty similar to the sgmllib
module (with one
important difference; xmllib
actually tries to
support the entire XML standard). Python 2.0 comes with more advanced XML tools, based on the optional
expat
parser.
18.191.237.194