Chapter 21. XML

Introduction

The Extensible Markup Language, or XML, is a portable, human-readable format for exchanging text or data between programs. XML derives from its parent standard SGML, as does the HTML language used on web pages worldwide. XML, then, is HTML’s younger but more capable sibling. And since most developers know at least a bit of HTML, parts of this discussion will be couched in terms of comparisons with HTML. XML’s lesser-known grandparent is IBM’s GML (General Markup Language), and one of its cousins is Adobe FrameMaker’s Maker Interchange Format (MIF). Example 21-1 depicts the family tree.

XML’s ancestry

Figure 21-1. XML’s ancestry

One way of thinking about XML is that it’s HTML cleaned up, consolidated, and with the ability to define your own tags. It’s HTML with tags that can and should identify the informational content as opposed to the formatting. Another way of perceiving XML is as a general interchange format for such things as business-to-business communications over the Internet, or as a human-editable[50] description of things as diverse as word-processing files and Java documents. XML is all these things, depending on where you’re coming from as a developer and where you want to go today -- and tomorrow.

Because of the wide acceptance of XML, it is used as the basis for many other formats, including the Open Office (http://www.openoffice.org) save file format, the SVG graphics file format, and many more.

From SGML, both HTML and XML inherit the syntax of using angle brackets (< and >) around tags, each pair of which delimits one part of an XML document, called an element . An element may contain content (like a <P> tag in HTML) or may not (like an <HR> in HTML). While HTML documents can begin with either an <HTML> tag or a <DOCTYPE...> tag (or, informally, with neither), an XML file must always begin with an XML prolog, which is at least the following:

<?xml version="1.0"?>

The question mark is a special character used to identify the XML prolog (it’s syntactically similar to the % used in ASP and JSP).

HTML has a number of elements that accept attributes, such as:

<BODY BGCOLOR=white> ... </body>

XML attribute values (such as the 1.0 for the version in the prolog, or the white in BGCOLOR) must be quoted. In other words, quoting is optional in HTML, but required in XML.

Another difference between HTML and XML is that XML is case-sensitive, so that BODY, Body, and body represent three different element names. The BODY example shown here, while allowed in HTML, would draw complaint from any XML parser. And speaking of XML parsing, there’s a great variety of XML parsers available. A parser is simply a program or class that reads an XML file, looks at it at least syntactically, and lets you access some or all of the elements. Most of these parsers conform to the Java bindings for one of the two well-known XML APIs, SAX and DOM. SAX, the Simple API for XML, reads the file and calls your code when it encounters certain events, such as start-of-element, end-of-element, start-of-document, and the like. DOM, the Document Object Model, reads the file and constructs an in-memory tree or graph corresponding to the elements and their attributes and contents in the file. This tree can be traversed, searched, modified (even constructed from scratch, using DOM), or written to a file.

An alternative API called JDOM has also been released into the open source field. JDOM, by Brett McLaughlin and Jason Hunter, has the advantage of being aimed primarily at Java (DOM itself is designed to work with many different programming languages). JDOM is available at http://www.jdom.org, and has been accepted as a JSR (Java Standards Request) for the Sun Community Standards Process.

But how does the parser know if an XML file contains the correct elements? Well, the simpler, “non-validating” parsers don’t; they simply check that the XML is syntactically correct, or well-formed. Validating parsers check that the XML file conforms to a given Document Type Definition (DTD) or an XML Schema. DTDs are inherited from SGML; their syntax is discussed in Section 21.5. Schemas are newer than DTDs and, while more complex, provide such object-based features as inheritance. DTDs are written in a special meta-language derived from SGML, while XML Schemas are written in “pure” XML.

In addition to parsing XML, you can use an XML processor to transform it into some other format, such as HTML. This is a natural for use in a servlet (see Chapter 18): if a given web browser client can support XML, just write the data as-is, but if not, transform the data into HTML. There are two transformation languages, XML-T and XML-FO, which we’ll look at first; for more complex operations on XML, there are two parsing APIs that we’ll cover later.

If you need to control how an XML document is displayed, you can use XSL-FO (Extensible Style Language: Formatted Objects). XSL-FO is an extended version of the HTML stylesheet concept that allows you to specify formatting for particular elements. However, the XSL-FO standard isn’t complete yet. And XML-FO can be complex; you are basically specifying a batch formatting language to describe how your textual data is formatted for the printed page. A comprehensive reference implementation is FOP, which produces Acrobat PDF output and is available from http://xml.apache.org.



[50] Although you can edit XML using vi, Emacs, notepad, or simpletext, it is considered preferable to use an XML-aware editor. XML’s structure is more complex, and parsing programs far less tolerant of picayune error, than was ever the case in the HTML world. XML files are kept as plain text for debugging purposes, for ease of transmission across wildly incompatible operating systems, and (as a last resort) for manual editing to repair software disasters.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.88.54