CHAPTER GOALS
Understanding XML elements and attributes
Understanding the concept of an XML parser
Being able to read and write XML documents
Being able to design Document Type Definitions for XML documents
In this chapter, you will learn about the Extensible Markup Language (XML), a mechanism for encoding data that is independent of any programming language. XML allows you to encode complex data in a form that the recipient can easily parse. XML is very popular for data exchange. It is simple enough that a wide variety of programs can easily generate XML data. XML data has a nested structure, so you can use it to describe hierarchical data sets—for example, an invoice that contains many items, each of which consists of a product and a quantity. Because the XML format is standardized, libraries for parsing the data are widely available and—as you will see in this chapter—easy to use for a programmer.
It is particularly easy to read and write XML documents in Java. In fact, it is generally easier to use XML than it is to use an "ad hoc" file format. Thus, using XML makes your programs easier to write and more professional.
To understand the advantages of using XML for encoding data, let's look at a typical example. We will encode product descriptions, so that they can be transferred to another computer. Your first attempt might be a naïve encoding like this:
Toaster 29.95
In contrast, here is an XML encoding of the same data:
<product> <description>Toaster</description> <price>29.95</price> </product>
XML allows you to encode complex data, independent from any programming language, in a form that the recipient can easily parse.
The advantage of the XML version is clear: You can look at the data and understand what they mean. Of course, this is a benefit for the programmer, not for a computer program. A computer program has no understanding of what a "price" is. As a programmer, you still need to write code to extract the price as the content of the price
element. Nevertheless, the fact that an XML document is comprehensible by humans is a huge advantage for program development.
XML files are readable by computer programs and by humans.
A second advantage of the XML version is that it is resilient to change. Suppose the product data change, and an additional data item is introduced, to denote the manufacturer. In the naïve format, the manufacturer might be added after the price, like this:
Toaster 29.95 General Appliances
XML-formatted data files are resilient to change.
A program that can process the old format might get confused when reading a sequence of products in the new format. The program would think that the price is followed by the name of the next product. Thus, the program needs to be updated to work with both the old and new data formats. As data get more complex, programming for multiple versions of a data format can be difficult and time-consuming.
When using XML, on the other hand, it is easy to add new elements:
<product> <description>Toaster</description> <price>29.95</price> <manufacturer>General Appliances</manufacturer> </product>
Now a program that processes the new data can still extract the old information in the same way—as the contents of the description
and price
elements. The program need not be updated, and it can tolerate different versions of the data format.
If you know HTML, you may have noticed that the XML format of the product data looked somewhat like HTML code. However, there are some differences that we will discuss in this section.
Let's start with the similarities. The XML tag pairs, such as <price>
and </price>
look just like HTML tag pairs, for example <li>
and </li>
. Both in XML and in HTML, tags are enclosed in angle brackets < >
, and a start-tag is paired with an endtag that starts with a slash /
character.
However, web browsers are quite permissive about HTML. For example, you can omit an end-tag </li>
and the browser will try to figure out what you mean. In XML, this is not permissible. When writing XML, pay attention to the following rules:
In XML, you must pay attention to the letter case of the tags; for example, <li>
and <LI>
are different tags that bear no relation to each other.
Every start-tag must have a matching end-tag. You cannot omit tags, such as </li>
. However, if a tag has no end-tag, it must end in />
, for example
<img src="hamster.jpeg"/>
When the parser sees the />
, it knows not to look for a matching end-tag.
Finally, attribute values must be enclosed in quotes. For example,
<img src="hamster.jpeg" width=400 height=300/>
is not acceptable. You must use
<img src="hamster.jpeg" width="400" height="300"/>
Moreover, there is an important conceptual difference between HTML and XML. HTML has one specific purpose: to describe web documents. In contrast, XML is an extensible syntax that can be used to specify many different kinds of data. For example, the VRML language uses the XML syntax to describe virtual reality scenes. The MathML language uses the XML syntax to describe mathematical formulas. You can use the XML syntax to describe your own data, such as product records or invoices.
Most people who first see XML wonder how an XML document looks inside a browser. However, that is not generally a useful question to ask. Most data that are encoded in XML have nothing to do with browsers. For example, it would probably not be exciting to display an XML document with nothing but product records (such as the ones in the previous section) in a browser. Instead, you will learn in this chapter how to write programs that analyze XML data. XML does not tell you how to display data; it is merely a convenient format for representing data.
XML describes the meaning of data, not how to display them.
In this section, you will see the rules for properly formatted XML. In XML, text and tags are combined into a document. The XML standard recommends that every XML document start with a declaration
<?xml version="1.0"?>
An XML document starts out with an XML declaration and contains elements and text.
Next, the XML document contains the actual data. The data are contained in a root element. For example,
<?xml version="1.0"?> <invoice> more data </invoice>
The root element is an example of an XML element. An element has one of two forms:
<elementName
>content
</elementName
>
or
<elementName
/>
In the first case, the element has content—elements, text, or a mixture of both. A good example is a paragraph in an HTML document:
<p>Use XML for <strong>robust</strong> data formats.</p>
The p
element contains
The text: "Use XML for "
A strong
child element
More text: " data formats."
For XML files that contain documents in the traditional sense of the term, the mixture of text and elements is useful. The XML specification calls this type of content mixed content. But for files that describe data sets—such as our product data—it is better to stick with elements that contain either other elements or text. Content that consists only of elements is called element content.
An element can contain text, child elements, or both (mixed content). For data descriptions, avoid mixed content.
An element can have attributes. For example, the a
element of HTML has an href
attribute that specifies the URL of a hyperlink:
<a href="http://java.sun.com"> ... </a>
An attribute has a name (such as href
) and a value. In XML, the value must be enclosed in single or double quotes.
An element can have multiple attributes, for example
<img src="hamster.jpeg" width="400" height="300"/>
And, as you have already seen, an element can have both attributes and content.
<a href="http://java.sun.com">Sun's Java web site</a>
Programmers often wonder whether it is better to use attributes or child elements. For example, should a product be described as
<product description="Toaster" price="29.95"/>
or
<product>
<description>Toaster</description> <price>29.95</price> </product>
The former is shorter. However, it violates the spirit of attributes. Attributes are intended to provide information about the element content. For example, the price
element might have an attribute currency
that helps interpret the element content. The content 29.95 has a different interpretation in the element
<price currency="USD">29.95</price>
than it does in the element
<price currency="EUR">29.95</price>
You have now seen the components of an XML document that are needed to use XML for encoding data. There are other XML constructs for more specialized situations—see
for more information. In the next section, you will see how to use Java to parse XML documents.http://www.xml.com/axml/axml.html
What does your browser do when you load an XML file, such as the items.xml
file that is contained in the companion code for this book?
Why does HTML use the src
attribute to specify the source of an image instead of <img>hamster.jpeg</img>
?
To read and analyze the contents of an XML document, you need an XML parser. A parser is a program that reads a document, checks whether it is syntactically correct, and takes some action as it processes the document.
Two kinds of XML parsers are in common use. Streaming parsersread the XML input one token at a time and report what they encounter: a start tag, text, an end tag, and so on. In contrast, a tree-based parser builds a tree that represents the parsed document. Once the parser is done, you can analyze the tree.
A parser is a program that reads a document, checks whether it is syntactically correct, and takes some action as it processes the document.
Streaming parsers are more efficient for handling large XML documents whose tree structure would require large amounts of memory. Tree-based parsers, however, are easier to use for most applications—the parse tree gives you a complete overview of the data, whereas a streaming parser gives you the information in bits and pieces.
A streaming parser reports the building blocks of an XML document. A tree-based parser builds a document tree.
In this section, you will learn how to use a tree-based parser that produces a tree structure according to the DOM (Document Object Model) standard. The DOM standard defines interfaces and methods to analyze and modify the tree structure that represents an XML document.
In order to parse an XML document into a DOM tree, you need a Document-Builder
. To get a DocumentBuilder
object, first call the static newInstance
method of the DocumentBuilderFactory
class, then call the newDocumentBuilder
method on the factory object.
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder builder = factory.newDocumentBuilder();
Once you have a DocumentBuilder
, you can read a document. To read a document from a file, first construct a File
object from the file name, then call the parse
method of the DocumentBuilder
class.
String fileName = ...; File f = new File(fileName); Document doc = builder.parse(f);
A DocumentBuilder
can read an XML document from a file, URL, or input stream. The result is a Document
object, which contains a tree.
If the document is located on the Internet, use an URL:
String urlName = ...; URL u = new URL(urlName); Document doc = builder.parse(u);
You can also read a document from an arbitrary input stream:
InputStream in = ...; Document doc = builder.parse(in);
Once you have created a new document or read a document from a file, you can inspect and modify it.
The easiest method for inspecting a document is the XPath syntax. An XPath describes a node or set of nodes, using a syntax that is similar to directory paths. For example, consider the following XPath, applied to the document in Figure 3 and Figure 4:
This XPath selects the quantity of the first item, that is, the value 8. (In XPath, array positions start with 1. Accessing /items/item[0]
would be an error.)
Similarly, you can get the price of the second product as
/items/item[2]/product/price
To get the number of items, use the XPath expression
count(/items/item)
In our example, the result is 2.
The total number of children can be obtained as
count(/items/*)
In our example, the result is again 2 because the items
element has exactly two children.
To select attributes, use an @
followed by the name of the attribute. For example,
/items/item[2]/product/price/@currency
would select the currency
price attribute if it had one.
Finally, if you have a document with variable or unknown structure, you can find out the name of a child with an expression such as the following:
name(/items/item[1]/*[1])
The result is the name of the first child of the first item, or product
.
That is all you need to know about the XPath syntax to analyze simple documents. (See Table 1 for a summary.) There are many more options in the XPath syntax that we do not cover here. If you are interested, look up the specification (
) or work through the online tutorial (http://www.w3.org/TR/xpath
).http://www.zvon.org/xxl/XPathTutorial/General/examples.html
To evaluate an XPath expression in Java, first create an XPath
object:
XPathFactory xpfactory = XPathFactory.newInstance(); XPath path = xpfactory.newXPath();
Then call the evaluate
method, like this:
String result = path.evaluate(expression
, doc)
Here, expression is an XPath expression and doc
is the Document
object that represents the XML document. For example, the statement
String result = path.evaluate("/items/item[2]/product/price", doc)
sets result
to the string "19.95"
.
Now you have all the tools that you need to read and analyze an XML document. The example program at the end of this section puts these techniques to work. (The program uses the LineItem
and Product
classes from Chapter 12.) The class ItemListParser
can parse an XML document that contains a list of product descriptions. Its parse
method takes the file name and returns an array list of LineItem
objects:
ItemListParser parser = new ItemListParser(); ArrayList<LineItem> items = parser.parse("items.xml");
The ItemListParser
class translates each XML element into an object of the corresponding Java class. We first get the number of items:
int itemCount = Integer.parseInt(path.evaluate("count(/items/item)", doc));
For each item element, we gather the product data and construct a Product
object:
String description = path.evaluate( "/items/item[" + i + "]/product/description", doc); double price = Double.parseDouble(path.evaluate( "/items/item[" + i + "]/product/price", doc)); Product pr = new Product(description, price);
Then we construct a LineItem
object in the same way, and add it to the items
array list. Here is the complete source code.
ch23/parser/ItemListParser.java
1
import java.io.File;2
import java.io.IOException;3
import java.util.ArrayList;4
import javax.xml.parsers.DocumentBuilder;
5
import javax.xml.parsers.DocumentBuilderFactory;6
import javax.xml.parsers.ParserConfigurationException;7
import javax.xml.xpath.XPath;8
import javax.xml.xpath.XPathExpressionException;9
import javax.xml.xpath.XPathFactory;10
import org.w3c.dom.Document;11
import org.xml.sax.SAXException;12
13
/**14
An XML parser for item lists.15
*/16
public class ItemListParser17
{18
private DocumentBuilder builder;19
private XPath path;20
21
/**22
Constructs a parser that can parse item lists.23
*/24
public ItemListParser()25
throws ParserConfigurationException26
{27
DocumentBuilderFactory dbfactory28
= DocumentBuilderFactory.newInstance();29
builder = dbfactory.newDocumentBuilder();30
XPathFactory xpfactory = XPathFactory.newInstance();31
path = xpfactory.newXPath();32
}33
34
/**35
Parses an XML file containing an item list.36
@param fileName the name of the file37
@return an array list containing all items in the XML file38
*/39
public ArrayList<LineItem> parse(String fileName)40
throws SAXException, IOException, XPathExpressionException41
{42
File f = new File(fileName);43
Document doc = builder.parse(f);44
45
ArrayList<LineItem> items = new ArrayList<LineItem>();46
int itemCount = Integer.parseInt(path.evaluate(47
"count(/items/item)", doc));48
for (int i = 1; i <= itemCount; i++)49
{50
String description = path.evaluate(51
"/items/item[" + i + "]/product/description", doc);52
double price = Double.parseDouble(path.evaluate(53
"/items/item[" + i + "]/product/price", doc));54
Product pr = new Product(description, price);55
int quantity = Integer.parseInt(path.evaluate(56
"/items/item[" + i + "]/quantity", doc));57
LineItem it = new LineItem(pr, quantity);58
items.add(it);59
}60
return items;61
}62
}
ch23/parser/ItemListParserDemo.java
1
import java.util.ArrayList;2
3
/**4
This program parses an XML file containing an item list.5
It prints out the items that are described in the XML file.6
*/7
public class ItemListParserDemo8
{9
public static void main(String[] args) throws Exception10
{11
ItemListParser parser = new ItemListParser();12
ArrayList<LineItem> items = parser.parse("items.xml");13
for (LineItem anItem : items)14
System.out.println(anItem.format());15
}16
}
Program Run
Ink Jet Refill Kit 29.95 8 239.6
4
-port Mini Hub 19.95 4 79.8
Which XPath statement yields the name of the root element of any XML document?
In the preceding section, you saw how to read an XML file into a Document
object and how to analyze the contents of that object. In this section, you will see how to do the opposite—build up a Document
object and then save it as an XML file. Of course, you can also generate an XML file simply as a sequence of print
statements. However, that is not a good idea—it is easy to build an illegal XML document in this way, as when data contain special characters such as <
or &
.
Recall that you needed a DocumentBuilder
object to read in an XML document. You also need such an object to create a new, empty document. Thus, to create a new document, first make a document builder factory, then a document builder, and finally the empty document:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder builder = factory.newDocumentBuilder(); Document doc = builder.newDocument(); // An empty document
Now you are ready to insert nodes into the document. You use the createElement
method of the Document
interface to create the elements that you need.
Element priceElement = doc.createElement("price");
The Document
interface has methods to create elements and text nodes.
You set element attributes with the setAttribute
method. For example,
priceElement.setAttribute("currency", "USD");
You have to work a bit harder for inserting text. First create a text node:
Text textNode = doc.createTextNode("29.95");
Then add the text node to the element:
priceElement.appendChild(textNode);
Figure 7 shows the DOM interfaces for XML document nodes. To construct the tree structure of a document, it is a good idea to use a set of helper methods.
We start out with a helper method that creates an element with text:
private Element createTextElement(String name, String text) { Text t = doc.createTextNode(text); Element e = doc.createElement(name); e.appendChild(t); return e; }
Using this helper method, we can construct a price
element like this:
Element priceElement = createTextElement("price", "29.95");
Next, we write a helper method to create a product
element from a Product
object:
private Element createProduct(Product p) { Element e = doc.createElement("product"); e.appendChild(createTextElement("description", p.getDescription())); e.appendChild(createTextElement("price", "" + p.getPrice())); return e; }
This helper method is called from the createItem
helper method:
private Element createItem(LineItem anItem) { Element e = doc.createElement("item"); e.appendChild(createProduct(anItem.getProduct())); e.appendChild(createTextElement("quantity", "" + anItem.getQuantity())); return e; }
A helper method
private Element createItems(ArrayList<LineItem> items)
for the items
element is implemented in the same way—see the program listing at the end of this section.
Now you build the document as follows:
ArrayList<LineItem> items = ...; doc = builder.newDocument(); Element root = createItems(items); doc.appendChild(root);
Once you have built the document, you will want to write it to a file. The DOM standard provides the LSSerializer
interface for this purpose. Unfortunately, the DOM standard uses very generic methods, which makes the code that is required to obtain a serializer object look like a "magic incantation":
DOMImplementation impl = doc.getImplementation(); DOMImplementationLS implLS = (DOMImplementationLS) impl.getFeature("LS", "3.0"); LSSerializer ser = implLS.createLSSerializer();
Use an LSSerializer
to write a DOM document.
Once you have the serializer object, you simply use the writeToString
method:
String str = ser.writeToString(doc);
By default, the LSSerializer
produces an XML document without spaces or line breaks. As a result, the output looks less pretty, but it is actually more suitable for parsing by another program because it is free from unnecessary white space.
If you want white space, you use yet another magic incantation after creating the serializer:
ser.getDomConfig().setParameter("format-pretty-print", true);
Here is an example program that shows how to build and print an XML document.
ch23/builder/ItemListBuilder.java
1
import java.util.ArrayList;2
import javax.xml.parsers.DocumentBuilder;3
import javax.xml.parsers.DocumentBuilderFactory;4
import javax.xml.parsers.ParserConfigurationException;5
import org.w3c.dom.Document;6
import org.w3c.dom.Element;7
import org.w3c.dom.Text;8
9
/**10
Builds a DOM document for an array list of items.11
*/12
public class ItemListBuilder13
{14
private DocumentBuilder builder;15
private Document doc;16
17
/**18
Constructs an item list builder.19
*/20
public ItemListBuilder()21
throws ParserConfigurationException22
{23
DocumentBuilderFactory factory24
= DocumentBuilderFactory.newInstance();25
builder = factory.newDocumentBuilder();26
}27
28
/**29
Builds a DOM document for an array list of items.30
@param items the items31
@return a DOM document describing the items32
*/33
public Document build(ArrayList<LineItem> items)34
{35
doc = builder.newDocument();36
doc.appendChild(createItems(items));37
return doc;38
}39
40
/**41
Builds a DOM element for an array list of items.42
@param items the items43
@return a DOM element describing the items44
*/45
private Element createItems(ArrayList<LineItem> items)46
{47
Element e = doc.createElement("items");48
49
for (LineItem anItem : items)50
e.appendChild(createItem(anItem));51
52
return e;53
}54
55
/**56
Builds a DOM element for an item.57
@param anItem the item58
@return a DOM element describing the item59
*/60
private Element createItem(LineItem anItem)61
{62
Element e = doc.createElement("item");63
64
e.appendChild(createProduct(anItem.getProduct()));65
e.appendChild(createTextElement(66
"quantity", "" + anItem.getQuantity()));67
68
return e;69
}70
71
/**72
Builds a DOM element for a product.73
@param p the product74
@return a DOM element describing the product75
*/76
private Element createProduct(Product p)77
{78
Element e = doc.createElement("product");79
80
e.appendChild(createTextElement(81
"description", p.getDescription()));82
e.appendChild(createTextElement(83
"price", "" + p.getPrice()));84
85
return e;86
}
87
88
private Element createTextElement(String name, String text)89
{90
Text t = doc.createTextNode(text);91
Element e = doc.createElement(name);92
e.appendChild(t);93
return e;94
}95
}
ch23/builder/ItemListBuilderDemo.java
1
import java.util.ArrayList;2
import org.w3c.dom.DOMImplementation;3
import org.w3c.dom.Document;4
import org.w3c.dom.ls.DOMImplementationLS;5
import org.w3c.dom.ls.LSSerializer;6
7
/**8
This program demonstrates the item list builder. It prints the XML9
file corresponding to a DOM document containing a list of items.10
*/11
public class ItemListBuilderDemo12
{13
public static void main(String[] args) throws Exception14
{15
ArrayList<LineItem> items = new ArrayList<LineItem>();16
items.add(new LineItem(new Product("Toaster", 29.95), 3));17
items.add(new LineItem(new Product("Hair dryer", 24.95), 1));18
19
ItemListBuilder builder = new ItemListBuilder();20
Document doc = builder.build(items);21
DOMImplementation impl = doc.getImplementation();22
DOMImplementationLS implLS23
= (DOMImplementationLS) impl.getFeature("LS", "3.0");24
LSSerializer ser = implLS.createLSSerializer();25
String out = ser.writeToString(doc);26
27
System.out.println(out);28
}29
}
This program uses the Product
and LineItem
classes from Chapter 12. The LineItem
class has been modified by adding getProduct
and getQuantity
methods.
Program Run
<?xml version="1.0" encoding="UTF-8"?><items><item><product> <description>Toaster</description><price>29.95</price></product> <quantity>3</quantity></item><item><product><description>Hair dryer </description><price>24.95</price></product><quantity>1</quantity> </item></items>
How would you write a document to the file output.xml
?
In this section you will learn how to specify rules for XML documents of a particular type. There are several mechanisms for this purpose. The oldest and simplest mechanism is a Document Type Definition (DTD), the topic of this section. We discuss other mechanisms in Special Topic 23.1 on page 938.
Consider a document of type items
. Intuitively, items
denotes a sequence of item
elements. Each item
element contains a product
and a quantity
. A product
contains a description
and a price
. Each of these elements contains text describing the product's description, price, and quantity. The purpose of a DTD is to formalize this description.
A DTD is a sequence of rules that describes
The valid attributes for each element type
The valid child elements for each element type
A DTD is a sequence of rules that describes the valid child elements and attributes for each element type.
Let us first turn to child elements. The valid child elements of an element are described by an ELEMENT
rule:
<!ELEMENT items (item*)>
This means that an item list must contain a sequence of 0 or more item
elements.
As you can see, the rule is delimited by <!
... >
, and it contains the name of the element whose children are to be constrained (items
), followed by a description of what children are allowed.
Next, let us turn to the definition of an item
node:
<!ELEMENT item (product, quantity)>
This means that the children of an item
node must be a product
node, followed by a quantity
node.
The definition for a product is similar:
<!ELEMENT product (description, price)>
Finally, here are the definitions of the three remaining node types:
<!ELEMENT quantity (#PCDATA)> <!ELEMENT description (#PCDATA)> <!ELEMENT price (#PCDATA)>
The symbol #PCDATA
refers to text, called "parsed character data" in XML terminology. The character data can contain any characters. However, certain characters, such as <
and &
, have special meaning in XML and need to be replaced if they occur in character data. Table 2 shows the replacements for special characters.
The complete DTD for an item list has six rules, one for each element type:
<!ELEMENT items (item*)> <!ELEMENT item (product, quantity)> <!ELEMENT product (description, price)> <!ELEMENT quantity (#PCDATA)> <!ELEMENT description (#PCDATA)> <!ELEMENT price (#PCDATA)>
Let us have a closer look at the descriptions of the allowed children. Table 3 shows the expressions used to describe the children of an element. The EMPTY
reserved word is self-explanatory: an element that is declared as EMPTY
may not have any children. For example, the HTML DTD defines the img
element to be EMPTY
—an image has only attributes, specifying the image source, size, and placement, and no children.
Table 23.3. Regular Expressions for Element Content
Rule Description | Element Content |
---|---|
No children allowed | |
| Any sequence of 0 or more elements E |
| Any sequence of 1 or more elements E |
| Optional element E (0 or 1 occurrences allowed) |
| Element E1, followed by E2, ... |
| Element E1 or E2 or ... |
| Text only |
| Any sequence of text and elements E1, E2, ..., in any order |
| Any children allowed |
More interesting child rules can be formed with the regular expression operations (* + ?, |
). (See Table 3 and Figure 8. Also see Productivity Hint 11.1 for more information on regular expressions.) You have already seen the *
("0 or more") and, (sequence) operations. The children of an items
element are 0 or more item
elements, and the children of an item
are a sequence of product
and description
elements.
You can also combine these operations to form more complex expressions. For example,
<!ELEMENT section (title, (paragraph | (image, title?))+)
defines an element section
whose children are:
A title
element
A sequence of one or more of the following:
paragraph
elements
image
elements followed by optional title
elements
Thus,
<section> <title/> <paragraph/> <image/> <title/> <paragraph/> </section>
is valid, but
<section> <paragraph/> <paragraph/> <title/> </section>
is not—there is no starting title, and the title at the end doesn't follow an image.
You already saw the (#PCDATA)
rule. It means that the children can consist of any character data. For example, in our product list DTD, the description
element can have any character data inside.
You can also allow mixed content—any sequence of character data and specified elements. However, in mixed content, you have no control over the order in which the elements appear. As explained in Quality Tip 23.2 on page 912, you should avoid mixed content for DTDs that describe data sets. This feature is intended for documents that contain both text and markup instructions, such as HTML pages.
Finally, you can allow a node to have children of any type—you should avoid that for DTDs that describe data sets.
You now know how to specify what children a node may have. A DTD also gives you control over the allowed attributes of an element. An attribute description looks like this:
<!ATTLISTElement
Attribute
Type
Default
>
The most useful attribute type descriptions are listed in Table 4. The CDATA
type describes any sequence of character data. As with #PCDATA
, certain characters, such as <
and &
, need to be encoded (as <, &
and so on). There is no practical difference between the CDATA
and #PCDATA
types. Simply use CDATA
in attribute declarations and #PCDATA
in element declarations.
Table 23.4. Common Attribute Types
Type Description | Attribute Type |
---|---|
| Any character data |
| One of V1, V2, ... |
Rather than allowing arbitrary attribute values, you can specify a finite number of choices. For example, you may want to restrict a currency
attribute to U.S. dollar, euro, and Japanese yen. Then use the following declaration:
<!ATTLIST price currency (USD | EUR | JPY) #REQUIRED>
You can use letters, numbers, and the hyphen (-
) and underscore (_
) characters for the attribute values.
There are other type descriptions that are less common in practice. You can find them in the XML reference (
).http://www.xml.com/axml/axml.html
The attribute type description is followed by a "default" declaration. The reserved words that can appear in a "default" declaration are listed in Table 5.
For example, this attribute declaration describes that each price
element must have a currency
attribute whose value is any character data:
<!ATTLIST price currency CDATA #REQUIRED>
To fulfill this declaration, each price
element must have a currency
attribute, such as <price currency="USD">
. A price without a currency would not be valid.
For an optional attribute, you use the #IMPLIED
reserved word instead.
<!ATTLIST price currency CDATA #IMPLIED>
That means that you can supply a currency attribute in a price
element, or you can omit it. If you omit it, then the application that processes the XML data implicitly assumes some default currency.
A better choice would be to supply the default value explicitly:
<!ATTLIST price currency CDATA "USD">
That means that the currency
attribute is understood to mean USD
if the attribute is not specified. An XML parser will then report the value of currency
as USD
if the attribute was not specified.
Finally, you can state that an attribute can only be identical to a particular value. For example, the rule
<!ATTLIST price currency CDATA #FIXED "USD">
means that a price
element must either not have a currency
attribute at all (in which case the XML parser will report its value as USD
), or specify the currency
attribute as USD
. Naturally, this kind of rule is not very common.
Table 23.5. Attribute Defaults
Default Declaration | Explanation |
---|---|
| Attribute is required |
| Attribute is optional |
V | Default attribute, to be used if attribute is not specified |
| Attribute must either be unspecified or contain this value |
You have now seen the most common constructs for DTDs. Using these constructs, you can define your own DTDs for XML documents that describe data sets. In the next section, you will see how to specify which DTD an XML document should use, and how to have the XML parser check that a document conforms to its DTD.
When you reference a DTD with an XML document, you can instruct the parser to check that the document follows the rules of the DTD. That way, the parser can check errors in the document.
In the preceding section you saw how to develop a DTD for a class of XML documents. The DTD specifies the permitted elements and attributes in the document. An XML document has two ways of referencing a DTD:
An XML document can contain its DTD or refer to a DTD that is stored elsewhere.
A DTD is introduced with the DOCTYPE
declaration. If the document contains its DTD, then the declaration looks like this:
<!DOCTYPErootElement
[rules
]>
For example, an item list can include its DTD like this:
<?xml version="1.0"?> <!DOCTYPE items [ <!ELEMENT items (item*)> <!ELEMENT item (product, quantity)> <!ELEMENT product (description, price)> <!ELEMENT quantity (#PCDATA)> <!ELEMENT description (#PCDATA)> <!ELEMENT price (#PCDATA)> ]> <items> <item> <product> <description>Ink Jet Refill Kit</description> <price>29.95</price> </product> <quantity>8</quantity> </item> <item> <product> <description>4-port Mini Hub</description> <price>19.95</price> </product> <quantity>4</quantity> </item> </items>
However, if the DTD is more complex, then it is better to store it outside the XML document. In that case, you use the SYSTEM
reserved word inside the DOCTYPE
declaration to indicate that the system that hosts the XML processor must locate the DTD. The SYSTEM
reserved word is followed by the location of the DTD. For example, a DOCTYPE
declaration might point to a local file
<!DOCTYPE items SYSTEM "items.dtd">
Alternatively, the resource might be an URL anywhere on the Web:
<!DOCTYPE items SYSTEM "http://www.mycompany.com/dtds/items.dtd">
For commonly used DTDs, the DOCTYPE
declaration can contain a PUBLIC
reserved word. For example,
<!DOCTYPE faces-config PUBLIC "-//Sun Microsystems, Inc.//DTD JavaServer Faces Config 1.0//EN" "http://java.sun.com/dtd/web-facesconfig_1_0.dtd">
A program parsing the DTD can look at the public identifier. If it is a familiar identifier, then it need not spend time retrieving the DTD from the URL.
When referencing an external DTD, you must supply an URL for locating the DTD.
When you include a DTD with an XML document, then you can tell the parser to validate the document. That means that the parser will check that all child elements and attributes of an element conform to the ELEMENT
and ATTLIST
rules in the DTD. If a document is invalid, then the parser reports an error. To turn on validation, you use the setValidating
method of the DocumentBuilderFactory
class before calling the newDocumentBuilder
method:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); factory.setValidating(true); DocumentBuilder builder = factory.newDocumentBuilder(); Document doc = builder.parse(...);
When your XML document has a DTD, you can request validation when parsing.
Validation can simplify your code for processing XML documents. For example, if the DTD specifies that the child elements of each item
element are product
and quantity
elements in that order, then you can rely on that fact and don't need to put tedious checks in your code.
When you parse an XML file with a DTD, tell the parser to ignore white space.
If the parser has access to the DTD, it can make another useful improvement. By default, the parser converts all spaces in the input document to text, even if the spaces are only used to logically line up elements. As a result, the document contains text nodes that are wasteful and can be confusing when you analyze the document tree.
To make the parser ignore white space, call the setIgnoringElementContentWhitespace
method of the DocumentBuilderFactory
class.
factory.setValidating(true); factory.setIgnoringElementContentWhitespace(true);
Finally, if the parser has access to the DTD, it can fill in default values for attributes. For example, suppose a DTD defines a currency
attribute for a price
element:
<!ATTLIST price currency CDATA "USD">
If a document contains a price
element without a currency
attribute, then the parser can supply the default:
String attributeValue = priceElement.getAttribute("currency"); // Gets "USD" if no currency specified
This concludes our discussion of XML. You now know enough XML to put it to work for describing data formats. Whenever you are tempted to use a "quick and dirty" file format, you should consider using XML instead. By using XML for data interchange, your programs become more professional, robust, and flexible. This chapter covers the most important aspects of XML for everyday programming. For more advanced features that can be useful in specialized situations, please see
. Furthermore, XML technology is still undergoing rapid change at the time of this writing. Therefore, it is a good idea to check out the latest developments. Good web sites are http://www.xml.com/axml/axml.html
, the W3C XML web site, and http://www.w3c.org/xml
, the Sun Microsystems XML web site.http://java.sun.com/xml
How can a DTD specify that a product
element can contain a description
and a price
element, in any order?
How can a DTD specify that the description
element has an optional attribute language
?
Describe the purpose of XML and the structure of an XML document.
XML allows you to encode complex data, independent from any programming language, in a form that the recipient can easily parse.
XML files are readable by computer programs and by humans.
XML-formatted data files are resilient to change.
XML describes the meaning of data, not how to display them.
An XML document starts out with an XML declaration and contains elements and text.
An element can contain text, child elements, or both (mixed content). For data descriptions, avoid mixed content.
Elements can have attributes. Use attributes to describe how to interpret the element content.
Use a parser and the XPath language to process an XML document.
A parser is a program that reads a document, checks whether it is syntactically correct, and takes some action as it processes the document.
A streaming parser reports the building blocks of an XML document. A tree-based parser builds a document tree.
A DocumentBuilder
can read an XML document from a file, URL, or input stream. The result is a Document
object, which contains a tree.
An XPath describes a node or node set, using a notation similar to that for directory paths.
Write Java programs that create XML documents.
The Document
interface has methods to create elements and text nodes.
Use an LSSerializer
to write a DOM document.
Explain the use of DTDs for validating XML documents.
A DTD is a sequence of rules that describes the valid child elements and attributes for each element type.
An XML document can contain its DTD or refer to a DTD that is stored elsewhere.
When referencing an external DTD, you must supply an URL for locating the DTD.
When your XML document has a DTD, you can request validation when parsing.
When you parse an XML file with a DTD, tell the parser to ignore white space.
javax.xml.parsers.DocumentBuilder org.w3c.dom.DOMConfiguration newDocument setParameter parse org.w3c.dom.DOMImplementation javax.xml.parsers.DocumentBuilderFactory getFeature newDocumentBuilder org.w3c.dom.Element newInstance getAttribute setIgnoringElementContentWhitespace setAttribute setValidating org.w3c.dom.ls.DOMImplementationLS javax.xml.xpath.XPath createLSSerializer evaluate org.w3c.dom.ls.LSSerializer javax.xml.xpath.XPathExpressionException getDomConfig javax.xml.xpath.XPathFactory writeToString newInstance newXPath org.w3c.dom.Document createElement createTextNode getImplementation
R23.1 Give some examples to show the differences between XML and HTML.
R23.2 Design an XML document that describes a bank account.
R23.3 Draw a tree view for the XML document you created in Exercise R23.2.
R23.4 Write the XML document that corresponds to the parse tree in Figure 5.
R23.5 Write the XML document that corresponds to the parse tree in Figure 6.
R23.6 Make an XML document describing a book, with child elements for the author name, the title, and the publication year.
R23.7 Add a description of the book's language to the document of the preceding exercise. Should you use an element or an attribute?
R23.8 What is mixed content? What problems does it cause?
R23.9 Design an XML document that describes a purse containing three quarters, a dime, and two nickels.
R23.10 Explain why a paint program, such as Microsoft Paint, is a WYSIWYG program that is also "what you see is all you've got".
R23.11 Consider the XML file
<purse> <coin> <value>0.5</value> <name lang="en">half dollar</name> </coin> <coin>
<value>0.25</value> <name lang="en">quarter</name> </coin> </purse>
What are the values of the following XPath expressions?
/purse/coin[1]/value
/purse/coin[2]/name
/purse/coin[2]/name/@lang
name(/purse/coin[2]/*[1])
count(/purse/coin)
count(/purse/coin[2]/name)
R23.12 With the XML file of Exercise R23.11, give XPath expressions that yield:
the value of the first coin.
the number of coins.
the name of the first child element of the first coin
element.
the name of the first attribute of the first coin's name
element. (The expression @*
selects the attributes of an element.)
the value of the lang
attribute of the second coin's name
element.
R23.13 Design a DTD that describes a bank with bank accounts.
R23.14 Design a DTD that describes a library patron who has checked out a set of books. Each book has an ID number, an author, and a title. The patron has a name and telephone number.
R23.15 Write the DTD file for the following XML document
<?xml version="1.0"?> <productlist> <product> <name>Comtrade Tornado</name> <price currency="USD">2495</price> <score>60</score> </product> <product> <name>AMAX Powerstation 75</name> <price>2999</price> <score>62</score> </product> </productlist>
R23.16 Design a DTD for invoices, as described in How To 23.3 on page 936.
R23.17 Design a DTD for simple English sentences, as described in Random Fact 23.2 on page 920.
R23.18 Design a DTD for arithmetic expressions, as described in Random Fact 23.2 on page 920.
P23.1 Write a program that can read XML files, such as
<purse> <coin> <value>0.5</value> <name>half dollar</name> </coin> ... </purse>
Your program should construct a Purse
object and print the total value of the coins in the purse.
P23.2 Building on Exercise P23.1, make the program read an XML file as described in that exercise. Then print an XML file of the form
<purse> <coins> <coin> <value>0.5</value> <name>half dollar</name> </coin> <quantity>3</quantity> </coins> <coins> <coin> <value>0.25</value> <name>quarter</name> </coin> <quantity>2</quantity> </coins> </purse>
P23.3 Repeat Exercise P23.1, using a DTD for validation.
P23.4 Write a program that can read XML files, such as
<bank> <account> <number>3</number> <balance>1295.32</balance> </account> ... </bank>
Your program should construct a Bank
object and print the total value of the balances in the accounts.
P23.5 Repeat Exercise P23.4, using a DTD for validation.
P23.6 Enhance Exercise P23.4 as follows: First read the XML file in, then add 10 percent interest to all accounts, and write an XML file that contains the increased account balances.
P23.7 Write a DTD file that describes documents that contain information about countries: name of the country, its population, and its area. Create an XML file that has five different countries. The DTD and XML should be in different files. Write a program that uses the XML file you wrote and prints:
The country with the largest area.
The country with the largest population.
The country with the largest population density (people per square kilometer).
P23.8 Write a parser to parse invoices using the invoice structure described in How To 23.1 on page 909. The parser should parse the XML file into an Invoice
object and print out the invoice in the format used in Chapter 12.
P23.9 Modify Exercise P23.8 to support separate shipping and billing addresses. Supply a modified DTD with your solution.
P23.10 Write a document builder that turns an invoice object, as defined in Chapter 12, into an XML file of the format described in How To 23.1 on page 909.
P23.11 Modify Exercise P23.10 to support separate shipping and billing addresses.
P23.12 Write a program that can read an XML document of the form
<rectangle> <x>5</x> <y>10</y> <width>20</width> <height>30</height> </rectangle>
and draw the shape in a window.
P23.13 Write a program that can read an XML document of the form
<ellipse> <x>5</x> <y>10</y> <width>20</width> <height>30</height> </ellipse>
and draw the shape in a window.
P23.14 Write a program that can read an XML document of the form
<rectangularshape shape="ellipse"> <x>5</x> <y>10</y> <width>20</width> <height>30</height> </rectangularshape>
Support shape attributes "rectangle", "roundrectangle"
, and "ellipse"
.
Draw the shape in a window.
P23.15 Write a program that can read an XML document of the form
<polygon> <point> <x>5</x> <y>10</y> </point> ...
</polygon>
and draw the shape in a window.
P23.16 Write a program that can read an XML document of the form
<drawing> <rectangle> <x>5</x> <y>10</y> <width>20</width> <height>30</height> </rectangle> <line> <x1>5</x1> <y1>10</y1> <x2>25</x2> <y2>40</y2> </line> <message> <text>Hello, World!</text> <x>20</x> <y>30</y> </message> </drawing>
and show the drawing in a window.
P23.17 Repeat Exercise P23.16, using a DTD for validation.
Project 23.1 Following Exercise P12.7, design an XML format for the appointments in an appointment calendar. Write a program that first reads in a file with appointments, then another file of the format
<commands> <add> <appointment> ... </appointment> </add> ... <remove> <appointment> ... </appointment> </remove> </commands>
Your program should process the commands and then produce an XML file that consists of the updated appointments.
Project 23.2 Write a program to simulate an airline seat reservation system, using XML documents. Reference Exercise P12.8 for the airplane seat information. The program reads a seating chart, in an XML format of your choice, and a command file, in an XML format of your choice, similar to the command file of the preceding exercise. Then the program processes the commands and produces an updated seating chart.
Your answer should look similar to this:
<student> <name>James Bond</name> <id>007</id> </student>
Most browsers display a tree structure that indicates the nesting of the tags. Some browsers display nothing at all because they can't find any HTML tags.
The text hamster.jpg
is never displayed, so it should not be a part of the document. Instead, the src
attribute tells the browser where to find the image that should be displayed.
29.95.
name(/*[1])
.
The createTextElement
method is useful for creating other documents.
First construct a string, as described, and then use a PrintWriter
to save the string to a file.
<!ELEMENT item (product, quantity?)>
<!ELEMENT product ((description, price) | (price, description))>
<!ATTLIST description language CDATA #IMPLIED>
3.145.17.140