The Document Object Model, unlike SAX, has its origins in the World Wide Web Consortium (W3C). Whereas SAX is public-domain software, developed through long discussions on the XML-dev mailing list, DOM is a standard just as the actual XML specification itself is. The DOM is also not designed specifically for Java, but to represent the content and model of documents across all programming languages and tools. Bindings exist for JavaScript, Java, CORBA, and other languages, allowing the DOM to be a cross-platform and cross-language specification.
In addition to being different from SAX in regard to standardization and language bindings, the DOM is organized into "levels” instead of versions. DOM Level One is an accepted Recommendation, and you can view the completed specification at http://www.w3.org/TR/REC-DOM-Level-1/. Level One details the functionality and navigation of content within a document. A document in the DOM is not just limited to XML, but can be HTML or other content models as well! Level Two, which should finalize in mid-2000, adds upon Level One by supplying modules and options aimed at specific content models, such as XML, HTML, and Cascading Style Sheets (CSS). These less-generic modules begin to “fill in the blanks” left by the more general tools provided in DOM Level One. You can view the current Level Two Candidate Recommendation at http://www.w3.org/TR/DOM-Level-2/. Level Three is already being worked on, and should add even more facilities for specific types of documents, such as validation handlers for XML.
Using the DOM for a specific programming language requires a set of interfaces and classes that define and implement the DOM itself. Because the methods involved are not outlined specifically in the DOM specification, and instead the model of a document is focused upon, language bindings must be developed to represent the conceptual structure of the DOM for its use in Java or any other language. These language bindings then serve as APIs for us to manipulate documents in the fashion outlined in the DOM specification.
We are obviously concerned with the Java language binding. The latest Java
bindings, the DOM Level Two Java bindings, can be downloaded from
http://www.w3.org/TR/DOM-Level-2/java-binding.html.
The classes you should be able to add to your IDE or class path are
all in the org.w3c.dom
package (and its
subpackages). However, before downloading these yourself, you should
check the XML parser and XSLT processor you purchased or downloaded;
like the SAX package, the DOM package is often included with these
products. This also ensures a correct match between your parser,
processor, and the version of DOM that is supported.
Most processors do not handle the task of generating a DOM input themselves, but instead rely on an XML parser that is capable of generating a DOM tree. For this reason, it is often the XML parser that will have the needed DOM binding classes and not the XSLT processor. In addition, this maintains the loose coupling between parser and processor, letting one or the other be substituted with comparable products. As Apache Xalan, by default, uses Apache Xerces for XML parsing and DOM generation, it is the level of support for DOM that Xerces provides that is of interest to us.
To give you an idea of how DOM works, we are going to look at how the Apache Xalan processor and other programs that need DOM input receive an XML document in the DOM tree structure. This will give us our first look at the DOM Java language binding, and start us towards understanding the concepts behind handling XML documents using the DOM.
One thing that the DOM does not specify is how a DOM tree is created.
The specification instead focuses on the structure and APIs for
manipulating this tree, which leaves a lot of latitude in how DOM
parsers are implemented. Unlike the SAX XMLReader
class, which dynamically loads a SAX XMLReader
implementation, you will need to import and instantiate your
vendor’s DOM parser class explicitly. To begin, create a new
Java file and call it DOMParserDemo.java. We will look at how to
build a simple DOM parsing program to read in an XML document and
print out its contents. Create the structure and skeleton of your
example class first, as shown in Example 7.2.
Example 7-2. DOMParserDemo Class
// Import your vendor's DOM parser import org.apache.xerces.parsers.DOMParser; /** * <b><code>DOMParserDemo</code></b> will take an XML file and display * the document using DOM. * * @version 1.0 */ public class DOMParserDemo { /** * <p> * This parses the file, and then prints the document out * using DOM. * </p> * * @param uri <code>String</code> URI of file to parse. */ public void performDemo(String uri) { System.out.println("Parsing XML File: " + uri + " "); // Instantiate your vendor's DOM parser implementation DOMParser parser = new DOMParser( ); try { // parser.parse(uri); } catch (Exception e) { System.out.println("Error in parsing: " + e.getMessage( )); } } /** * <p> * This provides a command-line entry point for this demo. * </p> */ public static void main(String[] args) { if (args.length != 1) { System.out.println("Usage: java DOMParserDemo [XML URI]"); System.exit(0); } String uri = args[0]; DOMParserDemo parserDemo = new DOMParserDemo( ); parserDemo.performDemo(uri); } }
This is set up in a fashion similar to our earlier
SAXParserDemo
class, but imports the Apache Xerces
DOMParser
class directly and instantiates it. We
have commented out our actual invocation of the parse( )
method
for the moment; before looking at what is involved in parsing a
document into a DOM structure, we need to address issues of vendor
neutrality in our choice of parsers.
Keep in mind that this is simple and works great for many
applications, but is not portable across parser implementations as
our SAX example was. The initial impulse would be to use Java
constructs like Class.forName(parserClass).newInstance( )
to get an instance of the correct vendor parser class.
However, different DOM implementations behave in a variety of
fashions: sometimes the parse( )
method returns an
org.w3c.dom.Document
object (which we look at
next); sometimes the parser class provides a getDocument( )
method; and sometimes different parameter types are
required for the parse( )
method
(InputSource
, InputStream
,
String
, URI
, etc.) to be
supplied with the URI. In other words, while the DOM tree created is
portable, the method of obtaining that tree is not without fairly
complex reflection and dynamic class and method loading.
Remember
that in SAX, the focus of interest in the parser was the lifecycle of
the process, as all the callback methods provided us
“hooks” into the data as it was being parsed. In the DOM,
the focus of interest lies in the output from the parsing process.
Until the entire document has been parsed and added into the output
tree structure, the data is not in a usable state. The output of a
parse intended for use with the DOM interfaces is an
org.w3c.dom.Document
object. This object acts as a
“handle” to the tree your XML data is in, and in terms of
the element hierarchy we have continually discussed, it is equivalent
to one level above the root element in your XML document. In other
words, it owns each and every element in the XML document input.
Unfortunately, the standardization with regard to DOM is focused on
manipulating this data instead of obtaining it. This has resulted in
some variety in the mechanism used to obtain the
Document
object after a parse. In many
implementations, such as older versions of the IBM XML4J parser, the
parse( )
method returned the
Document
object. The code to use such an
implementation of a DOM parser would look like this:
public void performDemo(String uri) {
System.out.println("Parsing XML File: " + uri + "
");
// Instantiate your vendor's DOM parser implementation
DOMParser parser = new DOMParser( );
try {
Document doc = parser.parse(uri);
} catch (Exception e) {
System.out.println("Error in parsing: " + e.getMessage( ));
}
}
Most newer parsers, such as Apache Xerces, do not follow this
methodology. In order to maintain a standard interface across both
SAX and DOM parsers, the parse( )
method in these
parsers returns void, as our SAX example of using the parse( )
method did. This change allows an application to use a
DOM parser class and a SAX parser class interchangeably; however, it
requires an additional method to obtain the
Document
object result from the XML parsing. In
Apache Xerces, this method is named getDocument( )
. Using this type of parser, we can add the following
example to our code to obtain the resulting DOM tree from parsing our
input file:
public void performDemo(String uri) { System.out.println("Parsing XML File: " + uri + " "); // Instantiate your vendor's DOM parser implementation DOMParser parser = new DOMParser( ); try {parser.parse(uri);
Document doc = parser.getDocument( );
} catch (Exception e) { System.out.println("Error in parsing: " + e.getMessage( )); } }
Also be sure to import the necessary DOM class:
import org.w3c.dom.Document; // Import your vendor's DOM parser import org.apache.xerces.parsers.DOMParser;
You should consult your vendor documentation to determine which of these mechanisms you need to employ to get the DOM result of your parse. In the next chapter, we look at Sun’s JAXP API and other ways to have a more standardized means of accessing a DOM tree from any parser implementation. Although there is some variance in getting this result, all of the uses of this result we will look at are standard across the DOM specification, so you should not have to worry about any other implementation curveballs as we continue in the chapter.
Now that we
have this “tree” object, let’s look at doing
something useful with it. For our example, we want to move through
the tree structure we have access to and print out the tree of our
XML data. The easiest way to do this is to take our initial
Document
object and at each node in the tree,
process the current node and then recursively process the children of
that node. This should sound familiar to you if you have ever done
any work with tree structures. To understand how this works, we need
to see the basic objects that our XML data will be accessible
through; we have already seen the Document
object,
and this and the other core DOM object interfaces are listed
here. These interfaces, shown in Figure 7.2 (which
includes the less frequently used DOM interfaces as well), will be
the means by which we manipulate all data within our DOM tree.
In addition to absorbing these interfaces, pay special attention to
the Node
interface, and notice that it is the
base interface for the other interfaces. Anytime you see a design
pattern like this, you should immediately think of runtime
object-type discovery. In other words, we can write a method that
takes in a node, discovers what type of DOM structure that node is,
and prints it in the correct fashion. This allows us to easily print
our entire DOM tree with one method! Once we print the node, we can
use the common methods that are available to move on to the next
sibling element in the tree, get the attributes if it is an element,
and handle any other special cases that arise. Then, iterating
through the child nodes, we can recursively invoke the same method on
each, until our entire DOM tree is printed. This is a simple, clean
way of handling DOM trees. We take a detailed look at how to
accomplish this now.
Because our Document
object itself is a DOM
Node
, we can pass it unchanged as the initial
argument to our printing method. We can create the skeleton of this
method, but first we need to add the appropriate import statements to
our Java file:
import org.w3c.dom.Document;import org.w3c.dom.Node;
// Import your vendor's DOM parser
import org.apache.xerces.parsers.DOMParser;
We then can add our method signature, which takes in a DOM
Node
and will print it out:
/** * <p> * This will print a DOM <code>Node</code> and then recurse * on its children. * </p> * * @param node <code>Node</code> object to print. */ public void printNode(Node node) { // Determine the type of node // Print the node // Recurse on children }
Finally, with our skeleton method in place, we can invoke the method
on our initial Document
object, letting recursion
continue the printing until the tree is completely output. This works
because the Document
interface extends from the
common DOM Node
interface:
public void performDemo(String uri) { System.out.println("Parsing XML File: " + uri + " "); // Instantiate your vendor's DOM parser implementation DOMParser parser = new DOMParser( ); try { parser.parse(uri); Document doc = parser.getDocument( );// Print the document from the DOM tree
printNode(doc);
} catch (Exception e) { System.out.println("Error in parsing: " + e.getMessage( )); } }
At this point, you can compile your Java source file. Although there is no output, you can see that getting an application that uses a DOM output from an XML parser up and running is fairly simple. We next look at making this a usable demonstration.
Once within our printing method, our first task is to determine what
type of node we have. Although we could
approach this with a Java methodology, using
instanceof
and Java reflection, the DOM bindings
for Java make our task much simpler. The
Node
interface defines a helper method,
getNodeType( )
, which returns an integer value. This value
can be compared against a set of constants defined within the
Node
interface, and the type of
Node
being examined can be quickly and easily
determined. This also fits very naturally into the Java
switch
construct, which we can use within our
method to break up printing into logical sections. We compare the
type of our node with the most common node types; although there are
some additional node types defined (see Figure 7.2), these are the most common and the concepts
here can be applied to the less frequent node types as well:
private static void printTree(Node node) {// Determine action based on node type
switch (node.getNodeType( )) {
case Node.DOCUMENT_NODE:
// Print the contents of the Document object
break;
case Node.ELEMENT_NODE:
// Print the element and its attributes
break;
case Node.TEXT_NODE:
case Node.CDATA_SECTION_NODE:
// Print the textual data
break;
case Node.PROCESSING_INSTRUCTION_NODE:
// Print the processing instruction
break;
case Node.ENTITY_REFERENCE_NODE:
// Print the entity reference
break;
case Node.DOCUMENT_TYPE_NODE:
// Print the DTD declaration
break;
}
}
Notice that for CDATASection
and
Text
nodes, we handle output in a single
case
statement. In this example, we are
not concerned with whether the text was in a CDATA
section or not in the original document; we just want to print the
text within the element. We now can add printing and recursion to the
appropriate blocks of code, and have our application printing our DOM
tree quickly and easily. We look at how to do this for the various
DOM node types next.
Now that you have seen how the concepts and
structure of DOM work, you should only have to learn the syntax for
the different node types to have a solid understanding of the DOM. In
other words, you can now treat the DOM Java bindings as just another
API, like the JNDI packages or the servlet extensions. Learning the
concepts is typically the most difficult part of mastery, while using
correct syntax only requires a reference and some example code. In
this section, we give you that example code, demonstrating how to
print out the most common node types as well as looking at traversing
the DOM tree. You can then use the online documentation for the
DOM at
http://www.w3.org/DOM
as your reference, as well
as Appendix A, which has an API reference for SAX,
DOM, and JDOM (which we look at in the next chapter).
Because the Document is an extension of the
Node
interface itself, we can use it
interchangeably with our other node types. However, it is a bit of a
special case, as it contains the root element as well as the XML
document’s DTD and some other special information not within
the XML element hierarchy. Thus we need to extract the root element
and pass that to our printing function when we run across this node.
We also print out a simple version declaration to make our output
conform to the XML specification:
case Node.DOCUMENT_NODE:System.out.println("<xml version="1.0"> ");
Document doc = (Document)node;
printTree(doc.getDocumentElement( ));
break;
Unfortunately, DOM Level 2 (as well as SAX 2.0) does not expose the XML declaration. This may not seem to be a big problem, until you consider that the encoding of the document is included in this declaration. DOM Level 3 is expected to address this deficiency, and should be available in a draft form in mid- to late 2000. Be careful not to write DOM applications that depend on this information until this problem is corrected.
Since we need to access a Document
-specific
method, we must first cast the Node
implementation
passed to the printing method to the Document
interface; we can then invoke its getDocumentElement( )
method to obtain the root element of the XML
input document, and then in turn pass that on to the printing method,
starting the recursion and traversal of the DOM tree.
Of course,
our most common task will be to take a DOM Element
and print out its name, attributes, and value, and then print its
children. As you would suspect, all of these are easily
accomplishable with DOM method calls. First we need to get the name
of the XML element, which is available through the
getNodeName( )
method within the Node
interface, and print it out. For now, we can leave space to add in
the attributes, and then print out the closing angle bracket on our
element. We then need to get the children of the current element and
print these as well. A Node
’s children can
be accessed through the getChildNodes( )
method, which returns an instance of a DOM
NodeList
.
While most Java APIs use a Vector
, Hashtable
, or other Java collections class to support traversing a list, the DOM API returns a DOM-specific interface, the NodeList
interface. Don’t expect to be handling attributes through the Java methods you are used to; instead, you will have to familiarize yourself with these structures when using DOM.
It is trivial to obtain the length of this list, and then iterate through the children calling the printing method on each, continuing our recursion. Finally, we can output the closing of our element.
First let’s add in the new DOM interface we need:
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
// Import your vendor's DOM parser
import org.apache.xerces.parsers.DOMParser;
Now that we have the needed classes and interfaces accessible by
their class name, we can add in the code we have been discussing. We
obtain the name of the
Node
implementation, print it in XML
format, print each of its children (checking for
null
to ensure that children exist), and then
close our element. Although this code doesn’t yet handle
attributes, it should take care of printing out our XML elements for
us throughout the entire DOM tree:
case Node.ELEMENT_NODE:String name = node.getNodeName( );
System.out.print("<" + name);
// Print out attributes
System.out.println(">");
// recurse on each child
NodeList children = node.getChildNodes( );
if (children != null) {
for (int i=0; i<children.getLength( ); i++) {
printTree(children.item(i));
}
}
System.out.println("</" + name + ">");
break;
This seems fairly easy, right? It is just as simple to iterate
through the attributes of our DOM Element
. We use
the getAttributes( )
method, again defined in the Node
interface, to
get a list of XML attributes, returned in a
NamedNodeMap
. This DOM interface is used for a
collection of nodes that are unique with regard to their name, so is
ideal for storing a list of our XML element’s attributes. We
then iterate through this list, printing out the name and value of
each attribute. This is similar to the way we handled iteration
through our element’s child nodes, and we use the
getNodeName( )
and getNodeValue( )
methods to obtain the values needed for printing. Let’s take a
look at how to do this here; add the import statement needed for
NamedNodeMap
and make the following changes to our
code:
import org.w3c.dom.Document;import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node; import org.w3c.dom.NodeList; // Import your vendor's DOM parser import org.apache.xerces.parsers.DOMParser; ... case Node.ELEMENT_NODE: String name = node.getNodeName( ); System.out.print("<" + name);NamedNodeMap attributes = node.getAttributes( );
for (int i=0; i<attributes.getLength( ); i++) {
Node current = attributes.item(i);
System.out.print(" " + current.getNodeName( ) +
"="" + current.getNodeValue( ) +
""");
}
System.out.println(">"); // recurse on each child NodeList children = node.getChildNodes( ); if (children != null) { for (int i=0; i<children.getLength( ); i++) { printNode(children.item(i)); } } System.out.println("</" + name + ">"); break; ...
At this point we have done quite a bit of work! With only a couple of hundred lines of code complete, we can iterate through a DOM tree and print out elements and attributes. In fact, it is this ease of use, particularly as compared to SAX, that has made DOM such a prevalent and popular way to handle XML data. Certainly it is not always the best choice, as we have already discussed and will again, but it provides a simple representation of XML that is easy to move through.
If you haven’t already, compile your Java source file and run it on the XML table of contents file we have been using. You should get output similar to the fragment shown in Example 7.3.
Example 7-3. DOMParserDemo Output
D:prodJavaXML>java DOMParserDemo D:prodJavaXMLcontents.xml Parsing XML File: D:prodJavaXML contents.xml <xml version="1.0"> <JavaXML:Book xmlns:JavaXML="http://www.oreilly.com/catalog/javaxml/"> <JavaXML:Title> </JavaXML:Title> <JavaXML:Contents> <JavaXML:Chapter focus="XML"> <JavaXML:Heading> </JavaXML:Heading> <JavaXML:Topic subSections="4"> </JavaXML:Topic> <JavaXML:Topic subSections="5"> </JavaXML:Topic> <JavaXML:Topic subSections="4"> </JavaXML:Topic> <JavaXML:Topic subSections="1"> </JavaXML:Topic> </JavaXML:Chapter> ...
This does exactly what we expected, but perhaps not in a very clear or usable way. It is very difficult to see where elements begin and end without the indenting we used in our original document. As you remember, whitespace between elements is stripped and typically ignored by parsers, so we need to add some whitespace back in to help in our output formatting. We can pass a simple indentation string into our printing method to handle this. Indentation can be added to as we recurse and traverse through the DOM tree:
/**
* <p>
* This will print a DOM <code>Node</code> and then recurse
* on its children.
* </p>
*
* @param node <code>Node</code> object to print.
* @param indent <code>String</code> spacing to insert
* before this <code>Node</code>
*/
public void printNode(Node node, String indent) {
switch (node.getNodeType( )) { case Node.DOCUMENT_NODE: System.out.println("<xml version="1.0"> "); Document doc = (Document)node;printNode(doc.getDocumentElement( ), "");
break; case Node.ELEMENT_NODE: String name = node.getNodeName( );System.out.print(indent + "<" + name);
NamedNodeMap attributes = node.getAttributes( ); for (int i=0; i<attributes.getLength( ); i++) { Node current = attributes.item(i); System.out.print(" " + current.getNodeName( ) + "="" + current.getNodeValue( ) + """); } System.out.println(">"); // recurse on each child NodeList children = node.getChildNodes( ); if (children != null) { for (int i=0; i<children.getLength( ); i++) {printNode(children.item(i), indent + " ");
} }System.out.println(indent + "</" + name + ">");
break; case Node.TEXT_NODE: case Node.CDATA_SECTION_NODE: // Print the textual data break; case Node.PROCESSING_INSTRUCTION_NODE: // Print the processing instruction break; case Node.ENTITY_REFERENCE_NODE: // Print the entity reference break; case Node.DOCUMENT_TYPE_NODE: // Print the DTD declaration break; } }
Then make a small change to feed our method an initial indent of an empty string:
public void performDemo(String uri) { System.out.println("Parsing XML File: " + uri + " "); // Instantiate your vendor's DOM parser implementation DOMParser parser = new DOMParser( ); try { parser.parse(uri); Document doc = parser.getDocument( );// Print the document from the DOM tree and
// feed it an initial indentation of nothing
printNode(doc, "");
} catch (Exception e) { System.out.println("Error in parsing: " + e.getMessage( )); } }
With this minor change, you can see in Example 7.4 that our output is much more readable.
Example 7-4. DOMParserDemo Output with Indentation
D:prodJavaXML>java DOMParserDemo D:prodJavaXMLcontents.xml Parsing XML File: D:prodJavaXML contents.xml <xml version="1.0"> <JavaXML:Book xmlns:JavaXML="http://www.oreilly.com/catalog/javaxml/"> <JavaXML:Title> </JavaXML:Title> <JavaXML:Contents> <JavaXML:Chapter focus="XML"> <JavaXML:Heading> </JavaXML:Heading> <JavaXML:Topic subSections="4"> </JavaXML:Topic> <JavaXML:Topic subSections="5"> </JavaXML:Topic> <JavaXML:Topic subSections="4"> </JavaXML:Topic> <JavaXML:Topic subSections="1"> </JavaXML:Topic> </JavaXML:Chapter> ...
With this formatting in place, we are ready to add in the textual data values of our elements.
If
you are wondering when we are going to handle integer nodes, numeric
nodes, or Boolean nodes, we are not. As you should recall, all XML
data within an element was reported through the SAX
characters( )
callback. This should have given you
your first clue that an XML parser handles all data as text, and an
application must make data type conversions if needed. Thus, the DOM
Text
and CDATASection
interfaces are all we need to worry about to print our
elements’ values. Printing is quite simple, as we only need to
use the now familiar getNodeValue( )
method of the DOM Node
interface to get the
textual data and print it out:
case Node.TEXT_NODE:
case Node.CDATA_SECTION_NODE:
System.out.print(node.getNodeValue( ));
break;
With that in place, we are very close to having a complete DOM
traversal. However, before moving on, let’s take a look at a
few less common but useful DOM interfaces (and their corresponding
Node
types):
ProcessingInstruction
,
DocumentType
, and
EntityReference
, all of which we have in our
document.
The DOM bindings for Java define an interface to handle processing instructions that are within the input XML document. This is useful, as these instructions do not follow the same markup model that XML elements and attributes do, but are still important for applications to know about. In our example document, we give instructions to the XSLT processor informing it about a stylesheet, as well as an instruction to a publishing framework, Apache Cocoon, letting it know what processing needs to occur. If you still have these PIs commented out in your XML table of contents, you should uncomment these now:
<?xml version="1.0"?><?xml-stylesheet href="XSLJavaXML.html.xsl" type="text/xsl"?>
<?xml-stylesheet href="XSLJavaXML.wml.xsl" type="text/xsl"
media="wap"?>
<?cocoon-process type="xslt"?>
<!DOCTYPE JavaXML:Book SYSTEM "DTDJavaXML.dtd"> <!-- Java and XML --> <JavaXML:Book xmlns:JavaXML="http://www.oreilly.com/catalog/javaxml/">
The PI node in the DOM is a little bit of a break from what we have
seen so far: to fit the syntax into the Node
interface model, the getNodeValue( )
method
returns all data instructions within a PI in one
String
. This allows us to quickly output the PI to
the screen; however, we still use getNodeName( )
to get the name of the PI. If you were writing an application that
could receive PIs from an XML document, you might prefer to use the
PrcoessingInstruction
interface; although it
exposes the same data, the method names (getTarget( )
and getData( )
) are more in line with
a PI’s format. With this understanding, we can add in the code
to print out our PIs:
case Node.PROCESSING_INSTRUCTION_NODE:System.out.println("<?" + node.getNodeName( ) +
" " + node.getNodeValue( ) +
"?>");
break;
If you compile and run the sample program with this change, you may
be in for a surprise: none of our XML document’s PIs are
outputted! Is something going on? Well, in a sense, no. Our code
currently obtains the Document
object from the XML
parser, and only processes the root element of our XML input. Because
our processing instructions are at the same level of the tree as that
root element, they are ignored. To correct this, we need to change
the section of code that handles a Node
that is of
type Document
. We can make a modification similar
to how we handled an element’s children to process all of our
“top-level” XML structures instead of just the root
element:
case Node.DOCUMENT_NODE: System.out.println("<xml version="1.0"> ");// recurse on each child
NodeList nodes = node.getChildNodes( );
if (nodes != null) {
for (int i=0; i<nodes.getLength( ); i++) {
printNode(nodes.item(i), "");
}
}
/*
Document doc = (Document)node;
printTree(doc.getDocumentElement( ), "");
*/
break;
Compile this change in, and you should see the PIs at the top of your output, as expected. This is another subtle point of using DOM that you should remember: always be aware of what nodes you are processing, as well as what nodes you are not processing!
Like PIs, a DTD declaration can be
helpful in exposing what set of constraints an XML document
references. However, since there can be a public and system ID as
well as other DTD-specific data, we need to case our
Node
instance to the
DocumentType
interface to access this additional
data. We can then use the helper methods to get the name of the
Node
, which returns the root element of the
document it constrains, the public ID (if it exists), and the
system ID of the
DTD referenced. Using this information, we can reconstruct the
DTD reference
in the XML document:
import org.w3c.dom.Document;import org.w3c.dom.DocumentType;
import org.w3c.dom.NamedNodeMap; import org.w3c.dom.Node; import org.w3c.dom.NodeList; // Import your vendor's DOM parser import org.apache.xerces.parsers.DOMParser; ... case Node.DOCUMENT_TYPE_NODE:DocumentType docType = (DocumentType)node;
System.out.print("<!DOCTYPE " + docType.getName( ));
if (docType.getPublicId( ) != null) {
System.out.print(" PUBLIC "" +
docType.getPublicId( ) + "" ");
} else {
System.out.print(" SYSTEM ");
}
System.out.println(""" + docType.getSystemId( ) + "">");
break;
The final node type we look at is the
EntityReference
interface. This handles the
various entity references that can appear within an XML document,
such as our copyright reference in our example XML document. There
are no surprises in how we print this type of node:
case Node.ENTITY_REFERENCE_NODE:
System.out.println("&" + node.getNodeName( ) + ";");
break;
There are a few surprises that may trip you up when it comes to the
output from a node such as this. The definition of how entity
references should be processed within DOM allows a lot of latitude,
and also relies heavily on the underlying parser’s behavior. In
fact, most XML parsers have expanded and processed entity references
before the XML document’s data ever makes its way into the DOM
tree. So often, when expecting to see an entity reference within your
DOM structure, you will find the text or values outputted rather than
the entity itself. To test this for your parser, comment out or
remove our HTML version of the JavaXML:Copyright
element (we changed this last chapter, remember?) and replace it with
the OReillyCopyright
entity reference:
<!--
<JavaXML:Copyright>
<center>
<table cellpadding="0" cellspacing="1" border="1" bgcolor="Black">
<tr>
<td align="center">
<table bgcolor="White" border="2">
<tr>
<td>
<font size="-1">
Copyright O'Reilly and Associates, 2000
</font>
</td>
</tr>
</table>
</td>
</tr>
</table>
</center>
</JavaXML:Copyright>
-->
<JavaXML:Copyright>&OReillyCopyright;</JavaXML:Copyright>
</JavaXML:Book>
This can cause a nasty bug, and leave you staring at your XML wondering what is wrong. So what good is an entity reference node type if it is pre-processed by the parser? Well, it’s actually intended more for use in creating XML than in parsing existing XML. We look at this next.
The one glaring omission you will notice if you are familiar with the DOM is that we have not talked about the mutability of the DOM tree we have been using. It is possible to add nodes into the DOM tree very easily. In fact, next to the simplicity of use, this ability to modify and add to the tree is one of DOM’s most used and heralded features.
This brings us back full circle to our original discussion on XML transformations. Why is the DOM so important for an XSLT processor? Not only is the input in an easily accessible data form, but an XSLT processor can create a new DOM tree for the output document, and easily copy, modify, add, and remove nodes from the input tree, creating an output tree as processing occurs. This is the “apples to apples” processing that is preferred in complex operations, as often input can be passed through to the output tree with little class casting or new object instantiation, greatly reducing complexity and increasing performance of the XSLT processor. For this reason, we have spent this chapter looking closely at the DOM.
However, you are probably still wondering when we talk about using this mutability; unfortunately, you will have to wait a little longer. To delve into that process now would take us quite a bit off track in our discussions on XSL; however, rest assured that the next chapters will spend time on how to create and modify XML. Before we dive into these more advanced topical chapters, let’s take a look at some additional stumbling blocks that DOM can give you as a Java developer learning how to use these powerful APIs.
18.188.218.226