The Document Object Model (DOM)

The Document Object Model, unlike SAX, has its origins in the World Wide Web Consortium (W3C). Whereas SAX is public-domain software, developed through long discussions on the XML-dev mailing list, DOM is a standard just as the actual XML specification itself is. The DOM is also not designed specifically for Java, but to represent the content and model of documents across all programming languages and tools. Bindings exist for JavaScript, Java, CORBA, and other languages, allowing the DOM to be a cross-platform and cross-language specification.

In addition to being different from SAX in regard to standardization and language bindings, the DOM is organized into "levels” instead of versions. DOM Level One is an accepted Recommendation, and you can view the completed specification at http://www.w3.org/TR/REC-DOM-Level-1/. Level One details the functionality and navigation of content within a document. A document in the DOM is not just limited to XML, but can be HTML or other content models as well! Level Two, which should finalize in mid-2000, adds upon Level One by supplying modules and options aimed at specific content models, such as XML, HTML, and Cascading Style Sheets (CSS). These less-generic modules begin to “fill in the blanks” left by the more general tools provided in DOM Level One. You can view the current Level Two Candidate Recommendation at http://www.w3.org/TR/DOM-Level-2/. Level Three is already being worked on, and should add even more facilities for specific types of documents, such as validation handlers for XML.

The DOM and Java

Using the DOM for a specific programming language requires a set of interfaces and classes that define and implement the DOM itself. Because the methods involved are not outlined specifically in the DOM specification, and instead the model of a document is focused upon, language bindings must be developed to represent the conceptual structure of the DOM for its use in Java or any other language. These language bindings then serve as APIs for us to manipulate documents in the fashion outlined in the DOM specification.

We are obviously concerned with the Java language binding. The latest Java bindings, the DOM Level Two Java bindings, can be downloaded from http://www.w3.org/TR/DOM-Level-2/java-binding.html. The classes you should be able to add to your IDE or class path are all in the org.w3c.dom package (and its subpackages). However, before downloading these yourself, you should check the XML parser and XSLT processor you purchased or downloaded; like the SAX package, the DOM package is often included with these products. This also ensures a correct match between your parser, processor, and the version of DOM that is supported.

Most processors do not handle the task of generating a DOM input themselves, but instead rely on an XML parser that is capable of generating a DOM tree. For this reason, it is often the XML parser that will have the needed DOM binding classes and not the XSLT processor. In addition, this maintains the loose coupling between parser and processor, letting one or the other be substituted with comparable products. As Apache Xalan, by default, uses Apache Xerces for XML parsing and DOM generation, it is the level of support for DOM that Xerces provides that is of interest to us.

Getting a DOM Parser

To give you an idea of how DOM works, we are going to look at how the Apache Xalan processor and other programs that need DOM input receive an XML document in the DOM tree structure. This will give us our first look at the DOM Java language binding, and start us towards understanding the concepts behind handling XML documents using the DOM.

One thing that the DOM does not specify is how a DOM tree is created. The specification instead focuses on the structure and APIs for manipulating this tree, which leaves a lot of latitude in how DOM parsers are implemented. Unlike the SAX XMLReader class, which dynamically loads a SAX XMLReader implementation, you will need to import and instantiate your vendor’s DOM parser class explicitly. To begin, create a new Java file and call it DOMParserDemo.java. We will look at how to build a simple DOM parsing program to read in an XML document and print out its contents. Create the structure and skeleton of your example class first, as shown in Example 7.2.

Example 7-2. DOMParserDemo Class

// Import your vendor's DOM parser
import org.apache.xerces.parsers.DOMParser;

/**
 * <b><code>DOMParserDemo</code></b> will take an XML file and display 
 *   the document using DOM.
 * 
 * @version 1.0
 */
public class DOMParserDemo {

    /**
     * <p>
     * This parses the file, and then prints the document out
     *   using DOM.
     * </p>
     *
     * @param uri <code>String</code> URI of file to parse.
     */
    public void performDemo(String uri) {
        System.out.println("Parsing XML File: " + uri + "

");       
        
        // Instantiate your vendor's DOM parser implementation
        DOMParser parser = new DOMParser(  ); 
        try {
            // parser.parse(uri);
            
        } catch (Exception e) {
            System.out.println("Error in parsing: " + e.getMessage(  ));
        }
    }
    
    /**
     * <p>
     * This provides a command-line entry point for this demo.
     * </p>
     */
    public static void main(String[] args) {
        if (args.length != 1) {
            System.out.println("Usage: java DOMParserDemo [XML URI]");
            System.exit(0);
        }
        
        String uri = args[0];
    
        DOMParserDemo parserDemo = new DOMParserDemo(  );
        parserDemo.performDemo(uri);
    }
    
}

This is set up in a fashion similar to our earlier SAXParserDemo class, but imports the Apache Xerces DOMParser class directly and instantiates it. We have commented out our actual invocation of the parse( ) method for the moment; before looking at what is involved in parsing a document into a DOM structure, we need to address issues of vendor neutrality in our choice of parsers.

Keep in mind that this is simple and works great for many applications, but is not portable across parser implementations as our SAX example was. The initial impulse would be to use Java constructs like Class.forName(parserClass).newInstance( ) to get an instance of the correct vendor parser class. However, different DOM implementations behave in a variety of fashions: sometimes the parse( ) method returns an org.w3c.dom.Document object (which we look at next); sometimes the parser class provides a getDocument( ) method; and sometimes different parameter types are required for the parse( ) method (InputSource, InputStream, String, URI, etc.) to be supplied with the URI. In other words, while the DOM tree created is portable, the method of obtaining that tree is not without fairly complex reflection and dynamic class and method loading.

DOM Parser Output

Remember that in SAX, the focus of interest in the parser was the lifecycle of the process, as all the callback methods provided us “hooks” into the data as it was being parsed. In the DOM, the focus of interest lies in the output from the parsing process. Until the entire document has been parsed and added into the output tree structure, the data is not in a usable state. The output of a parse intended for use with the DOM interfaces is an org.w3c.dom.Document object. This object acts as a “handle” to the tree your XML data is in, and in terms of the element hierarchy we have continually discussed, it is equivalent to one level above the root element in your XML document. In other words, it owns each and every element in the XML document input.

Unfortunately, the standardization with regard to DOM is focused on manipulating this data instead of obtaining it. This has resulted in some variety in the mechanism used to obtain the Document object after a parse. In many implementations, such as older versions of the IBM XML4J parser, the parse( ) method returned the Document object. The code to use such an implementation of a DOM parser would look like this:

public void performDemo(String uri) {
    System.out.println("Parsing XML File: " + uri + "

");       
    
    // Instantiate your vendor's DOM parser implementation
    DOMParser parser = new DOMParser(  ); 
    try {
        Document doc = parser.parse(uri);
        
    } catch (Exception e) {
        System.out.println("Error in parsing: " + e.getMessage(  ));
    }
}

Most newer parsers, such as Apache Xerces, do not follow this methodology. In order to maintain a standard interface across both SAX and DOM parsers, the parse( ) method in these parsers returns void, as our SAX example of using the parse( ) method did. This change allows an application to use a DOM parser class and a SAX parser class interchangeably; however, it requires an additional method to obtain the Document object result from the XML parsing. In Apache Xerces, this method is named getDocument( ). Using this type of parser, we can add the following example to our code to obtain the resulting DOM tree from parsing our input file:

public void performDemo(String uri) {
    System.out.println("Parsing XML File: " + uri + "

");       
    
    // Instantiate your vendor's DOM parser implementation        
    DOMParser parser = new DOMParser(  );
    try {
        parser.parse(uri);
               
        Document doc = parser.getDocument(  );
        
    } catch (Exception e) {
        System.out.println("Error in parsing: " + e.getMessage(  ));
    }
}

Also be sure to import the necessary DOM class:

import org.w3c.dom.Document;

// Import your vendor's DOM parser
import org.apache.xerces.parsers.DOMParser;

You should consult your vendor documentation to determine which of these mechanisms you need to employ to get the DOM result of your parse. In the next chapter, we look at Sun’s JAXP API and other ways to have a more standardized means of accessing a DOM tree from any parser implementation. Although there is some variance in getting this result, all of the uses of this result we will look at are standard across the DOM specification, so you should not have to worry about any other implementation curveballs as we continue in the chapter.

Using a DOM Tree

Now that we have this “tree” object, let’s look at doing something useful with it. For our example, we want to move through the tree structure we have access to and print out the tree of our XML data. The easiest way to do this is to take our initial Document object and at each node in the tree, process the current node and then recursively process the children of that node. This should sound familiar to you if you have ever done any work with tree structures. To understand how this works, we need to see the basic objects that our XML data will be accessible through; we have already seen the Document object, and this and the other core DOM object interfaces are listed here. These interfaces, shown in Figure 7.2 (which includes the less frequently used DOM interfaces as well), will be the means by which we manipulate all data within our DOM tree.

UML class model of DOM Level 2 core interfaces and classes

Figure 7-2. UML class model of DOM Level 2 core interfaces and classes

In addition to absorbing these interfaces, pay special attention to the Node interface, and notice that it is the base interface for the other interfaces. Anytime you see a design pattern like this, you should immediately think of runtime object-type discovery. In other words, we can write a method that takes in a node, discovers what type of DOM structure that node is, and prints it in the correct fashion. This allows us to easily print our entire DOM tree with one method! Once we print the node, we can use the common methods that are available to move on to the next sibling element in the tree, get the attributes if it is an element, and handle any other special cases that arise. Then, iterating through the child nodes, we can recursively invoke the same method on each, until our entire DOM tree is printed. This is a simple, clean way of handling DOM trees. We take a detailed look at how to accomplish this now.

Getting the ball rolling

Because our Document object itself is a DOM Node, we can pass it unchanged as the initial argument to our printing method. We can create the skeleton of this method, but first we need to add the appropriate import statements to our Java file:

import org.w3c.dom.Document;import org.w3c.dom.Node;

// Import your vendor's DOM parser
import org.apache.xerces.parsers.DOMParser;

We then can add our method signature, which takes in a DOM Node and will print it out:

/**
 * <p>
 * This will print a DOM <code>Node</code> and then recurse
 *   on its children.
 * </p>
 *
 * @param node <code>Node</code> object to print.
 */
public void printNode(Node node)  {
    // Determine the type of node
    // Print the node
    // Recurse on children
}

Finally, with our skeleton method in place, we can invoke the method on our initial Document object, letting recursion continue the printing until the tree is completely output. This works because the Document interface extends from the common DOM Node interface:

public void performDemo(String uri) {
    System.out.println("Parsing XML File: " + uri + "

");       
    
    // Instantiate your vendor's DOM parser implementation        
    DOMParser parser = new DOMParser(  );
    try {
        parser.parse(uri);
        Document doc = parser.getDocument(  );           
        
        // Print the document from the DOM tree
                  
        printNode(doc);
    
    } catch (Exception e) {
        System.out.println("Error in parsing: " + e.getMessage(  ));
    }
      
}

At this point, you can compile your Java source file. Although there is no output, you can see that getting an application that uses a DOM output from an XML parser up and running is fairly simple. We next look at making this a usable demonstration.

Determining a node’s type

Once within our printing method, our first task is to determine what type of node we have. Although we could approach this with a Java methodology, using instanceof and Java reflection, the DOM bindings for Java make our task much simpler. The Node interface defines a helper method, getNodeType( ) , which returns an integer value. This value can be compared against a set of constants defined within the Node interface, and the type of Node being examined can be quickly and easily determined. This also fits very naturally into the Java switch construct, which we can use within our method to break up printing into logical sections. We compare the type of our node with the most common node types; although there are some additional node types defined (see Figure 7.2), these are the most common and the concepts here can be applied to the less frequent node types as well:

private static void printTree(Node node) {

        // Determine action based on node type
                  
        switch (node.getNodeType(  )) {
                  
            case Node.DOCUMENT_NODE:
                  
                // Print the contents of the Document object 
                  
                break;
                  
            case Node.ELEMENT_NODE:
                  
                // Print the element and its attributes
                  
                break;
                  
            case Node.TEXT_NODE:
                  
            case Node.CDATA_SECTION_NODE:
                  
                // Print the textual data
                  
                break;
                  
            case Node.PROCESSING_INSTRUCTION_NODE:
                  
                // Print the processing instruction
                  
                break;
                  
            case Node.ENTITY_REFERENCE_NODE:
                  
                // Print the entity reference
                  
                break;
                  
            case Node.DOCUMENT_TYPE_NODE: 
                  
                // Print the DTD declaration
                  
                break;                
                  
        }
}

Notice that for CDATASection and Text nodes, we handle output in a single case statement. In this example, we are not concerned with whether the text was in a CDATA section or not in the original document; we just want to print the text within the element. We now can add printing and recursion to the appropriate blocks of code, and have our application printing our DOM tree quickly and easily. We look at how to do this for the various DOM node types next.

The DOM Node Types

Now that you have seen how the concepts and structure of DOM work, you should only have to learn the syntax for the different node types to have a solid understanding of the DOM. In other words, you can now treat the DOM Java bindings as just another API, like the JNDI packages or the servlet extensions. Learning the concepts is typically the most difficult part of mastery, while using correct syntax only requires a reference and some example code. In this section, we give you that example code, demonstrating how to print out the most common node types as well as looking at traversing the DOM tree. You can then use the online documentation for the DOM at http://www.w3.org/DOM as your reference, as well as Appendix A, which has an API reference for SAX, DOM, and JDOM (which we look at in the next chapter).

The Document node

Because the Document is an extension of the Node interface itself, we can use it interchangeably with our other node types. However, it is a bit of a special case, as it contains the root element as well as the XML document’s DTD and some other special information not within the XML element hierarchy. Thus we need to extract the root element and pass that to our printing function when we run across this node. We also print out a simple version declaration to make our output conform to the XML specification:

case Node.DOCUMENT_NODE:System.out.println("<xml version="1.0">
");
                  Document doc = (Document)node;
                  printTree(doc.getDocumentElement(  ));
break;

Warning

Unfortunately, DOM Level 2 (as well as SAX 2.0) does not expose the XML declaration. This may not seem to be a big problem, until you consider that the encoding of the document is included in this declaration. DOM Level 3 is expected to address this deficiency, and should be available in a draft form in mid- to late 2000. Be careful not to write DOM applications that depend on this information until this problem is corrected.

Since we need to access a Document-specific method, we must first cast the Node implementation passed to the printing method to the Document interface; we can then invoke its getDocumentElement( ) method to obtain the root element of the XML input document, and then in turn pass that on to the printing method, starting the recursion and traversal of the DOM tree.

DOM elements

Of course, our most common task will be to take a DOM Element and print out its name, attributes, and value, and then print its children. As you would suspect, all of these are easily accomplishable with DOM method calls. First we need to get the name of the XML element, which is available through the getNodeName( ) method within the Node interface, and print it out. For now, we can leave space to add in the attributes, and then print out the closing angle bracket on our element. We then need to get the children of the current element and print these as well. A Node’s children can be accessed through the getChildNodes( ) method, which returns an instance of a DOM NodeList.

Warning

While most Java APIs use a Vector, Hashtable, or other Java collections class to support traversing a list, the DOM API returns a DOM-specific interface, the NodeList interface. Don’t expect to be handling attributes through the Java methods you are used to; instead, you will have to familiarize yourself with these structures when using DOM.

It is trivial to obtain the length of this list, and then iterate through the children calling the printing method on each, continuing our recursion. Finally, we can output the closing of our element.

First let’s add in the new DOM interface we need:

import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

// Import your vendor's DOM parser
import org.apache.xerces.parsers.DOMParser;

Now that we have the needed classes and interfaces accessible by their class name, we can add in the code we have been discussing. We obtain the name of the Node implementation, print it in XML format, print each of its children (checking for null to ensure that children exist), and then close our element. Although this code doesn’t yet handle attributes, it should take care of printing out our XML elements for us throughout the entire DOM tree:

case Node.ELEMENT_NODE:
    String name = node.getNodeName(  );
                  
    System.out.print("<" + name);
                  
    // Print out attributes
                  
    System.out.println(">");
                  
    // recurse on each child
                  
    NodeList children = node.getChildNodes(  );
                  
    if (children != null) {
                  
        for (int i=0; i<children.getLength(  ); i++) {
                  
            printTree(children.item(i));
                  
        }
                  
    }
                  
    System.out.println("</" + name + ">");
    break;

This seems fairly easy, right? It is just as simple to iterate through the attributes of our DOM Element. We use the getAttributes( ) method, again defined in the Node interface, to get a list of XML attributes, returned in a NamedNodeMap. This DOM interface is used for a collection of nodes that are unique with regard to their name, so is ideal for storing a list of our XML element’s attributes. We then iterate through this list, printing out the name and value of each attribute. This is similar to the way we handled iteration through our element’s child nodes, and we use the getNodeName( ) and getNodeValue( ) methods to obtain the values needed for printing. Let’s take a look at how to do this here; add the import statement needed for NamedNodeMap and make the following changes to our code:

import org.w3c.dom.Document;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

// Import your vendor's DOM parser
import org.apache.xerces.parsers.DOMParser;
...
            case Node.ELEMENT_NODE:
                String name = node.getNodeName(  );
                System.out.print("<" + name);
                NamedNodeMap attributes = node.getAttributes(  );
                  
                for (int i=0; i<attributes.getLength(  ); i++) {
                  
                    Node current = attributes.item(i);
                  
                    System.out.print(" " + current.getNodeName(  ) +
                  
                                     "="" + current.getNodeValue(  ) +
                  
                                     """);
                  
                }
                System.out.println(">");
                
                // recurse on each child
                NodeList children = node.getChildNodes(  );
                if (children != null) {
                    for (int i=0; i<children.getLength(  ); i++) {
                        printNode(children.item(i));
                    }
                }
                
                System.out.println("</" + name + ">");
                break;
...

At this point we have done quite a bit of work! With only a couple of hundred lines of code complete, we can iterate through a DOM tree and print out elements and attributes. In fact, it is this ease of use, particularly as compared to SAX, that has made DOM such a prevalent and popular way to handle XML data. Certainly it is not always the best choice, as we have already discussed and will again, but it provides a simple representation of XML that is easy to move through.

Applying formatting

If you haven’t already, compile your Java source file and run it on the XML table of contents file we have been using. You should get output similar to the fragment shown in Example 7.3.

Example 7-3. DOMParserDemo Output

D:prodJavaXML>java DOMParserDemo D:prodJavaXMLcontents.xml
Parsing XML File: D:prodJavaXML contents.xml


<xml version="1.0">

<JavaXML:Book xmlns:JavaXML="http://www.oreilly.com/catalog/javaxml/">
<JavaXML:Title>
</JavaXML:Title>
<JavaXML:Contents>
<JavaXML:Chapter focus="XML">
<JavaXML:Heading>
</JavaXML:Heading>
<JavaXML:Topic subSections="4">
</JavaXML:Topic>
<JavaXML:Topic subSections="5">
</JavaXML:Topic>
<JavaXML:Topic subSections="4">
</JavaXML:Topic>
<JavaXML:Topic subSections="1">
</JavaXML:Topic>
</JavaXML:Chapter>
...

This does exactly what we expected, but perhaps not in a very clear or usable way. It is very difficult to see where elements begin and end without the indenting we used in our original document. As you remember, whitespace between elements is stripped and typically ignored by parsers, so we need to add some whitespace back in to help in our output formatting. We can pass a simple indentation string into our printing method to handle this. Indentation can be added to as we recurse and traverse through the DOM tree:

                  /**
                   * <p>
                   * This will print a DOM <code>Node</code> and then recurse
                   *   on its children.
                   * </p>
                   *
                   * @param node <code>Node</code> object to print.
                   * @param indent <code>String</code> spacing to insert 
                   *               before this <code>Node</code>
                   */
                  public void printNode(Node node, String indent)  {
    switch (node.getNodeType(  )) {
       case Node.DOCUMENT_NODE:
            System.out.println("<xml version="1.0">
");
            Document doc = (Document)node;
            printNode(doc.getDocumentElement(  ), "");
            break;
                
       case Node.ELEMENT_NODE:
            String name = node.getNodeName(  );
            System.out.print(indent + "<" + name);
            NamedNodeMap attributes = node.getAttributes(  );
            for (int i=0; i<attributes.getLength(  ); i++) {
                Node current = attributes.item(i);
                System.out.print(" " + current.getNodeName(  ) +
                                 "="" + current.getNodeValue(  ) +
                                 """);
            }
            System.out.println(">");
            
            // recurse on each child
            NodeList children = node.getChildNodes(  );
            if (children != null) {
               for (int i=0; i<children.getLength(  ); i++) {
                   printNode(children.item(i), indent + "  ");
               }
           }
                
           System.out.println(indent + "</" + name + ">");
           break;
                
       case Node.TEXT_NODE:
       case Node.CDATA_SECTION_NODE:
           // Print the textual data
           break;
            
       case Node.PROCESSING_INSTRUCTION_NODE:
            // Print the processing instruction
            break;
            
       case Node.ENTITY_REFERENCE_NODE:
           // Print the entity reference
           break;
                
        case Node.DOCUMENT_TYPE_NODE: 
            // Print the DTD declaration
            break;                
   }
}

Then make a small change to feed our method an initial indent of an empty string:

public void performDemo(String uri) {
    System.out.println("Parsing XML File: " + uri + "

");   
    
    // Instantiate your vendor's DOM parser implementation    
    DOMParser parser = new DOMParser(  );
    try {
    parser.parse(uri);
        Document doc = parser.getDocument(  );       
            
        // Print the document from the DOM tree and
                  
        //   feed it an initial indentation of nothing
                  
        printNode(doc, "");
            
    } catch (Exception e) {
        System.out.println("Error in parsing: " + e.getMessage(  ));
    }
          
}

With this minor change, you can see in Example 7.4 that our output is much more readable.

Example 7-4. DOMParserDemo Output with Indentation

D:prodJavaXML>java DOMParserDemo D:prodJavaXMLcontents.xml
Parsing XML File: D:prodJavaXML contents.xml


<xml version="1.0">

<JavaXML:Book xmlns:JavaXML="http://www.oreilly.com/catalog/javaxml/">
  <JavaXML:Title>
  </JavaXML:Title>
  <JavaXML:Contents>
    <JavaXML:Chapter focus="XML">
      <JavaXML:Heading>
      </JavaXML:Heading>
      <JavaXML:Topic subSections="4">
      </JavaXML:Topic>
      <JavaXML:Topic subSections="5">
      </JavaXML:Topic>
      <JavaXML:Topic subSections="4">
      </JavaXML:Topic>
      <JavaXML:Topic subSections="1">
      </JavaXML:Topic>
    </JavaXML:Chapter>
...

With this formatting in place, we are ready to add in the textual data values of our elements.

Textual nodes

If you are wondering when we are going to handle integer nodes, numeric nodes, or Boolean nodes, we are not. As you should recall, all XML data within an element was reported through the SAX characters( ) callback. This should have given you your first clue that an XML parser handles all data as text, and an application must make data type conversions if needed. Thus, the DOM Text and CDATASection interfaces are all we need to worry about to print our elements’ values. Printing is quite simple, as we only need to use the now familiar getNodeValue( ) method of the DOM Node interface to get the textual data and print it out:

case Node.TEXT_NODE:
case Node.CDATA_SECTION_NODE:
    System.out.print(node.getNodeValue(  ));
    break;

With that in place, we are very close to having a complete DOM traversal. However, before moving on, let’s take a look at a few less common but useful DOM interfaces (and their corresponding Node types): ProcessingInstruction, DocumentType, and EntityReference, all of which we have in our document.

Processing instructions

The DOM bindings for Java define an interface to handle processing instructions that are within the input XML document. This is useful, as these instructions do not follow the same markup model that XML elements and attributes do, but are still important for applications to know about. In our example document, we give instructions to the XSLT processor informing it about a stylesheet, as well as an instruction to a publishing framework, Apache Cocoon, letting it know what processing needs to occur. If you still have these PIs commented out in your XML table of contents, you should uncomment these now:

<?xml version="1.0"?>
<?xml-stylesheet href="XSLJavaXML.html.xsl" type="text/xsl"?>
                  <?xml-stylesheet href="XSLJavaXML.wml.xsl" type="text/xsl" 
                  
                 media="wap"?>
                  <?cocoon-process type="xslt"?>

<!DOCTYPE JavaXML:Book SYSTEM "DTDJavaXML.dtd">

<!-- Java and XML -->
<JavaXML:Book xmlns:JavaXML="http://www.oreilly.com/catalog/javaxml/">

The PI node in the DOM is a little bit of a break from what we have seen so far: to fit the syntax into the Node interface model, the getNodeValue( ) method returns all data instructions within a PI in one String. This allows us to quickly output the PI to the screen; however, we still use getNodeName( ) to get the name of the PI. If you were writing an application that could receive PIs from an XML document, you might prefer to use the PrcoessingInstruction interface; although it exposes the same data, the method names (getTarget( ) and getData( )) are more in line with a PI’s format. With this understanding, we can add in the code to print out our PIs:

case Node.PROCESSING_INSTRUCTION_NODE:
    System.out.println("<?" + node.getNodeName(  ) +
                  
                       " " + node.getNodeValue(  ) +
                  
                       "?>");                                   
    break;

If you compile and run the sample program with this change, you may be in for a surprise: none of our XML document’s PIs are outputted! Is something going on? Well, in a sense, no. Our code currently obtains the Document object from the XML parser, and only processes the root element of our XML input. Because our processing instructions are at the same level of the tree as that root element, they are ignored. To correct this, we need to change the section of code that handles a Node that is of type Document. We can make a modification similar to how we handled an element’s children to process all of our “top-level” XML structures instead of just the root element:

case Node.DOCUMENT_NODE:
    System.out.println("<xml version="1.0">
");
    // recurse on each child
                  
    NodeList nodes = node.getChildNodes(  );
                  
    if (nodes != null) {
                  
        for (int i=0; i<nodes.getLength(  ); i++) {
                  
            printNode(nodes.item(i), "");
                  
        }
                  
    }
                  
    /*
                  
    Document doc = (Document)node;
                  
    printTree(doc.getDocumentElement(  ), "");
                  
    */
    break;

Compile this change in, and you should see the PIs at the top of your output, as expected. This is another subtle point of using DOM that you should remember: always be aware of what nodes you are processing, as well as what nodes you are not processing!

Document types

Like PIs, a DTD declaration can be helpful in exposing what set of constraints an XML document references. However, since there can be a public and system ID as well as other DTD-specific data, we need to case our Node instance to the DocumentType interface to access this additional data. We can then use the helper methods to get the name of the Node, which returns the root element of the document it constrains, the public ID (if it exists), and the system ID of the DTD referenced. Using this information, we can reconstruct the DTD reference in the XML document:

import org.w3c.dom.Document;import org.w3c.dom.DocumentType;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

// Import your vendor's DOM parser
import org.apache.xerces.parsers.DOMParser;
...
case Node.DOCUMENT_TYPE_NODE:
    DocumentType docType = (DocumentType)node;
                  
    System.out.print("<!DOCTYPE " + docType.getName(  ));
                  
    if (docType.getPublicId(  ) != null)  {
                  
        System.out.print(" PUBLIC "" + 
                  
            docType.getPublicId(  ) + "" ");                    
                  
    } else {
                  
        System.out.print(" SYSTEM ");
                  
    }
                  
     System.out.println(""" + docType.getSystemId(  ) + "">");
    break;

Entity references

The final node type we look at is the EntityReference interface. This handles the various entity references that can appear within an XML document, such as our copyright reference in our example XML document. There are no surprises in how we print this type of node:

case Node.ENTITY_REFERENCE_NODE:
    System.out.println("&" + node.getNodeName(  ) + ";");               
    break;

There are a few surprises that may trip you up when it comes to the output from a node such as this. The definition of how entity references should be processed within DOM allows a lot of latitude, and also relies heavily on the underlying parser’s behavior. In fact, most XML parsers have expanded and processed entity references before the XML document’s data ever makes its way into the DOM tree. So often, when expecting to see an entity reference within your DOM structure, you will find the text or values outputted rather than the entity itself. To test this for your parser, comment out or remove our HTML version of the JavaXML:Copyright element (we changed this last chapter, remember?) and replace it with the OReillyCopyright entity reference:

<!--
<JavaXML:Copyright>
  <center>
   <table cellpadding="0" cellspacing="1" border="1" bgcolor="Black">
    <tr>
     <td align="center">
      <table bgcolor="White" border="2">
       <tr>
        <td>
         <font size="-1">
          Copyright O'Reilly and Associates, 2000
         </font>
        </td>
       </tr>
      </table>
     </td>
    </tr>
   </table>
  </center>
 </JavaXML:Copyright>
-->
<JavaXML:Copyright>&OReillyCopyright;</JavaXML:Copyright>

</JavaXML:Book>

This can cause a nasty bug, and leave you staring at your XML wondering what is wrong. So what good is an entity reference node type if it is pre-processed by the parser? Well, it’s actually intended more for use in creating XML than in parsing existing XML. We look at this next.

Mutability of a DOM Tree

The one glaring omission you will notice if you are familiar with the DOM is that we have not talked about the mutability of the DOM tree we have been using. It is possible to add nodes into the DOM tree very easily. In fact, next to the simplicity of use, this ability to modify and add to the tree is one of DOM’s most used and heralded features.

This brings us back full circle to our original discussion on XML transformations. Why is the DOM so important for an XSLT processor? Not only is the input in an easily accessible data form, but an XSLT processor can create a new DOM tree for the output document, and easily copy, modify, add, and remove nodes from the input tree, creating an output tree as processing occurs. This is the “apples to apples” processing that is preferred in complex operations, as often input can be passed through to the output tree with little class casting or new object instantiation, greatly reducing complexity and increasing performance of the XSLT processor. For this reason, we have spent this chapter looking closely at the DOM.

However, you are probably still wondering when we talk about using this mutability; unfortunately, you will have to wait a little longer. To delve into that process now would take us quite a bit off track in our discussions on XSL; however, rest assured that the next chapters will spend time on how to create and modify XML. Before we dive into these more advanced topical chapters, let’s take a look at some additional stumbling blocks that DOM can give you as a Java developer learning how to use these powerful APIs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.73.207