Document Object Model (DOM) Parser

When you use the DOM API to parse an XML document, a tree structure representing the XML document is built in memory. You can then analyze the nodes of the tree to discover the XML contents.

The mechanism for instantiating a DOM parser is very similar to that for a SAX parser. A new instance of a DocumentBuilderFactory is obtained that is used to create a new DocumentBuilder.

The parse() method is called on this DocumentBuilder object to return an object that conforms to the public Document interface. This object represents the XML document tree. The following code fragment creates a DOM parser and reads the XML document from a file called text.xml:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
document = builder.parse(new File("text.xml");

With the DocumentBuilder.parse() method, you are not restricted to reading XML only from a file; you can also use a constructed InputStream or read from a source defined by a URL.

There are a number of methods provided in the Document interface to access the nodes in the tree. These are listed in Table 16.8.

The normalize() method should always be used to put all text nodes into a form where there are no adjacent text nodes or empty text nodes. In this form, the DOM view better reflects the XML structure.

As already shown, a DOM parser is instantiated in a similar manner as a SAX parser; the code should be familiar:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
document = builder.parse(new File(argv[0]));

This is where the similarity ends. At this point, the DOM parser has built an in-memory representation of the document that will look something like Figure 16.2.

Figure 16.2. Diagram of DOM tree.


The root of the DOM tree is obtained with the getDocumentElement() method.

Element root = document.getDocumentElement();

This method returns an Element, which is simply a node that may have attributes associated with it. An element can be the parent to other elements.

There are a number of methods provided in the Document interface to access the nodes in the tree, which are listed in Table 16.8. These methods return either a Node or a NodeList (ordered collection of nodes).

Table 16.8. Document Interface Methods to Traverse a DOM Tree
Method Name Description
getDocumentElement() Allows direct access to the root element of the document
getElementsByTagName(String) Returns a NodeList of all the elements with the given tag name in the order in which they are encountered in the tree
getChildNodes() A NodeList that contains all children of this node
getParentNode() The parent of this node
getFirstChild() The first child of this node
getLastChild() The last child of this node
getPreviousSibling() The node immediately preceding this node

In the DOM application you are about to build, the getChildNodes() method is used to recursively traverse the DOM tree. The NodeList.getLength() method can then be used to find out the number of nodes in the NodeList.

NodeList children = node.getChildNodes();
int len = (children != null) ? children.getLength() : 0;

In addition to the tree traversal methods, the Node interface provides the following methods to investigate the contents of a node as in Table 16.9.

Table 16.9. Document Interface Methods to Inspect DOM Nodes
Method Name Description
getAttributes() A NamedNodeMap containing the attributes of a node if it is an Element or null if it is not.
getNodeName() A string representing name of this node (the tag).
getNodeType() A code representing the type of the underlying object. A node can be one of ELEMENT_NODE, ATTRIBUTE_NODE, TEXT_NODE, CDATA_SECTION_NODE, ENTITY_REFERENCE_NODE, ENTITY_NODE, PROCESSING_INSTRUCTION_NODE, COMMENT_NODE, DOCUMENT_NODE, DOCUMENT_TYPE_NODE, DOCUMENT_FRAGMENT_NODE, NOTATION_NODE.
getNodeValue() A string representing the value of this node. If the node is a text node, the value will be the contents of the text node; for an attribute node, it will be the string assigned to the attribute. For most node types, there is no value and a call to this method will return null.
getNamespaceURI() The namespace URI of this node.
hasAttributes() Returns a boolean to indicate whether this node has any attributes.
hasChildNodes() Returns a boolean to indicate whether this node has any children.

Listing 16.10 is the full listing of a simple standalone parser that uses DOM. It reads in a file from the command line, builds the parse tree, and outputs elements (including attributes) and text nodes as XML.

Listing 16.10. Simple DOM Parser
  1: import javax.xml.parsers.*;
  2: import org.xml.sax.*;
  3: import java.io.*;
  4: import org.w3c.dom.*;
  5: import java.util.*;
  6:
  7: public class DOMParse {
  8:
  9:     static Document document;
 10:
 11:     public static void main(String argv[]) {
 12:         if (argv.length != 1) {
 13:             System.err.println("Usage: DOMParse filename");
 14:             System.exit(1);
 15:         }
 16:         DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
 17:         try {
 18:             DocumentBuilder builder = factory.newDocumentBuilder();
 19:             document = builder.parse(new File(argv[0]));
 20:             document.getDocumentElement().normalize ();
 21:             Element root = document.getDocumentElement();
 22:             writeElement(root, "");
 23:         }
 24:         catch (ParserConfigurationException ex) {
 25:             System.err.println ("Failed to create DOM parser:" + ex);
 26:         }
 27:         catch (SAXException ex) {
 28:             System.err.println ("General SAX exeception:" + ex);
 29:         }
 30:         catch (IOException ex) {
 31:             System.err.println ("IO exeception:" + ex);
 32:         }
 33:         catch (IllegalArgumentException ex) {
 34:             System.err.println ("Invalid file argument" + ex);
 35:         }
 36:     }
 37:
 38:     private static void writeElement(Node n, String indent) {
 39:         StringBuffer name = new StringBuffer(indent);
 40:         name.append('<'),
       // note where to put / when printing out end tag
 41:         int tag_start = name.length();
 42:         name.append(n.getNodeName());
 43:
 44:         NamedNodeMap attrs = n.getAttributes();
 45:         int attrCount = (attrs != null) ? attrs.getLength() : 0;
 46:         StringBuffer attributes = new StringBuffer();
 47:         for (int i = 0; i < attrCount; i++) {
 48:             Node attr = attrs.item(i);
 49:             attributes.append(' '),
 50:             attributes.append(attr.getNodeName());
 51:             attributes.append("="");
 52:             attributes.append(attr.getNodeValue());
 53:             attributes.append('"'),
 54:         }
 55:         System.out.print (name);
 56:         System.out.print (attributes);
 57:         System.out.println (">");
 58:         name.append('>'),
 59:
 60:         NodeList children = n.getChildNodes();
 61:         int len = (children != null) ? children.getLength() : 0;
 62:         indent += "  ";
 63:         for (int i = 0; i < len; i++) {
 64:             Node node = children.item(i);
 65:             switch (node.getNodeType())
 66:             {
 67:               case Node.TEXT_NODE:
 68:                 writeText(node, indent);
 69:                 break;
 70:
 71:               case Node.ELEMENT_NODE:
 72:                 writeElement(node, indent);
 73:                 break;
 74:             }
 75:         }
 76:         name.insert(tag_start, '/'),
 77:         System.out.println (name);
 78:     }
 79:
 80:     private static void writeText(Node n, String indent) {
 81:         String value = n.getNodeValue().trim();
 82:         if (value.length() > 0) {
 83:             System.out.print(indent);
 84:             StringTokenizer XMLTokens = new StringTokenizer(value, "&<>'"", true);
 85:             while (XMLTokens.hasMoreTokens()) {
 86:                 String t = XMLTokens.nextToken();
 87:                 if (t.length() == 1)  // might be a special char
 88:                 {
 89:                     if (t.equals("&"))
 90:                         System.out.print ("&amp;");
 91:                     else if (t.equals("<"))
 92:                         System.out.print ("&lt;");
 93:                     else if (t.equals(">"))
 94:                         System.out.print ("&gt;");
 95:                     else if (t.equals("'"))
 96:                         System.out.print ("&apos;");
 97:                     else if (t.equals("""))
 98:                         System.out.print ("&quot;");
 99:                     else
100:                         System.out.print(t);
101:                 }
102:                 else
103:                     System.out.print(t);
104:             }
105:             System.out.println();
106:         }
107:     }
108: }
					

Although at first site this looks more complicated than the SAX parser, most of the additional code is concerned with producing output that conforms to the XML syntax.

Lines 38–57 prints out the start tag with any associated attributes. Lines 59–75 checks for any child nodes and calls the appropriate method according to whether the child node is an XML element or a text node. Line 76 inserts a / character into the tag name before printing it out as the end tag.

The writeText() method starting on line 80 tokenizes the text contents and replaces the special characters (listed in Table 16.2) with the appropriate XML strings.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.47.166