Document Based Parsing with DOM Under Java

Although event based parsing with the Simple API for XML is fast and memory efficient, it does not provide any context for where the current element is in the hierarchy of the document. This is due to the fact that parsing via SAX is effectively a one-shot operation: The parser only knows about the current element at the moment when it notifies the event handler and then promptly forgets about it.

For applications in which it is useful to have the parser remember more of this contextual information, using a parser that implements DOM is appropriate. DOM has the advantage of storing a complete copy of the XML document in memory at all times, allowing programmers to traverse through the hierarchy of elements without worrying about keeping track of the state of the document. This has the advantage of providing a much higher-level interface to XML documents than a SAX interface, but at a serious efficiency and memory cost, because all the information from the document must be stored in memory at all times.

This section provides an example of working with DOM in Java. For more specific information on the Document Object Model and its interfaces, see Chapter 14, “XML and the Document Object Model.”

Creating the DOM Parser Instance

In order to parse XML documents with DOM, we first need a method to generate a DOM Parser. Under the Java API for XML Processing, this is done in a very similar way to how SAX Parsers are built: with a factory object that creates parsers on request. Under the JAXP interfaces, a DOM parser is referred to as a Document Parser, and the parser factory responsible for generating a Document Parser is the DocumentBuilderFactory. Therefore, our example starts with the creation of the DocumentBuilderFactory. (The methods available to the resulting DocumentBuilder are shown in Table 16.3.)

DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder(); 

Table 16.3. DocumentBuilder Methods
getDOMImplementation The DOMImplementation associated with this document builder
isNamespaceAware A boolean value that specifies whether this parser understands namespaces
isValidating A boolean value that specifies whether this parser validates against DTDs and Schemas
newDocument Creates a new W3C DOM Document object that can be manipulated
parse Parses the specified XML document
setEntityResolver Sets the event callback that will be used to resolve entity references in the document
setErrorHandler Sets the event callback that will be called when an error is encountered during the parsing of the document

Parsing the Document

At this point, there is a marked difference in how documents are parsed between the DOM and SAX parsers. Under SAX, a SAXParser must be created and then an XMLReader is created from the parser to actually parse the document. When using the DOM interface, the process is much more streamlined: The filename to be parsed is passed directly to the parse method of the DocumentBuilder. This method returns a Document object. The next step in the example is to use this method to parse our sample "example.xml" XML file.

Document document = documentBuilder.parse("example.xml"); 

This Document object represents an in-memory representation of the entire XML document. There are various methods for querying the Document object for details about the XML document that has been parsed. (See Chapter 14 if you are interested in more detailed information about these interfaces.) For our example, we start by retrieving the root node of the document. This is referred to by the Document Object Model specification (somewhat confusingly) as the Document Element, and is retrieved by the method getDocumentElement().

Element elem = document.getDocumentElement(); 

Now that the root element of the document has been retrieved, it's time to harness the power of the Document Object model in Java to find out some specific information from our example XML file. If you refer back to the example file at the beginning of the chapter, you will notice that an XML element called <standards> contains several children <standard> elements that each name a standard in the text of the element and provide the Web address for each of them in the href attribute of the element. Let's add the ability to our example to print out the information for each of those standards.

First, a list of all the elements with the name "standards" needs to be retrieved. This is done with the DOM method getElementsByTagName, which returns a NodeList that contains all of the elements.

NodeList standards_list = root_element.getElementsByTagName("standards"); 

For this example, there is only one <standards> element in the document, so the item method of the NodeList can be called to retrieve the first element in the NodeList. Although the Document Object Model specifies that a NodeList contains DOM nodes, in this case we know ahead of time that the Node is actually a DOM Element. Therefore, a Java typecast is used to convert the returned Node into a DOM Element:

Element standards_element = (Element) standards_list.item(0); 

Now we have a handle on the <standards> element, which can be then used to navigate to all of its children. In this case, the children are all a group of individual elements with the name <standard>. Therefore, we use the getElementsByTagName() method again, but in this case we find all <standard> elements:

NodeList children = standards_element.getElementsByTagName("standard"); 

Unlike the previous use of getElementsByTagName, where there was only one element to be retrieved, in this case several different elements need to be visited because the <standards> element contains several <standard> elements. Therefore, the getLength method of the children NodeList object is called, which returns the number of elements contained in the NodeList. This value is then used to cycle through returned children via a Java for loop:

for (int n=0; n<children.getLength(); n++) { 

Upon each pass through the for loop, the NodeList item method is used to select the current element in the list:

Element standard = (Element)children.item(n); 

Next we want to print out the names of each of the standards. We start by notifying the user that the standard's name will be printed:

System.out.println("The standard named: 
	"); 

At this point, the process becomes slightly more complicated. The name of each standard is represented as text contained within each <standard> element. Unfortunately, something like a getText method that retrieves all that text is not available, because the Document Object Model specifies that any text contained within an element is represented as child text nodes of that element. Therefore, we need to iterate over the child nodes of each <standard> element, looking for text nodes and then appending them to the result string:

String value = ""; 
// Collect the text beneath each node
NodeList child = standard.getChildNodes();
for(int i = 0; i < child.getLength(); i++ ) {
       Node current_child = child.item(i);
       // If a node is a Text Node then it is part of the standard name
       if( current_child.getNodeType() == Node.TEXT_NODE ) {
               value = value + current_child.getNodeValue();
       }
}
System.out.println(value);

This process is very similar to the one used previously to navigate into the children of the <standard> and <standards> elements. However, one important difference to note is the use of the getNoteType method of the DOM Node, which in this case is used to discover whether the current node is a text node. If it is, the value of the text Node (obtained via the getNodeValue method) is used to add that text to the current string. If it is not a text node, it is ignored.

Finally, for each <standard> element, the Web address for each standard is printed. Retrieving attributes from the DOM in Java is much easier than retrieving text. Each Element object has a getAttribute method that, when given the name of an attribute, returns the text value of that attribute. This method is used here to find the "href" attribute for each element and print it to the console.

System.out.println("Can be accessed on the web at: 
	"); 
String href = standard.getAttribute("href");
System.out.println(href);
System.out.println();

The full source code for this example is shown in Listing 16.3, and the output from the example is shown in Listing 16.4.

Listing 16.3. DOM Example in Java
//SimpleDOM.java

import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.ParserConfigurationException;
import org.xml.sax.SAXException;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
import org.w3c.dom.Node;
import java.io.IOException;

public class SimpleDOM {

  public static void main(String[] args)
  {

    try
    {
      // Generate a Document Builder Factory that will allow
      // DOM Parsers to be generated
      DocumentBuilderFactory documentBuilderFactory =  DocumentBuilderFactory.newInstance();
      // Using the Document Builder Factory, generate a DOM Parser
      DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();

      System.out.println("Start Processing: example.xml");

      // Using the generated DOM parser, parse the example file
      Document document = documentBuilder.parse("example.xml");
          
      // Get the Root Element of the Document
      Element root_element = document.getDocumentElement();

      // Get a list of all elements with the name "standard"
      NodeList standards_list = root_element.getElementsByTagName("standards");
      Element standards_element = (Element) standards_list.item(0);

      // Take the first element with the name "standard" and get all of its
      // children named "standards"
      NodeList children = standards_element.getElementsByTagName("standard");

      // Cycle through the NodeList containing elements named standards
      for (int n=0; n<children.getLength(); n++)
      {
        // Get each individual standard element one at a time
        Element standard = (Element)children.item(n);

        System.out.println("The standard named: 
	");
        String value = "";
        // Collect the text beneath each node
        NodeList child = standard.getChildNodes();
        for(int i = 0; i < child.getLength(); i++ ) {
          Node current_child = child.item(i);
          // If a node is a Text Node then it is part of the standard name
          if( current_child.getNodeType() == Node.TEXT_NODE ) {
            value = value + current_child.getNodeValue();
        }
      }
      System.out.println(value);
      // Print out the standard location, which is contained as an attribute
      System.out.println("Can be accessed on the web at: 
	");
      String href = standard.getAttribute("href");
      System.out.println(href);
      System.out.println();
    }
    System.out.println("End Processing: example.xml");
  } catch (ParserConfigurationException e)
    {
      System.out.println("Parser Configuration Exception :" + e);
    } catch (SAXException e)
    {
    System.out.println("SAX Exception :" + e);
    } catch (IOException e)
    {
      System.out.println("Input Output Exception :" + e); 
    }
  }
}

Listing 16.4. Output from DOM Example
Start Processing: example.xml
The standard named:
Java API for XML Processing(JAXP)
Can be accessed on the web at:
http://java.sun.com/xml/jaxp
The standard named:
Document Object Model(DOM)
Can be accessed on the web at:
http://www.w3.org/DOM/
The standard named:
Simple API for XML(SAX)
Can be accessed on the web at:
http://www.saxproject.org/
End Processing: example.xml
						

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.94.153