C H A P T E R  9

Writing and Reading XML

XML stands for Extensible Markup Language. You might think it would be “eXtensible Markup Language,” but it's not. Odd acronym aside, XML rates inclusion in a book for beginning programmers because, as your software-development career (whether as a hobbyist sqor a professional) continues, you'll inevitably run into XML in all sorts of places. It's used to store documents, from the contents of a single web page to the contents of entire sets of encyclopedias. It's also used to transmit data between applications, whether the servers running those applications are halfway around the world or in the same room. It's even used (with Cascading Style Sheets) to display information in web browsers. Every company I've worked for over the last dozen years, and every application I've written (at least those applications more serious and substantial than Minesweeper), has made at least some use of XML.

Although a specialized language called XSLT (Extensible Stylesheet Language Transformation) exists specifically for processing XML, Java is also a very popular language for dealing with XML. Also, one of the best and most popular XSLT processors, called Saxon, is coded in Java. Java is especially handy for working with XML because it includes a number of packages intended specifically for processing (reading, writing, and transforming) XML. The two most common packages (largely because they are included in Java) are DOM (Document Object Model) and SAX (Simple API for XML). You can use DOM to read and write XML. SAX only reads (or, more properly, parses) XML. For writing XML, though, you can also create a String object and write that to your file. Done correctly, writing String objects offers the lowest overhead (in both memory and speed) for producing XML. This chapter will cover writing XML from DOM and from String objects, and reading XML with DOM and SAX.

The Structure of XML

Before you get to processing the stuff, you should see what XML looks like and learn a bit about its nature. First off, know that XML, while called a language, isn't a language in the same sense as Java. XML is a storage format, and it offers no processing capabilities of its own. It has no looping structure, no way to specify variables or data types (except that a program might use a bit of XML as a variable or data type, but that's not the same as what Java does). So, XML is really just text that has been organized in a particular way.

The root of any XML document is a single element. That element can have any number of other elements as children, and each of these children can have any number of children, and so on, resulting in a hierarchical structure of arbitrary complexity and depth (which is to say that an XML document can be of any size and have elements nested to any depth). Also, each element can have any number of attributes. However, attributes cannot have children, so most of the content, in most XML documents, comes from the elements.

Before going any further, take a look at the smallest possible XML file.

Listing 9-1. The Smallest Possible XML File

<?xml version="1.0" encoding="UTF-8"?>
<elementName/>

I have worked with systems that had many such files, as each directory in a set of directories meant to contain the output of a complex process had to have at least one file. Consequently, we had a bunch of XML files with content as follows: <?xml version="1.0" encoding="UTF-8"?><placeholder/> You can see the exact syntax shortly. Until then, a more meaningful example will help to clarify things. Here's one of my favorite poems, encoded as an XML document.

Listing 9-2. An Example of XML

<?xml version="1.0" encoding="UTF-8"?>
<poem title="The Great Figure" author="William Carlos Williams">
  <line>Among the rain</line>
  <line>and lights</line>
  <line>I saw the figure 5</line>
  <line>in gold</line>
  <line>on a red</line>
  <line>fire truck</line>
  <line>moving</line>
  <line>tense</line>
  <line>unheeded</line>
  <line>to gong clangs</line>
  <line>siren howls</line>
  <line>and wheels rumbling</line>
  <line>through the dark city</line>
</poem>

The first line, the document specifier, indicates that this document is an XML document and specifies the version (1.0, which is the most often used version, and suffices for most purposes) and the encoding. Document specifiers always begin with <? and end with ?>. This way, they can't be confused with XML elements. Most systems that can process XML will work with documents that don't have a document specifier, but a document without one isn't strictly an XML file—it's just a collection of characters that happen to look like an XML file. That may seem like an arbitrary and trivial distinction, but your XML document may be rejected for just that reason by some systems, so it's good to get in the habit of always including a document specifier. The encoding indicates the character set that applies to the content. UTF-8 is a large set that includes most of the characters available in non-Asian languages (including English, Greek, Spanish, Russian, and many others). The Asian languages (Chinese, Japanese, Vietnamese, and others) use pictographs (that is, an image that corresponds to a word). The Asian character sets are consequently very large and tricky to manipulate. For the sake of simplicity, we'll stick to UTF-8 and documents in English.

The next line contains the root element. The first element in any XML file is that document's root element. All other elements, no matter how deeply nested, are descendants of the root. The root element, poem, contains two attributes, title and author. The root element also contains all the line elements, which make up the body of the poem.

Note the syntax for each element. Each one begins with an opening tag (<poem> or <line>) and ends with a closing tag (</poem> or </line>). The basic rule is that the names within the tags have to match (and there are various restrictions about which characters can be used, but just about any English word works). Other than that, opening tags always start with a left angle character (<) and end with a right angle character(>). Ending tags always begin with a left angle character and a forward slash (</) and end with a right angle character (>). Elements can also be empty, in which case they can take one of two forms: a beginning tag and an ending tag with nothing between them, or a special empty element tag. For example, an empty line element can be represented as either <line></line> or <line/>. That second structure provides a handy shortcut that saves some typing. If the poem included a blank line (such as a line between stanzas in a longer poem), you could represent a blank line that way.

images Note XML is case-sensitive. <POEM>, <Poem>, and <poem> are all different elements, so <POEM></poem> would cause an XML parser to throw an error.

A poem offers an example of a fairly traditional document encoded as XML. Consider an example of data transmitted between systems as XML.

Listing 9-2. XML As Data

<?xml version="1.0"?>
<soap:Envelope
    xmlns:soap="http://www.w3.org/2001/12/soap-envelope"
    soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding">
  <soap:Body xmlns:w="http://www.noaa.gov/weather">
    <w:GetTemperature>
      <w:Location>78701</w:Location>
    </w:GetTemperature>
  </soap:Body>
</soap:Envelope>

Look closely at the document specifier. In this case, it doesn't indicate encoding. The default encoding for XML is UTF-8, so you can omit the specifier when you are going to use UTF-8 characters. Then you have a root element named soap:Envelope. The specifier and the root element constitute the minimum content for an XML file, but that wouldn't make a very useful message between systems. So this example also contains a bit of data (the zip code for a particular city).This XML document represents a request for data from one system to another system. (I invented it from scratch, by the way; NOAA may use something else entirely.) In particular, it's a request to get the temperature in Austin, TX. Note the indentation. When XML is meant to be read by humans, it's customarily indented, such that each level of elements is farther to the right than its parent level. When creating a stream to send to another system, all the white space between elements is generally removed, resulting in the whole document being on one line. That's tough to read for a human, but it saves bandwidth, and a computer doesn't find it hard to read.

images Note “Document” is the normal way to refer to any instance of XML, whether it encodes an actual document or some arbitrary bit of data—XML originally comes from the publishing industry, and it retains some of that industry's terminology.

SOAP stands for Simplified Object Access Protocol. It's a common way for systems to pass data back and forth. One system makes a request, and another system sends a corresponding response, which may trigger yet another request. The system that processes the request, and produces a response, may then send a request. In this fashion the systems “play tag” with one another, until the data is properly transmitted (or an error condition indicates that the systems should stop trying). Inter-system and inter-process communication can get much more complex than a simple request-response cycle, but that's beyond the scope of this book.The purpose of this kind of protocol (SOAP is one of many such standards) is to provide separate systems a way to communicate that has as few dependencies as possible on the operating system or language. One system might be a Windows server running an application written in C++, while another system might be a Linux server running a Java application, and yet another system might be a supercomputer running an ADA application. Provided the other system can recognize the request and generate an appropriate response, anything else is irrelevant. This makes SOAP, and similar protocols, very useful to software developers. You can create all kinds of systems and make them talk to one another. The Internet is the plumbing, but SOAP messages, and similar content, constitute the water in the pipes (or the information in the network, to abandon the metaphor). Requests and responses (the latter, in particular) can be very large, by the way. I once created a system that shared insurance policy information with a national insurance clearing house. The responses in that system were very large— often several megabytes of data. (Honestly, had I designed that system, I would have sent a response that specified a location from which the other system could download a file containing the policy. Huge responses can be problematic. The longer the message, the higher the chance for corrupted data, and it's sometimes useful to store data for a time. It wasn't my decision to make, though.)

The xmlns:soap attribute specifies the namespace for SOAP, so that any system receiving it can recognize what kind of message it is (assuming the system knows about SOAP at all). (XMLNS stands for XML NameSpace.) The encodingStyle element (itself a member of the soap namespace) specifies the exact version of SOAP being used. The soap:Body namespace specifies a (fictional, in this case) namespace for weather information. Note that each namespace has an alias (soap or w, in this document). The aliases save the trouble of typing out the namespace for each element and, more importantly, reduces the number of bytes going down the pipe. The relatively simple poem example has no namespace declaration, which means that it uses the default XML namespace. Every XML element has a namespace, even if it's only the default. In the SOAP example, you can see that a single XML document can contain elements from multiple namespaces (a Microsoft Word document contains as many as 14 different namespaces, just to show how complex things can get). Namespaces let different organizations use the same elements without trampling one another when they get into the same document. For example, if another organization produced weather information, they'd have their own namespace, to prevent collisions with the noaa namespace.

XML and Streams

A stream is a collection of data meant to be read sequentially. That is, a stream is meant to be read one byte at a time. It is generally said that such a block of data is serialized (meaning that it is ready to be transmitted and read serially, which is another way to say one byte at a time).In Java (and in other many other languages), XML is processed as streams. Reading XML is done by parsing InputStream objects, and writing XML is generally done by creating StreamResult objects. When creating XML with String objects, the result is often still exported as an OutputStream, as some other process needs to receive a stream to do its part in a larger process. For example, many systems produce large documents. In these systems, you can create XML by using String objects, create an OutputStream from the result, and send that stream as an InputStream to another object, which would produce a PDF file. The PDF file serves as the final document, to be stored on a server, printed, or both.

As you saw in Chapter 8, Java uses Stream objects for reading from, and writing to, files. Since you're already processing Stream objects, saving an XML document as a file is a natural and easy task. Reading an existing file as a stream also makes loading an XML file straightforward. So, there are streams in, streams out, and streams between the steps of larger processes. Fortunately, Java makes working with streams easy.

DOM and SAX

DOM (Document Object Model) and SAX (Simplified API for XML) have their strengths and weaknesses. As with most data-related problems, knowing which one to use comes down to knowing your data. If you need to work with relatively small documents, DOM works well, as it loads the entire XML data stream into memory, making it fast (again, provided the document is small). SAX, on the other hand, uses only enough memory to process the current element (called a node), which makes it capable of handling documents of any size. I've used SAX to parse the contents of books as long as 2,000 pages. The down side of SAX is that you can't reach much of the document at once, as little of it is in memory. Also, SAX only works with incoming XML documents; it doesn't write XML.

So, remember to use DOM for small XML sources and SAX for larger XML sources. If you're uncertain if the XML input will be large or small, use SAX. Of course, whether a document is large or small depends on how much memory is available for processing the document. If your program can run on a computer with plenty of memory all to itself, you can use DOM to load fairly large documents. However, if your program has to share a server with other processes, or has to run on a small device (such as a phone), memory will be limited and your options will be reduced. Finally, if your application has to process multiple documents at once (perhaps for multiple users), the memory for each process will be greatly reduced. The more constrained the memory available to the application, the more you should lean toward SAX. As an example, I recently worked on an application that would trigger an arbitrary number of transforms to create sets of documents. In practice, each set contained about 15 documents. Also, multiple users could start document-production runs at the same time, leading to as many as 50 documents being processed at the same time, all with 8 MB of RAM. We definitely needed to use SAX.

Writing XML

As mentioned above, you can write XML with the DOM package or by writing String objects. If you just need to write an XML file, writing strings works well enough (and it is the fastest way to create XML). On the other hand, if you need to pass your XML output to a process that requires an XML header, and perhaps even needs to ensure that the XML conforms to a schema, you might want to consider using DOM.

Before you get to writing XML, you first need a data source to provide the content that you want to turn into XML. Here is a simple class that provides the content of the poem used earlier (“The Great Figure,” by William Carlos Williams).

Listing 9-3. A Poem As a Data Source

package com.bryantcs.examples.xml;

import java.util.ArrayList;

public class Poem {

  private static String title = "The Great Figure";
  private static String author ="William Carlos Williams";
  private static ArrayList<String> lines = new ArrayList<String>();

  static {
    lines.add("Among the rain");
    lines.add("and lights");
    lines.add("I saw the figure 5");
    lines.add("in gold");
    lines.add("on a red");
    lines.add("fire truck");
    lines.add("moving");
    lines.add("tense");
    lines.add("unheeded");
    lines.add("to gong clangs");
    lines.add("siren howls");
    lines.add("and wheels rumbling");
    lines.add("through the dark city");
  }

  public static String getTitle() {
    return title;
  }

  public static String getAuthor() {
    return author;
  }
  public static ArrayList<String> getLines() {
    return lines;
  }
}

As you can see, it's a pretty simple representation of a poem. Notice that it's also entirely static. Some classes consist entirely of static members, but those classes usually define sets of helper methods (string manipulation specialized for a particular application is a common use for that kind of helper class). A class with static data, though, usually indicates that someone hasn't thought through a problem very well. This kind of thing usually belongs in a file. For our purposes, though, this slightly odd class will serve well enough.

Writing XML with DOM

Here's the code for writing an XML file with DOM, given the Poem class as the data source:

Listing 9-4. Writing XML with DOM

package com.bryantcs.examples.xml;

import java.io.File;

import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.StringWriter;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;

import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Text;

public class WriteWithDOM {

  public static void main (String args[]) {
    // Create an empty Document
    Document doc = createDocument();

    // Create the XML
    createElements(doc);

    // Create a String representation of the XML
    String xmlContent = createXMLString(doc);

    // Write the XML to a file
    writeXMLToFile(xmlContent);
  }

// Here's where we create the (empty for now) XML document  private static Document
createDocument() {
    Document doc = null;
    try {
      DocumentBuilderFactory dbfac = DocumentBuilderFactory.newInstance();
      DocumentBuilder docBuilder = dbfac.newDocumentBuilder();
      doc =  docBuilder.newDocument();
      doc.setXmlStandalone(true);
    }
    catch(ParserConfigurationException pce) {
      System.out.println("Couldn't create a DocumentBuilder");
      System.exit(1);
    }
    return doc;
  }
  // Here's where we add content to the XML document  private static void
createElements(Document doc) {
    // Create the root element
    Element poem = doc.createElement("poem");
    poem.setAttribute("title", Poem.getTitle());
    poem.setAttribute("author", Poem.getAuthor());

    // Add the root element to the document
    doc.appendChild(poem);

    // Create the child elements
    for (String lineIn : Poem.getLines() ) {
      Element line = doc.createElement("line");
      Text lineText = doc.createTextNode(lineIn);
      line.appendChild(lineText);
      // Add each element to the root element      poem.appendChild(line);
    }
  }

  // Here's where we convert the DOM object
  // into a String that contains XML  private static String createXMLString(Document doc) {
    // Transform the DOM to a String
    Transformer transformer = null;
    StringWriter stringWriter = new StringWriter();
    try {
      TransformerFactory transformerFactory =
        TransformerFactory.newInstance();
      transformer = transformerFactory.newTransformer();
      transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION,
        "no");
      transformer.setOutputProperty(OutputKeys.INDENT, "yes");

      // Create a string to contain the XML from the Document object
      stringWriter = new StringWriter();
      StreamResult result = new StreamResult(stringWriter);
      DOMSource source = new DOMSource(doc);
      transformer.transform(source, result);
    } catch (TransformerConfigurationException e) {
      System.out.println("Couldn't create a Transformer");
      System.exit(1);
    } catch (TransformerException e) {
      System.out.println("Couldn't transform DOM to a String");
      System.exit(1);
    }
    return stringWriter.toString();
  }

  // Here's where we turn the String holding the XML
  // into a file  private static void writeXMLToFile(String xmlContent) {
    String fileName = "C:" + File.separator + "test"
    + File.separator + "domoutput.xml";
    try {
      File domOutput = new File(fileName);
      FileOutputStream domOutputStream
          = new FileOutputStream(domOutput, true);
      domOutputStream.write(xmlContent.getBytes());
      domOutputStream.close();
      System.out.println(fileName + " was successfully written");
    } catch (FileNotFoundException fnfe) {
      System.out.println("Couldn't find a file called " + fileName);
      System.exit(1);
    } catch (IOException ioe) {
      System.out.println("Couldn't write to a file called " + fileName);
      System.exit(1);
    }
  }
}

images Tip Use a StringBuilder object to create a string whenever you need to append strings onto other strings. If you use the string concatenation operator (+), the JVM creates a new String object but also keeps the previous String object in memory, which quickly consumes a great deal of memory. Modern JVMs have gotten better about handling this problem, but it remains an issue, and good practice dictates using StringBuilder when you have more than one or two concatenations to do.

As you can see, I've carved it up into a few methods to cleanly and clearly separate the parts of the algorithm. That's a practice you'll see many developers follow, and it's good to embrace this when code complexity reaches a certain level. Every programmer has a different threshold for when they think a long method should become multiple methods. My own threshold is pretty low. A method doesn't have to get very long before I start itching to split it. In this case, splitting the code also lets me handle the Exception objects thrown by each step separately.

The process for creating XML with DOM is fairly straightforward:

  1. Create an empty Document object (the top-level DOM object that contains everything else). That's done in the createDocument method.
  2. Create the elements and attributes (and their children, grandchildren, and so on, as needed) and add the elements and attributes to the Document object. The createElements method performs this step.
  3. Convert the contents of the DOM object to a String object. The createXMLString method does this step for you.
  4. Write the String object to the target (a file in this case). The writeXMLToFile method creates your file and puts your XML into the file.

Writing XML with Strings

Here's the code to produce the same output by writing out a String object.

Listing 9-5. Writing XML with Strings

package com.bryantcs.examples.xml;

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;

public class WriteWithStrings {

  public static void main(String[] args) {
    String xmlContent = createXMLContent();
    writeXMLToFile(xmlContent);
  }

  private static String createXMLContent() {
    // write the first line
    StringBuilder sb = new StringBuilder();
    sb.append("<?xml version="1.0" encoding="UTF-8"?>");    sb.append("<poem title="");
    sb.append(Poem.getTitle());
    sb.append("" author="");
    sb.append(Poem.getAuthor());
    sb.append(""> ");
    // write the middle lines
    for (String lineIn : Poem.getLines()) {
      sb.append("<line>");
      sb.append(lineIn);
      sb.append("</line> ");
    }
    // write the last line
    sb.append("</poem>");
    return sb.toString();
  }

  private static void writeXMLToFile(String xmlContent) {
    String fileName = "C:" + File.separator + "test"
    + File.separator + "domoutput.xml";
    try {
      File domOutput = new File(fileName);
      FileOutputStream domOutputStream
          = new FileOutputStream(domOutput, true);
      domOutputStream.write(xmlContent.getBytes());
      domOutputStream.close();
      System.out.println(fileName + " was successfully created");
    } catch (FileNotFoundException fnfe) {
      System.out.println("Couldn't find a file called " + fileName);
      System.exit(1);
    } catch (IOException ioe) {
      System.out.println("Couldn't write to a file called " + fileName);
      System.exit(1);
    }
  }
}

As you can see, the code is substantially simpler and easier to follow. It also performs more quickly. Again, it doesn't have some of the output features from DOM (no XML header, for example), but it works if you just need a simple XML document.

Reading XML

To read XML, you can use either DOM or SAX. As mentioned earlier in this chapter, DOM is handy when you can be sure that your XML content will fit into the memory you have available. However, DOM fails when the input is too large. SAX, on the other hand, can handle any amount of input. For SAX, you need only as much memory as the largest element needs (usually not much, unless you're doing something such as processing large images or items where a single element can contain a large amount of data).

Reading XML with DOM

Here's the source for a program that reads XML with DOM. For this program to work, you need to create a file named poemsource.xml and put it in your C: est (on Windows) or C:/test (on Unix or Linux) directory. You can use the contents of the domoutput.xml file as the contents of the poemsource.xml file.

Listing 9-6. Reading XML with DOM

package com.bryantcs.examples.xml;

import java.io.File;
import java.io.IOException;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;

import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

public class ReadWithDOM {

  public static void main(String[] args) {
    String fileName = "C:" + File.separator + "test"
    + File.separator + "poemsource.xml";
    writeFileContentsToConsole(fileName);
  }

  // Write the contents of the file to the console  private static void
writeFileContentsToConsole(String fileName) {
    // Create a DOM Document object    Document doc = createDocument(fileName);
    // Get the root element    Element root = doc.getDocumentElement();
    // Create a StringBuilder object that describes the root element    StringBuilder sb = new
StringBuilder();    sb.append("The root element is named: "" + root.getNodeName() + """);
    sb.append(" and has the following attributes: ");
    NamedNodeMap attributes = root.getAttributes();
    for (int i = 0; i < attributes.getLength(); i ++) {
      Node thisAttribute = attributes.item(i);
      sb.append(thisAttribute.getNodeName());
      sb.append (" ("" + thisAttribute.getNodeValue() + "")");
      if (i < attributes.getLength() - 1) {
        sb.append(", ");
      }
    }
    // Write the description of the root element to the console    System.out.println(sb);

    // Work through the children of the root
    // First, get a list of the child nodes    NodeList nodes =
doc.getElementsByTagName("line");
    for (int i = 0; i < nodes.getLength(); i++) {
      // Process each element in turn      Element element = (Element) nodes.item(i);
      System.out.println("Found an element named "" +
      // By writing its name and content to the console (System.out)
element.getTagName() + """ +
        " with the following content: "" +
        element.getTextContent() + """);
    }
  }

  // Create a DOM Document object from a file  private static Document createDocument(String
fileName) {
    Document doc = null;
    try {
      // Get the file      File xmlFile = new File(fileName);
      // Create document builder factory      DocumentBuilderFactory dbfac =
DocumentBuilderFactory.newInstance();
      // Create a document builder object      DocumentBuilder docBuilder =
dbfac.newDocumentBuilder();
      // Load the document by parsing the file with the document builder      doc =
docBuilder.parse(xmlFile);
      // Indicate that this document is self-contained      doc.setXmlStandalone(true);
    }
    // Deal with the possible exceptions    catch (IOException ioe) {
      System.out.println("Couldn't open file: " + fileName);
      System.exit(1);
    }
    catch (SAXException se) {
      System.out.println("Couldn't parse the XML file");
      System.exit(1);
    }
    catch(ParserConfigurationException pce) {
      System.out.println("Couldn't create a DocumentBuilder");
      System.exit(1);
    }
    // Finally return the Document object
    // that we built from the file    return doc;
  }
}

In this case, we just create a Document object, read through each line of the input, and describe the content in the console. Naturally, you'll probably want to do something more than describe your input in the console, but this example shows you how to read a file. One thing to note is that each Element object is really a Node object (the Element interface extends the Node interface). Due to the way DOM has been implemented, you sometimes need to work with both Element objects and Node objects, as I had to do here when working with the attribute values.

Reading XML with SAX

SAX uses an interface called ContentHandler to expose parsing events that you can then intercept in your own code to do whatever processing you want to do for each parsing event. The SAX packages also provide a default implementation of ContentHandler, called DefaultHandler. DefaultHandler does nothing with each event, because doing nothing is the default behavior. However, you can override the methods in DefaultHandler to do whatever you like. The advantage of extending DefaultHandler is that you can override just the methods you care about and leave the rest alone. In the example I've used here, I didn't need many of the methods in DefaultHandler, so I didn't override them.

If you look at the names of the methods, you can see why SAX uses so little memory to process XML. It triggers an event for the beginning and end of each part of an XML document, be it the document itself or an element. So, all the parser has to put in memory is the name (and some other details) about the element, and a list of the element's children. It doesn't have to put the element's content into memory until it gets to the characters method, which is the method that handles an element's character content. Most elements don't have vast amounts of text content (one exception is when someone stores an image in an XML element, as Word documents do), so the memory used to process the text usually isn't much.

To show you how to read a simple XML document and describe its contents in the console, I first created a class (called XMLToConsoleHandler) that extends DefaultHandler and overrides the handful of methods I need to use when capturing the contents of an XML file. Here's the XMLToConsoleHandler class:

Listing 9-7. XMLToConsolHandler

package com.bryantcs.examples.xml;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;

import org.xml.sax.helpers.DefaultHandler;

public class XMLToConsoleHandler extends DefaultHandler {

  // The characters method handles the actual content of an element  @Override
  public void characters(char[] content, int start, int length) throws SAXException {
    // Describe the content in the console    System.out.println("Found content: " + new String(content, start, length));  }

  // The endDocument method lets us do something
  // when the parser reaches the end of the document  @Override
  public void endDocument() throws SAXException {
    // Announce in the console that we found the end of the document
System.out.println("Found the end of the document");  }

  // The endElement method lets us do something
  // when the parser reaches the end of an element  @Override
  public void endElement(String arg0, String localName, String qName)
      throws SAXException {
    // Announce in the console that we found the end of an element
System.out.println("Found the end of an element named "" + qName + """);
  }

  // The startDocument lets us do something
  // when we find the top of the document  @Override
  public void startDocument() throws SAXException {
    // Announce in the console that we found the beginning of the document
System.out.println("Found the start of the document");
  }

  // The startElement method lets us do something
  // when we reach the beginning of an element  @Override
  public void startElement(String uri, String localName, String qName,
      Attributes attributes) throws SAXException {
    // Create a StringBuilder object to contain our description of this element
StringBuilder sb = new StringBuilder();
    // Add the name of the element    sb.append("Found the start of an element named "" +
qName + """);
    // See if the element has any attributes    if (attributes != null &&
attributes.getLength() > 0) {
      sb.append(" with attributes named ");
      // If we do find attributes, describe each one      for (int i = 0; i <
attributes.getLength(); i++) {
        String attributeName = attributes.getLocalName(i);
        String attributeValue = attributes.getValue(i);
        sb.append(""" + attributeName + """);
        sb.append(" (value = ");
        sb.append(""" + attributeValue + """);
        sb.append(")");

        // If we're not at the end of the attributes,
        // add a comma, for proper formatting        if (i < attributes.getLength() - 1) {
          sb.append(", ");
        }
      }
    }
    // Describe the element in the console    System.out.println(sb.toString());
  }
}

Again, we're using a StringBuilder to avoid creating an excess of String objects in memory. The only complexity comes when we work through any attributes that may be present, and most of the code is really just for “pretty printing” (a phrase that programmers often use when referring to code that formats output to be easily read by humans).

Let's look at the class that uses the XMLToConsoleHandler class to write to the console. Again, you need to have a file named poemsource.xml in the C: est (on Windows) or C:/test (on Unix or Linux) directory. You can use the contents of the domoutput.xml file as the contents of the poemsource.xml file. Here's that class:

Listing 9-8. ReadWithSAX

package com.bryantcs.examples.xml;

import java.io.File;
import java.io.IOException;

import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

import org.xml.sax.SAXException;

public class ReadWithSAX {

  public static void main(String[] args) {
    String fileName = "C:" + File.separator + "test"
        + File.separator + "poemsource.xml";
    getFileContents(fileName);
  }

  private static void getFileContents (String fileName) {
    try {
    // Make an instance of our handler      XMLToConsoleHandler handler = new
XMLToConsoleHandler();
    // Get a parser factory      SAXParserFactory factory = SAXParserFactory.newInstance();
    // Get a parser      SAXParser saxParser = factory.newSAXParser();
    // And now parse the file with our handler      saxParser.parse(fileName, handler );

    // Deal with the possible exceptions    } catch(IOException ioe) {
      System.out.println("Couldn't open " + fileName + " for parsing");
    } catch(ParserConfigurationException pce) {
      System.out.println("Failed to create a SAX parser ");
    } catch(SAXException saxe) {
      System.out.println("Failed to parse an XML file");
    }
  }
}

Simple, isn't it? One of the joys of SAX is that it's simple to implement. Create a handler, create a parser, and pass your input and handler to the factory. Consequently, the exception-handling code is longer than the code that does the work. Of course, this arrangement is really masking the fact that the complexity is in the handler class. Still, carving up your code so that complexity is isolated to a single class is exactly the right way to use an object-oriented language such as Java.

A Word about Factory Classes

You may have noticed that to get a DocumentBuilder object, you had to use a DocumentBuilderFactory object. Similarly, to get a SAXParser object, you had to use a SAXParserFactory object. The factory pattern is often used in Java (and other object-oriented languages, such as C++) to permit the creation of objects that have varying attributes. For example, the SAXParserFactory class includes a way to specify a separate validator object, to ensure that the XML conforms to a schema (which is a definition of what a set of data should contain).

The factory pattern is a handy way to present a group of very similar objects that vary only by having some features turned on or off. Otherwise, you'd have to have a class for every possible combination of features. In some cases, that would be a lot of very similar classes. The factory pattern offers an easy-to-use and easy-to-understand solution to that problem. Consequently, Java has a number of factory objects in its standard libraries. You can also create your own factory classes. When you find yourself needing to create many very similar (but slightly different) objects, consider creating a factory for them.

Summary

Well, that was a whirlwind tour of how to use Java to work with XML. Several good books have been written about the subject, as there's a great deal more complexity than what's been presented here. Still, this chapter should prepare you for when you have to read and write XML files or streams.

In particular, we covered the basics of Java's two main ways to deal with XML: the Document Object Model (DOM) and the Simplified API for XML (SAX). You learned that DOM offers great performance because it loads the entire document into memory. However, loading the entire document into memory is also DOM's biggest problem, as large documents may exceed the available memory. SAX, on the other hand, can handle any amount of XML (including gigabyte-size streams of data), but it requires making a custom handler class for each kind of XML document.

Finally, you looked at the nature of factory classes and learned about why they exist, and when you might want to use them.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.206.225