XML stands for Extensible Markup Language. You might think it would be “eXtensible Markup Language,” but it's not. Odd acronym aside, XML rates inclusion in a book for beginning programmers because, as your software-development career (whether as a hobbyist sqor a professional) continues, you'll inevitably run into XML in all sorts of places. It's used to store documents, from the contents of a single web page to the contents of entire sets of encyclopedias. It's also used to transmit data between applications, whether the servers running those applications are halfway around the world or in the same room. It's even used (with Cascading Style Sheets) to display information in web browsers. Every company I've worked for over the last dozen years, and every application I've written (at least those applications more serious and substantial than Minesweeper), has made at least some use of XML.
Although a specialized language called XSLT (Extensible Stylesheet Language Transformation) exists specifically for processing XML, Java is also a very popular language for dealing with XML. Also, one of the best and most popular XSLT processors, called Saxon, is coded in Java. Java is especially handy for working with XML because it includes a number of packages intended specifically for processing (reading, writing, and transforming) XML. The two most common packages (largely because they are included in Java) are DOM (Document Object Model) and SAX (Simple API for XML). You can use DOM to read and write XML. SAX only reads (or, more properly, parses) XML. For writing XML, though, you can also create a String
object and write that to your file. Done correctly, writing String
objects offers the lowest overhead (in both memory and speed) for producing XML. This chapter will cover writing XML from DOM and from String
objects, and reading XML with DOM and SAX.
Before you get to processing the stuff, you should see what XML looks like and learn a bit about its nature. First off, know that XML, while called a language, isn't a language in the same sense as Java. XML is a storage format, and it offers no processing capabilities of its own. It has no looping structure, no way to specify variables or data types (except that a program might use a bit of XML as a variable or data type, but that's not the same as what Java does). So, XML is really just text that has been organized in a particular way.
The root of any XML document is a single element. That element can have any number of other elements as children, and each of these children can have any number of children, and so on, resulting in a hierarchical structure of arbitrary complexity and depth (which is to say that an XML document can be of any size and have elements nested to any depth). Also, each element can have any number of attributes. However, attributes cannot have children, so most of the content, in most XML documents, comes from the elements.
Before going any further, take a look at the smallest possible XML file.
<?xml version="1.0" encoding="UTF-8"?>
<elementName/>
I have worked with systems that had many such files, as each directory in a set of directories meant to contain the output of a complex process had to have at least one file. Consequently, we had a bunch of XML files with content as follows: <?xml version="1.0" encoding="UTF-8"?><placeholder/>
You can see the exact syntax shortly. Until then, a more meaningful example will help to clarify things. Here's one of my favorite poems, encoded as an XML document.
<?xml version="1.0" encoding="UTF-8"?>
<poem title="The Great Figure" author="William Carlos Williams">
<line>Among the rain</line>
<line>and lights</line>
<line>I saw the figure 5</line>
<line>in gold</line>
<line>on a red</line>
<line>fire truck</line>
<line>moving</line>
<line>tense</line>
<line>unheeded</line>
<line>to gong clangs</line>
<line>siren howls</line>
<line>and wheels rumbling</line>
<line>through the dark city</line>
</poem>
The first line, the document specifier, indicates that this document is an XML document and specifies the version (1.0, which is the most often used version, and suffices for most purposes) and the encoding. Document specifiers always begin with <? and end with ?>. This way, they can't be confused with XML elements. Most systems that can process XML will work with documents that don't have a document specifier, but a document without one isn't strictly an XML file—it's just a collection of characters that happen to look like an XML file. That may seem like an arbitrary and trivial distinction, but your XML document may be rejected for just that reason by some systems, so it's good to get in the habit of always including a document specifier. The encoding indicates the character set that applies to the content. UTF-8 is a large set that includes most of the characters available in non-Asian languages (including English, Greek, Spanish, Russian, and many others). The Asian languages (Chinese, Japanese, Vietnamese, and others) use pictographs (that is, an image that corresponds to a word). The Asian character sets are consequently very large and tricky to manipulate. For the sake of simplicity, we'll stick to UTF-8 and documents in English.
The next line contains the root element. The first element in any XML file is that document's root element. All other elements, no matter how deeply nested, are descendants of the root. The root element, poem
, contains two attributes, title
and author
. The root element also contains all the line
elements, which make up the body of the poem.
Note the syntax for each element. Each one begins with an opening tag (<poem>
or <line>
) and ends with a closing tag (</poem>
or </line>
). The basic rule is that the names within the tags have to match (and there are various restrictions about which characters can be used, but just about any English word works). Other than that, opening tags always start with a left angle character (<
) and end with a right angle character(>
). Ending tags always begin with a left angle character and a forward slash (</
) and end with a right angle character (>
). Elements can also be empty, in which case they can take one of two forms: a beginning tag and an ending tag with nothing between them, or a special empty element tag. For example, an empty line element can be represented as either <line></line>
or <line/>
. That second structure provides a handy shortcut that saves some typing. If the poem included a blank line (such as a line between stanzas in a longer poem), you could represent a blank line that way.
Note XML is case-sensitive. <POEM>
, <Poem>
, and <poem>
are all different elements, so <POEM></poem>
would cause an XML parser to throw an error.
A poem offers an example of a fairly traditional document encoded as XML. Consider an example of data transmitted between systems as XML.
<?xml version="1.0"?>
<soap:Envelope
xmlns:soap="http://www.w3.org/2001/12/soap-envelope"
soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding">
<soap:Body xmlns:w="http://www.noaa.gov/weather">
<w:GetTemperature>
<w:Location>78701</w:Location>
</w:GetTemperature>
</soap:Body>
</soap:Envelope>
Look closely at the document specifier. In this case, it doesn't indicate encoding. The default encoding for XML is UTF-8, so you can omit the specifier when you are going to use UTF-8 characters. Then you have a root element named soap:Envelope. The specifier and the root element constitute the minimum content for an XML file, but that wouldn't make a very useful message between systems. So this example also contains a bit of data (the zip code for a particular city).This XML document represents a request for data from one system to another system. (I invented it from scratch, by the way; NOAA may use something else entirely.) In particular, it's a request to get the temperature in Austin, TX. Note the indentation. When XML is meant to be read by humans, it's customarily indented, such that each level of elements is farther to the right than its parent level. When creating a stream to send to another system, all the white space between elements is generally removed, resulting in the whole document being on one line. That's tough to read for a human, but it saves bandwidth, and a computer doesn't find it hard to read.
Note “Document” is the normal way to refer to any instance of XML, whether it encodes an actual document or some arbitrary bit of data—XML originally comes from the publishing industry, and it retains some of that industry's terminology.
SOAP stands for Simplified Object Access Protocol. It's a common way for systems to pass data back and forth. One system makes a request, and another system sends a corresponding response, which may trigger yet another request. The system that processes the request, and produces a response, may then send a request. In this fashion the systems “play tag” with one another, until the data is properly transmitted (or an error condition indicates that the systems should stop trying). Inter-system and inter-process communication can get much more complex than a simple request-response cycle, but that's beyond the scope of this book.The purpose of this kind of protocol (SOAP is one of many such standards) is to provide separate systems a way to communicate that has as few dependencies as possible on the operating system or language. One system might be a Windows server running an application written in C++, while another system might be a Linux server running a Java application, and yet another system might be a supercomputer running an ADA application. Provided the other system can recognize the request and generate an appropriate response, anything else is irrelevant. This makes SOAP, and similar protocols, very useful to software developers. You can create all kinds of systems and make them talk to one another. The Internet is the plumbing, but SOAP messages, and similar content, constitute the water in the pipes (or the information in the network, to abandon the metaphor). Requests and responses (the latter, in particular) can be very large, by the way. I once created a system that shared insurance policy information with a national insurance clearing house. The responses in that system were very large— often several megabytes of data. (Honestly, had I designed that system, I would have sent a response that specified a location from which the other system could download a file containing the policy. Huge responses can be problematic. The longer the message, the higher the chance for corrupted data, and it's sometimes useful to store data for a time. It wasn't my decision to make, though.)
The xmlns:soap
attribute specifies the namespace for SOAP, so that any system receiving it can recognize what kind of message it is (assuming the system knows about SOAP at all). (XMLNS stands for XML NameSpace.) The encodingStyle
element (itself a member of the soap namespace) specifies the exact version of SOAP being used. The soap:Body
namespace specifies a (fictional, in this case) namespace for weather information. Note that each namespace has an alias (soap
or w
, in this document). The aliases save the trouble of typing out the namespace for each element and, more importantly, reduces the number of bytes going down the pipe. The relatively simple poem example has no namespace declaration, which means that it uses the default XML namespace. Every XML element has a namespace, even if it's only the default. In the SOAP example, you can see that a single XML document can contain elements from multiple namespaces (a Microsoft Word document contains as many as 14 different namespaces, just to show how complex things can get). Namespaces let different organizations use the same elements without trampling one another when they get into the same document. For example, if another organization produced weather information, they'd have their own namespace, to prevent collisions with the noaa
namespace.
A stream is a collection of data meant to be read sequentially. That is, a stream is meant to be read one byte at a time. It is generally said that such a block of data is serialized (meaning that it is ready to be transmitted and read serially, which is another way to say one byte at a time).In Java (and in other many other languages), XML is processed as streams. Reading XML is done by parsing InputStream
objects, and writing XML is generally done by creating StreamResult
objects. When creating XML with String
objects, the result is often still exported as an OutputStream
, as some other process needs to receive a stream to do its part in a larger process. For example, many systems produce large documents. In these systems, you can create XML by using String
objects, create an OutputStream
from the result, and send that stream as an InputStream
to another object, which would produce a PDF file. The PDF file serves as the final document, to be stored on a server, printed, or both.
As you saw in Chapter 8, Java uses Stream
objects for reading from, and writing to, files. Since you're already processing Stream
objects, saving an XML document as a file is a natural and easy task. Reading an existing file as a stream also makes loading an XML file straightforward. So, there are streams in, streams out, and streams between the steps of larger processes. Fortunately, Java makes working with streams easy.
DOM (Document Object Model) and SAX (Simplified API for XML) have their strengths and weaknesses. As with most data-related problems, knowing which one to use comes down to knowing your data. If you need to work with relatively small documents, DOM works well, as it loads the entire XML data stream into memory, making it fast (again, provided the document is small). SAX, on the other hand, uses only enough memory to process the current element (called a node), which makes it capable of handling documents of any size. I've used SAX to parse the contents of books as long as 2,000 pages. The down side of SAX is that you can't reach much of the document at once, as little of it is in memory. Also, SAX only works with incoming XML documents; it doesn't write XML.
So, remember to use DOM for small XML sources and SAX for larger XML sources. If you're uncertain if the XML input will be large or small, use SAX. Of course, whether a document is large or small depends on how much memory is available for processing the document. If your program can run on a computer with plenty of memory all to itself, you can use DOM to load fairly large documents. However, if your program has to share a server with other processes, or has to run on a small device (such as a phone), memory will be limited and your options will be reduced. Finally, if your application has to process multiple documents at once (perhaps for multiple users), the memory for each process will be greatly reduced. The more constrained the memory available to the application, the more you should lean toward SAX. As an example, I recently worked on an application that would trigger an arbitrary number of transforms to create sets of documents. In practice, each set contained about 15 documents. Also, multiple users could start document-production runs at the same time, leading to as many as 50 documents being processed at the same time, all with 8 MB of RAM. We definitely needed to use SAX.
As mentioned above, you can write XML with the DOM package or by writing String
objects. If you just need to write an XML file, writing strings works well enough (and it is the fastest way to create XML). On the other hand, if you need to pass your XML output to a process that requires an XML header, and perhaps even needs to ensure that the XML conforms to a schema, you might want to consider using DOM.
Before you get to writing XML, you first need a data source to provide the content that you want to turn into XML. Here is a simple class that provides the content of the poem used earlier (“The Great Figure,” by William Carlos Williams).
package com.bryantcs.examples.xml;
import java.util.ArrayList;
public class Poem {
private static String title = "The Great Figure";
private static String author ="William Carlos Williams";
private static ArrayList<String> lines = new ArrayList<String>();
static {
lines.add("Among the rain");
lines.add("and lights");
lines.add("I saw the figure 5");
lines.add("in gold");
lines.add("on a red");
lines.add("fire truck");
lines.add("moving");
lines.add("tense");
lines.add("unheeded");
lines.add("to gong clangs");
lines.add("siren howls");
lines.add("and wheels rumbling");
lines.add("through the dark city");
}
public static String getTitle() {
return title;
}
public static String getAuthor() {
return author;
}
public static ArrayList<String> getLines() {
return lines;
}
}
As you can see, it's a pretty simple representation of a poem. Notice that it's also entirely static. Some classes consist entirely of static members, but those classes usually define sets of helper methods (string manipulation specialized for a particular application is a common use for that kind of helper class). A class with static data, though, usually indicates that someone hasn't thought through a problem very well. This kind of thing usually belongs in a file. For our purposes, though, this slightly odd class will serve well enough.
Here's the code for writing an XML file with DOM, given the Poem class as the data source:
package com.bryantcs.examples.xml;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.StringWriter;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Text;
public class WriteWithDOM {
public static void main (String args[]) {
// Create an empty Document
Document doc = createDocument();
// Create the XML
createElements(doc);
// Create a String representation of the XML
String xmlContent = createXMLString(doc);
// Write the XML to a file
writeXMLToFile(xmlContent);
}
// Here's where we create the (empty for now) XML document private static Document
createDocument() {
Document doc = null;
try {
DocumentBuilderFactory dbfac = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = dbfac.newDocumentBuilder();
doc = docBuilder.newDocument();
doc.setXmlStandalone(true);
}
catch(ParserConfigurationException pce) {
System.out.println("Couldn't create a DocumentBuilder");
System.exit(1);
}
return doc;
}
// Here's where we add content to the XML document private static void
createElements(Document doc) {
// Create the root element
Element poem = doc.createElement("poem");
poem.setAttribute("title", Poem.getTitle());
poem.setAttribute("author", Poem.getAuthor());
// Add the root element to the document
doc.appendChild(poem);
// Create the child elements
for (String lineIn : Poem.getLines() ) {
Element line = doc.createElement("line");
Text lineText = doc.createTextNode(lineIn);
line.appendChild(lineText);
// Add each element to the root element poem.appendChild(line);
}
}
// Here's where we convert the DOM object
// into a String that contains XML private static String createXMLString(Document doc) {
// Transform the DOM to a String
Transformer transformer = null;
StringWriter stringWriter = new StringWriter();
try {
TransformerFactory transformerFactory =
TransformerFactory.newInstance();
transformer = transformerFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION,
"no");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
// Create a string to contain the XML from the Document object
stringWriter = new StringWriter();
StreamResult result = new StreamResult(stringWriter);
DOMSource source = new DOMSource(doc);
transformer.transform(source, result);
} catch (TransformerConfigurationException e) {
System.out.println("Couldn't create a Transformer");
System.exit(1);
} catch (TransformerException e) {
System.out.println("Couldn't transform DOM to a String");
System.exit(1);
}
return stringWriter.toString();
}
// Here's where we turn the String holding the XML
// into a file private static void writeXMLToFile(String xmlContent) {
String fileName = "C:" + File.separator + "test"
+ File.separator + "domoutput.xml";
try {
File domOutput = new File(fileName);
FileOutputStream domOutputStream
= new FileOutputStream(domOutput, true);
domOutputStream.write(xmlContent.getBytes());
domOutputStream.close();
System.out.println(fileName + " was successfully written");
} catch (FileNotFoundException fnfe) {
System.out.println("Couldn't find a file called " + fileName);
System.exit(1);
} catch (IOException ioe) {
System.out.println("Couldn't write to a file called " + fileName);
System.exit(1);
}
}
}
Tip Use a StringBuilder
object to create a string whenever you need to append strings onto other strings. If you use the string concatenation operator (+
), the JVM creates a new String object but also keeps the previous String object in memory, which quickly consumes a great deal of memory. Modern JVMs have gotten better about handling this problem, but it remains an issue, and good practice dictates using StringBuilder when you have more than one or two concatenations to do.
As you can see, I've carved it up into a few methods to cleanly and clearly separate the parts of the algorithm. That's a practice you'll see many developers follow, and it's good to embrace this when code complexity reaches a certain level. Every programmer has a different threshold for when they think a long method should become multiple methods. My own threshold is pretty low. A method doesn't have to get very long before I start itching to split it. In this case, splitting the code also lets me handle the Exception
objects thrown by each step separately.
The process for creating XML with DOM is fairly straightforward:
Document
object (the top-level DOM object that contains everything else). That's done in the createDocument
method.createElements
method performs this step.String
object. The createXMLString
method does this step for you.String
object to the target (a file in this case). The writeXMLToFile
method creates your file and puts your XML into the file.Here's the code to produce the same output by writing out a String
object.
package com.bryantcs.examples.xml;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
public class WriteWithStrings {
public static void main(String[] args) {
String xmlContent = createXMLContent();
writeXMLToFile(xmlContent);
}
private static String createXMLContent() {
// write the first line
StringBuilder sb = new StringBuilder();
sb.append("<?xml version="1.0" encoding="UTF-8"?>"); sb.append("<poem title="");
sb.append(Poem.getTitle());
sb.append("" author="");
sb.append(Poem.getAuthor());
sb.append("">
");
// write the middle lines
for (String lineIn : Poem.getLines()) {
sb.append("<line>");
sb.append(lineIn);
sb.append("</line>
");
}
// write the last line
sb.append("</poem>");
return sb.toString();
}
private static void writeXMLToFile(String xmlContent) {
String fileName = "C:" + File.separator + "test"
+ File.separator + "domoutput.xml";
try {
File domOutput = new File(fileName);
FileOutputStream domOutputStream
= new FileOutputStream(domOutput, true);
domOutputStream.write(xmlContent.getBytes());
domOutputStream.close();
System.out.println(fileName + " was successfully created");
} catch (FileNotFoundException fnfe) {
System.out.println("Couldn't find a file called " + fileName);
System.exit(1);
} catch (IOException ioe) {
System.out.println("Couldn't write to a file called " + fileName);
System.exit(1);
}
}
}
As you can see, the code is substantially simpler and easier to follow. It also performs more quickly. Again, it doesn't have some of the output features from DOM (no XML header, for example), but it works if you just need a simple XML document.
To read XML, you can use either DOM or SAX. As mentioned earlier in this chapter, DOM is handy when you can be sure that your XML content will fit into the memory you have available. However, DOM fails when the input is too large. SAX, on the other hand, can handle any amount of input. For SAX, you need only as much memory as the largest element needs (usually not much, unless you're doing something such as processing large images or items where a single element can contain a large amount of data).
Here's the source for a program that reads XML with DOM. For this program to work, you need to create a file named poemsource.xml and put it in your C: est (on Windows) or C:/test (on Unix or Linux) directory. You can use the contents of the domoutput.xml file as the contents of the poemsource.xml file.
package com.bryantcs.examples.xml;
import java.io.File;
import java.io.IOException;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;
public class ReadWithDOM {
public static void main(String[] args) {
String fileName = "C:" + File.separator + "test"
+ File.separator + "poemsource.xml";
writeFileContentsToConsole(fileName);
}
// Write the contents of the file to the console private static void
writeFileContentsToConsole(String fileName) {
// Create a DOM Document object Document doc = createDocument(fileName);
// Get the root element Element root = doc.getDocumentElement();
// Create a StringBuilder object that describes the root element StringBuilder sb = new
StringBuilder(); sb.append("The root element is named: "" + root.getNodeName() + """);
sb.append(" and has the following attributes: ");
NamedNodeMap attributes = root.getAttributes();
for (int i = 0; i < attributes.getLength(); i ++) {
Node thisAttribute = attributes.item(i);
sb.append(thisAttribute.getNodeName());
sb.append (" ("" + thisAttribute.getNodeValue() + "")");
if (i < attributes.getLength() - 1) {
sb.append(", ");
}
}
// Write the description of the root element to the console System.out.println(sb);
// Work through the children of the root
// First, get a list of the child nodes NodeList nodes =
doc.getElementsByTagName("line");
for (int i = 0; i < nodes.getLength(); i++) {
// Process each element in turn Element element = (Element) nodes.item(i);
System.out.println("Found an element named "" +
// By writing its name and content to the console (System.out)
element.getTagName() + """ +
" with the following content: "" +
element.getTextContent() + """);
}
}
// Create a DOM Document object from a file private static Document createDocument(String
fileName) {
Document doc = null;
try {
// Get the file File xmlFile = new File(fileName);
// Create document builder factory DocumentBuilderFactory dbfac =
DocumentBuilderFactory.newInstance();
// Create a document builder object DocumentBuilder docBuilder =
dbfac.newDocumentBuilder();
// Load the document by parsing the file with the document builder doc =
docBuilder.parse(xmlFile);
// Indicate that this document is self-contained doc.setXmlStandalone(true);
}
// Deal with the possible exceptions catch (IOException ioe) {
System.out.println("Couldn't open file: " + fileName);
System.exit(1);
}
catch (SAXException se) {
System.out.println("Couldn't parse the XML file");
System.exit(1);
}
catch(ParserConfigurationException pce) {
System.out.println("Couldn't create a DocumentBuilder");
System.exit(1);
}
// Finally return the Document object
// that we built from the file return doc;
}
}
In this case, we just create a Document
object, read through each line of the input, and describe the content in the console. Naturally, you'll probably want to do something more than describe your input in the console, but this example shows you how to read a file. One thing to note is that each Element
object is really a Node
object (the Element
interface extends the Node
interface). Due to the way DOM has been implemented, you sometimes need to work with both Element
objects and Node
objects, as I had to do here when working with the attribute values.
SAX uses an interface called ContentHandler
to expose parsing events that you can then intercept in your own code to do whatever processing you want to do for each parsing event. The SAX packages also provide a default implementation of ContentHandler
, called DefaultHandler
. DefaultHandler
does nothing with each event, because doing nothing is the default behavior. However, you can override the methods in DefaultHandler
to do whatever you like. The advantage of extending DefaultHandler
is that you can override just the methods you care about and leave the rest alone. In the example I've used here, I didn't need many of the methods in DefaultHandler
, so I didn't override them.
If you look at the names of the methods, you can see why SAX uses so little memory to process XML. It triggers an event for the beginning and end of each part of an XML document, be it the document itself or an element. So, all the parser has to put in memory is the name (and some other details) about the element, and a list of the element's children. It doesn't have to put the element's content into memory until it gets to the characters method, which is the method that handles an element's character content. Most elements don't have vast amounts of text content (one exception is when someone stores an image in an XML element, as Word documents do), so the memory used to process the text usually isn't much.
To show you how to read a simple XML document and describe its contents in the console, I first created a class (called XMLToConsoleHandler
) that extends DefaultHandler
and overrides the handful of methods I need to use when capturing the contents of an XML file. Here's the XMLToConsoleHandler
class:
package com.bryantcs.examples.xml;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class XMLToConsoleHandler extends DefaultHandler {
// The characters method handles the actual content of an element @Override
public void characters(char[] content, int start, int length) throws SAXException {
// Describe the content in the console System.out.println("Found content: " + new String(content, start, length)); }
// The endDocument method lets us do something
// when the parser reaches the end of the document @Override
public void endDocument() throws SAXException {
// Announce in the console that we found the end of the document
System.out.println("Found the end of the document"); }
// The endElement method lets us do something
// when the parser reaches the end of an element @Override
public void endElement(String arg0, String localName, String qName)
throws SAXException {
// Announce in the console that we found the end of an element
System.out.println("Found the end of an element named "" + qName + """);
}
// The startDocument lets us do something
// when we find the top of the document @Override
public void startDocument() throws SAXException {
// Announce in the console that we found the beginning of the document
System.out.println("Found the start of the document");
}
// The startElement method lets us do something
// when we reach the beginning of an element @Override
public void startElement(String uri, String localName, String qName,
Attributes attributes) throws SAXException {
// Create a StringBuilder object to contain our description of this element
StringBuilder sb = new StringBuilder();
// Add the name of the element sb.append("Found the start of an element named "" +
qName + """);
// See if the element has any attributes if (attributes != null &&
attributes.getLength() > 0) {
sb.append(" with attributes named ");
// If we do find attributes, describe each one for (int i = 0; i <
attributes.getLength(); i++) {
String attributeName = attributes.getLocalName(i);
String attributeValue = attributes.getValue(i);
sb.append(""" + attributeName + """);
sb.append(" (value = ");
sb.append(""" + attributeValue + """);
sb.append(")");
// If we're not at the end of the attributes,
// add a comma, for proper formatting if (i < attributes.getLength() - 1) {
sb.append(", ");
}
}
}
// Describe the element in the console System.out.println(sb.toString());
}
}
Again, we're using a StringBuilder
to avoid creating an excess of String
objects in memory. The only complexity comes when we work through any attributes that may be present, and most of the code is really just for “pretty printing” (a phrase that programmers often use when referring to code that formats output to be easily read by humans).
Let's look at the class that uses the XMLToConsoleHandler
class to write to the console. Again, you need to have a file named poemsource.xml in the C: est (on Windows) or C:/test (on Unix or Linux) directory. You can use the contents of the domoutput.xml file as the contents of the poemsource.xml file. Here's that class:
package com.bryantcs.examples.xml;
import java.io.File;
import java.io.IOException;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.SAXException;
public class ReadWithSAX {
public static void main(String[] args) {
String fileName = "C:" + File.separator + "test"
+ File.separator + "poemsource.xml";
getFileContents(fileName);
}
private static void getFileContents (String fileName) {
try {
// Make an instance of our handler XMLToConsoleHandler handler = new
XMLToConsoleHandler();
// Get a parser factory SAXParserFactory factory = SAXParserFactory.newInstance();
// Get a parser SAXParser saxParser = factory.newSAXParser();
// And now parse the file with our handler saxParser.parse(fileName, handler );
// Deal with the possible exceptions } catch(IOException ioe) {
System.out.println("Couldn't open " + fileName + " for parsing");
} catch(ParserConfigurationException pce) {
System.out.println("Failed to create a SAX parser ");
} catch(SAXException saxe) {
System.out.println("Failed to parse an XML file");
}
}
}
Simple, isn't it? One of the joys of SAX is that it's simple to implement. Create a handler, create a parser, and pass your input and handler to the factory. Consequently, the exception-handling code is longer than the code that does the work. Of course, this arrangement is really masking the fact that the complexity is in the handler class. Still, carving up your code so that complexity is isolated to a single class is exactly the right way to use an object-oriented language such as Java.
You may have noticed that to get a DocumentBuilder
object, you had to use a DocumentBuilderFactory
object. Similarly, to get a SAXParser
object, you had to use a SAXParserFactory
object. The factory pattern is often used in Java (and other object-oriented languages, such as C++) to permit the creation of objects that have varying attributes. For example, the SAXParserFactory
class includes a way to specify a separate validator object, to ensure that the XML conforms to a schema (which is a definition of what a set of data should contain).
The factory pattern is a handy way to present a group of very similar objects that vary only by having some features turned on or off. Otherwise, you'd have to have a class for every possible combination of features. In some cases, that would be a lot of very similar classes. The factory pattern offers an easy-to-use and easy-to-understand solution to that problem. Consequently, Java has a number of factory objects in its standard libraries. You can also create your own factory classes. When you find yourself needing to create many very similar (but slightly different) objects, consider creating a factory for them.
Well, that was a whirlwind tour of how to use Java to work with XML. Several good books have been written about the subject, as there's a great deal more complexity than what's been presented here. Still, this chapter should prepare you for when you have to read and write XML files or streams.
In particular, we covered the basics of Java's two main ways to deal with XML: the Document Object Model (DOM) and the Simplified API for XML (SAX). You learned that DOM offers great performance because it loads the entire document into memory. However, loading the entire document into memory is also DOM's biggest problem, as large documents may exceed the available memory. SAX, on the other hand, can handle any amount of XML (including gigabyte-size streams of data), but it requires making a custom handler class for each kind of XML document.
Finally, you looked at the nature of factory classes and learned about why they exist, and when you might want to use them.
18.223.206.225