Java also includes the StAX API for parsing and creating XML documents. Chapter 4 introduces you to StAX.
What Is StAX?
Streaming API for XML (StAX) is a Java API for parsing an XML document sequentially from start to finish and also for creating XML documents. StAX was introduced by Java 6 as an alternative to SAX and DOM and is located midway between these “polar opposites.”
StAX Versus SAX and DOM
StAX (like SAX) can be used to parse documents of arbitrary sizes. In contrast, the maximum size of documents parsed by DOM is limited by the available memory, which makes DOM unsuitable for mobile devices with limited amounts of memory.
StAX (like DOM) can be used to create documents. In contrast to DOM, which can create documents whose maximum size is constrained by available memory, StAX can create documents of arbitrary sizes. SAX cannot be used to create documents.
StAX (like SAX) makes infoset items available to applications almost immediately. In contrast, these items are not made available by DOM until after it finishes building the tree of nodes.
StAX (like DOM) adopts the pull model, in which the application tells the parser when it’s ready to receive the next infoset item. This model is based on the iterator design pattern (see http://sourcemaking.com/design_patterns/iterator ), which results in an application that’s easier to write and debug. In contrast, SAX adopts the push model, in which the parser passes infoset items via events to the application, whether or not the application is ready to receive them. This model is based on the observer design pattern (see http://sourcemaking.com/design_patterns/observer ), which results in an application that’s often harder to write and debug.
Summing up, StAX can parse or create documents of arbitrary size, makes infoset items available to applications almost immediately, and uses the pull model to put the application in charge. Neither SAX nor DOM offers all of these advantages.
Exploring StAX
Java implements StAX through types stored in the javax.xml.stream, javax.xml.stream.events, and javax.xml.stream.util packages. This section introduces you to various types from the first two packages while showing you how to use StAX to parse and create XML documents.
Stream-Based Versus Event-Based Readers and Writers
StAX parsers are known as document readers , and StAX document creators are known as document writers. StAX classifies document readers and document writers as stream-based or event-based.
A stream-based reader extracts the next infoset item from an input stream via a cursor (infoset item pointer). Similarly, a stream-based writer writes the next infoset item to an output stream at the cursor position. The cursor can point to only one item at a time, and always moves forward, typically by one infoset item.
Stream-based readers and writers are appropriate when writing code for memory-constrained environments such as Java ME Embedded, because you can use them to create smaller and more efficient code. They also offer better performance for low-level libraries, where performance is important.
An event-based reader extracts the next infoset item from an input stream by obtaining an event. Similarly, an event-based writer writes the next infoset item to the stream by adding an event to the output stream. In contrast to stream-based readers and writers, event-based readers and writers have no concept of a cursor.
Event-based readers and writers are appropriate for creating XML processing pipelines (sequences of components that transform the previous component’s input and pass the transformed output to the next component in the sequence), for modifying an event sequence, and more.
Parsing XML Documents
Document readers are obtained by calling the various “create” methods that are declared in the javax.xml.stream.XMLInputFactory class. These creational methods are organized into two categories: methods for creating stream-based readers and methods for creating event-based readers.
Note
You can also call the XMLInputFactory newInstance() static method but might not want to do so because its same-named but parameterized companion method has been deprecated to maintain API consistency, and it’s possible that newInstance() will be deprecated as well.
The newFactory() methods follow an ordered lookup procedure to locate the XMLInputFactory implementation class. This procedure first examines the javax.xml.stream.XMLInputFactory system property and lastly returns the system-default implementation (returned from XMLInputFactory newDefaultFactory()). If there is a service configuration error, or if the implementation is not available or cannot be instantiated, the method throws an instance of the javax.xml.stream.FactoryConfigurationError class.
After creating the factory, call XMLInputFactory’s void setProperty(String name, Object value) method to set various features and properties as necessary. For example, you might execute xmlif.setProperty(XMLInputFactory.IS_VALIDATING, true); (true is passed as a java.lang.Boolean object via autoboxing—see http://docs.oracle.com/javase/tutorial/java/data/autoboxing.html ) to request a DTD-validating stream-based reader. However, the default StAX factory implementation throws java.lang.IllegalArgumentException because it doesn’t support DTD validation. Similarly, you might execute xmlif.setProperty(XMLInputFactory.IS_NAMESPACE_AWARE, true); to request a namespace-aware event-based reader, which is supported.
Parsing Documents with Stream-Based Readers
A stream-based reader is created by calling one of XMLInputFactory’s createXMLStreamReader() methods, such as XMLStreamReader createXMLStreamReader(Reader reader). These methods throw javax.xml.stream.XMLStreamException when the stream-based reader cannot be created.
The low-level javax.xml.stream.XMLStreamReader interface offers the most efficient way to read XML data with StAX. This interface’s boolean hasNext() method returns true when there is a next infoset item to obtain; otherwise, it returns false. The int next() method advances the cursor by one infoset item and returns an integer code that identifies this item’s type.
Instead of comparing next()’s return value with an integer value, you would compare this value against a javax.xml.stream.XMLStreamConstants infoset constant, such as START_ELEMENT or DTD—XMLStreamReader extends the XMLStreamConstants interface.
Note
You can also obtain the type of the infoset item that the cursor is pointing to by calling XMLStreamReader’s int getEventType() method. Specifying “Event” in the name of this method is unfortunate because it confuses stream-based readers with event-based readers.
XMLStreamReader also declares various methods for extracting infoset information. For example, QName getName() returns the qualified name (as a javax.xml.namespace.QName instance) of the element at the cursor position when next() returns XMLStreamReader.START_ELEMENT or XMLStreamReader.END_ELEMENT.
Note
QName describes a qualified name as a combination of namespace URI, local part, and prefix components. After instantiating this immutable class (via a constructor such as QName(String namespaceURI, String localPart, String prefix)), you can return these components by calling QName’s String getNamespaceURI(), String getLocalPart(), and String getPrefix() methods.
StAXDemo (Version 1)
After verifying the number of command-line arguments, Listing 4-1’s main() method creates a factory, uses the factory to create a stream-based reader that obtains its XML data from the file identified by the solitary command-line argument, and enters a parsing loop. Whenever next() returns XMLStreamReader.START_ELEMENT or XMLStreamReader.END_ELEMENT, XMLStreamReader’s getName() method is called to return the element’s qualified name.
Note
XMLStreamReader declares a void close() method that you will want to call to free any resources associated with this stream-based reader when your application is designed to run for an extended period of time. Calling this method doesn’t close the underlying input source.
Parsing Documents with Event-Based Readers
An event-based reader is created by calling one of XMLInputFactory’s createXMLEventReader() methods , such as XMLEventReader createXMLEventReader(Reader reader). These methods throw XMLStreamException when the event-based reader cannot be created.
The high-level javax.xml.stream.XMLEventReader interface offers a somewhat less efficient but more object-oriented way to read XML data with StAX. This interface’s boolean hasNext() method returns true when there is an event to obtain; otherwise, it returns false. The XMLEvent nextEvent() method returns the next event as an object whose class implements a subinterface of the javax.xml.stream.events.XMLEvent interface.
Note
XMLEvent is the base interface for handling markup events. It declares methods that apply to all subinterfaces; for example, Location getLocation() (return a javax.xml.stream.Location object whose int getCharacterOffset() and other methods return location information about the event) and int getEventType() (return the event type as an XMLStreamConstants infoset constant, such as START_ELEMENT and PROCESSING_INSTRUCTION—XMLEvent extends XMLStreamConstants). XMLEvent is subtyped by other javax.xml.stream.events interfaces that describe different kinds of events (such as Attribute) in terms of methods that return infoset item-specific information (such as Attribute’s QName getName() and String getValue() methods).
StAXDemo (Version 2)
After verifying the number of command-line arguments, Listing 4-2’s main() method creates a factory, uses the factory to create an event-based reader that obtains its XML data from the file identified by the solitary command-line argument, and enters a parsing loop. Whenever nextEvent() returns XMLEvent.START_ELEMENT or XMLEvent.END_ELEMENT, StartElement’s or EndElement’s getName() method is called to return the element’s qualified name.
Note
You can also create a filtered event-based reader to accept or reject various events by calling one of XMLInputFactory’s createFilteredReader() methods, such as XMLEventReader createFilteredReader(XMLEventReader reader, EventFilter filter). The javax.xml.stream.EventFilter interface declares a boolean accept(XMLEvent event) method that returns true when the specified event is part of the event sequence; otherwise, it returns false.
Creating XML Documents
Document writers are obtained by calling the various “create” methods that are declared in the javax.xml.stream.XMLOutputFactory class . These creational methods are organized into two categories: methods for creating stream-based writers and methods for creating event-based writers.
Note
You can also call the XMLOutputFactory newInstance() static method but might not want to do so because its same-named but parameterized companion method has been deprecated to maintain API consistency, and it’s possible that newInstance() will be deprecated as well.
The newFactory() methods follow an ordered lookup procedure to locate the XMLOutputFactory implementation class. This procedure first examines the javax.xml.stream.XMLOutputFactory system property and lastly returns the system-default implementation (returned from XMLOutputFactory newDefaultFactory()). If there is a service configuration error, or if the implementation is not available or cannot be instantiated, the method throws an instance of the FactoryConfigurationError class.
After creating the factory, call XMLOutputFactory’s void setProperty(String name, Object value) method to set various features and properties as necessary. The only property currently supported by all writers is XMLOutputFactory.IS_REPAIRING_NAMESPACES. When enabled (by passing true or a Boolean object, such as Boolean.TRUE, to value), the document writer takes care of all namespace bindings and declarations, with minimal help from the application. The output is always well formed with respect to namespaces. However enabling this property adds some overhead to the job of writing the XML.
Creating Documents with Stream-Based Writers
A stream-based writer is created by calling one of XMLOutputFactory’s createXMLStreamWriter() methods , such as XMLStreamWriter createXMLStreamWriter(Writer writer). These methods throw XMLStreamException when the stream-based writer cannot be created.
void close() closes this stream-based writer and frees any associated resources. The underlying writer is not closed.
void flush() writes any cached data to the underlying writer.
void setPrefix(String prefix, String uri) identifies the namespace prefix to which the uri value is bound. This prefix is used by variants of the writeStartElement(), writeAttribute(), and writeEmptyElement() methods that take namespace arguments but not prefixes. Also, it remains valid until the writeEndElement() invocation that corresponds to the last writeStartElement() invocation. This method doesn’t create any output.
void writeAttribute(String localName, String value) writes the attribute identified by localName and having the specified value to the underlying writer. A namespace prefix isn’t included. This method escapes the &, <, >, and " characters.
void writeCharacters(String text) writes text’s characters to the underlying writer. This method escapes the &, <, and > characters.
void writeEndDocument() closes any start tags and writes corresponding end tags to the underlying writer.
void writeEndElement() writes an end tag to the underlying writer, relying on the internal state of the stream-based writer to determine the tag’s prefix and local name.
void writeNamespace(String prefix, String namespaceURI) writes a namespace to the underlying writer. This method must be called to ensure that the namespace specified by setPrefix() and duplicated in this method call is written; otherwise, the resulting document will not be well formed from a namespace perspective.
void writeStartDocument() writes the XML declaration to the underlying writer.
void writeStartElement(String namespaceURI, String localName) writes a start tag with the arguments passed to namespaceURI and localName to the underlying writer.
StAXDemo (Version 3)
Although Listing 4-3 is fairly easy to follow, you might be somewhat confused by the duplication of namespace URIs in the setPrefix() and writeStartElement() method calls. For example, you might be wondering about the duplicate URIs in xmlsw.setPrefix("h", NS1); and its xmlsw.writeStartElement(NS1, "html"); successor.
The setPrefix() method call creates a mapping between a namespace prefix (the value) and a URI (the key) without generating any output. The writeStartElement() method call specifies the URI key, which this method uses to access the prefix value, which it then prepends (with a colon character) to the html start tag’s name before writing this tag to the underlying writer.
Compile Listing 4-3 and run the resulting application. You should discover a recipe.xml file in the current directory.
Creating Documents with Event-Based Writers
An event-based writer is created by calling one of XMLOutputFactory’s createXMLEventWriter() methods , such as XMLEventWriter createXMLEventWriter(Writer writer). These methods throw XMLStreamException when the event-based writer cannot be created.
The high-level XMLEventWriter interface declares the void add(XMLEvent event) method for adding events that describe infoset items to the output stream implemented by the underlying writer. Each argument passed to event is an instance of a class that implements a subinterface of XMLEvent (such as Attribute and StartElement).
To save you the trouble of implementing these interfaces, StAX provides javax.xml.stream.EventFactory. This utility class declares various factory methods for creating XMLEvent subinterface implementations. For example, Comment createComment(String text) returns an object whose class implements the javax.xml.stream.events.Comment subinterface of XMLEvent.
StAXDemo (Version 4)
Listing 4-4 should be fairly easy to follow; it’s the event-based equivalent of Listing 4-3. Notice that this listing includes the creation of java.util.Iterator instances from anonymous classes that implement this interface. These iterators are created to pass namespaces or attributes to XMLEventFactory’s StartElement createStartElement(String prefix, String namespaceUri, String localName, Iterator<? extends Attribute> attributes, Iterator<? extends Namespace> namespaces) method. (You can pass null to this parameter when an iterator isn’t applicable; for example, when the start tag has no attributes.)
Compile Listing 4-4 and run the resulting application. You should discover a recipe.xml file in the current directory.
Copy
For brevity, I added a throws Exception clause to main()’s header.
If the source XML file doesn’t have an XML declaration, it will be added to the destination XML file.
Exercises
- 1.
Define StAX.
- 2.
What packages make up the StAX API?
- 3.
True or false: A stream-based reader extracts the next infoset item from an input stream by obtaining an event.
- 4.
How do you obtain a document reader? How do you obtain a document writer?
- 5.
What does a document writer do when you call XMLOutputFactory’s void setProperty(String name, Object value) method with XMLOutputFactory.IS_REPAIRING_NAMESPACES as the property name and true as the value?
- 6.
Create a ParseXMLDoc application that uses a StAX stream-based reader to parse its single command-line argument, an XML document. After creating this reader, the application should verify that a START_DOCUMENT infoset item has been detected, and then enter a loop that reads the next item and uses a switch statement to output a message corresponding to the item that has been read: ATTRIBUTE, CDATA, CHARACTERS, COMMENT, DTD, END_ELEMENT, ENTITY_DECLARATION, ENTITY_REFERENCE, NAMESPACE, NOTATION_DECLARATION, PROCESSING_INSTRUCTION, SPACE, or START_ELEMENT. When START_ELEMENT is detected, output this element’s name and local name, and output the local names and values of all attributes. The loop ends when the END_DOCUMENT infoset item has been detected. Explicitly close the stream reader followed by the file reader upon which it’s based. Test this application with Exercise 1-21’s books.xml file.
Summary
StAX is a Java API for parsing an XML document sequentially from start to finish and also for creating XML documents. Java implements StAX through types stored in the javax.xml.stream, javax.xml.stream.events, and javax.xml.stream.util packages.
StAX parsers are known as document readers, and StAX document creators are known as document writers. StAX classifies document readers and document writers as stream-based or event-based.
Document readers are obtained by calling the various “create” methods that are declared in the XMLInputFactory class. Document writers are obtained by calling the various “create” methods that are declared in the XMLOutputFactory class.
The low-level XMLStreamReader interface offers the most efficient way to read XML data with StAX. This interface’s boolean hasNext() method returns true when there is a next infoset item to obtain; otherwise, it returns false. The int next() method advances the cursor by one infoset item and returns an integer code that identifies this item’s type.
The low-level XMLStreamWriter interface declares several methods for writing infoset items to the destination. Examples include void writeAttribute(String localName, String value) and void writeCharacters(String text).
The high-level XMLEventReader interface offers a somewhat less efficient but more object-oriented way to read XML data with StAX. This interface’s boolean hasNext() method returns true when there is an event to obtain; otherwise, it returns false. The XMLEvent nextEvent() method returns the next event as an object whose class implements a subinterface of the XMLEvent interface.
The high-level XMLEventWriter interface declares the void add(XMLEvent event) method for adding events that describe infoset items to the output stream implemented by the underlying writer. Each argument passed to event is an instance of a class that implements a subinterface of XMLEvent (such as Attribute and StartElement).
Chapter 5 introduces Java’s XPath API for simplifying DOM node access.