SAX is the de facto standard XML parser interface for Java. You learn how to use it here with a simple SAX application written in Java.
The Simple API for XML (SAX) is a streaming, event-based API for XML (http://www.saxproject.org). It was (and continues to be) developed by members of the xml-dev mailing list (http://www.xml.org/xml/xmldev.shtml). Discussion of a uniform API for XML parsers began on xml-dev in December 1997 (http://lists.xml.org/archives/xml-dev/199712/msg00170.html) and resulted in the release of the first version of SAX in May 1998 (http://lists.xml.org/archives/xml-dev/199805/msg00226.html), with David Megginson as the chief maintainer (http://www.megginson.com/). SAX 2.0.1, which has namespace support, was released in January 2002, with David Brownell as the lead maintainer (http://lists.xml.org/archives/xml-dev/200201/msg01943.html).
SAX provides an interface to SAX parsers. As an event -based API, it munches on documents and
reports, parsing events along the way, usually in one fell swoop.
These reports come directly to the application through callbacks.
This is called push
parsing
. To push these events, an application
must implement event handlers (methods) from the SAX interfaces, such
as startDocument()
or startElement()
. Without implementing or registering these handlers, a
SAX application won’t
“see” the results that are pushed
up from its underlying parser.
Pull parsing , on the other hand, allows you to pull events on demand. Examples of pull parsers include the C# XmlReader [Hack #98] , Paul Prescod’s Python pull parser (http://www.prescod.net/python/pulldom.html), Aleksander Slominski’s XML pull parser (http://www.extreme.indiana.edu/xgws/xsoap/xpp/), and the Streaming API for XML (StAX), which is a pull parser API just now emerging from the Sun Java Specification Request, JSR 173 (http://www.jcp.org/en/jsr/detail?id=173).
SAX was originally written in Java and continues to be maintained in Java, but it is also available in other languages, such as C++, Pascal, and Perl (http://www.saxproject.org/?selected=langs). This hack demonstrates a simple Java program that uses SAX.
First, have a look at the document blob.xml :
<time timezone="PST"><hour>11</hour><minute>59</minute><second>59 </second><meridiem>p.m.</meridiem></time>
Not much to look at, is it? It’s just a blob of elements with only one attribute, no pretty whitespace between elements, no XML declaration, and no comments. Having elements crammed together is not a big problem from a processing standpoint, except that it gives me a headache when I’m looking at it.
When I was first learning Java a few years ago, I searched high and low for simple SAX programs, ones that were reduced down to something I could grasp. I didn’t have much luck finding such programs, so I decided to write a few of my own. Example 7-21 is a short SAX program, Poco.java . This program does some readily discernible things, just right for someone getting up to speed with SAX. It will also help us do something interesting with blob.xml.
Example 7-21. Poco.java
import org.xml.sax.XMLReader; import org.xml.sax.Attributes; import org.xml.sax.helpers.DefaultHandler; import org.xml.sax.helpers.XMLReaderFactory; public class Poco extends DefaultHandler { private int depth = -1; private static String parser = "org.apache.crimson.parser.XMLReaderImpl"; public static void main (String[ ] args) throws Exception { XMLReader reader = XMLReaderFactory.createXMLReader(parser); reader.setContentHandler(new Poco()); reader.parse(args[0]); } public void startDocument() { System.out.println("<?xml version="1.0" encoding="ISO-8859-1"?> "); System.out.println("<!-- processed with Poco -->"); } public void startElement (String uri, String name, String qName, Attributes atts) { depth++; if (depth > 0) System.out.print(" "); System.out.print("<" + qName + ">"); if (depth = = 0) System.out.println(); } public void endElement (String uri, String name, String qName) { System.out.print("</" + qName + ">"); if (depth = = 1) System.out.println(); depth--; } public void characters (char ch[ ], int start, int length) { for (int i = start; i < start + length; i++) { System.out.print(ch[i]); } } }
Compiling this program as shown here requires Java version 1.4 or later:
java Poco.java
Because 1.4 has JAXP built in, you don’t have to place a SAX JAR file (such as sax2r2.jar, available from http://sourceforge.net/projects/sax/) in the classpath. When the program is compiled, you can run it like this:
java Poco blob.xml
or like this in Windows:
java Poco file:///C:/Hacks/examples/blob.xml
The results of processing blob.xml with Poco.class are shown in Example 7-22.
Example 7-22. Results of processing blob.xml with Poco.class
<?xml version="1.0" encoding="ISO-8859-1"?> <!-- processed with Poco --> <time> <hour>11</hour> <minute>59</minute> <second>59</second> <meridiem>p.m.</meridiem> </time>
An XML declaration and comment are added to the top of the resulting
document. All the elements from the source file are copied, properly
indented, and sent to standard output, including their character data
content. The attribute on the time
element,
however, is not processed and so is excluded from the output. Now
let’s talk about how all this happened.
On line 1 of Example 7-21, the program imports the
XMLReader
interface from the package
org.xml.sax
, then on line 4 imports the class
XMLReaderFactory
from
org.xml.sax.helpers
. Line 13 creates an XML reader
for SAX using the factory. Creating the reader is number one on your
list of things to do when writing a SAX program.
The createXMLReader()
method takes as an argument a string that
represents a Java class name. This class name is the entry point for
the underlying XML parser. JAXP’s default XML parser
is Crimson, identified with the class name
org.apache.crimson.parse.XMLReaderImpl
. If
createXMLReader()
has no argument, you can pass
in a class name for the parser using the -D
command-line option and the system property
org.xml.sax.driver
. For example, you could use the
-D
option on the command line, like this:
java -Dorg.xml.sax.driver=org.apache.crimson.parse.XMLReaderImpl Poco blob.xml
and get the same results as placing the class name in the program itself. You might choose the Xerces parser instead of Crimson. In this case, use this command line:
java -cp .;xercesImpl.jar -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser Poco blob.xml
This command assumes that xercesImpl.jar is in the current directory (download it from http://xml.apache.org/xerces2-j/download.cgi).
The
DefaultHandler
class (line 3) is also from the
helpers
package. It implements several other SAX
interfaces, and is the default base class for SAX2 event handlers.
For example, the DefaultHandler
class implements
methods from the ContentHandler
interface. More
precisely, ContentHandler
contains only signatures
for such methods or event handlers as startDocument( )
and startElement()
, and
DefaultHandler
provides no-op implementations of
these and other methods. The Poco
class extends
the DefaultHandler
class (line 6), and the call to
setContentHandler()
on line 14 registers a
content event handler. Without this content handler, all reported
events are quietly ignored.
The rubber hits the road when the parse()
method is called on line 15. The argument
(args[0]
) is a string that represents the filename
from the command line. The argument for parse()
is of type InputSource
(http://www.saxproject.org/apidoc/org/xml/sax/InputSource.html),
which can be a system identifier (or URI), a bytes stream, or a
character stream.
Poco.java provides working implementations for
four methods: startDocument()
(line 24),
startElement()
(line 24), endElement( )
(line 34), and characters()
(line
41). SAX uses callbacks, which are registered to handle certain
events when encounterd, hence we call these methods event
handlers
. If we don’t
implement them in our program, they actually get called at runtime,
but nothing apparent happens! Only by implementing the event handlers
do we get into action.
startDocument()
writes an XML declaration (line 20) and
a comment (line 21) to standard output. startElement()
writes a start tag, and
endElement()
writes an end tag. The only reason why
the Attributes
interface is imported (line 2) is
to satisfy the required method signature for startElement( )
, whose fourth argument is of type
Attributes
.
Both startElement()
and endElement()
use the depth
variable (line 8) to
determine element depth and to add whitespace appropriately, but this
is not a general solution because it only works for a depth of 0 or
1! For a solid technique on handling element depth and whitespace,
see David Megginson’s
DataWriter.java, available at http://megginson.com/Software/xml-writer-0.2.zip.
characters()
(line 44) simply prints any
characters it encounters.
The program is admittedly weak in its
exception handling. It only does the minimum by throwing
Exception
from main()
.
SAXException
and
SAXParseException
are both imported by
DefaultHandler
, which Poco
extends. A more responsible program—and therefore a more
complex one—would use
try
/catch
blocks to handle the
exceptions intelligently. I have chosen to keep this program simple
so it is easier to understand.
Poco.java is only the beginning of you can do, but it should give you a fairly good understanding of the basics of SAX programming in Java.
Karl Waclawek’s SAX for .NET: http://sf.net/projects/saxdotnet
SAX API reference: http://www.saxproject.org/apidoc/overview-summary.html
SAX2, by David Brownell (O’Reilly)
The Book of SAX, by W. Scott Means and Michael A. Bodie (No Starch Press)
3.131.110.169