Chapter 17. SAX: The Simple API for XML

 

Which painting in the National Gallery would I save if there was a fire? The one nearest the door of course.

 
 --George Bernard Shaw

If you share Mr. Shaw’s propensity for practicality, you will probably find this lesson quite interesting. Through most of this book, the discussion on XML has focused on its use as a structured document format. However, XML is also often used as a format for data storage. Unlike proprietary file formats, XML documents follow consistent structural rules and can be tested not only for well-formedness but also for compliance with specific structural rules. There are a variety of tools out there that allow you to interact with the structure of XML documents. The previous hour focused on one of these technologies, the W3C’s Document Object Model (DOM). This hour tackles another such technology, known as SAX. SAX (Simple API for XML) is an API that can be used to analyze and extract data from an XML document.

In this hour, you’ll learn

  • What SAX is and how it works

  • How to get a SAX parser for your favorite programming language

  • How to write a Java program that uses a SAX parser to process an XML file

What Is SAX?

SAX is a programming interface for event-based parsing of XML files. In practical terms, this means that SAX takes a very different approach to parsing XML code than its counterpart, the DOM. If you recall from previous hours, XML documents are processed using parsers. The parser reads the XML document; verifies that it is well formed; and, if it’s a validating parser, validates it against a schema or DTD. What happens next depends on the parser you’re using. In some cases, it might copy the data into a data structure that’s native to the programming language you’re using. In other cases, it might transform the data into a presentation format or apply styles to it. The SAX parser doesn’t do anything to the data other than trigger certain events. It’s up to the user of the SAX parser to determine what happens when those events occur.

What I mean when I say that SAX is a programming interface is that it isn’t a program, it’s a document—a standard—that describes how a SAX parser should be written. It explains which events must be supported in a compliant SAX parser and leaves it up to the implementers to make sure that the parsers they write comply.

By the Way

An interface is basically a contract offered by someone who writes a program or specifies how a program should work. It says that as long as you implement all of the features specified in the interface, any programs written to use that interface will work as expected. When someone writes a parser that implements the SAX interface, it means that any program that supports all of the events specified in the SAX interface can use that parser.

A Really Brief History of SAX

Most of the time when you’re dealing with XML, one standards body or another developed the various technologies. With SAX, that isn’t the case. SAX was developed by members of the xml-dev mailing list in order to provide XML developers with a way to deal with XML documents in a simple and straightforward manner. One of the lead developers in this mailing list was Dave Megginson, whose name often comes up in discussions related to SAX, and who has resumed maintaining SAX after a hiatus. You can find out more about SAX at http://www.saxproject.org/.

The original version of SAX, 1.0, was released in May 1998. The most recent version is SAX 2.0.2, which was released in April 2004. Earlier versions of the SAX API were implemented initially as Java interfaces. However, you can write a SAX parser in any language, and indeed, there are SAX parsers available for most popular programming languages. However, I’m going to talk about the features that were made available in the Java version—you can assume they’ll also be available under whatever implementation you choose to use. Let’s look at the specifics of these two releases.

By the Way

SAX 2.0.2 is a fairly minor enhancement of the original SAX 2.0 release that came out back in May 2000. Throughout the remainder of this lesson I generally refer to the latest release of SAX as version 2.0.

SAX 1.0

SAX 1.0 provides support for triggering events on all of the standard content in an XML document. Rather than telling you everything it does support, it’s easier to tell you that SAX 1.0 does not support namespaces. A program that uses a SAX 1.0 parser must support the following methods, which are automatically invoked when events occur during the parsing of a document:

  • characters()— Returns the characters found inside an element

  • endDocument()— Triggered when parsing of the document is complete

  • endElement()— Triggered when the closing tag for any element is encountered

  • ignorableWhitespace()— Triggered when whitespace is encountered between elements

  • processingInstruction()— Triggered when a processing instruction is encountered in the document

  • startElement()— Triggered when the opening tag for an element is encountered

By the Way

If you don’t have a programming background, allow me to clarify that a method is a sequence of programming code that performs a certain task. Methods are very similar to functions in programming languages other than Java.

SAX 1.0 also handles attributes of elements by providing them through its interface when the startElement() method of the document handler is called. SAX 1.0 has been deprecated now that SAX 2.0 has been implemented. In the Java world, most SAX 2.0 libraries (such as Xerces) still support SAX 1.0 so that they’ll work with legacy SAX 1.0 applications. But if you’re writing a new application that uses SAX, you should use SAX 2.0.

SAX 2.0

SAX 2.0 is an extension of SAX 1.0 that provides support for namespaces. As such, programs that communicate with a SAX 2.0 parser must support the following methods:

  • startPrefixMapping()— Triggered when a prefix mapping (mapping a namespace to an entity prefix) is encountered

  • endPrefixMapping()— Triggered when a prefix mapping is closed

  • skippedEntity()— Triggered whenever an entity is skipped for any number of reasons

Writing Programs That Use SAX Parsers

Unless you really develop an interest in XML parsing, chances are you won’t be writing a SAX parser. Rather, you’ll be writing a program that interacts with a SAX parser. Writing a program that works with a SAX parser is in some ways similar to writing a program with a graphical user interface (GUI), such as a traditional application for Windows or Macintosh. When you write a GUI program, the GUI library turns actions that the user takes into events that are returned to you by the library. Your job as a programmer is then to write event handlers that respond to incoming events. For example, with JavaScript, certain elements on a web page can generate events that can be handled by JavaScript. Links generate onClick and onMouseOver events. There are also documentwide events in JavaScript, such as onLoad.

In regard to event handling, SAX works the same way conceptually as JavaScript. When a SAX parser parses the data in an XML document, it fires events based on the data that it is currently parsing. All of the methods listed previously that are associated with SAX are called by the parser when the associated event occurs. It’s up to the application programmer to decide what action to take when those events are caught.

For example, you might want to print out just the contents of all of the title elements in a document, or you might want to construct a complex data structure based on all of the information you find in the document. The SAX parser doesn’t care; it just provides you with all of the data in the document in a linear manner so that you can do whatever you like with it.

You might be asking yourself at this point why you would ever care to parse an XML document at such a low level. In other words, why would you ever want to print out just the contents of the title elements in a document? The main answer to this question has to do with data maintenance and integrity. As you continue to build and maintain larger and larger XML documents, you may find that you need to extract and study portions of the documents to find editorial errors or any other inconsistencies that are difficult to find when viewing raw XML code. A custom application built around a SAX parser can be used to drill down into an XML document and spit out any subset of the data that you want.

Obtaining a SAX Parser

If you want to write an application that uses SAX, the first thing you have to do is obtain a SAX parser. There are several SAX parsers available, and it’s ultimately up to your own specific development needs as to which parser you should use. Furthermore, you’ll need to look at the documentation for the parser that you choose in order to figure out how to integrate the parser with your applications. Following are several of the more popular SAX parsers you might want to consider using:

  • Xerces

  • libxml

  • Python SAX API

The next few sections provide more information about these SAX parsers, along with how to download and install them.

Xerces

Xerces is the XML parser from the Apache Software Foundation. It’s used as part of several other Apache XML and Java-related projects and can be used by itself as well. In addition to supporting SAX, it also supports DOM Level 2, which you learned about in the previous hour, as well as XML Schema validation.

You can obtain Xerces, along with lots of other open source XML-related software, at http://xml.apache.org/. Xerces is completely free as it is open source software released under the Apache Software License.

The Xerces library is available in both .tar.gz and .zip formats—download the one that’s appropriate for your platform. Included in the package are xercesImpl.jar and xml-apis.jar, which contain the compiled class files for the Xerces library itself, and xercesSamples.jar, compiled versions of the sample programs that come with Xerces. The package also includes documentation, source code for the sample programs, and some sample data files.

By the Way

A .JAR file is a lot like a .ZIP file except that it is typically used to package compressed Java programs for distribution; JAR stands for Java ARchive.

In order to use the Xerces library, you just need to include the two aforementioned .JAR files (xercesImpl.jar and xml-apis.jar) in your class path when compiling and running programs that use it.

libxml

libxml is a package of Perl modules that contains a number of XML processing libraries. One of these is XML::Parser::PerlSAX. The easiest way to install it is to download it from CPAN (http://www.cpan.org/) and follow the instructions to install it on your local system. The methods provided by the PerlSAX module are basically identical to those in the Java version of SAX—they both implement the same interface in ways appropriate to Perl and Java, respectively.

Python

If you’re a Python programmer, things are particularly easy for you. Recent versions of Python (from 2.0 on) provide built-in support for SAX without any additional software. To use the SAX library in your programs, you just need to include the line

from xml.sax import saxutils

Using SAX with Java

The sample program in this chapter is written in Java and uses the Xerces SAX parser, which I mentioned earlier. If you’re a Java programmer, I’m sure you’re perfectly happy with this state of affairs. If you have no interest in Java, much of the remainder of this lesson probably won’t be to your liking. However, the purpose of this chapter is to explain how SAX works, and while there are SAX parsers available for many languages, it started out in the Java world. And even if you have no interest in digesting the upcoming Java code, you can still experiment with the sample Java program, running it on your XML documents and analyzing the results. The syntax for this program is relatively simple, and I’ve commented the code to make it as clear as possible.

Even if you don’t care about Java programming, you may still want to see the output of the sample program on your own computer. To run the program, you’ll need Sun’s Java Development Kit (JDK) and the Xerces library mentioned previously. I already explained how to download and install Xerces; to get the JDK, just go to http://java.sun.com/j2se/.

You’ll need to download the J2SE (Java 2 Standard Edition) SDK and install it. Once it’s installed, you can run the sample program. Just put the sample program’s .java source code file in the directory where you put xercesImpl.jar and xml-apis.jar (you can put it anywhere you like, but this route is probably easiest), open a command prompt in that directory, and type the following:

javac -classpath xercesImpl.jar;xml-apis.jar;. DocumentPrinter.java

By the Way

Alternatively, you can copy the xercesImpl.jar and xml-apis.jar files to the same location as the sample program and then compile and run the program from there. The main point is that the program needs to be able to access the .JAR files.

If your copy of the code for DocumentPrinter.java is correct and xercesImpl.jar and xml-apis.jar are really in the current folder, the DocumentPrinter class will be compiled and a file called DocumentPrinter.class will result. To run the program, use the following command:

java -classpath xercesImpl.jar;xml-apis.jar;. DocumentPrinter file.xml

You should replace file.xml with the name of the XML file that you want to process. As an example, here’s how you would initiate the Document Printer sample program using the vehicles XML file from Hour 12, “Transforming XML with XSLT”:

java -classpath xercesImpl.jar;xml-apis.jar;. DocumentPrinter vehicles.xml

Listing 17.1 contains a partial listing of the resulting output of running the DocumentPrinter SAX sample program on the vehicles.xml document.

Example 17.1. The Document Printer Sample Program Uses a SAX Parser to Display Detailed Information About the vehicles.xml Document

 1: Start document.
 2: Received processing instruction:
 3: Target: xml-stylesheet
 4: Data: href="vehicles.xsl" type="text/xsl"
 5: Start element: vehicles
 6: Start element: vehicle
 7: Start element: mileage
 8: Received characters: 13495
 9: End of element: mileage
10: Start element: color
11: Received characters: green
12: End of element: color
13: Start element: price
14: Received characters: 33900
15: End of element: price
16: End of element: vehicle
17: ...
18: Start element: vehicle
19: Start element: mileage
20: Received characters: 48405
21: End of element: mileage
22: Start element: color
23: Received characters: gold
24: End of element: color
25: Start element: price
26: Received characters: 22995
27: End of element: price
28: End of element: vehicle
29: End of element: vehicles
30: End of document reached.

Just to refresh your memory, following is a brief code excerpt from the vehicles.xml document:

<vehicle year="2004" make="Acura" model="3.2TL">
  <mileage>13495</mileage>
  <color>green</color>
  <price>33900</price>
</vehicle>

In fact, this piece of code is for the first vehicle in the document, which matches up with the code on lines 6 through 16 in Listing 17.1. If you carefully compare the XML code with the listing, you’ll notice how the program parsed and output information about every element in the document. This is the kind of detailed control you have at your disposal when using a tool such as a SAX parser.

Inside the SAX Sample Program

Let’s look at how the program you just saw uses a SAX parser to parse an XML document. The program just prints out messages that explain what it’s doing at each step while parsing, along with the associated data from the XML document. You could easily replace this code with code that performs more useful tasks, such as performing a calculation or otherwise transforming the data, but because the purpose of this program is just to illustrate how the SAX parser works, the diagnostic messages are fine.

Because you already know the scoop on SAX, Java, and the Xerces SAX parser for Java, let’s go ahead and jump right into the program code. Here are the first 12 lines of Java code:

import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.ErrorHandler;
import org.xml.sax.Locator;
import org.xml.sax.SAXParseException;
import org.xml.sax.XMLReader;

public class DocumentPrinter implements ContentHandler, ErrorHandler {
  // A constant containing the name of the SAX parser to use.
  private static final String PARSER_NAME
    = "org.apache.xerces.parsers.SAXParser";

This code imports classes that will be used later on and declares the class (program) that you’re currently writing. The import statements indicate which classes will be used by this program. In this case, all of the classes that will be used are from the org.xml.sax package and are included in the xercesImpl.jar and xml-apis.jar archives.

This class, called DocumentPrinter, implements two interfaces—ContentHandler and ErrorHandler. These two interfaces are part of the standard SAX 2.0 package and are included in the import list. A program that implements ContentHandler is set up to handle events passed back in the normal course of parsing an XML document, and a program that implements ErrorHandler can handle any error events generated during SAX parsing.

In the Java world, an interface is a framework that specifies a list of methods that must be defined in a class. An interface is useful because it guarantees that any class that implements it meets the requirements of that interface. If you fail to include all of the methods required by the interface, your program will not compile. Because this program implements ContentHandler and ErrorHandler, the parser can be certain that it is capable of handling all of the events it triggers as it parses a document.

After the class has been declared, a single member variable is created for the class, PARSER_NAME. This variable is a constant that contains the name of the class that you’re going to use as the SAX parser. As you learned earlier, there is any number of SAX parsers available. The Xerces parser just so happens to be one of the better Java SAX parsers out there, which explains the parser name of org.apache.xerces.parsers.SAXParser.

By the Way

Although SAX is certainly a popular Java-based XML parser given its relatively long history, it has some serious competition from Sun, the makers of Java. The latest version of Java (J2SE 5.0) now includes an XML API called JAXP that serves as a built-in XML parser for Java. To learn more about JAXP, visit http://java.sun.com/xml/jaxp/.

The main() Method

Every command-line Java application begins its life with the main() method. In the Java world, the main method indicates that a class is a standalone program, as opposed to one that just provides functionality used by other classes. Perhaps more importantly, it’s the method that gets run when you start the program. The purpose of this method is to set up the parser and get the name of the document to be parsed from the arguments passed in to the program. Here’s the code:

  public static void main(String[] args) {
    if (args.length == 0) {
      System.out.println("No XML document path specified.");
      System.exit(1);
    }

    DocumentPrinter dp = new DocumentPrinter();
    XMLReader parser;
    try {
      parser = (XMLReader)Class.forName(PARSER_NAME).newInstance();
      parser.setContentHandler(dp);
      parser.setErrorHandler(dp);
      parser.parse(args[0]);
    }
    // Normally it's a bad idea to catch generic exceptions like this.
    catch (Exception ex) {
      System.out.println(ex.getMessage());
      ex.printStackTrace();
    }
  }

This program expects that the user will specify the path to an XML document as its only command-line argument. If no such argument is submitted, the program will exit and instruct the user to supply that argument when running the program.

Next, the program creates an instance of the DocumentPrinter object and assigns it to the variable dp. You’ll need this object later when you tell the parser which ContentHandler and ErrorHandler to use. After instantiating dp, a try...catch block is opened to house the parsing code. This is necessary because some of the methods called to carry out the parsing can throw exceptions that must be caught within the program. All of the real work in the program takes place inside the try block.

By the Way

The try...catch block is the standard way in which Java handles errors that crop up during the execution of a program. It enables the program to compensate and work around those errors if the user chooses to do so. In this case, you simply print out information about the error and allow the program to exit gracefully.

Within the try...catch block, the first order of business is creating a parser object. This object is actually an instance of the class named in the variable PARSER_NAME. The fact that you’re using it through the XMLReader interface means that you can call only those methods included in that interface. For this application, that’s fine. The class specified in the PARSER_NAME variable is then loaded and assigned to the variable parser. Because SAX 2.0 parsers must implement XMLReader, you can refer to the interface as an object of that type rather than referring to the class by its own name—SAXParser.

After the parser has been created, you can start setting its properties. Before actually parsing the document, however, you have to specify the content and error handlers that the parser will use. Because the DocumentPrinter class can play both of those roles, you simply set both of those properties to dp (the DocumentPrinter object you just created). At this point, all you have to do is call the parse() method on the URI passed in on the command line, which is exactly what the code does.

Implementing the ContentHandler Interface

The skeleton for the program is now in place. The rest of the program consists of methods that fulfill the requirements of the ContentHandler and ErrorHandler interfaces. More specifically, these methods respond to events that are triggered during the parsing of an XML document. In this program, the methods just print out the content that they receive.

The first of these methods is the characters() method, which is called whenever content is parsed in a document. Following is the code for this method:

  public void characters(char[] ch, int start, int length) {
    String chars = "";
    for (int i = start; i < start + length; i++)
      chars = chars + ch[i];

      if ((chars.trim()).length() > 0)
        System.out.println("Received characters: " + chars);
  }

The characters() method receives content found within elements. It accepts three arguments: an array of characters, the position in the array where the content starts, and the amount of content received. In this method, a for loop is used to extract the content from the array, starting at the position in the array where the content starts, and iterating over each element until the position of the last element is reached. When all of the characters are gathered, the code checks to make sure they aren’t just empty spaces, and then prints the results if not.

By the Way

It’s important not to just process all of the characters in the array of characters passed in unless that truly is your intent. The array can contain lots of padding on both sides of the relevant content, and including it all will result in a lot of extra characters along with the content that you actually want. On the other hand, if you know that the code contains parsed character data (PCDATA) that you want to read verbatim, then by all means process all of the characters.

The next two methods, startDocument() and endDocument(), are called when the beginning and end of the document are encountered, respectively. They accept no arguments and are called only once each during document parsing, for obvious reasons. Here’s the code for these methods:

  public void startDocument() {
    System.out.println("Start document.");
  }
  public void endDocument() {
    System.out.println("End of document reached.");
  }

Next let’s look at the startElement() and endElement() methods, which accept the most complex set of arguments of any of the methods that make up a ContentHandler:

  public void startElement(String namespaceURI, String localName,
    String qName, Attributes atts) {
    System.out.println("Start element: " + localName);
  }

  public void endElement(String namespaceURI, String localName,
    String qName) {
    System.out.println("End of element: " + localName);
  }

The startElement() method accepts four arguments from the parser. The first is the namespace URI, which you’ll see elsewhere as well. The namespace URI is the URI for the namespace associated with the element. If a namespace is used in the document, the URI for the namespace is provided in a namespace declaration. The local name is the name of the element without the namespace prefix. The qualified name is the name of the element including the namespace prefix if there is one. Finally, the attributes are provided as an instance of the Attributes object. The endElement() method accepts the same first three arguments but not the final attributes argument.

By the Way

SAX parsers must have namespace processing turned on in order to populate all of these attributes. If that option is deactivated, any of the arguments (other than the attributes) may be populated with empty strings. The method for turning on namespace processing varies depending on which parser you use.

Let’s look at attribute processing specifically. Attributes are supplied to the startElement() method as an instance of the Attributes object. In the sample code, you use three methods of the Attributes object: getLength(), getLocalName(), and getValue(). The getLength() method is used to iterate over the attributes supplied to the method call, while getLocalName() and getValue() accept the index of the attribute being retrieved as arguments. The code retrieves each attribute and prints out its name and value. In case you’re curious, the full list of methods for the Attributes object appears in Table 17.1.

Table 17.1. Methods of the Attributes Object

Method

Purpose

getIndex(String qName)

Retrieves an attribute’s index using its qualified name

getIndex(String uri, String localPart)

Retrieves an attribute’s index using its namespace URI and the local portion of its name

getLength()

Returns the number of attributes in the element

getLocalName(int index)

Returns the local name of the attribute associated with the index

getQName(int index)

Returns the qualified name of the attribute associated with the index

getType(int index)

Returns the type of the attribute with the supplied index

getType(String uri, String localName)

Looks up the type of the attribute with the namespace URI and name specified

getURI(int index)

Looks up the namespace URI of the attribute with the index specified

getValue(int index)

Looks up the value of the attribute using the index

getValue(String qName)

Looks up the value of the attribute using the qualified name

getValue(String uri, String localName)

Looks up the value of the attribute using the namespace URI and local name

Getting back to the endElement() method, its operation is basically the same as that of startElement() except that it doesn’t accept the attributes of the element as an argument.

The next two methods, startPrefixMapping() and endPrefixMapping(), have to do with prefix mappings for namespaces:

  public void startPrefixMapping(String prefix, String uri) {
    System.out.println("Prefix mapping: " + prefix);
    System.out.println("URI: " + uri);
  }

  public void endPrefixMapping(String prefix) {
    System.out.println("End of prefix mapping: " + prefix);
  }

These methods are used to report the beginning and end of namespace prefix mappings when they are encountered in a document.

The next method, ignorableWhitespace(), is similar to characters(), except that it returns whitespace from element content that can be ignored.

  public void ignorableWhitespace(char[] ch, int start, int length) {
    System.out.println("Received whitespace.");
  }

Next on the method agenda is processingInstruction(), which reports processing instructions to the content handler. For example, a stylesheet can be associated with an XML document using the following processing instruction:

<?xml-stylesheet href="mystyle.css" type="text/css"?>

The method that handles such instructions is

  public void processingInstruction(String target, String data) {
    System.out.println("Received processing instruction:");
    System.out.println("Target: " + target);
    System.out.println("Data: " + data);
  }

The last method you need to be concerned with is setDocumentLocator(), which is called when each and every event is processed. Nothing is output by this method in this program, but I’ll explain what its purpose is anyway. Whenever an entity in a document is processed, the parser calls setDocumentLocator() with a Locator object. The Locator object contains information about where in the document the entity currently being processed is located. Here’s the “do nothing” source code for the method:

    public void setDocumentLocator(Locator locator) { }

The methods of a Locator object are described in Table 17.2.

Table 17.2. The Methods of a Locator Object

Method

Purpose

getColumnNumber()

Returns the column number of the current position in the document being parsed

getLineNumber()

Returns the line number of the current position in the document being parsed

getPublicId()

Returns the public identifier of the current document event

getSystemId()

Returns the system identifier of the current document event

Because the sample program doesn’t concern itself with the specifics of locators, none of these methods are actually used. However, it’s good for you to know about them in case you need to develop a program that somehow is interested in locators.

Implementing the ErrorHandler Interface

I mentioned earlier that the DocumentPrinter class implements two interfaces, ContentHandler and ErrorHandler. Let’s look at the methods that are used to implement the ErrorHandler interface. There are three types of errors that a SAX parser can generate—errors, fatal errors, and warnings. Classes that implement the ErrorHandler interface must provide methods to handle all three types of errors. Here’s the source code for the three methods:

  public void error(SAXParseException exception) { }

  public void fatalError(SAXParseException exception) { }

  public void warning(SAXParseException exception) { }

As you can see, each of the three methods accepts the same argument—a SAXParseException object. The only difference between them is that they are called under different circumstances. To keep things simple, the sample program doesn’t output any error notifications. For the sake of completeness, the full list of methods supported by SAXParseException appears in Table 17.3.

Table 17.3. Methods of the SAXParseException Interface

Method

Purpose

getColumnNumber()

Returns the column number of the current position in the document being parsed

getLineNumber()

Returns the line number of the current position in the document being parsed

getPublicId()

Returns the public identifier of the current document event

getSystemId()

Returns the system identifier of the current document event

Similar to the Locator methods, these methods aren’t used in the Document Printer sample program, so you don’t have to worry about the ins and outs of how they work.

Testing the Document Printer Program

Now that you understand how the code works in the Document Printer sample program, let’s take it for a test drive one more time. This time around, you’re running the program to parse the condos.xml sample document from the previous hour. Here’s an excerpt from that document in case it’s already gotten a bit fuzzy in your memory:

  <proj status="active">
    <location lat="36.122238" long="-86.845028" />
    <description>
      <name>Woodmont Close</name>
      <address>131 Woodmont Blvd.</address>
      <address2>Nashville, TN 37205</address2>
      <img>condowc.jpg</img>
    </description>
  </proj>

And here’s the command required to run this document through the Document Printer program:

java -classpath xercesImpl.jar;xml-apis.jar;. DocumentPrinter condos.xml

Finally, Listing 17.2 contains the output of the Document Printer program after feeding it the condominium map data stored in the condos.xml document.

Example 17.2. The Output of the Document Printer Example Program After Processing the condos.xml Document

 1:  Start document.
 2:  Start element: projects
 3:  Start element: proj
 4:  Start element: location
 5:  End of element: location
 6:  Start element: description
 7:  Start element: name
 8:  Received characters: Woodmont Close
 9:  End of element: name
10:  Start element: address
11:  Received characters: 131 Woodmont Blvd.
12:  End of element: address
13:  Start element: address2
14:  Received characters: Nashville, TN 37205
15:  End of element: address2
16:  Start element: img
17:  Received characters: condowc.jpg
18:  End of element: img
19:  End of element: description
20:  End of element: proj
21:  ...
22:  Start element: proj
23:  Start element: location
24:  End of element: location
25:  Start element: description
26:  Start element: name
27:  Received characters: Harding Hall
28:  End of element: name
29:  Start element: address
30:  Received characters: 2120 Harding Pl.
31:  End of element: address
32:  Start element: address2
33:  Received characters: Nashville, TN 37215
34:  End of element: address2
35:  Start element: img
36:  Received characters: condohh.jpg
37:  End of element: img
38:  End of element: description
39:  End of element: proj
40:  End of element: projects
41:  End of document reached.

The excerpt from the condos.xml document that you saw a moment ago corresponds to the first proj element in the XML document. Lines 3 through 20 show how the Document Printer program parses and displays detailed information for this element and all of its content.

Summary

In this hour, you learned about one of the two popular APIs for parsing XML files—SAX. You already covered the DOM in the previous lesson, so this lesson wrapped up some loose ends in terms of giving you a more rounded understanding of XML parsing. SAX (Simple API for XML) is an event-driven parser that is usually combined with a custom program designed to process the events generated by the parser. You worked through an example of such an application in this hour that demonstrated how to use the Xerces SAX parser to iterate through the entities in an XML document and print out detailed information about them.

Q&A

Q.

I didn’t get any of that Java stuff; how am I supposed to use SAX?

A.

If you found the Java code confusing, you may be better off looking at the documentation for the SAX implementation for a programming language that you’re more comfortable using. You may want to do some online investigating to find a SAX parser that’s appropriate for you. Keep in mind that the Xerces SAX parser (http://xml.apache.org/) that you used in this lesson is also available for the C++ and Perl languages. Also, if you prefer using JavaScript or Visual Basic you may want to consider using the DOM for XML processing, which you explored in Hour 16, “Parsing XML with the DOM.”

Q.

How do I access the data structure created by SAX?

A.

The catch with SAX is that it doesn’t create its own data structure; it’s up to the programmer who writes the event handlers to generate a data structure, print the XML, or do whatever it is they want to do with the data as it’s processed by the SAX parser.

Workshop

The Workshop is designed to help you anticipate possible questions, review what you’ve learned, and begin learning how to put your knowledge into practice.

Quiz

1.

What is an event-driven parser?

2.

What standards body was responsible for the creation of SAX?

3.

Which important feature was added when SAX was upgraded from version 1.0 to 2.0?

Quiz Answers

1.

An event-driven parser iterates through an XML document and calls specific methods in another program as it processes entities in the document being parsed.

2.

I admit it, this was a trick question; a standards body did not create SAX. Rather, members of the xml-dev mailing list created it through a grassroots effort.

3.

SAX 2.0 added support for namespaces.

Exercises

1.

Modify the sample program in this hour so that it reproduces the XML document that is supplied as input.

2.

Reproduce the sample program in the language that you do your development in (if it’s not Java).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.37.154