Event-Based Parsing of XML

The XML API discussed in the previous chapter was based on a tree representation of an XML document, which is a logical approach because XML document content forms a tree structure. However, for many types of applications, this way of handling documents is extremely inefficient—like reading a thousand Yellow Pages entries to find a phone number of a restaurant.

This chapter will discuss another way of dealing with XML documents. This API—the Simple API for XML—is not a Recommendation of the W3C, but it is a de facto standard for XML parsing and is absolutely free for both commercial and noncommercial use.

SAX: A Simple API for XML

The SAX is a result of a successful attempt to bring together ideas implemented in different products available in 1997 to provide a common interface for Java programmers.

Unlike the DOM, the SAX started as a purely Java-based project and evolutionized in open discussions among the members of the XML-DEV mailing list under the leadership of David Megginson.

At the time of this writing, the current version of the API is SAX 2.0, originally released in May 2000. The SAX 2.0 r2 (release 2) is in prerelease stage. Although the SAX API was initially implemented in Java, it later evolved into a standard supported by other languages including C++, Visual Basic, COM, Perl, and Pascal.

DOM Versus SAX

There is a huge difference between the approaches used by the developers of DOM and SAX. It is so large that it's impossible to say that one is better than another—they are just to be used in different types of applications.

The DOM parser reads the entire document into memory and then builds a DOM tree of the parsed XML structure, making the tree available to a developer only after the entire tree is built.

The SAX parser reads an XML document and generates notifications on different types of events depending on the content being read. The developer then filters and handles these events using easy-to-implement event handlers.

Trees Versus Events

The main problem with the tree-based parsing such as the DOM is that a parser has to load an entire document into memory before preparing its tree image, even if all you need to do is to read just one element out of thousands. In fact, the DOM API is unusable on large documents.

Note

On a Pentium 4–based PC with Sun JDK 1.4, it takes several minutes to load a 5MB XML file and get a DOM tree.

Larger documents require increasing the memory allocation beyond the default for the JVM instance—otherwise, you will get an out-of-memory exception.

If you are interested in concrete benchmark results, you can get some figures at http://www.sosnoski.com/opensrc/xmlbench/index.html.


On the other hand, after the DOM tree is ready, it provides developers with full random access to any node of the document it represents.

It is also easy to edit documents using the DOM API—as SAX events are one way and there is no way to apply changes to existing documents through an event-based API of this sort.

Choosing Between DOM and SAX

Whenever you have to choose between DOM and SAX, try answering a few questions, as follows:

  • Do you need to handle large documents? If yes, choose SAX; if no, then consider another question:

  • Is the performance an issue in your application? If yes, SAX will be better; if no, ask yourself the next question:

  • Do you need to search for or retrieve content of any specific node of the document? If yes, choose SAX; if no, go on to the next question:

  • Does your application heavily rely on random access to nodes across the document? If yes, your solution is DOM; if no, the last question is

  • Do you need to update documents? If yes, your choice will be the DOM API.

If this is not enough to make up your mind, it is probably wise to think more about the requirements of your application.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.152.87