Basically, any SAX-based XML processing application consists of two parts—a SAX parser and a set of handlers that should be implemented by a developer. Some helper classes also make life easier.
The core of SAX2 API contains two packages:
org.xml.sax
org.xml.sax.helpers
Some additional nonstandard features can be implemented in a standard way in the third package, org.xml.sax.ext, which is not a part of the SAX core.
We won't repeat the SAX 2 documentation, which is available for download on the official Web site of the project (see references at the end of the chapter).
Instead, we will discuss the steps that have to be taken and the classes that have to be implemented by developers of SAX-driven applications and by developers of SAX-compliant parsers.
First, an instance of a SAX parser must be instantiated before an application can use the SAX API to access XML documents. It is up to the developer to decide which parser to use with the application. All SAX-compliant parsers realize standard interfaces so that it's very easy to stick in any of the parsers available.
For the purpose of this discussion, we will not focus on any particular implementation of a SAX parser. The examples given will work with any SAX implementation that supports the SAX API in Java. For more details about using specific parsers, you can refer to Chapter 16.
The developer has two choices in instantiating an instance of a parser:
Using an implementation of the interface XMLReader supplied by a parser.
Instantiating a parser with a factory provided by the SAX API.
A few lines of code that follow demonstrate the first approach. Again, the example is implemented in Java and works with any parser that provides support for SAX API in Java.
import org.xml.sax.XMLReader; import javax.xml.parsers.SAXParser; ... XMLReader myReader = new javax.xml.parsers.SAXParser();
The second approach is demonstrated here
import org.xml.sax.XMLReader; import org.xml.sax.XMLReaderFactory; import org.xml.sax.SAXException ... try { XMLReader myReader = XMLReaderFactory.createXMLReader("javax.xml.parsers.SAXParser"); } catch(SAXException se) { // Report a problem }
The second example instantiates a parser passing its class name to a factory method as a string, which can be loaded from a config file, or taken as a startup parameter. This approach is usually more flexible as it allows changing compliant parsers on-the-fly without recompiling any source code.
The XMLReader interface used in this example has a lot of interesting methods, which provides developers with the control over event handlers associated with the document being parsed, features of the parser in effect, and so on. Table 15.1 lists the methods of the interface.
The XMLReader interface is a replacement for the Parser interface from version 1.0 and is a required interface for any SAX 2 driver (parser).
As you can see from Table 15.1, the XMLReader interface provides us with methods that can be used to register event handlers. This is what interests us in the next step.
After we have an instance of an XMLReader, we can register event handlers for the events we are interested in.
To do so, we have a choice of methods on the XMLReader interface, but before setting a handler we have to implement it.
To start with, we need to implement a ContentHandler interface—the methods of which will be invoked by the parser, or called back, every time it encounters something interesting in the document being parsed.
Note
Callback methods are often utilized to handle events programmatically. For example, callbacks are used in the Windows operating system to report mouse events (and many other kinds of events), such as a mouse movement, to applications. The use of callbacks usually involves two steps—registration and handling.
Registration is needed to advise the source of events about the parties that are interested in receiving event notifications. Methods or functions are bound to an event type through registration.
Handling is what callback methods are written for. When an event occurs, the source of events (a parser in our case) invokes (or calls back) the methods registered to handle this particular type of event.
For every event, there can be multiple handlers registered. They are usually invoked in the order of registration.
Table 15.2 lists the methods of the ContentHandler interface, which have to be implemented by an application developer.
SAX provides a default implementation of this interface that can be used as a base class for any custom implementation. This default class is the DefaultHandler and it implements do-nothing versions of the callbacks defined in the interface ContentHandler (and three other handler interfaces). Application developers can inherit their implementations from it and override the methods they need.
After the content handler has been implemented, we need to make sure our software is able to respond to any problems that may occur during the parsing.
Three kinds of errors can occur during the parsing:
Fatal Error— This error occurs when something prevents the parser from further processing the document. A fatal error in SAX is defined by the W3C XML 1.0 Recommendation, Section 1.2.
Error— This error is reported by the parser when it encounters a problem with the document but is able to recover and continue processing. An error in SAX is defined by the W3C XML 1.0 Recommendation, Section 1.2.
Warning— This event is reported to an application when a SAX parser wants to report something that is neither an error nor a fatal error according to Section 1.2 of the W3C XML 1.0 Recommendation.
Using these categories of errors, the ErrorHandler interface defines three callback methods that any parser expects to be implemented by application developers. These methods are described in Table 15.3:
Name | Parameters | Description, Returns, Throws |
---|---|---|
fatalError | exception, SAXParseException | Called when the application is requested to process a fatal error that occurred during the parsing process.
Returns: No return value. Throws: SAXException |
error | exception, SAXParseException | Called when the application is requested to process an error that occurred during the parsing process.
Returns: No return value. Throws: SAXException |
warning | exception, SAXParseException | Called when the application is requested to process a warning that occurred during the parsing process.
Returns: No return value. Throws: SAXException |
And, again, the developers of SAX have taken care of implementing the default error handler for us. The class DefaultHandler discussed earlier also implements methods of the ErrorHandler interface as empty methods.
The next interface to implement is DTDHandler. It fires events upon encountering notation and unparsed entity declarations that can be used by application developers.
Note
Note that this interface does not have anything to do with validation of DTDs. To perform the validation, an application should call the setFeature method on the implementation of the XMLReader interface with a parameter http://xml.org/sax/features/validation and set this parameter to true.
Because the future of DTDs is rather unclear and it looks as though XML Schemas will preempt the concept of DTDs, it is unlikely that many of the readers will have to implement this handler. However, if needed, Table 15.4 shows the methods to implement.
Name | Parameters | Description, Returns, Throws |
---|---|---|
notationDecl | Process the declaration of a notation.
Returns: No return value. Throws: SAXException | |
unparsedEntityDecl | Process the declaration of an unparsed entity.
Returns: No return value. Throws: SAXException |
The DefaultHandler class implements empty methods for this handler.
The EntityResolver interface is normally not needed to be implemented by application developers. The only use of this interface is to handle external entities before the parser tries to open them.
It may be useful to implement the interface in such cases when an application uses nonstandard system identifiers or nonstandard ways of resolving the system identifiers.
The only method of the interface is described in Table 15.5. The DefaultHandler class contains a default implementation of the interface.
Having instantiated a parser to read through XML documents and implemented the handlers to process events, your application is now ready to work with XML documents. However, there are other useful features of SAX, such as the ability to change or eliminate events that can make your life much easier when it comes to processing XML input.
One of the advantages of event-based processing is that it allows developers to build data processing solutions from chains or sequences of filters and handlers, which process and possibly modify data while events “flow” through them.
For that purpose, there are a special interface XMLFilter and its default implementation class XMLFilterImple available in SAX 2.0. The XMLFilter is a simple extension to the XMLReader, which adds methods to enable filtering of events initiated during the processing of documents.
There are only two methods in the XMLFilter interface as described in Table 15.6.
The XMLFilterImpl class provides application developers with a convenient default implementation of the XMLFilter interface together with all four handler interfaces previously discussed.
All you need to do to implement your filter handlers is to derive your handler class from this default implementation and provide your implementations of the handlers you want to use.
Following is a simple Java example of a Y2K-aware filter, which replaces all the letters Y with K (that's why it's Y2K-aware) in element names:
As a result of this processing, see the following portion of an XML file:
<january> <new-year-day date = "1"> <celebrate drink = "champagne" /> </new-year-day> </january> <february> </february>
The handler following our y2kFilter will receive startElement and endElement events with the following element names.
Now that you have been introduced to several of the interfaces in SAX, you may want the details on the implementation of SAX parsers. Although we cannot delve into the details of implementing SAX parsers in this book, we can give you a list of the interfaces that you must implement to write your own.
SAX provides implementations of most of the interfaces needed for both application and parser writers. The minimum set of parser interfaces that have to be implemented is as follows:
XMLReader (Parser for SAX 1–compliant parsers)— This interface in used to set features of the parser and start the parsing process. The implementation of this class will be the source of events for applications.
Locator— Is used by an application whenever it needs to associate an event received from the parser with the location of a node in the document being parsed.
Attributes (AttributeList for SAX 1–compliant parsers)— Applications use this interface for obtaining lists of elements' attributes.
In addition to the set of core interfaces discussed before, the SAX 2 specification defines some features that are optional for parsers but are standardized to make all the SAX- compliant parsers compatible and transparent to applications.
The standard SAX Extension 1.0 provides the following additional interfaces:
DeclHandler— Provides application developers with the ability to do more complex handling of DTD declarations than through the standard handlers.
LexicalHandler— An additional handler interface to deal with lexical events. Operates with CDATA sections, comments, entities, and DTD declarations.
Attributes2— A bit more advanced version of the Attributes interface.
EntityResolver2— A more complicated extension to the EntityResolver interface.
Locator2— In addition to the services provided in the Locator interface, this one defines methods to retrieve the information regarding the encoding of the XML document and the version of XML.
3.16.48.181