“Gotcha!”

Before leaving our introduction to parsing XML documents, there are a few pitfalls to make you aware of. These “gotchas” will help you avoid common programming mistakes when using SAX, and we will discuss more of these for other APIs in the appropriate sections.

My Parser Doesn’t Support SAX 2.0: What Can I Do?

For those of you who are unlucky enough not to have a parser with SAX 2.0 support, don’t despair. First, you always have the option of changing parsers; keeping current on SAX standards is an important part of an XML parser’s responsibility, and if your vendor is not doing this, you may have other concerns to address with them as well. However, there are certainly cases where you are forced to use a parser because of legacy code or applications; in these situations, you are still not “left out in the cold.”

SAX 2.0 includes a helper class, org.xml.sax.helpers.ParserAdapter , which can actually cause a SAX 1.0 Parser implementation to behave like a SAX 2.0 XMLReader implementation. This handy class takes in a 1.0 Parser implementation as an input parameter and then can be used in the stead of that implementation. It allows a ContentHandler to be set, and handles all namespace callbacks properly. The only feature loss you will see is that skipped entities will not be reported, as this capability was not available in a 1.0 implementation in any form, and cannot be emulated by a 2.0 adapter class. The sample class would be used as shown in Example 3.6.

Example 3-6. Using a SAX 1.0 Parser as a 2.0 XMLReader

try {
    // Register a parser with SAX
    Parser parser = 
        ParserFactory.makeParser(
            "org.apache.xerces.parsers.SAXParser");
            
    ParserAdapter myParser = new ParserAdapter(parser);
                                        
    // Register the document handler
    myParser.setContentHandler(contentHandler);
    
    // Register the error handler
    myParser.setErrorHandler(errHandler);            
        
    // Parse the document      
    myParser.parse(uri);
    
} catch (ClassNotFoundException e) {
    System.out.println(
        "The parser class could not be found.");
} catch (IllegalAccessException e) {
    System.out.println(
        "Insufficient privileges to load the parser class.");
} catch (InstantiationException e) {
    System.out.println(
        "The parser class could not be instantiated.");
} catch (ClassCastException e) {
    System.out.println(
        "The parser does not implement org.xml.sax.Parser");
} catch (IOException e) {
    System.out.println("Error reaading URI: " + e.getMessage(  ));
} catch (SAXException e) {
    System.out.println("Error in parsing: " + e.getMessage(  ));
}

If SAX is new to you and this example doesn’t make much sense, don’t worry about it; you are using the latest and greatest version of SAX (2.0) and probably won’t ever have to write code like this. Only in cases where a 1.0 parser must be used is this code helpful.

The SAX XMLReader: Reused and Reentrant

One of Java’s nicest features is the ease of reuse of objects, and the memory advantages of this reuse. SAX parsers are no different. Once an XMLReader has been instantiated, it is possible to continue using that parser, parsing several or even hundreds of XML documents. Different documents or InputSources may be continually passed to a parser, allowing it to be used for a variety of different tasks. However, parsers are not reentrant. Once the parsing process has started, a parser may not be used until the parsing of the requested document or input has completed. For those of you who are prone to coding recursive methods, this is definitely a “gotcha!” The first time that you attempt to use a parser that is in the middle of processing another document, you will receive a rather nasty SAXException and all parsing will stop. What is the lesson learned? Parse one document at a time, or pay the price of instantiating multiple parser instances.

The Misplaced Locator

Another dangerous but seemingly innocuous feature of SAX events is the Locator instance that is made available through the setDocumentLocator( ) callback method. This gives the application the origin of a SAX event, and is useful for making decisions about the progress of parsing and how to react to events. However, this origin point is only valid for the duration of the life of the ContentHandler instance; once parsing is complete, the Locator is no longer valid, including in the case when another parse begins. A “gotcha” that many XML newcomers make is to hold a reference to the Locator object within a class member variable outside of the callback method:

public void setDocumentLocator(Locator locator) {
    // Saving the Locator to a class outside the ContentHandler
    myOtherClass.setLocator(locator);
}
...

public myOtherClassMethod(  ) {
    // Trying to use this outside of the ContentHandler
    System.out.println(locator.getLineNumber(  ));
}

This is an extremely bad idea, as this Locator becomes meaningless as soon as the scope of the ContentHandler implementation is left. Often, using the member variable resulting from this operation results in not only erroneous information being supplied to an application, but corruption of the XML document that was parsed. In other words, use this object locally, and not globally. In our ContentHandler implementation, we saved the supplied Locator to a member variable. It could then correctly be used (for example) to give you the line number of each element as it was encountered:

public void startElement(String namespaceURI, String localName,
                         String rawName, Attributes atts)
    throws SAXException {
        
        System.out.print("startElement: " + localName +
               
                         " at line " + locator.getLineNumber(  ));
                
    if (!namespaceURI.equals("")) {
        System.out.println(" in namespace " + namespaceURI + 
                           " (" + rawName + ")");
    } else {
        System.out.println(" has no associated namespace");
    }

    for (int i=0; i<atts.getLength(  ); i++)
        System.out.println("  Attribute: " + atts.getLocalName(i) +
                           "=" + atts.getValue(i));      
}

Getting Ahead of the Data

The characters( ) callback method accepts a character array and start and end parameters to signify which index to start and end reading of that array from. This can cause some confusion; a common mistake is to include code like this example to read from the character array:

public void characters(char[] ch, int start, int end)
    throws SAXException {

    for (int i=0; i<ch.length; i++)
        System.out.print(i);
}

The mistake here is in reading from the beginning to the end of the character array. This natural “gotcha” results from years of iterating through arrays, either in Java, C, or another language. However, in the case of a SAX event, this can cause quite a bug. SAX parsers are required to pass in starting and ending boundaries on the character array which any loop constructs should use to read from the array. This allows lower-level manipulation of textual data to occur to optimize parser performance, such as reading data ahead of the current location as well as array reuse. This is all legal behavior within SAX, as the expectation is that a wrapping application will not try to “get ahead” of the end parameter sent to the callback.

Mistakes as in the example shown can result in gibberish data being output to the screen or used within the wrapping application, and are almost always problematic for applications. The loop construct looks very normal and compiles without a hitch, so this “gotcha” can be a very tricky problem to track down.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.111.194