Before leaving our introduction to parsing XML documents, there are a few pitfalls to make you aware of. These “gotchas” will help you avoid common programming mistakes when using SAX, and we will discuss more of these for other APIs in the appropriate sections.
For those of you who are unlucky enough not to have a parser with SAX 2.0 support, don’t despair. First, you always have the option of changing parsers; keeping current on SAX standards is an important part of an XML parser’s responsibility, and if your vendor is not doing this, you may have other concerns to address with them as well. However, there are certainly cases where you are forced to use a parser because of legacy code or applications; in these situations, you are still not “left out in the cold.”
SAX 2.0 includes a helper class,
org.xml.sax.helpers.ParserAdapter
, which can actually cause a SAX 1.0
Parser
implementation to behave like a SAX 2.0
XMLReader
implementation. This handy class takes
in a 1.0 Parser
implementation as an input
parameter and then can be used in the stead of that implementation.
It allows a ContentHandler
to be set, and handles
all namespace callbacks properly. The only feature loss you will see
is that skipped entities will not be reported, as
this capability was not available in a 1.0 implementation in any
form, and cannot be emulated by a 2.0 adapter class. The sample class
would be used as shown in Example 3.6.
Example 3-6. Using a SAX 1.0 Parser as a 2.0 XMLReader
try { // Register a parser with SAX Parser parser = ParserFactory.makeParser( "org.apache.xerces.parsers.SAXParser"); ParserAdapter myParser = new ParserAdapter(parser); // Register the document handler myParser.setContentHandler(contentHandler); // Register the error handler myParser.setErrorHandler(errHandler); // Parse the document myParser.parse(uri); } catch (ClassNotFoundException e) { System.out.println( "The parser class could not be found."); } catch (IllegalAccessException e) { System.out.println( "Insufficient privileges to load the parser class."); } catch (InstantiationException e) { System.out.println( "The parser class could not be instantiated."); } catch (ClassCastException e) { System.out.println( "The parser does not implement org.xml.sax.Parser"); } catch (IOException e) { System.out.println("Error reaading URI: " + e.getMessage( )); } catch (SAXException e) { System.out.println("Error in parsing: " + e.getMessage( )); }
If SAX is new to you and this example doesn’t make much sense, don’t worry about it; you are using the latest and greatest version of SAX (2.0) and probably won’t ever have to write code like this. Only in cases where a 1.0 parser must be used is this code helpful.
One of Java’s nicest features is the ease of reuse of objects,
and the memory advantages of this reuse. SAX parsers are no
different. Once an XMLReader
has been
instantiated, it is possible to continue using that parser, parsing
several or even hundreds of XML documents. Different documents or
InputSource
s may be continually passed to a
parser, allowing it to be used for a variety of different tasks.
However, parsers are not reentrant. Once the
parsing process has started, a parser may not be used until the
parsing of the requested document or input has completed. For those
of you who are prone to coding recursive methods, this is definitely
a “gotcha!” The first time that you attempt to use a
parser that is in the middle of processing another document, you will
receive a rather nasty SAXException
and all
parsing will stop. What is the lesson learned? Parse one document at
a time, or pay the price of instantiating multiple parser instances.
Another dangerous but seemingly innocuous feature of SAX events is
the Locator
instance that is made available
through the setDocumentLocator( )
callback method. This gives the
application the origin of a SAX event, and is useful for making
decisions about the progress of parsing and how to react to events.
However, this origin point is only valid for the duration of the life
of the
ContentHandler
instance; once parsing is complete,
the Locator
is no longer valid, including in the
case when another parse begins. A “gotcha” that many XML
newcomers make is to hold a reference to the
Locator
object within a class member variable
outside of the callback method:
public void setDocumentLocator(Locator locator) { // Saving the Locator to a class outside the ContentHandler myOtherClass.setLocator(locator); } ... public myOtherClassMethod( ) { // Trying to use this outside of the ContentHandler System.out.println(locator.getLineNumber( )); }
This is an extremely bad idea, as this Locator
becomes meaningless as soon as the scope of the
ContentHandler
implementation is left. Often,
using the member variable resulting from this operation results in
not only erroneous information being supplied to an application, but
corruption of the XML document that was parsed. In other words, use
this object locally, and not globally. In our
ContentHandler
implementation, we saved the
supplied Locator
to a member variable. It could
then correctly be used (for example) to give you the line number of
each element as it was encountered:
public void startElement(String namespaceURI, String localName, String rawName, Attributes atts) throws SAXException {System.out.print("startElement: " + localName +
" at line " + locator.getLineNumber( ));
if (!namespaceURI.equals("")) { System.out.println(" in namespace " + namespaceURI + " (" + rawName + ")"); } else { System.out.println(" has no associated namespace"); } for (int i=0; i<atts.getLength( ); i++) System.out.println(" Attribute: " + atts.getLocalName(i) + "=" + atts.getValue(i)); }
The characters( )
callback method accepts a character array and
start
and end
parameters to
signify which index to start and end reading of that array from. This
can cause some confusion; a common mistake is to include code like
this example to read from the character array:
public void characters(char[] ch, int start, int end) throws SAXException { for (int i=0; i<ch.length; i++) System.out.print(i); }
The mistake here is in reading from the beginning to the end of the
character array. This natural “gotcha” results from years
of iterating through arrays, either in Java, C, or another language.
However, in the case of a SAX event, this can cause quite a bug. SAX
parsers are required to pass in starting and ending boundaries on the
character array which any loop constructs should use to read from the
array. This allows lower-level manipulation of textual data to occur
to optimize parser performance, such as reading data ahead of the
current location as well as array reuse. This is all legal behavior
within SAX, as the expectation is that a wrapping application will
not try to “get ahead” of the end
parameter sent to the callback.
Mistakes as in the example shown can result in gibberish data being output to the screen or used within the wrapping application, and are almost always problematic for applications. The loop construct looks very normal and compiles without a hitch, so this “gotcha” can be a very tricky problem to track down.
3.143.111.194