Content Handlers

In order to let our application do something useful with XML data as it is being parsed, we must register handlers with the SAX parser. A handler is nothing more than a set of callbacks that SAX defines to let us interject application code at important events within a document’s parsing. Realize that these events will take place as the document is parsed, not after the parsing has occurred. This is one of the reasons that SAX is such a powerful interface: it allows a document to be handled sequentially, without having to first read the entire document into memory. We will later look at the Document Object Model (DOM), which has this limitation.

There are four core handler interfaces defined by SAX 2.0: org.xml.sax.ContentHandler , org.xml.sax.ErrorHandler, org.xml.sax.DTDHandler, and org.xml.sax.EntityResolver. In this chapter, we discuss ContentHandler, which allows standard data-related events within an XML document to be handled, and take a first look at ErrorHandler, which receives notifications from the parser when errors in the XML data are found. DTDHandler will be examined in Chapter 5. We briefly discuss EntityResolver at various points in the text; it is enough for now to understand that EntityResolver works just like the other handlers, and is built specifically for resolving external entities specified within an XML document. Custom application classes that perform specific actions within the parsing process can implement each of these interfaces. These implementation classes can be registered with the parser with the methods setContentHandler( ), setErrorHandler( ), setDTDHandler( ), and setEntityResolver( ). Then the parser invokes the callback methods on the appropriate handlers during parsing.

For our example, we want to implement the ContentHandler interface. This interface defines several important methods within the parsing lifecycle that our application can react to. First we need to add the appropriate import statements to our source file (including the org.xml.sax.Locator and org.xml.sax.Attributes class and interface, which we will discuss in a moment), as well as a new class that will implement these callback methods. This new class can be added at the end of your source file, SAXParserDemo.java :

import java.io.IOException;
import org.xml.sax.Attributes;
            import org.xml.sax.ContentHandler;
            import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;

// Import your vendor's XMLReader implementation here
import org.apache.xerces.parsers.SAXParser;

...

/**
 * <b><code>MyContentHandler</code></b> implements the SAX 
 *   <code>ContentHandler</code> interface and defines callback
 *   behavior for the SAX callbacks associated with an XML
 *   document's content.
 */
class MyContentHandler implements ContentHandler {

    /** Hold onto the locator for location information */
    private Locator locator;
    
    /**
     * <p>
     * Provide reference to <code>Locator</code> which provides
     *   information about where in a document callbacks occur.
     * </p>
     *
     * @param locator <code>Locator</code> object tied to callback
     *                process
     */
    public void setDocumentLocator(Locator locator) {
    }
    
    /**
     * <p>
     * This indicates the start of a Document parse--this precedes
     *   all callbacks in all SAX Handlers with the sole exception
     *   of <code>{@link #setDocumentLocator}</code>.
     * </p>
     *
     * @throws <code>SAXException</code> when things go wrong
     */
    public void startDocument(  ) throws SAXException {
    }

    /**
     * <p>
     * This indicates the end of a Document parse--this occurs after
     *   all callbacks in all SAX Handlers.</code>.
     * </p>
     *
     * @throws <code>SAXException</code> when things go wrong
     */    
    public void endDocument(  ) throws SAXException {
    }
    
    /**
     * <p>
     * This indicates that a processing instruction (other than
     *   the XML declaration) has been encountered.
     * </p>
     *
     * @param target <code>String</code> target of PI
     * @param data <code>String</code containing all data sent to the PI.
     *             This typically looks like one or more attribute value
     *             pairs.
     * @throws <code>SAXException</code> when things go wrong
     */         
    public void processingInstruction(String target, String data)
        throws SAXException {
    }
    
    /**
     * <p>
     * This indicates the beginning of an XML Namespace prefix 
     *   mapping.  Although this typically occurs within the root element 
     *   of an XML document, it can occur at any point within the 
     *   document.  Note that a prefix mapping on an element triggers 
     *   this callback <i>before</i> the callback for the actual element 
     *   itself (<code>{@link #startElement}</code>) occurs.
     * </p>
     *
     * @param prefix <code>String</code> prefix used for the namespace 
     *               being reported
     * @param uri <code>String</code> URI for the namespace 
     *               being reported
     * @throws <code>SAXException</code> when things go wrong
     */         
    public void startPrefixMapping(String prefix, String uri) {
    }

    /**
     * <p>
     * This indicates the end of a prefix mapping, when the namespace 
     *   reported in a <code>{@link #startPrefixMapping}</code> callback 
     *   is no longer available.
     * </p>
     *
     * @param prefix <code>String</code> of namespace being reported
     * @throws <code>SAXException</code> when things go wrong
     */    
    public void endPrefixMapping(String prefix) {
    }
    
    /**
     * <p>
     * This reports the occurrence of an actual element. It includes
     *   the element's attributes, with the exception of XML vocabulary 
     *   specific attributes, such as 
     *   <code>xmlns:[namespace prefix]</code> and
     *   <code>xsi:schemaLocation</code>.
     * </p>
     *
     * @param namespaceURI <code>String</code> namespace URI this element
     *                     is associated with, or an empty 
     *                     <code>String</code>
     * @param localName <code>String</code> name of element (with no 
     *                  namespace prefix, if one is present)
     * @param rawName <code>String</code> XML 1.0 version of element name:
     *                [namespace prefix]:[localName]
     * @param atts <code>Attributes</code> list for this element
     * @throws <code>SAXException</code> when things go wrong
     */    
    public void startElement(String namespaceURI, String localName,
                             String rawName, Attributes atts)
        throws SAXException { 
    }
    
    /** 
     * <p>
     * Indicates the end of an element 
     *   (<code>&lt;/[element name]&gt;</code>) is reached. Note that 
     *   the parser does not distinguish between empty
     *   elements and non-empty elements, so this occurs uniformly.
     * </p>
     *
     * @param namespaceURI <code>String</code> URI of namespace this 
     *                     element is associated with
     * @param localName <code>String</code> name of element without prefix
     * @param rawName <code>String</code> name of element in XML 1.0 form
     * @throws <code>SAXException</code> when things go wrong
     */    
    public void endElement(String namespaceURI, String localName, 
                           String rawName) 
        throws SAXException {
    }

    /**
     * <p>
     * This reports character data (within an element).
     * </p>
     *
     * @param ch <code>char[]</code> character array with character data
     * @param start <code>int</code> index in array where data starts.
     * @param end <code>int</code> index in array where data ends.
     * @throws <code>SAXException</code> when things go wrong
     */         
    public void characters(char[] ch, int start, int end) 
        throws SAXException {            
    }

	/**
     * <p>
     * This reports whitespace that can be ignored in the 
     *   originating document. This is typically invoked only when
     *   validation is ocurring in the parsing process.
     * </p>
     *
     * @param ch <code>char[]</code> character array with character data
     * @param start <code>int</code> index in array where data starts.
     * @param end <code>int</code> index in array where data ends.
     * @throws <code>SAXException</code> when things go wrong     
     */
    public void ignorableWhitespace(char[] ch, int start, int end)
        throws SAXException {
    }

	/**
     * <p>
     * This reports an entity that is skipped by the parser. This
     *   should only occur for non-validating parsers, and then is still
     *   implementation-dependent behavior.
     * </p>
     *
     * @param name <code>String</code> name of entity being skipped
     * @throws <code>SAXException</code> when things go wrong     
     */
    public void skippedEntity(String name) throws SAXException {
    }
}

We have added empty implementations for all the methods defined in the ContentHandler interface, which allows our source file to compile. Of course, these empty implementations don’t provide any feedback for us, so we will walk through each of these required methods now.

The Document Locator

The first method we need to define is one that sets an org.xml.sax.Locator for any SAX event. When a callback event occurs, a class that implements a handler often needs access to the location within an XML file of the SAX parser. This can then be used to help the application make decisions about the event and its location within the XML document. The Locator class has several useful methods such as getLineNumber( ) and getColumnNumber( ) that return the current location within an XML file when invoked. Because this location is only valid for the current parsing lifecycle, the Locator should only be used within the scope of the ContentHandler implementation. Since we may want to use this later, we save the provided Locator instance to a member variable, as well as printing out a message indicating that the callback has occurred. This will help outline the order and occurrence of SAX events:

/** Hold onto the locator for location information */
private Locator locator;

/**
 * <p>
 * Provide reference to <code>Locator</code>, which provides
 *   information about where in a document callbacks occur.
 * </p>
 *
 * @param locator <code>Locator</code> object tied to callback
 *                process
 */
public void setDocumentLocator(Locator locator) {
        System.out.println("    * setDocumentLocator(  ) called");
               
        // We save this for later use if desired.
               
        this.locator = locator;
}

Later, we can add details to this method if we need to act upon information about the origin of events; in this example, we merely want to show information about what is occurring in the parsing process. However, if we wanted to show information about where in the document events were occurring, such as the line number an element appeared on, we would want to assign this Locator to a member variable for later use within the class.

The Start and the End of a Document

In any lifecycle process, there must always be a beginning and an end. These important events should both occur once, the former before all other events, and the latter after all other events. This obvious fact is critical to applications, as it allows them to know exactly when parsing begins and exactly when it ends. SAX provides callback methods for each of these events, startDocument( ) and endDocument( ) .

The first method, startDocument( ), is called before any other callbacks, including the callback methods within other SAX handlers, such as DTDHandler. In other words, startDocument( ) is not only the first method called within ContentHandler, but also within the entire parsing process, aside from the setDocumentLocator( ) method we just discussed. This ensures a finite beginning to parsing, and lets the application perform any tasks it needs to before parsing takes place.

The second method, endDocument( ), is always the last method called, again across all handlers. This includes situations in which errors occur that cause parsing to halt. We will discuss errors later, but there are both recoverable errors and unrecoverable errors. If an unrecoverable error occurs, the ErrorHandler’s callback method will be invoked, and then a final call to endDocument( ) completes the attempted parsing.

In our example, we want to output to the console when both these events occur to further illustrate the parsing lifecycle:

/**
 * <p>
 * This indicates the start of a Document parse--this precedes
 *   all callbacks in all SAX Handlers with the sole exception
 *   of <code>{@link #setDocumentLocator}</code>.
 * </p>
 *
 * @throws <code>SAXException</code> when things go wrong
 */
public void startDocument(  ) throws SAXException {
        System.out.println("Parsing begins...");
}

/**
 * <p>
 * This indicates the end of a Document parse - this occurs after
 *   all callbacks in all SAX Handlers.</code>.
 * </p>
 *
 * @throws <code>SAXException</code> when things go wrong
 */
public void endDocument(  ) throws SAXException {
        System.out.println("...Parsing ends.");
}

Both of these callback methods can throw SAXException s. These are the only types of exceptions that SAX events ever throw, and they provide another standard interface to the parsing behavior. However, these exceptions often wrap other exceptions that are indicative of what problems occur. For example, if an XML file was being parsed over the network via a URL, and the connection suddenly became invalid, an IOException would result. However, an application using the SAX classes should not have to catch this exception, because it should not have to know where the XML resource is located. Instead, the application can catch the single SAXException. Within the SAX parser, the original exception is caught and re-thrown as a SAXException, with the originating exception stuffed inside the new one. This allows applications to have one standard exception to trap for, while allowing specific details of what errors occurred within the parsing process to be wrapped and made available to the calling program through this standard exception. The SAXException class provides a method, getException( ) , which returns the underlying Exception.

Processing Instructions

You should recall that we talked about processing instructions (PIs) within XML as a bit of a special case. They were not considered XML elements, and were handled differently by being passed to the calling application. Because of these special characteristics, SAX defines a specific callback for handling processing instructions. This method receives the target of the processing instruction and any data sent to the PI. For our example, we want to echo this information to the screen to notify us when a callback is made:

/**
 * <p>
 * This indicates that a processing instruction (other than
 *   the XML declaration) has been encountered.
 * </p>
 *
 * @param target <code>String</code> target of PI
 * @param data <code>String</code containing all data sent to the PI.
 *             This typically looks like one or more attribute-value
 *             pairs.
 * @throws <code>SAXException</code> when things go wrong
 */
public void processingInstruction(String target, String data)
    throws SAXException {
        
        System.out.println("PI: Target:" + target + " and Data:" + data);
}

In a real application that is using XML data, this is where an application could receive instructions and set variable values or execute methods to perform application-specific processing. For example, the Apache Cocoon publishing framework might set flags to perform transformations on the data once it is parsed, or to display the XML as a specific content type. This method, like the other SAX callbacks, throws a SAXException when errors occur.

You may also remember that in our discussion of PIs we mentioned the XML declaration. This special processing instruction gives the version and optional information about the encoding of the document and whether it is a standalone document:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

This instruction is specifically for the XML parser, allowing the parser to report an error, like a version that is not supported, at the outset of parsing. Because this instruction is only intended to be used by the parser, it does not initiate a callback to processingInstruction( ). Be sure not to build application code that expects this instruction or version information, because the application will never receive a callback for this PI. In fact, it is only the parser that should have much interest in the encoding and version of an XML document, as these items are used in parsing. Once the data is available to you through Java APIs, these details are generally irrelevant.

Namespace Callbacks

By the amount of discussion (and confusion) we have already encountered about namespaces in XML, you should be starting to realize their importance and impact on parsing and handling XML. Alongside XML Schema, XML Namespaces is easily the most significant concept added to XML since the original XML 1.0 Recommendation. With SAX 2.0, support for namespaces was introduced at the element level. This allows a distinction to be made between the namespace of an element, signified by an element prefix and an associated namespace URI, and the local name of an element. In this case, we use local name to refer to the unprefixed name of an element. For example, the local name of JavaXML:Book is simply Book. The namespace prefix is JavaXML, and the namespace URI (in our example) is declared as http://www.oreilly.com/catalog/javaxml.

There are two SAX callbacks specifically dealing with namespaces (although the element callbacks use them as well). These callbacks are invoked when the parser reaches the start and end of a prefix mapping. Although this is a new term, it is not a new concept; a prefix mapping is simply an element that uses the xmlns attribute to declare a namespace. This is often the root element (which may have multiple mappings), but can be any element within an XML document that declares an explicit namespace. For example:

<root>
  <element1>
    <myNamespace:element2 xmlns:myNamespace="http://myUrl.com">
      <myNamespace:element3>Here is some data</myNamespace:element3>
    </myNamespace:element2>
  </element1>
</root>

In this case, an explicit namespace is declared several element nestings deep within the document.

The startPrefixMapping( ) callback is given the namespace prefix as well as the URI associated with that prefix. The mapping is considered “closed” or “ended” when the element that declared the mapping is closed. The only twist to this callback is that it doesn’t quite behave in the sequential manner in which SAX usually is structured; the prefix mapping callback occurs directly before the callback for the element that declares the namespace. We look at this callback now:

/**
 * <p>
 * This will indicate the beginning of an XML Namespace prefix 
 *   mapping.  Although this typically occurs within the root element 
 *   of an XML document, it can occur at any point within the 
 *   document.  Note that a prefix mapping on an element triggers 
 *   this callback <i>before</i> the callback for the actual element 
 *   itself (<code>{@link #startElement}</code>) occurs.
 * </p>
 *
 * @param prefix <code>String</code> prefix used for the namespace 
 *               being reported
 * @param uri <code>String</code> URI for the namespace 
 *            being reported
 * @throws <code>SAXException</code> when things go wrong
 */
public void startPrefixMapping(String prefix, String uri) {
        System.out.println("Mapping starts for prefix " + prefix + 
               
                           " mapped to URI " + uri);
}

In our document, the only mapping we have is declared as an attribute of the root element. That means we should expect to see this callback invoked before the first element callback (which we look at next), although still after the startDocument( ) callback as well as any PIs we have at the top of our document. The other half of this namespace pair of callbacks is invoked to signify the end of the mapping, and appears directly after the closing tag of the element declaring the mapping:

/**
 * <p>
 * This indicates the end of a prefix mapping, when the namespace 
 *   reported in a <code>{@link #startPrefixMapping}</code> callback 
 *   is no longer available.
 * </p>
 *
 * @param prefix <code>String</code> of namespace being reported
 * @throws <code>SAXException</code> when things go wrong
 */
public void endPrefixMapping(String prefix) {
        System.out.println("Mapping ends for prefix " + prefix);
}

For the XML document fragment above, we could then expect the following output when the element2 element was reached:

Mapping starts for prefix myNamespace mapped to URI http://myUrl.com

This lets us know the prefix being mapped, and what URI that prefix is associated with.

Element Callbacks

By now you are probably ready to actually get to the data in our XML document. It is true that over half of the SAX callbacks have nothing to do with XML elements, attributes, and data. This is because the process of parsing XML is intended to do more than simply provide your application with the XML data; it should give the application instructions from XML PIs so your application can know what actions to take, let the application know when parsing starts and when it ends, and even tell it when there is whitespace that can be ignored! If some of these callbacks don’t make much sense, keep reading. We’ll explain more here, as well as in Chapter 5, when we look at how validation of XML fits into the picture.

Still, there certainly are SAX callbacks intended to give you access to the XML data within your documents. The three primary events you will concern yourself with to get that data are the start and end of elements and the characters( ) callback. These tell you when an element is parsed, the data within that element, and when the closing tag for that element is reached. The first of these, startElement( ), gives an application information about an XML element and any attributes it may have. The parameters to this callback are the name of the element (in various forms), and an org.xml.sax.Attributes instance (remember our import statement earlier?). This helper class holds references to all of the attributes within an element. It allows easy iteration through the element’s attributes in a form similar to a Vector. In addition to being able to reference an attribute by its index (used when iterating through all attributes), it is possible to reference an attribute by its name. Of course, by now you should be a bit cautious when you see the word “name” referring to an XML element or attribute, as it can mean various things. In this case, either the complete name of the attribute (with a namespace prefix, if any), called its “raw” name, can be used, or the combination of its local name and namespace URI if a namespace is used. There are also helper methods such as getURI(int index) and getLocalName(int index) that help give additional namespace information about an attribute. Used as a whole, the Attributes interface can be a comprehensive set of information about an element’s attributes.

In addition to the element attributes, we mentioned you get several forms of the element’s name. This again is in deference to XML namespaces. The namespace URI of the element is first supplied. This places the element in its correct context across the complete document’s set of namespaces. Then the local name of the element is supplied, which we mentioned is the unprefixed element name. In addition (and for backwards compatibility), the “raw” name of the element is supplied. This is the unmodified, unchanged name of the element, which includes a namespace prefix if present. In other words, this is exactly what was in the XML document, and so it would be JavaXML:Book for our Book element. With these three types of names supplied, you should be able to describe an element with or without respect to its namespace.

Now that we’ve seen how an element and its attributes are made available, let’s look at an implementation of the SAX callback that prints this information out to the screen when it is invoked. In this example, we see if the element name has a namespace URI associated with it; if so, we print out that namespace; if not, we print a message stating that the element has no namespace associated with it:

/**
 * <p>
 * This reports the occurrence of an actual element. It will include
 *   the element's attributes, with the exception of XML vocabulary 
 *   specific attributes, such as 
 *   <code>xmlns:[namespace prefix]</code> and
 *   <code>xsi:schemaLocation</code>.
 * </p>
 *
 * @param namespaceURI <code>String</code> namespace URI this element
 *                     is associated with, or an empty 
 *                     <code>String</code>
 * @param localName <code>String</code> name of element (with no 
 *                  namespace prefix, if one is present)
 * @param rawName <code>String</code> XML 1.0 version of element name:
 *                [namespace prefix]:[localName]
 * @param atts <code>Attributes</code> list for this element
 * @throws <code>SAXException</code> when things go wrong
 */
public void startElement(String namespaceURI, String localName,
                         String rawName, Attributes atts)
    throws SAXException {
        
        System.out.print("startElement: " + localName);
               
        if (!namespaceURI.equals("")) {
               
            System.out.println(" in namespace " + namespaceURI + 
               
                               " (" + rawName + ")");
               
        } else {
               
            System.out.println(" has no associated namespace");
               
        }
                
               
        for (int i=0; i<atts.getLength(  ); i++)
               
            System.out.println("  Attribute: " + atts.getLocalName(i) +
               
                               "=" + atts.getValue(i));          
}

SAX makes this process very simple and straightforward. One final thing to notice when looking at the startElement( ) callback and attributes in particular is that attributes do not remain ordered. When iterating through an Attributes implementation, the attributes will not necessarily be available in the order in which they were parsed, which is the order in which they were written. This means it is not a good idea to depend on the ordering of attributes, due to XML not requiring this ordering to be maintained by XML parsers. While there are some parsers that implement an ordering, it often is not included in a parser’s feature set.

The closing half of an element callback is the endElement( ) method. This simple callback is fairly self-explanatory, and only the name of the element is sent to the callback, allowing that name to be matched with the appropriate element name passed earlier to a startElement( ) callback. The main purpose of this callback is to signify the close of an element, and let an application know that further characters are part of another scope, rather than the element now being closed. We make note of this in our example by printing out the name of an element when it is closed:

/** 
 * <p>
 * Indicates the end of an element 
 *   (<code>&lt;/[element name]&gt;</code>) is reached. Note that 
 *   the parser does not distinguish between empty
 *   elements and non-empty elements, so this will occur uniformly.
 * </p>
 *
 * @param namespaceURI <code>String</code> URI of namespace this 
 *                     element is associated with
 * @param localName <code>String</code> name of element without prefix
 * @param rawName <code>String</code> name of element in XML 1.0 form
 * @throws <code>SAXException</code> when things go wrong
 */
public void endElement(String namespaceURI, String localName, 
                       String rawName) 
    throws SAXException {
          
               
        System.out.println("endElement: " + localName + "
");
}

Element Data

Once the beginning and end of an element block are identified and the element’s attributes are enumerated for an application, the next piece of important information is the actual data contained within the element itself. This generally consists of additional elements, textual data, or a combination of the two. When other elements appear, the callbacks for those elements are initiated, and a type of pseudo-recursion happens: elements nested within elements results in callbacks “nested” within callbacks. At some point, textual data will be encountered. This is typically the most important information to an XML client, as this data is usually either what is shown to the client or what is processed to generate a client response.

In XML, textual data within elements is sent to a wrapping application via the characters( ) callback. This method provides the wrapping application with an array of characters as well as a starting and ending index from which to read the relevant textual data:

/**
 * <p>
 * This will report character data (within an element).
 * </p>
 *
 * @param ch <code>char[]</code> character array with character data
 * @param start <code>int</code> index in array where data starts.
 * @param end <code>int</code> index in array where data ends.
 * @throws <code>SAXException</code> when things go wrong
 */
public void characters(char[] ch, int start, int end) 
    throws SAXException {
            
               
        String s = new String(ch, start, end);
               
        System.out.println("characters: " + s);
}

Seemingly a simple callback, this method often results in a significant amount of confusion because the SAX interface and standards do not strictly define how this callback must be used for lengthy pieces of character data. In other words, a parser may choose to return all contiguous character data in one invocation, or split this data up into multiple method invocations. For any given element, this method will be called not at all (if no character data is present within the element) or one or more times. Different parsers will implement this behavior differently, often using algorithms designed to increase parsing speed. You should never count on having all the textual data for an element within one callback method; conversely, you should never assume that multiple callbacks would result for one element’s contiguous character data.

As you are writing your SAX event handlers, you should also be sure to keep your mind in a hierarchical mode. In other words, you should not get in the habit of thinking that any element owns its data and child elements, but only that it serves as a parent. Also keep in mind that the parser is moving along, handling elements, attributes, and data as it comes across them. This can make for some surprising results. Consider the following XML document fragment:

<parent>This is<child>embedded text</child>more text</parent>

Forgetting that SAX parses sequentially, making callbacks as it sees elements and data, and forgetting that the XML is viewed as hierarchical, you might make the assumption that the output here would be something like:

startElement: parent has no associated namespace
characters: This is more text
startElement: child has no associated namespace
characters: embedded text
endElement: child
endElement: parent

This would seem logical, as the parent element completely “owns” the child element, right? Wrong. What actually occurs is that a callback is made at each SAX event-point, resulting in the following event-firing chain:

startElement: parent has no associated namespace
characters: This is
startElement: child has no associated namespace
characters: embedded text
endElement: child
characters: more text
endElement: parent

SAX does not do any reading ahead, so the result here is exactly what you would expect if you viewed the XML document as sequential data, without all the human assumptions that we tend to make. This is an important point to remember.

Finally, whitespace is often reported by the characters( ) method. This introduces additional confusion, as another SAX callback, ignorableWhitespace( ) , also reports whitespace. In our example, we are not validating our XML document; however, we may still be using a validating (capable) parser. This subtle detail is very important, as the way in which whitespace is reported is defined by whether the parser being used is a validating one or not. Validating parsers will report all whitespace through the ignorableWhitespace( ) method, due to some validation issues we will address in the next two chapters. Non-validating parsers can report whitespace either through the ignorableWhitespace( ) method or the characters( ) method. To determine the difference, you will need to consult your parser’s documentation to determine if you are using a validating parser or not. Remember, just because you are not requesting validation of your document does not mean that your parser is non-validating; a parser that is capable of validating, even if not actively doing so, is a validating parser.

To add to this confusion, many parsers are actually made up of dual parser implementations: one for validation and one for parsing without validation. At runtime, the correct class is loaded dynamically, as a non-validating parser often performs much better than a validating one, even if validation is not occurring, due to the extra data structures that must be implemented to allow validation to be used. This is exactly the case with the Apache Xerces parser; our example will utilize an instance of a non-validating parser, although if a DTD or schema was specified and validation was requested, a different parser class would be loaded and validation could occur.

The best way to avoid this confusion altogether is to not make any assumptions at all about whitespace. You should rarely, if ever, be using whitespace as data within your XML document. If you are forced to use whitespace, such as several spaces, non-space data, and then several more spaces, and the number of spaces in this data is relevant to an application, a CDATA section should be used. This ensures that your space-specific data will not be parsed at all; instead, it will be handed to the XML wrapper application as a large “chunk” of character data. Other than that special case, whitespace should be avoided as a data representation, and assumptions about which document callback will report whitespace should not be made.

Whitespace, Just the Whitespace

We have already addressed most of the issues with whitespace. We simply need to add this last SAX callback to our MyContentHandler class. The ignor-ableWhitespace( ) method takes parameters in the exact same format as the characters( ) method, and should use the starting and ending indexes provided to read from the character array supplied:

/**
 * <p>
 * This will report whitespace that can be ignored in the 
 *   originating document. This is typically only invoked when
 *   validation is occurring in the parsing process.
 * </p>
 *
 * @param ch <code>char[]</code> character array with character data
 * @param start <code>int</code> index in array where data starts.
 * @param end <code>int</code> index in array where data ends.
 * @throws <code>SAXException</code> when things go wrong 
 */
public void ignorableWhitespace(char[] ch, int start, int end)
    throws SAXException {
        
        String s = new String(ch, start, end);
               
        System.out.println("ignorableWhitespace: [" + s + "]");
}

Of course, our sample will not print out any visible content, as the String created from the character array will be made up completely of whitespace, so we enclose the output within brackets. Whitespace is reported in the same manner as character data; it can be handled with one callback, or a SAX parser may break up the whitespace and report it over several method invocations. In either case, the precautions we have already discussed about not making assumptions or counting on whitespace as textual data should be closely adhered to in order to avoid troublesome bugs in your applications.

Skipped Entities

As you recall, we had one entity reference in our document, the OReillyCopyright entity. When parsed and resolved, this results in another file being loaded, either from the local filesystem or some other URI. However, we are not requesting that validation occur in our document. An often overlooked facet of non-validating parsers is that they are not required to resolve entity references, and instead may skip them. This has caused some headaches before, as parser results may simply not include entity references that were expected. SAX 2.0 nicely accounts for this with a callback that is issued when an entity is skipped by a non-validating parser. The callback gives the name of the entity, which we will include in our output (although Apache Xerces does not exhibit this behavior, your parser may):

/**
 * <p>
 * This will report an entity that is skipped by the parser. This
 *   should only occur for non-validating parsers, and then is still
 *   implementation-dependent behavior.
 * </p>
 *
 * @param name <code>String</code> name of entity being skipped
 * @throws <code>SAXException</code> when things go wrong 
 */
public void skippedEntity(String name) throws SAXException {
        System.out.println("Skipping entity " + name);
}

Before you go trying to recreate this behavior, you should note that most established parsers will not skip entities, even if they are not validating. Apache Xerces, for example, will never invoke this callback; instead, the entity reference will be expanded and the result will be included in the data available after parsing. In other words, this is there for parsers to use, but you will be hard-pressed to find a case where it crops up! If you do have a parser that exhibits this behavior, be aware that the parameter passed does not include the leading ampersand and trailing semicolon in the entity reference. For &OReillyCopyright;, only the name of the entity, OReillyCopyright, is passed to skippedEntity().

The Results

Finally, we need to register our handler with the XMLReader we have instantiated. This is done with setContentHandler( ) , which takes a ContentHandler implementation as its single argument. Add the following lines to the demo( ) method of your parser example program:

/**
 * <p>
 * This parses the file, using registered SAX handlers, and outputs 
 *   the events in the parsing process cycle.
 * </p>
 *
 * @param uri <code>String</code> URI of file to parse.
 */
public void performDemo(String uri) {
    System.out.println("Parsing XML File: " + uri + "

");

        // Get instances of our handlers
               
        ContentHandler contentHandler = new MyContentHandler(  );
    
    try {
        // Instantiate a parser
        XMLReader parser = 
            new SAXParser(  );

        // Register the content handler
        parser.setContentHandler(contentHandler);
            
        // Parse the document
        parser.parse(uri);
        
    } catch (IOException e) {
        System.out.println("Error reading URI: " + e.getMessage(  ));
    } catch (SAXException e) {
        System.out.println("Error in parsing: " + e.getMessage(  ));
    }
}

If you have entered in all of the document callbacks as we have gone along, you should be able to compile the MyContentHandler class and the enclosing SAXParserDemo file. Once done, you may run the SAX parser demonstration on our XML sample file created earlier. The complete Java command should read:

D:prodJavaXML> java SAXParserDemo D:prodJavaXMLcontentscontents.xml

This should result in a fairly long and verbose output. If you are on a Windows machine, you may need to increase the buffer size of your DOS window so you may scroll and view the complete command output. The output should look similar to that in Example 3.2.[2]

Example 3-2. SAXParserDemo Output

D:prodJavaXML>java SAXParserDemo D:prodJavaXMLcontents.xml
Parsing XML File: D:prodJavaXMLcontents.xml


    * setDocumentLocator(  ) called
Parsing begins...
Mapping starts for prefix JavaXML mapped to URI 
    http://www.oreilly.com/catalog/javaxml/
startElement: Book in namespace 
    http://www.oreilly.com/catalog/javaxml/ (JavaXML:Book)
characters:

startElement: Title in namespace 
    http://www.oreilly.com/catalog/javaxml/ (JavaXML:Title)
characters: Java and XML
endElement: Title

characters:

startElement: Contents in namespace 
    http://www.oreilly.com/catalog/javaxml/ (JavaXML:Contents)
characters:


startElement: Chapter in namespace 
    http://www.oreilly.com/catalog/javaxml/ (JavaXML:Chapter)
  Attribute: focus=XML
characters:

startElement: Heading in namespace 
    http://www.oreilly.com/catalog/javaxml/ (JavaXML:Heading)
characters: Introduction
endElement: Heading

characters:

startElement: Topic in namespace 
    http://www.oreilly.com/catalog/javaxml/ (JavaXML:Topic)
  Attribute: subSections=7
characters: What Is It?
endElement: Topic

characters:

startElement: Topic in namespace 
    http://www.oreilly.com/catalog/javaxml/ (JavaXML:Topic)
  Attribute: subSections=3
characters: How Do I Use It?
endElement: Topic

characters:

startElement: Topic in namespace 
    http://www.oreilly.com/catalog/javaxml/ (JavaXML:Topic)
  Attribute: subSections=4
characters: Why Should I Use It?
endElement: Topic

characters:

startElement: Topic in namespace 
    http://www.oreilly.com/catalog/javaxml/ (JavaXML:Topic)
  Attribute: subSections=0
characters: What's Next?
endElement: Topic
...

This output should go on quite a while, as the XML document being parsed has a number of elements within it. You can clearly see exactly how the parser sequentially handles each element, the element’s attributes, any data within the element, nested elements, and the element’s end tag. This process repeats for each element within the document. In our example, a non-validating instance of the Xerces parser was used (remember our rather confusing discussion on this?), so whitespace is being reported with the characters( ) callback; in the next two chapters we will discuss validation and see how this reporting changes.

You have now seen how a SAX-compliant parser handles a well-formed XML document. You should also be getting an understanding of the document callbacks that occur within the parsing process and how an application can use these callbacks to get information about an XML document as it is parsed. In the next two chapters, we will spend time looking at validating an XML document by using additional SAX classes designed for handling DTDs. Before moving on, though, we want to address the issue of what happens when your XML document is not valid, and the errors that can result from this condition.



[2] In this and other output examples, note that carriage returns may have been inserted to ensure that the output is formatted correctly on the printed page. As long as the actual content is the same, you have got everything working correctly!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.67.5