Which painting in the National Gallery would I save if there was a fire? The one nearest the door of course. | ||
--George Bernard Shaw |
If you share Mr. Shaw’s propensity for practicality, you will probably find this lesson quite interesting. Through most of this book, the discussion on XML has focused on its use as a structured document format. However, XML is also often used as a format for data storage. Unlike proprietary file formats, XML documents follow consistent structural rules and can be tested not only for well-formedness but also for compliance with specific structural rules. There are a variety of tools out there that allow you to interact with the structure of XML documents. The previous hour focused on one of these technologies, the W3C’s Document Object Model (DOM). This hour tackles another such technology, known as SAX. SAX (Simple API for XML) is an API that can be used to analyze and extract data from an XML document.
In this hour, you’ll learn
What SAX is and how it works
How to get a SAX parser for your favorite programming language
How to write a Java program that uses a SAX parser to process an XML file
SAX is a programming interface for event-based parsing of XML files. In practical terms, this means that SAX takes a very different approach to parsing XML code than its counterpart, the DOM. If you recall from previous hours, XML documents are processed using parsers. The parser reads the XML document; verifies that it is well formed; and, if it’s a validating parser, validates it against a schema or DTD. What happens next depends on the parser you’re using. In some cases, it might copy the data into a data structure that’s native to the programming language you’re using. In other cases, it might transform the data into a presentation format or apply styles to it. The SAX parser doesn’t do anything to the data other than trigger certain events. It’s up to the user of the SAX parser to determine what happens when those events occur.
What I mean when I say that SAX is a programming interface is that it isn’t a program, it’s a document—a standard—that describes how a SAX parser should be written. It explains which events must be supported in a compliant SAX parser and leaves it up to the implementers to make sure that the parsers they write comply.
An interface is basically a contract offered by someone who writes a program or specifies how a program should work. It says that as long as you implement all of the features specified in the interface, any programs written to use that interface will work as expected. When someone writes a parser that implements the SAX interface, it means that any program that supports all of the events specified in the SAX interface can use that parser.
Most of the time when you’re dealing with XML, one standards body or another developed the various technologies. With SAX, that isn’t the case. SAX was developed by members of the xml-dev
mailing list in order to provide XML developers with a way to deal with XML documents in a simple and straightforward manner. One of the lead developers in this mailing list was Dave Megginson, whose name often comes up in discussions related to SAX, and who has resumed maintaining SAX after a hiatus. You can find out more about SAX at http://www.saxproject.org/.
The original version of SAX, 1.0, was released in May 1998. The most recent version is SAX 2.0.2, which was released in April 2004. Earlier versions of the SAX API were implemented initially as Java interfaces. However, you can write a SAX parser in any language, and indeed, there are SAX parsers available for most popular programming languages. However, I’m going to talk about the features that were made available in the Java version—you can assume they’ll also be available under whatever implementation you choose to use. Let’s look at the specifics of these two releases.
SAX 1.0 provides support for triggering events on all of the standard content in an XML document. Rather than telling you everything it does support, it’s easier to tell you that SAX 1.0 does not support namespaces. A program that uses a SAX 1.0 parser must support the following methods, which are automatically invoked when events occur during the parsing of a document:
characters()
— Returns the characters found inside an element
endDocument()
— Triggered when parsing of the document is complete
endElement()
— Triggered when the closing tag for any element is encountered
ignorableWhitespace()
— Triggered when whitespace is encountered between elements
processingInstruction()
— Triggered when a processing instruction is encountered in the document
startElement()
— Triggered when the opening tag for an element is encountered
If you don’t have a programming background, allow me to clarify that a method is a sequence of programming code that performs a certain task. Methods are very similar to functions in programming languages other than Java.
SAX 1.0 also handles attributes of elements by providing them through its interface when the startElement()
method of the document handler is called. SAX 1.0 has been deprecated now that SAX 2.0 has been implemented. In the Java world, most SAX 2.0 libraries (such as Xerces) still support SAX 1.0 so that they’ll work with legacy SAX 1.0 applications. But if you’re writing a new application that uses SAX, you should use SAX 2.0.
SAX 2.0 is an extension of SAX 1.0 that provides support for namespaces. As such, programs that communicate with a SAX 2.0 parser must support the following methods:
startPrefixMapping()
— Triggered when a prefix mapping (mapping a namespace to an entity prefix) is encountered
endPrefixMapping()
— Triggered when a prefix mapping is closed
skippedEntity()
— Triggered whenever an entity is skipped for any number of reasons
Unless you really develop an interest in XML parsing, chances are you won’t be writing a SAX parser. Rather, you’ll be writing a program that interacts with a SAX parser. Writing a program that works with a SAX parser is in some ways similar to writing a program with a graphical user interface (GUI), such as a traditional application for Windows or Macintosh. When you write a GUI program, the GUI library turns actions that the user takes into events that are returned to you by the library. Your job as a programmer is then to write event handlers that respond to incoming events. For example, with JavaScript, certain elements on a web page can generate events that can be handled by JavaScript. Links generate onClick
and onMouseOver
events. There are also documentwide events in JavaScript, such as onLoad
.
In regard to event handling, SAX works the same way conceptually as JavaScript. When a SAX parser parses the data in an XML document, it fires events based on the data that it is currently parsing. All of the methods listed previously that are associated with SAX are called by the parser when the associated event occurs. It’s up to the application programmer to decide what action to take when those events are caught.
For example, you might want to print out just the contents of all of the title
elements in a document, or you might want to construct a complex data structure based on all of the information you find in the document. The SAX parser doesn’t care; it just provides you with all of the data in the document in a linear manner so that you can do whatever you like with it.
You might be asking yourself at this point why you would ever care to parse an XML document at such a low level. In other words, why would you ever want to print out just the contents of the title
elements in a document? The main answer to this question has to do with data maintenance and integrity. As you continue to build and maintain larger and larger XML documents, you may find that you need to extract and study portions of the documents to find editorial errors or any other inconsistencies that are difficult to find when viewing raw XML code. A custom application built around a SAX parser can be used to drill down into an XML document and spit out any subset of the data that you want.
If you want to write an application that uses SAX, the first thing you have to do is obtain a SAX parser. There are several SAX parsers available, and it’s ultimately up to your own specific development needs as to which parser you should use. Furthermore, you’ll need to look at the documentation for the parser that you choose in order to figure out how to integrate the parser with your applications. Following are several of the more popular SAX parsers you might want to consider using:
Xerces
libxml
Python SAX API
The next few sections provide more information about these SAX parsers, along with how to download and install them.
Xerces is the XML parser from the Apache Software Foundation. It’s used as part of several other Apache XML and Java-related projects and can be used by itself as well. In addition to supporting SAX, it also supports DOM Level 2, which you learned about in the previous hour, as well as XML Schema validation.
You can obtain Xerces, along with lots of other open source XML-related software, at http://xml.apache.org/. Xerces is completely free as it is open source software released under the Apache Software License.
The Xerces library is available in both .tar.gz
and .zip
formats—download the one that’s appropriate for your platform. Included in the package are xercesImpl.jar
and xml-apis.jar
, which contain the compiled class files for the Xerces library itself, and xercesSamples.jar
, compiled versions of the sample programs that come with Xerces. The package also includes documentation, source code for the sample programs, and some sample data files.
A .JAR
file is a lot like a .ZIP
file except that it is typically used to package compressed Java programs for distribution; JAR stands for Java ARchive.
In order to use the Xerces library, you just need to include the two aforementioned .JAR
files (xercesImpl.jar and xml-apis.jar
) in your class path when compiling and running programs that use it.
libxml
is a package of Perl modules that contains a number of XML processing libraries. One of these is XML::Parser::PerlSAX
. The easiest way to install it is to download it from CPAN (http://www.cpan.org/) and follow the instructions to install it on your local system. The methods provided by the PerlSAX
module are basically identical to those in the Java version of SAX—they both implement the same interface in ways appropriate to Perl and Java, respectively.
The sample program in this chapter is written in Java and uses the Xerces SAX parser, which I mentioned earlier. If you’re a Java programmer, I’m sure you’re perfectly happy with this state of affairs. If you have no interest in Java, much of the remainder of this lesson probably won’t be to your liking. However, the purpose of this chapter is to explain how SAX works, and while there are SAX parsers available for many languages, it started out in the Java world. And even if you have no interest in digesting the upcoming Java code, you can still experiment with the sample Java program, running it on your XML documents and analyzing the results. The syntax for this program is relatively simple, and I’ve commented the code to make it as clear as possible.
Even if you don’t care about Java programming, you may still want to see the output of the sample program on your own computer. To run the program, you’ll need Sun’s Java Development Kit (JDK) and the Xerces library mentioned previously. I already explained how to download and install Xerces; to get the JDK, just go to http://java.sun.com/j2se/.
You’ll need to download the J2SE (Java 2 Standard Edition) SDK and install it. Once it’s installed, you can run the sample program. Just put the sample program’s .java
source code file in the directory where you put xercesImpl.jar
and xml-apis.jar
(you can put it anywhere you like, but this route is probably easiest), open a command prompt in that directory, and type the following:
javac -classpath xercesImpl.jar;xml-apis.jar;. DocumentPrinter.java
Alternatively, you can copy the xercesImpl.jar
and xml-apis.jar
files to the same location as the sample program and then compile and run the program from there. The main point is that the program needs to be able to access the .JAR
files.
If your copy of the code for DocumentPrinter.java
is correct and xercesImpl.jar
and xml-apis.jar
are really in the current folder, the DocumentPrinter
class will be compiled and a file called DocumentPrinter.class
will result. To run the program, use the following command:
java -classpath xercesImpl.jar;xml-apis.jar;. DocumentPrinter file.xml
You should replace file.xml
with the name of the XML file that you want to process. As an example, here’s how you would initiate the Document Printer sample program using the vehicles XML file from Hour 12, “Transforming XML with XSLT”:
java -classpath xercesImpl.jar;xml-apis.jar;. DocumentPrinter vehicles.xml
Listing 17.1 contains a partial listing of the resulting output of running the DocumentPrinter SAX sample program on the vehicles.xml
document.
Example 17.1. The Document Printer Sample Program Uses a SAX Parser to Display Detailed Information About the vehicles.xml
Document
1: Start document. 2: Received processing instruction: 3: Target: xml-stylesheet 4: Data: href="vehicles.xsl" type="text/xsl" 5: Start element: vehicles 6: Start element: vehicle 7: Start element: mileage 8: Received characters: 13495 9: End of element: mileage 10: Start element: color 11: Received characters: green 12: End of element: color 13: Start element: price 14: Received characters: 33900 15: End of element: price 16: End of element: vehicle 17: ... 18: Start element: vehicle 19: Start element: mileage 20: Received characters: 48405 21: End of element: mileage 22: Start element: color 23: Received characters: gold 24: End of element: color 25: Start element: price 26: Received characters: 22995 27: End of element: price 28: End of element: vehicle 29: End of element: vehicles 30: End of document reached.
Just to refresh your memory, following is a brief code excerpt from the vehicles.xml
document:
<vehicle year="2004" make="Acura" model="3.2TL"> <mileage>13495</mileage> <color>green</color> <price>33900</price> </vehicle>
In fact, this piece of code is for the first vehicle in the document, which matches up with the code on lines 6 through 16 in Listing 17.1. If you carefully compare the XML code with the listing, you’ll notice how the program parsed and output information about every element in the document. This is the kind of detailed control you have at your disposal when using a tool such as a SAX parser.
Let’s look at how the program you just saw uses a SAX parser to parse an XML document. The program just prints out messages that explain what it’s doing at each step while parsing, along with the associated data from the XML document. You could easily replace this code with code that performs more useful tasks, such as performing a calculation or otherwise transforming the data, but because the purpose of this program is just to illustrate how the SAX parser works, the diagnostic messages are fine.
Because you already know the scoop on SAX, Java, and the Xerces SAX parser for Java, let’s go ahead and jump right into the program code. Here are the first 12 lines of Java code:
import org.xml.sax.Attributes; import org.xml.sax.ContentHandler; import org.xml.sax.ErrorHandler; import org.xml.sax.Locator; import org.xml.sax.SAXParseException; import org.xml.sax.XMLReader; public class DocumentPrinter implements ContentHandler, ErrorHandler { // A constant containing the name of the SAX parser to use. private static final String PARSER_NAME = "org.apache.xerces.parsers.SAXParser";
This code imports classes that will be used later on and declares the class (program) that you’re currently writing. The import
statements indicate which classes will be used by this program. In this case, all of the classes that will be used are from the org.xml.sax
package and are included in the xercesImpl.jar
and xml-apis.jar
archives.
This class, called DocumentPrinter
, implements two interfaces—ContentHandler
and ErrorHandler
. These two interfaces are part of the standard SAX 2.0 package and are included in the import
list. A program that implements ContentHandler
is set up to handle events passed back in the normal course of parsing an XML document, and a program that implements ErrorHandler
can handle any error events generated during SAX parsing.
In the Java world, an interface is a framework that specifies a list of methods that must be defined in a class. An interface is useful because it guarantees that any class that implements it meets the requirements of that interface. If you fail to include all of the methods required by the interface, your program will not compile. Because this program implements ContentHandler
and ErrorHandler
, the parser can be certain that it is capable of handling all of the events it triggers as it parses a document.
After the class has been declared, a single member variable is created for the class, PARSER_NAME
. This variable is a constant that contains the name of the class that you’re going to use as the SAX parser. As you learned earlier, there is any number of SAX parsers available. The Xerces parser just so happens to be one of the better Java SAX parsers out there, which explains the parser name of org.apache.xerces.parsers.SAXParser
.
Although SAX is certainly a popular Java-based XML parser given its relatively long history, it has some serious competition from Sun, the makers of Java. The latest version of Java (J2SE 5.0) now includes an XML API called JAXP that serves as a built-in XML parser for Java. To learn more about JAXP, visit http://java.sun.com/xml/jaxp/.
Every command-line Java application begins its life with the main()
method. In the Java world, the main
method indicates that a class is a standalone program, as opposed to one that just provides functionality used by other classes. Perhaps more importantly, it’s the method that gets run when you start the program. The purpose of this method is to set up the parser and get the name of the document to be parsed from the arguments passed in to the program. Here’s the code:
public static void main(String[] args) { if (args.length == 0) { System.out.println("No XML document path specified."); System.exit(1); } DocumentPrinter dp = new DocumentPrinter(); XMLReader parser; try { parser = (XMLReader)Class.forName(PARSER_NAME).newInstance(); parser.setContentHandler(dp); parser.setErrorHandler(dp); parser.parse(args[0]); } // Normally it's a bad idea to catch generic exceptions like this. catch (Exception ex) { System.out.println(ex.getMessage()); ex.printStackTrace(); } }
This program expects that the user will specify the path to an XML document as its only command-line argument. If no such argument is submitted, the program will exit and instruct the user to supply that argument when running the program.
Next, the program creates an instance of the DocumentPrinter
object and assigns it to the variable dp
. You’ll need this object later when you tell the parser which ContentHandler
and ErrorHandler
to use. After instantiating dp
, a try...catch
block is opened to house the parsing code. This is necessary because some of the methods called to carry out the parsing can throw exceptions that must be caught within the program. All of the real work in the program takes place inside the try
block.
The try...catch
block is the standard way in which Java handles errors that crop up during the execution of a program. It enables the program to compensate and work around those errors if the user chooses to do so. In this case, you simply print out information about the error and allow the program to exit gracefully.
Within the try...catch
block, the first order of business is creating a parser object. This object is actually an instance of the class named in the variable PARSER_NAME
. The fact that you’re using it through the XMLReader
interface means that you can call only those methods included in that interface. For this application, that’s fine. The class specified in the PARSER_NAME
variable is then loaded and assigned to the variable parser
. Because SAX 2.0 parsers must implement XMLReader
, you can refer to the interface as an object of that type rather than referring to the class by its own name—SAXParser
.
After the parser has been created, you can start setting its properties. Before actually parsing the document, however, you have to specify the content and error handlers that the parser will use. Because the DocumentPrinter
class can play both of those roles, you simply set both of those properties to dp
(the DocumentPrinter
object you just created). At this point, all you have to do is call the parse()
method on the URI passed in on the command line, which is exactly what the code does.
The skeleton for the program is now in place. The rest of the program consists of methods that fulfill the requirements of the ContentHandler
and ErrorHandler
interfaces. More specifically, these methods respond to events that are triggered during the parsing of an XML document. In this program, the methods just print out the content that they receive.
The first of these methods is the characters()
method, which is called whenever content is parsed in a document. Following is the code for this method:
public void characters(char[] ch, int start, int length) { String chars = ""; for (int i = start; i < start + length; i++) chars = chars + ch[i]; if ((chars.trim()).length() > 0) System.out.println("Received characters: " + chars); }
The characters()
method receives content found within elements. It accepts three arguments: an array of characters, the position in the array where the content starts, and the amount of content received. In this method, a for
loop is used to extract the content from the array, starting at the position in the array where the content starts, and iterating over each element until the position of the last element is reached. When all of the characters are gathered, the code checks to make sure they aren’t just empty spaces, and then prints the results if not.
It’s important not to just process all of the characters in the array of characters passed in unless that truly is your intent. The array can contain lots of padding on both sides of the relevant content, and including it all will result in a lot of extra characters along with the content that you actually want. On the other hand, if you know that the code contains parsed character data (PCDATA) that you want to read verbatim, then by all means process all of the characters.
The next two methods, startDocument()
and endDocument()
, are called when the beginning and end of the document are encountered, respectively. They accept no arguments and are called only once each during document parsing, for obvious reasons. Here’s the code for these methods:
public void startDocument() { System.out.println("Start document."); } public void endDocument() { System.out.println("End of document reached."); }
Next let’s look at the startElement()
and endElement()
methods, which accept the most complex set of arguments of any of the methods that make up a ContentHandler
:
public void startElement(String namespaceURI, String localName, String qName, Attributes atts) { System.out.println("Start element: " + localName); } public void endElement(String namespaceURI, String localName, String qName) { System.out.println("End of element: " + localName); }
The startElement()
method accepts four arguments from the parser. The first is the namespace URI, which you’ll see elsewhere as well. The namespace URI is the URI for the namespace associated with the element. If a namespace is used in the document, the URI for the namespace is provided in a namespace declaration. The local name is the name of the element without the namespace prefix. The qualified name is the name of the element including the namespace prefix if there is one. Finally, the attributes are provided as an instance of the Attributes
object. The endElement()
method accepts the same first three arguments but not the final attributes argument.
SAX parsers must have namespace processing turned on in order to populate all of these attributes. If that option is deactivated, any of the arguments (other than the attributes) may be populated with empty strings. The method for turning on namespace processing varies depending on which parser you use.
Let’s look at attribute processing specifically. Attributes are supplied to the startElement()
method as an instance of the Attributes
object. In the sample code, you use three methods of the Attributes
object: getLength(), getLocalName()
, and getValue()
. The getLength()
method is used to iterate over the attributes supplied to the method call, while getLocalName()
and getValue()
accept the index of the attribute being retrieved as arguments. The code retrieves each attribute and prints out its name and value. In case you’re curious, the full list of methods for the Attributes
object appears in Table 17.1.
Table 17.1. Methods of the Attributes
Object
Method | Purpose |
---|---|
| Retrieves an attribute’s index using its qualified name |
| Retrieves an attribute’s index using its namespace URI and the local portion of its name |
| Returns the number of attributes in the element |
| Returns the local name of the attribute associated with the index |
| Returns the qualified name of the attribute associated with the index |
| Returns the type of the attribute with the supplied index |
| Looks up the type of the attribute with the namespace URI and name specified |
| Looks up the namespace URI of the attribute with the index specified |
| Looks up the value of the attribute using the index |
| Looks up the value of the attribute using the qualified name |
| Looks up the value of the attribute using the namespace URI and local name |
Getting back to the endElement()
method, its operation is basically the same as that of startElement()
except that it doesn’t accept the attributes of the element as an argument.
The next two methods, startPrefixMapping()
and endPrefixMapping()
, have to do with prefix mappings for namespaces:
public void startPrefixMapping(String prefix, String uri) { System.out.println("Prefix mapping: " + prefix); System.out.println("URI: " + uri); } public void endPrefixMapping(String prefix) { System.out.println("End of prefix mapping: " + prefix); }
These methods are used to report the beginning and end of namespace prefix mappings when they are encountered in a document.
The next method, ignorableWhitespace()
, is similar to characters()
, except that it returns whitespace from element content that can be ignored.
public void ignorableWhitespace(char[] ch, int start, int length) { System.out.println("Received whitespace."); }
Next on the method agenda is processingInstruction()
, which reports processing instructions to the content handler. For example, a stylesheet can be associated with an XML document using the following processing instruction:
<?xml-stylesheet href="mystyle.css" type="text/css"?>
The method that handles such instructions is
public void processingInstruction(String target, String data) { System.out.println("Received processing instruction:"); System.out.println("Target: " + target); System.out.println("Data: " + data); }
The last method you need to be concerned with is setDocumentLocator()
, which is called when each and every event is processed. Nothing is output by this method in this program, but I’ll explain what its purpose is anyway. Whenever an entity in a document is processed, the parser calls setDocumentLocator()
with a Locator
object. The Locator
object contains information about where in the document the entity currently being processed is located. Here’s the “do nothing” source code for the method:
public void setDocumentLocator(Locator locator) { }
The methods of a Locator
object are described in Table 17.2.
Table 17.2. The Methods of a Locator
Object
Method | Purpose |
---|---|
| Returns the column number of the current position in the document being parsed |
| Returns the line number of the current position in the document being parsed |
| Returns the public identifier of the current document event |
| Returns the system identifier of the current document event |
Because the sample program doesn’t concern itself with the specifics of locators, none of these methods are actually used. However, it’s good for you to know about them in case you need to develop a program that somehow is interested in locators.
I mentioned earlier that the DocumentPrinter
class implements two interfaces, ContentHandler
and ErrorHandler
. Let’s look at the methods that are used to implement the ErrorHandler
interface. There are three types of errors that a SAX parser can generate—errors, fatal errors, and warnings. Classes that implement the ErrorHandler
interface must provide methods to handle all three types of errors. Here’s the source code for the three methods:
public void error(SAXParseException exception) { } public void fatalError(SAXParseException exception) { } public void warning(SAXParseException exception) { }
As you can see, each of the three methods accepts the same argument—a SAXParseException
object. The only difference between them is that they are called under different circumstances. To keep things simple, the sample program doesn’t output any error notifications. For the sake of completeness, the full list of methods supported by SAXParseException
appears in Table 17.3.
Table 17.3. Methods of the SAXParseException
Interface
Method | Purpose |
---|---|
| Returns the column number of the current position in the document being parsed |
| Returns the line number of the current position in the document being parsed |
| Returns the public identifier of the current document event |
| Returns the system identifier of the current document event |
Similar to the Locator
methods, these methods aren’t used in the Document Printer sample program, so you don’t have to worry about the ins and outs of how they work.
Now that you understand how the code works in the Document Printer sample program, let’s take it for a test drive one more time. This time around, you’re running the program to parse the condos.xml
sample document from the previous hour. Here’s an excerpt from that document in case it’s already gotten a bit fuzzy in your memory:
<proj status="active"> <location lat="36.122238" long="-86.845028" /> <description> <name>Woodmont Close</name> <address>131 Woodmont Blvd.</address> <address2>Nashville, TN 37205</address2> <img>condowc.jpg</img> </description> </proj>
And here’s the command required to run this document through the Document Printer program:
java -classpath xercesImpl.jar;xml-apis.jar;. DocumentPrinter condos.xml
Finally, Listing 17.2 contains the output of the Document Printer program after feeding it the condominium map data stored in the condos.xml
document.
Example 17.2. The Output of the Document Printer Example Program After Processing the condos.xml
Document
1: Start document. 2: Start element: projects 3: Start element: proj 4: Start element: location 5: End of element: location 6: Start element: description 7: Start element: name 8: Received characters: Woodmont Close 9: End of element: name 10: Start element: address 11: Received characters: 131 Woodmont Blvd. 12: End of element: address 13: Start element: address2 14: Received characters: Nashville, TN 37205 15: End of element: address2 16: Start element: img 17: Received characters: condowc.jpg 18: End of element: img 19: End of element: description 20: End of element: proj 21: ... 22: Start element: proj 23: Start element: location 24: End of element: location 25: Start element: description 26: Start element: name 27: Received characters: Harding Hall 28: End of element: name 29: Start element: address 30: Received characters: 2120 Harding Pl. 31: End of element: address 32: Start element: address2 33: Received characters: Nashville, TN 37215 34: End of element: address2 35: Start element: img 36: Received characters: condohh.jpg 37: End of element: img 38: End of element: description 39: End of element: proj 40: End of element: projects 41: End of document reached.
The excerpt from the condos.xml
document that you saw a moment ago corresponds to the first proj
element in the XML document. Lines 3 through 20 show how the Document Printer program parses and displays detailed information for this element and all of its content.
In this hour, you learned about one of the two popular APIs for parsing XML files—SAX. You already covered the DOM in the previous lesson, so this lesson wrapped up some loose ends in terms of giving you a more rounded understanding of XML parsing. SAX (Simple API for XML) is an event-driven parser that is usually combined with a custom program designed to process the events generated by the parser. You worked through an example of such an application in this hour that demonstrated how to use the Xerces SAX parser to iterate through the entities in an XML document and print out detailed information about them.
I didn’t get any of that Java stuff; how am I supposed to use SAX? | |
If you found the Java code confusing, you may be better off looking at the documentation for the SAX implementation for a programming language that you’re more comfortable using. You may want to do some online investigating to find a SAX parser that’s appropriate for you. Keep in mind that the Xerces SAX parser (http://xml.apache.org/) that you used in this lesson is also available for the C++ and Perl languages. Also, if you prefer using JavaScript or Visual Basic you may want to consider using the DOM for XML processing, which you explored in Hour 16, “Parsing XML with the DOM.” | |
How do I access the data structure created by SAX? | |
The catch with SAX is that it doesn’t create its own data structure; it’s up to the programmer who writes the event handlers to generate a data structure, print the XML, or do whatever it is they want to do with the data as it’s processed by the SAX parser. |
The Workshop is designed to help you anticipate possible questions, review what you’ve learned, and begin learning how to put your knowledge into practice.
3.149.27.234