Chapter 3. Parsing XML

With two solid chapters of introduction behind us, we are ready to code! By now you have seen the numerous acronyms that make up the world of XML, you have delved into the language itself, and you should be familiar with an XML document. This chapter takes the next step, and the first on our path of Java programming, by demonstrating how an XML document is parsed and how we can access the parsed data from within Java code.

One of the first things you will have to do when dealing with XML programmatically is take an XML document and parse it. As the document is parsed, the data in the document becomes available to the application using the parser, and suddenly we are within an XML-aware application! If this all sounds a little too simple to be true, it almost is. In this chapter, we will look closely at how an XML document is parsed. Using a parser within an application and how to feed that parser your document’s data will be covered. Then we will look at the various callbacks that are available within the parsing lifecycle. These events are the points where application-specific code can be inserted and data manipulation can occur.

In addition to looking at how parsers work, we will also begin our exploration of the Simple API for XML (SAX) in this chapter. SAX is what makes these parsing callbacks available. The interfaces provided in the SAX package will become an important part of our toolkit for handling XML. Even though the SAX classes are small and few in number, everything else in our discussions of XML is based on these classes. A solid understanding of how they help us access XML data is critical to effectively leveraging XML in your Java programs.

Getting Prepared

There are several items that we should take care of before beginning to code. First, you must obtain an XML parser. Writing a parser for XML is a serious task, and there are several efforts going on to provide excellent XML parsers. We are not going to detail the process of actually writing an XML parser here; rather, we will discuss the applications that wrap this parsing behavior, focusing on using existing tools to manipulate XML data. This results in better and faster programs, as we do not seek to reinvent what is already available. After selecting a parser, we must ensure that a copy of the SAX classes is on hand. These are easy to locate, and are key to our Java code being able to process XML. Finally, we will need an XML document to parse. Then, on to the code!

Obtaining a Parser

The first step in getting ready to code Java that uses XML is locating and obtaining the parser you want to use. We briefly talked about this process in Chapter 1, and listed various XML parsers that could be used. To ensure that your parser works with all of the examples in the book, you should verify your parser’s compliance with the XML specification. Because of the variety of parsers available and the rapid pace of change within the XML community, all of the details about which parsers have what compliance levels are beyond the scope of this book. You should consult the parser’s vendor and visit the web sites previously given for this information.

In the spirit of the open source community, all of the examples in this book will use the Apache Xerces parser. Freely available in binary and source form at http://xml.apache.org, this C- and Java-based parser is already one of the most widely contributed-to parsers available. In addition, using an open source parser such as Xerces allows you to send questions or bug reports to the parser’s authors, resulting in a better product, as well as helping you use the software quickly and correctly. To subscribe to the general list and request help on the Xerces parser, send a blank email to . The members of this list can help if you have questions or problems with a parser not specifically covered in this book. Of course, the examples in this book all run normally on any parser that uses the SAX implementation covered here.

Once you have selected and downloaded an XML parser, make sure that your Java environment, whether it be an IDE (Integrated Development Environment) or a command line, has the XML parser classes in its class path. This will be a basic requirement for all further examples.

Getting the SAX Classes and Interfaces

Once you have your parser, you need to locate the SAX classes. These classes are almost always included with a parser when downloaded, and Xerces is no exception. If this is the case with your parser, you should be sure not to download the SAX classes explicitly, as your parser is probably packaged with the latest version of SAX that is supported by the parser. At the time of this writing, SAX 2.0 had just gone final. The SAX 2.0 classes are used throughout this book, and should come bundled with the latest version of the Apache Xerces parser.

If you are not sure whether you have the SAX classes, look at the jar file or class structure used by your parser. The SAX classes are packaged in the org.xml.sax structure. The latest version of these includes 17 classes in this root directory, as well as 9 classes in org.xml.sax.helpers and 2 in org.xml.sax.ext. If you are missing any of these classes, you should try to contact your parser’s vendor to see why the classes were not included with your distribution. It is possible that some classes may have been left out if they are not supported in whole.[1] These class counts are for SAX 2.0 as well; fewer classes may appear if only SAX 1.0 is supported.

Finally, you may want to either download or bookmark the SAX API Javadocs on the Web. This documentation is extremely helpful in using the SAX classes, and the Javadoc structure provides a standard, simple way to find out additional information about the classes and what they do. This documentation is located at http://www.megginson.com/SAX/SAX2/javadoc/index.html. You may also generate Javadoc from the SAX source if you wish, by using the source included with your parser, or by downloading the complete source from http://www.megginson.com/SAX/SAX2.

Have an XML Document on Hand

You should also make sure that you have an XML document to parse. The output shown in the examples is based on parsing the XML document we discussed in Chapter 2. Save this file as contents.xml somewhere on your local hard drive. We highly recommend that you follow what we’re doing in this file. You can simply type the file in from the book, or you may download the XML file from the book’s web site, http://www.oreilly.com/catalog/javaxml. You are encouraged to take the time to type in the example, though, as it will almost certainly familiarize you with XML syntax more than a quick download will.

In addition to downloading or creating the XML file, you need to make a couple of small modifications. Because we haven’t covered or discussed how to constrain and transform documents, our programs only parse XML in this chapter. To prevent errors, we need to remove the references within the XML document to an external DTD, which constrains the XML, and the XSL stylesheets that transform it. You should comment out these two lines in the XML document, as well as the processing instruction to Cocoon requesting XSL transformation:

<?xml version="1.0"?>

<!-- We don't need these yet
               
  <?xml-stylesheet href="XSLJavaXML.html.xsl" type="text/xsl"?>
               
  <?xml-stylesheet href="XSLJavaXML.wml.xsl" type="text/xsl" 
               
                   media="wap"?>
               
  <?cocoon-process type="xslt"?>
               
  <!DOCTYPE JavaXML:Book SYSTEM "DTDJavaXML.dtd">
               
-->

<!-- Java and XML -->
<JavaXML:Book xmlns:JavaXML="http://www.oreilly.com/catalog/javaxml/">

Once these lines are commented, note the full path to the XML document. You will need to supply that path to our programs in this and later chapters.

Finally, we need to comment out our reference to the OReillyCopyright external entity that would be used to load a file from the filesystem with the needed copyright information. Without a DTD to define how to resolve this entity reference, we will receive unwanted errors. In the next chapter, we will look at how to resolve this reference for the XML document.

</JavaXML:Contents>

<!-- Leave out until DTD Section
               
 <JavaXML:Copyright>&OReillyCopyright;</JavaXML:Copyright>
               
-->
               
 
</JavaXML:Book>


[1] Supporting SAX in whole is a very important item for a parser. Although you are certainly welcome to use any parser you like, if your parser does not have complete SAX 2.0 support, many of the examples in this book will not work. In addition, your parser is not keeping up with the latest XML developments. For either or both reasons, you may want to consider at least trying the Xerces parser for the duration of this book.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.137.151