Dealing with XML in Java

JAXP (Java API for XML Processing) is a standard set of Java APIs for dealing with XML objects. It's not strictly part of the J2EE standard, but it's an essential class library for dealing with XML instances. JDBC (Java DataBase Connectivity) is a set of Java APIs for connecting to back-end SQL databases. Together, JAXP and JDBC provide an infrastructure for building applications using XML and SQL. One way JAXP and JDBC work together is in the construction of Java objects that utilize both APIs to provide seamless access to information encoded in both XML and relational databases.

Parsers: SAX Versus DOM

As if you needed more impenetrable acronyms in your life!

Whenever you deal with XML instances in applications (Java or otherwise), you need to think about XML parsers. An XML parser turns the raw XML document into something that your application can use. SAX (Simple API for XML) and DOM (Document Object Model) are two different approaches that are used across different programming languages and application servers. Both take XML instances and turn them into something that can be manipulated programmatically, but they take very different paths to get there.

SAX (http://www.megginson.com/SAX/) uses an “event-based” model, which means that, as a SAX parser processes through your XML instance, it kicks off events that your code can “listen” for. Anyone who's written user interface code should be familiar with event-based coding. Your code listens for an event (such as a mouse click or a key press) and then kicks off some functionality based on this event. In the case of SAX events, you're waiting for a certain tag or combination of tags.

DOM takes a different, and somewhat more involved, approach, converting your entire XML instance into a tree and handing it back to you in that form. It's then up to you to walk through that tree from node to node, or search through it for what you're looking for.

The bottom line is that SAX is lighter weight, but DOM is much more powerful. With SAX, you create an object model for your data and map SAX-parsed XML into your object model. One advantage of SAX is that your application's object model can be exactly what you want it to be. DOM creates an object model for you. Choosing whether to use SAX or DOM depends on what kind of data is in your XML and how you want to use it. If you're doing something simple with XML messages being passed back and forth between systems, the lighter-weight SAX is your best bet. If you're doing anything more complex (for example, working with documents like our example movie reviews or complex hierarchical data) or if you are writing a user-oriented application that manipulates XML instances, you should use DOM. If you're building routines to do partial decomposition, I recommend using SAX because the event-based model makes more sense in this context. SAX is the parser to use if you want to manipulate XML documents “on the fly” or where storage capacity is limited (such as in a mobile device) because it doesn't require huge amounts of memory to store a document tree as DOM does.

Both SAX and DOM parsers are included in the JAXP specification, and by the way, nothing stops you from using both in the same application.

DOM (http://www.w3.org/DOM/) was developed by the W3C. DOM Level 2 has recently been released as a recommended specification. DOM Level 2 allows for an XML event model (bringing it closer to SAX in terms of functionality) and includes some other useful features. Many commercial XML parsers currently on the market support DOM Level 2.


Building Java Objects for XML Instances with DOM

In many cases, especially when your application code has to work with many of your documents simultaneously, you need to create an object type for your document instances. For the kind of XML instances we've been dealing with in this book and when you already have relational tables that contain data partially decomposed from the XML, I recommend building custom objects around a DOM-parsed tree and information retrieved directly from the relational tables. As the Review object (a simplified version of our movie review examples from the previous chapters) in Figure 9-2 illustrates, the title, author list, abstract, and publish date are represented by simple Java types—strings, arrays of strings, and timestamps (the latter being a JDBC type but still simple). These pieces of information are extracted directly from relational tables of the kind that we discussed in Chapter 6. The review text itself is stored in a DOM tree, which is parsed from the XML instance itself—probably, but not necessarily, from the same database.

Figure 9-2. Review object with data sources


Your review object, when instantiated, should populate its simple fields (such as Title, Author, Abstract, and Publish Date) but wait to populate the ReviewText field until this data is asked for. In this manner, you save yourself the trouble of parsing the XML document and storing the tree in memory until it's needed. Your application may have instantiated this object only as part of an array of reviews so that it can display a listing of reviews, showing the title, author, abstract, and publish date. The user, using your application, then selects one of the reviews for reading, at which point you can go back to the database, extract the XML instance, and parse it using a DOM parser to create the tree object necessary for whatever action you wish. Or you can send it to an XSLT transformation as described later (see the section Invoking XSLT Transformations later in this chapter).

The code you execute to build your DOM tree when needed would look something like this:

static Document myDocument;
DocumentBuilderFactory factory =
    DocumentBuilderFactory.newInstance();
...
try {
    DocumentBuilder builder = factory.newDocumentBuilder();
    myDocument = builder.parse( ...the source of the XML instance... );
}  catch (Exception x) {
// Do something with any exceptions generated
}

In this example, I've used the DocumentBuilderFactory class, which is part of the JAXP library (under the class hierarchy javax.xml.parsers) to give me a DocumentBuilder object whenever I need one. This approach simplifies your code by relying on an object factory to create your DocumentBuilder objects for you instead of having to instantiate and initialize them yourself. This DocumentBuilder object is then used to parse the XML file (probably taken out of the database using JDBC;it returns a parsed DOM tree into myDocument.

Using SAX Events to Drive XML Partial Decomposition

A SAX parser generates specific events as it processes through an XML instance. The generic DefaultHandler class (which is actually defined in the org.xml.sax.helpers package that provides many other useful helper classes and utility functions) is provided for you to subclass and add your own functionality to. The main event-handling methods provided (as null methods) are startDocument (which triggers at the start of an XML instance), endDocument (which triggers at the end of an XML instance), startElement (which triggers at the beginning of an element), endElement (which triggers at the end of an element), and characters (which handles all characters contained with in elements, including white space). When the parser is invoked on an XML instance, the parser throws Java exceptions that correspond to these events. Those exceptions can then be caught by whatever code invoked the parser to begin with.

When the parser finds a start tag or end tag, the name of the tag is passed to the startElement or endElement method. When a start tag is encountered, its attributes are passed as a list. Characters found in the element are also passed along.

The upshot of all this is that your custom-coded methods can wait for the SAX events described earlier and then execute JDBC statements to insert or update information in your relational tables or call preprepared, stored SQL procedures within your relational database. Your resultant code would look something like this:

import org.xml.sax.*;
import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;

public class MyFunXMLHandler extends DefaultHandler
{
    public void startDocument()
    throws SAXException
    {
        // Whatever you want to do at the start of the document -- maybe
        // open your connection to the database.
    }

    public void endDocument()
    throws SAXException
    {
        // Whatever you want to do at the end of the document -- maybe close
        // your connection to the database.
    }

    public void startElement(String namespaceURI, // Only used if we're using
                             String localName,    // using XML name spaces...
                             String qualifiedName,
                             Attributes attributes)
    throws SAXException
    {
    // Call a partial decomposition utility function with the name of
    // the element and attribute list as parameters.
    }
}

This code skeleton shows how you can set up the parser to process through code and react to SAX parser events. The first part imports the necessary class libraries. You don't need to override the default method for endElement because you don't need to do anything special when an element ends.

The namespaceURI and localName parameters to startElement are used only if you're using XML namespaces, which is a method for creating a standard reference for the names of elements and attributes across different applications, schemas, and DTDs. XML namespaces is a fascinating topic, but it is beyond the scope of this book; refer to the W3C recommendation document at http://www.w3.org/TR/1999/REC-xml-names-19990114/.

Harkening back to our CyberCinema example, if you're trying to decompose the following piece of XML code:

<DOCUMENT ID="23">
...
<MOVIE ID="12">Ben Hur</MOVIE>
...
</DOCUMENT>

with the preceding SAX parser, your helper utility will want to insert a row into your review_movie bridging table (see Chapter 6), so the corresponding JDBC call will look something like this:

String updateString = "INSERT review_movie " +
                    "values (" +
                    reviewId +
                    "," +
                    movieId +
                    ")";
Statement stmt = con.createStatement();
stmt.executeUpdate(updateString);

reviewId has been set to 23, and movieId has been set to 12.

The preceding code works great if the only information you ever want to decompose from the database is stored in element attributes. If you want to decompose the contents of elements, you have to use the endElement method and the characters method to extract all the information between the beginning of the tag and the end of the tag for decomposition.

Invoking XSLT Transformations

So what do you do when you actually want to look at some of this XML? As we saw in Chapter 7, XSLT (Extensible Stylesheet Language Transformation) provides a mechanism for transforming (or converting) one type of XML into another. Your XML servlets (which serve HTTP requests) use XSLT stylesheets to transform the XML code you're using in your application to whatever is appropriate for the particular content delivery channel (another custom XML format, XHTML, WML, and so on). JAXP provides an easy way to convert your XML instances using XSLT transformations. The following code example illustrates how to invoke XSLT stylesheets from within your Java code:

Transformer transformer;
TransformerFactory factory = TransformerFactory.newInstance();
String stylesheet = "file:///cybercinema_root/cc_stylesheet.xsl";
String sourceId = "file:///cybercinema_root/reviews/1234.xml";
try {
  transformer = factory.newTransformer(
       new StreamSource(stylesheet)
       );
  transformer.transform(
       new StreamSource(sourceId),
       new StreamResult(System.out)
       );
} catch (Exception e) {
  // handle whatever exceptions occur
}

This code snippet takes a style sheet from one file and an XML instance from another file and outputs the result (the fully transformed file) to the console. You can use this same method to take your XML file from any location (the database, an object in memory, a Web service, anywhere) and transform it. The preceding example is purposely simplistic, but you can begin to see how easily this kind of function is implemented.

Designing an Entity Bean for Movie Reviews

The following code is a simplified design for an entity EJB (Enterprise Java Bean) for our movie reviews. The J2EE spec contains two types of EJBs: session beans and entity beans. Session beans are intended for business processes, while entity beans are intended for business objects (code objects that represent some kind of real-world object, such as an e-mail message, a movie review, a person, an airplane, and so on). For instance, the hypothetical publishing system for our movie reviews might contain session beans that take care of publishing activity (such as maintaining workflow states—see Chapter 10 for a discussion on workflow) and entity beans that represent particular movie reviews.

The code presented here is a first try at entity beans for movie reviews. In this case, I've simplified the reviews by having them include only a review author (pulled directly out of a relational table) and the review XML itself (also pulled out of a table and then parsed into a DOM tree). Note how the database routine selectByReviewId first selects the author ID out of the review_person table (which we defined in Chapter 6), then selects the appropriate XML out of the review table, and finally parses it using the DOM parser into a Document object.

// Import required class libraries
//
import java.sql.*;
import javax.sql.*;
import java.util.*;
import javax.ejb.*;
import javax.naming.*;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;

// Define our MovieReview entity bean and all its methods
//
public class MovieReview implements EntityBean
{
    private String reviewId;
    private String authorId;
    private Document document;
    private EntityContext context;
    private Connection con; // database connection
    private String logicalDBName = "...URI for data source...";

    // Public get method for the review ID
    //
    public String getReviewId()
    {
      return reviewId;
    }
    // Public get method for the review Author
    //
    public int getAuthorId()
    {
      return authorId;
    }

    // Public get method for the review XML tree
    //
    public int getDocument()
    {
      return document;
    }

    // Find method for this bean.
    //
    public String ejbFindByReviewId(String reviewId)
      throws FinderException
    {

      String reviewId;

      try
        {
          reviewId = selectByReviewId(reviewId);
         }
      catch (Exception ex)
        {
          throw new EJBException("ejbFindByReviewId: " +
                               ex.getMessage());
        }

    return reviewId;
    }

    // Database routine: initiate connection to database
    //
    private void makeConnection() throws NamingException, SQLException
    {
      InitialContext ic = new InitialContext();
      DataSource ds = (DataSource) ic.lookup(logicalDBName);
      con =  ds.getConnection();
    }

    // Database routine: get the review author, and get and parse the review
    // XML instance.
    //
    private String selectByReviewId(String reviewId)
      throws SQLException
    {
    String reviewId;
    String reviewXML;
    int REVIEWER_ROLE_ID = 3; // This constant for the ID number of the
                              // reviewer role in the review_person_role
                              // table would normally be set
                              // somewhere else, but is here for simplicity.

    // Execute the SQL statement to get the review
    //
    String selectStatement =
            "select person_id, review_xml" +
            "from review, review_person_role" +
            "where review.review_id = ? and review_person_role.role_id = ?";
    PreparedStatement prepStmt =
            con.prepareStatement(selectStatement);
    prepStmt.setString(1, reviewId);
    prepStmt.setString(2, REVIEWER_ROLE_ID);

    ResultSet rs = prepStmt.executeQuery();

    // Extract the author ID and the XML from the result set
    //
    if (rs.next())
       {
          authorId = rs.getString(1);
          reviewXML = rs.getString(1);
       }
    else
        {
          reviewId = null;
        }

    prepStmt.close();

    // Now create a DOM tree out of the XML
    //
       DocumentBuilderFactory factory =
            DocumentBuilderFactory.newInstance();

        try
        {
          DocumentBuilder builder = factory.newDocumentBuilder();
          document = builder.parse(reviewXML);
        }
    catch (Exception x)
        {
          reviewId= null;;
        }

    return reviewId;
    }
}

Now we have a way to create for our movie reviews self-contained entity bean objects that can be manipulated by session beans, enumerated, queried, and otherwise shuffled around in the business logic of our application.

To Transform or Not to Transform

The JAXP libraries provide a powerful XML transformation mechanism for XML through invoking XMLT transformations, as we've seen previously in this chapter. However, I don't recommend that you do XML transformation on the fly in your applications. What do I mean by that? Let's take the following example, again building on our CyberCinema example.

When a review is published, it goes through a workflow process and ends up in a database table as a raw bit of XML with some accompanying, partially decomposed data (the author, title, publish date, and so forth) in other tables. Every time a user asks to see this review, do you want to fumble around in your database, pulling out the data, starting up the parser, running the document through the parser, transforming it to HTML, and serving it to the user? The alternative is to invoke the three most important parts of any Web application: caching, caching, caching.

Partial decomposition already represents a layer of caching. Some application servers include caching functionality at the database level and at the page generation level, which can be a good start, although I don't necessarily recommend relying on these built-in caching mechanisms. When you design your application, you should think about caching at each layer of your application. If a particular product offers built-in caching that fits the bill for one or more of these layers, then by all means make use of it.

Using the example of page generation, you can take one of several approaches:

  • Pregenerate all pages. Your document management system can pregenerate the pages that it knows need to be generated whenever a “publishing” action takes place and output them to HTML files sitting in a document root of your Web server. This strategy is the most meticulous type of page management, but when you're talking performance, there is nothing better than serving flat HTML documents off a file system. Don't let the tools vendors tell you differently. The tricky parts are making sure the right HTML documents are regenerated when you need them to be and maintaining “link integrity”—that is, making sure that intradocument links point where they're supposed to and don't point to documents that haven't been generated yet.

  • Use a page-caching application server. Some application servers include page caches, which generate a requested page on first request and then keep it in a file system cache. Subsequent requests for the same page are served directly out of the file system. The advantage is that you don't have to maintain the set of HTML files yourself, as with the previously described approach. A disadvantage is that these page-caching systems are often somewhat of a “black box” to the developer; that is, when they work, they're great, but when they fail, you have no idea why they've failed and no way to fix them. In addition, page-caching application servers are often prohibitively expensive.

  • Use a reverse-proxy cache in front of your application server. Another approach that gets you the same bang for fewer bucks is to use a reverse-proxy cache in front of your application server. Squid (http://www.squid-cache.org/) is an excellent free solution for reverse-proxy page caching. The reverse-proxy cache basically responds to all page requests from the Internet—that is, it becomes your Web server, the one users hit when they go to your site. But the reverse-proxy cache has a secret: It isn't really your Web server at all, but it knows where your server is. So if the reverse-proxy cache doesn't have the document the user is looking for, it gets it from the Web server and caches it for a configurable period of time.

The disadvantage of using a page cache that isn't part of your application server is that you don't have fine control over it from your application. Your application can't easily tell it to “forget” a page, which can mean a delay in the publishing process as you wait for your page cache to “flush.”

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.135.224