12.6. Creating an Index of XML Documents


You need to quickly search a collection of XML documents, and, to do this, you need to create an index of terms keeping track of the context in which these terms appear.


Use Jakarta Lucene and Jakarta Digester and create an index of Lucene Document objects for the lowest level of granularity you wish to search. For example, if you are attempting to search for speeches in a Shakespeare play that contain specific terms, create a Lucene Document object for each speech. For the purposes of this recipe, assume that you are attempting to index Shakespeare plays stored in the following XML format:

<?xml version="1.0"?>

  <TITLE>All's Well That Ends Well</TITLE>


      <TITLE>SCENE I.  Rousillon. The COUNT's palace.</TITLE>

        <LINE>In delivering my son from me, I bury a second husband.</LINE>

        <LINE>And I in going, madam, weep o'er my father's death</LINE>
        <LINE>anew: but I must attend his majesty's command, to</LINE>
        <LINE>whom I am now in ward, evermore in subjection.</LINE>

The following class creates a Lucene index of Shakespeare speeches, reading XML files for each play in the ./data/Shakespeare directory, and calling the PlayIndexer to create Lucene Document objects for every speech. These Document objects are then written to a Lucene index using an IndexWriter:

import java.io.File;
import java.io.FilenameFilter;

import org.apache.log4j.Logger;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;
import org.apache.oro.io.GlobFilenameFilter;

File dataDir = new File("./data/shakespeare");
logger.info( "Looking for XML files in " 

FilenameFilter xmlFilter = new GlobFilenameFilter( "*.xml" );
File[] xmlFiles = dataDir.listFiles( xmlFilter );

logger.info( "Creating Index");
IndexWriter writer = new IndexWriter("index", 
                                     new SimpleAnalyzer( ), true);
PlayIndexer playIndexer = new PlayIndexer( writer );
playIndexer.init( );

for (int i = 0; i < xmlFiles.length; i++) {

writer.optimize( );
writer.close( );

logger.info( "Parsing Complete, Index Created");

The PlayIndexer class, shown in Example 12-1, parses each XML file and creates Document objects that are written to an IndexWriter. The PlayIndexer uses Commons Digester to create a Lucene Document object for every speech. The init( ) method creates a Digester instance designed to interact with an inner class, DigestContext, which keeps track of the current context of a speech—play, act, scene, speaker—and the textual contents of a speech. At the end of every speech element, the DigestContext invokes the processSpeech( ) method that creates a Lucene Document for each speech and writes this Document to the Lucene IndexWriter. Because each Document is associated with the specific context of a speech, it will be possible to obtain a specific location for each term or phrase.

Example 12-1. PlayIndexer using Commons Digester and Jakarta Lucene

package com.discursive.jccook.xml.bardsearch;

import java.io.File;
import java.io.IOException;
import java.net.URL;

import org.apache.commons.digester.Digester;
import org.apache.commons.digester.xmlrules.DigesterLoader;
import org.apache.log4j.Logger;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.xml.sax.SAXException;

import com.discursive.jccook.util.LogInit;

public class PlayIndexer {

    private static Logger logger = 
        Logger.getLogger( PlayIndexer.class );
    static { LogInit.init( ); }

    private IndexWriter indexWriter;
    private Digester digester;
    private DigestContext context;

    public PlayIndexer(IndexWriter pIndexWriter) {
        indexWriter = pIndexWriter;
    public void init( ) {
        URL playRules = 
        digester = DigesterLoader.createDigester( playRules );
    public void index(File playXml) throws IOException, SAXException {
        context = new DigestContext( );
        digester.push( context );
        digester.parse( playXml );
        logger.info( "Parsed: " + playXml.getAbsolutePath( ) );
    public void processSpeech( ) {
        Document doc = new Document( );
        doc.add(Field.Text("play", context.playTitle));
        doc.add(Field.Text("act", context.actTitle));
        doc.add(Field.Text("scene", context.sceneTitle));
        doc.add(Field.Text("speaker", context.speaker));
                           new StringReader( context.speech.toString( ) )));
        try {
            indexWriter.addDocument( doc );
        } catch( IOException ioe ) {
            logger.error( "Unable to add document to index", ioe);
    public class DigestContext {
        File playXmlFile;
        String playTitle, actTitle, sceneTitle, speaker;
        StringBuffer speech = new StringBuffer( );

        public void setActTitle(String string) { actTitle = string; }
        public void setPlayTitle(String string) { playTitle = string; }
        public void setSceneTitle(String string){ sceneTitle = string;}
        public void setSpeaker(String string) { speaker = string; }
        public void appendLine(String pLine) { speech.append( pLine ); }

        public void speechEnd( ) {
            processSpeech( );
            speech.delete( 0, speech.length( ) );
            speaker = "";

Example 12-1 used a Digester rule set defined in Example 12-2. This set of rules is designed to invoke a series of methods in a set sequence to populate the context variables for each speech. The Digester rules in Example 12-2 never push or pop objects onto the digester stack; instead, the Digester is being used to populate variables and invoke methods on an object that creates Lucene Document objects based on a set of context variables. This example uses the Digester as a shorthand Simple API for XML (SAX) parser; the PlayIndexer contains a series of callback methods, and the Digester rule set simplifies the interaction between the underlying SAX parser and the DigestContext.

Example 12-2. Digester rules for PlayIndexer

<?xml version="1.0"?>

    <pattern value="PLAY">
        <bean-property-setter-rule pattern="TITLE"
        <pattern value="ACT">
            <bean-property-setter-rule pattern="TITLE"
               <pattern value="PROLOGUE">
                <bean-property-setter-rule pattern="TITLE"
                <pattern value="SPEECH">
                    <bean-property-setter-rule pattern="SPEAKER"
                    <call-method-rule pattern="LINE" 
                    <call-method-rule methodname="speechEnd"
            <pattern value="SCENE">
                <bean-property-setter-rule pattern="TITLE"
                <pattern value="SPEECH">
                    <bean-property-setter-rule pattern="SPEAKER"
                    <call-method-rule pattern="LINE"
                    <call-method-rule methodname="speechEnd"


In this recipe, an IndexWriter was created with a SimpleAnalyzer. An Analyzer takes a series of terms or tokens and creates the terms to be indexed; different Analyzer implementations are appropriate for different applications. A SimpleAnalyzer will keep every term in a piece of text, discarding nothing. A StandardAnalyzer is an Analyzer that discards common English words with little semantic value, such as “the,” “a,” “an,” and “for.” The StandardAnalyzer maintains a list of terms to discard—a stop list. Cutting down on the number of terms indexed can save time and space in an index, but it can also limit accuracy. For example, if one were to use the StandardAnalyzer to index the play Hamlet, a search for “to be or not to be” would return zero results, because every term in that phrase is a common English word on StandardAnalyzer’s stop list. In this recipe, a SimpleAnalyzer is used because it keeps track of the occurrence of every term in a document.

What you end up with after running this example is a directory named index, which contains files used by Lucene to associate terms with documents. In this example, a Lucene Document consists of the contextual information fully describing each speech—“play,” “act,” “scene,” “speaker,” and “speech.” Field objects are added to Document objects using Document’s addDoc() method. The processSpeech() method in PlayIndexer creates Lucene Document objects that contain Fields; Field objects are created by calling Text( ), a static method on Field. The first parameter to Text( ) is the name of the field, and the second parameter is the content to be indexed. Passing a String as the second parameter to Text() instructs the IndexWriter to store the content of a field in a Lucene index; a Field created with a String can be displayed in a search result. Passing a Reader as the second parameter to Text( ) instructs the IndexWriter not to store the contents of a field, and the contents of a field created with a Reader cannot be returned in a search result. In the previous example, the “speech” field is created with a Reader to reduce the size of the Lucene index, and every other Field is created with a String so that our search results can provide a speech’s contextual coordinates.

See Also

Sure, you’ve created a Lucene index, but how would you search it? The index created in this recipe can be searched with Lucene using techniques described in Recipe 12.7 and Recipe 12.8.

If you are indexing a huge database of English documents, consider using the StandardAnalyzer to discard common English words. If you are indexing documents written in German or Russian, Lucene ships with GermanAnalyzer and RussianAnalyzer, which both contain stop word lists for these languages. For more information about these two implementations of Analyzer, see the Lucene JavaDoc at http://jakarta.apache.org/lucene/docs/api/index.html. Analyzer implementations for French, Dutch, Chinese, and Czech can be found in the Lucene Sandbox (http://cvs.apache.org/viewcvs/jakarta-lucene-sandbox/contributions/analyzers/).

For more information about Jakarta Lucene, see the Lucene project web site at http://jakarta.apache.org/lucene.

This recipe uses the The Plays of Shakespeare, compiled by Jon Bosak. To download the complete works of Shakespeare in XML format, see http://www.ibiblio.org/bosak/.

