You need to quickly search a collection of XML documents, and, to do this, you need to create an index of terms keeping track of the context in which these terms appear.
Use Jakarta Lucene and Jakarta Digester and create an index of Lucene
Document
objects for the lowest level of
granularity you wish to search. For example, if you are attempting to
search for speeches in a Shakespeare play that contain specific
terms, create a Lucene Document
object for each
speech. For the purposes of this recipe, assume that you are
attempting to index Shakespeare plays stored in the following XML
format:
<?xml version="1.0"?> <PLAY> <TITLE>All's Well That Ends Well</TITLE> <ACT> <TITLE>ACT I</TITLE> <SCENE> <TITLE>SCENE I. Rousillon. The COUNT's palace.</TITLE> <SPEECH> <SPEAKER>COUNTESS</SPEAKER> <LINE>In delivering my son from me, I bury a second husband.</LINE> </SPEECH> <SPEECH> <SPEAKER>BERTRAM</SPEAKER> <LINE>And I in going, madam, weep o'er my father's death</LINE> <LINE>anew: but I must attend his majesty's command, to</LINE> <LINE>whom I am now in ward, evermore in subjection.</LINE> </SPEECH> </SCENE> </ACT> </PLAY>
The following class creates a Lucene index of Shakespeare speeches,
reading XML files for each play in the
./data/Shakespeare
directory, and calling the
PlayIndexer
to create Lucene
Document
objects for every speech. These
Document
objects are then written to a Lucene index using an
IndexWriter
:
import java.io.File; import java.io.FilenameFilter; import org.apache.log4j.Logger; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.index.IndexWriter; import org.apache.oro.io.GlobFilenameFilter; File dataDir = new File("./data/shakespeare"); logger.info( "Looking for XML files in " FilenameFilter xmlFilter = new GlobFilenameFilter( "*.xml" ); File[] xmlFiles = dataDir.listFiles( xmlFilter ); logger.info( "Creating Index"); IndexWriter writer = new IndexWriter("index", new SimpleAnalyzer( ), true); PlayIndexer playIndexer = new PlayIndexer( writer ); playIndexer.init( ); for (int i = 0; i < xmlFiles.length; i++) { playIndexer.index(xmlFiles[i]); } writer.optimize( ); writer.close( ); logger.info( "Parsing Complete, Index Created");
The PlayIndexer
class, shown in Example 12-1, parses each XML file and creates
Document
objects that are written to an
IndexWriter
. The
PlayIndexer
uses Commons Digester to create a Lucene Document
object for every speech. The init( )
method
creates a Digester
instance designed to interact
with an inner class, DigestContext
, which keeps
track of the current context of a
speech—play
, act
,
scene
, speaker
—and the
textual contents of a speech
. At the end of every
speech element, the DigestContext
invokes the
processSpeech( )
method that creates a Lucene
Document
for each speech and writes this
Document
to the
Lucene
IndexWriter
. Because each
Document
is associated with the specific context
of a speech, it will be possible to obtain a specific location for
each term or phrase.
Example 12-1. PlayIndexer using Commons Digester and Jakarta Lucene
package com.discursive.jccook.xml.bardsearch; import java.io.File; import java.io.IOException; import java.net.URL; import org.apache.commons.digester.Digester; import org.apache.commons.digester.xmlrules.DigesterLoader; import org.apache.log4j.Logger; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; import org.xml.sax.SAXException; import com.discursive.jccook.util.LogInit; public class PlayIndexer { private static Logger logger = Logger.getLogger( PlayIndexer.class ); static { LogInit.init( ); } private IndexWriter indexWriter; private Digester digester; private DigestContext context; public PlayIndexer(IndexWriter pIndexWriter) { indexWriter = pIndexWriter; } public void init( ) { URL playRules = PlayIndexer.class.getResource("play-digester-rules.xml"); digester = DigesterLoader.createDigester( playRules ); } public void index(File playXml) throws IOException, SAXException { context = new DigestContext( ); digester.push( context ); digester.parse( playXml ); logger.info( "Parsed: " + playXml.getAbsolutePath( ) ); } public void processSpeech( ) { Document doc = new Document( ); doc.add(Field.Text("play", context.playTitle)); doc.add(Field.Text("act", context.actTitle)); doc.add(Field.Text("scene", context.sceneTitle)); doc.add(Field.Text("speaker", context.speaker)); doc.add(Field.Text("speech", new StringReader( context.speech.toString( ) ))); try { indexWriter.addDocument( doc ); } catch( IOException ioe ) { logger.error( "Unable to add document to index", ioe); } } public class DigestContext { File playXmlFile; String playTitle, actTitle, sceneTitle, speaker; StringBuffer speech = new StringBuffer( ); public void setActTitle(String string) { actTitle = string; } public void setPlayTitle(String string) { playTitle = string; } public void setSceneTitle(String string){ sceneTitle = string;} public void setSpeaker(String string) { speaker = string; } public void appendLine(String pLine) { speech.append( pLine ); } public void speechEnd( ) { processSpeech( ); speech.delete( 0, speech.length( ) ); speaker = ""; } } }
Example 12-1 used a Digester rule set defined in
Example 12-2. This set of rules is designed to invoke
a series of methods in a set sequence to populate the context
variables for each speech. The Digester rules in Example 12-2 never push or pop objects onto the digester
stack; instead, the Digester is being used to populate variables and
invoke methods on an object that creates Lucene
Document
objects based on a set of context
variables. This example uses the Digester as a shorthand Simple API
for XML (SAX) parser; the PlayIndexer
contains a
series of callback methods, and the Digester rule set simplifies the
interaction between the underlying SAX parser and the
DigestContext
.
Example 12-2. Digester rules for PlayIndexer
<?xml version="1.0"?> <digester-rules> <pattern value="PLAY"> <bean-property-setter-rule pattern="TITLE" propertyname="playTitle"/> <pattern value="ACT"> <bean-property-setter-rule pattern="TITLE" propertyname="actTitle"/> <pattern value="PROLOGUE"> <bean-property-setter-rule pattern="TITLE" propertyname="sceneTitle"/> <pattern value="SPEECH"> <bean-property-setter-rule pattern="SPEAKER" propertyname="speaker"/> <call-method-rule pattern="LINE" methodname="appendLine" paramtype="java.lang.String" paramcount="0"/> <call-method-rule methodname="speechEnd" paramtype="java.lang.Object"/> </pattern> </pattern> <pattern value="SCENE"> <bean-property-setter-rule pattern="TITLE" propertyname="sceneTitle"/> <pattern value="SPEECH"> <bean-property-setter-rule pattern="SPEAKER" propertyname="speaker"/> <call-method-rule pattern="LINE" methodname="appendLine" paramtype="java.lang.String" paramcount="0"/> <call-method-rule methodname="speechEnd" paramtype="java.lang.Object"/> </pattern> </pattern> </pattern> </pattern> </digester-rules>
In this recipe, an IndexWriter
was created with a
SimpleAnalyzer
. An Analyzer
takes a series of terms or tokens and creates the terms to be
indexed; different Analyzer
implementations are
appropriate for different applications. A
SimpleAnalyzer
will keep every term in a piece of
text, discarding nothing. A StandardAnalyzer
is an
Analyzer
that discards common English words with
little semantic value, such as
“the,”
“a,”
“an,” and
“for.” The
StandardAnalyzer
maintains a list of terms to
discard—a stop list
. Cutting down on the
number of terms indexed can save time and space in an index, but it
can also limit accuracy. For example, if one were to use the
StandardAnalyzer
to index the play
Hamlet, a search for “to be or
not to be” would return zero results, because every
term in that phrase is a common English word on
StandardAnalyzer
’s stop list. In
this recipe, a SimpleAnalyzer
is used because it
keeps track of the occurrence of every term in a document.
What you end up with after running this example is a directory named
index
, which contains files used by Lucene to
associate terms with documents. In this example, a Lucene
Document
consists of the contextual information
fully describing each
speech—“play,”
“act,”
“scene,”
“speaker,” and
“speech.” Field
objects are added to Document
objects using
Document
’s addDoc()
method. The processSpeech()
method in PlayIndexer
creates Lucene
Document
objects that contain
Field
s; Field
objects are
created by calling Text( )
, a static method on
Field
. The first parameter to Text( )
is the name of the field, and the second
parameter is the content to be indexed. Passing a
String
as the second parameter to Text()
instructs the IndexWriter
to store the
content of a field in a Lucene index; a Field
created with a String
can be displayed in a search
result. Passing a Reader
as the second parameter
to Text( )
instructs the
IndexWriter
not to store the contents of a field,
and the contents of a field created with a Reader
cannot be returned in a search
result. In the previous example, the
“speech” field is created with a
Reader
to reduce the size of the Lucene index, and
every other Field
is created with a
String
so that our search results can provide a
speech’s contextual coordinates.
Sure, you’ve created a Lucene index, but how would you search it? The index created in this recipe can be searched with Lucene using techniques described in Recipe 12.7 and Recipe 12.8.
If you are indexing a huge database of English documents, consider
using the StandardAnalyzer
to discard common
English words. If you are indexing documents written in German or
Russian, Lucene ships with GermanAnalyzer
and
RussianAnalyzer
, which both contain stop word
lists for these languages. For more information about these two
implementations of Analyzer
, see the Lucene
JavaDoc at http://jakarta.apache.org/lucene/docs/api/index.html.
Analyzer implementations for French, Dutch, Chinese, and Czech can be
found in the Lucene Sandbox (http://cvs.apache.org/viewcvs/jakarta-lucene-sandbox/contributions/analyzers/).
For more information about Jakarta Lucene, see the Lucene project web site at http://jakarta.apache.org/lucene.
This recipe uses the The Plays of Shakespeare, compiled by Jon Bosak. To download the complete works of Shakespeare in XML format, see http://www.ibiblio.org/bosak/.
3.17.176.72