Use Jakarta Lucene to index your documents and obtain a
TermEnum
using an IndexReader
. The frequency of a term is
defined as the number of documents in which a specific term appears,
and a TermEnum
object contains the frequency of
every term in a set of documents. Example 12-3
iterates over the terms contained in TermEnum
returning every term that appears in more than 1,100 speeches.
Example 12-3. TermFreq finding the most frequent terms in an index
package com.discursive.jccook.xml.bardsearch; import java.util.ArrayList; import java.util.Collections; import java.util.Iterator; import java.util.List; import org.apache.commons.lang.builder.CompareToBuilder; import org.apache.log4j.Logger; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.TermEnum; import com.discursive.jccook.util.LogInit; public class TermFreq { private static Logger logger = Logger.getLogger(TermFreq.class); static { LogInit.init( ); } public static void main(String[] pArgs) throws Exception { logger.info("Threshold is 1100" ); Integer threshold = new Integer( 1100 ); IndexReader reader = IndexReader.open( "index" ); TermEnum enum = reader.terms( ); List termList = new ArrayList( ); while( enum.next( ) ) { if( enum.docFreq( ) >= threshold.intValue( ) && enum.term( ).field( ).equals( "speech" ) ) { Freq freq = new Freq( enum.term( ).text( ), enum.docFreq( ) ); termList.add( freq ); } } Collections.sort( termList ); Collections.reverse( termList ); System.out.println( "Frequency | Term" ); Iterator iterator = termList.iterator( ); while( iterator.hasNext( ) ) { Freq freq = (Freq) iterator.next( ); System.out.print( freq.frequency ); System.out.println( " | " + freq.term ); } } public static class Freq implements Comparable { String term; int frequency; public Freq( String term, int frequency ) { this.term = term; this.frequency = frequency; } public int compareTo(Object o) { if( o instanceof Freq ) { Freq oFreq = (Freq) o; return new CompareToBuilder( ) .append( frequency, oFreq.frequency ) .append( term, oFreq.term ) .toComparison( ); } else { return 0; } } } }
A Lucene index is opened by passing the name of the
index
directory to IndexReader.open()
, and a TermEnum
is retrieved from the
IndexReader
with a call to reader.terms()
. The previous example iterates through every term
contained in TermEnum
, creating and populating an
instance of the inner class Freq
, if a term
appears in more than 1,100 documents and the term occurs in the
“speech” field.
TermEnum
contains three methods of interest:
next( )
, docFreq( )
, and
term( )
. next( )
moves to the
next term in the TermEnum
, returning
false
if no more terms are available.
docFreq( )
returns the number of documents a term
appears in, and term( )
returns a
Term
object containing the text of the term and
the field the term occurs in. The List
of
Freq
objects is sorted by frequency and reversed,
and the most frequent terms in a set of Shakespeare plays is printed
to the console:
0 INFO [main] TermFreq - Threshold is 4500 Frequency | Term 2907 | i 2823 | the 2647 | and 2362 | to 2186 | you 1950 | of 1885 | a 1870 | my 1680 | is 1678 | that 1564 | in 1562 | not 1410 | it 1262 | s 1247 | me 1200 | for 1168 | be 1124 | this 1109 | but
From this list, it appears that the most frequent terms in
Shakespeare plays are inconsequential words, such as
“the,”
“a,”
“of,” and
“be.” The index this example was
executed against was created with a SimpleAnalyzer
that does not discard any terms. If this index is created with
StandardAnalyzer
, common articles and pronouns
will not be stored as terms in the index, and they will not show up
on the most frequent terms list. Running this example against an
index created with a
StandardAnalyzer
and reducing the frequency threshold to 600 documents returns the
following results:
Frequency | Term 2727 | i 2153 | you 1862 | my 1234 | me 1091 | your 1057 | have 1027 | he 973 | what 921 | so 893 | his 824 | do 814 | him 693 | all 647 | thou 632 | shall 614 | lord
There is an example of enumerating term frequency in the Jakarta Lucene Sandbox. To see this frequency analysis example, see the “High Frequency Terms” example (http://jakarta.apache.org/lucene/docs/lucene-sandbox/).
3.129.194.123