12.8. Finding the Frequency of Terms in an Index

Problem

You need to find the most frequently used terms in a Lucene index.

Solution

Use Jakarta Lucene to index your documents and obtain a TermEnum using an IndexReader. The frequency of a term is defined as the number of documents in which a specific term appears, and a TermEnum object contains the frequency of every term in a set of documents. Example 12-3 iterates over the terms contained in TermEnum returning every term that appears in more than 1,100 speeches.

Example 12-3. TermFreq finding the most frequent terms in an index

package com.discursive.jccook.xml.bardsearch;

import java.util.ArrayList;
import java.util.Collections;
import java.util.Iterator;
import java.util.List;

import org.apache.commons.lang.builder.CompareToBuilder;
import org.apache.log4j.Logger;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.TermEnum;

import com.discursive.jccook.util.LogInit;

public class TermFreq {
    private static Logger logger = Logger.getLogger(TermFreq.class);
    static { LogInit.init( ); }

    public static void main(String[] pArgs) throws Exception {
        logger.info("Threshold is 1100" );
        Integer threshold = new Integer( 1100 );

        IndexReader reader = IndexReader.open( "index" );
        TermEnum enum = reader.terms( );
        List termList = new ArrayList( );
        while( enum.next( ) ) {
            if( enum.docFreq( ) >= threshold.intValue( ) && 
                enum.term( ).field( ).equals( "speech" ) ) {
                Freq freq = new Freq( enum.term( ).text( ), enum.docFreq( ) );
                termList.add( freq );
            }
        }
        Collections.sort( termList );
        Collections.reverse( termList );

        System.out.println( "Frequency | Term" );
        Iterator iterator = termList.iterator( );
        while( iterator.hasNext( ) ) {
            Freq freq = (Freq) iterator.next( );
            System.out.print( freq.frequency );
            System.out.println( " | " + freq.term );
        }
    }
    
    public static class Freq implements Comparable {
        String term;
        int frequency;
        
        public Freq( String term, int frequency ) {
            this.term = term;
            this.frequency = frequency;
        }
        
        public int compareTo(Object o) {
            if( o instanceof Freq ) {
                Freq oFreq = (Freq) o;
                return new CompareToBuilder( )
                    .append( frequency, oFreq.frequency )
                    .append( term, oFreq.term )
                    .toComparison( );
            } else {
                return 0;
            }
        }
    }
}

A Lucene index is opened by passing the name of the index directory to IndexReader.open(), and a TermEnum is retrieved from the IndexReader with a call to reader.terms(). The previous example iterates through every term contained in TermEnum, creating and populating an instance of the inner class Freq, if a term appears in more than 1,100 documents and the term occurs in the “speech” field. TermEnum contains three methods of interest: next( ), docFreq( ), and term( ). next( ) moves to the next term in the TermEnum, returning false if no more terms are available. docFreq( ) returns the number of documents a term appears in, and term( ) returns a Term object containing the text of the term and the field the term occurs in. The List of Freq objects is sorted by frequency and reversed, and the most frequent terms in a set of Shakespeare plays is printed to the console:

0    INFO  [main] TermFreq     - Threshold is 4500
Frequency | Term
2907 | i
2823 | the
2647 | and
2362 | to
2186 | you
1950 | of
1885 | a
1870 | my
1680 | is
1678 | that
1564 | in
1562 | not
1410 | it
1262 | s
1247 | me
1200 | for
1168 | be
1124 | this
1109 | but

Discussion

From this list, it appears that the most frequent terms in Shakespeare plays are inconsequential words, such as “the,” “a,” “of,” and “be.” The index this example was executed against was created with a SimpleAnalyzer that does not discard any terms. If this index is created with StandardAnalyzer, common articles and pronouns will not be stored as terms in the index, and they will not show up on the most frequent terms list. Running this example against an index created with a StandardAnalyzer and reducing the frequency threshold to 600 documents returns the following results:

Frequency | Term
2727 | i
2153 | you
1862 | my
1234 | me
1091 | your
1057 | have
1027 | he
973 | what
921 | so
893 | his
824 | do
814 | him
693 | all
647 | thou
632 | shall
614 | lord

See Also

There is an example of enumerating term frequency in the Jakarta Lucene Sandbox. To see this frequency analysis example, see the “High Frequency Terms” example (http://jakarta.apache.org/lucene/docs/lucene-sandbox/).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.194.123