2.12. Measuring the Frequency of a String

Problem

You need to find out how many times a certain word or piece of text occurs in a string.

Solution

StringUtils.countMatches() returns the frequency of a piece of text within another string:

File manuscriptFile = new File("manuscript.txt");
Reader reader = new FileReader( manuscriptFile );
StringWriter stringWriter = new StringWriter( );
while( reader.ready( ) ) { writer.write( reader.read( ) ); }
String manuscript = stringWriter.toString( );

// Convert string to lowercase
manuscript = StringUtils.lowerCase(manuscript);

// count the occurrences of "futility"
int numFutility = StringUtils.countMatches( manuscript, "futility" );

Converting the entire string to lowercase ensures that all occurrences of the word “futility” are counted, regardless of capitalization. This code executes and numFutility will contain the number of occurrences of the word “futility.”

Discussion

If the manuscript.txt file is large, it makes more sense to search this file one line at a time, and sum the number of matches as each line is read. A more efficient implementation of the previous example would look like this:

File manuscriptFile = new File("manuscript.txt");
Reader reader = new FileReader( manuscriptFile );
LineNumberReader lineReader = new LineNumberReader( reader );
int numOccurences = 0;

while( lineReader.ready( ) ) { 
    String line = StringUtils.lowerCase( lineReader.readLine( ) );
    numOccurences += StringUtils.countMatches( , "futility" );
}

Your random access memory will thank you for this implementation. Java programmers are often lulled into a sense of false security knowing that they do not have to worry about memory management. Poor design decisions and inefficient implementation often lead to slow running or hard-to-scale applications. Just because you don’t have to allocate and deallocate memory does not mean that you should stop thinking about efficient memory use. If you are trying to search for the frequency of a word in a 20 megabyte file, please try not to read the entire file into memory before searching. Performing a linear search on a large string is an inappropriate way to search a large database of documents. When searching large amounts of text, it is more efficient to create an index of terms than to perform a linear search over a large string. A method for indexing and searching documents using Jakarta Lucene and Jakarta Commons Digester will be discussed in a later chapter.

See Also

Chapter 12 contains a number of recipes devoted to searching and filtering content. If you are creating a system that needs to search a large collection of documents, consider using Jakarta Lucene (http://jakarta.apache.org/lucene) to index your content.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.87.161