Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

2.12. Measuring the Frequency of a String

Problem

You need to find out how many times a certain word or piece of text occurs in a string.

Solution

StringUtils.countMatches() returns the frequency of a piece of text within another string:

File manuscriptFile = new File("manuscript.txt");
Reader reader = new FileReader( manuscriptFile );
StringWriter stringWriter = new StringWriter( );
while( reader.ready( ) ) { writer.write( reader.read( ) ); }
String manuscript = stringWriter.toString( );

// Convert string to lowercase
manuscript = StringUtils.lowerCase(manuscript);

// count the occurrences of "futility"
int numFutility = StringUtils.countMatches( manuscript, "futility" );

Converting the entire string to lowercase ensures that all occurrences of the word “futility” are counted, regardless of capitalization. This code executes and numFutility will contain the number of occurrences of the word “futility.”

Discussion

If the manuscript.txt file is large, it makes more sense to search this file one line at a time, and sum the number of matches as each line is read. A more efficient implementation of the previous example would look like this:

File manuscriptFile = new File("manuscript.txt");
Reader reader = new FileReader( manuscriptFile );
LineNumberReader lineReader = new LineNumberReader( reader );
int numOccurences = 0;

while( lineReader.ready( ) ) { 
    String line = StringUtils.lowerCase( lineReader.readLine( ) );
    numOccurences += StringUtils.countMatches( , "futility" );
}

Your random access memory will thank you for this implementation. Java programmers are often lulled into a sense of false security knowing that they do not have to worry about memory management. Poor design decisions and inefficient implementation often lead to slow running or hard-to-scale applications. Just because you don’t have to allocate and deallocate memory does not mean that you should stop thinking about efficient memory use. If you are trying to search for the frequency of a word in a 20 megabyte file, please try not to read the entire file into memory before searching. Performing a linear search on a large string is an inappropriate way to search a large database of documents. When searching large amounts of text, it is more efficient to create an index of terms than to perform a linear search over a large string. A method for indexing and searching documents using Jakarta Lucene and Jakarta Commons Digester will be discussed in a later chapter.

Table of Contents for
2.12. Measuring the Frequency of a String

2.12. Measuring the Frequency of a String

Problem

Solution

Discussion

See Also

Table of Contents for 2.12. Measuring the Frequency of a String

Create new playlist

Sign In

Sign Up

2.12. Measuring the Frequency of a String

Problem

Solution

Discussion

See Also

Table of Contents for
2.12. Measuring the Frequency of a String