StringUtils.countMatches()
returns the frequency of a piece of text
within another string:
File manuscriptFile = new File("manuscript.txt"); Reader reader = new FileReader( manuscriptFile ); StringWriter stringWriter = new StringWriter( ); while( reader.ready( ) ) { writer.write( reader.read( ) ); } String manuscript = stringWriter.toString( ); // Convert string to lowercase manuscript = StringUtils.lowerCase(manuscript); // count the occurrences of "futility" int numFutility = StringUtils.countMatches( manuscript, "futility" );
Converting the entire string to lowercase ensures that all
occurrences of the word “futility”
are counted, regardless of capitalization. This code executes and
numFutility
will contain the number of occurrences
of the word “futility.”
If the manuscript.txt
file is large, it makes
more sense to search this file one line at a time, and sum the number
of matches as each line is read. A more efficient implementation of
the previous example would look like this:
File manuscriptFile = new File("manuscript.txt"); Reader reader = new FileReader( manuscriptFile ); LineNumberReader lineReader = new LineNumberReader( reader ); int numOccurences = 0; while( lineReader.ready( ) ) { String line = StringUtils.lowerCase( lineReader.readLine( ) ); numOccurences += StringUtils.countMatches( , "futility" ); }
Your random access memory will thank you for this implementation. Java programmers are often lulled into a sense of false security knowing that they do not have to worry about memory management. Poor design decisions and inefficient implementation often lead to slow running or hard-to-scale applications. Just because you don’t have to allocate and deallocate memory does not mean that you should stop thinking about efficient memory use. If you are trying to search for the frequency of a word in a 20 megabyte file, please try not to read the entire file into memory before searching. Performing a linear search on a large string is an inappropriate way to search a large database of documents. When searching large amounts of text, it is more efficient to create an index of terms than to perform a linear search over a large string. A method for indexing and searching documents using Jakarta Lucene and Jakarta Commons Digester will be discussed in a later chapter.
Chapter 12 contains a number of recipes devoted to searching and filtering content. If you are creating a system that needs to search a large collection of documents, consider using Jakarta Lucene (http://jakarta.apache.org/lucene) to index your content.
3.145.87.161