Sometimes, you may need to make a word-based analysis of a text file, for example, for spell checking or statistics. This recipe shows how a file can be read word-by-word in Groovy.
For this recipe, you can create a new Groovy script file and download a large text file for testing purposes. The Project Gutenberg website has thousands of text files that can be used for text analysis, for example, William Shakespeare's Macbeth, available at http://www.gutenberg.net/cache/epub/2264/pg2264.txt.
We assume that the pg2264.txt
file containing Shakespeare's masterpiece Macbeth has been downloaded, but any large text file will do for this example.
def file = new File('pg2264.txt') // Macbeth int wordCount = 0 file.eachLine { String line -> line.tokenize().each { String word -> wordCount++ println word } } println "Number of words: $wordCount"
... FINIS. THE TRAGEDIE OF MACBETH. Number of words: 20366
The snippet in the previous paragraph prints every word in the file on a separate line, and at the end, it outputs a total number of words. The simplest way to pick all the words from a file is by reading the file line by line with the help of the eachLine
method (described in the Reading a text file line by line recipe) and then splitting each line into words. The java.lang.String
class already provides a split
method that takes a regular expression for a word separator. There is also a tokenize
method added by the Groovy JDK. This splits a given string into a collection of strings by using a whitespace separator ([s ]+
).
The tokenize
method is equivalent to the following split
method call:
line.split(/[s ]+/).findAll{ it.trim() }.each { String word -> ... }
Note that we filtered the empty words from the result of the split
method using the findAll
method available in all collections in Groovy. The tokenize
method does this cleaning automatically for us.
Another way to split words in a text file is to use the splitEachLine
method from java.io.File
Groovy's extension. Like String's split
method, it also takes a regular expression as an input, as well as a closure to which it passes the collection of strings received from a line split. With the help of this method, our original code snippet can be rewritten in the following way:
int wordCount = 0 file.splitEachLine(/[s ]+/) { Collection words -> words.findAll{ it.trim() }.each { String word -> wordCount++ println word } } println "Number of words: $wordCount"
Also, similar to the split
method, we need to filter empty words to get the same number of words.
3.143.247.81