Processing every word in a text file

Sometimes, you may need to make a word-based analysis of a text file, for example, for spell checking or statistics. This recipe shows how a file can be read word-by-word in Groovy.

Getting ready

For this recipe, you can create a new Groovy script file and download a large text file for testing purposes. The Project Gutenberg website has thousands of text files that can be used for text analysis, for example, William Shakespeare's Macbeth, available at http://www.gutenberg.net/cache/epub/2264/pg2264.txt.

How to do it...

We assume that the pg2264.txt file containing Shakespeare's masterpiece Macbeth has been downloaded, but any large text file will do for this example.

  1. Add the following code to the Groovy script:
    def file = new File('pg2264.txt') // Macbeth
    int wordCount = 0
    file.eachLine { String line ->
      line.tokenize().each { String word ->
        wordCount++
        println word
      }
    }
    println "Number of words: $wordCount"
  2. After the execution, the script should terminate with the following output:
    ...
    FINIS.
    THE
    TRAGEDIE
    OF
    MACBETH.
    Number of words: 20366
    

How it works...

The snippet in the previous paragraph prints every word in the file on a separate line, and at the end, it outputs a total number of words. The simplest way to pick all the words from a file is by reading the file line by line with the help of the eachLine method (described in the Reading a text file line by line recipe) and then splitting each line into words. The java.lang.String class already provides a split method that takes a regular expression for a word separator. There is also a tokenize method added by the Groovy JDK. This splits a given string into a collection of strings by using a whitespace separator ([s ]+).

The tokenize method is equivalent to the following split method call:

line.split(/[s	]+/).findAll{ it.trim() }.each { String word ->
  ...
}

Note that we filtered the empty words from the result of the split method using the findAll method available in all collections in Groovy. The tokenize method does this cleaning automatically for us.

There's more...

Another way to split words in a text file is to use the splitEachLine method from java.io.File Groovy's extension. Like String's split method, it also takes a regular expression as an input, as well as a closure to which it passes the collection of strings received from a line split. With the help of this method, our original code snippet can be rewritten in the following way:

int wordCount = 0
file.splitEachLine(/[s	]+/) { Collection words ->
  words.findAll{ it.trim() }.each { String word ->
    wordCount++
    println word
  }
}
println "Number of words: $wordCount"

Also, similar to the split method, we need to filter empty words to get the same number of words.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.247.81