Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Processing every word in a text file

Sometimes, you may need to make a word-based analysis of a text file, for example, for spell checking or statistics. This recipe shows how a file can be read word-by-word in Groovy.

Getting ready

For this recipe, you can create a new Groovy script file and download a large text file for testing purposes. The Project Gutenberg website has thousands of text files that can be used for text analysis, for example, William Shakespeare's Macbeth, available at http://www.gutenberg.net/cache/epub/2264/pg2264.txt.

How to do it...

We assume that the pg2264.txt file containing Shakespeare's masterpiece Macbeth has been downloaded, but any large text file will do for this example.

Add the following code to the Groovy script:

def file = new File('pg2264.txt') // Macbeth
int wordCount = 0
file.eachLine { String line ->
  line.tokenize().each { String word ->
    wordCount++
    println word
  }
}
println "Number of words: $wordCount"

After the execution, the script should terminate with the following output:
```
...
FINIS.
THE
TRAGEDIE
OF
MACBETH.
Number of words: 20366
```

How it works...

The snippet in the previous paragraph prints every word in the file on a separate line, and at the end, it outputs a total number of words. The simplest way to pick all the words from a file is by reading the file line by line with the help of the eachLine method (described in the Reading a text file line by line recipe) and then splitting each line into words. The java.lang.String class already provides a split method that takes a regular expression for a word separator. There is also a tokenize method added by the Groovy JDK. This splits a given string into a collection of strings by using a whitespace separator ([s ]+).

The tokenize method is equivalent to the following split method call:

line.split(/[s	]+/).findAll{ it.trim() }.each { String word ->
  ...
}

Note that we filtered the empty words from the result of the split method using the findAll method available in all collections in Groovy. The tokenize method does this cleaning automatically for us.

There's more...

Another way to split words in a text file is to use the splitEachLine method from java.io.File Groovy's extension. Like String's split method, it also takes a regular expression as an input, as well as a closure to which it passes the collection of strings received from a line split. With the help of this method, our original code snippet can be rewritten in the following way:

int wordCount = 0
file.splitEachLine(/[s	]+/) { Collection words ->
  words.findAll{ it.trim() }.each { String word ->
    wordCount++
    println word
  }
}
println "Number of words: $wordCount"

Also, similar to the split method, we need to filter empty words to get the same number of words.

Table of Contents for
Processing every word in a text file

Processing every word in a text file

Getting ready

How to do it...

How it works...

There's more...

See also

Table of Contents for Processing every word in a text file

Create new playlist

Sign In

Sign Up

Processing every word in a text file

Getting ready

How to do it...

How it works...

There's more...

See also

Table of Contents for
Processing every word in a text file