Focusing on content words with stoplists

Stoplists or stopwords are a list of words that should not be included in further analysis. Usually, this is because they're so common that they don't add much information to the analysis.

These lists are usually dominated by what are known as function words—words that have a grammatical purpose in the sentence, but which themselves do not carry any meaning. For example, the indicates that the noun that follows is singular, but it does not have a meaning by itself. Others prepositions, such as after, have a meaning, but they are so common that they tend to get in the way.

On the other hand, chair has a meaning beyond what it's doing in the sentence, and in fact, it's role in the sentence will vary (subject, direct object, and so on).

You don't always want to use stopwords since they throw away information. However, since function words are more frequent than content words, sometimes focusing on the content words can add clarity to your analysis and its output. Also, they can speed up the processing.

Getting ready

This recipe will build on the work that we've done so far in this chapter. As such, it will use the same project.clj file that we used in the Tokenizing text and Finding sentences recipes:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [clojure-opennlp "0.3.2"]])

However, we'll use a slightly different set of requirements for this recipe:

(require '[opennlp.nlp :as nlp]
         '[clojure.java.io :as io])

We'll also need to have a list of stopwords. You can easily create your own list, but for the purpose of this recipe, we'll use the English stopword list included with the Natural Language Toolkit (http://www.nltk.org/). You can download this from http://nltk.github.com/nltk_data/packages/corpora/stopwords.zip. Unzip it into your project directory and make sure that the stopwords/english file exists.

We'll also use the tokenize and get-sentences functions that we created in the previous two recipes.

How to do it…

We'll need to create a function in order to process and normalize the tokens. Also, we'll need a utility function to load the stopword list. Once these are in place, we'll see how to use the stopwords. To do this, perform the following steps:

  1. The words in the stopword list have been lowercased. We can also do this with the tokens that we create. We'll use the normalize function to handle the lowercasing of each token:
    (defn normalize [token-seq]
      (map #(.toLowerCase %) token-seq))
  2. The stoplist will actually be represented by a Clojure set. This will make filtering a lot easier. The load-stopwords function will read in the file, break it into lines, and fold them into a set, as follows:
    (defn load-stopwords [filename]
      (with-open [r (io/reader filename)]
        (set (doall (line-seq r)))))
    (def is-stopword (load-stopwords "stopwords/english"))
  3. Finally, we can load the tokens. This will break the input into sentences. Then, it will tokenize each sentence, normalize its tokens, and remove its stopwords, as follows:
    (def tokens
      (map #(remove is-stopword (normalize (tokenize %)))
           (get-sentences
             "I never saw a Purple Cow.
             I never hope to see one.
             But I can tell you, anyhow.
             I'd rather see than be one.")))

Now, you can see that the tokens returned are more focused on the content and are missing all of the function words:

user=> (pprint tokens)
(("never" "saw" "purple" "cow" ".")
 ("never" "hope" "see" "one" ".")
 ("tell" "," "anyhow" ".")
 ("'d" "rather" "see" "one" "."))
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.38.92