Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Getting document frequencies

One common and useful metric to work with text corpora is to get the counts of the tokens in the documents. This can be done quite easily by leveraging standard Clojure functions.

Let's see how.

Getting ready

We'll continue building on the previous recipes in this chapter. Because of that, we'll use the same project.clj file:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [clojure-opennlp "0.3.2"]])

We'll also use tokenize, get-sentences, normalize, load-stopwords, and is-stopword from the earlier recipes.

We'll also use the value of the tokens that we saw in the Focusing on content words with stoplists recipe. Here it is again:

(def tokens
  (map #(remove is-stopword (normalize (tokenize %)))
       (get-sentences
         "I never saw a Purple Cow.
         I never hope to see one.
         But I can tell you, anyhow.
         I'd rather see than be one.")))

How to do it…

Of course, the standard function to count items in a sequence is frequencies. We can use this to get the token counts for each sentence, but then we'll also want to fold those into a frequency table using merge-with:

(def token-freqs
  (apply merge-with + (map frequencies tokens)))

We can print or query this table to get the count for any token or piece of punctuation, as follows:

user=> (pprint token-freqs)
{"see" 2,
 "purple" 1,
 "tell" 1,
 "cow" 1,
 "anyhow" 1,
 "hope" 1,
 "never" 2,
 "saw" 1,
 "'d" 1,
 "." 4,
 "one" 2,
 "," 1,
 "rather" 1}

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Getting document frequencies

Create new playlist

Sign In

Sign Up

Getting document frequencies

Getting ready

How to do it…

Table of Contents for
Getting document frequencies