Scaling document frequencies by document size

While raw token frequencies can be useful, they often have one major problem: comparing frequencies with different documents is complicated if the document sizes are not the same. If the word customer appears 23 times in a 500-word document and it appears 40 times in a 1,000-word document, which one do you think is more focused on that word? It's difficult to say.

To work around this, it's common to scale the tokens frequencies for each document by the size of the document. That's what we'll do in this recipe.

Getting ready

We'll continue building on the previous recipes in this chapter. Because of that, we'll use the same project.clj file:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [clojure-opennlp "0.3.2"]])

We'll use the token frequencies that we figured from the Getting document frequencies recipe. We'll keep them bound to the name token-freqs.

How to do it…

The function used to perform this scaling is fairly simple. It calculates the total number of tokens by adding the values from the frequency hashmap and then it walks over the hashmap again, scaling each frequency, as shown here:

(defn scale-by-total [freqs]
  (let [total (reduce + 0 (vals freqs))]
    (->> freqs
         (map #(vector (first %) (/ (second %) total)))
         (into {}))))

We can now use this on token-freqs from the last recipe:

user=> (pprint (scale-by-total token-freqs))
{"see" 2/19,
 "purple" 1/19,
 "tell" 1/19,
 "cow" 1/19,
 "anyhow" 1/19,
 "hope" 1/19,
 "never" 2/19,
 "saw" 1/19,
 "'d" 1/19,
 "." 4/19,
 "one" 2/19,
 "," 1/19,
 "rather" 1/19}

Now, we can easily compare these values to the frequencies generated from other documents.

How it works…

This works by changing all of the raw frequencies into ratios based on each document's size.

These numbers are comparable. In our example, from the introduction to this recipe, 0.046 (23/500) is obviously slightly more than 0.040 (40/1000). However, both of these numbers are ridiculously high. Words that typically occur this much in English are words such as the.

Document-scaled frequencies do have problems with shorter texts. For example, take this tweet by the Twitter user @LegoAcademics:

"Dr Brown's random number algorithm is based on the baffling floor sequences chosen by the Uni library elevator".

In this tweet, let's see what the scaled frequency of random is:

(-> (str "Dr Brown's random number algorithm is based "
         "on the baffling floor seqeuences chosen by "
         "the Uni library elevator.")
    tokenize
    normalize
    frequencies
    scale-by-total
    (get "random")
    float)

This gives us 0.05. Again, this is ridiculously high. Most other tweets won't include the term random at all. Because of this, you still can only compare tweets.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.21.46.78