Calculating relative values

One way to normalize values is to scale frequencies by the sizes of their groups. For example, say the word truth appears three times in a document. This means one thing if the document has thirty words. It means something else if the document has 300 or 3,000 words. Moreover, if the dataset has documents of all these lengths, how do you compare the frequencies for words across documents?

One way to do this is to rescale the frequency counts. In some cases, we can just scale the terms by the length of the documents. Or, if we want better results, we might use something more complicated such as term frequency-inverse document frequency (TF-IDF).

For this recipe, we'll rescale some term frequencies by the total word count for their document.

Getting ready

We don't need much for this recipe. We'll use the minimal project.clj file, which is listed here:

(defproject cleaning-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]])

However, it will be easier if we have a 'pretty-printer' available in the REPL:

(require '[clojure.pprint :as pp])

How to do it…

Actually, let's frame this problem in a more abstract manner. If each datum is a map, we can rescale one key (:frequency) by the total of this key's values in the group defined by another key (:document). This is a more general approach and should be useful in more situations.

  1. Let's define a function that rescales by a key's total in a collection. It assigns the scaled value to a new key (dest):
    (defn rescale-by-total [src dest coll]
      (let [total (reduce + (map src coll))
            update #(assoc % dest (/ (% src) total))]
        (map update coll)))
  2. Now, let's use this function in order to define a function to rescale by a group:
    (defn rescale-by-group [src group dest coll]
      (->> coll
           (sort-by group)
           (group-by group)
           vals
           (mapcat #(rescale-by-total src dest %))))
  3. We can easily make up some data to test this:
    (def word-counts
      [{:word 'the, :freq 92, :doc 'a}
       {:word 'a, :freq 76,:doc 'a}
       {:word 'jack, :freq 4,:doc 'a}
       {:word 'the, :freq 3,:doc 'b}
       {:word 'a, :freq 2,:doc 'b}
       {:word 'mary, :freq 1,:doc 'b}])

Now, we can see how it works:

user=> (pp/pprint (rescale-by-group :freq :doc :scaled 
                                    word-counts))
({:freq 92, :word the, :scaled 23/43, :doc a}
 {:freq 76, :word a, :scaled 19/43, :doc a}
 {:freq 4, :word jack, :scaled 1/43, :doc a}
 {:freq 3, :word the, :scaled 1/2, :doc b}
 {:freq 2, :word a, :scaled 1/3, :doc b}
 {:freq 1, :word mary, :scaled 1/6, :doc b})

We can immediately see that the scaled values are more easily comparable. The scaled frequencies for the, for example, are approximately in line with each other in the way that the raw frequencies just aren't (0.53 and 0.5 versus 92 and 3). Of course, since this isn't a real dataset, the frequencies are meaningless, but this still illustrates the method and how it improves the dataset.

How it works…

For each function, we pass in a couple of keys: a source key and a destination key. The first function, rescale-by-total, totals the values for the source key, and then sets the destination key to the ratio of the source key for that item and the total for the source key in all of the items in the collection.

The second function, rescale-by-group, uses another key: the group key. It sorts and groups the items by group key and then passes each group to rescale-by-total.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.14.134.130