Mapping documents to a sparse vector space representation

Many text algorithms deal with vector space representations of the documents. This means that the documents are normalized into vectors. Each individual token type is assigned one position across all the documents' vectors. For instance, text might have position 42, so index 42 in all the document vectors will have the frequency (or other value) of the word text.

However, most documents won't have anything for most words. This makes them sparse vectors, and we can use more efficient formats for them.

The Colt library (http://acs.lbl.gov/ACSSoftware/colt/) contains implementations of sparse vectors. For this recipe, we'll see how to read a collection of documents into these.

Getting ready…

For this recipe, we'll need the following in our project.clj file:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [clojure-opennlp "0.3.2"]
                 [colt/colt "1.2.0"]])

For our script or REPL, we'll need these libraries:

(require '[clojure.set :as set]
         '[opennlp.nlp :as nlp])
(import [cern.colt.matrix DoubleFactory2D])

From the previous recipes, we'll use several functions. From the Tokenizing text recipe, we'll use tokenize and normalize, and from the Scaling document frequencies with TF-IDF recipe, we'll use get-corpus-terms.

For the data, we'll again use the State of the Union address that we first saw in the Scaling document frequencies in TF-IDF recipe. You can download these from http://www.ericrochester.com/clj-data-analysis/data/sotu.tar.gz. I've unpacked the data from this file into the sotu directory.

How to do it…

In order to create vectors of all the documents, we'll first need to create a token index that maps tokens to the indexes in the vector. We'll then use that to create a sequence of Colt vectors. Finally, we can load the SOTU addresses and generate sparse feature vectors of all the documents, as follows:

  1. Before we can create the feature vectors, we need to have a token index so that the vector indexes will be consistent across all of the documents. The build-index function takes care of this:
    (defn build-index [corpus]
      (into {}
            (zipmap (tfidf/get-corpus-terms corpus)
                    (range))))
  2. Now, we can use build-index to convert a sequence of token-frequency pairs into a feature vector. All of the tokens must be in the index:
    (defn ->matrix [index pairs]
      (let [matrix (.make DoubleFactory2D/sparse
                     1 (count index) 0.0)
            inc-cell (fn [m p]
                       (let [[k v] p,
                             i (index k)]
                         (.set m 0 i v)
                         m))]
        (reduce inc-cell matrix pairs)))

With these in place, let's make use of them by loading the token frequencies in a corpus and then create the index from this:

(def corpus
  (->> "sotu"
       (java.io.File.)
       (.list)
       (map #(str "sotu/" %))
       (map slurp)
       (map tokenize)
       (map normalize)
       (map frequencies)))
(def index (build-index corpus))

With the index, we can finally move the information of the document frequencies into sparse vectors:

(def vecs (map #(->matrix index %) corpus))
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.54.245