Preparing the data

For this experiment, I've randomly selected 500 hotel reviews and classified them manually. A better option might be to use Amazon's Mechanical Turk (https://www.mturk.com/mturk/) to get more reviews classified than any one person might be able to do easily. Really, a few hundred is about the minimum that we'd like to use as both the training and test sets need to come from this. I made sure that the sample contained an equal number of positive and negative reviews. (You can find the sample in the data directory of the code download.)

The data files are tab-separated values (TSV). After being manually classified, each line had four fields: the classification as a + or - sign, the date of the review, the title of the review, and the review itself. Some of the reviews are quite long.

After tagging the files, we'll take those files and create feature vectors from the vocabulary of the title and create a review for each one. For this chapter, we'll see what works best: unigrams (single tokens), bigrams, trigrams, or part-of-speech annotated unigrams. These features comprise several common ways to extract features from the text:

  • Unigrams are single tokens, for example, features from the preceding sentence
  • Bigrams are two tokens next to each other, for example, features comprise
  • Trigrams are three tokens next to each other, for example, features comprise several
  • Part-of-speech annotated unigrams would look something like features_N, which just means that the unigram features is a noun.

We'll also use these features to train a variety of classifiers on the reviews. Just like Bo Pang and Lillian Lee did, we'll try experiments with naive Bayes maximum entropy classifiers. To compare how well each of these does, we'll use cross validation to train and test our classifier multiple times.

Tokenizing

Before we get started on the code for this chapter, note that the Leiningen 2 project.clj file looks like the following code:

(defproject sentiment "0.1.0-SNAPSHOT"
:plugins [[lein-cljsbuild "0.3.2"]]
:dependencies [[org.clojure/clojure "1.5.1"]
                 [org.clojure/data.csv "0.1.2"]
                 [org.clojure/data.json "0.2.3"]
                 [org.apache.opennlp/opennlp-tools "1.5.3"]
                 [nz.ac.waikato.cms.weka/weka-dev "3.7.7"]]
:jvm-opts ["-Xmx4096m"])

First, let's create some functions to handle tokenization. Under the cover's, we'll use methods from the OpenNLP library (http://opennlp.apache.org/) to process the next methods from the Weka machine learning library (http://www.cs.waikato.ac.nz/ml/weka/) to perform the sentiment analysis. However, we'll wrap these to provide a more natural, Clojure-like interface.

Let's start in the src/sentiment/tokens.clj file, which will begin in the following way:

(ns sentiment.tokens
  (:require [clojure.string :as str]
            [clojure.java.io :as io])
  (:import [opennlp.tools.tokenizeSimpleTokenizer]
           [opennlp.tools.postagPOSModelPOSTaggerME]))

Our tokenizer will use SimpleTokenizer from the OpenNLP library and normalize all characters to lowercase:

(defn tokenize [s]
  (map (memfn toLowerCase)
       (seq
         (.tokenize SimpleTokenizer/INSTANCE s))))

I've aliased the sentiment.tokens namespace to t in the REPL. This function is used to break an input string into a sequence of token substrings:

user=> (t/tokenize "How would this be TOKENIZED?")
("how" "would" "this" "be" "tokenized" "?")

Next, we'll take the token streams and create feature vectors from them.

Creating feature vectors

A feature vector is a vector that summarizes an observation or document. Each vector contains the values associated with each variable or feature. The values may be boolean, indicating the presence or absence with 0 or 1, they may be raw counts, or they may be proportions scaled by the size of the overall document. As much of machine learning is based on linear algebra, vectors and matrices are very convenient data structures.

In order to maintain consistent indexes for each feature, we have to maintain a mapping from feature to indexes. Whenever we encounter a new feature, we need to assign it to a new index.

For example, the following table traces the steps to create a feature vector based on token frequencies from the phrase the cat in the hat.

Step

Feature

Index

Feature Vector

1

the

0

[1]

2

cat

1

[1, 1]

3

in

2

[1, 1, 1]

4

the

0

[2, 1, 1]

5

hat

3

[2, 1, 1, 1]

So, the final feature vector for the cat in the hat would be [2, 1, 1, 1]. In this case, we're counting the features. In other applications, we might use a bag-of-words approach that only tests the presence of the features. In that case, the feature vector would be [1, 1, 1, 1].

We'll include the code to do this in the sentiment.tokens namespace. First, we'll create a function that increments the value of a feature in the feature vector. It looks up the index of the feature in the vector from the feature index (f-index). If the feature hasn't been seen yet, this function also allocates an index for it:

(defn inc-feature [f-index f-vec feature]
  (if-let [i (f-index feature)]
    [f-index, (assoc f-veci (inc (nth f-veci)))]
    (let [i (count f-index)]
      [(assoc f-index feature i), (assoc f-veci 1)])))

We can use this function to convert a feature sequence into a feature vector. This function initially creates a vector of zeroes for the feature sequence, and then it reduces over the features, updating the feature index and vector as necessary:

(defn ->feature-vec [f-index features]
  (reduce #(inc-feature (first %1) (second %1) %2)
          [f-index (vec (repeat (count f-index) 0))]
features))

Finally, for this task, we have several functions that we'll look at together. The first function, accum-features, builds the index and the list of feature vectors. Each time it's called, it takes the sequence of features passed to it and creates a feature vector. It appends this to the collection of feature vectors also passed into it. The next function, pad-to, makes sure that the feature vector has the same number of elements as the feature index. This makes it slightly easier to work with the feature vectors later on. The final function takes a list of feature vectors and returns the feature index and vectors for this data:

(defnaccum-features [state features]
  (let [[index accum] state
        [new-index feature] (->feature-vec index features)]
    [new-index (conj accum feature)]))

(defn pad-to [f-index f-vec]
(vec (take (count f-index) (concat f-vec (repeat 0)))))

(defn ->features [feature-seq]
  (let [[f-index f-vecs]
        (reduce accum-features [{} []] feature-seq)]
    [f-index (map #(pad-to f-index %) f-vecs)]))

We can use these functions to build up a matrix of feature vectors from a set of input sentences. Let's see how this works in the first few sentences of an Emily Dickinson poem:

user=> (def f-out
         (t/->features
           (map set
             (map t/tokenize ["I'm nobody."
                              "Who are you?"
                              "Are you nobody too?"]))))
#'user/f-out
user=> (first f-out)
{"nobody" 0, "'" 1, "i" 2, "m" 3, "." 4, "too" 9, "are" 5,
 "who" 6, "you" 7, "?" 8}
user=> (print (second f-out))
([1 1 111 0 0000] [0 0 000 1 111 0]
 [1 0 000 1 0 1 11])

Notice that after tokenizing each document, we created a set of the tokens. This changes the system here to use a bag-of-words approach. We're only looking at the presence or absence of a feature, not its frequency. This does put the tokens out of order, nobody was evidently the first token indexed, but this doesn't matter.

Now, by inverting the feature index, we can look up the words in a document from the features that it contains. This allows us to recreate a frequency map for each document as well as to recreate the tokens in each document. In this case, we'll look up the words from the first feature vector, I'm nobody:

user=> (def index (map first (sort-by second (first f-out))))
#'user/index
user=> index
("nobody" "'" "i" "m" "." "are" "who" "you" "?" "too")
user=> (->> f-out
second
first
         (map-indexed vector)
         (remove #(zero? (second %)))
         (map first)
         (map #(nth index %)))
("nobody" "'" "i" "m" ".")

This block of code gets the indexes for each position in the feature vector, removes the features that didn't occur, and then looks up the index in the inverted feature index. This provides us with the sequence of features that occurred in that document. Notice that they're out of order. This is to be expected because neither the input sequence of features (in this case a set) nor the feature vector itself preserves the order of the features.

Creating feature vector functions and POS tagging

We'll also include some functions to turn a list of tokens into a list of features. By wrapping these into functions, we make it easier to compose pipelines of processing functions and experiment with different feature sets.

The simplest and probably the most common type of feature is the unigram or a single token. As the tokenize function already outputs single functions, the unigram function is very simple to implement:

(def unigrams identity)

Another way to construct features is to use a number of consecutive tokens. In the abstract, these are called n-grams. Bigrams (two tokens) and trigrams (three tokens) are common instances of this type of function. We'll define all of these as functions:

(defn n-grams [n coll]
  (map #(str/join " " %) (partition n 1 coll)))
(defn bigrams [coll] (n-grams 2 coll))
(defn trigrams [coll] (n-grams 3 coll))

There are a number of different features we could create and experiment with, but we won't show them all here. However, before we move on, here's one more common type of feature: the token tagged with its part of speech (POS). POS is the category for words, which determines their range of uses in sentences. You probably remember these from elementary school. Nouns are people, places, and things. Verbs are actions.

To get this information, we'll use OpenNLP's trained POS tagger. This takes a word and associates it with a part of speech. In order to use this, we need to download the training model file. You can find it at http://opennlp.sourceforge.net/models-1.5/. Download en POS tagger (English) with a description of Maxent model with tag dictionary. The file itself is named en-pos-maxent.bin, and I put it into the data directory of my project.

This tagger uses the POS tags defined by the Penn Treebank (http://www.cis.upenn.edu/~treebank/). It uses a trained, probabilistic tagger to associate tags with each token from a sentence. For example, it might associate the token things with the NNS tag, which is the abbreviation for plural nouns. We'll create the string for this feature by putting these two together so that this feature would look like things_NNS.

Once we have the data file, we need to load it into a POS model. We'll write a function to do this and return the tagger object:

(defn read-me-tagger [filename]
  (->>filename
io/input-stream
POSModel.
POSTaggerME.))

Using the tagger is pretty easy. We just call its tag method as follows:

(defn with-pos [model coll]
  (map #(str/join "_" [%1 %2])
coll
       (.tag model (into-array coll))))

Now that we have these functions ready, let's take a short sentence and generate the features for it. For this set of examples, we'll use the clauses, Time flies like an arrow; fruit flies like a banana. To begin with, we'll define the input data and load the POS tagger.

user=> (def data
         "Time flies like an arrow; fruit flies like a banana.")
user=> (def tagger (t/read-me-tagger "data/en-pos-maxent.bin"))
user=> (def tokens (t/tokenize data))
user=> (t/unigrams tokens)
("time" "flies" "like" "an" "arrow" ";" "fruit" "flies" "like" "a"
 "banana" ".")
user=> (t/bigrams tokens)
("time flies" "flies like" "like an" "an arrow" "arrow ;"
 "; fruit" "fruit flies" "flies like" "like a" "a banana"
 "banana .")
user=> (t/trigrams tokens)
("time flies like" "flies like an" "like an arrow" "an arrow ;"
 "arrow ; fruit" "; fruit flies" "fruit flies like" "flies like a"
 "like a banana" "a banana .")
user=> (t/with-pos tagger tokens)
("time_NN" "flies_VBZ" "like_IN" "an_DT" "arrow_NN" ";_:"
 "fruit_NN" "flies_NNS" "like_IN" "a_DT" "banana_NN" "._.")

In the last output, the words are associated with part-of-speech tags. This output uses the tags from the Penn Treebank (http://www.cis.upenn.edu/~treebank/). You can look at it for more information, but briefly, here are the tags used in the preceding code snippet:

  • NN means noun;
  • VBZ means the present tense verb, third person, singular;
  • IN means and, the preposition or subordinating conjunction
  • DT means the determiner.

So we can see that the POS-tagged features provide the most data on the single tokens; however, the n-grams (bigrams and trigrams) provide more information about the context around each word. Later on, we'll see which one gets better results.

Now that we have the preprocessing out of way, let's turn our attention to the documents and how we want to structure the rest of the experiment.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.117.191