Using the Weka machine learning library

We're going to test a couple of machine learning algorithms that are commonly used for sentiment analysis. Some of them are implemented in the OpenNLP library. However, they do not have anything for others algorithms. So instead, we'll use the Weka machine learning library (http://www.cs.waikato.ac.nz/ml/weka/). This doesn't have the classes to tokenize or segment the data that an application in a natural language processing requires, but it does have a more complete palette of machine learning algorithms.

All of the classes in the Weka library also have a standard, consistent interface. These classes are really designed to be used from the command line, so each takes its options as an array of strings with a command-line-like syntax. For example, the array for a naive Bayesian classifier may have a flag to indicate that it should use the kernel density estimator rather than the normal distribution. This would be indicated by the -K flag being included in the option array. Other options may include a parameter that would follow the option in the array. For example, the logistic regression classifier can take a parameter to indicate the maximum number of iterations it should run. This would include the items -M and 1000 (say) in the options array.

The Clojure interface functions for these classes are very regular. In fact, they're almost boilerplate. Unfortunately, they're also a little redundant. Option names are repeated in the functions' parameter list, the default values for those parameters, and where the parameters are fed into the options array. It would be better to have one place for a specification of each option, its name, its flag, its semantics, and its default value.

This is a perfect application of Clojure's macro system. The data to create the functions can be transformed into the function definition, which is then compiled into the interface function.

The final product of this is the defanalysis macro, which takes the name of the function, the class, the method it's based on, and the options it accepts. We'll see several uses of it later in this chapter.

Unfortunately, at almost 40 lines, this system is a little long and disruptive to include here, however interesting it may be. You can find this in the src/sentiment/weka.clj file in the code download, and I have discussed it in a bit more length in Clojure Data Analysis Cookbook, Packt Publishing.

We do still need to convert the HotelReview records that we loaded earlier into a WekaInstances collection. We'll need to do this several times as we train and test the classifiers, and this will provide us with a somewhat shorter example of interacting with Weka.

To store a data matrix, Weka uses an Instances object. This implements a number of standard Java collection interfaces, and it holds objects that implement the Instance interface, such as DenseInstance or SparseInstance.

Instances also keep track of which fields each item has in its collection of Attribute objects. To create these, we'll populate ArrayList with all of the features that we accumulated in the feature index. We'll also create a feature for the ratings and add it to ArrayList. We'll return both the full collection of the attributes and the single attribute for the review's rating:

(defn instances-attributes [f-index]
  (let [attrs (->> f-index
                (sort-by second)
                (map #(Attribute. (first %)))
ArrayList.)
review (Attribute. "review-rating"
                           (ArrayList. ["+" "-"]))]
    (.add attrs review)
    [attrs review]))

(At this point, we're hardcoding the markers for the sentiments as a plus sign and a negative sign. However, these could easily be made into parameters for a more flexible system.)

Each hotel review itself can be converted separately. As most documents will only have a fraction of the full number of features, we'll use SparseInstance. Sparse vectors are more memory efficient if most of the values in the instance are zero. If the feature is nonzero in the feature vector, we'll set it in Instance. Finally, we'll also set the rating attribute as follows:

(defn review->instance [attrs review]
  (let [i (SparseInstance. (.size attrs))]
    (doseq [[attr value] (map vector attrs (:feature-vec review))]
      (when-not (zero? value)
        (.setValueiattr (double value))))
    (.setValuei (last attrs) (:rating review))
i))

With these, we can populate Instances with the data from the HotelReview records:

(defn ->instances
  ([f-index review-coll]
   (->instances f-index review-coll "hotel-reviews"))
  ([f-index review-coll name]
   (let [[attrs review] (instances-attributes f-index)
instances (Instances. name attrs (count review-coll))]
     (doseq [review review-coll]
       (let [i (review->instance attrs review)]
         (.add instances i)))
     (.setClass instances review)
instances)))

Now we can define some functions to sit between the cross-validation functions we defined earlier and the Weka interface functions.

Connecting Weka and cross-validation

The first of these functions will classify an instance and determine which rating symbol it is classified by (+ or -), given the distribution of probabilities for each category. This function is used to run the classifier on all data in an Instances object:

(defn run-instance [classifier instances instance]
  (let [dist (.distributionForInstance classifier instance)
i (first (apply max-key second
                       (map vector (range) dist)))]
    (.. instances classAttribute (value i))))
(defn run-classifier [classifier instances]
  (map #(run-instance classifier instances %) instances))

The next function defines the cross-validation procedure for a group of HotelReview records. This function actually takes a training function and returns a function that takes the feature index and collection of HotelReview records and actually performs the cross validation. This will allow us to create some wrapper functions for each type of classifier:

(defn run-k-fold [trainer]
  (fn [f-index coll]
    (let [do-train (fn [xs]
                     (let [is (w/->instances f-index xs)]
                       (trainer is)))
do-test (fn [classifier xs]
                    (->>xs
                      (w/->instances f-index)
w/filter-class-index
                      (run-classifier classifier)
                      (xv/compute-error (map :rating xs))
vector))]
      (xv/k-fold do-train do-test concat 10 coll))))

When executed, this function will return a list of ten of whatever the do-test function returns. In this case, that means a list of ten precision and recall mappings. We can average the output of this to get a summary of each classifier's performance.

Now we can start actually defining and testing classifiers.

Understanding maximum entropy classifiers

Maximum entropy (maxent) classifiers are, in a sense, very conservative classifiers. They assume nothing about hidden variables and base their classifications strictly upon the evidence they've been trained on. They are consistent with the facts that they've seen, but all other distributions are assumed to be completely uniform otherwise. What does this mean?

Let's say that we have a set of reviews and positive or negative ratings, and we wish to be able to predict the value of ratings when the ratings are unavailable, given the tokens or other features in the reviews. The probability that a rating is positive would be p(+). Initially, before we see any actual evidence, we may intuit that this probability would be uniform across all possible features. So, for a set of five features, before training, we might expect the probability function to return these values:

p(+)

½

p(-)

½

This is perfectly uniform but not very useful. We have to make observations from the data in order to train the classifier.

The process of training involves observing the features in each document and its rating and determining the probability of any given feature that is found in a document with a given rating. We'll denote this as p(x, y) or the probability as feature x and rating y.

These features impose constraints on our model. As we gather more and more constraints, figuring a consistent and uniform distribution for the non-constrained probabilities in the model becomes increasingly difficult.

Essentially, this is the maxent algorithm's job. It takes into account all of the constraints imposed by the probabilities found in the training data, but it maintains a uniform distribution on everything that's unconstrained. This provides a more consistent, stronger algorithm overall, and it still performs very well, usually. Also, cross validation can help us evaluate its performance.

Another benefit is that maxent doesn't make any assumptions about the relationships between different features. In a bit, we'll look at a naive Bayesian classifier, and it does make an assumption about the relationships between the features, an often unrealistic assumption. Because maxent does not make that assumption, it can better match the data involved.

For this chapter, we'll use the maxent classifier found in the Weka class, weka.classifiers.functions.Logistic (maxent is equivalent to the logistic regression, which attempts to classify data based on a binary categorical label, which is based on one or more features). We'll use the defanalysis macro to define a utility function that cross validates a logistic regression classifier as follows:

(w/defanalysis train-logistic Logistic buildClassifier
  [["-D" debugging false :flag-true]
   ["-R" ridge nil :not-nil]
   ["-M" max-iterations -1]])
(def k-fold-logistic (run-k-fold train-logistic))

Now let's define something similar for a naive Bayesian classifier.

Understanding naive Bayesian classifiers

A common, generally well-performing classifier is the naive Bayesian classifier. It's naive because it makes an assumption about that data and the features; it assumes that the features are independent of each other. That is, the probability of, say, good occurring in a document is not influenced at all by the probability of any other token or feature, such as, say, not. Unfortunately, language doesn't work this way, and there are dependencies all through the features of any linguistic dataset.

Fortunately, even when the data and features are not completely independent, this classifier often still performs quite well in practice. For example, in An analysis of data characteristics that affect naive Bayes performance by Irina Rish, Joseph Hellerstein, and Jayram Thathachar, it was found that Bayesian classifiers perform best with features that are completely independent or functionally dependent.

This classifier works by knowing several probabilities and then using Bayes' theorem to turn them around to predict the classification of the document. The following are the probabilities that it needs to know:

  • It needs to know the probability for each feature in the training set. We'll call this p(F). Say the word good occurs in 40 percent of the documents. This is the evidence of the classification.
  • It needs to know the probability that a document will be part of a classification. We'll call this p(C). Say that the rate of positive ratings in the corpus of reviews is 80 percent. This is the prior distribution.
  • Now it needs to know the probability that the good feature is in the document if the document is rated positively. This is p(F|C). For this hypothetical example, say that good appears in 40 percent of the positive reviews. This is the likelihood.

Bayes theorem allows us to turn this around and compute the probability that a document is positively rated, if it contains the feature good.

Understanding naive Bayesian classifiers

For this example, this turns out to be (0.8)(0.4) / 0.4, or 0.8 (80 percent). So, if the document contains the feature good, it is very likely to be positively rated.

Of course, things begin to get more and more interesting as we start to track more and more features. If the document contains both not and good, for instance, the probability that the review is positive may change drastically.

The Weka implementation of a naive Bayesian classifier is found in weka.classifiers.bayes.NaiveBayes, and we'll wrap it in a manner that is similar to the one we used for the maxent classifier:

(w/defanalysis train-naive-bayesNaiveBayesbuildClassifier
  [["-K" kernel-density false :flag-true]
   ["-D" discretization false :flag-true]])
(def k-fold-naive-bayes (run-k-fold train-naive-bayes))

Now that we have both the classifiers in place, let's look again at the features we'll use and how we'll compare everything.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.106.237