Performing naïve Bayesian classification with MALLET

MALLET has gotten its reputation as a library for topic modeling. However, it also has a lot of other algorithms in it.

One popular algorithm that MALLET implements is naïve Bayesian classification. If you have documents that are already divided into categories, you can train a classifier to categorize new documents into those same categories. Often, this works surprisingly well.

One common use for this is in spam e-mail detection. We'll use this as our example here too.

Getting ready

We'll need to have MALLET included in our project.clj file:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [cc.mallet/mallet "2.0.7"]])

Just as in the Performing topic modeling with MALLET recipe, the list of classes to be included is a little long, but most of them are for the processing pipeline, as shown here:

(require '[clojure.java.io :as io])
(import [cc.mallet.util.*]
        [cc.mallet.types InstanceList]
        [cc.mallet.pipe
         Input2CharSequence TokenSequenceLowercase
         CharSequence2TokenSequence SerialPipes
         SaveDataInSource Target2Label
         TokenSequence2FeatureSequence
         TokenSequenceRemoveStopwords
         FeatureSequence2AugmentableFeatureVector]
        [cc.mallet.pipe.iterator FileIterator]
        [cc.mallet.classify NaiveBayesTrainer])

For data, we can get preclassified emails from the SpamAssassin website. Take a look at https://spamassassin.apache.org/publiccorpus/. From this directory, I downloaded 20050311_spam_2.tar.bz2, 20030228_easy_ham_2.tar.bz2, and 20030228_hard_ham.tar.bz2. I decompressed these into the training directory. This added three subdirectories: training/easy_ham_2, training/hard_ham, and training/spam_2.

I also downloaded two other archives: 20021010_hard_ham.tar.bz2 and 20021010_spam.tar.bz2. I decompressed these into the test-data directory in order to create the test-data/hard_ham and test-data/spam directories.

How to do it…

Now, we can define the functions to create the processing pipeline and a list of document instances, as well as to train the classifier and classify the documents:

  1. We'll create the processing pipeline separately. A single instance of this has to be used to process all of the training, test, and actual data. Hang on to this:
    (defn make-pipe-list []
      (SerialPipes.
        [(Target2Label.)
         (SaveDataInSource.)
         (Input2CharSequence. "UTF-8")
         (CharSequence2TokenSequence.
            #"p{L}[p{L}p{P}]+p{L}")
         (TokenSequenceLowercase.)
         (TokenSequenceRemoveStopwords.)
         (TokenSequence2FeatureSequence.)
         (FeatureSequence2AugmentableFeatureVector.
            false)]))
  2. We can use that to create the instance list over the files in a directory. When we do, we'll use the documents' parent directory's name as its classification. This is what we'll be training the classifier on:
    (defn add-input-directory [dir-name pipe]
      (doto (InstanceList. pipe)
        (.addThruPipe
          (FileIterator. (io/file dir-name)
                         #".*/([^/]*?)/d+..*$"))))
  3. Finally, these two are relatively short and not strictly necessary, but it is good to have these two utility functions:
    (defn train [instance-list]
      (.train (NaiveBayesTrainer.) instance-list))
    (defn classify [bayes instance-list]
      (.classify bayes instance-list))

Now, we can use these functions to load the training documents from the training directory, train the classifier, and use it to classify the test files:

(def pipe (make-pipe-list))
(def instance-list (add-input-directory "training" pipe))
(def bayes (train instance-list))

Now we can use it to classify the test files.

(def test-list (add-input-directory "test-data" pipe))
(def classes (classify bayes test-list))

Moreover, finding the results just takes digging into the class structure:

user=> (.. (first (seq classes)) getLabeling getBestLabel
           toString)
"hard_ham"

We can use this to construct a matrix that shows how the classifier performs, as follows:

 

Expected ham

Expected spam

Actually ham

246

99

Actually spam

4

402

From this confusion matrix, you can see that it does pretty well. Moreover, it errs on misclassifying spam as ham. This is good because this means that we'd only need to dig into our spam folder for four emails.

How it works…

Naïve Bayesian classifiers work by starting with a reasonable guess about how likely a set of features are to be marked as spam. Often, this might be 50/50. Then, as it sees more and more documents and their classifications, it modifies this model, getting better results.

For example, it might notice that the word free is found in 100 ham emails but in 900 spam emails. This makes it a very strong indicator of spam, and the classifier will update its expectations accordingly. It then combines all of the relevant probabilities from the features it sees in a document in order to classify it one way or the other.

There's more…

Alexandru Nedelcu has a good introduction to Bayesian modeling and classifiers at https://www.bionicspirit.com/blog/2012/02/09/howto-build-naive-bayes-classifier.html

See also…

We'll take a look at how to use the Weka machine learning library to train a naïve Bayesian classifier in order to sort edible and poisonous mushrooms in the Classifying data with the Naïve Bayesian classifier recipe in Chapter 9, Clustering, Classifying, and Working with Weka.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.245.1