Classifying data with the Naive Bayesian classifier

Bayesian classification is a way of updating your estimate of the probability that an item is in a given category, depending on what you already know about that item, category, and the world at large. In the case of a Naive Bayesian system, we assume that all features of the items are independent. For example, elevation and average snowfall are not independent (higher elevations tend to have more snow), but elevation and median income should be independent. This algorithm has been useful in a number of interesting areas, for example, spam detection in emails, automatic language detection, and document classification. In this recipe, we'll apply it to the mushroom dataset that we looked at in the Classifying data with decision trees recipe.

Getting ready

First, we'll need to use the dependencies that we specified in the project.clj file in the Loading CSV and ARFF files into Weka recipe. We'll also use the defanalysis macro from the Discovering groups of data using K-Means clustering recipe, and we'll need this import in our script or REPL:

(import [weka.classifiers.bayes NaiveBayes]
        [weka.core Instances])

For data, we'll use the mushroom dataset that we did in the Classifying data with decision trees recipe. You can download it from http://www.ericrochester.com/clj-data-analysis/data/UCI/mushroom.arff. We'll also need to ensure that the class attribute is marked, just as we did in that recipe:

 (def shrooms (doto (load-arff "data/UCI/mushroom.arff")
               (.setClassIndex 22)))

How to do it…

In order to test the classifier, we'll take a sample of the data and train the classifier on that. We'll then see how well it classifies the entire dataset:

  1. The following function takes a dataset of instances and a sample size, and it returns a sample of the dataset:
    (defn sample-instances [instances size]
      (let [inst-count (.numInstances instances)]
        (if (<= inst-count size)
          instances
          (let [indexes
                (loop [sample #{}]
                  (if (= (count sample) size)
                    (sort sample)
                    (recur
                      (conj sample
                            (rand-int inst-count)))))
                sample (Instances. instances size)]
            (doall
              (map #(.add sample (.get instances %))
                   indexes))
            sample))))
  2. We also need to create the wrapper function for the Bayesian analyzer:
    (defanalysis
      naive-bayes NaiveBayes buildClassifier
      [["-K" kernel-density false :flag-true]
       ["-D" discretization false :flag-true]])
  3. Now we can create a sample of the mushroom data and apply the analyzer to it:
    (def shroom-sample (sample-instances shrooms 2000))
    (def bayes (naive-bayes shroom-sample))
  4. We can pull an instance from the original dataset and use the Bayesian model to classify it. In this dataset, edible (e) is 0 and poisonous (p) is 1, so the model has correctly classified this data:
    user=> (.get shrooms 2)
    #<DenseInstance b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m,e>
    user=> (.classifyInstance bayes (.get shrooms 2))
    0.0

How it works…

Bayesian models work by initially assuming that there's usually a fifty-fifty chance that a mushroom is edible or poisonous. For each item in the training set, it uses the values of each mushroom's attributes to nudge this fifty-fifty chance in the direction of the classification for that mushroom. So if a mushroom has a bell-shaped cap, smells of anise, and is edible, the model will assume that if it sees another mushroom with a bell-shaped cap that smells of anise, it's slightly more likely to be edible than poisonous.

By classifying each instance in the mushroom dataset, we can evaluate how well the model has done. The following code snippet will do this:

user=> (frequencies
         (map #(vector (.classValue (.get shrooms %))
                       (.classifyInstance bayes (.get shrooms %)))
              (range (.numInstances shrooms))))
{[1.0 0.0] 383, [0.0 0.0] 4173, [1.0 1.0] 3533, [0.0 1.0] 35}

Thus, it classified approximately 95 percent of the data correctly (4173 plus 3533). It thought that 383 poisonous mushrooms were edible and 35 edible mushrooms were poisonous. This isn't a bad result, although we might wish that it erred on the side of caution!

There's more…

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.119.170