Classifying data with support vector machines

Support vector machines (SVMs) try to divide two groups of data along a plane. An SVM finds the plane that is the farthest from both groups. If a plane comes much closer to group B, it will prefer a plane that is approximately an equal distance from both. SVMs have a number of nice properties. While other clustering or classification algorithms work well with defined clusters of data, SVMs may work fine with data that isn't in well-defined and delineated groupings. They are also not affected by the local minima. Algorithms such as K-Means or SOMs—which begin from a random starting point—can get caught in solutions that aren't bad for the area around the solution, but aren't the best for the entire space. This isn't a problem for SVMs.

Getting ready

First, we'll need these dependencies in our project.clj file:

(defproject d-mining "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [nz.ac.waikato.cms.weka/weka-dev "3.7.11"]
                 [nz.ac.waikato.cms.weka/LibSVM "1.0.6"]])

In the script or REPL, we'll import the SVM library:

(import [weka.classifiers.functions LibSVM])

We'll also use the ionosphere dataset from the Weka datasets. (You can download this from http://www.ericrochester.com/clj-data-analysis/data/UCI/ionosphere.arff.) This data is taken from a phased-array antenna system in Goose Bay, Labrador. For each observation, the first 34 attributes are from 17 pulse numbers (a pulse for each observation) for the system, with two attributes per pulse number. The thirty-fifth attribute indicates whether the reading is good or bad. Good readings show evidence of some kind of structure in the ionosphere. Bad readings do not; their signals pass through the ionosphere. We'll load this and set the last column, the "good" or "bad" column as the class index:

 (def ion (doto (load-arff "data/UCI/ionosphere.arff")
           (.setClassIndex 34)))

Finally, we'll use the defanalysis macro from the Discovering groups of data with K-Means clustering recipe and the sample-instances function from the Classifying data with Naive Bayesian classifiers recipe.

How to do it…

For this recipe, we'll define some utility functions and the analysis algorithm wrapper. Then we'll put it through its paces:

  1. A number of the options for the SVM analysis have to be converted from Clojure-friendly values. For example, we want to pass true to one option and a mnemonic keyword to another, but Weka wants both of these as integers. So to make the parameter values more natural to Clojure, we'll use several functions that convert the Clojure parameters to the integer strings that Weka wants:
    (defn bool->int [b] (if b 1 0))
    (def svm-types
      {:c-svc 0, :nu-svc 1, :one-class-svm 2,
       :epsilon-svr 3, :nu-svr 4})
    (def svm-fns
      {:linear 0, :polynomial 1, :radial-basis 2,
       :sigmoid 3})
  2. We'll use these to define the wrapper function for the LibSVM class, which is a standalone library that works with Weka:
    (defanalysis
      svm LibSVM buildClassifier
      [["-S" svm-type :c-svc svm-types]
       ["-K" kernel-fn :radial-basis svm-fns]
       ["-D" degree 3]
       ["-G" gamma nil :not-nil]
       ["-R" coef0 0]
       ["-C" c 1]
       ["-N" nu 0.5]
       ["-Z" normalize false bool->int]
       ["-P" epsilon 0.1]
       ["-M" cache-size 40]
       ["-E" tolerance 0.001]
       ["-H" shrinking true bool->int]
       ["-W" weights nil :not-nil]])
  3. Before we use this, let's also write a function to calculate the classification accuracy by re-classifying each instance and tracking whether the SVM identified the class values correctly or not. We'll use this to see how well the trained SVM is:
    (defn eval-instance
      ([] {:correct 0, :incorrect 0})
      ([_] {:correct 0, :incorrect 0})
      ([classifier sums instance]
       (if (= (.classValue instance)
              (.classifyInstance classifier instance))
         (assoc sums
                :correct (inc (sums :correct)))
         (assoc sums
                :incorrect (inc (sums :incorrect))))))
  4. Now, let's get a sample of 35 of the observations (about 10 percent of the total) and train the SVM on them:
    (def ion-sample (sample-instances ion 35))
    (def ion-svm (svm ion-sample))
  5. It'll output some information about the optimizations, and then it will be ready to use. We'll use eval-instance to see how it did:
    user=> (reduce (partial eval-instance ion-svm)
                   (eval-instance) ion)
    {:incorrect 81, :correct 270}

This gives us a total correct of 77 percent.

There's more…

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.139.169