Classification with support vector machines

Support vector machines (SVM) can be used for regression—support vector regression (SVR)—and classification (SVC). The algorithm was invented by Vladimir Vapnik in 1993 (see http://en.wikipedia.org/wiki/Support_vector_machine). SVM maps data points to points in multidimensional space. The mapping is performed by a so-called kernel function. The kernel function can be linear or nonlinear. The classification problem is then reduced to finding a hyperplane or hyperplanes that best separate the points into classes. It can be hard to perform the separation with hyperplanes, which lead to the emergence of the concept of soft margin. The soft margin measures the tolerance for misclassification and is governed by a constant commonly denoted with C. Another important parameter is the type of the kernel function, which can be:

  • A linear function
  • A polynomial function
  • A radial basis function
  • A sigmoid function

A grid search can find the proper parameters for a problem. This is a systematic method that tries all possible parameter combinations. We will perform a grid search with the scikit-learn GridSearchCV class. We give this class a classifier or regressor type object with a dictionary. The keys of the dictionary are parameters we want to tweak. The values of the dictionary are the corresponding lists of parameter values to try. The scikit-learn API has a number of classes that add cross-validation functionality to a counterpart class. Cross-validation is turned off by default. Create a GridSearchCV object as follows:

clf = GridSearchCV(SVC(random_state=42, max_iter=100), {'kernel': ['linear', 'poly', 'rbf'], 'C':[1, 10]})

In this line, we specified the number of maximum iterations to not test our patience too much. Cross-validation was turned off also to speed up the process. Furthermore, we varied the types of kernels and the soft margin parameter.

The preceding code snippet created a grid of two by three for the possible parameter variations. If we had more time, we could have created a bigger grid with more possible values. We would also set the cv parameter of GridSearchCV to the number of folds we want, such as 5 or 10. The maximum iterations should be set to a higher value as well. The different kernels can vary wildly in time required to fit. We can print more information such as execution time for each combination of parameter values with the verbose parameter set to a non-zero integer value. Typically, we want to vary the soft-margin parameter by orders of magnitude, for instance, from 1 to 10,000. We can achieve this with the NumPy logspace() function.

Applying this classifier, we obtain an accuracy of 56 percent for the weather data and an accuracy of 82 percent for the iris sample dataset. The grid_scores_ field of GridSearchCV contains scores resulting from the grid search. For the weather data, the scores are as follows:

[mean: 0.42879, std: 0.11308, params: {'kernel': 'linear', 'C': 1},
 mean: 0.55570, std: 0.00559, params: {'kernel': 'poly', 'C': 1},
 mean: 0.36939, std: 0.00169, params: {'kernel': 'rbf', 'C': 1},
 mean: 0.30658, std: 0.03034, params: {'kernel': 'linear', 'C': 10},
 mean: 0.41673, std: 0.20214, params: {'kernel': 'poly', 'C': 10},
 mean: 0.49195, std: 0.08911, params: {'kernel': 'rbf', 'C': 10}]

For the iris sample data, we get the following scores:

[mean: 0.80000, std: 0.03949, params: {'kernel': 'linear', 'C': 1},
 mean: 0.58667, std: 0.12603, params: {'kernel': 'poly', 'C': 1},
 mean: 0.80000, std: 0.03254, params: {'kernel': 'rbf', 'C': 1},
 mean: 0.74667, std: 0.07391, params: {'kernel': 'linear', 'C': 10},
 mean: 0.56667, std: 0.13132, params: {'kernel': 'poly', 'C': 10},
 mean: 0.79333, std: 0.03467, params: {'kernel': 'rbf', 'C': 10}]

Refer to the svm_class.py file in this book's code bundle:

from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV
from sklearn import datasets
import numpy as np
from pprint import PrettyPrinter

def classify(x, y):
    clf = GridSearchCV(SVC(random_state=42, max_iter=100), {'kernel': ['linear', 'poly', 'rbf'], 'C':[1, 10]})

    clf.fit(x, y)
    print "Score", clf.score(x, y)
    PrettyPrinter().pprint(clf.grid_scores_)

rain = np.load('rain.npy')
dates = np.load('doy.npy')

x = np.vstack((dates[:-1], rain[:-1]))
y = np.sign(rain[1:])
classify(x.T, y)

#iris example
iris = datasets.load_iris()
x = iris.data[:, :2]
y = iris.target
classify(x, y)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.146.34.146