Support Vector Machine

A Support Vector Machine (SVM) is a learner for binary classification (and regression) that tries to separate examples from the two different class labels with a decision boundary that maximizes the margin between the two classes.

Let's return to our example of positive and negative data samples, each of which has exactly two features (x and y) and two possible decision boundaries, as follows:

Support Vector Machine

Both of these decision boundaries get the job done. They partition all the samples of positives and negatives with zero misclassifications. However, one of them seems intuitively better. How can we quantify "better" and thus learn the "best" parameter settings?

This is where SVMs come in. SVMs are also called maximal margin classifiers because they can be used to do exactly that; they define the decision boundary so as to make those two clouds of + and as far apart as possible.

For the preceding example, an SVM would find two lines that pass through the data points on the class margins (the dashed lines in the following figure), and then make the line that passes through the center of the margins, the decision boundary (the bold black line in the following figure):

Support Vector Machine

It turns out that to find the maximal margin, it is only important to consider the data points that lie on the class margins. These points are sometimes also called support vectors.

Note

In addition to performing linear classification (that is, when the decision boundary is a straight line), SVMs can also perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs to high-dimensional feature spaces.

Using SVMs for Multi-class classification

Whereas some classification algorithms, such as neural networks, naturally lend themselves to using more than two classes, SVMs are binary classifiers by nature. They can, however, be turned into multiclass classifiers.

Here, we will consider two different strategies:

  • one-vs-all: The one-vs-all strategy involves training a single classifier per class, with the samples of that class as positive samples and all other samples as negatives. For k classes, this strategy thus requires the training of k different SVMs. During testing, all classifiers can express a "+1" vote by predicting that an unseen sample belongs to their class. In the end, an unseen sample is classified by the ensemble as the class with the most votes. Usually, this strategy is used in combination with confidence scores instead of predicted labels so that in the end, the class with the highest confidence score can be picked.
  • one-vs-one: The one-vs-one strategy involves training a single classifier per class pair, with the samples of the first class as positive samples and the samples of the second class as negative samples. For k classes, this strategy requires the training of k*(k-1)/2 classifiers. However, the classifiers have to solve a significantly easier task, so there is a trade-off when considering which strategy to use. During testing, all classifiers can express a "+1" vote for either the first or the second class. In the end, an unseen sample is classified by the ensemble as the class with the most votes.

Which strategy to use can be specified by the user via an input argument (mode) to the MutliClassSVM class:

class MultiClassSVM(Classifier):
    """Multi-class classification using Support Vector Machines (SVMs) """
    def __init__(self, num_classes, mode="one-vs-all", params=None):
        self.num_classes = num_classes
        self.mode = mode
        self.params = params or dict()

As mentioned earlier, depending on the classification strategy, we will need either k or k*(k-1)/2 SVM classifiers, for which we will maintain a list in self.classifiers:

        # initialize correct number of classifiers
        self.classifiers = []
        if mode == "one-vs-one":
            # k classes: need k*(k-1)/2 classifiers
            for i in xrange(numClasses*(numClasses-1)/2):
                self.classifiers.append(cv2.SVM())
        elif mode == "one-vs-all":
            # k classes: need k classifiers
            for i in xrange(numClasses):
                self.classifiers.append(cv2.SVM())
        else:
            print "Unknown mode ",mode

Once the classifiers are correctly initialized, we are ready for training.

Training the SVM

Following the requirements defined by the Classifier base class, we need to perform training in a fit method:

def fit(self, X_train, y_train, params=None):
    """ fit model to data """
    if params is None:
        params = self.params

The training will differ depending on the classification strategy that is being used. The one-vs-one strategy requires us to train an SVM for each pair of classes:

if self.mode == "one-vs-one":
    svm_id=0
    for c1 in xrange(self.numClasses):
        for c2 in xrange(c1+1,self.numClasses):

Here we use svm_id to keep track of the number of SVMs we use. In contrast to the one-vs-all strategy, we need to train a much larger number of SVMs here. However, the training samples to consider per SVM include only samples of either class—c1 or c2:

y_train_c1 = np.where(y_train==c1)[0]
y_train_c2 = np.where(y_train==c2)[0]

data_id = np.sort(np.concatenate((y_train_c1, y_train_c2), axis=0))
X_train_id = X_train[data_id,:]
y_train_id = y_train[data_id]

Because an SVM is a binary classifier, we need to convert our integer class labels into 0s and 1s. We assign label 1 to all samples of the c1 class, and label 0 to all samples of the c2 class, again using the handy np.where function:

y_train_bin = np.where(y_train_id==c1, 1, 0).flatten()

Then we are ready to train the SVM:

self.classifiers[svm_id].train(X_train_id, y_train_bin, params=self.params)

Here, we pass a dictionary of training parameters, self.params, to the SVM. For now, we only consider the (already suitable) default parameter values, but feel free to experiment with different settings.

Don't forget to update svm_id so that you can move on to the next SVM in the list:

svm_id += 1

On the other hand, the one-vs-all strategy requires us to train a total of self.numClasses SVMs, which makes indexing a lot easier:

elif self.mode == "one-vs-all":
    for c in xrange(self.numClasses):

Again, we need to convert integer labels to binary labels. In contrast to the one-vs-one strategy, every SVM here considers all training samples. We assign label 1 to all samples of the c class and label 0 to all other samples, and pass the data to the classifier's training method:

y_train_bin = np.where(y_train==c,1,0).flatten()
self.classifiers[c].train(X_train, y_train_bin, params=self.params)

OpenCV will take care of the rest. What happens under the hood is that the SVM training uses Lagrange multipliers to optimize some constraints that lead to the maximum-margin decision boundary. The optimization process is usually performed until some termination criteria are met, which can be specified via the SVM's optional arguments:

params.term_crit = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 100, 1e-6)

Testing the SVM

There are many ways to evaluate a classifier, but most often, we are simply interested in the accuracy metric, that is, how many data samples from the test set were classified correctly.

In order to arrive at this metric, we need to have each individual SVM predict the labels of the test data, and assemble their consensus in a 2D voting matrix (Y_vote):

def evaluate(self, X_test, y_test, visualize=False):
    """Evaluates model performance"""
    Y_vote = np.zeros((len(y_test), self.numClasses))

For each sample in the dataset, the voting matrix will contain the number of times the sample has been voted to belong to a certain class. Populating the voting matrix will take a slightly different form depending on the classification strategy. In the case of the one-vs-one strategy, we need to loop over all k*(k-1)/2 classifiers and obtain a vector, called y_hat, that contains the predicted labels for all test samples that belong to either the c1 class or the c2 class:

if self.mode == "one-vs-one":
    svm_id = 0
    for c1 in xrange(self.numClasses):
        for c2 in xrange(c1+  1, self.numClasses):
            data_id = np.where((y_test==c1) + (y_test==c2))[0]
            X_test_id = X_test[data_id,:],:],:] 
            y_test_id = y_test[data_id]

            # predict labels
            y_hat = self.classifiers[svm_id].predict_all( X_test_id)

The y_hat vector will contain 1s whenever the classifier believes that the sample belongs to the c1 class, and 0s wherever the classifier believes that the sample belongs to the c2 class. The tricky part is how to +1 the correct cell in the Y_vote matrix. For example, if the ith entry in y_hat is 1 (meaning that we believe that the ith sample belongs to the c1 class), we want to increment the value of the ith row and c1th column in the Y_vote matrix. This will indicate that the present classifier expressed a vote to classify the ith sample as belonging to the c1 class.

Since we know the indices of all test samples that belong to either class, c1 or c2 (stored in data_id), we know which rows of Y_vote to access. Because data_id is used as an index for Y_vote, we only need to find the indices in data_id that correspond to entries where y_hat is 1:

            # we vote for c1 where y_hat is 1, and for c2 where 
            # y_hat is 0
            # np.where serves as the inner index into the 
            # data_id array, which in turn serves as index 
            # into the Y_vote matrix
            Y_vote[data_id[np.where(y_hat==1)[0]],c1] += 1

Similarly, we can express a vote for the c2 class:

            Y_vote[data_id[np.where(y_hat==0)[0]],c2] += 1

Then we increment svm_id and move on to the next classifier:

             svm_id += 1

The one-vs-all strategy poses a different problem. Indexing in the Y_vote matrix is less tricky, because we always consider all the test data samples. We repeat the procedure of predicting labels for each data sample:

elif self.mode == "one-vs-all":
    for c in xrange(self.numClasses):
        # predict labels
        y_hat = self.classifiers[c].predict_all(X_test)

Wherever y_hat has a value of 1, the classifier expresses a vote that the data sample belongs to class c, and we update the voting matrix:

        # we vote for c where y_hat is 1
        if np.any(y_hat):
            Y_vote[np.where(y_hat==1)[0], c] += 1

The tricky part now is, "What to do with entries of y_hat that have a value of 0?" Since we classified one-vs-all, we only know that the sample is not of the c class (that is, it belongs to the "rest"), but we do not know what the exact class label is supposed to be. Thus, we cannot add these samples to the voting matrix.

Since we neglected to include samples that are consistently classified as belonging to the "rest," it is possible that some rows in the Y_vote matrix will have all 0s. In such a case, simply pick a class at random (unless you have a better idea):

        # find all rows without votes, pick a class at random
        no_label = np.where(np.sum(y_vote,axis=1)==0)[0]
        Y_vote[no_label,np.random.randint(self.numClasses, size=len(no_label))] = 1

Now, we are ready to calculate the desired performance metrics as described in detail in later sections. For the purpose of this chapter, we choose to calculate accuracy, precision, and recall, which are implemented in their own dedicated private methods:

accuracy = self.__accuracy(y_test, Y_vote)
precision = self.__precision(y_test, Y_vote)
recall = self.__recall(y_test, Y_vote)

return (accuracy,precision,recall)

Note

The scikit-learn machine learning package (http://scikit-learn.org) supports the three metrics—accuracy, precision, and recall (as well as others)—straight out of the box, and also comes with a variety of other useful tools. For educational purposes (and to minimize software dependencies), we will derive the three metrics ourselves.

Confusion matrix

A confusion matrix is a 2D matrix of size equal to (self.numClasses, self.numClasses), where the rows correspond to the predicted class labels, and columns correspond to the actual class labels. Then, the [r,c] matrix element contains the number of samples that were predicted to have label r, but in reality have label c. Having access to a confusion matrix will allow us to calculate precision and recall.

The confusion matrix can be calculated from a vector of ground-truth labels (y_test) and the voting matrix (Y_vote). We first convert the voting matrix into a vector of predicted labels by picking the column index (that is, the class label) that received the most votes:

def __confusion(self, y_test, Y_vote):
    y_hat = np.argmax(Y_vote, axis=1)

Then we need to loop twice over all classes and count the number of times a data sample of the c_true ground-truth class was predicted as having the c_pred class:

    conf = np.zeros((self.numClasses, self.numClasses)).astype(np.int32)
    for c_true in xrange(self.numClasses):
        # looking at all samples of a given class, c_true
        # how many were classified as c_true? how many as others?
        for c_pred in xrange(self.numClasses):

All elements of interest in each iteration are thus the samples that have the c_true label in y_test and the c_pred label in y_hat:

            y_this = np.where((y_test==c_true) * (y_hat==c_pred))

The corresponding cell in the confidence matrix is then the number of non-zero elements in y_this:

            conf[c_pred,c_true] = np.count_nonzero(y_this)

After the nested loops complete, we pass the confusion matrix for further processing:

    return conf

As you may have guessed already, the goal of a good classifier is to make the confusion matrix diagonal, which would imply that the ground-truth class (c_true) and the predicted class (c_pred) of every sample are the same.

The one-vs-one strategy, in combination with HOG features, actually performs really well, which is evident from this resulting confusion matrix, wherein most of the off-diagonal elements are zero:

Confusion matrix

Accuracy

The most straightforward metric to calculate is probably accuracy. This metric simply counts the number of test samples that have been predicted correctly, and returns the number as a fraction of the total number of test samples:

def __accuracy(self, y_test, y_vote):
    """ Calculates the accuracy based on a vector of ground-truth labels (y_test) and a 2D voting matrix (y_vote) of size (len(y_test),numClasses). """

The classification prediction (y_hat) can be extracted from the vote matrix by finding out which class has received the most votes:

    y_hat = np.argmax(y_vote, axis=1)

The correctness of the prediction can be verified by comparing the predicted label of a sample to its actual label:

    # all cases where predicted class was correct
    mask = (y_hat == y_test)

Accuracy is then calculated by counting the number of correct predictions (that is, non-zero entries in mask) and normalizing that number by the total number of test samples:

    return np.count_nonzero(mask)*1./len(y_test)

Precision

Precision in binary classification is a useful metric for measuring the fraction of retrieved instances that are relevant (also called the positive predictive value). In a classification task, the number of true positives is defined as the number of items correctly labeled as belonging to the positive class. Precision is defined as the number of true positives divided by the total number of positives. In other words, out of all the pictures in the test set that a classifier thinks contain a cat, precision is the fraction of pictures that actually contain a cat.

The total number of positives can also be calculated as the sum of true positives and false positives, the latter being the number of samples incorrectly labeled as belonging to a particular class. This is where the confusion matrix comes in handy, because it will allow us to quickly calculate the number of false positives.

Again, we start by translating the voting matrix into a vector of predicted labels:

def __precision(self, y_test, Y_vote):
    """ precision extended to multi-class classification """
    # predicted classes
    y_hat = np.argmax(Y_vote, axis=1)

The procedure will differ slightly depending on the classification strategy. The one-vs-one strategy requires us operating with the confusion matrix:

    if True or self.mode == "one-vs-one":
        # need confusion matrix
        conf = self.__confusion(y_test, y_vote)

        # consider each class separately
        prec = np.zeros(self.numClasses)
        for c in xrange(self.numClasses):

Since true positives are the samples that are predicted to have label c and also have label c in reality, they correspond to the diagonal elements of the confusion matrix:

            # true positives: label is c, classifier predicted c
            tp = conf[c,c]

Similarly, false positives correspond to the off-diagonal elements of the confusion matrix that are in the same column as the true positive. The quickest way to calculate that number is to sum up all the elements in column c but subtract the true positives:

            # false positives: label is c, classifier predicted 
            # not c
            fp = np.sum(conf[:,c]) - conf[c,c]

Precision is the number of true positives divided by the sum of true positives and false positives:

    if tp + fp != 0:
        prec[c] = tp*1./(tp+fp)

The one-vs-all strategy makes the math slightly easier. Since we always operate on the full test set, we need to find only those samples that have a ground-truth label of c in y_test and a predicted label of c in y_hat:

    elif self.mode == "one-vs-all":
        # consider each class separately
        prec = np.zeros(self.numClasses)
        for c in xrange(self.numClasses):
            # true positives: label is c, classifier predicted c
            tp = np.count_nonzero((y_test==c) * (y_hat==c))

Similarly, false positives have a ground-truth label of c in y_test and their predicted label is not c in y_hat:

            # false positives: label is c, classifier predicted 
            # not c
            fp = np.count_nonzero((y_test==c) * (y_hat!=c))

Again, precision is the number of true positives divided by the sum of true positives and false positives:

            if tp + fp != 0:
                prec[c] = tp*1./(tp+fp)

After that, we pass the precision vector for visualization:

    return prec

Recall

Recall is similar to precision in the sense that it measures the fraction of relevant instances that are retrieved (as opposed to the fraction of retrieved instances that are relevant). In a classification task, the number of false negatives is the number of items not labeled as belonging to the positive class but should have been labeled. Recall is the number of true positives divided by the sum of true positives and false negatives. In other words, out of all the pictures of cats in the world, recall is the fraction of pictures that have been correctly identified as pictures of cats.

Again, we start off by translating the voting matrix into a vector of predicted labels:

def __recall(self, y_test, Y_vote):
    """ recall extended to multi-class classification """
    # predicted classes
    y_hat = np.argmax(Y_vote, axis=1)

The procedure is almost identical to calculating precision. The one-vs-one strategy once again requires some arithmetic involving the confusion matrix:

    if True or self.mode == "one-vs-one":
        # need confusion matrix
        conf = self.__confusion(y_test, y_vote)

        # consider each class separately
        recall = np.zeros(self.numClasses)
        for c in xrange(self.numClasses):

True positives once again correspond to diagonal elements in the confusion matrix:

            # true positives: label is c, classifier predicted c
            tp = conf[c,c]

To get the number of false negatives, we sum up all the columns in the row c and subtract the number of true positives for this row. This will give us the number of samples for which the classifier predicted the class as c but that actually had a ground-truth label other than c:

            # false negatives: label is not c, classifier
            # predicted c
            fn = np.sum(conf[c,:]) - conf[c,c]

Recall is the number of true positives divided by the sum of true positives and false negatives:

            if tp + fn != 0:
                recall[c] = tp*1./(tp+fn)

Again, the one-vs-all strategy makes the math slightly easier. Since we always operate on the full test set, we need to find only those samples whose ground-truth label is not c in y_test, and predicted label is c in y_hat:

    elif self.mode == "one-vs-all":
        # consider each class separately
        recall = np.zeros(self.numClasses)
        for c in xrange(self.numClasses):
            # true positives: label is c, classifier predicted c
            tp = np.count_nonzero((y_test==c) * (y_hat==c))

            # false negatives: label is not c, classifier
            # predicted c
            fn = np.count_nonzero((y_test!=c) * (y_hat==c))

            if tp + fn != 0:
                recall[c] = tp*1./(tp+fn)

After that, we pass the recall vector for visualization:

    return recall
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.42.128