This algorithm, Support Vector Machine (SVM), tries to geometrically separate the dataset into two subsets labeled with yi=+1 and yi=-1. The next figure shows the data perfectly separated into two classes (empty circles and black circles), that is, the case the data in which the decision boundary (or hyperplane) given by the black line fully separates the two classes (in other words, there are no misclassified data points):
The hyperplane is mathematically described by the equation , where is the distance of the hyperplane from the origin and w is the normal to the hyperplane. The goal of the algorithm is to maximize the distance of the decision boundary from the data points. In practice, we consider the closest points i to the hyperplane, called support vectors, that lie in two planes H1, H2 at distances d1, d2 from the decision boundary such that:
for H1 such that yi=+1 ––––––––(1)
for H2 such that yi=-1––––––––(2)
Assuming d1=d2, the common distance is called margin so that the support vector machine method finds the values of w and b that maximize the margin.
Since the distance between H1 and H2 is given by , the margin is equal to and the support vector machine algorithm is equivalent to:
such that ,
Here, the square operation and the factor have been added to allow the use of a quadratic programming method to solve the mathematical problem. Now, the problem can be rewritten in a Lagrangian form using the Lagrange multipliers ai >0:
Setting the derivatives with respect to and b to 0, we obtain:
––––––––(3)
––––––––(4)
So the optimized Lagrangian becomes:
Here, .
This is known as a dual form of the original problem, which depends only on the maximization of ai:
The solutions (the cases ai =0 return null vectors) are found using a technique called quadratic programming and represent the support vectors w through formula (3):
––––––––(5).
as satisfy the equation (combination of equation (1) and (2)):
Substituting equation (3) and multiplying both sides by ys (which is +1 or -1), we obtain:
Averaging over all the support vectors Ns we can have a better estimate of the parameter b:
––––––––(6)
The equations (5) and (6) return the values of the parameters that define the support vector machines algorithm, from which it is possible to predict the class of all test points t:
If a line is not able to completely separate the data points into two classes, we need to allow the data points to be misclassified by an error such that:
And we need to maximize the margin, trying to minimize the misclassification errors. This condition is translated into this equation:
such that
Here, the parameter C is set to balance the size of the margin with the misclassification errors (C=0 trivially no misclassification and maximum margin, C>>1 many misclassified points and a narrow margin). Applying the same method as before, the dual problem is subjected to Lagrange multipliers' conditions with an upper bound C:
Until now, we have treated problems in which only two classes are considered. Real problems may have multiple classes, and two procedures are commonly used to employ this method (as seen for logistic regression): one versus all or one versus one. Given a problem with M classes, the first method trains M SVM models, each one assuming the labels of the considered class j +1 and all the rest -1. The second method instead trains a model for each pair of classes i, j, leading to trained models. Clearly, the second method is computationally more expensive but the results are generally more precise.
In a similar way, SVM can be used in regression problems, that is, whenever yi is continuous between -1 and 1. In this case, the goal is to find the parameters w and b such that:
We assume that the true values ti can differ from the predicted value yi of a maximum and the predictions can further be misclassified of about depending on whether yi is larger or smaller than ti. The following figure shows for an example point i the various predictions yi lying around the true value ti, and the associated errors:
The minimization problem becomes:
Such that:
It is possible to show that the associated dual problem is now equal to:
subject to .
Here, are the Lagrangian multipliers.
The new prediction, yp, can be found by applying the formula , where the parameter b can be obtained as before—averaging on the subset S given by the support vectors associated with the subset and :
There are datasets that are not linearly separable in a certain space, but if it is transformed in the right space, then a hyperplane can separate the data into the desired two or more classes. Consider the example shown in the following figure:
We can clearly see that the two classes are not linearly separable in two-dimensional space (the left figure). Suppose we then apply a kernel function K on the data such that:
The data is now separable by a two-dimensional plane (the right figure). The kernel function on the SVM algorithm is applied to the matrix Hij, replacing the dot product on the variable i, j:
Popular kernel functions used on the SVM algorithm are:
18.218.45.80