Support vector machine

This algorithm, Support Vector Machine (SVM), tries to geometrically separate the dataset Support vector machine into two subsets labeled with yi=+1 and yi=-1. The next figure shows the data perfectly separated into two classes (empty circles and black circles), that is, the case the data in which the decision boundary (or hyperplane) given by the black line fully separates the two classes (in other words, there are no misclassified data points):

Support vector machine

Sketch of the dataset separated into two classes (empty and filled circles) by the black line (decision boundary)

The hyperplane is mathematically described by the equation Support vector machine, where Support vector machine is the distance of the hyperplane from the origin and w is the normal to the hyperplane. The goal of the algorithm is to maximize the distance of the decision boundary from the data points. In practice, we consider the closest points i to the hyperplane, called support vectors, that lie in two planes H1, H2 at distances d1, d2 from the decision boundary such that:

Support vector machine for H1 such that yi=+1 ––––––––(1)

Support vector machine for H2 such that yi=-1––––––––(2)

Assuming d1=d2, the common distance is called margin so that the support vector machine method finds the values of w and b that maximize the margin.

Since the distance between H1 and H2 is given by Support vector machine, the margin is equal to Support vector machineand the support vector machine algorithm is equivalent to:

Support vector machine such that Support vector machine,

Here, the square operation and the factor Support vector machine have been added to allow the use of a quadratic programming method to solve the mathematical problem. Now, the problem can be rewritten in a Lagrangian form using the Lagrange multipliers ai >0:

Support vector machine

Setting the derivatives with respect to Support vector machine and b to 0, we obtain:

Support vector machine ––––––––(3)

Support vector machine ––––––––(4)

So the optimized Lagrangian becomes:

Support vector machine

Here, Support vector machine.

This is known as a dual form of the original problem, which depends only on the maximization of ai:

Support vector machine

The solutions Support vector machine(the cases ai =0 return null vectors) are found using a technique called quadratic programming and represent the support vectors w through formula (3):

Support vector machine ––––––––(5).

as satisfy the equation (combination of equation (1) and (2)):

Support vector machine

Substituting equation (3) and multiplying both sides by ys (which is +1 or -1), we obtain:

Support vector machine

Averaging over all the support vectors Ns we can have a better estimate of the parameter b:

Support vector machine ––––––––(6)

The equations (5) and (6) return the values of the parameters that define the support vector machines algorithm, from which it is possible to predict the class of all test points t:

Support vector machine
Support vector machine

If a line is not able to completely separate the data points into two classes, we need to allow the data points to be misclassified by an error Support vector machine such that:

Support vector machine
Support vector machine

And we need to maximize the margin, trying to minimize the misclassification errors. This condition is translated into this equation:

Support vector machine such that Support vector machine

Here, the parameter C is set to balance the size of the margin with the misclassification errors (C=0 trivially no misclassification and maximum margin, C>>1 many misclassified points and a narrow margin). Applying the same method as before, the dual problem is subjected to Lagrange multipliers' conditions with an upper bound C:

Support vector machine

Until now, we have treated problems in which only two classes are considered. Real problems may have multiple classes, and two procedures are commonly used to employ this method (as seen for logistic regression): one versus all or one versus one. Given a problem with M classes, the first method trains M SVM models, each one assuming the labels of the considered class j +1 and all the rest -1. The second method instead trains a model for each pair of classes i, j, leading to Support vector machine trained models. Clearly, the second method is computationally more expensive but the results are generally more precise.

In a similar way, SVM can be used in regression problems, that is, whenever yi is continuous between -1 and 1. In this case, the goal is to find the parameters w and b such that:

Support vector machine

We assume that the true values ti can differ from the predicted value yi of a maximum Support vector machine and the predictions can further be misclassified of about Support vector machine depending on whether yi is larger or smaller than ti. The following figure shows for an example point i the various predictions yi lying around the true value ti, and the associated errors:

Support vector machine

The predictions yi lie around the true value ti

The minimization problem becomes:

Support vector machine

Such that:

Support vector machine

It is possible to show that the associated dual problem is now equal to:

Support vector machine subject to Support vector machine.

Here, Support vector machine are the Lagrangian multipliers.

The new prediction, yp, can be found by applying the formula Support vector machine, where the parameter b can be obtained as before—averaging on the subset S given by the support vectors associated with the subset Support vector machineand Support vector machine:

Support vector machine

Kernel trick

There are datasets that are not linearly separable in a certain space, but if it is transformed in the right space, then a hyperplane can separate the data into the desired two or more classes. Consider the example shown in the following figure:

Kernel trick

In a two-dimensional space, the dataset shown on the left is not separable. Mapping the dataset in a three-dimensional space, the two classes are separable.

We can clearly see that the two classes are not linearly separable in two-dimensional space (the left figure). Suppose we then apply a kernel function K on the data such that:

Kernel trick

The data is now separable by a two-dimensional plane (the right figure). The kernel function on the SVM algorithm is applied to the matrix Hij, replacing the dot product on the variable i, j:

Kernel trick

Popular kernel functions used on the SVM algorithm are:

  • Linear kernel: Kernel trick
  • Radial basis kernel (RBF): Kernel trick
  • Polynomial kernel: Kernel trick
  • Sigmoid kernel: Kernel trick
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.45.80