Kernels and support vector machines

So far, we've introduced the notion of maximum margin classification under linearly separable conditions and its extension to the support vector classifier, which still uses a hyperplane as the separating boundary but handles data sets that are not linearly separable by specifying a budget for tolerating errors. The observations that are on or within the margin, or are misclassified by the support vector classifier are support vectors. The critical role that these play in the positioning of the decision boundary was also seen in an alternative model representation of the support vector classifier that uses inner products.

What is common in the situations that we have seen so far in this chapter is that our model is always linear in terms of the input features. We've seen that the ability to create models that implement nonlinear boundaries between the classes to be separated is far more flexible in terms of the different kinds of underlying target functions that they can handle. One way to introduce nonlinearity in our model that uses our new representation involving inner products is to apply a nonlinear transformation to this result. We can define a general function K, which we'll call a kernel function that operates on two vectors and produces a scalar result. This allows us to generalize our model as follows:

Kernels and support vector machines

Our model now has as many features as there are support vectors, and each feature is defined as the result of a kernel acting upon the current observation and one of the support vectors. For the support vector classifier, the kernel we applied is known as the linear kernel as this just uses the inner product itself, producing a linear model.

Kernels and support vector machines

Kernel functions are also known as similarity functions as we can consider the output they produce as a measure of the similarity between the two input vectors provided. We introduce nonlinearity in our model using nonlinear kernels, and when we do this, our model is now known as a support vector machine. There are a number of different types of nonlinear kernels. The two most common ones are the polynomial kernel and the radial basis function kernel. The polynomial kernel uses a power expansion of the inner product between two vectors. For a polynomial of degree d, the form of the polynomial kernel is:

Kernels and support vector machines

Using this kernel, we are essentially transforming our feature space into a higher dimensional space. Computing the kernel applied to the inner product is much more efficient than first transforming all the features into a high-dimensional space and then trying to fit a linear model into that space. This is especially true when we use the radial basis function kernel, often referred to simply as the radial kernel, where the number of dimensions of the transformed feature space is actually infinite due to the infinite number of terms in the expansion. The form of the radial kernel is:

Kernels and support vector machines

Upon close inspection, we should be able to spot that the radial kernel does not use the inner product between two vectors. Instead, the summation in the exponent is just the square of the Euclidean distance between these two vectors. The radial kernel is often referred to as a local kernel, because when the Euclidean distance between the two input vectors is large, the resulting value that the kernel computes is very small because of the negative sign in the exponent. Consequently, when we use a radial kernel, only vectors close to the current observation for which we want to get a prediction play a significant role in the computation. We're now ready to put all this to practice with some real-world data sets.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.173.242