Chapter 5. Support Vector Machines

In this chapter, we are going to take a fresh look at nonlinear predictive models by introducing support vector machines. Support vector machines, often abbreviated as SVMs, are very commonly used for classification problems, although there are certainly ways to perform function approximation and regression tasks with them. In this chapter, we will focus on the more typical case of their role in classification. To do this, we'll first present the notion of maximal margin classification, which presents an alternative formulation of how to choose between many possible classification boundaries and differs from approaches such as maximum likelihood, which we have seen thus far. We'll introduce the related idea of support vectors and how, together with maximal margin classification, we can obtain a linear model in the form of a support vector classifier. Finally, we'll present how we can generalize these ideas in order to introduce nonlinearity through the use of certain functions known as kernels to finally arrive at our destination, the support vector machine.

Maximal margin classification

We'll begin this chapter by returning to a situation that should be very familiar by now: the binary classification task. Once again, we'll be thinking about the problem of how to design a model that will correctly predict whether an observation belongs to one of two possible classes. We've already seen that this task is simplest when the two classes are linearly separable, that is, when we can find a separating hyperplane (a plane in a multidimensional space) in the space of our features so that all the observations on one side of the hyperplane belong to one class and all the observations that lie on the other side belong to the second class. Depending on the structure, assumptions, and optimizing criterion that our particular model uses, we could end up with one of infinitely many such hyperplanes.

Let's visualize this scenario using some data in a two-dimensional feature space, where the separating hyperplane is just a separating line:

Maximal margin classification

In the preceding diagram, we can see two clusters of observations, each of which belongs to a different class. We've used different symbols for the two classes to denote this explicitly. Next, we show three different lines that could serve as the decision boundary of a classifier, all of which would generate 100 percent classification accuracy on the entire data set. We'll remind ourselves that the equation of a hyperplane can be expressed as a linear combination of the input features, which are the dimensions of the space in which the hyperplane resides:

Maximal margin classification

A separating hyperplane has this property:

Maximal margin classification

The first equation simply says that the data points that belong to class 1 all lie above the hyperplane, and the second equation says that the data points that belong to class -1 all lie below the hyperplane. The subscript i is used to index observations, and the subscript k is used to index features so that xik means the value of the kth feature in the ith observation. We can combine these two equations into a single equation for simplicity, as follows:

Maximal margin classification

To see why this simplification works, consider an observation of the class -1 (yi = -1). This observation will lie below the separating hyperplane, so the linear combination in brackets will produce a negative value. Multiplying this with its yi value of -1 results in a positive value. A similar argument works for observations of class 1.

Looking back at our diagram, note that the two dashed lines come quite close to certain observations. The solid line intuitively feels better than the other two lines as a decision boundary, as it separates the two classes without coming too close to either, by traversing the center of the space between them. In this way, it distributes the space between the two classes equally. We can define a quantity known as the margin that a particular separating hyperplane generates, as the smallest perpendicular distance from any point in the data set to the hyperplane. In two dimensions and two classes, we will always have at least two points that lie at a perpendicular distance equal to the margin from the separating line, one on each side of the line. Sometimes, as is the case with our data, we may have more than two points whose perpendicular distance from the separating line equals the margin.

The next plot shows the margin of the solid line from the previous plot, demonstrating that we have three points at a distance equal to the margin from this separating line:

Maximal margin classification

Now that we have the definition of the margin under our belt, we have a way to codify our intuition that led us to choose the solid line as the better decision boundary among the three lines that we saw in the first plot. We can go a step further and define the maximal margin hyperplane as the hyperplane whose margin is the largest amongst all possible separating hyperplanes. In our 2D example, we are essentially looking for the line that will separate the two classes while at the same time being as far away from the observations as possible. It turns out that the solid line from our example is actually the maximal margin line so that there is no other line that can be drawn with a higher margin than 2 units. This explains why we chose to label it as the line of maximum margin separation in our first plot.

In order to understand how we found the maximal margin hyperplane in our simple example, we need to formalize the problem as an optimization problem with p features using the following equations:

Maximal margin classification

Together, these two constraints in our optimization problem express the idea that observations in our data need to not only be correctly classified, but also lie at least M units away from the separating hyperplane. The goal is to maximize this distance M by appropriately choosing the coefficients βi. Thus, we need an optimization procedure that handles this type of problem. The details of how the optimization is actually implemented in practice are beyond the scope of this book, but we will see them in action later on when we do some programming with R.

We have a natural way forward now, which is to start looking at how the situation changes when our data is not linearly separable, something that we know by now is the typical scenario for a real-world data set. Let us take a step back before doing this. We've already studied two different methods for estimating the parameters of a model, namely maximum likelihood estimation and the least squared error criterion for linear regression. For example, when we looked at classification with logistic regression, we considered the idea of maximizing the likelihood of our data. This takes into account all of the available data points. This is also the case when classifying with multilayer perceptrons. With the maximum margin classifier, however, the construction of our decision boundary is only supported by the points that lie on the margin. Put differently, with the data in our 2D example, we can freely adjust the position of any observation except the three on the margin, and as long as the adjustment does not result in the observation falling inside the margin, our separating line will stay exactly in the same position. For this reason, we define the perpendicular vectors from the points that lie on the margin to the separating hyperplane as the support vectors. Thus, we've seen that our 2D example has three support vectors. The fact that only a subset of all the points in our data essentially determines the placement of the separating hyperplane means that we have the potential to overfit our training data.

On the other hand, this approach does yield a couple of nice properties. We split the space between the two classes equally between them without applying any bias toward either. Points that clearly lie well inside the area of space occupied by a particular class do not play such a big role in the model compared to points on the fringes, which is where we need to place our decision boundary.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.251.169