Support vector classification

We need our data to be linearly separable in order to classify with a maximal margin classifier. When our data is not linearly separable, we can still use the notion of support vectors that define a margin, but this time, we will allow some examples to be misclassified. Thus, we essentially define a soft margin in that some of the observations in our data set can violate the constraint that they need to be at least as far as the margin from the separating hyperplane. It is also important to note that sometimes, we may want to use a soft margin even for linearly separable data. The reason for this is in order to limit the degree of overfitting the data. Note that the larger the margin, the more confident we are about our ability to correctly classify new observations, because the classes are further apart from each other in our training data. If we achieve separation using a very small margin, we are less confident about our ability to correctly classify our data and we may, instead, want to allow a few errors and come up with a larger margin that is more robust. Study the following plot:

Support vector classification

In order to get a firmer grasp of the reason why a soft margin may be preferable to a hard margin even for linearly separable data, we've changed our data slightly. We used the same data that we had previously, but we added an extra observation to class 1 and placed it close to the boundary of class -1. Note that with the addition of this single new data point, with feature values f1 =16 and f2 =40, our maximal margin line has moved drastically! The margin has been reduced from 2 units to 0.29 units. Looking at this graph, we are tempted to feel that the new point might either be an outlier or a mislabeling in our data set. If we were to allow our model to make one single misclassification using a soft margin, we would go back to our previous line, which separates the two classes with a much wider margin and is less likely to have overfit the data. We formalize the notion of our soft classifier by modifying our optimization problem setup.

Support vector classification

Under this new setup, we've introduced a new set of variables ξi known as the slack variables. There is one slack variable for every observation in our data set and the value of the ξi slack variable depends on where the ith observation falls with respect to the margin. When an observation is on the correct side of the separating hyperplane and outside the margin, the slack variable for that observation takes the value 0. This is the ideal situation that we have seen for all observations under a hard boundary. When an observation is correctly classified but falls at a distance within the margin, the corresponding slack variable takes a small positive value less than 1. When an observation is actually misclassified, thus falling on the wrong side of the hyperplane altogether, then its associated slack variable takes a value greater than 1. In summary, take a look at the following:

Support vector classification

When an observation is incorrectly classified, the magnitude of the slack variables is proportional to the distance between that observation and the boundary of the separating hyperplane. The fact that the sum of the slack variables must be less than a constant C means that we can think of this constant as an error budget that we are prepared to tolerate. As a misclassification of a single particular observation results in a slack variable taking at least the value 1, and our constant C is the sum of all the slack variables, setting a value of C less than 1 means that our model will tolerate a few observations falling inside the margin but no misclassifications. A high value of C often results in many observations either falling inside the margin or being misclassified, and as these are all support vectors, we end up having a greater number of support vectors. This results in a model that has a lower variance, but because we have shifted our boundary in a way that has increased tolerance to margin violations and errors, we may have a higher bias. By contrast, depending on fewer support vectors caused by having a much stricter model and hence a lower value of C, may result in a lower bias in our model. These support vectors, however, will individually affect the position of our boundary to a much higher degree. Consequently, we will experience a higher variance in our model performance across different training sets. Once again, the interplay between model bias and variance resurfaces in the design decisions that we must make as predictive modelers.

Inner products

The exact details of how the parameters of the support vector classifier model are computed are beyond the scope of this book. However, it turns out that the model itself can be simplified into a more convenient form that uses the inner products of the observations. An inner product of two vectors, v1 and v2, of identical length is computed by first computing the element-wise multiplication of the two vectors and then taking the sum of the resulting elements. In R, we obtain an element-wise multiplication of two vectors by simply using the multiplication symbol, so we can compute the inner product of two vectors as follows:

> v1 <- c(1.2, 3.3, -5.6, 4.5, 0, 9.0)
> v2 <- c(-3.5, 0.1, -0.2, 1.0, -8.7, 0)
> v1 * v2
[1] -4.20  0.33  1.12  4.50  0.00  0.00
> inner_product <- sum(v1 * v2)
> inner_product
[1] 1.75

In mathematical terms, we use triangular brackets to denote the inner product operation, and we represent the process as follows:

Inner products

In the preceding equation, for the two vectors v1 and v2, the index i is iterating over the p features or dimensions. Now, here is the original form of our support vector classifier:

Inner products

This is just the standard equation for a linear combination of the input features. It turns out that for the support vector classifier, the model's solution can be expressed in terms of the inner product between the x observation that we are trying to classify and all other xi observations that are in our training data set. More concretely, the form of our support vector classifier can also be written as:

Inner products

For this equation, we have explicitly indicated that our model predicts y as a function of an input observation x. The summing function now computes a weighted sum of all the inner products of the current observation with every other observation in the data set which is why we are now summing across the n observations. We want to make it very clear that we haven't changed anything in the original model itself; we have simply written two different representations of the same model. Note that we cannot assume that a linear model takes this form in general; this is only true for the support vector classifier. Now, in a real-world scenario, the number of observations in our data set, n, is typically much greater than the number of parameters, p, so the number of α coefficients is seemingly larger than the number of β coefficients. Additionally, whereas in the first equation we were considering observations independently of each other, the form of the second equation shows us that to classify all our observations, we need to consider all possible pairs and compute their inner product. There are Inner products such pairs, which is of the order of n2. Thus, it would seem like we are introducing complexity rather than producing a representation that is simpler. It turns out, however, that all α coefficients are zero for all observations in our data set, except those that are support vectors.

The number of support vectors in our data set is typically much smaller than the total number of observations. Thus, we can simplify our new representation by explicitly showing that we sum over elements from the set of support vectors, S, in our data set.

Inner products
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.250.203