Chapter 6. Building Support Vector Machines

In this chapter, we will explore Support Vector Machines (SVMs). We will study several SVM implementations in Clojure that can be used to build and train an SVM using some given training data.

SVMs are supervised learning models that are used for both regression and classification. In this chapter, however, we will focus on the problem of classification within the context of SVMs. SVMs find applications in text mining, chemical classification, and image and handwriting recognition. Of course, we should not overlook the fact that the overall performance of a machine learning model mostly depends on the amount and nature of the training data and is also affected by which machine learning model we use to model the available data.

In the simplest form, an SVM separates and predicts two classes of data by estimating the optimal vector plane or hyperplane between these two classes represented in vector space. A hyperplane can be simply defined as a plane that has one less dimension than the ambient space. For a three-dimensional space, we would obtain a two-dimensional hyperplane.

A basic SVM is a non-probabilistic binary classifier that uses linear classification. In addition to linear classification, SVMs can also be used to perform nonlinear classification over several classes. An interesting aspect of SVMs is that the estimated vector plane will have a substantially large and distinct gap between the classes of input values. Due to this, SVMs often have a good generalization performance and also implement a kind of automatic complexity control to avoid overfitting. Hence, SVMs are also called large margin classifiers. In this chapter, we will also study how SVMs achieve this large margin between classes of input data, when compared to other classifiers. Another interesting fact about SVMs is that they scale very well with the number of features being modeled and thus, SVMs are often used in machine learning problems that deal with a large number of features.

Understanding large margin classification

As we previously mentioned, SVMs classify input data across large margins. Let's examine how this is achieved. We use our definition of a logistic classification model, which we previously described in Chapter 3, Categorizing Data, as a basis for reasoning about SVMs.

We can use the logistic or sigmoid function to separate two classes of input values, as we described in Chapter 3, Categorizing Data. This function can be formally defined as a function of an input variable X as follows:

Understanding large margin classification

In the preceding equation, the output variable Understanding large margin classification depends not only on the variable Understanding large margin classification, but also on the coefficient Understanding large margin classification. The variable Understanding large margin classification is analogous to the vector of input values in our model, and the term Understanding large margin classification is the parameter vector of the model. For binary classification, the value of Y must exist in the range of 0 and 1. Also, the class of a set of input values is determined by whether the output variable Understanding large margin classification is closer to 0 or 1. For these values of Y, the term Understanding large margin classification is either much greater than or much less than 0. This can be formally expressed as follows:

Understanding large margin classification

For Understanding large margin classification sample with input values Understanding large margin classification and output values Understanding large margin classification, we define the cost function Understanding large margin classification as follows:

Understanding large margin classification

Note

Note that the term Understanding large margin classification represents the output variable calculated from the estimated model.

For a logistic classification model, Understanding large margin classification is the value of logistic function when applied to a set of input values Understanding large margin classification. We can simplify and expand the summation term Understanding large margin classification in the cost function defined in the preceding equation, as follows:

Understanding large margin classification

It's obvious that the cost function shown in the preceding expression depends on the two logarithmic terms in the expression. Thus, we can represent the cost function as a function of these two logarithmic terms, represented by the terms Understanding large margin classification and Understanding large margin classification. Now, let's assume the two terms as shown the following equation:

Understanding large margin classification

Both the functions Understanding large margin classification and Understanding large margin classification are composed using the logistic function. A classifier that models the logistic function must be trained such that these two functions are minimized over all possible values of the parameter vector Understanding large margin classification. We can use the hinge-loss function to approximate the desired behavior of a linear classifier that uses the logistic function (for more information, refer to "Are Loss Functions All the Same?"). We will now study the hinge-loss function by comparing it to the logistic function. The following diagram depicts how the Understanding large margin classification function must vary with respect to the term Understanding large margin classification and how it can be modeled using the logistic and hinge-loss functions:

Understanding large margin classification

In the plot shown in the preceding diagram, the logistic function is represented as a smooth curve. The function is seen to decrease rapidly before a given point and then decreases at a lower rate. In this example, the point at which this change of rate of the logistic function occurs is found to be x = 0. The hinge-loss function approximates this by using two line segments that meet at the point x = 0. Interestingly, both these functions model a behavior that changes at a rate that is inversely proportional to the input value x. Similarly, we can approximate the effect of the Understanding large margin classification function using the hinge-loss function as follows:

Understanding large margin classification

Note that the Understanding large margin classification function is directly proportional to the term Understanding large margin classification. Thus, we can achieve the classification ability of the logistic function by modelling the hinge-loss function and a classifier built using the hinge-loss function will perform equally well as a classifier using the logistic function.

As seen in the preceding diagram, the hinge-loss function only changes its value at the point Understanding large margin classification. This applies to both the functions Understanding large margin classification and Understanding large margin classification. Thus, we can use the hinge loss function to separate two classes of data depending on whether the value of Understanding large margin classification is greater or less than 0. In this case, there's virtually no margin of separation between these two classes. To improve the margin of classification, we can modify the hinge-loss function such that its value is greater than 0 only when Understanding large margin classification or Understanding large margin classification.

The modified hinge-loss functions can be plotted for the two classes of data as follows. The following plot describes the case where Understanding large margin classification:

Understanding large margin classification

Similarly, the modified hinge-loss function for the case Understanding large margin classification can be illustrated by the following plot:

Understanding large margin classification

Note that the hinge occurs at -1 in the case of Understanding large margin classification.

If we substitute the hinge-loss functions in place of the Understanding large margin classification and Understanding large margin classification functions, we arrive at an optimization problem of SVMs (for more information, refer to "Support-vector networks"), which can be formally written as follows:

Understanding large margin classification

In the preceding equation, the term Understanding large margin classification is the regularization parameter. Also, when Understanding large margin classification, the behavior of the SVM is affected more by the Understanding large margin classification function than the Understanding large margin classification function, and vice versa when Understanding large margin classification. In some contexts, the regularization parameter Understanding large margin classification of the model is added to the optimization problem as a constant C, where C is analogous to Understanding large margin classification. This representation of the optimization problem can be formally expressed as follows:

Understanding large margin classification

As we only deal with two classes of data in which Understanding large margin classification is either 0 or 1, we can rewrite the optimization problem described previously, as follows:

Understanding large margin classification

Let's try to visualize the behavior of an SVM on some training data. Suppose we have two input variables Understanding large margin classification and Understanding large margin classification in our training data. The input values and their classes can represented by the following plot diagram:

Understanding large margin classification

In the preceding plot diagram, the two classes in the training data are represented as circles and squares. A linear classifier will attempt to partition these sample values into two distinct classes and will produce a decision boundary that can be represented by any one of the lines in the preceding plot diagram. Of course, the classifier should strive to minimize the overall error of the formulated model, while also finding a model that generalizes the data well. An SVM will also attempt to partition the sample data into two classes just as any other classification model. However, the SVM manages to determine a hyperplane of separation that is observed to have the largest possible margin between the two classes of input data.

This behavior of an SVM can be illustrated using the following plot:

Understanding large margin classification

As shown in the preceding plot diagram, an SVM will determine the optimal hyperplane that separates the two classes of data with the maximum possible margin between these two classes. From the optimization problem of an SVM, which we previously described, we can prove that the equation of the hyperplane of separation estimated by the SVM is as follows:

Understanding large margin classification

Note

Note that in the preceding equation, the constant Understanding large margin classification is simply the y-intercept of the hyperplane.

To understand more about how an SVM achieves this large margin of separation, we need to use some elementary vector arithmetic. Firstly, we can define the length of a given vector as follows:

Understanding large margin classification
Understanding large margin classification

Another operation that is often used to describe SVMs is the inner product of the two vectors. The inner product of two given vectors can be formally defined as follows:

Understanding large margin classification

Note

Note that the inner product of two vectors only exists if the two vectors are of the same length.

As shown in the preceding equation, the inner product Understanding large margin classification of the two vectors Understanding large margin classification and Understanding large margin classification is equal to the dot product of the transpose of Understanding large margin classification and the vector Understanding large margin classification. Another way to represent the inner product of two vectors is in terms of the projection of one vector onto another, which is given as follows:

Understanding large margin classification

Note that the term Understanding large margin classification is equivalent to the vector product Understanding large margin classification of the vector V and the transpose of the vector U. Since the expression Understanding large margin classification is equivalent to the product Understanding large margin classification of the vectors, we can rewrite the optimization problem, which we described earlier in terms of the projection of the input variables onto the output variable. This can be formally expressed as follows:

Understanding large margin classification

Hence, an SVM attempts to minimize the squared sum of the elements in the parameter vector Understanding large margin classification while ensuring that the optimal hyperplane that separates the two classes of data is present in between the two planes and Understanding large margin classification and Understanding large margin classification. These two planes are called the support vectors of the SVM. Since we must minimize the values of the elements in the parameter vector Understanding large margin classification, the projection Understanding large margin classification must be large enough to ensure that Understanding large margin classification and Understanding large margin classification:

Understanding large margin classification

Thus, the SVM will ensure that the projection of the input variable Understanding large margin classification onto the output variable Understanding large margin classification is as large as possible. This implies that the SVM will find the largest possible margin between the two classes of input values in the training data.

Alternative forms of SVMs

We will now describe a couple of alternative forms to represent an SVM. The remainder of this section can be safely skipped, but the reader is advised to know these forms as they also widely used notations of SVMs.

If Alternative forms of SVMs is the normal to hyperplane estimated by an SVM, we can represent this hyperplane of separation using the following equation:

Alternative forms of SVMs

Note

Note that in the preceding equation, the term Alternative forms of SVMs is the y-intercept of the hyperplane and is analogous to the term Alternative forms of SVMs in the equation of the hyperplane that we previously described.

The two peripheral support vectors of this hyperplane have the following equations:

Alternative forms of SVMs

We can use the expression Alternative forms of SVMs to determine the class of a given set of input values. If the value of this expression is less than or equal to -1, then we can say that the input values belong to one of the two classes of data. Similarly, if the value of the expression Alternative forms of SVMs is greater than or equal to 1, the input values are predicted to belong to the second class. This can be formally expressed as follows:

Alternative forms of SVMs

The two inequalities described in the preceding equation can be combined into a single inequality, as follows:

Alternative forms of SVMs

Thus, we can concisely rewrite the optimization problem of SVMs as follows:

Alternative forms of SVMs

In the constrained problem defined in the preceding equation, we use the normal Alternative forms of SVMs instead of the parameter vector Alternative forms of SVMs to parameterize the optimization problem. By using Lagrange multipliers Alternative forms of SVMs, we can express the optimization problem as follows:

Alternative forms of SVMs

This form of the optimization problem of an SVM is known as the primal form. Note that in practice, only a few of the Lagrange multipliers will have a value greater than 0. Also, this solution can be expressed as a linear combination of the input vectors Alternative forms of SVMs and the output variable Alternative forms of SVMs, as follows:

Alternative forms of SVMs

We can also express the optimization problem of an SVM in the dual form, which is a constrained representation that can be described as follows:

Alternative forms of SVMs

In the constrained problem described in the preceding equation, the function Alternative forms of SVMs is called the kernel function and we will discuss more about the role of this function in SVMs in the later sections of this chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.192.120