Case 2 – determining the optimal hyperplane

Look at the following example, hyperplane C is the preferred one as it enables the maximum sum of the distance between the nearest data point in the positive side to itself and the distance between the nearest data point in the negative side to itself:

The nearest point(s) in the positive side can constitute a hyperplane parallel to the decision hyperplane, which we call a Positive hyperplane; on the other hand, the nearest point(s) in the negative side constitute the Negative hyperplane. The perpendicular distance between the positive and negative hyperplanes is called the Margin, whose value equates to the sum of the two aforementioned distances. A Decision hyperplane is deemed optimal if the margin is maximized.

The optimal (also called maximum-margin) hyperplane and distance margins for a trained SVM model are illustrated in the following diagram. Again, samples on the margin (two from one class, and one from another class, as shown) are the so-called support vectors:

We can interpret it in a mathematical way by first describing the positive and negative hyperplanes as follows:

Here,  is a data point on the positive hyperplane, and  a data point on the negative hyperplane, respectively.

The distance between a point to the decision hyperplane can be calculated as follows:

Similarly, the distance between a point to the decision hyperplane is as follows:

So the margin becomes . As a result, we need to minimize |w| in order to maximize the margin. Importantly, to comply with the fact that the support vectors on the positive and negative hyperplanes are the nearest data points to the decision hyperplane, we add a condition that no data point falls between the positive and negative hyperplanes:

Here,  is an observation. And this can be combined further into the following:

To summarize, w and b, which determine the SVM decision hyperplane, are trained and solved by the following optimization problem:

  • Minimizing
  • Subject to , for a training set of , ,… …,

To solve this optimization problem, we need to resort to quadratic programming techniques, which are beyond the scope of our learning journey. Therefore, we will not cover the computation methods in detail and will implement the classifier using the SVC and LinearSVC modules from scikit-learn, which are realized respectively based on libsvm (https://www.csie.ntu.edu.tw/~cjlin/libsvm/) and liblinear (https://www.csie.ntu.edu.tw/~cjlin/liblinear/) as two popular open source SVM machine learning libraries. But it is always encouraging to understand the concepts of computing SVM.

Shai Shalev-Shwartz et al. "Pegasos: Primal estimated sub-gradient solver for SVM" (Mathematical Programming, March 2011, volume 127, issue 1, pp. 3-30), and Cho-Jui Hsieh et al. "A dual coordinate descent method for large-scale linear SVM" (Proceedings of the 25th international conference on machine learning, pp 408-415) would be great learning materials. They cover two modern approaches, sub-gradient descent and coordinate descent, accordingly.

The learned model parameters w and b are then used to classify a new sample x', based on the following conditions:

Moreover, |wx'+b| can be portrayed as the distance from the data point x' to the decision hyperplane, and also interpreted as the confidence of prediction: the higher the value, the further away from the decision boundary, hence the higher prediction certainty.

Although you might be eager to implement the SVM algorithm, let's take a step back and look at a common scenario where data points are not linearly separable, in a strict way. Try to find a separating hyperplane in the following example:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.226.66