Case 5 &#x2013; solving linearly non-separable problems

The hyperplanes we have found up till now are linear, for instance, a line in a two-dimensional feature space, or a surface in a three-dimensional one. However, in the following example, we are not able to find any linear hyperplane that can separate two classes:

Intuitively, we observe that data points from one class are closer to the origin than those from another class. The distance to the origin provides distinguishable information. So we add a new feature, , and transform the original two-dimensional space into a three-dimensional one. In the new space, as displayed in the following, we can find a surface hyperplane separating the data, or a line in the two-dimension view. With the additional feature, the dataset becomes linearly separable in the higher dimensional space, :

Based upon similar logics, SVMs with kernels are invented to solve non-linear classification problems by converting the original feature space, , to a higher dimensional feature space with a transformation function, , such that the transformed dataset is linearly separable. A linear hyperplane is then learned using observations . For an unknown sample , it is first transformed into ; the predicted class is determined by .

An SVM with kernels enables non-linear separation. But it does not explicitly map each original data point to the high-dimensional space and then perform expensive computation in the new space. Instead, it approaches this in a tricky way:

During the course of solving the SVM quadratic optimization problems, feature vectors are involved only in the form of a pairwise dot product , although we will not expand this mathematically in this book. With kernels, the new feature vectors are and their pairwise dot products can be expressed as . It would be computationally efficient if we can first implicitly conduct pairwise operation on two low-dimensional vectors and later map the result to the high-dimensional space. In fact, a function K that satisfies this does exist:

The function K is the so-called kernel function. With this trick, the transformation becomes implicit, and the non-linear decision boundary can be efficiently learned by simply replacing the term with .

The most popular kernel function is probably the radial basis function (RBF) kernel (also called the Gaussian kernel), which is defined as follows:

Here, . In the Gaussian function, the standard deviation controls the amount of variation or dispersion allowed: the higher (or lower ), the larger width of the bell, the wider range of data points allowed to spread out over. Therefore, as the kernel coefficient determines how particularly or generally the kernel function fits the observations. A large implies a small variance allowed and a relatively exact fit on the training samples, which might lead to overfitting. On the other hand, a small implies a high variance allowed and a loose fit on the training samples, which might cause underfitting. To illustrate this trade-off, let's apply the RBF kernel with different values to a toy dataset:

>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> X = np.c_[# negative class
...           (.3, -.8),
...           (-1.5, -1),
...           (-1.3, -.8),
...           (-1.1, -1.3),
...           (-1.2, -.3),
...           (-1.3, -.5),
...           (-.6, 1.1),
...           (-1.4, 2.2),
...           (1, 1),
...           # positive class
...           (1.3, .8),
...           (1.2, .5),
...           (.2, -2),
...           (.5, -2.4),
...           (.2, -2.3),
...           (0, -2.7),
...           (1.3, 2.1)].T
>>> Y = [-1] * 8 + [1] * 8

Eight data points are from one class, and eight from another. We take three values, 1, 2, and 4, for kernel coefficient as an example:

>>> gamma_option = [1, 2, 4]

Under each kernel coefficient, we fit an individual SVM classifier and visualize the trained decision boundary:

>>> import matplotlib.pyplot as plt
>>> gamma_option = [1, 2, 4]
>>> for i, gamma in enumerate(gamma_option, 1):
... svm = SVC(kernel='rbf', gamma=gamma)
... svm.fit(X, Y)
... plt.scatter(X[:, 0], X[:, 1], c=['b']*8+['r']*8, zorder=10, cmap=plt.cm.Paired)
... plt.axis('tight')
... XX, YY = np.mgrid[-3:3:200j, -3:3:200j]
... Z = svm.decision_function(np.c_[XX.ravel(), YY.ravel()])
... Z = Z.reshape(XX.shape)
... plt.pcolormesh(XX, YY, Z > 0, cmap=plt.cm.Paired)
... plt.contour(XX, YY, Z, colors=['k', 'k', 'k'],
                      linestyles=['--', '-', '--'], levels=[-.5, 0, .5])
... plt.title('gamma = %d' % gamma)
... plt.show()

Refer to the following screenshot for the end results:

We can observe that a larger results in a stricter fit on the dataset. Of course, can be fine-tuned through cross-validation to obtain the best performance.

Some other common kernel functions include polynomial kernel and sigmoid kernel:

In the absence of prior knowledge of the distribution, the RBF kernel is usually preferable in practical usage, as there is an additional parameter to tweak in the polynomial kernel (polynomial degree d) and the empirical sigmoid kernel can perform approximately on a par with the RBF, but only under certain parameters. Hence, we come to a debate between linear (also considered no kernel) and RBF kernel given a dataset.

Table of Contents for
Case 5 – solving linearly non-separable problems

Case 5 – solving linearly non-separable problems

Table of Contents for Case 5 &#x2013; solving linearly non-separable problems

Create new playlist

Sign In

Sign Up

Table of Contents for
Case 5 – solving linearly non-separable problems