Support vector machines

To put it simply, SVM algorithms search for hyperplanes in order to build classifiers and regressions. The mathematics behind it are nothing but amazing. The core idea behind it is to look for improved perspectives (hyperplanes) in order to separate data points, hence allowing to separate classes that are linearly-inseparable.

In other words, some variables may be linearly-inseparable in the X-Y dimension but you could apply a transformation (hyperplane transformation) that would give it an extra dimension (Z). Looking from this new perspective, you might be able to find a hyperplane that could separate well the distinct classes. In an extreme scenario, this process would burst dimensions right in our faces depending on the problem we were looking at. Lucky for us, there is the kernel trick.

However, there is no need to actually know which transformation to use. All that is asked of you is to pick a kernel function and go with it—the kernel trick was truly an ace in the hole for the SVMs. Thanks to convex optimization, SVMs will never be stuck in local optimal but they can suffer from overfitting.

A great way to avoid overfitting is to pick the right kernel and parameters.

I truly recommend the reader study the so-called kernel trick. Kernels are very useful because they can replace dot products. As many machine learning algorithms can be entirely expressed in terms of dot products, they can also be expressed (and coded) using kernels instead.

That being said, if you are not a kernel adopter, I encourage you to gather a deeper understanding of kernels; it might help you the next time you are writing down an machine learning algorithm. Which is your favorite kernel? Mine is the radial basis.

Enough with the theory for the moment. Things are going to get practical now. R is ready to deal with SVMs. There is a couple of packages I could name (really a couple): e1071 and kernlab. Where there is smoke, there is fire; and where there are multiple machine learning packages, there is the caret package.

The caret package gathers many machine learning packages. Training different models with the different packages reunited under the caret package is actually very easy. The package is extensively explained here: http://topepo.github.io/caret/.

Given that we're exploring SVM now, you may want to take a shortcut and go straight to the following link: http://topepo.github.io/caret/train-models-by-tag.html#support-vector-machines.

But you sure will benefit from reading the entire document. Don't worry, you don't need to read it right now. I will quickly introduce you to training an SVM model using a radial basis kernel. The latter link will give you a handful of information about which other alternatives there are for SVMs, while the former will give you a deeper understanding of how to use caret and what it's capable of doing.

As caret relies on many other packages, we want not only to install it alone but some other package, such as lubridate, may be also requested. Let's begin by installing both:

if(!require('lubridate')){install.packages('lubridate')}
if(!require('caret')){install.packages('caret')}

Even more packages may be requested. How can you know? Call library(caret) and pay close attention to warnings; it will tell you if some packages are missing. Install the missing packages if there are any. The next code block shows how to load caret and set a seed number generator (in some level, SVM depends on a random seed generator):

library(caret)
set.seed(2018)

The random seed generator was set so there is a greater chance for you and me to get similar results. A nice feature from caret is the tuning grid. Usually, it's possible to name several parameters in order to tune your model. These parameters need to be arranged into a DataFrame. The exapand.grid() function can easily create this DataFrame using every possible combination of given values:

tune_grid <- expand.grid(sigma = c(.025,.0025,.001), C = c(1,2))

After creating tune_grid, you can call this object to check how expand.grid() has worked things out. Use this tuning grid feature wisely. The more values you set, the more time it will take to train the models, plus you might enhance the risks of overfitting depending on the range used.

There are two parameters you can try while training with the 'svmRadial' method: sigma and C (cost). Both have to be given their exact names while stored in the DataFrame that will feed the tuning argument. While sigma is a parameter from the radial basis function kernel, the cost is associated with the constraints in the optimization problem.

Now, with the grid in hand, it's time to finally train an SVM model for classification. For the sake of comparability, we will stick with the dt_Chile dataset:

time0_svm <- Sys.time()

svm_caret <- caret::train(vote ~ . , 
                          data = dt_Chile[-i_out,], 
                          tuneGrid = tune_grid,
                          method = 'svmRadial')

time1_svm <- Sys.time()

Just as we've done before, the core function used to train the model (its name is actually train()) is in between the variables storing the time (time0_svm and time1_svm). Before bringing in the test-sample results, let's check the training-sample results:

mean(predict(svm_caret, newdata = dt_Chile[-i_out,]) == dt_Chile[-i_out, 'vote'])
# [1] 0.6721504

A near 67,21% hit rate was the accuracy we got from the in-sample results (dt_Chile[-i_out,]). Let's see how it performed in the test sample plus how much time it took to train (keep in mind that it might vary depending on many things):


mean(predict(svm_caret, newdata = dt_Chile[test,]) == dt_Chile[test, 'vote'])
# [1] 0.6931507
time1_svm - time0_svm
# Time difference of 1.108781 mins

The performance is even better (a 69,31% hit rate), which make me suspicious about it. An even better performance is achieved through validation data, which is also out-of-sample since we did not use it for training. Yet, you should never expect your model to do better in the out-of-sample measures, although it can happen.

The time required to run the model is also considerably higher in comparison with random forests. It went from about one second to about one minute to train the model. That's something to take into account. The beepr::beep() trick may be useful here:

alert <- function(x, n = 1, s = 8){if(x > 0){cat('Alert #',n,'.
',sep = '');beepr::beep(s);Sys.sleep(6);alert(x-1,n+1,s)}}

Deciding which kernel function (along with its parameters) to use is a central matter when designing an SVM model. Some algorithms will only allow regressions to be made; others will be suitable for two-classes classification but not for multi-classes classification. Others may be broader and implement regression and multi-class classification.

It's really important to get to know how the algorithm that you are trying really works. Picking the wrong one might be disastrous. This section addressed a brief introduction (theory and practice) about SVMs. For the moment, we've focused on supervised learning methods while mostly dealing classification problems. The next section discusses how these same models we've been using until now can be used to fit regressions, while talking a little bit about error measures. Afterward, we may take a detour around unsupervised learning.

Table of Contents for Support vector machines

Create new playlist

Sign In

Sign Up

Table of Contents for
Support vector machines