Predicting chemical biodegration

In this section, we are going to use R's e1071 package to try out the models we've discussed on a real-world data set. As our first example, we have chosen the QSAR biodegration data set, which can be found at https://archive.ics.uci.edu/ml/datasets/QSAR+biodegradation#. This is a data set containing 41 numerical variables that describe the molecular composition and properties of 1055 chemicals. The modeling task is to predict whether a particular chemical will be biodegradable based on these properties. Example properties are the percentages of carbon, nitrogen, and oxygen atoms as well as the number of heavy atoms in the molecule. These features are highly specialized and sufficiently numerous, so a full listing won't be given here. The complete list and further details of the quantities involved can be found on the website. For now, we've downloaded the data into a bdf data frame:

> bdf <- read.table("biodeg.csv", sep = ";", quote = """)
> head(bdf, n = 3)
     V1     V2 V3 V4 V5 V6 V7   V8 V9 V10 V11 V12   V13
1 3.919 2.6909  0  0  0  0  0 31.4  2   0   0   0 3.106
2 4.170 2.1144  0  0  0  0  0 30.8  1   1   0   0 2.461
3 3.932 3.2512  0  0  0  0  0 26.7  2   4   0   0 3.279
    V14   V15 V16   V17   V18 V19 V20 V21   V22 V23 V24
1 2.550 9.002   0 0.960 1.142   0   0   0 1.201   0   0
2 1.393 8.723   1 0.989 1.144   0   0   0 1.104   1   0
3 2.585 9.110   0 1.009 1.152   0   0   0 1.092   0   0
  V25 V26   V27    V28 V29 V30   V31 V32 V33 V34 V35
1   0   0 1.932  0.011   0   0 4.489   0   0   0   0
2   0   0 2.214 -0.204   0   0 1.542   0   0   0   0
3   0   0 1.942 -0.008   0   0 4.891   0   0   0   1
    V36   V37 V38   V39 V40 V41 V42
1 2.949 1.591   0 7.253   0   0  RB
2 3.315 1.967   0 7.257   0   0  RB
3 3.076 2.417   0 7.601   0   0  RB

The final column, V42, contains the output variable, which takes the value NRB for chemicals that are not biodegradable and RB for those that are. We'll recode this into the familiar labels of 0 and 1:

> levels(bdf$V42) <- c(0, 1)

Now that we have our data ready, we'll begin, as usual, by splitting them into training and testing sets, with an 80-20 split.

> library(caret)
> set.seed(23419002)
> bdf_sampling_vector <- createDataPartition(bdf$V42, p = 0.80, 
                                             list = FALSE)
> bdf_train <- bdf[bdf_sampling_vector,]
> bdf_test <- bdf[-bdf_sampling_vector,] 

There are a number of packages available in R that implement support vector machines. In this chapter, we'll explore the use of the e1071 package, which provides us with the svm() function. If we examine our training data, we'll quickly notice that on the one hand, the scales of the various features are quite different, and on the other hand, many features are sparse features, which means that for many entries they take a zero value. It is a good idea to scale features as we did with neural networks, especially if we want to work with radial kernels. Fortunately for us, the svm() function has a scale parameter, which is set to TRUE by default. This normalizes the input features so that they have zero mean and unit variance before the model is trained. This circumvents the need for us to manually carry out this preprocessing step. The first model that we will investigate will use a linear kernel:

> library(e1071)
> model_lin <- svm(V42 ~ ., data = bdf_train, kernel = "linear", cost = 10)

The call to the svm() function follows the familiar paradigm of first providing a formula, then providing the name of the data frame, and finally, other parameters relevant to the model. In our case, we want to train a model where the final V42 column is the predictor column and all other columns are to be used as features. For this reason, we can just use the simple formula V42 ~ . instead of having to fully enumerate all the other columns. After specifying our data frame, we then specify the type of kernel we will use, and in this case, we've opted for a linear kernel. We'll also specify a value for the cost parameter, which is related to the error budget C in our model:

> model_lin

Call:
svm(formula = V42 ~ ., data = biodeg_training2, kernel = "linear", cost = 10)
Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  linear 
       cost:  10 
      gamma:  0.02439024 

Number of Support Vectors:  272

Our model doesn't provide us with too much information on its performance beyond the details of the parameters that we specified. One interesting piece of information is the number of data points that were support vectors in our model; in this case, 272. If we use the str() function to examine the structure of the fitted model, however, we will find that it contains a number of useful attributes. For example, the fitted attribute contains the model's predictions on the training data. We'll use this to gauge the quality of model fit by computing the accuracy of the training data and the confusion matrix:

> mean(bdf_train[,42] == model_lin$fitted)
[1] 0.8887574
> table(actual = bdf_train[,42], predictions = model_lin$fitted)
      predictions
actual   0   1
     0 519  41
     1  53 232

We have a training accuracy of just under 89 percent, which is a decent start. Next, we'll examine the performance of the test data using the predict() function to see whether we can get a test accuracy close to this or whether we have ended up overfitting the data:

> test_predictions <- predict(model_lin, bdf_test[,1:41])
> mean(bdf_test[,42] == test_predictions)
[1] 0.8619048

We do have a slightly lower test accuracy than we'd expect, but we are sufficiently close to the training accuracy we obtained earlier in order to be relatively confident that we are not in a position where we are overfitting the training data. Now, we've seen that the cost parameter plays an important role in our model, and that choosing this involves a trade-off in model bias and variance. Consequently, we want to try different values of our cost parameter before settling on a final model. After manually repeating the preceding code for a few values of this parameter, we obtained the following set of results:

> linearPerformances
         0.01  0.1   1     10    100   1000 
training 0.858 0.888 0.883 0.889 0.886 0.886
test     0.886 0.876 0.876 0.862 0.862 0.862

Tip

Sometimes, when building a model, we may see a warning informing us that the maximum number of iterations has been reached. If this happens, we should be doubtful of the model that we produced, as it may be an indication that a solution was not found and the optimization procedure did not converge. In such a case, it is best to experiment with a different cost value and/or kernel type.

These results show that for most values of the cost parameter, we are seeing a very similar level of quality of fit on our training data, roughly 88 percent. Ironically, the best performance on the test data was obtained using the model whose fit on the training data was the worst, using a cost of 0.01. In short, although we have reasonable performance on our training and test data sets, the low variance in the results shown in the table essentially tells us that we are not going to get a significant improvement in the quality of fit by tweaking the cost parameter on this particular data set.

Now let's try using a radial kernel to see whether introducing some nonlinearity can allow us to improve our performance. When we specify a radial kernel, we must also specify a positive gamma parameter. This corresponds to the 1/2σ2 parameter in the equation of a radial kernel. The role that this parameter plays is that it controls the locality of the similarity computation between its two vector inputs. A large gamma means that the kernel will produce values that are close to zero unless the two vectors are very close together. A smaller gamma results in a smoother kernel and takes into account pairs of vectors that are farther away. Again, this choice boils down to a trade-off between bias and variance, so just as with the cost parameter, we'll have to try out different values of gamma. For now, let's see how we can create a support vector machine model using a radial kernel with a specific configuration:

> model_radial <- svm(V42 ~ ., data = bdf_train, kernel = "radial", 
                      cost = 10, gamma = 0.5)
> mean(bdf_train[,42] == model_radial$fitted)
[1] 0.9964497 
> test_predictions <- predict(model_radial, bdf_test[,1:41])
> mean(bdf_test[,42] == test_predictions)
[1] 0.8047619

Note that the radial kernel under these settings is able to fit the training data much more closely, as indicated by the near 100 percent training accuracy, but when we see the performance on the test data set, the result is substantially lower than what we obtained on the training data. Consequently, we have a very clear indication that this model is overfitting the data. To get around this problem, we will manually experiment with a few different settings of the gamma and cost parameters to see whether we can improve the fit:

> radialPerformances
         [,1]  [,2]  [,3] [,4]  [,5]  [,6]  [,7]  [,8]  [,9] 
cost     0.01  0.1   1    10    100   0.01  0.1   1     10   
gamma    0.01  0.01  0.01 0.01  0.01  0.05  0.05  0.05  0.05 
training 0.663 0.824 0.88 0.916 0.951 0.663 0.841 0.918 0.964
test     0.662 0.871 0.89 0.89  0.886 0.662 0.848 0.89  0.89 
         [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17]
cost     100   0.01  0.1   1     10    100   0.01  0.1  
gamma    0.05  0.1   0.1   0.1   0.1   0.1   0.5   0.5  
training 0.989 0.663 0.815 0.937 0.985 0.995 0.663 0.663
test     0.838 0.662 0.795 0.886 0.867 0.824 0.662 0.662
         [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
cost     1     10    100   0.01  0.1   1     10    100  
gamma    0.5   0.5   0.5   1     1     1     1     1    
training 0.98  0.996 0.998 0.663 0.663 0.991 0.996 0.999
test     0.79  0.805 0.805 0.662 0.662 0.748 0.757 0.757

As we can see, the combination of the two parameters, cost and gamma, yields a much wider range of results using the radial kernel. From the data frame we built previously, we can see that some combinations, such as cost = 1 and gamma = 0.05, have brought our accuracy up to 89 percent on the test data while still maintaining an analogous performance on our training data. Also, in the data frame, we see many examples of settings in which the training accuracy is nearly 100 percent, but the test accuracy is well below this.

As a result, we conclude that using a nonlinear kernel, such as the radial kernel, needs to be done with care in order to avoid overfitting. Nonetheless, radial kernels are very powerful and can be quite effective at modeling a highly nonlinear decision boundary, often allowing us to achieve higher rates of classification accuracy than a linear kernel. At this point in our analysis, we would usually want to settle on a particular value for the cost and gamma parameters and then retrain our model using the entire data available before deploying it in the real world.

Unfortunately, after using the test set to guide our decision on what parameters to use, it no longer represents an unseen data set that would enable us to predict the model's accuracy in the real world. One possible solution to this problem is to use a validation set, but this would require us to set aside some of our data resulting in smaller training and test set sizes. The next section will propose a way out of this dilemma.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.234.150