Predicting glass type revisited

In Chapter 3, Logistic Regression, we analyzed the glass identification data set, whose task is to identify the type of glass comprising a glass fragment found at a crime scene. The output of this data set is a factor with several class levels corresponding to different types of glass. Our previous approach was to build a one-versus-all model using multinomial logistic regression. The results were not very promising, and one of the main points of concern was a poor model fit on the training data.

In this section, we will revisit this data set and see whether a neural network model can do better. At the same time, we will demonstrate how neural networks can handle classification problems as well:

> glass <- read.csv("glass.data", header = FALSE)
> names(glass) <- c("id", "RI", "Na", "Mg", "Al", "Si", "K", "Ca", 
                    "Ba", "Fe", "Type")
> glass$id <- NULL

Our output is a multiclass factor and so we will want to dummy-encode this into binary columns. With the neuralnet package, we would normally need to do this manually as a preprocessing step before we can build our model.

In this section, we will look at a second package that contains functions for building neural networks, nnet. This is actually the same package that we used for multinomial logistic regression. One of the benefits of this package is that for multiclass classification, the nnet() function that trains the neural network will automatically detect outputs that are factors and perform the dummy encoding for us. With that in mind, we will prepare a training and test set:

> glass$Type <- factor(glass$Type)
> set.seed(4365677)
> glass_sampling_vector <- createDataPartition(glass$Type, p = 
                           0.80, list = FALSE)
> glass_train <- glass[glass_sampling_vector,]
> glass_test <- glass[-glass_sampling_vector,]

Next, just as with our previous data set, we will normalize our input data:

> glass_pp <- preProcess(glass_train[1:9], method = c("range"))
> glass_train <- cbind(predict(glass_pp, glass_train[1:9]), Type = glass_train$Type)
> glass_test  <- cbind(predict(glass_pp, glass_test[1:9]), Type = glass_test$Type)

We are now ready to train our model. Whereas the neuralnet package is able to model multiple hidden layers, the nnet package is designed to model neural networks with a single hidden layer. As a result, we still specify a formula as before, but this time, instead of a hidden parameter that can be either a scalar or a vector of integers, we specify a size parameter that is an integer representing the number of nodes in the single hidden layer of our model.

Also, the default neural network model in the nnet package is for classification, as the output layer uses a logistic activation function. It is really important when working with different packages for training the same type of model, such as multilayer perceptrons, to check the default values for the various model parameters, as these will be different from package to package. One other difference between the two packages that we will mention here is that nnet currently does not offer any plotting capabilities. Without further ado, we will now train our model:

> glass_model <- nnet(Type ~ ., data = glass_train, size = 10)
# weights:  166
initial  value 343.685179 
iter  10 value 265.604188
iter  20 value 220.518320
iter  30 value 194.637078
iter  40 value 192.980203
iter  50 value 192.569751
iter  60 value 192.445198
iter  70 value 192.421655
iter  80 value 192.415382
iter  90 value 192.415166
iter 100 value 192.414794
final  value 192.414794 
stopped after 100 iterations

From the output, we can see that the model has not converged, stopping after the default value of 100 iterations. To converge, we can either rerun this code a number of times or we can increase the number of allowed iterations to 1,000 using the maxit parameter:

> glass_model <- nnet(Type ~ ., data = glass_train, size = 10, maxit = 
                      1000)

Let's first investigate the accuracy of our model on the training data in order to assess the quality of fit. To compute predictions, we use the predict() function and specify the type parameter to be class. This lets the predict() function know that we want the class with highest probability to be selected. If we want to see the probabilities of each class, we can specify the value response for the type parameter. Finally, remember that we must pass in a data frame without the outputs to the predict() function, and thus the need to subset the training data frame:

> train_predictions <- predict(glass_model, glass_train[,1:9], 
                               type = "class")
> mean(train_predictions == glass_train$Type)
[1] 0.7183908046

Our first attempt shows us that we are getting the same quality of fit as with our multinomial logistic regression model. To improve upon this, we'll increase the complexity of the model by adding more neurons in our hidden layer. We will also increase our maxit parameter to 10,000 as the model is more complex and might need more iterations to converge:

> glass_model2 <- nnet(Type ~ ., data = glass_train, size = 50, maxit = 
                       10000)
> train_predictions2 <- predict(glass_model2, glass_train[,1:9], 
                                type = "class")
> mean(train_predictions2 == glass_train$Type)
[1] 1

As we can see, we have now achieved 100 percent training accuracy. Now that we have a decent model fit, we can investigate our performance on the test set:

> test_predictions2 <- predict(glass_model2, glass_test[,1:9], 
                               type = "class")
> mean(test_predictions2 == glass_test$Type)
[1] 0.6

Even though our model fits the training data perfectly, we see that the accuracy on the test set is only 60 percent. Even factoring in that the data set is very small, this discrepancy is a classic signal that our model is overfitting on the training data. When we looked at linear and logistic regression, we saw that there are shrinkage methods, such as the lasso, which are designed to combat overfitting by restricting the size of the coefficients in the model.

An analogous technique known as weight decay exists for neural networks. With this approach, the product of a decay constant and the sum of the squares of all the network weights is added to the cost function. This limits any weights from taking overly large values and thus performs regularization on the network. Whereas there is currently no option for regularization with neuralnet(), nnet() uses the decay parameter:

> glass_model3 <- nnet(Type~., data = glass_train, size = 10, maxit = 
                       10000, decay = 0.01)
> train_predictions3 <- predict(glass_model3, glass_train[,1:9], 
                                type = "class")
> mean(train_predictions3 == glass_train$Type)
[1] 0.9367816092
> test_predictions3 <- predict(glass_model3, glass_test[,1:9],  
                               type = "class")
> mean(test_predictions3 == glass_test$Type)
[1] 0.775

With this model, the fit on our training data is still very high, and substantially higher than we achieved with multinomial logistic regression. On the test set, the performance is still worse than on the training set, but much better than we had before.

We won't spend any more time on the glass identification data. Instead, we will reflect on a few lessons learned before moving on. The first of these is that achieving good performance with a neural network, and sometimes even just reaching convergence, might be tricky. Training the model involves a random initialization of network weights and the final result is often quite sensitive to these starting conditions. We can convince ourselves of this fact by training the different model configurations we have seen so far a number of times and noticing that certain configurations on some runs might not converge, and the performance on our training and test set does tend to differ from one run to the next.

Another insight is that training a neural network involves tuning a diverse range of parameters, from the number and arrangement of hidden neurons to the value of the decay parameter. Others that we did not experiment with include the choice of nonlinear activation function to use with the hidden layer neurons, the criteria for convergence, and the particular cost function we use to fit our model. For example, instead of using least squares, we could use a criterion known as entropy.

Before settling on a final choice of model, therefore, it pays to try out as many different combinations of these as possible. A good place to experiment with different parameter combinations is the train() function of the caret package. It provides a unified interface for both neural network packages we have seen and, in conjunction with expand.grid(), allows the simultaneous training and evaluation of several different neural network configurations. We'll provide just a vignette here, and the interested reader can use this to continue their investigation further:

> library(caret)
> nnet_grid <- expand.grid(.decay = c(0.1, 0.01, 0.001, 0.0001), 
                           .size = c(50, 100, 150, 200, 250))
> nnetfit <- train(Type ~ ., data = glass_train, method = "nnet", 
  maxit = 10000, tuneGrid = nnet_grid, trace = F, MaxNWts = 10000)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.251.169