This chapter explains a very important technique often used when training deep networks: regularization. We look at techniques such as the ℓ1 and ℓ2 methods, dropout, and early stopping. You learn how these methods help prevent the problem of overfitting and help you achieve much better results from your models when applied correctly. We look at the mathematics behind the methods and at how to implement them correctly in Python and Keras.
Complex Networks and Overfitting
In the previous chapters, you learned how to build and train complex networks. One of the most common problems you will encounter when using complex networks is overfitting. In this chapter, we face an extreme case of overfitting and discuss a few strategies to avoid it. A perfect dataset for studying this problem is the Boston housing price dataset [1].
CRIM: Per capita crime rate by town
ZN: Proportion of residential land zoned for lots over 25,000 square feet
INDUS: Proportion of non-retail business acres per town
CHAS: Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX: Nitric oxides concentration (parts per 10 million)
RM: Average number of rooms per dwelling
AGE: Proportion of owner-occupied units built prior to 1940
DIS: Weighted distances to five Boston employment centers
RAD: Index of accessibility to radial highways
TAX: Full-value property-tax rate per $10,000
PTRATIO: Pupil-teacher ratio by town
B − 1000(Bk − 0.63)2 − Bk: Proportion of African Americans by town
LSTAT: % lower status of the population
MEDV: Median value of owner-occupied homes in $1000s
The np.random.seed(42) is there so that you will always get the same training and dev dataset (so the results are reproducible).
This function builds and trains a feed-forward neural network model and evaluates it on the training and dev sets. You learned about feed-forward neural networks implementation in the last chapter so you should now understand what it does.
We use the He initialization here, since we will use ReLU activation functions. In the output layer, we have one neuron with the identity activation function for regression (remember that, in Keras, when you do not specify an activation function, the default one will be the identity function). Additionally, we use the Adam optimizer.1
As you may notice, to make things simpler, we avoided writing a function with many input parameters and simply hard-coded all the values in the body function (as for the learning rate, for example), since in this case we do not need to tune the parameters very much. Moreover, we are not using mini-batches here, since we only have a few hundred observations.
What can we do in this case to avoid the problem of overfitting? One solution would be of course to reduce the complexity of the network. Reduce the number of layers and/or the number of neurons in each layer. But, as you can imagine, this strategy is very time consuming. You must try several network architectures to see how the training error and the dev error behave. In this case, this is still a viable solution, but if you are working on a problem where the training phase takes several days, this can be quite difficult and extremely time consuming. Several strategies have been developed to deal with this problem; the most common is called regularization and is the focus of this chapter.
What Is Regularization
Before going into the different methods, we must quickly discuss how the deep learning community interprets the term regularization . The term has deeply (pun intended) evolved over time. For example, in the traditional sense from the 90s, the term was reserved only to a penalty term in the loss function [2]. Lately, the term has gained a much more broader meaning. For example, Goodfellow [3] defines it as “any modification we make to a learning algorithm that is intended to reduce its test error but not its training error.” Kukačka [4] generalizes the term even more and provides the definition: “Regularization is any supplementary technique that aims at making the model generalize better, i.e. produce better results on the test set.” So be aware when using the term and always be precise with what you mean.
You may also have heard or read the claim that regularization has been developed to fight overfitting. This is also a way of understanding it. Remember, a model that is overfitting the training dataset is not generalizing well to new data. This definition is also in line with all the others.
This is just a matter of definitions but it’s important to have heard them, so that you better understand what is meant when reading papers or books. This is a very active research area and, to give you an idea, Kukačka, in his review paper, lists 58 different regularization methods. It’s important to understand that with their general definition, SGD (Stochastic Gradient Descent) is also considered a regularization method, something that not everyone agrees on. So be warned when reading research material about the variation with the term regularization.
In this chapter, we look at the three most common and known methods—ℓ1, ℓ2, and dropout, We also briefly talk about early stopping, although this method does not, technically speaking, fight overfitting. ℓ1 and ℓ2 achieve a so-called weight decay by adding a regularization term to the cost function, while dropout simply removes, in a random fashion, nodes from the network during the training phase. To understand the three methods properly, we need to study them in detail. Let’s start with the most instructive one: ℓ2 regularization.
At the end of the chapter, we look at a few other ideas on how to fight overfitting and get the model to generalize better. Instead of changing or modifying the model or the learning algorithm, we will consider strategies with the idea of modifying the training data to make learning more effective.
About Network Complexity
Let’s spend a few moments briefly discussing the term we used very often: network complexity. You have read here, and can find almost everywhere, that with regularization you want to reduce network complexity . But what are we referring to really? It is very difficult to give a definition of network complexity, so much that nobody does it, actually. You can find several research papers on the problem of model complexity (note that this is not exactly network complexity), with roots in information theory. In this chapter, you see how the number of weights different from zero will change dramatically with the number of epochs, with the optimization algorithm and so on, therefore making this vague concept of complexity also dependent on how long you train your model. To make the story short, the term network complexity should be used only at a general level, since theoretically it’s a very complex concept to define. A complete discussion of the subject is completely out of the scope of this book.
ℓp Norm
where the sum is performed over all components of the vector x.
Let’s now start with the most instructive norm: the ℓ2.
ℓ2 Regularization
One of the most common regularization methods, ℓ2 regularization, consists of adding a term to the cost function that has the goal of effectively reducing the capacity of the network to adapt to complex datasets. Let’s first look at the mathematics behind the method.
Theory of ℓ2 Regularization
is called the regularization term and is nothing more than the ℓ2 norm squared and w multiplied by a constant factor λ/2m. λ is a called the regularization parameter .
The new regularization parameter λ is a new hyper-parameter that you need to tune to find its optimal value.
This is the equation that we need to use for the weights update. The difference with the one we already know from plain GD is that now the weight wj, [n] is multiplied by a constant This has the effect of shifting the weight values during the update toward zero, and therefore making the network less complex. This in turn helps to prevent overfitting. Let’s see what is really happening to the weights by applying the method to the Boston housing dataset.
Keras Implementation
The implementation in Keras is extremely easy. The library performs all the computations for us, and we just have to decide which regularization we want to use, set the λ parameter, and apply it to each layer. The model construction remains the same.
In Keras, regularization is applied at the layer level, and not globally on the cost function. You will notice how it is added at each layer and not when defining the cost function. But the explanation here remains valid, and it works in the same way as we discussed. The reason for this is that it can be helpful to add regularization only on certain layers, for example the largest, and not on layers with only a few neurons.
The main differences with respect to the previous function (the one that we used to build a network without regularization) are highlighted in bold.
With the reg = tf.keras.regularizers.l2(l2 = lambda_) line, we defined the ℓ2 regularizer, setting the value for λ. Then we apply the regularizer to each layer, assigning it to the kernel_regularizer, which applies the penalty on the layer’s kernel. The layers may also expose the keywork arguments bias_regularizer and activity_regularizer, which apply a penalty to the layer’s bias and output, respectively, but are less often used. Here we employ only the kernel_regularizer argument.
Remember that in Python lambda is a reserved word, so we cannot use it. This is why we use lambda_.
Now let’s train and evaluate our network to see what happens. This time we will print the MSE coming from the training (MSEtrain) and from dev (MSEdev) datasets to check what is going on. As mentioned, applying this method make many weights go to zero, effectively reducing the complexity of the network, and therefore fighting overfitting. Let’s run the model for λ = 0.0, without regularization, and for λ = 10.0.
Q is a very big number. Already, and without regularization, it is interesting to note that we have roughly 6% of the weights that after 1000 epochs are less than 10−3, so effectively close to zero. This is why talking about complexity in terms of the number of learnable parameters is risky. Additionally, using regularization will completely change the scenario. Complexity is a difficult concept to define: it depends on many things, including the architecture, the optimization algorithm, the cost function, and the number of epochs trained.
Defining the complexity of a network only in terms of the number of weights is not completely correct. The total number of weights gives you an idea, but it can be misleading since many may be zero after the training, effectively disappearing from the network, and making it less complex. It is more correct to talk about model complexity instead of network complexity, since many more aspects are involved than simply how many neurons or layers the network has.
Incredibly enough, only half of the weights play a role in the predictions at the end. This is why defining the network complexity only with the Q parameter is misleading. Given your problem, your loss function, and optimizer, you may well end up with a network that when trained is much simpler than what it was during the construction phase. So be very careful when using the term complexity in the deep learning world. Be aware of the subtleties involved.
Percentage of Weights Less Than 10−3 With and Without Regularization After 1000 Epochs
Layer | Percent of Weight Less Than 10−3 for λ = 0.0 | Percent of Weight Less Than 10−3 for λ = 3.0 |
---|---|---|
1 | 0.77 | 1.54 |
2 | 0.0 | 28.25 |
3 | 1.0 | 40.0 |
4 | 0.25 | 45.75 |
As you can see, with small values of λ (effectively without regularization), we are in an overfitting regime (MSEtrain ≪ MSEdev), with the MSEtrain and the MSEdev increasing. Until λ ≈ 6, the model overfits the training data, then the two values cross and the overfitting finishes. After that they grow together, at which point the model cannot capture the fine data structures anymore. After the crossing of the lines, the model is getting too simple to capture the features of the problem. Therefore, the errors grow together and the error on the training dataset gets bigger, since the model doesn’t even fit the training data well. In this specific case, a good value to choose for λ is around 6, around the value when the two lines cross, since you are not in an overfitting region anymore (since MSEtrain ≈ MSEdev). Remember that the main goal of having the regularization term is to get a model that generalizes in the best way possible when it’s applied to new data. You can look at it in a different way: a value of λ ≈ 6 gives you the minimum of MSEdev outside the overfitting region (for λ ≲ 6), so it’s a good choice. Note that you may observe a very different behavior for your optimizing metric, so you have to decide what the best value is for λ on a case-by-case basis.
A good way for estimating the optimal value of the regularization parameter λ is to plot your optimizing metric (in this example, the MSE) for the training and the dev dataset and see how they behave for various values of λ. Then you choose the value that gives the minimum of your optimizing metric on the dev dataset and at the same time gives you a model that is not overfitting your training data.
ℓ1 Regularization
This section looks at a regularization technique that is very similar to ℓ2 regularization. It is based on the same principle, adding a term to the cost function. This time, the mathematical form of the added term is different, but the method works very similarly to what you saw in the previous sections. Let’s again look at the mathematics behind the algorithm.
Theory of ℓ1 Regularization and Keras Implementation
As you can see, ℓ1 regularization has the same effect as ℓ2. It reduces the effective complexity of the network, reducing many weights to zero.
Percentage of Weights Less Than 10−3 With and Without Regularization After 1000 Epochs
Layer | Percent of Weight Less Than 10−3 for λ = 0.0 | Percent of Weight Less Than 10−3 for λ = 3.0 |
---|---|---|
1 | 0.0 | 90.77 |
2 | 0.5 | 94.50 |
3 | 0.0 | 96.75 |
4 | 0.0 | 94.50 |
Are the Weights Really Going to Zero?
Note that when using regularization, you end up having tensors with a lot of zero elements, called sparse tensors. You can then profit from special routines that are extremely efficient with sparse tensors. This is something to keep in mind when you start moving toward more complex models, but a subject too advanced for this book.
Dropout
The basic idea of dropout is different: during the training phase you remove nodes from layer l randomly with a probability p[l]. In each iteration you remove different nodes, effectively training at each iteration a different network (when using mini-batches, you train a different network for each batch, for example).
In Keras, you simply add how many dropout layers you want after the layer you need to drop, using the following function: keras.layers.Dropout(rate). You must put as input the layer you want to drop, and you must set the rate parameter. This parameter can assume float values in the range [0, 1), since it represents the fraction of the input units to drop. Therefore, it is not possible to drop all the units (setting a rate equal to 1). Usually, the rate parameter is set the same for all networks (but technically speaking, can be layer specific).
Very importantly, when doing predictions on a dev dataset, no dropout should be used! Keras will automatically apply dropout during the training phase of the model, without dropping any additional unit during the model’s evaluation on a different set.
During training, dropout removes nodes randomly during each iteration. But when doing predictions on a dev dataset, the entire network needs to be used without dropout. Keras will automatically consider this case for you.
Dropout can be layer-specific. For example, for layers with many neurons, rate can be small. For layers with a few neurons, you can set rate = 0.0, effectively keeping all neurons in such layers.
As you can see, you must put a dropout layer (highlighted in bold) after the layer you want to modify, setting the rate parameter .
Figure 4-13 shows that the MSEdev grows after dropping at the beginning. The model is in a clear extreme overfitting regime (MSEtrain ≪ MSEdev), and it generalizes worst when applied to new data. In Figure 4-12, you can see how the MSEtrain and MSEdev are of the same order of magnitude and the MSEdev does not continue to grow, so we have a model that is a lot better at generalizing than the one shown in Figure 4-13.
When applying dropout, your metric (in this case, the MSE) will oscillate, so do not be surprised when trying to find the best hyper-parameters if you see your optimizing metric oscillating.
Early Stopping
Early stopping simply consists of stopping the training at the point when the MSEdev has its minimum. (In Figure 4-14, the minimum is indicated by a vertical line.) Note that this is not an ideal way of solving the overfitting problem. Your model will still most probably generalize very badly to new data. It is usually preferable to use other techniques. Additionally, this is also time consuming and a manual process that is very error prone. You can get a good overview of the different contexts by checking the Wikipedia page at https://goo.gl/xnKo2s.
Additional Methods
Get more data. This is the simplest way of fighting overfitting. Unfortunately, very often in real life this is not possible. If you are classifying pictures of cats taken with a smartphone you may think of getting more data from the web. Although this may seem like a perfectly good idea, you may discover that the images have different quality, that possibly not all the images are really cats (what about cat toys?), you may only find images of white cats, and so on. Basically, your additional observations may come from a very different distribution than your original data and that will be a problem. So, when getting additional data, consider this problem well before proceeding.
Augment your data. For example, if you are working with images you can generate additional images by rotating, stretching, shifting, and otherwise editing your original images. That is a very common technique that may help.
The problem of making the model generalize better on new data is one of machine learning’s biggest goal. It is a complicated problem that requires experience and tests. Lots of tests. A lot of research is currently going on to solve such problems when working on very complex problems.
Exercises
Number of Layers | Number of Neurons in Each Layer |
---|---|
1 | 3 |
1 | 5 |
2 | 3 |
2 | 5 |
Find the minimum value for λ (in the case of ℓ2) for which the overfitting stops. Perform a set of tests using the function hist, model = create_and_train_reg_model_L2(train_x, train_y, dev_x, dev_y, 20, 4, 0.0), varying the value of λ from 0 to 10.0 in regular increments (you can decide which values you want to test). Use at a minimum the values: 0, 0.5, 1.0, 2.0, 5.0, 7.0, 10.0, and 15.0. After that, plot the value of the cost function on the training dataset and on the dev dataset vs. λ.
In the ℓ1 regularization example applied to the Boston dataset, plot the amount of weights close to zero in hidden layer 3 vs. λ. Considering only layer 3, plot the quantity (np.sum(np.abs(weights3) < 1e-3)) / weights3.size * 100.0 we have evaluated before and calculate it for several values of λ. Consider at least these values: 0, 0.5, 1.0, 2.0, 5.0, 7.0, 10.0, and 15.0. Plot the value vs. λ. What shape does the curve have? Does it flatten out?
Implement ℓ2 regularization from scratch.
References
[1] Delve (Data for Evaluating Learning in Valid Experiments), “The Boston Housing Dataset,” www.cs.toronto.edu/~delve/data/boston/bostonDetail.html, 1996, last accessed 22.03.2021.
[2] Bishop, C.M, (1995) Neural Networks for Pattern Recognition, Oxford University Press.
[3] Goodfellow, I.J. et al., Deep Learning, MIT Press.
[4] Kukačka, J. et al., Regularization for Deep Learning: A Taxonomy, arXiv: 1710.10686v1, available at https://goo.gl/wNkjXz, last accessed 28.03.2021.