© Umberto Michelucci 2018
Umberto MichelucciApplied Deep Learninghttps://doi.org/10.1007/978-1-4842-3790-8_5

5. Regularization

Umberto Michelucci1 
(1)
toelt.ai, Dübendorf, Switzerland
 

In this chapter, you will look at a very important technique often used when training deep networks: regularization. You will look at techniques such as the 2 and 1 methods, dropout, and early stopping. You will see how these methods help avoid the problem of overfitting and achieve much better results from your models, when applied correctly. You will look at the mathematics behind the methods and at how to implement it in Python and TensorFlow correctly.

Complex Networks and Overfitting

In the previous chapters, you have learned how to build and train complex networks. One of the most common problems you will encounter when using complex networks is overfitting. Review Chapter 3 for an overview of what overfitting is. In this chapter, you will face an extreme case of overfitting, and I will discuss a few strategies to avoid it. A perfect dataset to study this problem is the Boston housing price dataset discussed in Chapter 2. Let’s review how to get the data (for a more detailed discussion, please refer to Chapter 2). Start with the packages we need.
import matplotlib.pyplot as plt
%matplotlib inline
import tensorflow as tf
import numpy as np
from sklearn.datasets import load_boston
import sklearn.linear_model as sk
Then import the dataset.
boston = load_boston()
features = np.array(boston.data)
target = np.array(boston.target)
The dataset has 13 features (contained in the features NumPy array) and the house price contained in the target NumPy array . As in Chapter 2, to normalize the features, we will use the function
def normalize(dataset):
    mu = np.mean(dataset, axis = 0)
    sigma = np.std(dataset, axis = 0)
    return (dataset-mu)/sigma
To conclude our dataset preparation, let’s normalize it and then create training and a dev dataset .
features_norm = normalize(features)
np.random.seed(42)
rnd = np.random.rand(len(features_norm)) < 0.8
train_x = np.transpose(features_norm[rnd])
train_y = np.transpose(target[rnd])
dev_x = np.transpose(features_norm[~rnd])
dev_y = np.transpose(target[~rnd])
The np.random.seed(42) is there so that you will always get the same training and dev dataset (this way, your results will be reproducible). Now, let’s reshape the arrays we need.
train_y = train_y.reshape(1,len(train_y))
dev_y = dev_y.reshape(1,len(dev_y))
Next, let’s build a complex neural network with 4 layers and 20 neurons for each layer. Define the following function to build each layer:
def create_layer (X, n, activation):
    ndim = int(X.shape[0])
    stddev = 2.0 / np.sqrt(ndim)
    initialization = tf.truncated_normal((n, ndim), stddev = stddev)
    W = tf.Variable(initialization)
    b = tf.Variable(tf.zeros([n,1]))
    Z = tf.matmul(W,X)+b
    return activation(Z), W, b
Note that this time, we return the weights tensor W and the bias b. We will need them when implementing regularization. You have already seen this function at the end of Chapter 3, so you should understand what it does. We use the He initialization here, because we will use ReLU activation functions. The network can be created with the following code:
tf.reset_default_graph()
n_dim = 13
n1 = 20
n2 = 20
n3 = 20
n4 = 20
n_outputs = 1
tf.set_random_seed(5)
X = tf.placeholder(tf.float32, [n_dim, None])
Y = tf.placeholder(tf.float32, [1, None])
learning_rate = tf.placeholder(tf.float32, shape=())
hidden1, W1, b1 = create_layer (X, n1, activation = tf.nn.relu)
hidden2, W2, b2 = create_layer (hidden1, n2, activation = tf.nn.relu)
hidden3, W3, b3 = create_layer (hidden2, n3, activation = tf.nn.relu)
hidden4, W4, b4 = create_layer (hidden3, n4, activation = tf.nn.relu)
y_, W5, b5 = create_layer (hidden4, n_outputs, activation = tf.identity)
cost = tf.reduce_mean(tf.square(y_-Y))
optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate, beta1 = 0.9, beta2 = 0.999, epsilon = 1e-8).minimize(cost)
In our output layer, we have one neuron with the identity activation function for regression. Additionally, we use the Adam optimizer, as suggested in Chapter 4. Now let’s run the model with this code:
sess = tf.Session()
sess.run(tf.global_variables_initializer())
cost_train_history = []
cost_dev_history = []
for epoch in range(10000+1):
    sess.run(optimizer, feed_dict = {X: train_x, Y: train_y, learning_rate: 0.001})
    cost_train_ = sess.run(cost, feed_dict={ X:train_x, Y: train_y, learning_rate: 0.001})
    cost_dev_ = sess.run(cost, feed_dict={ X:dev_x, Y: dev_y, learning_rate: 0.001})
    cost_train_history = np.append(cost_train_history, cost_train_)
    cost_dev_history = np.append(cost_test_history, cost_test_)
    if (epoch % 1000 == 0):
        print("Reached epoch",epoch,"cost J(train) =", cost_train_)
        print("Reached epoch",epoch,"cost J(test) =", cost_test_)
As you may have noticed, there are a few differences from what we did before. To make things simpler, I avoided writing a function and simply hard-coded all the values in the code, because, in this case, we don’t need to tune the parameters much. I am not using mini-batches here, because we have only a few hundred observations, and I am calculating the MSE (mean squared error) for both training and dev datasets with the following lines:
cost_train_ = sess.run(cost, feed_dict={ X:train_x, Y: train_y, learning_rate: 0.001})
cost_dev_ = sess.run(cost, feed_dict={ X:dev_x, Y: dev_y, learning_rate: 0.001})
In this way, we can check what is happening on both datasets at the same time. Now if you let the code run and plot the two MSEs , one for the training, which we will indicate by MSEtrain, and one for the dev dataset, indicated by MSEdev, we get Figure 5-1.
../images/463356_1_En_5_Chapter/463356_1_En_5_Fig1_HTML.jpg
Figure 5-1

MSE for the training (continuous line) and the dev dataset (dashed line) for the neural network with 4 layers, each having 20 neurons

You will notice how the training error goes down to zero, while the dev error remains constant at around a value of roughly 20, after dropping rapidly at the beginning. If you remember the introduction to basic error analysis, you should know that this means that we are in a regime of extreme overfitting (when MSEtrain ≪ MSEdev). The error on the training dataset is practically zero, while the one for the dev dataset is not. The model cannot generalize at all when applied to new data. In Figure 5-2, you can see the predicted value plotted vs. the real value. You will notice how in the plot at the left, for the training data, the prediction is almost perfect, while the plot on the right, for the dev dataset, is not that good. You will remember that a perfect model would give you predicted values exactly equal to the measured ones. So, while plotting one vs. the other, they would all lie on the 45 degree line of the plot, as in Figure 5-2, on the left.
../images/463356_1_En_5_Chapter/463356_1_En_5_Fig2_HTML.jpg
Figure 5-2

Predicted value vs. the real value for the target variable (the house price). You will notice how in the left-hand plot, for the training data, the prediction is almost perfect, while on the plot on the right, for the dev dataset, the predictions are more spread.

What can we do in this case to avoid the problem of overfitting? One solution, of course, would be to reduce the complexity of the network, that is, reducing the number of layers and/or the number of neurons in each layer. But, as you can imagine, this strategy is very time-consuming. You must try several network architectures to see how the training error and the dev error behave. In this case, this is still a viable solution, but if you are working on a problem for which the training phase takes several days, this can be quite difficult and extremely time-consuming. Several strategies have been developed to deal with this problem. The most common is called regularization, the focus of this chapter.

What Is Regularization?

Before going into the different methods, I would like to quickly discuss what the deep-learning community understands by the term regularization . The term has deeply (pun intended) evolved over time. For example, in the traditional sense (from the ’90s), the term is reserved only to as a penalty term in the loss function (Christopher M. Bishop, Neural Networks for Pattern Recognition, New York: Oxford University Press, 1995). Lately the term has gained a much more broader meaning. For example, Ian Goodfellow et al. (Deep Learning, Cambridge, MA, MIT Press, 2016) define it as any modification we make to a learning algorithm that is intended to reduce its test error but not its training error.” Jan Kukačka et al. (“Regularization for deep learning: a taxonomy,” arXiv:1710.10686v1, available at https://goo.gl/wNkjXz ) generalize the term even further and offer the following definition: “Regularization is any supplementary technique that aims at making the model generalize better, i.e., produce better results on the test set.” So, be aware when using the term, and always be precise about what you mean.

You may also have heard or read the claim that regularization has been developed to fight overfitting. This is also a way of understanding it. Remember: A model that is overfitting the training dataset is not generalizing well to new data. This definition can also be found online, along with all the others. Although merely definitions, it is important to have a familiarity with them, so that you may better understand what is meant when reading papers or books. This is a very active research area, and to give you an idea, Kukačka et al., in their review paper referenced above, list 58 different regularization methods. Yes, 58; that is not a typo. But it is important to understand that in their general definition, SGD (stochastic gradient descent) also is considered a regularization method, something not everyone agrees on. So be warned, when reading research material, check what is understood by the term regularization.

In this chapter, you will look at the three most common and well-known methods: 1, 2, and dropout, and I will briefly discuss early stopping, although this method does not, technically speaking, fight overfitting. 1 and 2 achieve a so-called weight decay, by adding a so-called regularization term to the cost function, while dropout simply removes, in a random fashion, nodes from the network during the training phase. To understand the three methods properly, we must study them in detail. Let’s start with probably the most instructive one: 2 regularization.

At the end of the chapter, we will explore a few other ideas about how to fight overfitting and get the model to generalize better. Instead of changing or modifying the model or the learning algorithm, we will consider strategies with the idea of modifying the training data, to make learning more effective.

About Network Complexity

I would like to spend a few moments discussing briefly a term I’ve used very often: network complexity. You have read here, as you can almost anywhere, that with regularization, you want to reduce network complexity. But what does that really mean? Actually, it is relatively difficult to give a definition of network complexity, so much so that no one does it. You can find several research papers on the problem of model complexity (note that I did not say network complexity), with roots in information theory. You will see in this chapter how, for example, the number of weights that is different than zero will change dramatically with the number of epochs, with the optimization algorithm, and so on, therefore making this vaguely intuitive concept of complexity dependent also on how long you train your model. To make a long story short, the term network complexity should be used only on an intuitive level, because, theoretically, it is a very complex concept to define. A complete discussion of the subject would be way beyond the scope of this book.

p Norm

Before we start studying what 1 and 2 regularization are, I must introduce the p norm notation. We define the p norm of a vector x with xi components as
$$ leftVert {x}_p
ightVert =sqrt[p]{sum limits_{kern1.5em i}{left|{x}_i
ight|}^p}kern2.125em pin mathbb{R} $$

where the sum is performed over all components of the vector x.

Let’s begin with the most instructive norm: the 2.

2 Regularization

One of the most common regularization methods, 2 regularization consists of adding a term to the cost function that has the goal of effectively reducing the capacity of the network to adapt to complex datasets. Let’s first have a look at the mathematics behind the method.

Theory of ℓ2 Regularization

When doing plain regression, you will remember from Chapter 2, our cost function is simply the MSE
$$ J(w)=frac{1}{m}sum limits_{i=1}^m{left({y}_i-{widehat{y}}_i
ight)}^2 $$
where yi is our measured target variable, $$ {widehat{y}}_i $$ is the predicted value, w is the vector of all the weights of our network, including the bias, and m is the number of observations. Now let’s define a new cost function $$ 	ilde{J}left(w,b
ight). $$
$$ 	ilde{J}(w)=J(w)+frac{lambda }{2m}{{left|w
ight|}^2}_2 $$
This additional term,
$$ frac{lambda }{2m}{{leftVert w
ightVert}^2}_2 $$

is called a regularization term and is nothing else than the 2-norm squared of w multiplied by a constant factor λ/2m. λ is called the regularization parameter.

Note

The new regularization parameter, λ, is a new hyperparameter that you must tune to find the optimal value.

Now let’s try to get an intuitive understanding of what the effect of this term on the GD (gradient descent) algorithm is. Let’s consider the updated equation for the weight wj
$$ {w}_{j,left[n+1
ight]}={w}_{j,left[n
ight]}-gamma frac{partial 	ilde{J}left({w}_{left[n
ight]}
ight)}{partial {w}_j}={w}_{j,left[n
ight]}-gamma frac{partial Jleft({w}_{left[n
ight]}
ight)}{partial {w}_j}-frac{gamma lambda}{m}{w}_{j,left[n
ight]} $$
Since
$$ frac{partial }{partial {w}_j}{{leftVert w
ightVert}^2}_2=2{w}_j $$
this gives us
$$ {w}_{j,left[n+1
ight]}={w}_{j,left[n
ight]}left(1-frac{gamma lambda}{m}
ight)-lambda frac{partial Jleft({w}_{left[n
ight]}
ight)}{partial {w}_j} $$

This is the equation that we must use for the weights update. The difference with the one we already know from plain GD is that, now, the weight wj, [n] is multiplied with a constant $$ 1-frac{gamma lambda}{m}&lt;1 $$, and, therefore, this has the effect of effectively shifting the weight values during the update toward zero, making the network less complex (intuitively), thus fighting overfitting. Let’s try to see what is really happening to the weights, by applying the method to the Boston housing dataset.

tensorflow Implementation

The implementation in tensorflow is quite easy. Remember: We must calculate the additional term $$ {leftVert w
ightVert}_2^2 $$, then add it to the cost function. The model construction remains almost the same. We can do it with the following code:
tf.reset_default_graph()
n_dim = 13
n1 = 20
n2 = 20
n3 = 20
n4 = 20
n_outputs = 1
tf.set_random_seed(5)
X = tf.placeholder(tf.float32, [n_dim, None])
Y = tf.placeholder(tf.float32, [1, None])
learning_rate = tf.placeholder(tf.float32, shape=())
hidden1, W1, b1 = create_layer (X, n1, activation = tf.nn.relu)
hidden2, W2, b2 = create_layer (hidden1, n2, activation = tf.nn.relu)
hidden3, W3, b3 = create_layer (hidden2, n3, activation = tf.nn.relu)
hidden4, W4, b4 = create_layer (hidden3, n4, activation = tf.nn.relu)
y_, W5, b5 = create_layer (hidden4, n_outputs, activation = tf.identity)
lambd = tf.placeholder(tf.float32, shape=())
reg = tf.nn.l2_loss(W1) + tf.nn.l2_loss(W2) + tf.nn.l2_loss(W3) +
          tf.nn.l2_loss(W4) + tf.nn.l2_loss(W5)
cost_mse = tf.reduce_mean(tf.square(y_-Y))
cost = tf.reduce_mean(cost_mse + lambd*reg)
optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate, beta1 = 0.9, beta2 = 0.999, epsilon = 1e-8).minimize(cost)
For our new regularization parameter λ, we create a placeholder.
lambd = tf.placeholder(tf.float32, shape=())
Remember that in Python, lambda is a reserved word, so we cannot use it. This is the reason we use lambd. Then we calculate our regularization term $$ {leftVert w
ightVert}_2^2. $$
reg = tf.nn.l2_loss(W1) + tf.nn.l2_loss(W2) + tf.nn.l2_loss(W3) +
          tf.nn.l2_loss(W4) + tf.nn.l2_loss(W5)
with the useful TensorFlow function tf.nn.l2_loss(), and then we add it to the MSE function cost_mse.
cost_mse = tf.reduce_mean(tf.square(y_-Y))
cost = tf.reduce_mean(cost_mse + lambd*reg)
Now our cost tensor will contain the MSE plus the regularization term. Then we simply need to train the network and observe what happens. To train the network, we use this function:
def model(training_epochs, features, target, logging_step = 100, learning_r = 0.001, lambd_val = 0.1):
    sess = tf.Session()
    sess.run(tf.global_variables_initializer())
    cost_history = []
    for epoch in range(training_epochs+1):
        sess.run(optimizer, feed_dict = {X: features, Y: target, learning_rate: learning_r, lambd: lambd_val})
        cost_ = sess.run(cost_mse, feed_dict={ X:features, Y: target, learning_rate: learning_r, lambd: lambd_val})
        cost_history = np.append(cost_history, cost_)
        if (epoch % logging_step == 0):
                pred_y_test = sess.run(y_, feed_dict = {X: test_x, Y: test_y})
                print("Reached epoch",epoch,"cost J =", cost_)
                print("Training MSE = ", cost_)
                print("Dev MSE      = ", sess.run(cost_mse, feed_dict = {X: test_x, Y: test_y}))
    return sess, cost_history
This time, I printed the MSE coming from the training (MSEtrain) and dev (MSEdev) datasets , to check what is going on. As mentioned, applying this method makes many weights go to zero, effectively reducing the complexity of the network and, therefore, fighting overfitting. Let’s run the model for λ = 0, without regularization, and for λ = 10.0. We can run our model with the following code:
sess, cost_history = model(learning_r = 0.01,
                                training_epochs = 5000,
                                features = train_x,
                                target = train_y,
                                logging_step = 5000,
                                lambd_val = 0.0)
which gives us
Reached epoch 0 cost J = 238.378
Training MSE = 238.378
Dev MSE = 205.561
Reached epoch 5000 cost J = 0.00527479
Training MSE = 0.00527479
Dev MSE = 28.401
As expected, we are in an extreme overfitting regime (MSEtrain ≪ MSEdev) after 5000 epochs. Now let’s try it with λ = 10.
sess, cost_history = model(learning_r = 0.01,
                                training_epochs = 5000,
                                features = train_x,
                                target = train_y,
                                logging_step = 5000,
                                lambd_val = 10.0)
This gives the result
Reached epoch 0 cost J = 248.026
Training MSE = 248.026
Dev MSE = 214.921
Reached epoch 5000 cost J = 23.795
Training MSE = 23.795
Dev MSE = 21.6406
Now we are no more in an overfitting regime, because the two MSE values are of the same order of magnitude. The best way of checking what is going on is to study the weights distribution for each layer. In Figure 5-3, the weights distribution for the first 4 layers are plotted. The light gray histogram is for the weights without regularization, and the darker (and much more concentrated around zero) area is for the weights with regularization. I neglected layer 5, since it is the output layer.
../images/463356_1_En_5_Chapter/463356_1_En_5_Fig3_HTML.jpg
Figure 5-3

Weights distribution for each layer

You can clearly see how the weights, when we apply regularization, are much more concentrated around zero, meaning they are much smaller than without regularization. This makes the weight decay effect of regularization very evident. I would like to briefly take the chance to make another brief digression on network complexity. I said that this method reduces the network complexity. I told you in Chapter 3 that you can consider the number of learnable parameters an indication of the complexity of a network, but I also warned you that this can be very misleading. Now I would like to show you why. You will remember from Chapter 3 that the total number of learnable parameters we have in a network like the one we are using here is determined by the formula
$$ Q=sum limits_{j=1}^L{n}_lleft({n}_{l-1}+1
ight) $$
where nl is the number of neurons in layer l, and L is the total number of layers, including the output layer. In our case, we have an input layer with 13 features, then 4 layers, each with 20 neurons, and then an output layer with 1 neuron. Therefore, Q is given by
$$ Q=20	imes left(13+1
ight)+20	imes left(20+1
ight)+20	imes left(20+1
ight)+20	imes left(20+1
ight)+1	imes left(20+1
ight)=1561 $$

Q is quite a big number. But already, without regularization, it is interesting to note that we have roughly 48% of the weights that after 10,000 epochs are less than 10−10, so, effectively, zero. This is the reason I warned you about talking about complexity in terms of numbers of learnable parameters . Additionally, using regularization will change the scenario completely. Complexity is a difficult concept to define: it depends on many things, among others, architecture, optimization algorithm, cost function, and number of epochs trained.

Note

Defining the complexity of a network only in terms of number of weights is not completely correct. The total number of weights gives an idea, but it can be quite misleading, because many may be zero after the training, effectively disappearing from the network, and making it less complex. It is more correct to talk about “Model Complexity,” instead of network complexity, because many more aspects are involved than simply how many neurons or layers the network has.

Incredibly enough, only half of the weights play a role in the predictions in the end. This is the reason I told you in Chapter 3 that defining the network complexity only with the parameter Q is misleading. Given your problem, your loss function, and optimizer, you may well end up with a network that when trained is much simpler than it was at construction phase. So be very careful when using the term complexity in the deep learning world. Be aware of the subtleties involved.

To give you an idea of how effective regularization is in reducing the weights, see Table 5-1, in which the percentage of weights less than 1e-3 is compared with and without regularization after 1000 epochs in each layer.
Table 5-1

Percentage of Weights Less Than 1e-3 with and Without Regularization After 1000 Epochs

Layer

% of Weights Less Than 1e-3 for λ = 0

% of Weights Less Than 1e-3 for λ = 3

1

0.0

20.0

2

0.25

41.5

3

0.75

60.5

4

0.25

66.0

5

0.0

35.0

But how should we choose λ? To get an idea (repeat after me: “In the deep learning world, there is no universal rule.”), it is useful to see what is happening when varying the parameter λ to your optimizing metric (in this case, the MSE). In Figure 5-4, you can see the behavior of MSEtrain (continuous line) and MSEdev (dashed) datasets for our network varying λ after 1000 epochs.
../images/463356_1_En_5_Chapter/463356_1_En_5_Fig4_HTML.jpg
Figure 5-4

Behavior of the MSE for the training (continuous line) dataset and for the dev (dashed) dataset for our network varying λ.

As you can see with small values of λ (effectively without regularization), we are in an overfitting regime (MSEtrain ≪ MSEdev): slowly the MSEtrain increases, while the MSEdev remains roughly constant. Until λ ≈ 7.5, the model overfits the training data, then the two values cross, and the overfitting finishes. After that, they grow together, at which point the model cannot capture the fine data structures anymore. After the crossing of the lines, the model becomes too simple to capture the features of the problem, and, therefore, the errors grow together, and the error on the training dataset gets bigger, because the model doesn’t even fit the training data well. In this specific case, a good value to choose for λ would be about 7.5, nearly the value when the two lines cross, because there, you are no longer in an overfitting region, as MSEtrain ≈ MSEdev. Remember: The main goal of having the regularization term is to get a model that generalizes in the best way possible when applied to new data. You can look at it in an even different way: a value of λ ≈ 7.5 gives you the minimum of MSEdev outside the overfitting region (for λ ≲ 7.5); therefore, it would be a good choice. Note that you may observe for your problems a very different behavior for your optimizing metric, so you will have to decide on a case-by-case basis what the best value for λ is that works for you.

Note

A good way to estimate the optimal value of the regularization parameter λ is to plot your optimizing metric (in this example, the MSE) for the training and dev datasets and observe how they behave for various values of λ. Then choose the value that gives the minimum of your optimizing metric on the dev dataset and, at the same time, gives you a model that no longer overfits your training data.

I would like now to show you the effects of 2 regularization in an even more visual way. Let’s consider a dataset generated with the following code:
nobs = 30
np.random.seed(42)
xx1 = np.array([np.random.normal(0.3,0.15) for i in range (0,nobs)])
yy1 = np.array([np.random.normal(0.3,0.15) for i in range (0,nobs)])
xx2 = np.array([np.random.normal(0.1,0.1) for i in range (0,nobs)])
yy2 = np.array([np.random.normal(0.3,0.1) for i in range (0,nobs)])
c1_ = np.c_[xx1.ravel(), yy1.ravel()]
c2_ = np.c_[xx2.ravel(), yy2.ravel()]
c = np.concatenate([c1_,c2_])
yy1_ = np.full(nobs, 0, dtype=int)
yy2_ = np.full(nobs, 1, dtype=int)
yyL = np.concatenate((yy1_, yy2_), axis = 0)
train_x = c.T
train_y = yyL.reshape(1,60)
Our dataset has two features: x and y. We generate two groups of points, xx1,yy1 and xx2,yy2, from a normal distribution. To the first group, we assign the label 0 (contained in the array yy1_), and to the second, the label 1 (in the array yy2_). Now let’s use a network such as that described before (with 4 layers, each having 20 neurons) to do some binary classification on this dataset. We can take the same code given before, modifying the output layer and the cost function. You will remember that for binary classification, we need one neuron in the output layer with the sigmoid activation function
y_, W5, b5 = create_layer (hidden4, n_outputs, activation = tf.sigmoid)
and the following cost function:
cost_class = - tf.reduce_mean(Y * tf.log(y_)+(1-Y) * tf.log(1-y_))
cost = tf.reduce_mean(cost_class + lambd*reg)
All the rest remains the same as was described earlier. Let’s plot the decision boundary1 for this problem. This means that we will run our network on our dataset with the code
sess, cost_history = model(learning_r = 0.005,
                                training_epochs = 100,
                                features = train_x,
                                target = train_y,
                                logging_step = 10,
                                lambd_val = 0.0)
In Figure 5-5 , you can see our datasets where the white points are of the first class and the black of the second. The gray area is the zone that the network classifies as being of one class, and the white of the other. You can see that the network is able to capture the complex structure of our data in a flexible way.
../images/463356_1_En_5_Chapter/463356_1_En_5_Fig5_HTML.png
Figure 5-5

Decision boundary without regularization. White points are of the first class, and the black of the second.

Now let’s apply regularization to the network, exactly as we did before, and see how the decision boundary is modified. Here, we will use a regularization parameter λ = 0.1.

You can clearly see how in Figure 5-6 the decision boundary is almost linear and not able to capture the complex structure of our data anymore. Exactly what we expected: the regularization term makes the model simpler and, therefore, less able to capture the fine structures.
../images/463356_1_En_5_Chapter/463356_1_En_5_Fig6_HTML.png
Figure 5-6

Decision boundary, as predicted by the network with 2 regularization and with a regularization parameter λ = 0.1

It is interesting to compare the decision boundary of our network with the result of logistic regression with just one neuron. I will not put the code here, for space considerations, but if you compare the two decision boundaries in Figure 5-7 (the one coming from the network with one neuron is linear), you can see that they are almost the same. A regularization term of λ = 0.1 gives effectively the same results as a network with just one neuron.
../images/463356_1_En_5_Chapter/463356_1_En_5_Fig7_HTML.png
Figure 5-7

Decision boundaries for a complex network with λ = 0.1 and for one with just one neuron. The two boundaries almost overlap completely.

1 Regularization

Now we will look at a regularization technique that is very similar to 2 regularization. It is based on the same principle, adding a term to the cost function. This time, the mathematical form of the added term is different, but the method works very similarly to what I explained in the previous sections. Let’s again first have a look at the mathematics behind the algorithm.

Theory of ℓ1 Regularization and tensorflow Implementation

1 regularization also works when adding an additional term to the cost function
$$ 	ilde{J}(w)=J(w)+frac{lambda }{m}{leftVert w
ightVert}_1 $$
The effect it has on the learning is effectively the same as was described with 2 regularization. TensorFlow does not have, as for 2, a function ready to be used. We must code it manually, using the following code:
reg = tf.reduce_sum(tf.abs(W1))+tf.reduce_sum(tf.abs(W2))+tf.reduce_sum(tf.abs(W3))+
        tf.reduce_sum(tf.abs(W4))+tf.reduce_sum(tf.abs(W5))
The rest of the code discussed remains the same. We can again compare the weights distribution between the model without a regularization term (λ = 0) and with regularization (λ = 3, Figure 5-8). We have used the Boston dataset for the calculation. We have trained the model with the following call:
sess, cost_history = model(learning_r = 0.01,
                                training_epochs = 1000,
                                features = train_x,
                                target = train_y,
                                logging_step = 1000,
                                lambd_val = 3.0)
../images/463356_1_En_5_Chapter/463356_1_En_5_Fig8_HTML.jpg
Figure 5-8

Weights distribution comparison between the model without the 1 regularization term (λ = 0, light gray) and with 1 regularization (λ = 3, dark gray)

once with λ = 0, and once with λ = 3.

As you can see, 1 regularization has the same effect as 2. It reduces the effective complexity of the network, reducing many weights to zero.

To give you an idea of how effective regularization is in reducing the weights, see Table 5-2, which compares the percentage of weights less than 1e-3 with and without regularization after 1000 epochs.
Table 5-2

Comparison of Percentage of Weights Less Than 1e-3 with and Without Regularization

Layer

% of Weights Less Than 1e-3 for λ = 0

% of Weights Less Than 1e-3 for λ = 3

1

0.0

52.7

2

0.25

53.8

3

0.75

46.3

4

0.25

45.3

5

0.0

60.0

Are Weights Really Going to Zero?

It is very instructive to see how weights are going to zero . In Figure 5-9, you can see weight $$ {w}_{12,5}^{left[3
ight]} $$ (from layer 3) plotted vs. the number of epochs for our artificial dataset with two features, 2 regularization, γ = 10−3, λ = 0.1, after 1000 epochs. You can see how it quickly decreases to zero. The value after 1000 epochs is 2 · 10−21, so, for all purposes, zero.
../images/463356_1_En_5_Chapter/463356_1_En_5_Fig9_HTML.jpg
Figure 5-9

Weight $$ {w}_{12,5}^{left[3
ight]} $$ plotted vs. the epochs for our artificial dataset with two features, 2 regularization, γ = 10−3, λ = 0.1, trained for 1000 epochs

In case you are wondering, the weight goes to zero almost exponentially. A way of understanding why this is the case is the following. Let’s consider the weight update equation for one weight.
$$ {w}_{j,left[n+1
ight]}={w}_{j,left[n
ight]}left(1-frac{gamma lambda}{m}
ight)-frac{gamma partial Jleft({w}_{left[n
ight]}
ight)}{partial {w}_j} $$
Let’s now suppose that we find ourselves close to the minimum, in a region where the derivative of the cost function J is almost zero, so that we can neglect it. In other words, let’s suppose
$$ frac{partial Jleft({w}_{left[n
ight]}
ight)}{partial {w}_j}approx 0 $$
We can rewrite the weight update equation as
$$ {w}_{j,left[n+1
ight]}-{w}_{j,left[n
ight]}=-{w}_{j,left[n
ight]}frac{gamma lambda}{m} $$
Now the equation can be read as follows: the rate of variation of the weight with respect to the iteration number is proportional to the weight itself. For those of you with knowledge of differential equations, you may realize that we can draw a parallel to the following equation:
$$ frac{dx(t)}{dt}=-frac{gamma lambda}{m}x(t) $$
This can be read as the rate of variation of x(t) with respect to time is proportional to the function itself. For those of you who know how to solve this equation, you may know that a generic solution is
$$ x(t)=A{e}^{-frac{gamma lambda}{m}left(t-{t}_0
ight)} $$
You can now see why the weight decay will have a decay similar to that of an exponential function, by drawing a parallel between the two equations. In Figure 5-10, you can see the weight decay already discussed, with a pure exponential decay. The two curves are not identical, as expected, because, especially at the beginning, the gradient of the cost function is surely not zero. But the similarity is remarkable and gives us an idea of how fast the weights can go to zero (read: really fast).
../images/463356_1_En_5_Chapter/463356_1_En_5_Fig10_HTML.jpg
Figure 5-10

Weight $$ {w}_{12,5}^{left[3
ight]} $$ plotted vs. the epochs for our artificial dataset with two features, 2 regularization, γ = 10−3, λ = 0.1, trained for 1000 epochs (continuous line) together with a pure exponential decay (dashed line), provided for illustrative purposes

Note that when using regularization, you end up having tensors with a lot of zero elements, called sparse tensors. You can then profit from special routines that are extremely efficient with sparse tensors. This is something to keep in mind when you start moving toward more complex models, but a subject too advanced for this book and that would require too much space.

Dropout

The basic idea of dropout is different: during the training phase, you remove nodes from layer l randomly with a probability p[l]. In each iteration, you remove different nodes, effectively training at each iteration a different network (when using mini-batches, you train a different network for each batch, for example). Usually, the probability (often called keep_prob in Python) is set the same for all the network (but, technically speaking, it can be layer-specific). Intuitively, let’s consider the output tensor Z of a layer l. In Python, we can define a vector such as
d = np.random.rand(Z.shape[0], Z.shape[1]) < keep_prob
and then simply multiply the layer output Z by d, as follows:
Z = np.multiply(Z, d)

This effectively removes all elements that have a probability less than keep_prob. Of much importance when doing predictions on a dev dataset is that no dropout be used!

Note

During training, dropout removes nodes randomly each iteration. But when doing predictions on a dev dataset, the entire network without dropout must be used. In other words, you must set keep_prob=1.

Dropout can be layer-specific. For example, for layers with many neurons, keep_prob can be small. For layers with a few neurons, one can set keep_prob = 1.0, effectively keeping all neurons in such layers.

The implementation in TensorFlow is easy. First, you define a placeholder that will contain the value of the keep_prob parameter
keep_prob = tf.placeholder(tf.float32, shape=())
and then for each layer, you add a regularization operation in this way:
hidden1, W1, b1 = create_layer (X, n1, activation = tf.nn.relu)
hidden1_drop = tf.nn.dropout(hidden1, keep_prob)
Then, when creating the next layer, instead of using hidden1, you use hidden1_drop. The entire construction code looks like this:
tf.reset_default_graph()
n_dim = 13
n1 = 20
n2 = 20
n3 = 20
n4 = 20
n_outputs = 1
tf.set_random_seed(5)
X = tf.placeholder(tf.float32, [n_dim, None])
Y = tf.placeholder(tf.float32, [1, None])
learning_rate = tf.placeholder(tf.float32, shape=())
keep_prob = tf.placeholder(tf.float32, shape=())
hidden1, W1, b1 = create_layer (X, n1, activation = tf.nn.relu)
hidden1_drop = tf.nn.dropout(hidden1, keep_prob)
hidden2, W2, b2 = create_layer (hidden1_drop, n2, activation = tf.nn.relu)
hidden2_drop = tf.nn.dropout(hidden2, keep_prob)
hidden3, W3, b3 = create_layer (hidden2, n3, activation = tf.nn.relu)
hidden3_drop = tf.nn.dropout(hidden3, keep_prob)
hidden4, W4, b4 = create_layer (hidden3, n4, activation = tf.nn.relu)
hidden4_drop = tf.nn.dropout(hidden4, keep_prob)
y_, W5, b5 = create_layer (hidden4_drop, n_outputs, activation = tf.identity)
  
cost = tf.reduce_mean(tf.square(y_-Y))
optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate, beta1 = 0.9, beta2 = 0.999, epsilon = 1e-8).minimize(cost)
Now let’s analyze what happens to the cost function when using dropout. Let’s run our model applied to the Boston dataset for two values of the keep_prob variable: 1.0 (without dropout) and 0.5. In Figure 5-11, you can see that when applying dropout, the cost function is very irregular. It oscillates wildly. The two models have been evaluated with the calls
sess, cost_history05 = model(learning_r = 0.01,
                                training_epochs = 5000,
                                features = train_x,
                                target = train_y,
                                logging_step = 1000,
                                keep_prob_val = 1.0)
for keep_prob_val = 1.0 and for 0.5.
../images/463356_1_En_5_Chapter/463356_1_En_5_Fig11_HTML.jpg
Figure 5-11

Cost function for the training dataset for our model with two values of the keep_prob variable: 1.0 (no dropout) and 0.5. The other parameters are: γ = 0.01. The models have been trained for 5000 epochs. No mini-batch has been used. The oscillating line is the one evaluated with regularization.

In Figure 5-12, you can see the evolution of the MSE for the training and the dev dataset in the case of dropout (keep_prob=0.4).
../images/463356_1_En_5_Chapter/463356_1_En_5_Fig12_HTML.jpg
Figure 5-12

MSE for the training and dev datasets with dropout (keep_prob=0.4)

In Figure 5-13, you can see the same plot but without dropout. The difference is quite striking. Very interesting is the fact that without dropout, MSEdev grows with epochs, while using dropout, it is rather stable.
../images/463356_1_En_5_Chapter/463356_1_En_5_Fig13_HTML.jpg
Figure 5-13

MSE for the training and dev datasets without dropout (keep_prob=1.0)

In Figure 5-13, the MSEdev grows after dropping at the beginning. The model is in clear extreme overfitting regime (MSEtrain ≪ MSEdev), and it generalizes worse and worse when applied to the new data. In Figure 5-12, you can see how MSEtrain and MSEdev are of the same order of magnitude, and the MSEdev does not continue to grow. So, we have a model that is a lot better at generalizing than the one whose results are shown in Figure 5-13.

Note

When applying dropout, your metric (in this case, the MSE) will oscillate, so don’t be surprised when trying to find the best hyperparameters, if you see your optimizing metric oscillating.

Early Stopping

There is another technique that is sometimes used to fight overfitting. Strictly speaking, this method does nothing to avoid overfitting; it simply stops the learning before the overfitting problem becomes too bad. Consider the example in last section. In Figure 5-14, you can see MSEtrain and MSEdev plotted on the same plot.
../images/463356_1_En_5_Chapter/463356_1_En_5_Fig14_HTML.jpg
Figure 5-14

MSE for the training and the dev datasets without dropout (keep_prob=1.0). Early stopping consists in stopping the learning phase at the iteration when the MSEdev is minimum (indicated with a vertical line in the plot). At right, you can see a zoom of the left plot for the first 1000 epochs.

Early stopping simply consists of stopping the training at the point at which the MSEdev has its minimum (see Figure 5-14, the minimum is indicated by a vertical line in the figure). Note that this is not an ideal way to solve the overfitting problem. Your model will still most probably generalize very badly to new data. I usually prefer to use other techniques. Additionally, this is also time-consuming and a manual process that is very error-prone. You can get a good overview of the different application contexts by checking the Wikipedia page for early stopping: https://goo.gl/xnKo2s .

Additional Methods

All the methods I discussed so far consist, in some form or another, in making the model less complex. You keep the data as it is and modify your model. But we can try to do the opposite: leave the model as it is and work on the data. Here are two common strategies that work for fighting overfitting (but not very easily applicable):
  • Get more data. This is the simplest way of fighting overfitting. Unfortunately, very often in real life, this is not possible. Keep in mind that this is a complicated matter that I will discuss at length in the next chapter. If you are classifying cat pictures taken with a smartphone, you may think of getting more data from the Web. Although this may seem a perfectly good idea, you may discover that the images have varying quality, that possibly not all the images are really of cats (what about cat toys?). Also, you may find only images of young white cats, and so on. Basically, your additional observations may probably come from a very different distribution than your original data, and that will be a problem, as you will see. So, when getting additional data, consider the potential problems well before proceeding.

  • Augment your data. For example, if you are working with images, you can generate additional ones by rotating, stretching, shifting, etc., your images. That is a very common technique that may really be useful.

Resolving the problem of making the model generalize better on new data is one of machine learning’s biggest goals. It is a complicated problem that requires experience and tests. Lots of tests. Much research is going on that tries to solve these kinds of bugs when working on very complex problems. I will discuss additional techniques in the next chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.61.187