Example of dropout with Keras

We cannot test the effectiveness of the dropout with a more challenging classification problem. The dataset is the classical MNIST handwritten digits, but Keras allows downloading and working with the original version that is made up of 70 thousand (60 thousand training and 10 thousand test) 28 × 28 grayscale images. Even if this is not the best strategy, because a convolutional network should be the first choice to manage images, we want to try to classify the digits considering them as flattened 784-dimensional arrays.

The first step is loading and normalizing the dataset so that each value becomes a float bounded between 0 and 1:

import numpy as np

from keras.datasets import mnist
from keras.utils import to_categorical

(X_train, Y_train), (X_test, Y_test) = mnist.load_data()

width = height = X_train.shape[1]

X_train = X_train.reshape((X_train.shape[0], width * height)).astype(np.float32) / 255.0
X_test = X_test.reshape((X_test.shape[0], width * height)).astype(np.float32) / 255.0

Y_train = to_categorical(Y_train, num_classes=10)
Y_test = to_categorical(Y_test, num_classes=10)

At this point, we can start testing a model without dropout. The structure, which is common to all experiments, is based on three fully connected ReLU layers (2048-1024-1024) followed by a softmax layer with 10 units. Considering the problem, we can try to train the model using an Adam optimizer with η = 0.0001 and a decay set to 10^-6:

from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.optimizers import Adam

model = Sequential()

model.add(Dense(2048, input_shape=(width * height, )))
model.add(Activation('relu'))

model.add(Dense(1024))
model.add(Activation('relu'))

model.add(Dense(1024))
model.add(Activation('relu'))

model.add(Dense(10))
model.add(Activation('softmax'))

model.compile(optimizer=Adam(lr=0.0001, decay=1e-6),
             loss='categorical_crossentropy',
             metrics=['accuracy'])

The model is trained for 200 epochs with a batch size of 256 samples:

history = model.fit(X_train, Y_train,
                    epochs=200,
                    batch_size=256,
                    validation_data=(X_test, Y_test))

Train on 60000 samples, validate on 10000 samples
Epoch 1/200
60000/60000 [==============================] - 11s 189us/step - loss: 0.4026 - acc: 0.8980 - val_loss: 0.1601 - val_acc: 0.9523
Epoch 2/200
60000/60000 [==============================] - 7s 116us/step - loss: 0.1338 - acc: 0.9621 - val_loss: 0.1062 - val_acc: 0.9669
Epoch 3/200
60000/60000 [==============================] - 7s 124us/step - loss: 0.0872 - acc: 0.9744 - val_loss: 0.0869 - val_acc: 0.9732

...

Epoch 199/200
60000/60000 [==============================] - 7s 114us/step - loss: 1.1935e-07 - acc: 1.0000 - val_loss: 0.1214 - val_acc: 0.9838
Epoch 200/200
60000/60000 [==============================] - 7s 116us/step - loss: 1.1935e-07 - acc: 1.0000 - val_loss: 0.1214 - val_acc: 0.9840

Even without a further analysis, we can immediately notice that the model is overfitted. After 200 epochs, the training accuracy is 1.0 with a loss close to 0.0, while the validation accuracy is reasonably high, but with a validation loss slightly lower than the one obtained at the end of the second epoch.

To better understand what happened, it's useful to plot both accuracy and loss during the training process:

As it's possible to see, the validation loss reached a minimum during the first 10 epochs and immediately restarted to grow (this is sometimes called a U-curve because of its shape). At the same moment, the training accuracy reached 1.0. From that epoch on, the model started overfitting, learning a perfect structure of the training set, but losing the generalization ability. In fact, even if the final validation accuracy is rather high, the loss function indicates a lack of robustness when new samples are presented. As the loss is a categorical cross-entropy, the result can be interpreted as saying that the model has learned a distribution that partially mismatches the validation set one. As our goal is to use the model to predict new samples, this configuration could not be acceptable. Therefore, we try again, using some dropout layers. As suggested by the authors, we also increment the learning rate to 0.1 (switching to a Momentum SGD optimizer in order to avoid explosions due to adaptivity of RMSProp or Adam), initialize the weight with a uniform distribution (-0.05, 0.05), and impose a maximum norm constraint set to 2.0. This choice allows the exploration of more sub-configurations without the risk of excessively high weights. The dropout is applied to the 25% of input units and to all ReLU fully connected layers with a percentage set to 50%:

from keras.constraints import maxnorm
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import SGD

model = Sequential()

model.add(Dropout(0.25, input_shape=(width * height, ), seed=1000))

model.add(Dense(2048, kernel_initializer='uniform', kernel_constraint=maxnorm(2.0)))
model.add(Activation('relu'))
model.add(Dropout(0.5, seed=1000))

model.add(Dense(1024, kernel_initializer='uniform', kernel_constraint=maxnorm(2.0)))
model.add(Activation('relu'))
model.add(Dropout(0.5, seed=1000))

model.add(Dense(1024, kernel_initializer='uniform', kernel_constraint=maxnorm(2.0)))
model.add(Activation('relu'))
model.add(Dropout(0.5, seed=1000))

model.add(Dense(10))
model.add(Activation('softmax'))

model.compile(optimizer=SGD(lr=0.1, momentum=0.9),
              loss='categorical_crossentropy',
              metrics=['accuracy'])

The training process is performed with the same parameters:

history = model.fit(X_train, Y_train,
                    epochs=200,
                    batch_size=256,
                    validation_data=(X_test, Y_test))

Train on 60000 samples, validate on 10000 samples
Epoch 1/200
60000/60000 [==============================] - 11s 189us/step - loss: 0.4964 - acc: 0.8396 - val_loss: 0.1592 - val_acc: 0.9511
Epoch 2/200
60000/60000 [==============================] - 6s 97us/step - loss: 0.2300 - acc: 0.9300 - val_loss: 0.1081 - val_acc: 0.9645
Epoch 3/200
60000/60000 [==============================] - 6s 93us/step - loss: 0.1867 - acc: 0.9435 - val_loss: 0.0941 - val_acc: 0.9713

...

Epoch 199/200
60000/60000 [==============================] - 6s 99us/step - loss: 0.0184 - acc: 0.9939 - val_loss: 0.0473 - val_acc: 0.9884
Epoch 200/200
60000/60000 [==============================] - 6s 101us/step - loss: 0.0190 - acc: 0.9941 - val_loss: 0.0484 - val_acc: 0.9883

The final condition is dramatically changed. The model is no longer overfitted (even if it's possible to improve it in order to increase the validation accuracy) and the validation loss is lower than the initial one. To have a confirmation, let's analyze the accuracy/loss plots:

The result shows some imperfections because the validation loss is almost flat for many epochs; however, the same model, with a higher learning rate and a weaker algorithm achieved a better final performance (0.988 validation accuracy) and a superior generalization ability. State-of-the-art models can also reach a validation accuracy equal to 0.995, but our goal was to show the effect of dropout layers in preventing the overfitting and, moreover, yielding a final configuration that is much more robust to new samples or noisy ones. I invite the reader to repeat the experiment with different parameters, bigger or smaller networks, and other optimization algorithms, trying to further reduce the final validation loss.

Keras also implements two additional dropout layers: GaussianDropout, which multiplies the input samples by a Gaussian noise:

The value for the constant ρ can be set through the parameter rate (bounded between 0 and 1). When ρ → 1, σ² → ∞, while small values yield a null effect as n ≈ 1. This layer can be very useful as input one, in order to simulate a random data augmentation process. The other class is AlphaDropout, which works like the previous one, but renormalizing the output to keep the original mean and variance (this effect is very similar to the one obtained employing the technique described in the next paragraph together with noisy layers).

When working with probabilistic layers (such as dropout), I always suggest setting the random seed (np.random.seed(...) and tf.set_random_seed(...) when Tensorflow backend is used). In this way, it's possible to repeat the experiments comparing the results without any bias. If the random seed is not explicitly set, every new training process will be different and it's not easy to compare the performances, for example, after a fixed number of epochs.

Table of Contents for Example of dropout with Keras

Create new playlist

Sign In

Sign Up

Table of Contents for
Example of dropout with Keras