Examples of deep convolutional networks with Keras

In the first example, we want to consider again the complete MNIST handwritten digit dataset, but instead of using an MLP, we are going to employ a small deep convolutional network. The first step consists of loading and normalizing the dataset:

import numpy as np

from keras.datasets import mnist
from keras.utils import to_categorical

(X_train, Y_train), (X_test, Y_test) = mnist.load_data()

width = height = X_train.shape[1]

X_train = X_train.reshape((X_train.shape[0], width, height, 1)).astype(np.float32) / 255.0
 X_test = X_test.reshape((X_test.shape[0], width, height, 1)).astype(np.float32) / 255.0

Y_train = to_categorical(Y_train, num_classes=10)
Y_test = to_categorical(Y_test, num_classes=10)

We can now define the model architecture. The samples are rather small (28 × 28); therefore it can be helpful to use small kernels. This is not a general rule and it's useful to also evaluate larger kernels (in particular in the first layers); however, many state-of-the-art architectures confirmed large kernel sizes with small images can lead to a performance loss. In my personal experiments, I've always obtained the best results when the largest kernels were 8 ÷ 10 smaller than the image dimensions. Our model is made up of the following layers:

Input dropout 25%.
Convolution with 16 filters, (3 × 3) kernel, strides equal to 1, ReLU activation, and the same padding (the default weight initializer is Xavier). Keras implements the Conv2D class, whose main parameters are immediately understandable.
Dropout 50%.
Convolution with 32 filters, (3 × 3) kernel, strides equal to 1, ReLU activation, and the same padding.
Dropout 50%.
Average pooling with (2 × 2) pool size and strides equal to 1 (using the Keras class AveragePooling2D).
Convolution with 64 filters, (3 × 3) kernel, strides equal to 1, ReLU activation, and the same padding.
Average pooling with (2 × 2) pool size and strides equal to 1.
Convolution with 64 filters, (3 × 3) kernel, strides equal to 1, ReLU activation, and the same padding.
Dropout 50%.
Average pooling with (2 × 2) pool size and strides equal to 1.
Fully-connected layer with 1024 ReLU units.
Dropout 50%.
Fully-connected layer with 10 Softmax units.

The goal is to capture the low-level features (horizontal and vertical lines, intersections, and so on) in the first layers and use the pooling layers and all the subsequent convolutions to increase the accuracy when distorted samples are presented. At this point, we can create and compile the model (using the Adam optimizer with η = 0.001 and a decay rate equal to 10^-5):

from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout, Conv2D, AveragePooling2D, Flatten
from keras.optimizers import Adam

model = Sequential()

model.add(Dropout(0.25, input_shape=(width, height, 1), seed=1000))

model.add(Conv2D(16, kernel_size=(3, 3), padding='same'))
model.add(Activation('relu'))
model.add(Dropout(0.5, seed=1000))

model.add(Conv2D(32, kernel_size=(3, 3), padding='same'))
model.add(Activation('relu'))
model.add(Dropout(0.5, seed=1000))

model.add(AveragePooling2D(pool_size=(2, 2), padding='same'))

model.add(Conv2D(64, kernel_size=(3, 3), padding='same'))
model.add(Activation('relu'))

model.add(AveragePooling2D(pool_size=(2, 2), padding='same'))

model.add(Conv2D(64, kernel_size=(3, 3), padding='same'))
model.add(Activation('relu'))
model.add(Dropout(0.5, seed=1000))

model.add(AveragePooling2D(pool_size=(2, 2), padding='same'))

model.add(Flatten())

model.add(Dense(1024))
model.add(Activation('relu'))
model.add(Dropout(0.5, seed=1000))

model.add(Dense(10))
model.add(Activation('softmax'))

model.compile(optimizer=Adam(lr=0.001, decay=1e-5),
              loss='categorical_crossentropy',
              metrics=['accuracy'])

We can now proceed to train the model with 200 epochs and a batch size of 256 samples:

history = model.fit(X_train, Y_train,
                    epochs=200,
                    batch_size=256,
                    validation_data=(X_test, Y_test))

Train on 60000 samples, validate on 10000 samples
Epoch 1/200
60000/60000 [==============================] - 30s 496us/step - loss: 0.4474 - acc: 0.8531 - val_loss: 0.0993 - val_acc: 0.9693
Epoch 2/200
60000/60000 [==============================] - 20s 338us/step - loss: 0.1497 - acc: 0.9530 - val_loss: 0.0682 - val_acc: 0.9780
Epoch 3/200
60000/60000 [==============================] - 21s 346us/step - loss: 0.1131 - acc: 0.9647 - val_loss: 0.0598 - val_acc: 0.9839

...

Epoch 199/200
60000/60000 [==============================] - 21s 349us/step - loss: 0.0083 - acc: 0.9974 - val_loss: 0.0137 - val_acc: 0.9950
Epoch 200/200
60000/60000 [==============================] - 22s 373us/step - loss: 0.0083 - acc: 0.9972 - val_loss: 0.0143 - val_acc: 0.9950

The final validation accuracy is now 0.9950, which means that only 50 samples (out of 10,000) have been misclassified. To better understand the behavior, we can plot the accuracy and loss diagrams:

As it's possible to see, both validation accuracy and loss easily reach the optimal values. In particular, the initial validation accuracy is about 0.97 and the remaining epochs are necessary to improve the performance with all those samples, whose shapes can lead to confusion (for example, malformed 8s that resemble 0s, or 7s that are very similar to 1s). It's evident that the geometric approach employed by convolutions guarantees a much higher robustness than a standard fully-connected network, thanks also to the contribution of pooling layers, which reduce the variance due to noisy samples.

Table of Contents for Examples of deep convolutional networks with Keras

Create new playlist

Sign In

Sign Up

Table of Contents for
Examples of deep convolutional networks with Keras