Example of a deep convolutional network with Keras and data augmentation

In this example, we are going to use the Fashion MNIST dataset, which was freely provided by Zalando as a more difficult replacement for the standard MNIST dataset. In this case, instead of handwritten digits, there are greyscale photos of different articles of clothing. An example of a few samples is shown in the following screenshot:

However, in this case, we want to employ a utility class provided by Keras (ImageDataGenerator) in order to create a data-augmented sample set to improve the generalization ability of the deep convolutional network. This class allows us to add random transformations (such as standardization, rotations, shifting, flipping, zooming, shearing, and so on) and output the samples using a Python generator (with an infinite loop). Let's start loading the dataset (we don't need to standardize it, as this transformation is performed by the generator):

from keras.datasets import fashion_mnist

(X_train, Y_train), (X_test, Y_test) = fashion_mnist.load_data()

At this point, we can create the generators, selecting the transformation that best suits our case. As the dataset is rather standard (all the samples are represented only in a few positions), we've decided to augment the dataset by applying a sample-wise standardization (which doesn't rely on the entire dataset), horizontal flip, zooming, small rotations, and small shears. This choice has been made according to an objective analysis, but I suggest the reader repeat the experiment with different parameters (for example, adding whitening, vertical flip, horizontal/vertical shifting, and extended rotations). Of course, increasing the augmentation variability needs larger processed sets. In our case, we are going to use 384,000 training samples (the original size is 60,000), but larger values can be employed to train deeper networks:

import numpy as np

from keras.preprocessing.image import ImageDataGenerator
from keras.utils import to_categorical

nb_classes = 10
train_batch_size = 256
test_batch_size = 100

train_idg = ImageDataGenerator(rescale=1.0 / 255.0,
                               samplewise_center=True,
                               samplewise_std_normalization=True,
                               horizontal_flip=True,
                               rotation_range=10.0,
                               shear_range=np.pi / 12.0,
                               zoom_range=0.25)

train_dg = train_idg.flow(x=np.expand_dims(X_train, axis=3),
                          y=to_categorical(Y_train, num_classes=nb_classes),
                          batch_size=train_batch_size,
                          shuffle=True,
                          seed=1000)

test_idg = ImageDataGenerator(rescale=1.0 / 255.0,
                              samplewise_center=True,
                              samplewise_std_normalization=True)

test_dg = train_idg.flow(x=np.expand_dims(X_test, axis=3),
                         y=to_categorical(Y_test, num_classes=nb_classes),
                         shuffle=False,
                         batch_size=test_batch_size,
                         seed=1000)

Once an image data generator has been initialized, it must be fitted, specifying the input dataset and the desired batch size (the output of this operation is the actual Python generator). The test image generator is voluntarily kept without transformations except for normalization and standardization, in order to avoid a validation on a dataset drawn from a different distribution. At this point, we can create and compile our network, using 2D convolutions based on Leaky ReLU activations (using the LeakyReLU class, which replaces the standard layer Activation), batch normalizations, and max poolings:

from keras.models import Sequential
from keras.layers import Activation, Dense, Flatten, LeakyReLU, Conv2D, MaxPooling2D, BatchNormalization
from keras.optimizers import Adam

model = Sequential()

model.add(Conv2D(filters=32,
                 kernel_size=(3, 3),
                 padding='same',
                 input_shape=(X_train.shape[1], X_train.shape[2], 1)))

model.add(BatchNormalization())
model.add(LeakyReLU(alpha=0.1))

model.add(Conv2D(filters=64,
                 kernel_size=(3, 3),
                 padding='same'))

model.add(BatchNormalization())
model.add(LeakyReLU(alpha=0.1))

model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(filters=64,
                 kernel_size=(3, 3),
                 padding='same'))

model.add(BatchNormalization())
model.add(LeakyReLU(alpha=0.1))

model.add(Conv2D(filters=128,
                 kernel_size=(3, 3),
                 padding='same'))

model.add(BatchNormalization())
model.add(LeakyReLU(alpha=0.1))

model.add(Conv2D(filters=128,
                 kernel_size=(3, 3),
                 padding='same'))

model.add(BatchNormalization())
model.add(LeakyReLU(alpha=0.1))

model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Flatten())

model.add(Dense(units=1024))
model.add(BatchNormalization())
model.add(LeakyReLU(alpha=0.1))

model.add(Dense(units=1024))
model.add(BatchNormalization())
model.add(LeakyReLU(alpha=0.1))

model.add(Dense(units=nb_classes))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer=Adam(lr=0.0001, decay=1e-5),
              metrics=['accuracy'])

All the batch normalizations are always applied to the linear transformation before the activation function. Considering the additional complexity, we are also going to use a callback, which is a class that Keras uses in order to perform in-training operations. In our case, we want to reduce the learning rate when the validation loss stops improving. The specific callback is called ReduceLROnPlateau and it's tuned in order to reduce η multiplying it by 0.1 (after a number of epochs equal to the value of the patience parameter) with a cooldown period (the number of epochs to wait before restoring the original learning rate) of 1 epoch and a minimum η = 10^-6. The training method is now fit_generator(), which accepts Python generators instead of finite datasets and the number of iterations per epoch (all the other parameters are the same as implemented by fit()):

from keras.callbacks import ReduceLROnPlateau

nb_epochs = 100
steps_per_epoch = 1500

history = model.fit_generator(generator=train_dg,
                              epochs=nb_epochs,
                              steps_per_epoch=steps_per_epoch,
                              validation_data=test_dg,
                              validation_steps=int(X_test.shape[0] / test_batch_size),
                              callbacks=[
                                 ReduceLROnPlateau(factor=0.1, patience=1, cooldown=1, min_lr=1e-6)
                              ])

Epoch 1/100
1500/1500 [==============================] - 471s 314ms/step - loss: 0.3457 - acc: 0.8722 - val_loss: 0.2863 - val_acc: 0.8952
Epoch 2/100
1500/1500 [==============================] - 464s 309ms/step - loss: 0.2325 - acc: 0.9138 - val_loss: 0.2721 - val_acc: 0.8990
Epoch 3/100
1500/1500 [==============================] - 460s 307ms/step - loss: 0.1929 - acc: 0.9285 - val_loss: 0.2522 - val_acc: 0.9112

...

Epoch 99/100
1500/1500 [==============================] - 449s 299ms/step - loss: 0.0438 - acc: 0.9859 - val_loss: 0.2142 - val_acc: 0.9323
Epoch 100/100
1500/1500 [==============================] - 449s 299ms/step - loss: 0.0443 - acc: 0.9857 - val_loss: 0.2136 - val_acc: 0.9339

In this case, the complexity is higher and the result is not as accurate as the one obtained with the standard MNIST dataset. The validation and loss plots are shown in the following graph:

The loss plot doesn't show a U-curve, but it seems that there are no real improvements starting from the 20^th epoch. This is also confirmed by the validation plot, which continues oscillating between 0.935 and about 0.94. On the other side, the training loss hasn't reached its minimum (nor has the training accuracy), mainly because of the batch normalizations. However, considering several benchmarks, the result is not bad (even if state-of-the-art models can reach a validation accuracy of about 0.96). I suggest that the reader try different configurations (with and without dropout and other activations) based on deeper architectures with larger training sets. This example offers many chances to practice with this kind of models, as the complexity is not as high as to require dedicated hardware, but at the same time, there are many ambiguities (for example, between shirts and t-shirts) that can reduce the generalization ability.

Table of Contents for Example of a deep convolutional network with Keras and data augmentation

Create new playlist

Sign In

Sign Up

Table of Contents for
Example of a deep convolutional network with Keras and data augmentation