Building a convolutional neural network to classify images in the Zalando Research dataset, using Keras

In this section, we'll be building our convolutional neural network to classify images of clothing, using Zalando Research's fashion dataset. The repository for this dataset is available at https://github.com/zalandoresearch/fashion-mnist.

This dataset contains 70,000 grayscale images—each depicting an article of clothing—from 10 possible clothing articles. Specifically, the target classes are as follows: T-shirt/top, pants, sweater, dress, coat, sandal, shirt, sneaker, bag, and ankle boot.

Zalando, a Germany-based e-commerce company, released this dataset to provide researchers with an alternative to the classic MNIST dataset of handwritten digits. Additionally, this dataset, which they call Fashion MNIST, is a bit more challenging to predict excellently—the MNIST handwritten-digits dataset can be predicted with 99.7% accuracy without the need for extensive preprocessing or particularly deep neural networks.

So, let's get started! Follow these steps:

Clone the repository to our desktop. From the terminal, run the following:

cd ~/Desktop/
git clone [email protected]:zalandoresearch/fashion-mnist.git

If you haven't done so already, please install Keras by running pip install keras from the command line. We'll also need to install TensorFlow. To do this, run pip install tensorflow from the command line.

Import the libraries we'll be using:

import sys
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPool2D
from keras.utils import np_utils, plot_model
from PIL import Image
import matplotlib.pyplot as plt

Many of these libraries should look familiar by now. However, for some of you, this may be your first time using Keras. Keras is a popular Python deep learning library. It's a wrapper that can run on top of machine learning frameworks such as TensorFlow, CNTK, or Theano.

For our project, Keras will be running TensorFlow under the hood. Using TensorFlow directly would allow us more explicit control of the behavior of our networks; however, because TensorFlow uses dataflow graphs to represent its operations, this can take some getting used to. Luckily for us, Keras abstracts a lot of this away and its API is a breeze to learn for those comfortable with sklearn.

The only other library that may be new to some of you here will be the Python Imaging Library (PIL). PIL provides certain image-manipulation functionalities. We'll use it to visualize our Keras network's topology.

Load in the data. Zalando has provided us with a helper script that does the loading in for us. We just have to make sure that fashion-mnist/utils/ is in our path:

sys.path.append('/Users/Mike/Desktop/fashion-mnist/utils/')
import mnist_reader

Load in the data using the helper script:

X_train, y_train = mnist_reader.load_mnist('/Users/Mike/Desktop/fashion-mnist/data/fashion', kind='train')
X_test, y_test = mnist_reader.load_mnist('/Users/Mike/Desktop/fashion-mnist/data/fashion', kind='t10k')

Take a look at the shapes of X_train, X_test, y_train, and y_test:

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

Running that code gives us the following output:

Here, we can see our training set contains 60,000 images and our test contains 10,000 images. Each image is currently a vector of values 784 that are elements long. Let's now check the data types:

print(type(X_train))
print(type(y_train))
print(type(X_test))
print(type(y_test))

This returns the following:

Next, let's see what the data looks like. Remember, in its current form, each image is a vector of values. We know the images are grayscale, so to visualize each image, we'll have to reshape these vectors into a 28 x 28 matrix. Let's do this and peek at the first image:

image_1 = X_train[0].reshape(28,28)
plt.axis('off')
plt.imshow(image_1, cmap='gray');

This generates the following output:

Awesome! We can check to see the class this image belongs to by running the following:

y_train[0]

This generates the following output:

The classes are encoded from 0-9. In the README, Zalando provides us with the mapping:

Given this, we now know our first image is of an ankle boot. Sweet! Let's create an explicit mapping of these encoded values to their class names. This will come in handy momentarily:

mapping = {0: "T-shirt/top", 1:"Trouser", 2:"Pullover", 3:"Dress", 
 4:"Coat", 5:"Sandal", 6:"Shirt", 7:"Sneaker", 8:"Bag", 9:"Ankle Boot"}

Great. We've seen a single image, but we still need to get a feel for what's in our data. What do the images look like? Getting a grasp of this will tell us certain things. As an example, I'm interested to see how visually distinct the classes are. Classes that look similar to other classes will be harder for a classifier to differentiate than classes that are more unique.

Here, we define a helper function to help us through our visualization journey:

def show_fashion_mnist(plot_rows, plot_columns, feature_array, target_array, cmap='gray', random_seed=None):
    '''Generates a plot_rows * plot_columns grid of randomly selected images from a feature         array. Sets the title of each subplot equal to the associated index in the target array and     unencodes (i.e. title is in plain English, not numeric). Takes as optional args a color map     and a random seed. Meant for EDA.'''
    
    # Grabs plot_rows*plot_columns indices at random from X_train. 
    if random_seed is not None:
        np.random.seed(random_seed)
        
    feature_array_indices = np.random.randint(0,feature_array.shape[0], size = plot_rows*plot_columns)
    
    # Creates our plots
    fig, ax = plt.subplots(plot_rows, plot_columns, figsize=(18,18))
    
    reshaped_images_list = []

    for feature_array_index in feature_array_indices:
        # Reshapes our images, appends tuple with reshaped image and class to a reshaped_images_list.
        reshaped_image = feature_array[feature_array_index].reshape((28,28))
        image_class = mapping[target_array[feature_array_index]]
        reshaped_images_list.append((reshaped_image, image_class))
    
    # Plots each image in reshaped_images_list to its own subplot
    counter = 0
    for row in range(plot_rows):
        for col in range(plot_columns):
            ax[row,col].axis('off')
            ax[row, col].imshow(reshaped_images_list[counter][0], 
                                cmap=cmap)
            ax[row, col].set_title(reshaped_images_list[counter][1])
            counter +=1

What does this function do? It creates a grid of images selected at random from the data so that we can view multiple images simultaneously.

It takes as arguments the desired number of image rows (plot_rows), image columns (plot_columns), our X_train (feature_array), and y_train (target_array) and generates a matrix of images that's plot_rows x plot_columns large. As optional arguments, you can specify a cmap, or colormap (the default is ‘gray' because these are grayscale images), and a random_seed, if replicating the visualization is important.

Let's see how to run this, as follows:

show_fashion_mnist(4,4, X_train, y_train, random_seed=72)

This returns the following:

Visualization output

Remove the random_seed argument and rerun this function several times. Specifically, run the following code:

show_fashion_mnist(4,4, X_train, y_train)

You may have noticed that at this resolution some classes look quite similar and others quite distinct. For example, samples of the t-shirt/top target class can look very similar to samples from the shirt and coat target classes, whereas the sandal target class seems to be quite different than the rest. This is food for thought when thinking about where our model may be weak versus where it's likely to be strong.

Now let's take a peek at the distribution of target classes in our dataset. Will we have to do any upsampling or downsampling? Let's check:

y = pd.Series(np.concatenate((y_train, y_test)))
plt.figure(figsize=(10,6))
plt.bar(x=[mapping[x] for x in y.value_counts().index], height = y.value_counts());
plt.xlabel("Class")
plt.ylabel("Number of Images per Class")
plt.title("Distribution of Target Classes");

Running the preceding code generates the following plot:

Awesome! No class-balancing to do here.

Next, let's start preprocessing our data to get it ready for modeling.

As we discussed in our Image-feature extraction section, these grayscale images contain pixel values ranging from 0 to 255. We confirm this by running the following code:

print(X_train.max())
print(X_train.min())
print(X_test.max())
print(X_test.min())

This returns the following values:

For the purposes of modeling, we're going to want to normalize these values on a 0–1 scale. This is a common preprocessing step when preparing image data for modeling. Keeping our values in this range will allow our neural network to converge more quickly. We can normalize the data by running the following:

# First we cast as float
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
# Then normalize
X_train /= 255
X_test /= 255

Our data is now scaled from 0.0 to 1.0. We can confirm this by running the following code:

print(X_train.max())
print(X_train.min())
print(X_test.max())
print(X_test.min())

This returns the following output:

The next preprocessing step we'll need to perform before running our first Keras network will be to reshape our data. Remember, the shapes of our X_train and X_test are currently (60,000, 784) and (10,000,784), respectively. Our images are still vectors. For us to convolve these lovely kernels all over the image, we'll need need to reshape them into their 28 x 28 matrix form. Additionally, Keras requires that we explicitly declare the number of channels for our data. Accordingly, when we reshape these grayscale images for modeling, we'll declare 1:

X_train = X_train.reshape(X_train.shape[0], 28, 28, 1)
X_test = X_test.reshape(X_test.shape[0], 28, 28, 1)

Lastly, we'll one-hot encode our y vectors to conform with the target shape requirements of Keras:

y_train = np_utils.to_categorical(y_train, 10)
y_test = np_utils.to_categorical(y_test, 10)

We're now ready for modeling. Our first network will have eight hidden layers. The first six hidden layers will consist of alternating convolutional and max pooling layers. We'll then flatten the output of this network and feed that into a two-layer feedforward neural network before generating our predictions. Here's what this looks like, in code:

model = Sequential()
model.add(Conv2D(filters = 35, kernel_size=(3,3), input_shape=(28,28,1), activation='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Conv2D(filters = 35, kernel_size=(3,3), activation='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Conv2D(filters = 45, kernel_size=(3,3), activation='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(10, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

Let's describe what's happening on each line in some depth:

Line 1: Here, we just instantiate our model object. We'll further define the architecture—that is, the number of layers—sequentially with a series of .add() method calls that follow. This is the beauty of the Keras API.
Line 2: Here, we add our first convolutional layer. We specify 35 kernels, each 3 x 3 in size. After this, we specify the image input shape, 28 x 28 x 1. We only have to specify the input shape in the first .add() call of our network. Lastly, we specify our activation function as relu. Activation functions transform the output of a layer before it's passed into the next layer. We'll apply activation functions to our Conv2D and Dense layers. These transformations have many important properties. Using relu here speeds up the convergence of our network, http://www.cs.toronto.edu/~fritz/absps/imagenet.pdfhttp://www.cs.toronto.edu/~fritz/absps/imagenet.pdf and relu, relative to alternative activation functions, isn't expensive to compute—we're just transforming negative values to 0, and otherwise keeping all positive values. Mathematically, the relu function is given by max(0, value). For the purpose of this chapter, we'll stick to the relu activation for every layer but the output layer.
Line 3: Here, we add our first max pooling layer. We specify that the window size of this layer will be 2 x 2.
Line 4: This is our second convolutional layer. We set it up just as we set up the first convolutional layer.
Line 5: This is the second max pooling layer. We set this layer up just as we set up the first max pooling layer.
Line 6: This is our third and final convolutional layer. This time, we add additional filters (45 versus the 35 in previous layers). This is just a hyperparameter, and I encourage you to try multiple variations of this.
Line 7: This is the third and final max pooling layer. It's configured the same as all max pooling layers that came before it.
Line 8: Here's where we flatten the output of our convolutional neural network.
Line 9: Here's the first layer of our fully-connected network. We specify 64 neurons in this layer and a relu activation function.
Line 10: Here's the second layer of our fully-connected network. We specify 32 neurons for this layer and a relu activation function.
Line 11: This is our output layer. We specify 10 neurons, equal to the number of target classes in our data. Since this is a multi-class classification problem, we specify a softmax activation function. The output will represent the predicted probability of the image belonging to classes 0–9. These probabilities will sum to 1. The highest predicted probability of the 10 will represent the class our model believes to be the most likely class.
Line 12: Here's where we compile our Keras model. In the compile step, we specify our optimizer, Adam, a gradient-descent algorithm that automatically adapts its learning rate. We specify our loss function—in this case, categorical cross entropy because we're performing a multi-class classification problem. Lastly, for the metrics argument, we specify accuracy. By specifying this, Keras will inform us of our train and validation accuracy for each epoch that our model runs.

We can get a summary of our model by running the following:

model.summary()

This outputs the following:

Notice how the output shapes change as the data passes through the model. Specifically, look at the shape of our output after the flattening occurs—just 45 features. The raw data in X_train and X_test consisted of 784 features per row, so this is fantastic!

You'll need to install pydot to render the visualization. To install it, run pip install pydot from the terminal. You may need to restart your kernel for the install to take effect.

Using the plot_model function in Keras, we can visualize the topology of our network differently. To do this, run the following code:

plot_model(model, to_file='Conv_model1.png', show_shapes=True)
Image.open('Conv_model1.png')

Running the preceding code saves the topology to Conv_model1.png and generates the following:

This model will take several minutes to fit. If you have concerns about your system's hardware specs, you can easily reduce the training time by reducing the number of epochs to 10.

Running the following code block will fit the model:

my_fit_model = model.fit(X_train, y_train, epochs=25, validation_data=
                        (X_test, y_test))

In the fit step, we specify our X_train and y_train. We then specify the number of epochs we'd like to train the model. Then we plug in the validation data—X_test and y_test—to observe our model's out-of-sample performance. I like to save the model.fit step as a variable, my_fit_model, so we can later easily visualize the training and validation losses over epochs.

As the code runs, you'll see the model's train and validation loss, and accuracy after each epoch. Let's plot our model's train loss and validation loss using the following code:

plt.plot(my_fit_model.history['val_loss'], label="Validation")
plt.plot(my_fit_model.history['loss'], label = "Train")
plt.xlabel("Epoch", size=15)
plt.ylabel("Cat. Crossentropy Loss", size=15)
plt.title("Conv Net Train and Validation loss over epochs", size=18)
plt.legend();

Running the preceding code generates the following plot. Your plot won't be identical—there are several stochastic processes taking place here—but it should look roughly the same:

A quick glance at this plot shows us that our model is overfitting. We see our train loss continue to fall in every epoch, but the validation loss doesn't move in lockstep. Let's glance at our accuracy scores to grasp how well this model did at the classification task. We can do this by running the following code:

plt.plot(my_fit_model.history['val_acc'], label="Validation")
plt.plot(my_fit_model.history['acc'], label = "Train")
plt.xlabel("Epoch", size=15)
plt.ylabel("Accuracy", size=15)
plt.title("Conv Net Train and Validation accuracy over epochs", 
           size=18)
plt.legend();

This generates the following:

This plot, too, tells us we've overfit. But it appears as though our validation accuracy is in the high 80s, which is great! To get the max accuracy our model achieved and the epoch in which it occurred, we can run the following code:

print(max(my_fit_model.history['val_acc']))
print(my_fit_model.history['val_acc'].index(max(my_fit_model.history['v
      al_acc'])))

Your specific results will differ from mine, but here's my output:

Using our convolutional neural network, we achieved a max classification accuracy of 89.48% in the 21st epoch. This is amazing! But we've still got to address that overfitting problem. Next, we'll rebuild our model using dropout regularization.

Dropout regularization is a form of regularization we can apply to the fully-connected layers of our neural network. Using dropout regularization, we randomly drop neurons and their connections from the network during training. By doing this, the network doesn't become too reliant on the weights or biases associated with any specific node, allowing it to generalize better out of sample.

Here, we add dropout regularization, specifying that we'd like to drop 35% of the neurons at each Dense layer:

model = Sequential()
model.add(Conv2D(filters = 35, kernel_size=(3,3), input_shape=
         (28,28,1), activation='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Conv2D(filters = 35, kernel_size=(3,3), activation='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Conv2D(filters = 45, kernel_size=(3,3), activation='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.35))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.35))
model.add(Dense(10, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

Running the preceding code will compile our new model. Let's have another look at the summary by rerunning the following:

model.summary()

Running the preceding code returns the following output:

Let's refit our model by rerunning the following:

my_fit_model = model.fit(X_train, y_train, epochs=25, validation_data=
                        (X_test, y_test))

Once your model has refit, rerun the plot code to visualize loss. Here's mine:

This looks better! The difference between our training and validation losses has shrunk, which was the intended purpose, though there does appear to be some room for improvement.

Next, re-plot your accuracy curves. Here are mine for this run:

This also looks better from an overfitting perspective. Fantastic! What was the best classification accuracy we achieved after applying regularization? Let's run the following code:

print(max(my_fit_model.history['val_acc']))
print(my_fit_model.history['val_acc'].index(max(my_fit_model.history['v
      al_acc'])))

My output from this run of the model was as follows:

Interesting! The best validation accuracy we achieved was lower than that in our unregularized model, but not by much. And it's still quite good! Our model is telling us that we predict the correct type of clothing article 88.85% of the time.

One way to think about how well we've done here is to compare our model's accuracy with the baseline accuracy for our dataset. The baseline accuracy is simply the score we would get by naïvely selecting the most-commonly occurring class in the dataset. For this specific dataset, because the classes are perfectly balanced and there are 10 classes, the baseline accuracy is 10%. Our model handily beats this baseline accuracy. It's clearly learned something about the data!

There are so many different places you can go from here! Try building deeper models or grid-searching over the many hyperparameters we used in our models. Assess your classifier's performance as you would with any other model—try building a confusion matrix to understand what classes we predicted well and what classes we weren't as strong in!

Table of Contents for Building a convolutional neural network to classify images in the Zalando Research dataset, using Keras

Create new playlist

Sign In

Sign Up

Table of Contents for
Building a convolutional neural network to classify images in the Zalando Research dataset, using Keras