5 Advanced CNN architectures

This chapter covers

  • Working with CNN design patterns
  • Understanding the LeNet, AlexNet, VGGNet, Inception, and ResNet network architectures

Welcome to part 2 of this book. Part 1 presented the foundation of neural networks architectures and covered multilayer perceptrons (MLPs) and convolutional neural networks (CNNs). We wrapped up part 1 with strategies to structure your deep neural network projects and tune their hyperparameters to improve network performance. In part 2, we will build on this foundation to develop computer vision (CV) systems that solve complex image classification and object detection problems.

In chapters 3 and 4, we talked about the main components of CNNs and setting up hyperparameters such as the number of hidden layers, learning rate, optimizer, and so on. We also talked about other techniques to improve network performance, like regularization, augmentation, and dropout. In this chapter, you will see how these elements come together to build a convolutional network. I will walk you through five of the most popular CNNs that were cutting edge in their time, and you will see how their designers thought about building, training, and improving networks. We will start with LeNet, developed in 1998, which performed fairly well at recognizing handwritten characters. You will see how CNN architectures have evolved since then to deeper CNNs like AlexNet and VGGNet, and beyond to more advanced and super-deep networks like Inception and ResNet, developed in 2014 and 2015, respectively.

For each CNN architecture, you will learn the following:

  • Novel features --We will explore the novel features that distinguish these networks from others and what specific problems their creators were trying to solve.

  • Network architecture --We will cover the architecture and components of each network and see how they come together to form the end-to-end network.

  • Network code implementation --We will walk step-by-step through the network implementations using the Keras deep learning (DL) library. The goal of this section is for you to learn how to read research papers and implement new architectures as the need arises.

  • Setting up learning hyperparameters --After you implement a network architecture, you need to set up the hyperparameters of the learning algorithms that you learned in chapter 4 (optimizer, learning rate, weight decay, and so on). We will implement the learning hyperparameters as presented in the original research paper of each network. In this section, you will see how performance evolved from one network to another over the years.

  • Network performance --Finally, you will see how each network performed on benchmark datasets like MNIST and ImageNet, as represented in their research papers.

The three main objectives of this chapter follow:

  • Understanding the architecture and learning hyperparameters of advanced CNNs. You will be implementing simpler CNNs like AlexNet and VGGNet for simple- to medium-complexity problems. For very complex problems, you might want to use deeper networks like Inception and ResNet.

  • Understanding the novel features of each network and the reasons they were developed. Each succeeding CNN architecture solves a specific limitation in the previous one. After reading about the five networks in this chapter (and their research papers), you will build a strong foundation for reading and understanding new networks as they emerge.

  • Learning how CNNs have evolved and their designers’ thought processes. This will help you build an instinct for what works well and what problems may arise when building your own network.

In chapter 3, you learned about the basic building blocks of convolutional layers, pooling layers, and fully connected layers of CNNs. As you will see in this chapter, in recent years a lot of CV research has focused on how to put together these basic building blocks to form effective CNNs. One of the best ways for you to develop your intuition is to examine and learn from these architectures (similar to how most of us may have learned to write code by reading other people’s code).

To get the most out of this chapter, you are encouraged to read the research papers linked in each section before you read my explanation. What you have learned in part 1 of this book fully equips you to start reading research papers written by pioneers in the AI field. Reading and implementing research papers is by far one of the most valuable skills that you will build from reading this book.

TIP Personally, I feel the task of going through a research paper, interpreting the crux behind it, and implementing the code is a very important skill every DL enthusiast and practitioner should possess. Practically implementing research ideas brings out the thought process of the author and also helps transform those ideas into real-world industry applications. I hope that, by reading this chapter, you will get comfortable reading research papers and implementing their findings in your own work. The fast-paced evolution in this field requires us to always stay up-to-date with the latest research. What you will learn in this book (or in other publications) now will not be the latest and greatest in three or four years--maybe even sooner. The most valuable asset that I want you to take away from this book is a strong DL foundation that empowers you to get out in the real world and be able to read the latest research and implement it yourself.

Are you ready? Let’s get started!

5.1 CNN design patterns

Before we jump into the details of the common CNN architectures, we are going to look at some common design choices when it comes to CNNs. It might seem at first that there are way too many choices to make. Every time we learn about something new in deep learning, it gives us more hyperparameters to design. So it is good to be able to narrow down our choices by looking at some common patterns that were created by pioneer researchers in the field so we can understand their motivation and start from where they ended rather than doing things completely randomly:

  • Pattern 1: Feature extraction and classification --Convolutional nets are typically composed of two parts: the feature extraction part, which consists of a series of convolutional layers; and the classification part, which consists of a series of fully connected layers (figure 5.1). This is pretty much always the case with ConvNets, starting from LeNet and AlexNet to the very recent CNNs that have come out in the past few years, like Inception and ResNet.

    Figure 5.1 Convolutional nets generally include feature extraction and classification.

  • Pattern 2: Image depth increases, and dimensions decrease --The input data at each layer is an image. With each layer, we apply a new convolutional layer over a new image. This pushes us to think of an image in a more generic way. First, you see that each image is a 3D object that has a height, width, and depth. Depth is referred to as the color channel, where depth is 1 for grayscale images and 3 for color images. In the later layers, the images still have depth, but they are not colors per se: they are feature maps that represent the features extracted from the previous layers. That’s why the depth increases as we go deeper through the network layers. In figure 5.2, the depth of an image is equal to 96; this represents the number of feature maps in the layer. So, that’s one pattern you will always see: the image depth increases, and the dimensions decrease.

    Figure 5.2 Image depth increases, and the dimensions decrease.

  • Pattern 3: Fully connected layers --This generally isn’t as strict a pattern as the previous two, but it’s very helpful to know. Typically, all fully connected layers in a network either have the same number of hidden units or decrease at each layer. It is rare to find a network where the number of units in the fully connected layers increases at each layer. Research has found that keeping the number of units constant doesn’t hurt the neural network, so it may be a good approach if you want to limit the number of choices you have to make when designing your network. This way, all you have to do is to pick a number of units per layer and apply that to all your fully connected layers.

Now that you understand the basic CNN patterns, let’s look at some architectures that have implemented them. Most of these architectures are famous because they performed well in the ImageNet competition. ImageNet is a famous benchmark that contains millions of images; DL and CV researchers use the ImageNet dataset to compare algorithms. More on that later.

NOTE The snippets in this chapter are not meant to be runnable. The goal is to show you how to implement the specifications that are defined in a research paper. Visit the book’s website (www.manning.com/books/deep-learning-for-vision-systems) or Github repo (https://github.com/moelgendy/deep_learning _for_vision_systems) for the full executable code.

Now, let’s get started with the first network we are going to discuss in this chapter: LeNet.

5.2 LeNet-5

In 1998, Lecun et al. introduced a pioneering CNN called LeNet-5.1 The LeNet-5 architecture is straightforward, and the components are not new to you (they were new back in 1998); you learned about convolutional, pooling, and fully connected layers in chapter 3. The architecture is composed of five weight layers, and hence the name LeNet-5: three convolutional layers and two fully connected layers.

DEFINITION We refer to the convolutional and fully connected layers as weight layers because they contain trainable weights as opposed to pooling layers that don’t contain any weights. The common convention is to use the number of weight layers to describe the depth of the network. For example, AlexNet (explained next) is said to be eight layers deep because it contains five convolutional and three fully connected layers. The reason we care more about weight layers is mainly because they reflect the model’s computational complexity.

5.2.1 LeNet architecture

The architecture of LeNet-5 is shown in figure 5.3:

INPUT IMAGE ⇒ C1 ⇒ TANH ⇒ S2 ⇒ C3 ⇒ TANH ⇒ S4 ⇒ C5 ⇒ TANH ⇒ FC6 ⇒ SOFTMAX7

where C is a convolutional layer, S is a subsampling or pooling layer, and FC is a fully connected layer.

Notice that Yann LeCun and his team used tanh as an activation function instead of the currently state-of-the-art ReLU. This is because in 1998, ReLU had not yet been used in the context of DL, and it was more common to use tanh or sigmoid as an activation function in the hidden layers. Without further ado, let’s implement LeNet-5 in Keras.

Figure 5.3 LeNet architecture

5.2.2 LeNet-5 implementation in Keras

To implement LeNet-5 in Keras, read the original paper and follow the architecture information from pages 6-8. Here are the main takeaways for building the LeNet-5 network:

  • Number of filters in each convolutional layer --As you can see in figure 5.3 (and as defined in the paper), the depth (number of filters) of each convolutional layer is as follows: C1 has 6, C3 has 16, C5 has 120 layers.

  • Kernel size of each convolutional layer --The paper specifies that the kernel_size is 5 × 5.

  • Subsampling (pooling) layers --A subsampling (pooling) layer is added after each convolutional layer. The receptive field of each unit is a 2 × 2 area (for example, pool_size is 2). Note that the LeNet-5 creators used average pooling, which computes the average value of its inputs, instead of the max pooling layer that we used in our earlier projects, which passes the maximum value of its inputs. You can try both if you are interested, to see the difference. For this experiment, we are going to follow the paper’s architecture.

  • Activation function --As mentioned before, the creators of LeNet-5 used the tanh activation function for the hidden layers because symmetric functions are believed to yield faster convergence compared to sigmoid functions (figure 5.4).

Figure 5.4 The LeNet architecture consists of convolutional kernels of size 5 × 5; pooling layers; an activation function (tanh); and three fully connected layers with 120, 84, and 10 neurons, respectively.

Now let’s put that in code to build the LeNet-5 architecture:

from keras.models import Sequential                                  
from keras.layers import Conv2D, AveragePooling2D, Flatten, Dense    
  
model = Sequential()                                               
  
# C1 Convolutional Layer
model.add(Conv2D(filters = 6, kernel_size = 5, strides = 1, activation = 'tanh',  
                 input_shape = (28,28,1), padding = 'same'))
  
# S2 Pooling Layer
model.add(AveragePooling2D(pool_size = 2, strides = 2, padding = 'valid'))
  
# C3 Convolutional Layer
model.add(Conv2D(filters = 16, kernel_size = 5, strides = 1,activation = 'tanh',
                 padding = 'valid'))
  
# S4 Pooling Layer
model.add(AveragePooling2D(pool_size = 2, strides = 2, padding = 'valid'))
  
# C5 Convolutional Layer
model.add(Conv2D(filters = 120, kernel_size = 5, strides = 1,activation = 'tanh',
                 padding = 'valid'))
  
model.add(Flatten())                                               
  
# FC6 Fully Connected Layer
model.add(Dense(units = 84, activation = 'tanh'))
  
# FC7 Output layer with softmax activation
model.add(Dense(units = 10, activation = 'softmax'))
  
model.summary()                                                   

Imports the Keras model and layers

Instantiates an empty sequential model

Flattens the CNN output to feed it fully connected layers

Prints the model summary (figure 5.5)

LeNet-5 is a small neural network by today’s standards. It has 61,706 parameters, compared to millions of parameters in more modern networks, as you will see later in this chapter.

A note when reading the papers discussed in this chapter

When you read the LeNet-5 paper, just know that it is harder to read than the others we will cover in this chapter. Most of the ideas that I mention in this section are in sections 2 and 3 of the paper. The later sections of the paper talk about something called the graph transformer network, which isn’t widely used today. So if you do try to read the paper, I recommend focusing on section 2, which talks about the LeNet architecture and the learning details; then maybe take a quick look at section 3, which includes a bunch of experiments and results that are pretty interesting.

I recommend starting with the AlexNet paper (discussed in section 5.3), followed by the VGGNet paper (section 5.4), and then the LeNet paper. It is a good classic to look at once you go over the other ones.

Figure 5.5 LeNet-5 model summary

5.2.3 Setting up the learning hyperparameters

LeCun and his team used scheduled decay learning where the value of the learning rate was decreased using the following schedule: 0.0005 for the first two epochs, 0.0002 for the next three epochs, 0.00005 for the next four, and then 0.00001 thereafter. In the paper, the authors trained their network for 20 epochs.

Let’s build a lr_schedule function with this schedule. The method takes an integer epoch number as an argument and returns the learning rate (lr):

def lr_schedule(epoch):
    if epoch <= 2:                 
        lr = 5e-4
    elif epoch > 2 and epoch <= 5:
        lr = 2e-4
    elif epoch > 5 and epoch <= 9:
        lr = 5e-5
    else: 
        lr = 1e-5
    return lr

lr is 0.0005 for the first two epochs, 0.0002 for the next three epochs (3 to 5), 0.00005 for the next four (6 to 9), then 0.00001 thereafter (more than 9).

We use the lr_schedule function in the following code snippet to compile the model:

from keras.callbacks import ModelCheckpoint, LearningRateScheduler
 
lr_scheduler = LearningRateScheduler(lr_schedule)
checkpoint = ModelCheckpoint(filepath='path_to_save_file/file.hdf5',
                             monitor='val_acc',
                             verbose=1,
                             save_best_only=True)
 
callbacks = [checkpoint, lr_reducer]
 
model.compile(loss='categorical_crossentropy', optimizer='sgd',
              metrics=['accuracy'])

Now start the network training for 20 epochs, as mentioned in the paper:

hist = model.fit(X_train, y_train, batch_size=32, epochs=20,
          validation_data=(X_test, y_test), callbacks=callbacks, 
          verbose=2, shuffle=True)

See the downloadable notebook included with the book’s code for the full code implementation, if you want to see this in action.

5.2.4 LeNet performance on the MNIST dataset

When you train LeNet-5 on the MNIST dataset, you will get above 99% accuracy (see the code notebook with the book’s code). Try to re-run this experiment with the ReLU activation function in the hidden layers, and observe the difference in the network performance.

5.3 AlexNet

LeNet performs very well on the MNIST dataset. But it turns out that the MNIST dataset is very simple because it contains grayscale images (1 channel) and classifies into only 10 classes, which makes it an easier challenge. The main motivation behind AlexNet was to build a deeper network that can learn more complex functions.

AlexNet (figure 5.6) was the winner of the ILSVRC image classification competition in 2012. Krizhevsky et al. created the neural network architecture and trained it on 1.2 million high-resolution images into 1,000 different classes of the ImageNet dataset.2 AlexNet was state of the art at its time because it was the first real “deep” network that opened the door for the CV community to seriously consider convolutional networks in their applications. We will explain deeper networks later in this chapter, like VGGNet and ResNet, but it is good to see how ConvNets evolved and the main drawbacks of AlexNet that were the main motivation for the later networks.

Figure 5.6 AlexNet architecture

As you can see in figure 5.6, AlexNet has a lot of similarities to LeNet but is much deeper (more hidden layers) and bigger (more filters per layer). They have similar building blocks: a series of convolutional and pooling layers stacked on top of each other followed by fully connected layers and a softmax. We’ve seen that LeNet has around 61,000 parameters, whereas AlexNet has about 60 million parameters and 650,000 neurons, which gives it a larger learning capacity to understand more complex features. This allowed AlexNet to achieve remarkable performance in the ILSVRC image classification competition in 2012.

ImageNet and ILSVRC

ImageNet (http://image-net.org/index) is a large visual database designed for use in visual object recognition software research. It is aimed at labeling and categorizing images into almost 22,000 categories based on a defined set of words and phrases. The images were collected from the web and labeled by humans using Amazon’s Mechanical Turk crowdsourcing tool. At the time of this writing, there are over 14 million images in the ImageNet project. To organize such a massive amount of data, the creators of ImageNet followed the WordNet hierarchy where each meaningful word/ phrase in WordNet is called a synonym set (synset for short). Within the ImageNet project, images are organized according to these synsets, with the goal being to have 1,000+ images per synset.

The ImageNet project runs an annual software contest called the ImageNet Large Scale Visual Recognition Challenge (ILSVRC, www.image-net.org/challenges/LSVRC), where software programs compete to correctly classify and detect objects and scenes. We will use the ILSVRC challenge as a benchmark to compare different networks’ performance.

5.3.1 AlexNet architecture

You saw a version of the AlexNet architecture in the project at the end of chapter 3. The architecture is pretty straightforward. It consists of:

  • Convolutional layers with the following kernel sizes: 11 × 11, 5 × 5, and 3 × 3

  • Max pooling layers for images downsampling

  • Dropout layers to avoid overfitting

  • Unlike LeNet, ReLU activation functions in the hidden layers and a softmax activation in the output layer

AlexNet consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully connected layers with a final 1000-way softmax. The architecture can be represented in text as follows:

INPUT IMAGE ⇒ CONV1 ⇒ POOL2 ⇒ CONV3 ⇒ POOL4 ⇒ CONV5 ⇒ CONV6 ⇒ CONV7 ⇒ POOL8 ⇒ FC9 ⇒ FC10 ⇒ SOFTMAX7

5.3.2 Novel features of AlexNet

Before AlexNet, DL was starting to gain traction in speech recognition and a few other areas. But AlexNet was the milestone that convinced a lot of people in the CV community to take a serious look at DL and demonstrate that it really works in CV. AlexNet presented some novel features that were not used in previous CNNs (like LeNet). You are already familiar with all of them from the previous chapters, so we’ll go through them quickly here.

ReLU activation function

AlexNet uses ReLu for the nonlinear part instead of the tanh and sigmoid functions that were the earlier standard for traditional neural networks (like LeNet). ReLu was used in the hidden layers of the AlexNet architecture because it trains much faster. This is because the derivative of the sigmoid function becomes very small in the saturating region, and therefore the updates applied to the weights almost vanish. This phenomenon is called the vanishing gradient problem. ReLU is represented by this equation:

f (x) = max(0,x)

It’s discussed in detail in chapter 2.

The vanishing gradient problem

Certain activation functions, like the sigmoid function, squish a large input space into a small input space between 0 and 1 (-1 to 1 for tanh activations). Therefore, a large change in the input of the sigmoid function causes a small change in the output. As a result, the derivative becomes very small:

The vanishing gradient problem: a large change in the input of the sigmoid function causes a negligible change in the output.

We will talk more about the vanishing gradient phenomenon later in this chapter when we look at the ResNet architecture.

Dropout layer

As explained in chapter 3, dropout layers are used to prevent the neural network from overfitting. The neurons that are “dropped out” do not contribute to the forward pass and do not participate in backpropagation. This means every time an input is presented, the neural network samples a different architecture, but all of these architectures share the same weights. This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. Therefore, the neuron is forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons. Krizhevsky et al. used dropout with a probability of 0.5 in the two fully connected layers.

Data augmentation

One popular and very effective approach to avoid overfitting is to artificially enlarge the dataset using label-preserving transformations. This happens by generating new instances of the training images with transformations like image rotation, flipping, scaling, and many more. Data augmentation is explained in detail in chapter 4.

Local response normalization

AlexNet uses local response normalization. It is different from the batch normalization technique (explained in chapter 4). Normalization helps to speed up convergence. Nowadays, batch normalization is used instead of local response normalization; we will use BN in our implementation in this chapter.

Weight regularization

Krizhevsky et al. used a weight decay of 0.0005. Weight decay is another term for the L2 regularization technique explained in chapter 4. This approach reduces the overfitting of the DL neural network model on training data to allow the network to generalize better on new data:

model.add(Conv2D(32, (3,3), kernel_regularizer=l2(ƛ)))

The lambda (ƛ) value is a weight decay hyperparameter that you can tune. If you still see overfitting, you can reduce it by increasing the lambda value. In this case, Krizhevsky and his team found that a small decay value of 0.0005 was good enough for the model to learn.

Training on multiple GPUs

Krizhevsky et al. used a GTX 580 GPU with only 3 GB of memory. It was state-of-the-art at the time but not large enough to train the 1.2 million training examples in the dataset. Therefore, the team developed a complicated way to spread the network across two GPUs. The basic idea was that a lot of the layers were split across two different GPUs that communicated with each other. You don’t need to worry about these details today: there are far more advanced ways to train deep networks on distributed GPUs, as we will discuss later in this book.

5.3.3 AlexNet implementation in Keras

Now that you’ve learned the basic components of AlexNet and its novel features, let’s apply them to build the AlexNet neural network. I suggest that you read the architecture description on page 4 of the original paper and follow along.

As depicted in figure 5.7, the network contains eight weight layers: the first five are convolutional, and the remaining three are fully connected. The output of the last fully connected layer is fed to a 1000-way softmax that produces a distribution over the 1,000 class labels.

NOTE AlexNet input starts with 227 × 227 × 3 images. If you read the paper, you will notice that it refers to a dimensions volume of 224 × 224 × 3 for the input images. But the numbers make sense only for 227 × 227 × 3 images (figure 5.7). I suggest that this could be a typing mistake in the paper.

Figure 5.7 AlexNet contains eight weight layers: five convolutional and three fully connected. Two contain 4,096 neurons, and the output is fed to a 1,000-neuron softmax.

The layers are stacked together as follows:

  • CONV1--The authors used a large kernel size (11). They also used a large stride (4), which makes the input dimensions shrink by roughly a factor 4 (from 227 × 227 to 55 × 55). We calculate the dimensions of the output as follows:

    (227 - 11)/4 + 1 = 55

    and the depth is the number of filters in the convolutional layer (96). The output dimensions are 55 × 55 × 96.

  • POOL with a filter size of 3 × 3--This reduces the dimensions from 55 × 55 to 27 × 27:

    (55 - 3)/2 + 1 = 27

    The pooling layer doesn’t change the depth of the volume. The output dimensions are 27 × 27 × 96.

Similarly, we can calculate the output dimensions of the remaining layers:

  • CONV2--Kernel size = 5, depth = 256, and stride = 1

  • POOL--Size = 3 × 3, which downsamples its input dimensions from 27 × 27 to 13 × 13

  • CONV3--Kernel size = 3, depth = 384, and stride = 1

  • CONV4--Kernel size = 3, depth = 384, and stride = 1

  • CONV5--Kernel size = 3, depth = 256, and stride = 1

  • POOL--Size = 3 × 3, which downsamples its input from 13 × 13 to 6 × 6

  • Flatten layer--Flattens the dimension volume 6 × 6 × 256 to 1 × 9,216

  • FC with 4,096 neurons

  • FC with 4,096 neurons

  • Softmax layer with 1,000 neurons

NOTE You might be wondering how Krizhevsky and his team decided to implement this configuration. Setting up the right values of network hyperparameters like kernel size, depths, stride, pooling size, etc., is tedious and requires a lot of trial and error. The idea remains the same: we want to apply many weight layers to increase the model’s capacity to learn more complex functions. We also need to add pooling layers in between to downsample the input dimensions, as discussed in chapter 2. With that said, setting up the exact hyperparameters comes as one of the challenges of CNNs. VGGNet (explained next) solves this problem by implementing a uniform layer configuration to reduce the amount of trial and error when designing your network.

Note that all of the convolutional layers are followed by a batch normalization layer, and all of the hidden layers are followed by ReLU activations. Now, let’s put that in code to build the AlexNet architecture:

from keras.models import Sequential                                 
from keras.regularizers import l2                                   
from keras.layers import Conv2D, AveragePooling2D, Flatten, Dense,  
    Activation,MaxPool2D, BatchNormalization, Dropout               
 
model = Sequential()                                                
# 1st layer (CONV + pool + batchnorm)
model.add(Conv2D(filters= 96, kernel_size= (11,11), strides=(4,4), padding='valid', 
                 input_shape = (227,227,3)))
model.add(Activation('relu'))                                       
model.add(MaxPool2D(pool_size=(3,3), strides=(2,2)))
model.add(BatchNormalization())
    
# 2nd layer (CONV + pool + batchnorm)
model.add(Conv2D(filters=256, kernel_size=(5,5), strides=(1,1), padding='same',   
                 kernel_regularizer=l2(0.0005)))
model.add(Activation('relu'))
model.add(MaxPool2D(pool_size=(3,3), strides=(2,2), padding='valid'))
model.add(BatchNormalization())
            
# layer 3 (CONV + batchnorm)                                        
model.add(Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), padding='same', kernel_regularizer=l2(0.0005)))
model.add(Activation('relu'))
model.add(BatchNormalization())
        
# layer 4 (CONV + batchnorm)                                        
model.add(Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), padding='same',
                 kernel_regularizer=l2(0.0005)))
model.add(Activation('relu'))
model.add(BatchNormalization())
            
# layer 5 (CONV + batchnorm)  
model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding='same',
                 kernel_regularizer=l2(0.0005)))
model.add(Activation('relu'))
model.add(BatchNormalization())
model.add(MaxPool2D(pool_size=(3,3), strides=(2,2), padding='valid'))
 
model.add(Flatten())                                               
 
# layer 6 (Dense layer + dropout)  
model.add(Dense(units = 4096, activation = 'relu'))
model.add(Dropout(0.5))
 
# layer 7 (Dense layers) 
model.add(Dense(units = 4096, activation = 'relu'))
model.add(Dropout(0.5))
                           
# layer 8 (softmax output layer) 
model.add(Dense(units = 1000, activation = 'softmax'))
 
model.summary()                                                    

Imports the Keras model, layers, and regularizers

Instantiates an empty sequential model

The activation function can be added on its own layer or within the Conv2D function as we did in previous implementations.

Note that the AlexNet authors did not add a pooling layer here.

Similar to layer 3

Flattens the CNN output to feed it fully connected layers

Prints the model summary

When you print the model summary, you will see that the number of total parameters is 62 million:

____________________________________________
Total params: 62,383, 848
Trainable params: 62,381, 096
Non-trainable params: 2,752
 

NOTE Both LeNet and AlexNet have many hyperparameters to tune. The authors of those networks had to go through many experiments to set the kernel size, strides, and padding for each layer, which makes the networks harder to understand and manage. VGGNet (explained next) solves this problem with a very simple, uniform architecture.

5.3.4 Setting up the learning hyperparameters

AlexNet was trained for 90 epochs, which took 6 days on two Nvidia Geforce GTX 580 GPUs simultaneously. This is why you will see that the network is split into two pipelines in the original paper. Krizhevsky et al. started with an initial learning rate of 0.01 with a momentum of 0.9. The lr is then divided by 10 when the validation error stops improving:

reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=np.sqrt(0.1))        
 
optimizer = keras.optimizers.sgd(lr = 0.01, momentum = 0.9)                   
 
model.compile(loss='categorical_crossentropy', optimizer=optimizer, 
              metrics=['accuracy'])                                           
 
model.fit(X_train, y_train, batch_size=128, epochs=90, 
          validation_data=(X_test, y_test), verbose=2, callbacks=[reduce_lr]) 

Reduces the learning rate by 0.1 when the validation error plateaus

Sets the SGD optimizer with lr of 0.01 and momentum of 0.9

Compiles the model

Trains the model and calls the reduce_lr value using callbacks in the training method

5.3.5 AlexNet performance

AlexNet significantly outperformed all the prior competitors in the 2012 ILSVRC challenges. It achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry that year, which used other traditional classifiers. This huge improvement in performance attracted the CV community’s attention to the potential that convolutional networks have to solve complex vision problems and led to more advanced CNN architectures, as you will see in the following sections of this chapter.

Top-1 and top-5 error rates?

Top-1 and top-5 are terms used mostly in research papers to describe the accuracy of an algorithm on a given classification task. The top-1 error rate is the percentage of the time that the classifier did not give the correct class the highest score, and the top-5 error rate is the percentage of the time that the classifier did not include the correct class among its top five guesses.

Let’s apply this in an example. Suppose there are 100 classes, and we show the network an image of a cat. The classifier outputs a score or confidence value for each class as follows:

  1. Cat: 70%

  2. Dog: 20%

  3. Horse: 5%

  4. Motorcycle: 4%

  5. Car: 0.6%

  6. Plane: 0.4%

This means the classifier was able to correctly predict the true class of the image in the top-1. Try the same experiment for 100 images and observe how many times the classifier missed the true label, and that’s your top-1 error rate.

The same idea holds for the top-5 error rate. In the example, if the true label is Horse, then the classifier missed the true label in the top-1 but caught it in the first five predicted classes (for example, top-5). Calculate how many times the classifier missed the true label in the top five predictions, and that’s your top-5.

Ideally, we want the model to always predict the correct class in the top-1. But top-5 gives a more holistic evaluation of the model’s performance by defining how close the model is to the correct prediction for the missed classes.

5.4 VGGNet

VGGNet was developed in 2014 by the Visual Geometry Group at Oxford University (hence the name VGG).3 The building components are exactly the same as those in LeNet and AlexNet, except that VGGNet is an even deeper network with more convolutional, pooling, and dense layers. Other than that, no new components are introduced here.

VGGNet, also known as VGG16, consists of 16 weight layers: 13 convolutional layers and 3 fully connected layers. Its uniform architecture makes it appealing in the DL community because it is very easy to understand.

5.4.1 Novel features of VGGNet

We’ve seen how challenging it can be to set up CNN hyperparameters like kernel size, padding, strides, and so on. VGGNet’s novel concept is that it has a simple architecture containing uniform components (convolutional and pooling layers). It improves on AlexNet by replacing large kernel-sized filters (11 and 5 in the first and second convolutional layers, respectively) with multiple 3 × 3 pool-size filters one after another.

The architecture is composed of a series of uniform convolutional building blocks followed by a unified pooling layer, where:

  • All convolutional layers are 3 × 3 kernel-sized filters with a strides value of 1 and a padding value of same.

  • All pooling layers have a 2 × 2 pool size and a strides value of 2.

Simonyan and Zisserman decided to use a smaller 3 × 3 kernel to allow the network to extract finer-level features of the image compared to AlexNet’s large kernels (11 × 11 and 5 × 5). The idea is that with a given convolutional receptive field, multiple stacked smaller kernels is better than a larger kernel because having multiple nonlinear layers increases the depth of the network; this enables it to learn more complex features at a lower cost because it has fewer learning parameters.

For example, in their experiments, the authors noticed that a stack of two 3 × 3 convolutional layers (without spatial pooling in between) has an effective receptive field of 5 × 5, and three 3 × 3 convolutional layers have the effect of a 7 × 7 receptive field. So by using 3 × 3 convolutions with higher depth, you get the benefits of using more nonlinear rectification layers (ReLU), which makes the decision function more discriminative. Second, this decreases the number of training parameters because when you use a three-layer 3 × 3 convolutional with C channels, the stack is parameterised by 32C2 = 27C2 weights compared to a single 7 × 7 convolutional layer that requires 72C2 = 49C2 weights, which is 81% more parameters.

Receptive field

As explained in chapter 3, the receptive field is the effective area of the input image on which the output depends:

This unified configuration of the convolutional and pooling components simplifies the neural network architecture, which makes it very easy to understand and implement.

The VGGNet architecture is developed by stacking 3 × 3 convolutional layers with 2 × 2 pooling layers inserted after several convolutional layers. This is followed by the traditional classifier, which is composed of fully connected layers and a softmax, as depicted in figure 5.8.

Figure 5.8 VGGNet-16 architecture

5.4.2 VGGNet configurations

Simonyan and Zisserman created several configurations for the VGGNet architecture, as shown in figure 5.9. All of the configurations follow the same generic design. Configurations D and E are the most commonly used and are called VGG16 and VGG19, referring to the number of weight layers. Each block contains a series of 3 × 3 convolutional layers with similar hyperparameter configuration, followed by a 2 × 2 pooling layer.

Figure 5.9 VGGNet architecture configurations

Table 5.1 lists the number of learning parameters (in millions) for each configuration. VGG16 yields ~138 million parameters; VGG19, which is a deeper version of VGGNet, has more than 144 million parameters. VGG16 is more commonly used because it performs almost as well as VGG19 but with fewer parameters.

Table 5.1 VGGNet architecture parameters (in millions)

Network

A, A-LRN

B

C

D

E

No. of parameters

133

133

134

138

144

VGG16 in Keras

Configurations D (VGG16) and E (VGG19) are the most commonly used configurations because they are deeper networks that can learn more complex functions. So, in this chapter, we will implement configuration D, which has 16 weight layers. VGG19 (configuration E) can be similarly implemented by adding a fourth convolutional layer to the third, fourth, and fifth blocks as you can see in figure 5.9. This chapter’s downloaded code includes a full implementation of both VGG16 and VGG19.

Note that Simonyan and Zisserman used the following regularization techniques to avoid overfitting:

  • L2 regularization with weight decay of 5 × 10-4. For simplicity, this is not added to the implementation that follows.

  • Dropout regularization for the first two fully connected layers, with a dropout ratio set to 0.5.

The Keras code is as follows:

model = Sequential()             
 
# block #1
model.add(Conv2D(filters=64, kernel_size=(3,3), strides=(1,1), activation='relu',
                 padding='same', input_shape=(224,224, 3)))
model.add(Conv2D(filters=64, kernel_size=(3,3), strides=(1,1), activation='relu', 
                 padding='same'))
model.add(MaxPool2D((2,2), strides=(2,2)))
# block #2
model.add(Conv2D(filters=128, kernel_size=(3,3), strides=(1,1), activation='relu', 
                 padding='same'))
model.add(Conv2D(filters=128, kernel_size=(3,3), strides=(1,1), activation='relu', 
                 padding='same'))
model.add(MaxPool2D((2,2), strides=(2,2)))
# block #3
model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), activation='relu', 
                 padding='same'))
model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), activation='relu', 
                 padding='same'))
model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), activation='relu', 
                 padding='same'))
model.add(MaxPool2D((2,2), strides=(2,2)))
# block #4
model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), activation='relu', 
                 padding='same'))
model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), activation='relu', 
                 padding='same'))
model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), activation='relu', 
                 padding='same'))
model.add(MaxPool2D((2,2), strides=(2,2)))
# block #5
model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), activation='relu', 
                 padding='same'))
model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), activation='relu', 
                 padding='same'))
model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), activation='relu', 
                 padding='same'))
model.add(MaxPool2D((2,2), strides=(2,2)))
 
# block #6 (classifier)
model.add(Flatten())
model.add(Dense(4096, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(4096, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1000, activation='softmax'))
 
model.summary()                        

Instantiates an empty sequential model

Prints the model summary

When you print the model summary, you will see that the number of total parameters is ~138 million:

____________________________________________
Total params: 138,357, 544
Trainable params: 138,357, 544
Non-trainable params: 0

5.4.3 Learning hyperparameters

Simonyan and Zisserman followed a training procedure similar to that of AlexNet: the training is carried out using mini-batch gradient descent with momentum of 0.9. The learning rate is initially set to 0.01 and then decreased by a factor of 10 when the validation set accuracy stops improving.

5.4.4 VGGNet performance

VGG16 achieved a top-5 error rate of 8.1% on the ImageNet dataset compared to 15.3% achieved by AlexNet. VGG19 did even better: it was able to achieve a top-5 error rate of ~7.4%. It is worth noting that in spite of the larger number of parameters and the greater depth of VGGNet compared to AlexNet, VGGNet required fewer epochs to converge due to the implicit regularization imposed by greater depth and smaller convolutional filter sizes.

5.5 Inception and GoogLeNet

The Inception network came to the world in 2014 when a group of researchers at Google published their paper, “Going Deeper with Convolutions.”4 The main hallmark of this architecture is building a deeper neural network while improving the utilization of the computing resources inside the network. One particular incarnation of the Inception network is called GoogLeNet and was used in the team’s submission for ILSVRC 2014. It uses a network 22 layers deep (deeper than VGGNet) while reducing the number of parameters by 12 times (from ~138 million to ~13 million) and achieving significantly more accurate results. The network used a CNN inspired by the classical networks (AlexNet and VGGNet) but implemented a novel element dubbed as the inception module.

5.5.1 Novel features of Inception

Szegedy et al. took a different approach when designing their network architecture. As we’ve seen in the previous networks, there are some architectural decisions that you need to make for each layer when you are designing a network, such as these:

  • The kernel size of the convolutional layer --We’ve seen in previous architectures that the kernel size varies: 1 × 1, 3 × 3, 5 × 5, and, in some cases, 11 × 11 (as in AlexNet). When designing the convolutional layer, we find ourselves trying to pick and tune the kernel size of each layer that fits our dataset. Recall from chapter 3 that smaller kernels capture finer details of the image, whereas bigger filters will leave out minute details.

  • When to use the pooling layer --AlexNet uses pooling layers every one or two convolutional layers to downsize spatial features. VGGNet applies pooling after every two, three, or four convolutional layers as the network gets deeper.

Configuring the kernel size and positioning the pool layers are decisions we make mostly by trial and error and experiment with to get the optimal results. Inception says, “Instead of choosing a desired filter size in a convolutional layer and deciding where to place the pooling layers, let’s apply all of them all together in one block and call it the inception module.”

That is, rather than stacking layers on top of each other as in classical architectures, Szegedy and his team suggest that we create an inception module consisting of several convolutional layers with different kernel sizes. The architecture is then developed by stacking the inception modules on top of each other. Figure 5.10 shows how classical convolutional networks are architected versus the Inception network.

Figure 5.10 Classical convolutional networks vs. the Inception network

From the diagram, you can observe the following:

  • In classical architectures like LeNet, AlexNet, and VGGNet, we stack convolutional and pooling layers on top of each other to build the feature extractors. At the end, we add the dense fully connected layers to build the classifier.

  • In the Inception architecture, we start with a convolutional layer and a pooling layer, stack the inception modules and pooling layers to build the feature extractors, and then add the regular dense classifier layers.

We’ve been treating the inception modules as black boxes to understand the bigger picture of the Inception architecture. Now, we will unpack the inception module to understand how it works.

5.5.2 Inception module: Naive version

The inception module is a combination of four layers:

  • 1 × 1 convolutional layer

  • 3 × 3 convolutional layer

  • 5 × 5 convolutional layer

  • 3 × 3 max-pooling layer

The outputs of these layers are concatenated into a single output volume forming the input of the next stage. The naive representation of the inception module is shown in figure 5.11.

The diagram may look a little overwhelming, but the idea is simple to understand. Let’s follow along with this example:

  1. Suppose we have an input dimensional volume from the previous layer of size 32 × 32 × 200.

  2. We feed this input to four convolutions simultaneously:

    • 1 × 1 convolutional layer with depth = 64 and padding = same. The output of this kernel = 32 × 32 × 64.
    • 3 × 3 convolutional layer with depth = 128 and padding = same. Output = 32 × 32 × 128.
    • 5 × 5 convolutional layer with depth = 32 and padding = same. Output = 32 × 32 × 32.
    • 3 × 3 max-pooling layer with padding = same and strides = 1. Output = 32 × 32 × 32.
  3. We concatenate the depth of the four outputs to create one output volume of dimensions 32 × 32 × 256.

Figure 5.11 Naive representation of an inception module

Now we have an inception module that takes an input volume of 32 × 32 × 200 and outputs a volume of 32 × 32 × 256.

NOTE In the previous example, we use a padding value of same. In Keras, padding can be set to same or valid, as we saw in chapter 3. The same value results in padding the input such that the output has the same length as the original input. We do that because we want the output to have width and height dimensions similar to the input. And we want to output similar dimensions in the inception module to simplify the depth concatenation process. Now we can just add up the depths of all the outputs to concatenate them into one output volume to be fed to the next layer in our network.

5.5.3 Inception module with dimensionality reduction

The naive representation of the inception module that we just saw has a big computational cost problem that comes with processing larger filters like the 5 × 5 convolutional layer. To get a better sense of the compute problem with the naive representation, let’s calculate the number of operations that will be performed for the 5 × 5 convolutional layer in the previous example.

The input volume with dimensions of 32 × 32 × 200 will be fed to the 5 × 5 convolutional layer of 32 filters with dimensions = 5 × 5 × 32. This means the total number of multiplications that the computer needs to compute is 32 × 32 × 200 multiplied by 5 × 5 × 32, which is more than 163 million operations. While we can perform this many operations with modern computers, this is still pretty expensive. This is when the dimensionality reduction layers can be very useful.

Dimensionality reduction layer (1 × 1 convolutional layer)

The 1 × 1 convolutional layer can reduce the operational cost of 163 million operations to about a tenth of that. That is why it is called a reduce layer. The idea here is to add a 1 × 1 convolutional layer before the bigger kernels like the 3 × 3 and 5 × 5 convolutional layers, to reduce their depth, which in turn will reduce the number of operations.

Let’s look at an example. Suppose we have an input dimension volume of 32 × 32 × 200. We then add a 1 × 1 convolutional layer with a depth of 16. This reduces the dimension volume from 200 to 16 channels. We can then apply the 5 × 5 convolutional layer on the output, which has much less depth (figure 5.12).

Figure 5.12 Dimensionality reduction is used to reduce the computational cost by reducing the depth of the layer.

Notice that the 32 × 32 × 200 input is processed through the two convolutional layers and outputs a volume of dimensions 32 × 32 × 32, which is the same as produced without applying the dimensionality reduction layer. But here, instead of processing the 5 × 5 convolutional layer on the entire 200 channels of the input volume, we take this huge volume and shrink its representation to a much smaller intermediate volume that has only 16 channels.

Now, let’s look at the computational cost involved in this operation and compare it to the 163 million multiplications that we got before applying the reduce layer:

Computation

= operations in the 1 × 1 convolutional layer + operations in the 5 × 5 convolutional layer

= (32 × 32 × 200) multiplied by (1 × 1 × 16 + 32 × 32 × 16) multiplied by (5 × 5 × 32)

= 3.2 million + 13.1 million

The total number of multiplications in this operation is 16.3 million, which is a tenth of the 163 million multiplications that we calculated without the reduce layers.

The 1 × 1 convolutional layer

The idea of the 1 × 1 convolutional layer is that it preserves the spatial dimensions (height and width) of the input volume but changes the number of channels of the volume (depth):

1 × 1 conv layers preserve the spatial dimensions but change the depth.

The 1 × 1 convolutional layers are also known as bottleneck layers because the bottleneck is the smallest part of the bottle and reduce layers reduce the dimensionality of the network, making it look like a bottleneck:

1 × 1 convolutional layers are called bottleneck layers.

Impact of dimensionality reduction on network performance

You might be wondering whether shrinking the representation size so dramatically hurts the performance of the neural network. Szegedy et al. ran experiments and found that as long as you implement the reduce layer in moderation, you can shrink the representation size significantly without hurting performance--and save a lot of computations.

Now, let’s put the reduce layers into action and build a new inception module with dimensionality reduction. To do that, we will keep the same concept of concatenating the four layers from the naive representation. We will add a 1 × 1 convolutional reduce layer before the 3 × 3 and 5 × 5 convolutional layers to reduce their computational cost. We will also add a 1 × 1 convolutional layer after the 3 × 3 max-pooling layer because pooling layers don’t reduce the depth for their inputs. So, we will need to apply the reduce layer to their output before we do the concatenation (figure 5.13).

Figure 5.13 Building an inception module with dimensionality reduction

We add dimensionality reduction prior to bigger convolutional layers to allow for increasing the number of units at each stage significantly without an uncontrolled blowup in computational complexity at later stages. Furthermore, the design follows the practical intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from the different scales simultaneously.

Recap of inception modules

To summarize, if you are building a layer of a neural network and you don’t want to have to decide what filter size to use in the convolutional layers or when to add pooling layers, the inception module lets you use them all and concatenate the depth of all the outputs. This is called the naive representation of the inception module.

We then run into the problem of computational cost that comes with using large filters. Here, we use a 1 × 1 convolutional layer called the reduce layer that reduces the computational cost significantly. We add reduce layers before the 3 × 3 and 5 × 5 convolutional layers and after the max-pooling layer to create an inception module with dimensionality reduction.

5.5.4 Inception architecture

Now that we understand the components of the inception module, we are ready to build the Inception network architecture. We use the dimension reduction representation of the inception module, stack inception modules on top of each other, and add a 3 × 3 pooling layer in between for downsampling, as shown in figure 5.14.

Figure 5.14 We build the Inception network by adding a stack of inception modules on top of each other.

We can stack as many inception modules as we want to build a very deep convolutional network. In the original paper, the team built a specific incarnation of the inception module and called it GoogLeNet. They used this network in their submission for the ILSVRC 2014 competition. The GoogLeNet architecture is shown in figure 5.15.

Figure 5.15 The full GoogLeNet model consists of three parts: the first part has the classical CNN architecture like AlexNet and LeNet, the second part is a stack of inceptions modules and pooling layers, and the third part is the traditional fully connected classifiers.

As you can see, GoogLeNet uses a stack of a total of nine inception modules and a max pooling layer every several blocks to reduce dimensionality. To simplify this implementation, we are going to break down the GoogLeNet architecture into three parts:

  • Part A--Identical to the AlexNet and LeNet architectures; contains a series of convolutional and pooling layers.

  • Part B --Contains nine inception modules stacked as follows: two inception modules + pooling layer + five inception modules + pooling layer + five inception modules.

  • Part C --The classifier part of the network, consisting of the fully connected and softmax layers.

5.5.5 GoogLeNet in Keras

Now, let’s implement the GoogLeNet architecture in Keras (figure 5.16). Notice that the inception module takes the features from the previous module as input, passes them through four routes, concatenates the depth of the output of all four routes, and then passes the concatenated output to the next module. The four routes are as follows:

  • 1 × 1 convolutional layer

  • 1 × 1 convolutional layer + 3 × 3 convolutional layer

  • 1 × 1 convolutional layer + 5 × 5 convolutional layer

  • 3 × 3 pooling layer + 1 × 1 convolutional layer

Figure 5.16 The inception module of GoogLeNet

First we’ll build the inception_module function. It takes the number of filters of each convolutional layer as an argument and returns the concatenated output:

def inception_module(x, filters_1 × 1, filters_3x3_reduce, filters_3x3, filters_5x5_reduce,
                     filters_5x5, filters_pool_proj, name=None):
    
conv_1x1 = Conv2D(filters_1x1, kernel_size=(1, 1), padding='same', activation='relu',
                  kernel_initializer=kernel_init, bias_initializer=bias_init)(x) 
 
      # 3 × 3 route = 1 × 1 CONV + 3 × 3 CONV 
pre_conv_3x3 = Conv2D(filters_3x3_reduce, kernel_size=(1, 1), padding='same',
                  activation='relu', kernel_initializer=kernel_init, 
                  bias_initializer=bias_init)(x)
conv_3x3 = Conv2D(filters_3x3, kernel_size=(3, 3), padding='same', activation='relu', 
                  kernel_initializer=kernel_init, 
                  bias_initializer=bias_init)(pre_conv_3x3)
      # 5 × 5 route = 1 × 1 CONV + 5 × 5 CONV 
pre_conv_5x5 = Conv2D(filters_5x5_reduce, kernel_size=(1, 1), padding='same',
                  activation='relu', kernel_initializer=kernel_init, 
                  bias_initializer=bias_init)(x)
conv_5x5 = Conv2D(filters_5x5, kernel_size=(5, 5), padding='same', activation='relu', 
                  kernel_initializer=kernel_init, 
                  bias_initializer=bias_init)(pre_conv_5x5)
      # pool route = POOL + 1 × 1 CONV 
pool_proj = MaxPool2D((3, 3), strides=(1, 1), padding='same')(x)
pool_proj = Conv2D(filters_pool_proj, (1, 1), padding='same', activation='relu',
                  kernel_initializer=kernel_init, bias_initializer=bias_init)(pool_proj)
 
output = concatenate([conv_1x1, conv_3x3, conv_5x5, pool_proj], axis=3,  name=name)                                                                
 
return output

Creates the 1 × 1 convolutional layer that takes its input directly from the previous layer

Concatenates together the depth of the three filters

GoogLeNet architecture

Now that the inception_module function is ready, let’s build the GoogLeNet architecture from figure 5.16. To get the values of the inception_module function’s arguments, we will go through figure 5.17, which represents the hyperparameters set up as implemented by Szegedy et al. in the original paper. (Note that “#3 × 3 reduce” and “#5 × 5 reduce” in the figure represent the 1 × 1 filters in the reduction layers that are used before the 3 × 3 and 5 × 5 convolutional layers.)

Figure 5.17 Hyperparameters implemented by Szegedy et al. in the original Inception paper

Now, let’s go through the implementations of parts A, B, and C.

Part A: Building the bottom part of the network

Let’s build the bottom part of the network. This part consists of a 7 × 7 convolutional layer ⇒ 3 × 3 pooling layer ⇒ 1 × 1 convolutional layer ⇒ 3 × 3 convolutional layer ⇒ 3 × 3 pooling layer, as you can see in figure 5.18.

Figure 5.18 The bottom part of the network

In the LocalResponseNorm layer, similar to in AlexNet, local response normalization is used to help speed up convergence. Nowadays, batch normalization is used instead.

Here is the Keras code for part A:

# input layer with size = 24 × 24 × 3
input_layer = Input(shape=(224, 224, 3))
kernel_init = keras.initializers.glorot_uniform()
bias_init = keras.initializers.Constant(value=0.2)
x = Conv2D(64, (7, 7), padding='same', strides=(2, 2), activation='relu', name='conv_1_7x7/2', 
kernel_initializer=kernel_init, bias_initializer=bias_init)(input_layer)
x = MaxPool2D((3, 3), padding='same', strides=(2, 2), name='max_pool_1_3x3/2')(x)
x = BatchNormalization()(x)
x = Conv2D(64, (1, 1), padding='same', strides=(1, 1), activation='relu')(x)
x = Conv2D(192, (3, 3), padding='same', strides=(1, 1), activation='relu')(x)
 
x = BatchNormalization()(x)
 
x = MaxPool2D((3, 3), padding='same', strides=(2, 2))(x)

Part B: Building the inception modules and max-pooling layers

To build inception modules 3a and 3b and the first max-pooling layer, we use table 5.2 to start. The code is as follows:

Table 5.2 Inception modules 3a and 3b

Type

#1 × 1

#3 × 3 reduce

#3 × 3

#5 × 5 reduce

#5 × 5

Pool proj

Inception (3a)

064

096

128

16

32

32

Inception (3b)

128

128

192

32

96

64

x = inception_module(x, filters_1x1=64, filters_3x3_reduce=96, filters_3x3=128, 
                     filters_5x5_reduce=16, filters_5x5=32, filters_pool_proj=32,
                     name='inception_3a')
 
x = inception_module(x, filters_1x1=128, filters_3x3_reduce=128, filters_3x3=192,
                     filters_5x5_reduce=32, filters_5x5=96, filters_pool_proj=64,
                     name='inception_3b')
 
x = MaxPool2D((3, 3), padding='same', strides=(2, 2))(x)

Similarly, let’s create inception modules 4a, 4b, 4c, 4d, and 4e and the max pooling layer:

x = inception_module(x, filters_1x1=192, filters_3x3_reduce=96, filters_3x3=208, 
                     filters_5x5_reduce=16, filters_5x5=48, filters_pool_proj=64,
                     name='inception_4a')
 
x = inception_module(x, filters_1x1=160, filters_3x3_reduce=112, filters_3x3=224,
                     filters_5x5_reduce=24, filters_5x5=64, filters_pool_proj=64,
                     name='inception_4b')
 
x = inception_module(x, filters_1x1=128, filters_3x3_reduce=128, filters_3x3=256,
                     filters_5x5_reduce=24, filters_5x5=64, filters_pool_proj=64,
                     name='inception_4c')
 
x = inception_module(x, filters_1x1=112, filters_3x3_reduce=144, filters_3x3=288,
                     filters_5x5_reduce=32, filters_5x5=64, filters_pool_proj=64,
                     name='inception_4d')
 
x = inception_module(x, filters_1x1=256, filters_3x3_reduce=160, filters_3x3=320,
                     filters_5x5_reduce=32, filters_5x5=128, filters_pool_proj=128,
                     name='inception_4e')
 
x = MaxPool2D((3, 3), padding='same', strides=(2, 2), name='max_pool_4_3x3/2')(x)

Now, let’s create modules 5a and 5b:

x = inception_module(x, filters_1x1=256, filters_3x3_reduce=160, filters_3x3=320, 
                     filters_5x5_reduce=32, filters_5x5=128, filters_pool_proj=128, 
                     name='inception_5a')
 
x = inception_module(x, filters_1x1=384, filters_3x3_reduce=192, filters_3x3=384, 
                     filters_5x5_reduce=48, filters_5x5=128, filters_pool_proj=128, 
                     name='inception_5b')

Part C: Building the classifier part

In their experiments, Szegedy et al. found that adding an 7 × 7 average pooling layer improved the top-1 accuracy by about 0.6%. They then added a dropout layer with 40% probability to reduce overfitting:

x = AveragePooling2D(pool_size=(7,7), strides=1, padding='valid')(x)
x = Dropout(0.4)(x)
x = Dense(10, activation='softmax', name='output')(x)

5.5.6 Learning hyperparameters

The team used a SGD gradient descent optimizer with 0.9 momentum. They also implemented a fixed learning rate decay schedule of 4% every 8 epochs. An example of how to implement the training specifications similar to the paper is as follows:

epochs = 25
initial_lrate = 0.01
 
def decay(epoch, steps=100):                                                  
    initial_lrate = 0.01                                                      
    drop = 0.96                                                               
    epochs_drop = 8                                                           
    lrate = initial_lrate * math.pow(drop, math.floor((1+epoch)/epochs_drop)) 
    return lrate                                                              
 
lr_schedule = LearningRateScheduler(decay, verbose=1)
 
sgd = SGD(lr=initial_lrate, momentum=0.9, nesterov=False)
 
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
 
model.fit(X_train, y_train, batch_size=256, epochs=epochs, validation_data=(X_test, y_test), callbacks=[lr_schedule], verbose=2, shuffle=True)

Implements the learning rate decay function

5.5.7 Inception performance on the CIFAR dataset

GoogLeNet was the winner of the ILSVRC 2014 competition. It achieved a top-5 error rate of 6.67%, which was very close to human-level performance and much better than previous CNNs like AlexNet and VGGNet.

5.6 ResNet

The Residual Neural Network (ResNet) was developed in 2015 by a group from the Microsoft Research team.5 They introduced a novel residual module architecture with skip connections. The network also features heavy batch normalization for the hidden layers. This technique allowed the team to train very deep neural networks with 50, 101, and 152 weight layers while still having lower complexity than smaller networks like VGGNet (19 layers). ResNet was able to achieve a top-5 error rate of 3.57% in the ILSVRC 2015 competition, which beat the performance of all prior ConvNets.

5.6.1 Novel features of ResNet

Looking at how neural network architectures evolved from LeNet, AlexNet, VGGNet, and Inception, you might have noticed that the deeper the network, the larger its learning capacity, and the better it extracts features from images. This mainly happens because very deep networks are able to represent very complex functions, which allows the network to learn features at many different levels of abstraction, from edges (at the lower layers) to very complex features (at the deeper layers).

Earlier in this chapter, we saw deep neural networks like VGGNet-19 (19 layers) and GoogLeNet (22 layers). Both performed very well in the ImageNet challenge. But can we build even deeper networks? We learned from chapter 4 that one downside of adding too many layers is that doing so makes the network more prone to overfit the training data. This is not a major problem because we can use regularization techniques like dropout, L2 regularization, and batch normalization to avoid overfitting. So, if we can take care of the overfitting problem, wouldn’t we want to build networks that are 50, 100, or even 150 layers deep? The answer is yes. We definitely should try to build very deep neural networks. We need to fix just one other problem, to unblock the capability of building super-deep networks: a phenomenon called vanishing gradients.

Vanishing and exploding gradients

The problem with very deep networks is that the signal required to change the weights becomes very small at earlier layers. To understand why, let’s consider the gradient descent process explained in chapter 2. As the network backpropagates the gradient of the error from the final layer back to the first layer, it is multiplied by the weight matrix at each step; thus the gradient can decrease exponentially quickly to zero, leading to a vanishing gradient phenomenon that prevents the earlier layers from learning. As a result, the network’s performance gets saturated or even starts to degrade rapidly.

In other cases, the gradient grows exponentially quickly and “explodes” to take very large values. This phenomenon is called exploding gradients.

To solve the vanishing gradient problem, He et al. created a shortcut that allows the gradient to be directly backpropagated to earlier layers. These shortcuts are called skip connections : they are used to flow information from earlier layers in the network to later layers, creating an alternate shortcut path for the gradient to flow through. Another important benefit of the skip connections is that they allow the model to learn an identity function, which ensures that the layer will perform at least as well as the previous layer (figure 5.19).

Figure 5.19 Traditional network without skip connections (left); network with a skip connection (right).

At left in figure 5.19 is the traditional stacking of convolutional layers one after the other. On the right, we still stack convolutional layers as before, but we also add the original input to the output of the convolutional block. This is a skip connection. We then add both signals: skip connection + main path.

Note that the shortcut arrow points to the end of the second convolutional layer--not after it. The reason is that we add both paths before we apply the ReLU activation function of this layer. As you can see in figure 5.20, the x signal is passed along the shortcut path and then added to the main path, f(x). Then, we apply the ReLU activation to f(x) + x to produce the output signal: relu( f(x) + x ).

Figure 5.20 Adding the paths and applying the ReLU activation function to solve the vanishing gradient problem that usually comes with very deep networks

The code implementation of the skip connection is straightforward:

X_shortcut = x                                                       
 
X = Conv2D(filters = F1, kernel_size = (3, 3), strides = (1,1))(x)   
X = Activation('relu')(x)                                            
X = Conv2D(filters = F1, kernel_size = (3, 3), strides = (1,1))(x)   
 
X = Add()([X, X_shortcut])                                           
 
X = Activation('relu')(x)                                            

Stores the value of the shortcut to be equal to the input x

Performs the main path operations: CONV + ReLU + CONV

Adds both paths together

Applies the ReLU activation function

This combination of the skip connection and convolutional layers is called a residual block. Similar to the Inception network, ResNet is composed of a series of these residual block building blocks that are stacked on top of each other (figure 5.21).

Figure 5.21 Classical CNN architecture (left). The Inception network consists of a set of inception modules (middle). The residual network consists of a set of residual blocks (right).

From the figure, you can observe the following:

  • Feature extractors --To build the feature extractor part of ResNet, we start with a convolutional layer and a pooling layer and then stack residual blocks on top of each other to build the network. When we are designing our ResNet network, we can add as many residual blocks as we want to build even deeper networks.

  • Classifiers --The classification part is still the same as we learned for other networks: fully connected layers followed by a softmax.

Now that you know what a skip connection is and you are familiar with the high-level architecture of ResNet, let’s unpack residual blocks to understand how they work.

5.6.2 Residual blocks

A residual module consists of two branches:

  • Shortcut path (figure 5.22)--Connects the input to an addition of the second branch.

  • Main path --A series of convolutions and activations. The main path consists of three convolutional layers with ReLU activations. We also add batch normalization to each convolutional layer to reduce overfitting and speed up training. The main path architecture looks like this: [CONV ⇒ BN ⇒ ReLU] × 3.

Figure 5.22 The output of the main path is added to the input value through the shortcut before they are fed to the ReLU function.

Similar to what we explained earlier, the shortcut path is added to the main path right before the activation function of the last convolutional layer. Then we apply the ReLU function after adding the two paths.

Notice that there are no pooling layers in the residual block. Instead, He et al. decided to do dimension downsampling using bottleneck 1 × 1 convolutional layers, similar to the Inception network. So, each residual block starts with a 1 × 1 convolutional layer to downsample the input dimension volume, and a 3 × 3 convolutional layer and another 1 × 1 convolutional layer to downsample the output. This is a good technique to keep control of the volume dimensions across many layers. This configuration is called a bottleneck residual block.

When we are stacking residual blocks on top of each other, the volume dimensions change from one block to another. And as you might recall from the matrices introduction in chapter 2, to be able to perform matrix addition operations, the matrices should have similar dimensions. To fix this problem, we need to downsample the shortcut path as well, before merging both paths. We do that by adding a bottleneck layer (1 × 1 convolutional layer + batch normalization) to the shortcut path, as shown in figure 5.23. This is called the reduce shortcut.

Figure 5.23 To reduce the input dimensionality, we add a bottleneck layer (1 × 1 convolutional layer + batch normalization) to the shortcut path. This is called the reduce shortcut.

Before we jump into the code implementation, let’s recap the discussion of residual blocks:

  • Residual blocks contain two paths: the shortcut path and the main path.

  • The main path consists of three convolutional layers, and we add a batch normalization layer to them:

    • 1 × 1 convolutional layer
    • 3 × 3 convolutional layer
    • 1 × 1 convolutional layer
  • There are two ways to implement the shortcut path:

    • Regular shortcut --Add the input dimensions to the main path.
    • Reduce shortcut --Add a convolutional layer in the shortcut path before merging with the main path.

When we are implementing the ResNet network, we will use both regular and reduce shortcuts. This will be clearer when you see the full implementation. But for now, we will implement bottleneck_residual_block function that takes a reduce Boolean argument. When reduce is True, this means we want to use the reduce shortcut; otherwise, it will implement the regular shortcut. The bottleneck_residual_block function takes the following arguments:

  • X--Input tensor of shape (number of samples, height, width, channel)

  • f--Integer specifying the shape of the middle convolutional layer’s window for the main path

  • filters--Python list of integers defining the number of filters in the convolutional layers of the main path

  • reduce--Boolean: True identifies the reduction layer

  • s--Integer (strides)

The function returns X: the output of the residual block, which is a tensor of shape (height, width, channel).

The function is as follows:

def bottleneck_residual_block(X, kernel_size, filters, reduce=False, s=2):
    F1, F2, F3 = filters                                        
    
    X_shortcut = x                                              
    
    if reduce:                                                  
 
       X_shortcut = Conv2D(filters = F3, kernel_size = (1, 1), strides = (s,s))(X_shortcut)                                         
        X_shortcut = BatchNormalization(axis = 3)(X_shortcut)   
 
       x = Conv2D(filters = F1, kernel_size = (1, 1), strides = (s,s), padding = 'valid')(x)                                        
        x = BatchNormalization(axis = 3)(x)
        x = Activation('relu')(x)
        
    else: 
        # First component of main path
        x = Conv2D(filters = F1, kernel_size = (1, 1), strides = (1,1), padding = 'valid')(x)
        x = BatchNormalization(axis = 3)(x)
        x = Activation('relu')(x)
    
    # Second component of main path
    x = Conv2D(filters = F2, kernel_size = kernel_size, strides = (1,1), padding = 'same')(x)
    x = BatchNormalization(axis = 3)(x)
    x = Activation('relu')(x)
 
    # Third component of main path
    x = Conv2D(filters = F3, kernel_size = (1, 1), strides = (1,1), padding = 'valid')(x)
    x = BatchNormalization(axis = 3)(x)
 
    # Final step
    x = Add()([X, X_shortcut])                                  
    x = Activation('relu')(x)                                   
  
    return X

Unpacks the tuple to retrieve the filters of each convolutional layer

Saves the input value to use it later to add back to the main path

Condition if reduce is True

To reduce the spatial size, applies a 1 × 1 convolutional layer to the shortcut path. To do that, we need both convolutional layers to have similar strides.

If reduce, sets the strides of the first convolutional layer to be similar to the shortcut strides.

Adds the shortcut value to main path and passes it through a ReLU activation

5.6.3 ResNet implementation in Keras

You’ve learned a lot about residual blocks so far. Let’s add these blocks on top of each other to build the full ResNet architecture. Here, we will implement ResNet50: a version of the ResNet architecture that contains 50 weight layers (hence the name). You can use the same approach to develop ResNet with 18, 34, 101, and 152 layers by following the architecture in figure 5.24 from the original paper.

Figure 5.24 Architecture of several ResNet variations from the original paper

We know from the previous section that each residual module contains 3 × 3 convolutional layers, and we now can compute the total number of weight layers inside the ResNet50 network as follows:

  • Stage 1: 7 × 7 convolutional layer

  • Stage 2: 3 residual blocks, each containing [1 × 1 convolutional layer + 3 × 3 convolutional layer + 1 × 1 convolutional layer] = 9 convolutional layers

  • Stage 3: 4 residual blocks = total of 12 convolutional layers

  • Stage 4: 6 residual blocks = total of 18 convolutional layers

  • Stage 5: 3 residual blocks = total of 9 convolutional layers

  • Fully connected softmax layer

When we sum all these layers together, we get a total of 50 weight layers that describe the architecture of ResNet50. Similarly, you can compute the number of weight layers in the other ResNet versions.

NOTE In the following implementation, we use the residual block with reduce shortcut at the beginning of each stage to reduce the spatial size of the output from the previous layer. Then we use the regular shortcut for the remaining layers of that stage. Recall from our implementation of the bottleneck_ residual_block function that we will set the argument reduce to True to apply the reduce shortcut.

Now let’s follow the 50-layer architecture from figure 5.24 to build the ResNet50 network. We build a ResNet50 function that takes input_shape and classes as arguments and outputs the model:

def ResNet50(input_shape, classes):
    X_input = Input(input_shape)                                      
 
    # Stage 1
    x = Conv2D(64, (7, 7), strides=(2, 2), name='conv1')(X_input)
    x = BatchNormalization(axis=3, name='bn_conv1')(x)
    x = Activation('relu')(x)
    x = MaxPooling2D((3, 3), strides=(2, 2))(x)
 
    # Stage 2
    x = bottleneck_residual_block(X, 3, [64, 64, 256], reduce=True, s=1)
    x = bottleneck_residual_block(X, 3, [64, 64, 256])
    x = bottleneck_residual_block(X, 3, [64, 64, 256])
 
    # Stage 3 
    x = bottleneck_residual_block(X, 3, [128, 128, 512], reduce=True, s=2)
    x = bottleneck_residual_block(X, 3, [128, 128, 512])
    x = bottleneck_residual_block(X, 3, [128, 128, 512])
    x = bottleneck_residual_block(X, 3, [128, 128, 512])
 
    # Stage 4 
    x = bottleneck_residual_block(X, 3, [256, 256, 1024], reduce=True, s=2)
    x = bottleneck_residual_block(X, 3, [256, 256, 1024])
    x = bottleneck_residual_block(X, 3, [256, 256, 1024])
    x = bottleneck_residual_block(X, 3, [256, 256, 1024])
    x = bottleneck_residual_block(X, 3, [256, 256, 1024])
    x = bottleneck_residual_block(X, 3, [256, 256, 1024])
 
    # Stage 5 
    x = bottleneck_residual_block(X, 3, [512, 512, 2048], reduce=True, s=2)
    x = bottleneck_residual_block(X, 3, [512, 512, 2048])
    x = bottleneck_residual_block(X, 3, [512, 512, 2048])
 
    # AVGPOOL 
    x = AveragePooling2D((1,1))(x)
 
    # output layer
    x = Flatten()(x)
    x = Dense(classes, activation='softmax', name='fc' + str(classes))(x)
    
    model = Model(inputs = X_input, outputs = X, name='ResNet50')     
 
    return model

Defines the input as a tensor with shape input_shape

Creates the model

5.6.4 Learning hyperparameters

He et al. followed a training procedure similar to that of AlexNet: the training is carried out using mini-batch GD with momentum of 0.9. The team set the learning rate to start with a value of 0.1 and then decreased it by a factor of 10 when the validation error stopped improving. They also used L2 regularization with a weight decay of 0.0001 (not implemented in this chapter for simplicity). As you saw in the earlier implementation, they used batch normalization right after each convolutional and before activation to speed up training:

from keras.callbacks import ReduceLROnPlateau
 
epochs = 200                                                         
batch_size = 256                                                     
 
reduce_lr= ReduceLROnPlateau(monitor='val_loss',factor=np.sqrt(0.1),
    patience=5, min_lr=0.5e-6)                                       
 
model.compile(loss='categorical_crossentropy', optimizer=SGD, metrics=['accuracy'])                                           
 
model.fit(X_train, Y_train, batch_size=batch_size, validation_data=(X_test, Y_test),
          epochs=epochs, callbacks=[reduce_lr])                      

Sets the training parameters

min_lr is the lower bound on the learning rate, and factor is the factor by which the learning rate will be reduced.

Compiles the model

Trains the model, calling the reduce_lr value using callbacks in the training method

5.6.5 ResNet performance on the CIFAR dataset

Similar to the other networks explained in this chapter, the performance of ResNet models is benchmarked based on their results in the ILSVRC competition. ResNet-152 won first place in the 2015 classification competition with a top-5 error rate of 4.49% with a single model and 3.57% using an ensemble of models. This was much better than all the other networks, such as GoogLeNet (Inception), which achieved a top-5 error rate of 6.67%. ResNet also won first place in many object detection and image localization challenges, as we will see in chapter 7. More importantly, the residual blocks concept in ResNet opened the door to new possibilities for efficiently training super-deep neural networks with hundreds of layers.

Using open source implementations

Now that you have learned some of the most popular CNN architectures, I want to share some practical advice on how to use them. It turns out that a lot of these neural networks are difficult or finicky to replicate due to details of tuning hyperparameters such as learning decay and other things that make a difference for performance. DL researchers can even have a hard time replicating someone else’s polished work based on reading their paper.

Fortunately, many DL researchers routinely open source their work on the internet. A simple search for the network implementation on GitHub will point you toward implementations in several DL libraries that you can clone and train. If you can locate the author’s implementation, you can usually get going much faster than by trying to re-implement a network from scratch--although sometimes, re-implementing from scratch can be a good exercise, like what we did earlier.

Summary

  • Classical CNN architectures have the same classical architecture of stacking convolutional and pooling layers on top of each other with different configurations for their layers.

  • LeNet consists of five weight layers: three convolutional and two fully connected layers, with a pooling layer after the first and second convolutional layers.

  • AlexNet is deeper than LeNet and contains eight weight layers: five convolutional and three fully connected layers.

  • VGGNet solved the problem of setting up the hyperparameters of the convolutional and pooling layers by creating a uniform configuration for them to be used across the entire network.

  • Inception tried to solve the same problem as VGGNet: instead of having to decide which filter size to use and where to add the pooling layer, Inception says, “Let’s use them all.”

  • ResNet followed the same approach as Inception and created residual blocks that, when stacked on top of each other, form the network architecture. ResNet attempted to solve the vanishing gradient problem that made learning plateau or degrade when training very deep neural networks. The ResNet team introduced skip connections that allow information to flow from earlier layers in the network to later layers, creating an alternate shortcut path for the gradient to flow through. The fundamental breakthrough with ResNet was that it allowed us to train extremely deep neural networks with hundreds of layers.


1.Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-Based Learning Applied to Document Recognition,” Proceedings of the IEEE 86 (11): 2278-2324, http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf.

2.Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” Communications of the ACM 60 (6): 84-90, https://dl.acm.org/doi/10.1145/3065386.

3.Karen Simonyan and Andrew Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” 2014, https://arxiv.org/pdf/1409.1556v6.pdf.

4. Christian Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich, “Going Deeper with Convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1-9, 2015, http://mng.bz/YryB.

5.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep Residual Learning for Image Recognition,” 2015, http://arxiv.org/abs/1512.03385.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.199.243