6 Transfer learning

This chapter covers

  • Understanding the transfer learning technique
  • Using a pretrained network to solve your problem
  • Understanding network fine-tuning
  • Exploring open source image datasets for training a model
  • Building two end-to-end transfer learning projects

Transfer learning is one of the most important techniques of deep learning. When building a vision system to solve a specific problem, you usually need to collect and label a huge amount of data to train your network. You can build convnets, as you learned in chapter 3, and start the training from scratch; that is an acceptable approach. But what if you could download an existing neural network that someone else has tuned and trained, and use it as a starting point for your new task? Transfer learning allows you to do just that. You can download an open source model that someone else has already trained and tuned and use their optimized parameters (weights) as a starting point to train your model on a smaller dataset for a given task. This way, you can train your network a lot faster and achieve higher results.

DL researchers and practitioners have posted many research papers and open source projects of trained algorithms that they have worked on for weeks and months and trained on GPUs to get state-of-the-art results on an array of problems. Often, the fact that someone else has done this work and gone through the painful high-performance research process means you can download an open source architecture and weights and use them as a good start for your own neural network. This is transfer learning : the transfer of knowledge from a pretrained network in one domain to your own problem in a different domain.

In this chapter, I will explain transfer learning and outline reasons why using it is important. I will also detail different transfer learning scenarios and how to use them. Finally, we will see examples of using transfer learning to solve real-world problems. Ready? Let’s get started!

6.1 What problems does transfer learning solve?

As the name implies, transfer learning means transferring what a neural network has learned from being trained on a specific dataset to another related problem (figure 6.1). Transfer learning is currently very popular in the field of DL because it enables you to train deep neural networks with comparatively little data in a short training time. The importance of transfer learning comes from the fact that in most real-world problems, we typically do not have millions of labeled images to train such complex models.

Figure 6.1 Transfer learning is the transfer of the knowledge that the network has acquired from one task to a new task. In the context of neural networks, the acquired knowledge is the extracted features.

The idea is pretty straightforward. First we train a deep neural network on a very large amount of data. During the training process, the network extracts a large number of useful features that can be used to detect objects in this dataset. We then transfer these extracted features (feature maps) to a new network and train this new network on our new dataset to solve a different problem. Transfer learning is a great way to shortcut the process of collecting and training huge amounts of data simply by reusing the model weights from pretrained models that were developed for standard CV benchmark datasets, such as the ImageNet image-recognition tasks. Top-performing models can be downloaded and used directly, or integrated into a new model for your own CV problems.

The question is, why would we want to use transfer learning? Why don’t we just train a neural network directly on our new dataset to solve our problem? To answer this question, we first need to know the main problems that transfer learning solves. We’ll discuss those now; then I’ll go into the details of how transfer learning works and the different approaches to apply it.

Deep neural networks are immensely data-hungry and rely on huge amounts of labeled data to achieve high performance. In practice, very few people train an entire convolutional network from scratch. This is due to two main problems:

  • Data problem --Training a network from scratch requires a lot of data in order to get decent results, which is not feasible in most cases. It is relatively rare to have a dataset of sufficient size to solve your problem. It is also very expensive to acquire and label data: this is mostly a manual process done by humans capturing images and labeling them one by one, which makes it a nontrivial task.

  • Computation problem --Even if you are able to acquire hundreds of thousands of images for your problem, it is computationally very expensive to train a deep neural network on millions of images because doing so usually requires weeks of training on multiple GPUs. Also keep in mind that training a neural network is an iterative process. So, even if you happen to have the computing power required to train a complex neural network, spending weeks experimenting with different hyperparameters in each training iteration until you finally reach satisfactory results will make the project very costly.

Additionally, an important benefit of using transfer learning is that it helps the model generalize its learnings and avoid overfitting. When you apply a DL model in the wild, it is faced with countless conditions it may never have seen before and does not know how to deal with; each client has its own preferences and generates data that is different from the data used for training. The model is asked to perform well on many tasks that are related to but not exactly similar to the task it was trained for.

For example, when you deploy a car classifier model to production, people usually have different camera types, each with its own image quality and resolution. Also, images can be taken during different weather conditions. These image nuances vary from one user to another. To train the model on all these different cases, you either have to account for every case and acquire a lot of images to train the network on, or try to build a more robust model that is better at generalizing to new use cases. This is what transfer learning does. Since it is not realistic to account for all the cases the model may face in the wild, transfer learning can help us deal with novel scenarios. It is necessary for production-scale use of DL that goes beyond tasks and domains where labeled data is plentiful. Transferring features extracted from another network that has seen millions of images will make our model less prone to overfit and help it generalize better when faced with novel scenarios. You will be able to fully grasp this concept when we explain how transfer learning works in the following sections.

6.2 What is transfer learning?

Armed with the understanding of the problems that transfer learning solves, let’s look at its formal definition. Transfer learning is the transfer of the knowledge (feature maps) that the network has acquired from one task, where we have a large amount of data, to a new task where data is not abundantly available. It is generally used where a neural network model is first trained on a problem similar to the problem that is being solved. One or more layers from the trained model are then used in a new model trained on the problem of interest.

As we discussed earlier, to train an image classifier that will achieve image classification accuracy near to or above the human level, we’ll need massive amounts of data, large compute power, and lots of time on our hands. I’m sure most of us don’t have all these things. Knowing that this would be a problem for people with little-to-no resources, researchers built state-of-the-art models that were trained on large image datasets like ImageNet, MS COCO, Open Images, and so on, and then shared their models with the general public for reuse. This means you should never have to train an image classifier from scratch again, unless you have an exceptionally large dataset and a very large computation budget to train everything from scratch by yourself. Even if that is the case, you might be better off using transfer learning to fine-tune the pretrained network on your large dataset. Later in this chapter, we will discuss the different transfer learning approaches, and you will understand what fine-tuning means and why it is better to use transfer learning even when you have a large dataset. We will also talk briefly about some of the popular datasets mentioned here.

NOTE When we talk about training a model from scratch, we mean that the model starts with zero knowledge of the world, and the model’s structure and parameters begin as random guesses. Practically speaking, this means the weights of the model are randomly initialized, and they need to go through a training process to be optimized.

The intuition behind transfer learning is that if a model is trained on a large and general enough dataset, this model will effectively serve as a generic representation of the visual world. We can then use the feature maps it has learned, without having to train on a large dataset, by transferring what it learned to our model and using that as a base starting model for our own task.

In transfer learning, we first train a base network on a base dataset and task, and then we repurpose the learned features, or transfer them to a second target network to be trained on a target dataset and task. This process will tend to work if the features are general, meaning suitable to both base and target tasks, instead of specific to the base task.

   --Jason Yosinski et al.1

Let’s jump directly to an example to get a better intuition for how to use transfer learning. Suppose we want to train a model that classifies dog and cat images, and we have only two classes in our problem: dog and cat. We need to collect hundreds of thousands of images for each class, label them, and train our network from scratch. Another option is to use transfer knowledge from another pretrained network.

First, we need to find a dataset that has similar features to our problem at hand. This involves spending some time exploring different open source datasets to find the one closest to our problem. For the sake of this example, let’s use ImageNet, since we are already familiar with it from the previous chapter and it has a lot of dog and cat images. So the pretrained network is familiar with dog and cat features and will require minimum training. (Later in this chapter, we will explore other datasets.) Next, we need to choose a network that has been trained on ImageNet and achieved good results. In chapter 5, we learned about state-of-the-art architectures like VGGNet, GoogLeNet, and ResNet. Any of them would work fine. For this example, we will go with a VGG16 network that has been trained on ImageNet datasets.

To adapt the VGG16 network to our problem, we are going to download it with the pretrained weights, remove the classifier part, add our own classifier, and then retrain the new network (figure 6.2). This is called using a pretrained network as a feature extractor. We will discuss the different types of transfer learning later in this chapter.

DEFINITION A pretrained model is a network that has been previously trained on a large dataset, typically on a large-scale image classification task. We can either use the pretrained model directly as is to run our predictions, or use the pretrained feature extraction part of the network and add our own classifier. The classifier here could be one or more dense layers or even traditional ML algorithms like support vector machines (SVMs).

Figure 6.2 Example of applying transfer learning to a VGG16 network. We freeze the feature extraction part of the network and remove the classifier part. Then we add our new classifier softmax layer with two hidden units.

To fully understand how to use transfer learning, let’s implement this example in Keras. (Luckily, Keras has a set of pretrained networks that are ready for us to download and use: the complete list of models is at https://keras.io/api/applications.) Here are the steps:

  1. Download the open source code of the VGG16 network and its weights to create our base model, and remove the classification layers from the VGG network (FC_4096 > FC_4096 > Softmax_1000):

    from keras.applications.vgg16 import VGG16                     
     
    base_model = VGG16(weights = "imagenet", include_top=False, 
                       input_shape = (224,224, 3))                 
    base_model.summary()

    Imports the VGG16 model from Keras

    Downloads the model’s pretrained weights and saves them in the variable base_model. We specify that Keras should download the ImageNet weights. include_top is false to ignore the fully connected classifier part on top of the model.

  2. When you print a summary of the base model, you will notice that we downloaded the exact VGG16 architecture that we implemented in chapter 5. This is a fast approach to download popular networks that are supported by the DL library you are using. Alternatively, you can build the network yourself, as we did in chapter 5, and download the weights separately. I’ll show you how in the project at the end of this chapter. But for now, let’s look at the base_model summary that we just downloaded:

    Layer (type)                 Output Shape              Param #   
    =================================================================
    input_1 (InputLayer)         (None, 224, 224, 3)       0         
    _________________________________________________________________
    block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
    _________________________________________________________________
    block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
    _________________________________________________________________
    block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
    _________________________________________________________________
    block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
    _________________________________________________________________
    block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
    _________________________________________________________________
    block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0         
    _________________________________________________________________
    block3_conv1 (Conv2D)        (None, 56, 56, 256)       295168    
    _________________________________________________________________
    block3_conv2 (Conv2D)        (None, 56, 56, 256)       590080    
    _________________________________________________________________
    block3_conv3 (Conv2D)        (None, 56, 56, 256)       590080    
    _________________________________________________________________
    block3_pool (MaxPooling2D)   (None, 28, 28, 256)       0         
    _________________________________________________________________
    block4_conv1 (Conv2D)        (None, 28, 28, 512)       1180160   
    _________________________________________________________________
    block4_conv2 (Conv2D)        (None, 28, 28, 512)       2359808   
    _________________________________________________________________
    block4_conv3 (Conv2D)        (None, 28, 28, 512)       2359808   
    _________________________________________________________________
    block4_pool (MaxPooling2D)   (None, 14, 14, 512)       0         
    _________________________________________________________________
    block5_conv1 (Conv2D)        (None, 14, 14, 512)       2359808   
    _________________________________________________________________
    block5_conv2 (Conv2D)        (None, 14, 14, 512)       2359808   
    _________________________________________________________________
    block5_conv3 (Conv2D)        (None, 14, 14, 512)       2359808   
    _________________________________________________________________
    block5_pool (MaxPooling2D)   (None, 7, 7, 512)         0         
    =================================================================
    Total params: 14,714,688
    Trainable params: 14,714,688
    Non-trainable params: 0
    _________________________________________________________________

    Notice that this downloaded architecture does not contain the classifier part (three fully connected layers) at the top of the network because we set the include_top argument to False. More importantly, notice the number of trainable and non-trainable parameters in the summary. The downloaded network as it is makes all the network parameters trainable. As you can see, our base_ model has more than 14 million trainable parameters. Next, we want to freeze all the downloaded layers and add our own classifier.

  3. Freeze the feature extraction layers that have been trained on the ImageNet dataset. Freezing layers means freezing their trained weights to prevent them from being retrained when we run our training:

    for layer in base_model.layers:        
        layer.trainable = False
     
    base_model.summary()

    Iterates through layers and locks them to make them non-trainable with this code

    The model summary is omitted in this case for brevity, as it is similar to the previous one. The difference is that all the weights have been frozen, the trainable parameters are now equal to zero, and all the parameters of the frozen layers are non-trainable:

    Total params: 14,714,688
    Trainable params: 0
    Non-trainable params: 14,714,688
  4. Add our own classification dense layer. Here, we will add a softmax layer with two units because we have only two classes in our problem (see figure 6.3):

    from keras.layers import Dense, Flatten                       
    from keras.models import Model
     
    last_layer = base_model.get_layer('block5_pool')             
    last_output = last_layer.output                              
     
    x = Flatten()(last_output)                                  
     
    x = Dense(2, activation='softmax', name='softmax')(x)       

    Imports Keras modules

    Uses the get_layer method to save the last layer of the network

    Saves the output of the last layer to be the input of the next layer

    Flattens the classifier input, which is the output of the last layer of the VGG16 model

    Adds our new softmax layer with two hidden units

    Figure 6.3 Remove the classifier part of the network, and add a softmax layer with two hidden nodes.

  5. Build a new_model that takes the input of the base model as its input and the output of the last softmax layer as an output. The new model is composed of all the feature extraction layers in VGGNet with the pretrained weights, plus our new, untrained, softmax layer. In other words, when we train the model, we are only going to train the softmax layer in this example to detect the specific features of our new problem (German Shepherd, Beagle, Neither):

    new_model = Model(inputs=base_model.input, outputs=x)      
     
    new_model.summary()                                        
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #   
    ===================================================
    input_1 (InputLayer)         (None, 224, 224, 3)       0         
    _________________________________________________________________
    block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
    _________________________________________________________________
    block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
    _________________________________________________________________
    block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
    _________________________________________________________________
    block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
    _________________________________________________________________
    block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
    _________________________________________________________________
    block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0         
    _________________________________________________________________
    block3_conv1 (Conv2D)        (None, 56, 56, 256)       295168    
    _________________________________________________________________
    block3_conv2 (Conv2D)        (None, 56, 56, 256)       590080    
    _________________________________________________________________
    block3_conv3 (Conv2D)        (None, 56, 56, 256)       590080    
    _________________________________________________________________
    block3_pool (MaxPooling2D)   (None, 28, 28, 256)       0         
    _________________________________________________________________
    block4_conv1 (Conv2D)        (None, 28, 28, 512)       1180160   
    _________________________________________________________________
    block4_conv2 (Conv2D)        (None, 28, 28, 512)       2359808   
    _________________________________________________________________
    block4_conv3 (Conv2D)        (None, 28, 28, 512)       2359808   
    _________________________________________________________________
    block4_pool (MaxPooling2D)   (None, 14, 14, 512)       0         
    _________________________________________________________________
    block5_conv1 (Conv2D)        (None, 14, 14, 512)       2359808   
    _________________________________________________________________
    block5_conv2 (Conv2D)        (None, 14, 14, 512)       2359808   
    _________________________________________________________________
    block5_conv3 (Conv2D)        (None, 14, 14, 512)       2359808   
    _________________________________________________________________
    block5_pool (MaxPooling2D)   (None, 7, 7, 512)         0         
    _________________________________________________________________
    flatten_layer (Flatten)      (None, 25088)             0         
    _________________________________________________________________
    softmax (Dense)              (None, 2)                 50178     
    ===================================================
    Total params: 14,789,955
    Trainable params: 50,178
    Non-trainable params: 14,714,688
    _________________________________________________________________

    Instantiates a new_model using Keras’s Model class

    Prints the new_model summary

Training the new model is a lot faster than training the network from scratch. To verify that, look at the number of trainable params in this model (~50,000) compared to the number of non-trainable params in the network (~14 million). These “non-trainable” parameters are already trained on a large dataset, and we froze them to use the extracted features in our problem. With this new model, we don’t have to train the entire VGGNet from scratch because we only have to deal with the newly added softmax layer.

Additionally, we get much better performance with transfer learning because the new model has been trained on millions of images (ImageNet dataset + our small dataset). This allows the network to understand the finer details of object nuances, which in turn makes it generalize better on new, previously unseen images.

Note that in this example, we only explored the part where we build the model, to show how transfer learning is used. At the end of this chapter, I’ll walk you through two end-to-end projects to demonstrate how to train the new network on your small dataset. But now, let’s see how transfer learning works.

6.3 How transfer learning works

So far, we learned what the transfer learning technique is and the main problems it solves. We also saw an example of how to take a pretrained network that was trained on ImageNet and transfer its learnings to our specific task. Now, let’s see why transfer learning works, what is really being transferred from one problem to another, and how a network that is trained on one dataset can perform well on a different, possibly unrelated, dataset.

The following quick questions are reminders from previous chapters to get us to the core of what is happening in transfer learning:

  1. What is really being learned by the network during training? The short answer is: feature maps.

  2. How are these features learned? During the backpropagation process, the weights are updated until we get to the optimized weights that minimize the error function.

  3. What is the relationship between features and weights? A feature map is the result of passing the weights filter on the input image during the convolution process (figure 6.4).

    Figure 6.4 Example of generating a feature map by applying a convolutional kernel to the input image

  4. What is really being transferred from one network to another? To transfer features, we download the optimized weights of the pretrained network. These weights are then reused as the starting point for the training process and retrained to adapt to the new problem.

Okay, let’s dive into the details to understand what we mean when we say pretrained network. When we’re training a convolutional neural network, the network extracts features from an image in the form of feature maps: outputs of each layer in a neural network after applying the weights filter. They are representations of the features that exist in the training set. They are called feature maps because they map where a certain kind of feature is found in the image. CNNs look for features such as straight lines, edges, and even objects. Whenever they spot these features, they report them to the feature map. Each weight filter is looking for something different that is reflected in the feature maps: one filter could be looking for straight lines, another for curves, and so on (figure 6.5).

Figure 6.5 The network extracts features from an image in the form of feature maps. They are representations of the features that exist in the training set after applying the weight filters.

Now, recall that neural networks iteratively update their weights during the training cycle of feedforward and backpropagation. We say the network has been trained when we go through a series of training iterations and hyperparameter tuning until the network yields satisfactory results. When training is complete, we output two main items: the network architecture and the trained weights. So, when we say that we are going to use a pretrained network, we mean that we will download the network architecture together with the weights.

During training, the model learns only the features that exist in this training dataset. But when we download large models (like Inception) that have been trained on huge numbers of datasets (like ImageNet), all the features that have already been extracted from these large datasets are now available for us to use. I find that really exciting because these pretrained models have spotted other features that weren’t in our dataset and will help us build better convolutional networks.

In vision problems, there’s a huge amount of stuff for neural networks to learn about the training dataset. There are low-level features like edges, corners, round shapes, curvy shapes, and blobs; and then there are mid- and higher-level features like eyes, circles, squares, and wheels. There are many details in the images that CNNs can pick up on--but if we have only 1,000 images or even 25,000 images in our training dataset, this may not be enough data for the model to learn all those things. By using a pretrained network, we can basically download all this knowledge into our neural network to give it a huge and much faster start with even higher performance levels.

6.3.1 How do neural networks learn features?

A neural network learns the features in a dataset step by step in increasing levels of complexity, one layer after another. These are called feature maps. The deeper you go through the network layers, the more image-specific features are learned. In figure 6.6, the first layer detects low-level features such as edges and curves. The output of the first layer becomes input to the second layer, which produces higher-level features like semicircles and squares. The next layer assembles the output of the previous layer into parts of familiar objects, and a subsequent layer detects the objects. As we go through more layers, the network yields an activation map that represents more complex features. As we go deeper into the network, the filters begin to be more responsive to a larger region of the pixel space. Higher-level layers amplify aspects of the received inputs that are important for discrimination and suppress irrelevant variations.

Figure 6.6 An example of how CNNs detect low-level generic features at the early layers of the network. The deeper you go through the network layers, the more image-specific features are learned.

Consider the example in figure 6.6. Suppose we are building a model that detects human faces. We notice that the network learns low-level features like lines, edges, and blobs in the first layer. These low-level features appear not to be specific to a particular dataset or task; they are general features that are applicable to many datasets and tasks. The mid-level layers assemble those lines to be able to recognize shapes, corners, and circles. Notice that the extracted features start to get a little more specific to our task (human faces): mid-level features contain combinations of shapes that form objects in the human face like eyes and noses. As we go deeper through the network, we notice that features eventually transition from general to specific and, by the last layer of the network, form high-level features that are very specific to our task. We start seeing parts of human faces that distinguish one person from another.

Now, let’s take this example and compare the feature maps extracted from four models that are trained to classify faces, cars, elephants, and chairs (see figure 6.7). Notice that the earlier layers’ features are very similar for all the models. They represent low-level features like edges, lines, and blobs. This means models that are trained on one task capture similar relations in the data types in the earlier layers of the network and can easily be reused for different problems in other domains. The deeper we go into the network, the more specific the features, until the network overfits its training data and it becomes harder to generalize to different tasks. The lower-level features are almost always transferable from one task to another because they contain generic information like the structure and nature of how images look. Transferring information like lines, dots, curves, and small parts of objects is very valuable for the network to learn faster and with less data on the new task.

Figure 6.7 Feature maps extracted from four models that are trained to classify faces, cars, elephants, and chairs

6.3.2 Transferability of features extracted at later layers

The transferability of features that are extracted at later layers depends on the similarity of the original and new datasets. The idea is that all images must have shapes and edges, so the early layers are usually transferable between different domains. We can only identify differences between objects when we start extracting higher-level features: say, the nose on a face or the tires on a car. Only then can we say, “Okay, this is a person, because it has a nose. And this is a car, because it has tires.” Based on the similarity of the source and target domains, we can decide whether to transfer only the low-level features from the source domain, or the high-level features, or somewhere in between. This is motivated by the observation that the later layers of the network become progressively more specific to the details of the classes contained in the original dataset, as we are going to discuss in the next section.

DEFINITIONS The source domain is the original dataset that the pretrained network is trained on. The target domain is the new dataset that we want to train the network on.

6.4 Transfer learning approaches

There are three major transfer learning approaches: pretrained network as a classifier, pretrained network as a feature extractor, and fine-tuning. Each approach can be effective and save significant time in developing and training a deep CNN model. It may not be clear which use of a pretrained model may yield the best results on your new CV task, so some experimentation may be required. In this section, we will explain these three scenarios and give examples of how to implement them.

6.4.1 Using a pretrained network as a classifier

Using a pretrained network as a classifier doesn’t involve freezing any layers or doing extra model training. Instead, we just take a network that was trained on a similar problem and deploy it directly to our task. The pretrained model is used directly to classify new images with no changes applied to it and no extra training. All we do is download the network architecture and its pretrained weights and then run the predictions directly on our new data. In this case, we are saying that the domain of our new problem is very similar to the one that the pretrained network was trained on, and it is ready to be deployed.

In the dog breed example, we could have used a VGG16 network that was trained on an ImageNet dataset directly to run predictions. ImageNet already contains a lot of dog images, so a significant portion of the representational power of the pretrained network may be devoted to features that are specific to differentiating between dog breeds.

Let’s see how to use a pretrained network as a classifier. In this example, we will use a VGG16 network that was pretrained on the ImageNet dataset to classify the image of the German Shepherd dog in figure 6.8.

Figure 6.8 A sample image of a German Shepherd that we will use to run predictions

The steps are as follows:

  1. Import the necessary libraries:

    from keras.preprocessing.image import load_img
    from keras.preprocessing.image import img_to_array
    from keras.applications.vgg16 import preprocess_input
    from keras.applications.vgg16 import decode_predictions
    from keras.applications.vgg16 import VGG16
  2. Download the pretrained model of VGG16 and its ImageNet weights. We set include_top to True because we want to use the entire network as a classifier:

    model = VGG16(weights = "imagenet", include_top=True, input_shape = (224,224, 3))
  3. Load and preprocess the input image:

    image = load_img('path/to/image.jpg', target_size=(224, 224))              
     
    image = img_to_array(image)                                                
     
    image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2])) 
     
    image = preprocess_input(image)                                            

    Loads an image from a file

    Converts the image pixels to a NumPy array

    Reshapes the data for the model

    Prepares the image for the VGG model

  4. Now our input image is ready for us to run predictions:

    yhat = model.predict(image)                         
     
    label = decode_predictions(yhat)                    
     
    label = label[0][0]                                 
     
    print('%s (%.2f%%)' % (label[1], label[2]*100))     

    Predicts the probability across all output classes

    Converts the probabilities to class labels

    Retrieves the most likely result with the highest probability

    Prints the classification

When you run this code, you will get the following output:

>> German_shepherd (99.72%)

You can see that the model was already trained to predict the correct dog breed with a high confidence score (99.72%). This is because the ImageNet dataset has more than 20,000 labeled dog images classified into 120 classes. Go to the book’s website to play with the code yourself with your own images: www.manning.com/books/deep-learning-for-vision-systems or www.computervisionbook.com. Feel free to explore the classes available in ImageNet and run this experiment on your own images.

6.4.2 Using a pretrained network as a feature extractor

This approach is similar to the dog breed example that we implemented earlier in this chapter: we take a pretrained CNN on ImageNet, freeze its feature extraction part, remove the classifier part, and add our own new, dense classifier layers. In figure 6.9, we use a pretrained VGG16 network, freeze the weights in all 13 convolutional layers, and replace the old classifier with a new one to be trained from scratch.

We usually go with this scenario when our new task is similar to the original dataset that the pretrained network was trained on. Since the ImageNet dataset has a lot of dog and cat examples, the feature maps that the network has learned contain a lot of dog and cat features that are very applicable to our new task. This means we can use the high-level features that were extracted from the ImageNet dataset in this new task.

To do that, we freeze all the layers from the pretrained network and only train the classifier part that we just added on the new dataset. This approach is called using a pretrained network as a feature extractor because we freeze the feature extractor part to transfer all the learned feature maps to our new problem. We only add a new classifier, which will be trained from scratch, on top of the pretrained model so that we can repurpose the previously learned feature maps for our dataset.

We remove the classification part of the pretrained network because it is often very specific to the original classification task, and subsequently it is specific to the set of classes on which the model was trained. For example, ImageNet has 1,000 classes. The classifier part has been trained to overfit the training data to classify them into 1,000 classes. But in our new problem, let’s say cats versus dogs, we have only two classes. So, it is a lot more effective to train a new classifier from scratch to overfit these two classes.

Figure 6.9 Load a pretrained VGG16 network, remove the classifier, and add your own classifier.

6.4.3 Fine-tuning

So far, we’ve seen two basic approaches of using a pretrained network in transfer learning: using a pretrained network as a classifier or as a feature extractor. We generally use these approaches when the target domain is somewhat similar to the source domain. But what if the target domain is different from the source domain? What if it is very different? Can we still use transfer learning? Yes. Transfer learning works great even when the domains are very different. We just need to extract the correct feature maps from the source domain and fine-tune them to fit the target domain.

In figure 6.10, we show the different approaches of transferring knowledge from a pretrained network. If you are downloading the entire network with no changes and just running predictions, then you are using the network as a classifier. If you are freezing the convolutional layers only, then you are using the pretrained network as a feature extractor and transferring all of its high-level feature maps to your domain. The formal definition of fine-tuning is freezing a few of the network layers that are used for feature extraction, and jointly training both the non-frozen layers and the newly added classifier layers of the pretrained model. It is called fine-tuning because when we retrain the feature extraction layers, we fine-tune the higher-order feature representations to make them more relevant for the new task dataset.

In more practical terms, if we freeze features maps 1 and 2 in figure 6.10, the new network will take feature maps 2 as its input and will start learning from this point to adapt the features of the later layers to the new dataset. This saves the network the time that it would have spent learning feature maps 1 and 2.

Figure 6.10 The network learns features through its layers. In transfer learning, we make a decision to freeze specific layers of a pretrained network to preserve the learned features. For example, if we freeze the network at feature maps of layer 3, we preserve what it has learned in layers 1, 2, and 3.

As we discussed earlier, feature maps that are extracted early in the network are generic. The feature maps get progressively more specific as we go deeper in the network. This means feature maps 4 in figure 6.10 are very specific to the source domain. Based on the similarity of the two domains, we can decide to freeze the network at the appropriate level of feature maps:

  • If the domains are similar, we might want to freeze the network up to the last feature map level (feature maps 4, in the example).

  • If the domains are very different, we might decide to freeze the pretrained network after feature maps 1 and retrain all the remaining layers.

Between these two possibilities are a range of fine-tuning options that we can apply. We can retrain the entire network, or freeze the pretrained network at any level of feature maps 1, 2, 3, or 4 and retrain the remainder of the network. We typically decide the appropriate level of fine-tuning by trial and error. But there are guidelines that we can follow to intuitively decide on the fine-tuning level for the pretrained network. The decision is a function of two factors: the amount of data we have and the level of similarity between the source and target domains. We will explain these factors and the four possible scenarios to choose the appropriate level of fine-tuning in section 6.5.

Why is fine-tuning better than training from scratch?

When we train a network from scratch, we usually randomly initialize the weights and apply a gradient descent optimizer to find the best set of weights that optimizes our error function (as discussed in chapter 2). Since these weights start with random values, there is no guarantee that they will begin with values that are close to the desired optimal values. And if the initialized value is far from the optimal value, the optimizer will take a long time to converge. This is when fine-tuning can be very useful. The pretrained network’s weights have been already optimized to learn from its dataset. Thus, when we use this network in our problem, we start with the weight values that it ended with. So, the network converges much faster than if it had to randomly initialize the weights. We are basically fine-tuning the already-optimized weights to fit our new problem instead of training the entire network from scratch with random weights. Even if we decide to retrain the entire pretrained network, starting with the trained weights will converge faster than having to train the network from scratch with randomly initialized weights.

Using a smaller learning rate when fine-tuning

It’s common to use a smaller learning rate for ConvNet weights that are being fine-tuned, in comparison to the (randomly initialized) weights for the new linear classifier that computes the class scores of a new dataset. This is because we expect that the ConvNet weights are relatively good, so we don’t want to distort them too quickly and too much (especially while the new classifier above them is being trained from random initialization).

6.5 Choosing the appropriate level of transfer learning

Recall that early convolutional layers extract generic features and become more specific to the training data the deeper we go through the network. With that said, we can choose the level of detail for feature extraction from an existing pretrained model. For example, if a new task is quite different from the source domain of the pretrained network (for example, different from ImageNet), then perhaps the output of the pretrained model after the first few layers would be appropriate. If a new task is similar to the source domain, then perhaps the output from layers much deeper in the model can be used, or even the output of the fully connected layer prior to the softmax layer.

As mentioned earlier, choosing the appropriate level for transfer learning is a function of two important factors:

  • Size of the target dataset (small or large) --When we have a small dataset, the network probably won’t learn much from training more layers, so it will tend to overfit the new data. In this case, we most likely want to do less fine-tuning and rely more on the source dataset.

  • Domain similarity of the source and target datasets --How similar is our new problem to the domain of the original dataset? For example, if your problem is to classify cars and boats, ImageNet could be a good option because it contains a lot of images of similar features. On the other hand, if your problem is to classify lung cancer on X-ray images, this is a completely different domain that will likely require a lot of fine-tuning.

These two factors lead to the four major scenarios:

  1. The target dataset is small and similar to the source dataset.

  2. The target dataset is large and similar to the source dataset.

  3. The target dataset is small and very different from the source dataset.

  4. The target dataset is large and very different from the source dataset.

Let’s discuss these scenarios one by one to learn the common rules of thumb for navigating our options.

6.5.1 Scenario 1: Target dataset is small and similar to the source dataset

Since the original dataset is similar to our new dataset, we can expect that the higher-level features in the pretrained ConvNet are relevant to our dataset as well. Then it might be best to freeze the feature extraction part of the network and only retrain the classifier.

Another reason it might not be a good idea to fine-tune the network is that our new dataset is small. If we fine-tune the feature extraction layers on a small dataset, that will force the network to overfit to our data. This is not good because, by definition, a small dataset doesn’t have enough information to cover all possible features of its objects, which makes it fail to generalize to new, previously unseen, data. So in this case, the more fine-tuning we do, the more the network is prone to overfit the new data.

For example, suppose all the images in our new dataset contain dogs in a specific weather environment--snow, for example. If we fine-tuned on this dataset, we would force the new network to pick up features like snow and a white background as dog-specific features and make it fail to classify dogs in other weather conditions. Thus the general rule of thumb is: if you have a small amount of data, be careful of overfitting when you fine-tune your pretrained network.

6.5.2 Scenario 2: Target dataset is large and similar to the source dataset

Since both domains are similar, we can freeze the feature extraction part and retrain the classifier, similar to what we did in scenario 1. But since we have more data in the new domain, we can get a performance boost from fine-tuning through all or part of the pretrained network with more confidence that we won’t overfit. Fine-tuning through the entire network is not really needed because the higher-level features are related (since the datasets are similar). So a good start is to freeze approximately 60-80% of the pretrained network and retrain the rest on the new data.

6.5.3 Scenario 3: Target dataset is small and different from the source dataset

Since the dataset is different, it might not be best to freeze the higher-level features of the pretrained network, because they contain more dataset-specific features. Instead, it would work better to retrain layers from somewhere earlier in the network--or to not freeze any layers and fine-tune the entire network. However, since you have a small dataset, fine-tuning the entire network on the dataset might not be a good idea, because doing so will make it prone to overfitting. A midway solution will work better in this case. A good start is to freeze approximately the first third or half of the pretrained network. After all, the early layers contain very generic feature maps that will be useful for your dataset even if it is very different.

6.5.4 Scenario 4: Target dataset is large and different from the source dataset

Since the new dataset is large, you might be tempted to just train the entire network from scratch and not use transfer learning at all. However, in practice, it is often still very beneficial to initialize weights from a pretrained model, as we discussed earlier. Doing so makes the model converge faster. In this case, we have a large dataset that provides us with the confidence to fine-tune through the entire network without having to worry about overfitting.

6.5.5 Recap of the transfer learning scenarios

We’ve explored the two main factors that help us define which transfer learning approach to use (size of our data and similarity between the source and target datasets). These two factors give us the four major scenarios defined in table 6.1. Figure 6.11 summarizes the guidelines for the appropriate fine-tuning level to use in each of the scenarios.

Figure 6.11 Guidelines for the appropriate fine-tuning level to use in each of the four scenarios

Table 6.1 Transfer learning scenarios

Scenario

Size of the target data

Similarity of the original and new datasets

Approach

1

Small

Similar

Pretrained network as a feature extractor

2

Large

Similar

Fine-tune through the full network

3

Small

Very different

Fine-tune from activations earlier in the network

4

Large

Very different

Fine-tune through the entire network

6.6 Open source datasets

The CV research community has been pretty good about posting datasets on the internet. So, when you hear names like ImageNet, MS COCO, Open Images, MNIST, CIFAR, and many others, these are datasets that people have posted online and that a lot of computer researchers have used as benchmarks to train their algorithms and get state-of-the-art results.

In this section, we will review some of the popular open source datasets to help guide you in your search to find the most suitable dataset for your problem. Keep in mind that the ones listed in this chapter are the most popular datasets used in the CV research community at the time of writing; we do not intend to provide a comprehensive list of all the open source datasets out there. A great many image datasets are available, and the number is growing every day. Before starting your project, I encourage you to do your own research to explore the available datasets.

6.6.1 MNIST

MNIST (http://yann.lecun.com/exdb/mnist) stands for Modified National Institute of Standards and Technology. It contains labeled handwritten images of digits from 0 to 9. The goal of this dataset is to classify handwritten digits. MNIST has been popular with the research community for benchmarking classification algorithms. In fact, it is considered the “hello, world!” of image datasets. But nowadays, the MNIST dataset is comparatively pretty simple, and a basic CNN can achieve more than 99% accuracy, so MNIST is no longer considered a benchmark for CNN performance. We implemented a CNN classification project using MNIST dataset in chapter 3; feel free to go back and review it.

Figure 6.12 Samples from the MNIST dataset

MNIST consists of 60,000 training images and 10,000 test images. All are grayscale (one-channel), and each image is 28 pixels high and 28 pixels wide. Figure 6.12 shows some sample images from the MNIST dataset.

6.6.2 Fashion-MNIST

Fashion-MNIST was created with the intention of replacing the original MNIST dataset, which has become too simple for modern convolutional networks. The data is stored in the same format as MNIST, but instead of handwritten digits, it contains 60,000 training images and 10,000 test images of 10 fashion clothing classes: t-shirt/top, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag, and ankle boot. Visit https:// github.com/zalandoresearch/fashion-mnist to explore and download the dataset. Figure 6.13 shows a sample of the represented classes.

Figure 6.13 Sample images from the Fashion-MNIST dataset

6.6.3 CIFAR

CIFAR-10 (www.cs.toronto.edu/~kriz/cifar.html) is considered another benchmark dataset for image classification in the CV and ML literature. CIFAR images are more complex than those in MNIST in the sense that MNIST images are all grayscale with perfectly centered objects, whereas CIFAR images are color (three channels) with dramatic variation in how the objects appear. The CIFAR-10 dataset consists of 32×32 color images in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images. Figure 6.14 shows the classes in the dataset.

Figure 6.14 Sample images from the CIFAR-10 dataset

CIFAR-100 is the bigger brother of CIFAR-10: it contains 100 classes with 600 images each. These 100 classes are grouped into 20 superclasses. Each image comes with a fine label (the class to which it belongs) and a coarse label (the superclass to which it belongs).

6.6.4 ImageNet

We’ve discussed the ImageNet dataset several times in the previous chapters and used it extensively in chapter 5 and this chapter. But for completeness of this list, we are discussing it here as well. At the time of writing, ImageNet is considered the current benchmark and is widely used by CV researchers to evaluate their classification algorithms.

ImageNet is a large visual database designed for use in visual object recognition software research. It is aimed at labeling and categorizing images into almost 22,000 categories based on a defined set of words and phrases. The images were collected from the web and labeled by humans via Amazon’s Mechanical Turk crowdsourcing tool. At the time of this writing, there are over 14 million images in the ImageNet project. To organize such a massive amount of data, the creators of ImageNet followed the WordNet hierarchy: each meaningful word/phrase in WordNet is called a synonym set (synset for short). Within the ImageNet project, images are organized according to these synsets, with the goal being to have 1,000+ images per synset. Figure 6.15 shows a collage of ImageNet examples put together by Stanford University.

Figure 6.15 A collage of ImageNet examples compiled by Stanford University

The CV community usually refers to the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) when talking about ImageNet. In this challenge, software programs compete to correctly classify and detect objects and scenes. We will be using the ILSVRC challenge as a benchmark to compare the different networks’ performance.

6.6.5 MS COCO

MS COCO (http://cocodataset.org) is short for Microsoft Common Objects in Context. It is an open source database that aims to enable future research for object detection, instance segmentation, image captioning, and localizing person keypoints. It contains 328,000 images. More than 200,000 of them are labeled, and they include 1.5 million object instances and 80 object categories that would be easily recognizable by a 4-year-old. The original research paper by the creators of the dataset describes the motivation for and content of this dataset.2 Figure 6.16 shows a sample of the dataset provided on the MS COCO website.

Figure 6.16 A sample of the MS COCO dataset (Image copyright © 2015, COCO Consortium, used by permission under Creative Commons Attribution 4.0 License.)

6.6.6 Google Open Images

Open Images (https://storage.googleapis.com/openimages/web/index.html) is an open source image database created by Google. It contains more than 9 million images as of this writing. What makes it stand out is that these images are mostly of complex scenes that span thousands of classes of objects. Additionally, more than 2 million of these images are hand-annotated with bounding boxes, making Open Images by far the largest existing dataset with object-location annotations (see figure 6.17). In this subset of images, there are ~15.4 million bounding boxes of 600 classes of objects. Similar to ImageNet and ILSVRC, Open Images has a challenge called the Open Images Challenge (http://mng.bz/aRQz).

6.6.7 Kaggle

In addition to the datasets listed in this section, Kaggle (www.kaggle.com) is another great source for datasets. Kaggle is a website that hosts ML and DL challenges where people from all around the world can participate and submit algorithms for evaluations.

You are strongly encouraged to explore these datasets and search for the many other open source datasets that come up every day, to gain a better understanding of the classes and use cases they support. We mostly use ImageNet in this chapter’s projects; and throughout the book, we will be using MS COCO, especially in chapter 7.

Figure 6.17 Annotated images from the Open Images dataset, taken from the Google AI Blog (Vittorio Ferrari, “An Update to Open Images--Now with Bounding-Boxes,” July 2017, http://mng.bz/yyVG).

6.7 Project 1: A pretrained network as a feature extractor

In this project, we use a very small amount of data to train a classifier that detects images of dogs and cats. This is a pretty simple project, but the goal of the exercise is to see how to implement transfer learning when you have a very small amount of data and the target domain is similar to the source domain (scenario 1). As explained in this chapter, in this case, we will use the pretrained convolutional network as a feature extractor. This means we are going to freeze the feature extractor part of the network, add our own classifier, and then retrain the network on our new small dataset.

One other important takeaway from this project is learning how to preprocess custom data and make it ready to train your neural network. In previous projects, we used the CIFAR and MNIST datasets: they are preprocessed by Keras, so all we had to do was download them from the Keras library and use them directly to train the network. This project provides a tutorial of how to structure your data repository and use the Keras library to get your data ready.

Visit the book’s website at www.manning.com/books/deep-learning-for-vision-systems or www.computervisionbook.com to download the code notebook and the dataset used for this project. Since we are using transfer learning, the training does not require high computation power, so you can run this notebook on your personal computer; you don’t need a GPU.

For this implementation, we’ll be using the VGG16. Although it didn’t record the lowest error in the ILSVRC, I found that it worked well for the task and was quicker to train than other models. I got an accuracy of about 96%, but you can feel free to use GoogLeNet or ResNet to experiment and compare results.

The process to use a pretrained model as a feature extractor is well established:

  1. Import the necessary libraries.

  2. Preprocess the data to make it ready for the neural network.

  3. Load pretrained weights from the VGG16 network trained on a large dataset.

  4. Freeze all the weights in the convolutional layers (feature extraction part). Remember, the layers to freeze are adjusted depending on the similarity of the new task to the original dataset. In our case, we observed that ImageNet has a lot of dog and cat images, so the network has already been trained to extract the detailed features of our target object.

  5. Replace the fully connected layers of the network with a custom classifier. You can add as many fully connected layers as you see fit, and each can have as many hidden units as you want. For simple problems like this, we will just add one hidden layer with 64 units. You can observe the results and tune up if the model is underfitting or down if the model is overfitting. For the softmax layer, the number of units must be set equal to the number of classes (two units, in our case).

  6. Compile the network, and run the training process on the new data of cats and dogs to optimize the model for the smaller dataset.

  7. Evaluate the model.

Now, let’s go through these steps and implement this project:

  1. Import the necessary libraries:

    from keras.preprocessing.image import ImageDataGenerator
    from keras.preprocessing import image
    from keras.applications import imagenet_utils
    from keras.applications import vgg16
    from keras.applications import mobilenet
    from keras.optimizers import Adam, SGD
    from keras.metrics import categorical_crossentropy
    from keras.layers import Dense, Flatten, Dropout, BatchNormalization
    from keras.models import Model
    from sklearn.metrics import confusion_matrix
    import itertools
    import matplotlib.pyplot as plt
    %matplotlib inline
  2. Preprocess the data to make it ready for the neural network. Keras has an ImageDataGenerator class that allows us to easily perform image augmentation on the fly; you can read about it at https://keras.io/api/preprocessing/image. In this example, we use ImageDataGenerator to generate our image tensors, but for simplicity, we will not implement image augmentation.

    The ImageDataGenerator class has a method called flow_from_directory() that is used to read images from folders containing images. This method expects your data directory to be structured as in figure 6.18.

    Figure 6.18 The required directory structure for your dataset to use the .flow_from_directory() method from Keras

    I have the data structured in the book’s code so it’s ready for you to use flow_ from_directory(). Now, load the data into train_path, valid_path, and test _path variables, and then generate the train, valid, and test batches:

    train_path  = 'data/train'
    valid_path  = 'data/valid'
    test_path  = 'data/test'
     
    train_batches = ImageDataGenerator().flow_from_directory(train_path,          
                                                             target_size=(224,224),
                                                             batch_size=10)
     
    valid_batches = ImageDataGenerator().flow_from_directory(valid_path,
                                                             target_size=(224,224),
                                                             batch_size=30)
     
    test_batches = ImageDataGenerator().flow_from_directory(test_path, 
                                                            target_size=(224,224),
                                                            batch_size=50,
                                                            shuffle=False)

    ImageDataGenerator generates batches of tensor image data with real-time data augmentation. The data will be looped over (in batches). In this example, we won’t be doing any image augmentation.

  3. Load in pretrained weights from the VGG16 network trained on a large dataset. Similar to the examples in this chapter, we download the VGG16 network from Keras and download its weights after they are pretrained on the ImageNet dataset. Remember that we want to remove the classifier part from this network, so we set the parameter include_top=False:

    base_model = vgg16.VGG16(weights = "imagenet", include_top=False, 
                             input_shape = (224,224, 3))
  4. Freeze all the weights in the convolutional layers (feature extraction part). We freeze the convolutional layers from the base_model created in the previous step and use that as a feature extractor, and then add a classifier on top of it in the next step:

    for layer in base_model.layers:        
        layer.trainable = False

    Iterates through layers and locks them to make them non-trainable with this code

  5. Add the new classifier, and build the new model. We add a few layers on top of the base model. In this example, we add one fully connected layer with 64 hidden units and a softmax with 2 hidden units. We also add batch norm and dropout layers to avoid overfitting:

    last_layer = base_model.get_layer('block5_pool')         
    last_output = last_layer.output
     
    x = Flatten()(last_output)                               
     
    x = Dense(64, activation='relu', name='FC_2')(x)         
    x = BatchNormalization()(x)                              
    x = Dropout(0.5)(x)                                      
    x = Dense(2, activation='softmax', name='softmax')(x)    
     
    new_model = Model(inputs=base_model.input, outputs=x)    
    new_model.summary()
     
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #   
    =================================================================
    input_1 (InputLayer)         (None, 224, 224, 3)       0         
    _________________________________________________________________
    block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
    _________________________________________________________________
    block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
    _________________________________________________________________
    block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
    _________________________________________________________________
    block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
    _________________________________________________________________
    block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
    _________________________________________________________________
    block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0         
    _________________________________________________________________
    block3_conv1 (Conv2D)        (None, 56, 56, 256)       295168    
    _________________________________________________________________
    block3_conv2 (Conv2D)        (None, 56, 56, 256)       590080    
    _________________________________________________________________
    block3_conv3 (Conv2D)        (None, 56, 56, 256)       590080    
    _________________________________________________________________
    block3_pool (MaxPooling2D)   (None, 28, 28, 256)       0         
    _________________________________________________________________
    block4_conv1 (Conv2D)        (None, 28, 28, 512)       1180160   
    _________________________________________________________________
    block4_conv2 (Conv2D)        (None, 28, 28, 512)       2359808   
    _________________________________________________________________
    block4_conv3 (Conv2D)        (None, 28, 28, 512)       2359808   
    _________________________________________________________________
    block4_pool (MaxPooling2D)   (None, 14, 14, 512)       0         
    _________________________________________________________________
    block5_conv1 (Conv2D)        (None, 14, 14, 512)       2359808   
    _________________________________________________________________
    block5_conv2 (Conv2D)        (None, 14, 14, 512)       2359808   
    _________________________________________________________________
    block5_conv3 (Conv2D)        (None, 14, 14, 512)       2359808   
    _________________________________________________________________
    block5_pool (MaxPooling2D)   (None, 7, 7, 512)         0         
    _________________________________________________________________
    flatten_1 (Flatten)          (None, 25088)             0         
    _________________________________________________________________
    FC_2 (Dense)                 (None, 64)                1605696   
    _________________________________________________________________
    batch_normalization_1 (Batch (None, 64)                256       
    _________________________________________________________________
    dropout_1 (Dropout)          (None, 64)                0         
    _________________________________________________________________
    softmax (Dense)              (None, 2)                 130       
    =================================================================
    Total params: 16,320,770
    Trainable params: 1,605,954
    Non-trainable params: 14,714,816
    _________________________________________________________________

    Uses the get_layer method to save the last layer of the network. Then saves the output of the last layer to be the input of the next layer.

    Flattens the classifier input, which is output of the last layer of the VGG16 model

    Adds one fully connected layer that has 64 units and batchnorm, dropout, and softmax layers

    Instantiates a new_model using Keras’s Model class

  6. Compile the model and run the training process:

    new_model.compile(Adam(lr=0.0001), loss='categorical_crossentropy', 
                      metrics=['accuracy'])
     
    new_model.fit_generator(train_batches, steps_per_epoch=4,
                            validation_data=valid_batches, validation_steps=2,
                            epochs=20, verbose=2)

    When you run the previous code snippet, the verbose training is printed after each epoch as follows:

    Epoch 1/20
     - 28s - loss: 1.0070 - acc: 0.6083 - val_loss: 0.5944 - val_acc: 0.6833
    Epoch 2/20
     - 25s - loss: 0.4728 - acc: 0.7754 - val_loss: 0.3313 - val_acc: 0.8605
    Epoch 3/20
     - 30s - loss: 0.1177 - acc: 0.9750 - val_loss: 0.2449 - val_acc: 0.8167
    Epoch 4/20
     - 25s - loss: 0.1640 - acc: 0.9444 - val_loss: 0.3354 - val_acc: 0.8372
    Epoch 5/20
     - 29s - loss: 0.0545 - acc: 1.0000 - val_loss: 0.2392 - val_acc: 0.8333
    Epoch 6/20
     - 25s - loss: 0.0941 - acc: 0.9505 - val_loss: 0.2019 - val_acc: 0.9070
    Epoch 7/20
     - 28s - loss: 0.0269 - acc: 1.0000 - val_loss: 0.1707 - val_acc: 0.9000
    Epoch 8/20
     - 26s - loss: 0.0349 - acc: 0.9917 - val_loss: 0.2489 - val_acc: 0.8140
    Epoch 9/20
     - 28s - loss: 0.0435 - acc: 0.9891 - val_loss: 0.1634 - val_acc: 0.9000
    Epoch 10/20
     - 26s - loss: 0.0349 - acc: 0.9833 - val_loss: 0.2375 - val_acc: 0.8140
    Epoch 11/20
     - 28s - loss: 0.0288 - acc: 1.0000 - val_loss: 0.1859 - val_acc: 0.9000
    Epoch 12/20
     - 29s - loss: 0.0234 - acc: 0.9917 - val_loss: 0.1879 - val_acc: 0.8372
    Epoch 13/20
     - 32s - loss: 0.0241 - acc: 1.0000 - val_loss: 0.2513 - val_acc: 0.8500
    Epoch 14/20
     - 29s - loss: 0.0120 - acc: 1.0000 - val_loss: 0.0900 - val_acc: 0.9302
    Epoch 15/20
     - 36s - loss: 0.0189 - acc: 1.0000 - val_loss: 0.1888 - val_acc: 0.9000
    Epoch 16/20
     - 30s - loss: 0.0142 - acc: 1.0000 - val_loss: 0.1672 - val_acc: 0.8605
    Epoch 17/20
     - 29s - loss: 0.0160 - acc: 0.9917 - val_loss: 0.1752 - val_acc: 0.8667
    Epoch 18/20
     - 25s - loss: 0.0126 - acc: 1.0000 - val_loss: 0.1823 - val_acc: 0.9070
    Epoch 19/20
     - 29s - loss: 0.0165 - acc: 1.0000 - val_loss: 0.1789 - val_acc: 0.8833
    Epoch 20/20
     - 25s - loss: 0.0112 - acc: 1.0000 - val_loss: 0.1743 - val_acc: 0.8837

    Notice that the model was trained very quickly using regular CPU computing power. Each epoch took approximately 25 to 29 seconds, which means the model took less than 10 minutes to train for 20 epochs.

  7. Evaluate the model. First, let’s define the load_dataset() method that we will use to convert our dataset into tensors:

    from sklearn.datasets import load_files
    from keras.utils import np_utils
    import numpy as np
     
    def load_dataset(path):
        data = load_files(path)
        paths = np.array(data['filenames'])
        targets = np_utils.to_categorical(np.array(data['target']))
        return paths, targets
     
    test_files, test_targets = load_dataset('small_data/test')

    Then, we create test_tensors to evaluate the model on them:

    from keras.preprocessing import image  
    from keras.applications.vgg16 import preprocess_input
    from tqdm import tqdm
     
    def path_to_tensor(img_path):
        img = image.load_img(img_path, target_size=(224, 224))     
        x = image.img_to_array(img)                                
        return np.expand_dims(x, axis=0)                           
     
    def paths_to_tensor(img_paths):
        list_of_tensors = [path_to_tensor(img_path) for img_path in tqdm(img_paths)]
        return np.vstack(list_of_tensors)
     
    test_tensors = preprocess_input(paths_to_tensor(test_files))

    Loads an RGB image as PIL.Image.Image type

    Converts the PIL.Image.Image type to a 3D tensor with shape (224, 224, 3)

    Converts the 3D tensor to a 4D tensor with shape (1, 224, 224, 3) and returns the 4D tensor

    Now we can run Keras’s evaluate() method to calculate the model accuracy:

    print('
    Testing loss: {:.4f}
    Testing accuracy: {:.4f}'.format(*new_model.evaluate(test_tensors, test_targets)))
     
    Testing loss: 0.1042
    Testing accuracy: 0.9579

The model has achieved an accuracy of 95.79% in less than 10 minutes of training. This is very good, given our very small dataset.

6.8 Project 2: Fine-tuning

In this project, we are going to explore scenario 3, discussed earlier in this chapter, where the target dataset is small and very different from the source dataset. The goal of this project is to build a sign language classifier that distinguishes 10 classes: the sign language digits from 0 to 9. Figure 6.19 shows a sample of our dataset.

Following are the details of our dataset:

  • Number of classes = 10 (digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9)

  • Image size = 100 × 100

  • Color space = RGB

  • 1,712 images in the training set

  • 300 images in the validation set

  • 50 images in the test set

Figure 6.19 A sample from the sign language dataset

It is very noticeable how small our dataset is. If you try to train a network from scratch on this very small dataset, you will not achieve good results. On the other hand, we were able to achieve an accuracy higher than 98% by using transfer learning, even though the source and target domains were very different.

NOTE Please take this evaluation with a grain of salt, because the network hasn't been thoroughly tested with a lot of data. We only have 50 test images in this dataset. Transfer learning is expected to achieve good results anyway, but I wanted to highlight this fact.

Visit the book’s website at www.manning.com/books/deep-learning-for-vision-systems or www.computervisionbook.com to download the source code notebook and the dataset used for this project. Similar to project 1, the training does not require high computation power, so you can run this notebook on your personal computer; you don’t need a GPU.

For ease of comparison with the previous project, we will use the VGG16 network trained on the ImageNet dataset. The process to fine-tune a pretrained network is as follows:

  1. Import the necessary libraries.

  2. Preprocess the data to make it ready for the neural network.

  3. Load in pretrained weights from the VGG16 network trained on a large dataset (ImageNet).

  4. Freeze part of the feature extractor part.

  5. Add the new classifier layers.

  6. Compile the network, and run the training process to optimize the model for the smaller dataset.

  7. Evaluate the model.

Now let’s implement this project:

  1. Import the necessary libraries:

    from keras.preprocessing.image import ImageDataGenerator
    from keras.preprocessing import image
    from keras.applications import imagenet_utils
    from keras.applications import vgg16
    from keras.optimizers import Adam, SGD
    from keras.metrics import categorical_crossentropy
    from keras.layers import Dense, Flatten, Dropout, BatchNormalization
    from keras.models import Model
    from sklearn.metrics import confusion_matrix
    import itertools
    import matplotlib.pyplot as plt
    %matplotlib inline
  2. Preprocess the data to make it ready for the neural network. Similar to project 1, we use the ImageDataGenerator class from Keras and the flow_from_ directory() method to preprocess our data. The data is already structured for you to directly create your tensors:

    train_path  = 'dataset/train'
    valid_path  = 'dataset/valid'
    test_path  = 'dataset/test'
     
    train_batches = ImageDataGenerator().flow_from_directory(train_path,        
                                                             target_size=(224,224),
                                                             batch_size=10)
     
    valid_batches = ImageDataGenerator().flow_from_directory(valid_path,
                                                             target_size=(224,224),
                                                             batch_size=30)
     
    test_batches = ImageDataGenerator().flow_from_directory(test_path, 
                                                            target_size=(224,224), 
                                                            batch_size=50, 
                                                            shuffle=False)
     
    Found 1712 images belonging to 10 classes.
    Found 300 images belonging to 10 classes.
    Found 50 images belonging to 10 classes.

    ImageDataGenerator generates batches of tensor image data with real-time data augmentation. The data will be looped over (in batches). In this example, we won’t be doing any image augmentation.

  3. Load in pretrained weights from the VGG16 network trained on a large dataset (ImageNet). We download the VGG16 architecture from the Keras library with ImageNet weights. Note that we use the parameter pooling='avg' here: this basically means global average pooling will be applied to the output of the last convolutional layer, and thus the output of the model will be a 2D tensor. We use this as an alternative to the Flatten layer before adding the fully connected layers:

    base_model = vgg16.VGG16(weights = "imagenet", include_top=False, 
                             input_shape = (224,224, 3), pooling='avg')
  4. Freeze some of the feature extractor part, and fine-tune the rest on our new training data. The level of fine-tuning is usually determined by trial and error. VGG16 has 13 convolutional layers: you can freeze them all or freeze a few of them, depending on how similar your data is to the source data. In the sign language case, the new domain is very different from our domain, so we will start with fine-tuning only the last five layers; if we don’t get satisfying results, we can fine-tune more. It turns out that after we trained the new model, we got 98% accuracy, so this was a good level of fine-tuning. But in other cases, if you find that your network doesn’t converge, try fine-tuning more layers.

    for layer in base_model.layers[:-5]:            
        layer.trainable = False
     
    base_model.summary()
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #   
    =================================================================
    input_1 (InputLayer)         (None, 224, 224, 3)       0         
    _________________________________________________________________
    block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
    _________________________________________________________________
    block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
    _________________________________________________________________
    block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
    _________________________________________________________________
    block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
    _________________________________________________________________
    block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
    _________________________________________________________________
    block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0         
    _________________________________________________________________
    block3_conv1 (Conv2D)        (None, 56, 56, 256)       295168    
    _________________________________________________________________
    block3_conv2 (Conv2D)        (None, 56, 56, 256)       590080    
    _________________________________________________________________
    block3_conv3 (Conv2D)        (None, 56, 56, 256)       590080    
    _________________________________________________________________
    block3_pool (MaxPooling2D)   (None, 28, 28, 256)       0         
    _________________________________________________________________
    block4_conv1 (Conv2D)        (None, 28, 28, 512)       1180160   
    _________________________________________________________________
    block4_conv2 (Conv2D)        (None, 28, 28, 512)       2359808   
    _________________________________________________________________
    block4_conv3 (Conv2D)        (None, 28, 28, 512)       2359808   
    _________________________________________________________________
    block4_pool (MaxPooling2D)   (None, 14, 14, 512)       0         
    _________________________________________________________________
    block5_conv1 (Conv2D)        (None, 14, 14, 512)       2359808   
    _________________________________________________________________
    block5_conv2 (Conv2D)        (None, 14, 14, 512)       2359808   
    _________________________________________________________________
    block5_conv3 (Conv2D)        (None, 14, 14, 512)       2359808   
    _________________________________________________________________
    block5_pool (MaxPooling2D)   (None, 7, 7, 512)         0         
    _________________________________________________________________
    global_average_pooling2d_1 ( (None, 512)               0         
    =================================================================
    Total params: 14,714,688
    Trainable params: 7,079,424
    Non-trainable params: 7,635,264
    _________________________________________________________________

    Iterates through layers and locks them, except for the last five layers

  5. Add the new classifier layers, and build the new model:

    last_output = base_model.output                                      
     
    x = Dense(10, activation='softmax', name='softmax')(last_output)     
     
    new_model = Model(inputs=base_model.input, outputs=x)                
     
    new_model.summary()                                                  
     
    Layer (type)                 Output Shape              Param #   
    =================================================================
    input_1 (InputLayer)         (None, 224, 224, 3)       0         
    _________________________________________________________________
    block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
    _________________________________________________________________
    block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
    _________________________________________________________________
    block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
    _________________________________________________________________
    block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
    _________________________________________________________________
    block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
    _________________________________________________________________
    block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0         
    _________________________________________________________________
    block3_conv1 (Conv2D)        (None, 56, 56, 256)       295168    
    _________________________________________________________________
    block3_conv2 (Conv2D)        (None, 56, 56, 256)       590080    
    _________________________________________________________________
    block3_conv3 (Conv2D)        (None, 56, 56, 256)       590080    
    _________________________________________________________________
    block3_pool (MaxPooling2D)   (None, 28, 28, 256)       0         
    _________________________________________________________________
    block4_conv1 (Conv2D)        (None, 28, 28, 512)       1180160   
    _________________________________________________________________
    block4_conv2 (Conv2D)        (None, 28, 28, 512)       2359808   
    _________________________________________________________________
    block4_conv3 (Conv2D)        (None, 28, 28, 512)       2359808   
    _________________________________________________________________
    block4_pool (MaxPooling2D)   (None, 14, 14, 512)       0         
    _________________________________________________________________
    block5_conv1 (Conv2D)        (None, 14, 14, 512)       2359808   
    _________________________________________________________________
    block5_conv2 (Conv2D)        (None, 14, 14, 512)       2359808   
    _________________________________________________________________
    block5_conv3 (Conv2D)        (None, 14, 14, 512)       2359808   
    _________________________________________________________________
    block5_pool (MaxPooling2D)   (None, 7, 7, 512)         0         
    _________________________________________________________________
    global_average_pooling2d_1 ( (None, 512)               0         
    _________________________________________________________________
    softmax (Dense)              (None, 10)                5130      
    =================================================================
    Total params: 14,719,818
    Trainable params: 7,084,554
    Non-trainable params: 7,635,264

    Saves the output of base_model to be the input of the next layer

    Adds our new softmax layer with 10 hidden units

    Instantiates a new_model using Keras’s Model class

    Prints the new_model summary

  6. Compile the network, and run the training process to optimize the model for the smaller dataset:

    new_model.compile(Adam(lr=0.0001), loss='categorical_crossentropy', 
                      metrics=['accuracy'])
     
    from keras.callbacks import ModelCheckpoint
     
    checkpointer = ModelCheckpoint(filepath='signlanguage.model.hdf5', 
                                   save_best_only=True)
     
    history = new_model.fit_generator(train_batches, steps_per_epoch=18,
                       validation_data=valid_batches, validation_steps=3, 
                       epochs=20, verbose=1, callbacks=[checkpointer])
     
    Epoch 1/150
    18/18 [==============================] - 40s 2s/step - loss: 3.2263 - acc: 0.1833 - val_loss: 2.0674 - val_acc: 0.1667
    Epoch 2/150
    18/18 [==============================] - 41s 2s/step - loss: 2.0311 - acc: 0.1833 - val_loss: 1.7330 - val_acc: 0.3000
    Epoch 3/150
    18/18 [==============================] - 42s 2s/step - loss: 1.5741 - acc: 0.4500 - val_loss: 1.5577 - val_acc: 0.4000
    Epoch 4/150
    18/18 [==============================] - 42s 2s/step - loss: 1.3068 - acc: 0.5111 - val_loss: 0.9856 - val_acc: 0.7333
    Epoch 5/150
    18/18 [==============================] - 43s 2s/step - loss: 1.1563 - acc: 0.6389 - val_loss: 0.7637 - val_acc: 0.7333
    Epoch 6/150
    18/18 [==============================] - 41s 2s/step - loss: 0.8414 - acc: 0.6722 - val_loss: 0.7550 - val_acc: 0.8000
    Epoch 7/150
    18/18 [==============================] - 41s 2s/step - loss: 0.5982 - acc: 0.8444 - val_loss: 0.7910 - val_acc: 0.6667
    Epoch 8/150
    18/18 [==============================] - 41s 2s/step - loss: 0.3804 - acc: 0.8722 - val_loss: 0.7376 - val_acc: 0.8667
    Epoch 9/150
    18/18 [==============================] - 41s 2s/step - loss: 0.5048 - acc: 0.8222 - val_loss: 0.2677 - val_acc: 0.9000
    Epoch 10/150
    18/18 [==============================] - 39s 2s/step - loss: 0.2383 - acc: 0.9276 - val_loss: 0.2844 - val_acc: 0.9000
    Epoch 11/150
    18/18 [==============================] - 41s 2s/step - loss: 0.1163 - acc: 0.9778 - val_loss: 0.0775 - val_acc: 1.0000
    Epoch 12/150
    18/18 [==============================] - 41s 2s/step - loss: 0.1377 - acc: 0.9667 - val_loss: 0.5140 - val_acc: 0.9333
    Epoch 13/150
    18/18 [==============================] - 41s 2s/step - loss: 0.0955 - acc: 0.9556 - val_loss: 0.1783 - val_acc: 0.9333
    Epoch 14/150
    18/18 [==============================] - 41s 2s/step - loss: 0.1785 - acc: 0.9611 - val_loss: 0.0704 - val_acc: 0.9333
    Epoch 15/150
    18/18 [==============================] - 41s 2s/step - loss: 0.0533 - acc: 0.9778 - val_loss: 0.4692 - val_acc: 0.8667
    Epoch 16/150
    18/18 [==============================] - 41s 2s/step - loss: 0.0809 - acc: 0.9778 - val_loss: 0.0447 - val_acc: 1.0000
    Epoch 17/150
    18/18 [==============================] - 41s 2s/step - loss: 0.0834 - acc: 0.9722 - val_loss: 0.0284 - val_acc: 1.0000
    Epoch 18/150
    18/18 [==============================] - 41s 2s/step - loss: 0.1022 - acc: 0.9611 - val_loss: 0.0177 - val_acc: 1.0000
    Epoch 19/150
    18/18 [==============================] - 41s 2s/step - loss: 0.1134 - acc: 0.9667 - val_loss: 0.0595 - val_acc: 1.0000
    Epoch 20/150
    18/18 [==============================] - 39s 2s/step - loss: 0.0676 - acc: 0.9777 - val_loss: 0.0862 - val_acc: 0.9667

    Notice the training time of each epoch from the verbose output. The model was trained very quickly using regular CPU computing power. Each epoch took approximately 40 seconds, which means it took the model less than 15 minutes to train for 20 epochs.

  7. Evaluate the accuracy of the model. Similar to the previous project, we create a load_dataset() method to create test_targets and test_tensors and then use the evaluate() method from Keras to run inferences on the test images and get the model accuracy:

    print('
    Testing loss: {:.4f}
    Testing accuracy: {:.4f}'.format(*new_model.evaluate(test_tensors, test_targets)))
    
    Testing loss: 0.0574
    Testing accuracy: 0.9800

    A deeper level of evaluating your model involves creating a confusion matrix. We explained the confusion matrix in chapter 4: it is a table that is often used to describe the performance of a classification model, to provide a deeper understanding of how the model performed on the test dataset. See chapter 4 for details on the different model evaluation metrics. Now, let’s build the confusion matrix for our model (see figure 6.20):

    from sklearn.metrics import confusion_matrix
    import numpy as np
    
    cm_labels = ['0','1','2','3','4','5','6','7','8','9']
     
    cm = confusion_matrix(np.argmax(test_targets, axis=1),
                          np.argmax(new_model.predict(test_tensors), axis=1))
    plt.imshow(cm, cmap=plt.cm.Blues)
    plt.colorbar()
    indexes = np.arange(len(cm_labels))
    for i in indexes:
        for j in indexes:
            plt.text(j, i, cm[i, j])
    plt.xticks(indexes, cm_labels, rotation=90)
    plt.xlabel('Predicted label')
    plt.yticks(indexes, cm_labels)
    plt.ylabel('True label')
    plt.title('Confusion matrix')
    plt.show()

    To read this confusion matrix, look at the number on the Predicted Label axis and check whether it was correctly classified on the True Label axis. For example, look at number 0 on the Predicted Label axis: all five images were classified as 0, and no images were mistakenly classified as any other number. Similarly, go through the rest of the numbers on the Predicted Label axis. You will notice that the model successfully made the correct predictions for all the test images except the image with true label = 8. In that case, the model mistakenly classified an image of number 8 as number = 7.

Summary

  • Transfer learning is usually the go-to approach when starting a classification and object detection project, especially when you don’t have a lot of training data.

  • Transfer learning migrates the knowledge learned from the source dataset to the target dataset, to save training time and computational cost.

  • The neural network learns the features in your dataset step by step in increasing levels of complexity. The deeper you go through the network layers, the more image-specific the features that are learned.

  • Early layers in the network learn low-level features like lines, blobs, and edges. The output of the first layer becomes input to the second layer, which produces higher-level features. The next layer assembles the output of the previous layer into parts of familiar objects, and a subsequent layer detects the objects.

  • The three main transfer learning approaches are using a pretrained network as a classifier, using a pretrained network as a feature extractor, and fine-tuning.

  • Using a pretrained network as a classifier means using the network directly to classify new images without freezing layers or applying model training.

  • Using a pretrained network as a feature extractor means freezing the classifier part of the network and retraining the new classifier.

  • Fine-tuning means freezing a few of the network layers that are used for feature extraction, and jointly training both the non-frozen layers and the newly added classifier layers of the pretrained model.

    Figure 6.20 Confusion matrix for the sign language classifier

  • The transferability of features from one network to another is a function of the size of the target data and the domain similarity between the source and target data.

  • Generally, fine-tuning parameters use a smaller learning rate, while training the output layer from scratch can use a larger learning rate.


1.Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson, “How Transferable Are Features in Deep Neural Networks?” Advances in Neural Information Processing Systems 27 (Dec. 2014): 3320-3328, https://arxiv.org/ abs/1411.1792.

2. Tsung-Yi Lin, Michael Maire, Serge Belongie, et al., “Microsoft COCO: Common Objects in Context” (February 2015), https://arxiv.org/pdf/1405.0312.pdf.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.136.165