Transfer learning is one of the most important techniques of deep learning. When building a vision system to solve a specific problem, you usually need to collect and label a huge amount of data to train your network. You can build convnets, as you learned in chapter 3, and start the training from scratch; that is an acceptable approach. But what if you could download an existing neural network that someone else has tuned and trained, and use it as a starting point for your new task? Transfer learning allows you to do just that. You can download an open source model that someone else has already trained and tuned and use their optimized parameters (weights) as a starting point to train your model on a smaller dataset for a given task. This way, you can train your network a lot faster and achieve higher results.
DL researchers and practitioners have posted many research papers and open source projects of trained algorithms that they have worked on for weeks and months and trained on GPUs to get state-of-the-art results on an array of problems. Often, the fact that someone else has done this work and gone through the painful high-performance research process means you can download an open source architecture and weights and use them as a good start for your own neural network. This is transfer learning : the transfer of knowledge from a pretrained network in one domain to your own problem in a different domain.
In this chapter, I will explain transfer learning and outline reasons why using it is important. I will also detail different transfer learning scenarios and how to use them. Finally, we will see examples of using transfer learning to solve real-world problems. Ready? Let’s get started!
As the name implies, transfer learning means transferring what a neural network has learned from being trained on a specific dataset to another related problem (figure 6.1). Transfer learning is currently very popular in the field of DL because it enables you to train deep neural networks with comparatively little data in a short training time. The importance of transfer learning comes from the fact that in most real-world problems, we typically do not have millions of labeled images to train such complex models.
The idea is pretty straightforward. First we train a deep neural network on a very large amount of data. During the training process, the network extracts a large number of useful features that can be used to detect objects in this dataset. We then transfer these extracted features (feature maps) to a new network and train this new network on our new dataset to solve a different problem. Transfer learning is a great way to shortcut the process of collecting and training huge amounts of data simply by reusing the model weights from pretrained models that were developed for standard CV benchmark datasets, such as the ImageNet image-recognition tasks. Top-performing models can be downloaded and used directly, or integrated into a new model for your own CV problems.
The question is, why would we want to use transfer learning? Why don’t we just train a neural network directly on our new dataset to solve our problem? To answer this question, we first need to know the main problems that transfer learning solves. We’ll discuss those now; then I’ll go into the details of how transfer learning works and the different approaches to apply it.
Deep neural networks are immensely data-hungry and rely on huge amounts of labeled data to achieve high performance. In practice, very few people train an entire convolutional network from scratch. This is due to two main problems:
Data problem --Training a network from scratch requires a lot of data in order to get decent results, which is not feasible in most cases. It is relatively rare to have a dataset of sufficient size to solve your problem. It is also very expensive to acquire and label data: this is mostly a manual process done by humans capturing images and labeling them one by one, which makes it a nontrivial task.
Computation problem --Even if you are able to acquire hundreds of thousands of images for your problem, it is computationally very expensive to train a deep neural network on millions of images because doing so usually requires weeks of training on multiple GPUs. Also keep in mind that training a neural network is an iterative process. So, even if you happen to have the computing power required to train a complex neural network, spending weeks experimenting with different hyperparameters in each training iteration until you finally reach satisfactory results will make the project very costly.
Additionally, an important benefit of using transfer learning is that it helps the model generalize its learnings and avoid overfitting. When you apply a DL model in the wild, it is faced with countless conditions it may never have seen before and does not know how to deal with; each client has its own preferences and generates data that is different from the data used for training. The model is asked to perform well on many tasks that are related to but not exactly similar to the task it was trained for.
For example, when you deploy a car classifier model to production, people usually have different camera types, each with its own image quality and resolution. Also, images can be taken during different weather conditions. These image nuances vary from one user to another. To train the model on all these different cases, you either have to account for every case and acquire a lot of images to train the network on, or try to build a more robust model that is better at generalizing to new use cases. This is what transfer learning does. Since it is not realistic to account for all the cases the model may face in the wild, transfer learning can help us deal with novel scenarios. It is necessary for production-scale use of DL that goes beyond tasks and domains where labeled data is plentiful. Transferring features extracted from another network that has seen millions of images will make our model less prone to overfit and help it generalize better when faced with novel scenarios. You will be able to fully grasp this concept when we explain how transfer learning works in the following sections.
Armed with the understanding of the problems that transfer learning solves, let’s look at its formal definition. Transfer learning is the transfer of the knowledge (feature maps) that the network has acquired from one task, where we have a large amount of data, to a new task where data is not abundantly available. It is generally used where a neural network model is first trained on a problem similar to the problem that is being solved. One or more layers from the trained model are then used in a new model trained on the problem of interest.
As we discussed earlier, to train an image classifier that will achieve image classification accuracy near to or above the human level, we’ll need massive amounts of data, large compute power, and lots of time on our hands. I’m sure most of us don’t have all these things. Knowing that this would be a problem for people with little-to-no resources, researchers built state-of-the-art models that were trained on large image datasets like ImageNet, MS COCO, Open Images, and so on, and then shared their models with the general public for reuse. This means you should never have to train an image classifier from scratch again, unless you have an exceptionally large dataset and a very large computation budget to train everything from scratch by yourself. Even if that is the case, you might be better off using transfer learning to fine-tune the pretrained network on your large dataset. Later in this chapter, we will discuss the different transfer learning approaches, and you will understand what fine-tuning means and why it is better to use transfer learning even when you have a large dataset. We will also talk briefly about some of the popular datasets mentioned here.
NOTE When we talk about training a model from scratch, we mean that the model starts with zero knowledge of the world, and the model’s structure and parameters begin as random guesses. Practically speaking, this means the weights of the model are randomly initialized, and they need to go through a training process to be optimized.
The intuition behind transfer learning is that if a model is trained on a large and general enough dataset, this model will effectively serve as a generic representation of the visual world. We can then use the feature maps it has learned, without having to train on a large dataset, by transferring what it learned to our model and using that as a base starting model for our own task.
In transfer learning, we first train a base network on a base dataset and task, and then we repurpose the learned features, or transfer them to a second target network to be trained on a target dataset and task. This process will tend to work if the features are general, meaning suitable to both base and target tasks, instead of specific to the base task.
--Jason Yosinski et al.1
Let’s jump directly to an example to get a better intuition for how to use transfer learning. Suppose we want to train a model that classifies dog and cat images, and we have only two classes in our problem: dog and cat. We need to collect hundreds of thousands of images for each class, label them, and train our network from scratch. Another option is to use transfer knowledge from another pretrained network.
First, we need to find a dataset that has similar features to our problem at hand. This involves spending some time exploring different open source datasets to find the one closest to our problem. For the sake of this example, let’s use ImageNet, since we are already familiar with it from the previous chapter and it has a lot of dog and cat images. So the pretrained network is familiar with dog and cat features and will require minimum training. (Later in this chapter, we will explore other datasets.) Next, we need to choose a network that has been trained on ImageNet and achieved good results. In chapter 5, we learned about state-of-the-art architectures like VGGNet, GoogLeNet, and ResNet. Any of them would work fine. For this example, we will go with a VGG16 network that has been trained on ImageNet datasets.
To adapt the VGG16 network to our problem, we are going to download it with the pretrained weights, remove the classifier part, add our own classifier, and then retrain the new network (figure 6.2). This is called using a pretrained network as a feature extractor. We will discuss the different types of transfer learning later in this chapter.
DEFINITION A pretrained model is a network that has been previously trained on a large dataset, typically on a large-scale image classification task. We can either use the pretrained model directly as is to run our predictions, or use the pretrained feature extraction part of the network and add our own classifier. The classifier here could be one or more dense layers or even traditional ML algorithms like support vector machines (SVMs).
To fully understand how to use transfer learning, let’s implement this example in Keras. (Luckily, Keras has a set of pretrained networks that are ready for us to download and use: the complete list of models is at https://keras.io/api/applications.) Here are the steps:
Download the open source code of the VGG16 network and its weights to create our base model, and remove the classification layers from the VGG network (FC_4096
> FC_4096
> Softmax_1000
):
from
keras.applications.vgg16import
VGG16 ❶ base_model = VGG16(weights ="imagenet"
, include_top=False
, input_shape = (224,224, 3)) ❷ base_model.summary()
❶ Imports the VGG16 model from Keras
❷ Downloads the model’s pretrained weights and saves them in the variable base_model. We specify that Keras should download the ImageNet weights. include_top is false to ignore the fully connected classifier part on top of the model.
When you print a summary of the base model, you will notice that we downloaded the exact VGG16 architecture that we implemented in chapter 5. This is a fast approach to download popular networks that are supported by the DL library you are using. Alternatively, you can build the network yourself, as we did in chapter 5, and download the weights separately. I’ll show you how in the project at the end of this chapter. But for now, let’s look at the base_model
summary that we just downloaded:
Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) (None, 224, 224, 3) 0 _________________________________________________________________ block1_conv1 (Conv2D) (None, 224, 224, 64) 1792 _________________________________________________________________ block1_conv2 (Conv2D) (None, 224, 224, 64) 36928 _________________________________________________________________ block1_pool (MaxPooling2D) (None, 112, 112, 64) 0 _________________________________________________________________ block2_conv1 (Conv2D) (None, 112, 112, 128) 73856 _________________________________________________________________ block2_conv2 (Conv2D) (None, 112, 112, 128) 147584 _________________________________________________________________ block2_pool (MaxPooling2D) (None, 56, 56, 128) 0 _________________________________________________________________ block3_conv1 (Conv2D) (None, 56, 56, 256) 295168 _________________________________________________________________ block3_conv2 (Conv2D) (None, 56, 56, 256) 590080 _________________________________________________________________ block3_conv3 (Conv2D) (None, 56, 56, 256) 590080 _________________________________________________________________ block3_pool (MaxPooling2D) (None, 28, 28, 256) 0 _________________________________________________________________ block4_conv1 (Conv2D) (None, 28, 28, 512) 1180160 _________________________________________________________________ block4_conv2 (Conv2D) (None, 28, 28, 512) 2359808 _________________________________________________________________ block4_conv3 (Conv2D) (None, 28, 28, 512) 2359808 _________________________________________________________________ block4_pool (MaxPooling2D) (None, 14, 14, 512) 0 _________________________________________________________________ block5_conv1 (Conv2D) (None, 14, 14, 512) 2359808 _________________________________________________________________ block5_conv2 (Conv2D) (None, 14, 14, 512) 2359808 _________________________________________________________________ block5_conv3 (Conv2D) (None, 14, 14, 512) 2359808 _________________________________________________________________ block5_pool (MaxPooling2D) (None, 7, 7, 512) 0 ================================================================= Total params: 14,714,688 Trainable params: 14,714,688 Non-trainable params: 0 _________________________________________________________________
Notice that this downloaded architecture does not contain the classifier part (three fully connected layers) at the top of the network because we set the include_top
argument to False
. More importantly, notice the number of trainable and non-trainable parameters in the summary. The downloaded network as it is makes all the network parameters trainable. As you can see, our base_
model
has more than 14 million trainable parameters. Next, we want to freeze all the downloaded layers and add our own classifier.
Freeze the feature extraction layers that have been trained on the ImageNet dataset. Freezing layers means freezing their trained weights to prevent them from being retrained when we run our training:
for layer in base_model.layers: ❶
layer.trainable = False
base_model.summary()
❶ Iterates through layers and locks them to make them non-trainable with this code
The model summary is omitted in this case for brevity, as it is similar to the previous one. The difference is that all the weights have been frozen, the trainable parameters are now equal to zero, and all the parameters of the frozen layers are non-trainable:
Total params: 14,714,688 Trainable params: 0 Non-trainable params: 14,714,688
Add our own classification dense layer. Here, we will add a softmax layer with two units because we have only two classes in our problem (see figure 6.3):
from
keras.layers
import
Dense, Flatten ❶from
keras.models
import
Model last_layer = base_model.get_layer('block5_pool'
) ❷ last_output = last_layer.output ❸ x = Flatten()(last_output) ❹ x = Dense(2, activation='softmax'
, name='softmax'
)(x) ❺
❷ Uses the get_layer method to save the last layer of the network
❸ Saves the output of the last layer to be the input of the next layer
❹ Flattens the classifier input, which is the output of the last layer of the VGG16 model
Build a new_model
that takes the input of the base model as its input and the output of the last softmax layer as an output. The new model is composed of all the feature extraction layers in VGGNet with the pretrained weights, plus our new, untrained, softmax layer. In other words, when we train the model, we are only going to train the softmax layer in this example to detect the specific features of our new problem (German Shepherd, Beagle, Neither):
new_model = Model(inputs=base_model.input, outputs=x) ❶ new_model.summary() ❷ _________________________________________________________________ Layer (type) Output Shape Param # =================================================== input_1 (InputLayer) (None, 224, 224, 3) 0 _________________________________________________________________ block1_conv1 (Conv2D) (None, 224, 224, 64) 1792 _________________________________________________________________ block1_conv2 (Conv2D) (None, 224, 224, 64) 36928 _________________________________________________________________ block1_pool (MaxPooling2D) (None, 112, 112, 64) 0 _________________________________________________________________ block2_conv1 (Conv2D) (None, 112, 112, 128) 73856 _________________________________________________________________ block2_conv2 (Conv2D) (None, 112, 112, 128) 147584 _________________________________________________________________ block2_pool (MaxPooling2D) (None, 56, 56, 128) 0 _________________________________________________________________ block3_conv1 (Conv2D) (None, 56, 56, 256) 295168 _________________________________________________________________ block3_conv2 (Conv2D) (None, 56, 56, 256) 590080 _________________________________________________________________ block3_conv3 (Conv2D) (None, 56, 56, 256) 590080 _________________________________________________________________ block3_pool (MaxPooling2D) (None, 28, 28, 256) 0 _________________________________________________________________ block4_conv1 (Conv2D) (None, 28, 28, 512) 1180160 _________________________________________________________________ block4_conv2 (Conv2D) (None, 28, 28, 512) 2359808 _________________________________________________________________ block4_conv3 (Conv2D) (None, 28, 28, 512) 2359808 _________________________________________________________________ block4_pool (MaxPooling2D) (None, 14, 14, 512) 0 _________________________________________________________________ block5_conv1 (Conv2D) (None, 14, 14, 512) 2359808 _________________________________________________________________ block5_conv2 (Conv2D) (None, 14, 14, 512) 2359808 _________________________________________________________________ block5_conv3 (Conv2D) (None, 14, 14, 512) 2359808 _________________________________________________________________ block5_pool (MaxPooling2D) (None, 7, 7, 512) 0 _________________________________________________________________ flatten_layer (Flatten) (None, 25088) 0 _________________________________________________________________ softmax (Dense) (None, 2) 50178 =================================================== Total params: 14,789,955 Trainable params: 50,178 Non-trainable params: 14,714,688 _________________________________________________________________
Training the new model is a lot faster than training the network from scratch. To verify that, look at the number of trainable params in this model (~50,000) compared to the number of non-trainable params in the network (~14 million). These “non-trainable” parameters are already trained on a large dataset, and we froze them to use the extracted features in our problem. With this new model, we don’t have to train the entire VGGNet from scratch because we only have to deal with the newly added softmax layer.
Additionally, we get much better performance with transfer learning because the new model has been trained on millions of images (ImageNet dataset + our small dataset). This allows the network to understand the finer details of object nuances, which in turn makes it generalize better on new, previously unseen images.
Note that in this example, we only explored the part where we build the model, to show how transfer learning is used. At the end of this chapter, I’ll walk you through two end-to-end projects to demonstrate how to train the new network on your small dataset. But now, let’s see how transfer learning works.
So far, we learned what the transfer learning technique is and the main problems it solves. We also saw an example of how to take a pretrained network that was trained on ImageNet and transfer its learnings to our specific task. Now, let’s see why transfer learning works, what is really being transferred from one problem to another, and how a network that is trained on one dataset can perform well on a different, possibly unrelated, dataset.
The following quick questions are reminders from previous chapters to get us to the core of what is happening in transfer learning:
What is really being learned by the network during training? The short answer is: feature maps.
How are these features learned? During the backpropagation process, the weights are updated until we get to the optimized weights that minimize the error function.
What is the relationship between features and weights? A feature map is the result of passing the weights filter on the input image during the convolution process (figure 6.4).
What is really being transferred from one network to another? To transfer features, we download the optimized weights of the pretrained network. These weights are then reused as the starting point for the training process and retrained to adapt to the new problem.
Okay, let’s dive into the details to understand what we mean when we say pretrained network. When we’re training a convolutional neural network, the network extracts features from an image in the form of feature maps: outputs of each layer in a neural network after applying the weights filter. They are representations of the features that exist in the training set. They are called feature maps because they map where a certain kind of feature is found in the image. CNNs look for features such as straight lines, edges, and even objects. Whenever they spot these features, they report them to the feature map. Each weight filter is looking for something different that is reflected in the feature maps: one filter could be looking for straight lines, another for curves, and so on (figure 6.5).
Now, recall that neural networks iteratively update their weights during the training cycle of feedforward and backpropagation. We say the network has been trained when we go through a series of training iterations and hyperparameter tuning until the network yields satisfactory results. When training is complete, we output two main items: the network architecture and the trained weights. So, when we say that we are going to use a pretrained network, we mean that we will download the network architecture together with the weights.
During training, the model learns only the features that exist in this training dataset. But when we download large models (like Inception) that have been trained on huge numbers of datasets (like ImageNet), all the features that have already been extracted from these large datasets are now available for us to use. I find that really exciting because these pretrained models have spotted other features that weren’t in our dataset and will help us build better convolutional networks.
In vision problems, there’s a huge amount of stuff for neural networks to learn about the training dataset. There are low-level features like edges, corners, round shapes, curvy shapes, and blobs; and then there are mid- and higher-level features like eyes, circles, squares, and wheels. There are many details in the images that CNNs can pick up on--but if we have only 1,000 images or even 25,000 images in our training dataset, this may not be enough data for the model to learn all those things. By using a pretrained network, we can basically download all this knowledge into our neural network to give it a huge and much faster start with even higher performance levels.
A neural network learns the features in a dataset step by step in increasing levels of complexity, one layer after another. These are called feature maps. The deeper you go through the network layers, the more image-specific features are learned. In figure 6.6, the first layer detects low-level features such as edges and curves. The output of the first layer becomes input to the second layer, which produces higher-level features like semicircles and squares. The next layer assembles the output of the previous layer into parts of familiar objects, and a subsequent layer detects the objects. As we go through more layers, the network yields an activation map that represents more complex features. As we go deeper into the network, the filters begin to be more responsive to a larger region of the pixel space. Higher-level layers amplify aspects of the received inputs that are important for discrimination and suppress irrelevant variations.
Consider the example in figure 6.6. Suppose we are building a model that detects human faces. We notice that the network learns low-level features like lines, edges, and blobs in the first layer. These low-level features appear not to be specific to a particular dataset or task; they are general features that are applicable to many datasets and tasks. The mid-level layers assemble those lines to be able to recognize shapes, corners, and circles. Notice that the extracted features start to get a little more specific to our task (human faces): mid-level features contain combinations of shapes that form objects in the human face like eyes and noses. As we go deeper through the network, we notice that features eventually transition from general to specific and, by the last layer of the network, form high-level features that are very specific to our task. We start seeing parts of human faces that distinguish one person from another.
Now, let’s take this example and compare the feature maps extracted from four models that are trained to classify faces, cars, elephants, and chairs (see figure 6.7). Notice that the earlier layers’ features are very similar for all the models. They represent low-level features like edges, lines, and blobs. This means models that are trained on one task capture similar relations in the data types in the earlier layers of the network and can easily be reused for different problems in other domains. The deeper we go into the network, the more specific the features, until the network overfits its training data and it becomes harder to generalize to different tasks. The lower-level features are almost always transferable from one task to another because they contain generic information like the structure and nature of how images look. Transferring information like lines, dots, curves, and small parts of objects is very valuable for the network to learn faster and with less data on the new task.
The transferability of features that are extracted at later layers depends on the similarity of the original and new datasets. The idea is that all images must have shapes and edges, so the early layers are usually transferable between different domains. We can only identify differences between objects when we start extracting higher-level features: say, the nose on a face or the tires on a car. Only then can we say, “Okay, this is a person, because it has a nose. And this is a car, because it has tires.” Based on the similarity of the source and target domains, we can decide whether to transfer only the low-level features from the source domain, or the high-level features, or somewhere in between. This is motivated by the observation that the later layers of the network become progressively more specific to the details of the classes contained in the original dataset, as we are going to discuss in the next section.
DEFINITIONS The source domain is the original dataset that the pretrained network is trained on. The target domain is the new dataset that we want to train the network on.
There are three major transfer learning approaches: pretrained network as a classifier, pretrained network as a feature extractor, and fine-tuning. Each approach can be effective and save significant time in developing and training a deep CNN model. It may not be clear which use of a pretrained model may yield the best results on your new CV task, so some experimentation may be required. In this section, we will explain these three scenarios and give examples of how to implement them.
Using a pretrained network as a classifier doesn’t involve freezing any layers or doing extra model training. Instead, we just take a network that was trained on a similar problem and deploy it directly to our task. The pretrained model is used directly to classify new images with no changes applied to it and no extra training. All we do is download the network architecture and its pretrained weights and then run the predictions directly on our new data. In this case, we are saying that the domain of our new problem is very similar to the one that the pretrained network was trained on, and it is ready to be deployed.
In the dog breed example, we could have used a VGG16 network that was trained on an ImageNet dataset directly to run predictions. ImageNet already contains a lot of dog images, so a significant portion of the representational power of the pretrained network may be devoted to features that are specific to differentiating between dog breeds.
Let’s see how to use a pretrained network as a classifier. In this example, we will use a VGG16 network that was pretrained on the ImageNet dataset to classify the image of the German Shepherd dog in figure 6.8.
Import the necessary libraries:
from
keras.preprocessing.imageimport
load_imgfrom
keras.preprocessing.imageimport
img_to_arrayfrom
keras.applications.vgg16import
preprocess_inputfrom
keras.applications.vgg16import
decode_predictionsfrom
keras.applications.vgg16import
VGG16
Download the pretrained model of VGG16 and its ImageNet weights. We set include_top
to True
because we want to use the entire network as a classifier:
model = VGG16(weights = "imagenet"
, include_top=True, input_shape = (224,224, 3))
Load and preprocess the input image:
image = load_img('path/to/image.jpg'
, target_size=(224, 224)) ❶
image = img_to_array(image) ❷
image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2])) ❸
image = preprocess_input(image) ❹
❷ Converts the image pixels to a NumPy array
Now our input image is ready for us to run predictions:
yhat = model.predict(image) ❶ label = decode_predictions(yhat) ❷ label = label[0][0] ❸'
%s(
%.2f%%)'
% (label[1], label[2]*100)) ❹
❶ Predicts the probability across all output classes
❷ Converts the probabilities to class labels
❸ Retrieves the most likely result with the highest probability
When you run this code, you will get the following output:
>> German_shepherd (99.72%)
You can see that the model was already trained to predict the correct dog breed with a high confidence score (99.72%). This is because the ImageNet dataset has more than 20,000 labeled dog images classified into 120 classes. Go to the book’s website to play with the code yourself with your own images: www.manning.com/books/deep-learning-for-vision-systems or www.computervisionbook.com. Feel free to explore the classes available in ImageNet and run this experiment on your own images.
This approach is similar to the dog breed example that we implemented earlier in this chapter: we take a pretrained CNN on ImageNet, freeze its feature extraction part, remove the classifier part, and add our own new, dense classifier layers. In figure 6.9, we use a pretrained VGG16 network, freeze the weights in all 13 convolutional layers, and replace the old classifier with a new one to be trained from scratch.
We usually go with this scenario when our new task is similar to the original dataset that the pretrained network was trained on. Since the ImageNet dataset has a lot of dog and cat examples, the feature maps that the network has learned contain a lot of dog and cat features that are very applicable to our new task. This means we can use the high-level features that were extracted from the ImageNet dataset in this new task.
To do that, we freeze all the layers from the pretrained network and only train the classifier part that we just added on the new dataset. This approach is called using a pretrained network as a feature extractor because we freeze the feature extractor part to transfer all the learned feature maps to our new problem. We only add a new classifier, which will be trained from scratch, on top of the pretrained model so that we can repurpose the previously learned feature maps for our dataset.
We remove the classification part of the pretrained network because it is often very specific to the original classification task, and subsequently it is specific to the set of classes on which the model was trained. For example, ImageNet has 1,000 classes. The classifier part has been trained to overfit the training data to classify them into 1,000 classes. But in our new problem, let’s say cats versus dogs, we have only two classes. So, it is a lot more effective to train a new classifier from scratch to overfit these two classes.
So far, we’ve seen two basic approaches of using a pretrained network in transfer learning: using a pretrained network as a classifier or as a feature extractor. We generally use these approaches when the target domain is somewhat similar to the source domain. But what if the target domain is different from the source domain? What if it is very different? Can we still use transfer learning? Yes. Transfer learning works great even when the domains are very different. We just need to extract the correct feature maps from the source domain and fine-tune them to fit the target domain.
In figure 6.10, we show the different approaches of transferring knowledge from a pretrained network. If you are downloading the entire network with no changes and just running predictions, then you are using the network as a classifier. If you are freezing the convolutional layers only, then you are using the pretrained network as a feature extractor and transferring all of its high-level feature maps to your domain. The formal definition of fine-tuning is freezing a few of the network layers that are used for feature extraction, and jointly training both the non-frozen layers and the newly added classifier layers of the pretrained model. It is called fine-tuning because when we retrain the feature extraction layers, we fine-tune the higher-order feature representations to make them more relevant for the new task dataset.
In more practical terms, if we freeze features maps 1 and 2 in figure 6.10, the new network will take feature maps 2 as its input and will start learning from this point to adapt the features of the later layers to the new dataset. This saves the network the time that it would have spent learning feature maps 1 and 2.
As we discussed earlier, feature maps that are extracted early in the network are generic. The feature maps get progressively more specific as we go deeper in the network. This means feature maps 4 in figure 6.10 are very specific to the source domain. Based on the similarity of the two domains, we can decide to freeze the network at the appropriate level of feature maps:
If the domains are similar, we might want to freeze the network up to the last feature map level (feature maps 4, in the example).
If the domains are very different, we might decide to freeze the pretrained network after feature maps 1 and retrain all the remaining layers.
Between these two possibilities are a range of fine-tuning options that we can apply. We can retrain the entire network, or freeze the pretrained network at any level of feature maps 1, 2, 3, or 4 and retrain the remainder of the network. We typically decide the appropriate level of fine-tuning by trial and error. But there are guidelines that we can follow to intuitively decide on the fine-tuning level for the pretrained network. The decision is a function of two factors: the amount of data we have and the level of similarity between the source and target domains. We will explain these factors and the four possible scenarios to choose the appropriate level of fine-tuning in section 6.5.
When we train a network from scratch, we usually randomly initialize the weights and apply a gradient descent optimizer to find the best set of weights that optimizes our error function (as discussed in chapter 2). Since these weights start with random values, there is no guarantee that they will begin with values that are close to the desired optimal values. And if the initialized value is far from the optimal value, the optimizer will take a long time to converge. This is when fine-tuning can be very useful. The pretrained network’s weights have been already optimized to learn from its dataset. Thus, when we use this network in our problem, we start with the weight values that it ended with. So, the network converges much faster than if it had to randomly initialize the weights. We are basically fine-tuning the already-optimized weights to fit our new problem instead of training the entire network from scratch with random weights. Even if we decide to retrain the entire pretrained network, starting with the trained weights will converge faster than having to train the network from scratch with randomly initialized weights.
It’s common to use a smaller learning rate for ConvNet weights that are being fine-tuned, in comparison to the (randomly initialized) weights for the new linear classifier that computes the class scores of a new dataset. This is because we expect that the ConvNet weights are relatively good, so we don’t want to distort them too quickly and too much (especially while the new classifier above them is being trained from random initialization).
Recall that early convolutional layers extract generic features and become more specific to the training data the deeper we go through the network. With that said, we can choose the level of detail for feature extraction from an existing pretrained model. For example, if a new task is quite different from the source domain of the pretrained network (for example, different from ImageNet), then perhaps the output of the pretrained model after the first few layers would be appropriate. If a new task is similar to the source domain, then perhaps the output from layers much deeper in the model can be used, or even the output of the fully connected layer prior to the softmax layer.
As mentioned earlier, choosing the appropriate level for transfer learning is a function of two important factors:
Size of the target dataset (small or large) --When we have a small dataset, the network probably won’t learn much from training more layers, so it will tend to overfit the new data. In this case, we most likely want to do less fine-tuning and rely more on the source dataset.
Domain similarity of the source and target datasets --How similar is our new problem to the domain of the original dataset? For example, if your problem is to classify cars and boats, ImageNet could be a good option because it contains a lot of images of similar features. On the other hand, if your problem is to classify lung cancer on X-ray images, this is a completely different domain that will likely require a lot of fine-tuning.
These two factors lead to the four major scenarios:
The target dataset is small and similar to the source dataset.
The target dataset is large and similar to the source dataset.
The target dataset is small and very different from the source dataset.
The target dataset is large and very different from the source dataset.
Let’s discuss these scenarios one by one to learn the common rules of thumb for navigating our options.
Since the original dataset is similar to our new dataset, we can expect that the higher-level features in the pretrained ConvNet are relevant to our dataset as well. Then it might be best to freeze the feature extraction part of the network and only retrain the classifier.
Another reason it might not be a good idea to fine-tune the network is that our new dataset is small. If we fine-tune the feature extraction layers on a small dataset, that will force the network to overfit to our data. This is not good because, by definition, a small dataset doesn’t have enough information to cover all possible features of its objects, which makes it fail to generalize to new, previously unseen, data. So in this case, the more fine-tuning we do, the more the network is prone to overfit the new data.
For example, suppose all the images in our new dataset contain dogs in a specific weather environment--snow, for example. If we fine-tuned on this dataset, we would force the new network to pick up features like snow and a white background as dog-specific features and make it fail to classify dogs in other weather conditions. Thus the general rule of thumb is: if you have a small amount of data, be careful of overfitting when you fine-tune your pretrained network.
Since both domains are similar, we can freeze the feature extraction part and retrain the classifier, similar to what we did in scenario 1. But since we have more data in the new domain, we can get a performance boost from fine-tuning through all or part of the pretrained network with more confidence that we won’t overfit. Fine-tuning through the entire network is not really needed because the higher-level features are related (since the datasets are similar). So a good start is to freeze approximately 60-80% of the pretrained network and retrain the rest on the new data.
Since the dataset is different, it might not be best to freeze the higher-level features of the pretrained network, because they contain more dataset-specific features. Instead, it would work better to retrain layers from somewhere earlier in the network--or to not freeze any layers and fine-tune the entire network. However, since you have a small dataset, fine-tuning the entire network on the dataset might not be a good idea, because doing so will make it prone to overfitting. A midway solution will work better in this case. A good start is to freeze approximately the first third or half of the pretrained network. After all, the early layers contain very generic feature maps that will be useful for your dataset even if it is very different.
Since the new dataset is large, you might be tempted to just train the entire network from scratch and not use transfer learning at all. However, in practice, it is often still very beneficial to initialize weights from a pretrained model, as we discussed earlier. Doing so makes the model converge faster. In this case, we have a large dataset that provides us with the confidence to fine-tune through the entire network without having to worry about overfitting.
We’ve explored the two main factors that help us define which transfer learning approach to use (size of our data and similarity between the source and target datasets). These two factors give us the four major scenarios defined in table 6.1. Figure 6.11 summarizes the guidelines for the appropriate fine-tuning level to use in each of the scenarios.
The CV research community has been pretty good about posting datasets on the internet. So, when you hear names like ImageNet, MS COCO, Open Images, MNIST, CIFAR, and many others, these are datasets that people have posted online and that a lot of computer researchers have used as benchmarks to train their algorithms and get state-of-the-art results.
In this section, we will review some of the popular open source datasets to help guide you in your search to find the most suitable dataset for your problem. Keep in mind that the ones listed in this chapter are the most popular datasets used in the CV research community at the time of writing; we do not intend to provide a comprehensive list of all the open source datasets out there. A great many image datasets are available, and the number is growing every day. Before starting your project, I encourage you to do your own research to explore the available datasets.
MNIST (http://yann.lecun.com/exdb/mnist) stands for Modified National Institute of Standards and Technology. It contains labeled handwritten images of digits from 0 to 9. The goal of this dataset is to classify handwritten digits. MNIST has been popular with the research community for benchmarking classification algorithms. In fact, it is considered the “hello, world!” of image datasets. But nowadays, the MNIST dataset is comparatively pretty simple, and a basic CNN can achieve more than 99% accuracy, so MNIST is no longer considered a benchmark for CNN performance. We implemented a CNN classification project using MNIST dataset in chapter 3; feel free to go back and review it.
MNIST consists of 60,000 training images and 10,000 test images. All are grayscale (one-channel), and each image is 28 pixels high and 28 pixels wide. Figure 6.12 shows some sample images from the MNIST dataset.
Fashion-MNIST was created with the intention of replacing the original MNIST dataset, which has become too simple for modern convolutional networks. The data is stored in the same format as MNIST, but instead of handwritten digits, it contains 60,000 training images and 10,000 test images of 10 fashion clothing classes: t-shirt/top, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag, and ankle boot. Visit https:// github.com/zalandoresearch/fashion-mnist to explore and download the dataset. Figure 6.13 shows a sample of the represented classes.
CIFAR-10 (www.cs.toronto.edu/~kriz/cifar.html) is considered another benchmark dataset for image classification in the CV and ML literature. CIFAR images are more complex than those in MNIST in the sense that MNIST images are all grayscale with perfectly centered objects, whereas CIFAR images are color (three channels) with dramatic variation in how the objects appear. The CIFAR-10 dataset consists of 32×32 color images in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images. Figure 6.14 shows the classes in the dataset.
CIFAR-100 is the bigger brother of CIFAR-10: it contains 100 classes with 600 images each. These 100 classes are grouped into 20 superclasses. Each image comes with a fine label (the class to which it belongs) and a coarse label (the superclass to which it belongs).
We’ve discussed the ImageNet dataset several times in the previous chapters and used it extensively in chapter 5 and this chapter. But for completeness of this list, we are discussing it here as well. At the time of writing, ImageNet is considered the current benchmark and is widely used by CV researchers to evaluate their classification algorithms.
ImageNet is a large visual database designed for use in visual object recognition software research. It is aimed at labeling and categorizing images into almost 22,000 categories based on a defined set of words and phrases. The images were collected from the web and labeled by humans via Amazon’s Mechanical Turk crowdsourcing tool. At the time of this writing, there are over 14 million images in the ImageNet project. To organize such a massive amount of data, the creators of ImageNet followed the WordNet hierarchy: each meaningful word/phrase in WordNet is called a synonym set (synset for short). Within the ImageNet project, images are organized according to these synsets, with the goal being to have 1,000+ images per synset. Figure 6.15 shows a collage of ImageNet examples put together by Stanford University.
The CV community usually refers to the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) when talking about ImageNet. In this challenge, software programs compete to correctly classify and detect objects and scenes. We will be using the ILSVRC challenge as a benchmark to compare the different networks’ performance.
MS COCO (http://cocodataset.org) is short for Microsoft Common Objects in Context. It is an open source database that aims to enable future research for object detection, instance segmentation, image captioning, and localizing person keypoints. It contains 328,000 images. More than 200,000 of them are labeled, and they include 1.5 million object instances and 80 object categories that would be easily recognizable by a 4-year-old. The original research paper by the creators of the dataset describes the motivation for and content of this dataset.2 Figure 6.16 shows a sample of the dataset provided on the MS COCO website.
Open Images (https://storage.googleapis.com/openimages/web/index.html) is an open source image database created by Google. It contains more than 9 million images as of this writing. What makes it stand out is that these images are mostly of complex scenes that span thousands of classes of objects. Additionally, more than 2 million of these images are hand-annotated with bounding boxes, making Open Images by far the largest existing dataset with object-location annotations (see figure 6.17). In this subset of images, there are ~15.4 million bounding boxes of 600 classes of objects. Similar to ImageNet and ILSVRC, Open Images has a challenge called the Open Images Challenge (http://mng.bz/aRQz).
In addition to the datasets listed in this section, Kaggle (www.kaggle.com) is another great source for datasets. Kaggle is a website that hosts ML and DL challenges where people from all around the world can participate and submit algorithms for evaluations.
You are strongly encouraged to explore these datasets and search for the many other open source datasets that come up every day, to gain a better understanding of the classes and use cases they support. We mostly use ImageNet in this chapter’s projects; and throughout the book, we will be using MS COCO, especially in chapter 7.
In this project, we use a very small amount of data to train a classifier that detects images of dogs and cats. This is a pretty simple project, but the goal of the exercise is to see how to implement transfer learning when you have a very small amount of data and the target domain is similar to the source domain (scenario 1). As explained in this chapter, in this case, we will use the pretrained convolutional network as a feature extractor. This means we are going to freeze the feature extractor part of the network, add our own classifier, and then retrain the network on our new small dataset.
One other important takeaway from this project is learning how to preprocess custom data and make it ready to train your neural network. In previous projects, we used the CIFAR and MNIST datasets: they are preprocessed by Keras, so all we had to do was download them from the Keras library and use them directly to train the network. This project provides a tutorial of how to structure your data repository and use the Keras library to get your data ready.
Visit the book’s website at www.manning.com/books/deep-learning-for-vision-systems or www.computervisionbook.com to download the code notebook and the dataset used for this project. Since we are using transfer learning, the training does not require high computation power, so you can run this notebook on your personal computer; you don’t need a GPU.
For this implementation, we’ll be using the VGG16. Although it didn’t record the lowest error in the ILSVRC, I found that it worked well for the task and was quicker to train than other models. I got an accuracy of about 96%, but you can feel free to use GoogLeNet or ResNet to experiment and compare results.
The process to use a pretrained model as a feature extractor is well established:
Preprocess the data to make it ready for the neural network.
Load pretrained weights from the VGG16 network trained on a large dataset.
Freeze all the weights in the convolutional layers (feature extraction part). Remember, the layers to freeze are adjusted depending on the similarity of the new task to the original dataset. In our case, we observed that ImageNet has a lot of dog and cat images, so the network has already been trained to extract the detailed features of our target object.
Replace the fully connected layers of the network with a custom classifier. You can add as many fully connected layers as you see fit, and each can have as many hidden units as you want. For simple problems like this, we will just add one hidden layer with 64 units. You can observe the results and tune up if the model is underfitting or down if the model is overfitting. For the softmax layer, the number of units must be set equal to the number of classes (two units, in our case).
Compile the network, and run the training process on the new data of cats and dogs to optimize the model for the smaller dataset.
Now, let’s go through these steps and implement this project:
Import the necessary libraries:
from
keras.preprocessing.imageimport
ImageDataGeneratorfrom
keras.preprocessingimport
imagefrom
keras.applicationsimport
imagenet_utilsfrom
keras.applicationsimport
vgg16from
keras.applicationsimport
mobilenetfrom
keras.optimizersimport
Adam, SGDfrom
keras.metricsimport
categorical_crossentropyfrom
keras.layersimport
Dense, Flatten, Dropout, BatchNormalizationfrom
keras.modelsimport
Modelfrom
sklearn.metricsimport
confusion_matriximport
itertoolsimport
matplotlib.pyplotas
plt %matplotlib inline
Preprocess the data to make it ready for the neural network. Keras has an ImageDataGenerator
class that allows us to easily perform image augmentation on the fly; you can read about it at https://keras.io/api/preprocessing/image. In this example, we use ImageDataGenerator
to generate our image tensors, but for simplicity, we will not implement image augmentation.
The ImageDataGenerator
class has a method called flow_from_directory
()
that is used to read images from folders containing images. This method expects your data directory to be structured as in figure 6.18.
I have the data structured in the book’s code so it’s ready for you to use flow_
from_directory()
. Now, load the data into train_path
, valid_path
, and test
_path
variables, and then generate the train, valid, and test batches:
train_path ='data/train'
valid_path ='data/valid'
test_path ='data/test'
train_batches = ImageDataGenerator().flow_from_directory(train_path, ❶ target_size=(224,224), batch_size=10) valid_batches = ImageDataGenerator().flow_from_directory(valid_path, target_size=(224,224), batch_size=30) test_batches = ImageDataGenerator().flow_from_directory(test_path, target_size=(224,224), batch_size=50, shuffle=False)
❶ ImageDataGenerator generates batches of tensor image data with real-time data augmentation. The data will be looped over (in batches). In this example, we won’t be doing any image augmentation.
Load in pretrained weights from the VGG16 network trained on a large dataset. Similar to the examples in this chapter, we download the VGG16 network from Keras and download its weights after they are pretrained on the ImageNet dataset. Remember that we want to remove the classifier part from this network, so we set the parameter include_top=False
:
base_model = vgg16.VGG16(weights = "imagenet"
, include_top=False,
input_shape = (224,224, 3))
Freeze all the weights in the convolutional layers (feature extraction part). We freeze the convolutional layers from the base_model
created in the previous step and use that as a feature extractor, and then add a classifier on top of it in the next step:
for layer in base_model.layers: ❶
layer.trainable = False
❶ Iterates through layers and locks them to make them non-trainable with this code
Add the new classifier, and build the new model. We add a few layers on top of the base model. In this example, we add one fully connected layer with 64 hidden units and a softmax with 2 hidden units. We also add batch norm and dropout layers to avoid overfitting:
last_layer = base_model.get_layer('block5_pool'
) ❶ last_output = last_layer.output x = Flatten()(last_output) ❷ x = Dense(64, activation='relu'
, name='FC_2'
)(x) ❸ x = BatchNormalization()(x) ❸ x = Dropout(0.5)(x) ❸ x = Dense(2, activation='softmax'
, name='softmax'
)(x) ❸ new_model = Model(inputs=base_model.input, outputs=x) ❹ new_model.summary() _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) (None, 224, 224, 3) 0 _________________________________________________________________ block1_conv1 (Conv2D) (None, 224, 224, 64) 1792 _________________________________________________________________ block1_conv2 (Conv2D) (None, 224, 224, 64) 36928 _________________________________________________________________ block1_pool (MaxPooling2D) (None, 112, 112, 64) 0 _________________________________________________________________ block2_conv1 (Conv2D) (None, 112, 112, 128) 73856 _________________________________________________________________ block2_conv2 (Conv2D) (None, 112, 112, 128) 147584 _________________________________________________________________ block2_pool (MaxPooling2D) (None, 56, 56, 128) 0 _________________________________________________________________ block3_conv1 (Conv2D) (None, 56, 56, 256) 295168 _________________________________________________________________ block3_conv2 (Conv2D) (None, 56, 56, 256) 590080 _________________________________________________________________ block3_conv3 (Conv2D) (None, 56, 56, 256) 590080 _________________________________________________________________ block3_pool (MaxPooling2D) (None, 28, 28, 256) 0 _________________________________________________________________ block4_conv1 (Conv2D) (None, 28, 28, 512) 1180160 _________________________________________________________________ block4_conv2 (Conv2D) (None, 28, 28, 512) 2359808 _________________________________________________________________ block4_conv3 (Conv2D) (None, 28, 28, 512) 2359808 _________________________________________________________________ block4_pool (MaxPooling2D) (None, 14, 14, 512) 0 _________________________________________________________________ block5_conv1 (Conv2D) (None, 14, 14, 512) 2359808 _________________________________________________________________ block5_conv2 (Conv2D) (None, 14, 14, 512) 2359808 _________________________________________________________________ block5_conv3 (Conv2D) (None, 14, 14, 512) 2359808 _________________________________________________________________ block5_pool (MaxPooling2D) (None, 7, 7, 512) 0 _________________________________________________________________ flatten_1 (Flatten) (None, 25088) 0 _________________________________________________________________ FC_2 (Dense) (None, 64) 1605696 _________________________________________________________________ batch_normalization_1 (Batch (None, 64) 256 _________________________________________________________________ dropout_1 (Dropout) (None, 64) 0 _________________________________________________________________ softmax (Dense) (None, 2) 130 ================================================================= Total params: 16,320,770 Trainable params: 1,605,954 Non-trainable params: 14,714,816 _________________________________________________________________
❶ Uses the get_layer method to save the last layer of the network. Then saves the output of the last layer to be the input of the next layer.
❷ Flattens the classifier input, which is output of the last layer of the VGG16 model
❸ Adds one fully connected layer that has 64 units and batchnorm, dropout, and softmax layers
Compile the model and run the training process:
new_model.compile(Adam(lr=0.0001), loss='categorical_crossentropy'
, metrics=['accuracy'
]) new_model.fit_generator(train_batches, steps_per_epoch=4, validation_data=valid_batches, validation_steps=2, epochs=20, verbose=2)
When you run the previous code snippet, the verbose training is printed after each epoch as follows:
Epoch 1/20 - 28s - loss: 1.0070 - acc: 0.6083 - val_loss: 0.5944 - val_acc: 0.6833 Epoch 2/20 - 25s - loss: 0.4728 - acc: 0.7754 - val_loss: 0.3313 - val_acc: 0.8605 Epoch 3/20 - 30s - loss: 0.1177 - acc: 0.9750 - val_loss: 0.2449 - val_acc: 0.8167 Epoch 4/20 - 25s - loss: 0.1640 - acc: 0.9444 - val_loss: 0.3354 - val_acc: 0.8372 Epoch 5/20 - 29s - loss: 0.0545 - acc: 1.0000 - val_loss: 0.2392 - val_acc: 0.8333 Epoch 6/20 - 25s - loss: 0.0941 - acc: 0.9505 - val_loss: 0.2019 - val_acc: 0.9070 Epoch 7/20 - 28s - loss: 0.0269 - acc: 1.0000 - val_loss: 0.1707 - val_acc: 0.9000 Epoch 8/20 - 26s - loss: 0.0349 - acc: 0.9917 - val_loss: 0.2489 - val_acc: 0.8140 Epoch 9/20 - 28s - loss: 0.0435 - acc: 0.9891 - val_loss: 0.1634 - val_acc: 0.9000 Epoch 10/20 - 26s - loss: 0.0349 - acc: 0.9833 - val_loss: 0.2375 - val_acc: 0.8140 Epoch 11/20 - 28s - loss: 0.0288 - acc: 1.0000 - val_loss: 0.1859 - val_acc: 0.9000 Epoch 12/20 - 29s - loss: 0.0234 - acc: 0.9917 - val_loss: 0.1879 - val_acc: 0.8372 Epoch 13/20 - 32s - loss: 0.0241 - acc: 1.0000 - val_loss: 0.2513 - val_acc: 0.8500 Epoch 14/20 - 29s - loss: 0.0120 - acc: 1.0000 - val_loss: 0.0900 - val_acc: 0.9302 Epoch 15/20 - 36s - loss: 0.0189 - acc: 1.0000 - val_loss: 0.1888 - val_acc: 0.9000 Epoch 16/20 - 30s - loss: 0.0142 - acc: 1.0000 - val_loss: 0.1672 - val_acc: 0.8605 Epoch 17/20 - 29s - loss: 0.0160 - acc: 0.9917 - val_loss: 0.1752 - val_acc: 0.8667 Epoch 18/20 - 25s - loss: 0.0126 - acc: 1.0000 - val_loss: 0.1823 - val_acc: 0.9070 Epoch 19/20 - 29s - loss: 0.0165 - acc: 1.0000 - val_loss: 0.1789 - val_acc: 0.8833 Epoch 20/20 - 25s - loss: 0.0112 - acc: 1.0000 - val_loss: 0.1743 - val_acc: 0.8837
Notice that the model was trained very quickly using regular CPU computing power. Each epoch took approximately 25 to 29 seconds, which means the model took less than 10 minutes to train for 20 epochs.
Evaluate the model. First, let’s define the load_dataset()
method that we will use to convert our dataset into tensors:
from
sklearn.datasetsimport
load_filesfrom
keras.utilsimport
np_utilsimport
numpyas
npdef
load_dataset(path): data = load_files(path) paths = np.array(data['filenames'
]) targets = np_utils.to_categorical(np.array(data['target'
]))return
paths, targets test_files, test_targets = load_dataset('small_data/test'
)
Then, we create test_tensors to evaluate the model on them:
from
keras.preprocessingimport
imagefrom
keras.applications.vgg16import
preprocess_inputfrom
tqdmimport
tqdmdef
path_to_tensor(img_path): img = image.load_img(img_path, target_size=(224, 224)) ❶ x = image.img_to_array(img) ❷return
np.expand_dims(x, axis=0) ❸def
paths_to_tensor(img_paths): list_of_tensors = [path_to_tensor(img_path)for
img_pathin
tqdm(img_paths)]return
np.vstack(list_of_tensors) test_tensors = preprocess_input(paths_to_tensor(test_files))
❶ Loads an RGB image as PIL.Image.Image type
❷ Converts the PIL.Image.Image type to a 3D tensor with shape (224, 224, 3)
❸ Converts the 3D tensor to a 4D tensor with shape (1, 224, 224, 3) and returns the 4D tensor
Now we can run Keras’s evaluate()
method to calculate the model accuracy:
'
Testing loss:
{:.4f}Testing accuracy:
{:.4f}'
.format(*new_model.evaluate(test_tensors, test_targets))) Testing loss: 0.1042 Testing accuracy: 0.9579
The model has achieved an accuracy of 95.79% in less than 10 minutes of training. This is very good, given our very small dataset.
In this project, we are going to explore scenario 3, discussed earlier in this chapter, where the target dataset is small and very different from the source dataset. The goal of this project is to build a sign language classifier that distinguishes 10 classes: the sign language digits from 0 to 9. Figure 6.19 shows a sample of our dataset.
Following are the details of our dataset:
It is very noticeable how small our dataset is. If you try to train a network from scratch on this very small dataset, you will not achieve good results. On the other hand, we were able to achieve an accuracy higher than 98% by using transfer learning, even though the source and target domains were very different.
NOTE Please take this evaluation with a grain of salt, because the network hasn't been thoroughly tested with a lot of data. We only have 50 test images in this dataset. Transfer learning is expected to achieve good results anyway, but I wanted to highlight this fact.
Visit the book’s website at www.manning.com/books/deep-learning-for-vision-systems or www.computervisionbook.com to download the source code notebook and the dataset used for this project. Similar to project 1, the training does not require high computation power, so you can run this notebook on your personal computer; you don’t need a GPU.
For ease of comparison with the previous project, we will use the VGG16 network trained on the ImageNet dataset. The process to fine-tune a pretrained network is as follows:
Preprocess the data to make it ready for the neural network.
Load in pretrained weights from the VGG16 network trained on a large dataset (ImageNet).
Compile the network, and run the training process to optimize the model for the smaller dataset.
Now let’s implement this project:
Import the necessary libraries:
from
keras.preprocessing.imageimport
ImageDataGeneratorfrom
keras.preprocessingimport
imagefrom
keras.applicationsimport
imagenet_utilsfrom
keras.applicationsimport
vgg16from
keras.optimizersimport
Adam, SGDfrom
keras.metricsimport
categorical_crossentropyfrom
keras.layersimport
Dense, Flatten, Dropout, BatchNormalizationfrom
keras.modelsimport
Modelfrom
sklearn.metricsimport
confusion_matriximport
itertoolsimport
matplotlib.pyplotas
plt %matplotlib inline
Preprocess the data to make it ready for the neural network. Similar to project 1, we use the ImageDataGenerator
class from Keras and the flow_from_ directory()
method to preprocess our data. The data is already structured for you to directly create your tensors:
train_path = 'dataset/train'
valid_path = 'dataset/valid'
test_path = 'dataset/test'
train_batches = ImageDataGenerator().flow_from_directory(train_path, ❶
target_size=(224,224),
batch_size=10)
valid_batches = ImageDataGenerator().flow_from_directory(valid_path,
target_size=(224,224),
batch_size=30)
test_batches = ImageDataGenerator().flow_from_directory(test_path,
target_size=(224,224),
batch_size=50,
shuffle=False)
Found 1712 images belonging to 10 classes.
Found 300 images belonging to 10 classes.
Found 50 images belonging to 10 classes.
❶ ImageDataGenerator generates batches of tensor image data with real-time data augmentation. The data will be looped over (in batches). In this example, we won’t be doing any image augmentation.
Load in pretrained weights from the VGG16 network trained on a large dataset (ImageNet). We download the VGG16 architecture from the Keras library with ImageNet weights. Note that we use the parameter pooling='avg'
here: this basically means global average pooling will be applied to the output of the last convolutional layer, and thus the output of the model will be a 2D tensor. We use this as an alternative to the Flatten
layer before adding the fully connected layers:
base_model = vgg16.VGG16(weights ="imagenet"
, include_top=False, input_shape = (224,224, 3), pooling='avg'
)
Freeze some of the feature extractor part, and fine-tune the rest on our new training data. The level of fine-tuning is usually determined by trial and error. VGG16 has 13 convolutional layers: you can freeze them all or freeze a few of them, depending on how similar your data is to the source data. In the sign language case, the new domain is very different from our domain, so we will start with fine-tuning only the last five layers; if we don’t get satisfying results, we can fine-tune more. It turns out that after we trained the new model, we got 98% accuracy, so this was a good level of fine-tuning. But in other cases, if you find that your network doesn’t converge, try fine-tuning more layers.
for layer in base_model.layers[:-5]: ❶
layer.trainable = False
base_model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 224, 224, 3) 0
_________________________________________________________________
block1_conv1 (Conv2D) (None, 224, 224, 64) 1792
_________________________________________________________________
block1_conv2 (Conv2D) (None, 224, 224, 64) 36928
_________________________________________________________________
block1_pool (MaxPooling2D) (None, 112, 112, 64) 0
_________________________________________________________________
block2_conv1 (Conv2D) (None, 112, 112, 128) 73856
_________________________________________________________________
block2_conv2 (Conv2D) (None, 112, 112, 128) 147584
_________________________________________________________________
block2_pool (MaxPooling2D) (None, 56, 56, 128) 0
_________________________________________________________________
block3_conv1 (Conv2D) (None, 56, 56, 256) 295168
_________________________________________________________________
block3_conv2 (Conv2D) (None, 56, 56, 256) 590080
_________________________________________________________________
block3_conv3 (Conv2D) (None, 56, 56, 256) 590080
_________________________________________________________________
block3_pool (MaxPooling2D) (None, 28, 28, 256) 0
_________________________________________________________________
block4_conv1 (Conv2D) (None, 28, 28, 512) 1180160
_________________________________________________________________
block4_conv2 (Conv2D) (None, 28, 28, 512) 2359808
_________________________________________________________________
block4_conv3 (Conv2D) (None, 28, 28, 512) 2359808
_________________________________________________________________
block4_pool (MaxPooling2D) (None, 14, 14, 512) 0
_________________________________________________________________
block5_conv1 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_conv2 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_conv3 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_pool (MaxPooling2D) (None, 7, 7, 512) 0
_________________________________________________________________
global_average_pooling2d_1 ( (None, 512) 0
=================================================================
Total params: 14,714,688
Trainable params: 7,079,424
Non-trainable params: 7,635,264
_________________________________________________________________
❶ Iterates through layers and locks them, except for the last five layers
Add the new classifier layers, and build the new model:
last_output = base_model.output ❶ x = Dense(10, activation='softmax'
, name='softmax'
)(last_output) ❷ new_model = Model(inputs=base_model.input, outputs=x) ❸ new_model.summary() ❹ Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) (None, 224, 224, 3) 0 _________________________________________________________________ block1_conv1 (Conv2D) (None, 224, 224, 64) 1792 _________________________________________________________________ block1_conv2 (Conv2D) (None, 224, 224, 64) 36928 _________________________________________________________________ block1_pool (MaxPooling2D) (None, 112, 112, 64) 0 _________________________________________________________________ block2_conv1 (Conv2D) (None, 112, 112, 128) 73856 _________________________________________________________________ block2_conv2 (Conv2D) (None, 112, 112, 128) 147584 _________________________________________________________________ block2_pool (MaxPooling2D) (None, 56, 56, 128) 0 _________________________________________________________________ block3_conv1 (Conv2D) (None, 56, 56, 256) 295168 _________________________________________________________________ block3_conv2 (Conv2D) (None, 56, 56, 256) 590080 _________________________________________________________________ block3_conv3 (Conv2D) (None, 56, 56, 256) 590080 _________________________________________________________________ block3_pool (MaxPooling2D) (None, 28, 28, 256) 0 _________________________________________________________________ block4_conv1 (Conv2D) (None, 28, 28, 512) 1180160 _________________________________________________________________ block4_conv2 (Conv2D) (None, 28, 28, 512) 2359808 _________________________________________________________________ block4_conv3 (Conv2D) (None, 28, 28, 512) 2359808 _________________________________________________________________ block4_pool (MaxPooling2D) (None, 14, 14, 512) 0 _________________________________________________________________ block5_conv1 (Conv2D) (None, 14, 14, 512) 2359808 _________________________________________________________________ block5_conv2 (Conv2D) (None, 14, 14, 512) 2359808 _________________________________________________________________ block5_conv3 (Conv2D) (None, 14, 14, 512) 2359808 _________________________________________________________________ block5_pool (MaxPooling2D) (None, 7, 7, 512) 0 _________________________________________________________________ global_average_pooling2d_1 ( (None, 512) 0 _________________________________________________________________ softmax (Dense) (None, 10) 5130 ================================================================= Total params: 14,719,818 Trainable params: 7,084,554 Non-trainable params: 7,635,264
❶ Saves the output of base_model to be the input of the next layer
❷ Adds our new softmax layer with 10 hidden units
Compile the network, and run the training process to optimize the model for the smaller dataset:
new_model.compile(Adam(lr=0.0001), loss='categorical_crossentropy'
, metrics=['accuracy'
]) from keras.callbacks import ModelCheckpoint checkpointer = ModelCheckpoint(filepath='signlanguage.model.hdf5'
, save_best_only=True) history = new_model.fit_generator(train_batches, steps_per_epoch=18, validation_data=valid_batches, validation_steps=3, epochs=20, verbose=1, callbacks=[checkpointer]) Epoch 1/150 18/18 [==============================] - 40s 2s/step - loss: 3.2263 - acc: 0.1833 - val_loss: 2.0674 - val_acc: 0.1667 Epoch 2/150 18/18 [==============================] - 41s 2s/step - loss: 2.0311 - acc: 0.1833 - val_loss: 1.7330 - val_acc: 0.3000 Epoch 3/150 18/18 [==============================] - 42s 2s/step - loss: 1.5741 - acc: 0.4500 - val_loss: 1.5577 - val_acc: 0.4000 Epoch 4/150 18/18 [==============================] - 42s 2s/step - loss: 1.3068 - acc: 0.5111 - val_loss: 0.9856 - val_acc: 0.7333 Epoch 5/150 18/18 [==============================] - 43s 2s/step - loss: 1.1563 - acc: 0.6389 - val_loss: 0.7637 - val_acc: 0.7333 Epoch 6/150 18/18 [==============================] - 41s 2s/step - loss: 0.8414 - acc: 0.6722 - val_loss: 0.7550 - val_acc: 0.8000 Epoch 7/150 18/18 [==============================] - 41s 2s/step - loss: 0.5982 - acc: 0.8444 - val_loss: 0.7910 - val_acc: 0.6667 Epoch 8/150 18/18 [==============================] - 41s 2s/step - loss: 0.3804 - acc: 0.8722 - val_loss: 0.7376 - val_acc: 0.8667 Epoch 9/150 18/18 [==============================] - 41s 2s/step - loss: 0.5048 - acc: 0.8222 - val_loss: 0.2677 - val_acc: 0.9000 Epoch 10/150 18/18 [==============================] - 39s 2s/step - loss: 0.2383 - acc: 0.9276 - val_loss: 0.2844 - val_acc: 0.9000 Epoch 11/150 18/18 [==============================] - 41s 2s/step - loss: 0.1163 - acc: 0.9778 - val_loss: 0.0775 - val_acc: 1.0000 Epoch 12/150 18/18 [==============================] - 41s 2s/step - loss: 0.1377 - acc: 0.9667 - val_loss: 0.5140 - val_acc: 0.9333 Epoch 13/150 18/18 [==============================] - 41s 2s/step - loss: 0.0955 - acc: 0.9556 - val_loss: 0.1783 - val_acc: 0.9333 Epoch 14/150 18/18 [==============================] - 41s 2s/step - loss: 0.1785 - acc: 0.9611 - val_loss: 0.0704 - val_acc: 0.9333 Epoch 15/150 18/18 [==============================] - 41s 2s/step - loss: 0.0533 - acc: 0.9778 - val_loss: 0.4692 - val_acc: 0.8667 Epoch 16/150 18/18 [==============================] - 41s 2s/step - loss: 0.0809 - acc: 0.9778 - val_loss: 0.0447 - val_acc: 1.0000 Epoch 17/150 18/18 [==============================] - 41s 2s/step - loss: 0.0834 - acc: 0.9722 - val_loss: 0.0284 - val_acc: 1.0000 Epoch 18/150 18/18 [==============================] - 41s 2s/step - loss: 0.1022 - acc: 0.9611 - val_loss: 0.0177 - val_acc: 1.0000 Epoch 19/150 18/18 [==============================] - 41s 2s/step - loss: 0.1134 - acc: 0.9667 - val_loss: 0.0595 - val_acc: 1.0000 Epoch 20/150 18/18 [==============================] - 39s 2s/step - loss: 0.0676 - acc: 0.9777 - val_loss: 0.0862 - val_acc: 0.9667
Notice the training time of each epoch from the verbose output. The model was trained very quickly using regular CPU computing power. Each epoch took approximately 40 seconds, which means it took the model less than 15 minutes to train for 20 epochs.
Evaluate the accuracy of the model. Similar to the previous project, we create a load_dataset()
method to create test_targets
and test_tensors
and then use the evaluate()
method from Keras to run inferences on the test images and get the model accuracy:
'
Testing loss:
{:.4f}Testing accuracy:
{:.4f}'
.format(*new_model.evaluate(test_tensors, test_targets))) Testing loss: 0.0574 Testing accuracy: 0.9800
A deeper level of evaluating your model involves creating a confusion matrix. We explained the confusion matrix in chapter 4: it is a table that is often used to describe the performance of a classification model, to provide a deeper understanding of how the model performed on the test dataset. See chapter 4 for details on the different model evaluation metrics. Now, let’s build the confusion matrix for our model (see figure 6.20):
from
sklearn.metrics
import
confusion_matriximport
numpy
as
np cm_labels = ['0'
,'1'
,'2'
,'3'
,'4'
,'5'
,'6'
,'7'
,'8'
,'9'
] cm = confusion_matrix(np.argmax(test_targets, axis=1), np.argmax(new_model.predict(test_tensors), axis=1)) plt.imshow(cm, cmap=plt.cm.Blues) plt.colorbar() indexes = np.arange(len
(cm_labels))for
iin
indexes:for
jin
indexes: plt.text(j, i, cm[i, j]) plt.xticks(indexes, cm_labels, rotation=90) plt.xlabel('Predicted label'
) plt.yticks(indexes, cm_labels) plt.ylabel('True label'
) plt.title('Confusion matrix'
) plt.show()
To read this confusion matrix, look at the number on the Predicted Label axis and check whether it was correctly classified on the True Label axis. For example, look at number 0 on the Predicted Label axis: all five images were classified as 0, and no images were mistakenly classified as any other number. Similarly, go through the rest of the numbers on the Predicted Label axis. You will notice that the model successfully made the correct predictions for all the test images except the image with true label = 8. In that case, the model mistakenly classified an image of number 8 as number = 7.
Transfer learning is usually the go-to approach when starting a classification and object detection project, especially when you don’t have a lot of training data.
Transfer learning migrates the knowledge learned from the source dataset to the target dataset, to save training time and computational cost.
The neural network learns the features in your dataset step by step in increasing levels of complexity. The deeper you go through the network layers, the more image-specific the features that are learned.
Early layers in the network learn low-level features like lines, blobs, and edges. The output of the first layer becomes input to the second layer, which produces higher-level features. The next layer assembles the output of the previous layer into parts of familiar objects, and a subsequent layer detects the objects.
The three main transfer learning approaches are using a pretrained network as a classifier, using a pretrained network as a feature extractor, and fine-tuning.
Using a pretrained network as a classifier means using the network directly to classify new images without freezing layers or applying model training.
Using a pretrained network as a feature extractor means freezing the classifier part of the network and retraining the new classifier.
Fine-tuning means freezing a few of the network layers that are used for feature extraction, and jointly training both the non-frozen layers and the newly added classifier layers of the pretrained model.
The transferability of features from one network to another is a function of the size of the target data and the domain similarity between the source and target data.
Generally, fine-tuning parameters use a smaller learning rate, while training the output layer from scratch can use a larger learning rate.
1.Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson, “How Transferable Are Features in Deep Neural Networks?” Advances in Neural Information Processing Systems 27 (Dec. 2014): 3320-3328, https://arxiv.org/ abs/1411.1792.
2. Tsung-Yi Lin, Michael Maire, Serge Belongie, et al., “Microsoft COCO: Common Objects in Context” (February 2015), https://arxiv.org/pdf/1405.0312.pdf.
18.220.136.165