In the previous chapters, we explained how we can use deep neural networks for image classification tasks. In image classification, we assume that there is only one main target object in the image, and the model’s sole focus is to identify the target category. However, in many situations, we are interested in multiple targets in the image. We want to not only classify them, but also obtain their specific positions in the image. In computer vision, we refer to such tasks as object detection. Figure 7.1 explains the difference between image classification and object detection tasks.
Object detection is a CV task that involves both main tasks: localizing one or more objects within an image and classifying each object in the image (see table 7.1). This is done by drawing a bounding box around the identified object with its predicted class. This means the system doesn’t just predict the class of the image, as in image classification tasks; it also predicts the coordinates of the bounding box that fits the detected object. This is a challenging CV task because it requires both successful object localization, in order to locate and draw a bounding box around each object in an image, and object classification to predict the correct class of object that was localized.
Object detection is widely used in many fields. For example, in self-driving technology, we need to plan routes by identifying the locations of vehicles, pedestrians, roads, and obstacles in a captured video image. Robots often perform this type of task to detect targets of interest. And systems in the security field need to detect abnormal targets, such as intruders or bombs.
This chapter’s layout is as follows:
We will explore the general framework of the object detection algorithms.
We will dive deep into three of the most popular detection algorithms: the R-CNN family of networks, SSD, and the YOLO family of networks.
We will use what we’ve learned in a real-world project to train an end-to-end object detector.
By the end of this chapter, we will have gained an understanding of how DL is applied to object detection, and how the different object detection models inspire and diverge from one another. Let’s get started!
Before we jump into the object detection systems like R-CNN, SSD, and YOLO, let’s discuss the general framework of these systems to understand the high-level workflow that DL-based systems follow to detect objects and the metrics they use to evaluate their detection performance. Don’t worry about the code implementation details of object detectors yet. The goal of this section is to give you an overview of how different object detection systems approach this task and introduce you to a new way of thinking about this problem and a set of new concepts to set you up to understand the DL architectures that we will explain in sections 7.2, 7.3, and 7.4.
Typically, an object detection framework has four components:
Region proposal --An algorithm or a DL model is used to generate regions of interest (RoIs) to be further processed by the system. These are regions that the network believes might contain an object; the output is a large number of bounding boxes, each of which has an objectness score. Boxes with large objectness scores are then passed along the network layers for further processing.
Feature extraction and network predictions --Visual features are extracted for each of the bounding boxes. They are evaluated, and it is determined whether and which objects are present in the proposals based on visual features (for example, an object classification component).
Non-maximum suppression (NMS) --In this step, the model has likely found multiple bounding boxes for the same object. NMS helps avoid repeated detection of the same instance by combining overlapping boxes into a single bounding box for each object.
Evaluation metrics --Similar to accuracy, precision, and recall metrics in image classification tasks (see chapter 4), object detection systems have their own metrics to evaluate their detection performance. In this section, we will explain the most popular metrics, like mean average precision (mAP), precision-recall curve (PR curve), and intersection over union (IoU).
Now, let’s dive one level deeper into each one of these components to build an intuition about what their goals are.
In this step, the system looks at the image and proposes RoIs for further analysis. RoIs are regions that the system believes have a high likelihood of containing an object, called the objectness score (figure 7.2). Regions with high objectness scores are passed to the next steps; regions with low scores are abandoned.
There are several approaches to generate region proposals. Originally, the selective search algorithm was used to generate object proposals; we will talk more about this algorithm when we discuss the R-CNN network. Other approaches use more complex visual features extracted from the image by a deep neural network to generate regions (for example, based on the features from a DL model).
We will talk in more detail about how different object detection systems approach this task. The important thing to note is that this step produces a lot (thousands) of bounding boxes to be further analyzed and classified by the network. During this step, the network analyzes these regions in the image and classifies each region as foreground (object) or background (no object) based on its objectness score. If the objectness score is above a certain threshold, then this region is considered a foreground and pushed forward in the network. Note that this threshold is configurable based on your problem. If the threshold is too low, your network will exhaustively generate all possible proposals, and you will have a better chance of detecting all objects in the image. On the flip side, this is very computationally expensive and will slow down detection. So, the trade-off with generating region proposals is the number of regions versus computational complexity--and the right approach is to use problem-specific information to reduce the number of RoIs.
This component includes the pretrained CNN network that is used for feature extraction to extract features from the input image that are representative for the task at hand and to use these features to determine the class of the image. In object detection frameworks, people typically use pretrained image classification models to extract visual features, as these tend to generalize fairly well. For example, a model trained on the MS COCO or ImageNet dataset is able to extract fairly generic features.
In this step, the network analyzes all the regions that have been identified as having a high likelihood of containing an object and makes two predictions for each region:
Bounding-box prediction--The coordinates that locate the box surrounding the object. The bounding box coordinates are represented as the tuple (x, y, w, h), where x and y are the coordinates of the center point of the bounding box and w and h are the width and height of the box.
Class prediction : The classic softmax function that predicts the class probability for each object.
Since thousands of regions are proposed, each object will always have multiple bounding boxes surrounding it with the correct classification. For example, take a look at the image of the dog in figure 7.3. The network was clearly able to find the object (dog) and successfully classify it. But the detection fired a total of five times because the dog was present in the five RoIs produced in the previous step: hence the five bounding boxes around the dog in the figure. Although the detector was able to successfully locate the dog in the image and classify it correctly, this is not exactly what we need. We need just one bounding box for each object for most problems. In some problems, we only want the one box that fits the object the most. What if we are building a system to count dogs in an image? Our current system will count five dogs. We don’t want that. This is when the non-maximum suppression technique comes in handy.
As you can see in figure 7.4, one of the problems of an object detection algorithm is that it may find multiple detections of the same object. So, instead of creating only one bounding box around the object, it draws multiple boxes for the same object. NMS is a technique that makes sure the detection algorithm detects each object only once. As the name implies, NMS looks at all the boxes surrounding an object to find the box that has the maximum prediction probability, and it suppresses or eliminates the other boxes (hence the name).
The general idea of NMS is to reduce the number of candidate boxes to only one bounding box for each object. For example, if the object in the frame is fairly large and more than 2,000 object proposals have been generated, it is quite likely that some of them will have significant overlap with each other and the object.
Let’s see the steps of how the NMS algorithm works:
Discard all bounding boxes that have predictions that are less than a certain threshold, called the confidence threshold. This threshold is tunable, which means a box will be suppressed if the prediction probability is less than the set threshold.
Look at all the remaining boxes, and select the bounding box with the highest probability.
Calculate the overlap of the remaining boxes that have the same class prediction. Bounding boxes that have high overlap with each other and that predict the same class are averaged together. This overlap metric is called intersection over union (IoU). IoU is explained in detail in the next section.
Suppress any box that has an IoU value smaller than a certain threshold (called the NMS threshold). Usually the NMS threshold is equal to 0.5, but it is tunable as well if you want to output fewer or more bounding boxes.
NMS techniques are typically standard across the different detection frameworks, but it is an important step that may require tweaking hyperparameters such as the confidence threshold and the NMS threshold based on the scenario.
When evaluating the performance of an object detector, we use two main evaluation metrics: frames per second and mean average precision.
The most common metric used to measure detection speed is the number of frames per second (FPS). For example, Faster R-CNN operates at only 7 FPS, whereas SSD operates at 59 FPS. In benchmarking experiments, you will see the authors of a paper state their network results as: “Network x achieves mAP of Y% at Z FPS,” where x is the network name, y is the mAP percentage, and Z is the FPS.
The most common evaluation metric used in object recognition tasks is mean average precision (mAP). It is a percentage from 0 to 100, and higher values are typically better, but its value is different from the accuracy metric used in classification.
To understand how mAP is calculated, you first need to understand intersection over union (IoU) and the precision-recall curve (PR curve). Let’s explain IoU and the PR curve and then come back to mAP.
This measure evaluates the overlap between two bounding boxes: the ground truth bounding box (Bground truth) and the predicted bounding box (Bpredicted). By applying the IoU, we can tell whether a detection is valid (True Positive) or not (False Positive). Figure 7.5 illustrates the IoU between a ground truth bounding box and a predicted bounding box.
The intersection over the union value ranges from 0 (no overlap at all) to 1 (the two bounding boxes overlap each other 100%). The higher the overlap between the two bounding boxes (IoU value), the better (figure 7.6).
To calculate the IoU of a prediction, we need the following:
The ground truth bounding box (Bground truth): the hand-labeled bounding box created during the labeling process
We calculate IoU by dividing the area of overlap by the area of the union, as in the following equation:
IoU is used to define a correct prediction, meaning a prediction (True Positive) that has an IoU greater than some threshold. This threshold is a tunable value depending on the challenge, but 0.5 is a standard value. For example, some challenges, like Microsoft COCO, use [email protected] (IoU threshold of 0.5) or [email protected] (IoU threshold of 0.75). If the IoU value is above this threshold, the prediction is considered a True Positive (TP); and if it is below the threshold, it is considered a False Positive (FP).
With the TP and FP defined, we can now calculate the precision and recall of our detection for a given class across the testing dataset. As explained in chapter 4, we calculate the precision and recall as follows (recall that FN stands for False Negative):
After calculating the precision and recall for all classes, the PR curve is then plotted as shown in figure 7.7.
The PR curve is a good way to evaluate the performance of an object detector, as the confidence is changed by plotting a curve for each object class. A detector is considered good if its precision stays high as recall increases, which means if you vary the confidence threshold, the precision and recall will still be high. On the other hand, a poor detector needs to increase the number of FPs (lower precision) in order to achieve a high recall. That’s why the PR curve usually starts with high precision values, decreasing as recall increases.
Now that we have the PR curve, we can calculate the average precision (AP) by calculating the area under the curve (AUC). Finally, the mAP for object detection is the average of the AP calculated for all the classes. It is also important to note that some research papers use AP and mAP interchangeably.
To recap, the mAP is calculated as follows:
Get each bounding box’s associated objectness score (probability of the box containing an object).
Compute the PR curve for each class by varying the score threshold.
Calculate the AP: the area under the PR curve. In this step, the AP is computed for each class.
Calculate the mAP: the average AP over all the different classes.
The last thing to note about mAP is that it is more complicated to calculate than other traditional metrics like accuracy. The good news is that you don’t need to compute mAP values yourself: most DL object detection implementations handle computing the mAP for you, as you will see later in this chapter.
Now that we understand the general framework of object detection algorithms, let’s dive deeper into three of the most popular. In this chapter, we will discuss the R-CNN family of networks, SSD, and YOLO networks in detail to see how object detectors have evolved over time. We will also examine the pros and cons of each network so you can choose the most appropriate algorithm for your problem.
The R-CNN family of object detection techniques usually referred to as R-CNNs, which is short for region-based convolutional neural networks, was developed by Ross Girshick et al. in 2014.1 The R-CNN family expanded to include Fast-RCNN2 and Faster-RCN3 in 2015 and 2016, respectively. In this section, I’ll quickly walk you through the evolution of the R-CNN family from R-CNNs to Fast R-CNN to Faster R-CNN, and then we will dive deeper into the Faster R-CNN architecture and code implementation.
R-CNN is the least sophisticated region-based architecture in its family, but it is the basis for understanding how multiple object-recognition algorithms work for all of them. It was one of the first large, successful applications of convolutional neural networks to the problem of object detection and localization, and it paved the way for the other advanced detection algorithms. The approach was demonstrated on benchmark datasets, achieving then-state-of-the-art results on the PASCAL VOC-2012 dataset and the ILSVRC 2013 object detection challenge. Figure 7.8 shows a summary of the R-CNN model architecture.
The R-CNN model consists of four components:
Extract regions of interest --Also known as extracting region proposals. These regions have a high probability of containing an object. An algorithm called selective search scans the input image to find regions that contain blobs, and proposes them as RoIs to be processed by the next modules in the pipeline. The proposed RoIs are then warped to have a fixed size; they usually vary in size, but as we learned in previous chapters, CNNs require a fixed input image size.
Feature extraction module --We run a pretrained convolutional network on top of the region proposals to extract features from each candidate region. This is the typical CNN feature extractor that we learned about in previous chapters.
Classification module --We train a classifier like a support vector machine (SVM), a traditional machine learning algorithm, to classify candidate detections based on the extracted features from the previous step.
Localization module --Also known as a bounding-box regressor. Let’s take a step back to understand regression. ML problems are categorized as classification or regression problems. Classification algorithms output discrete, predefined classes (dog, cat, elephant), whereas regression algorithms output continuous value predictions. In this module, we want to predict the location and size of the bounding box that surrounds the object. The bounding box is represented by identifying four values: the x and y coordinates of the box’s origin (x, y), the width, and the height of the box (w, h). Putting this together, the regressor predicts the four real-valued numbers that define the bounding box as the following tuple: (x, y, w, h).
Figure 7.9 illustrates the R-CNN architecture in an intuitive way. As you can see, the network first proposes RoIs, then extracts features, and then classifies those regions based on their features. In essence, we have turned object detection into an image classification problem.
We learned in the previous section that R-CNNs are composed of four modules: selective search region proposal, feature extractor, classifier, and bounding-box regressor. All of the R-CNN modules need to be trained except the selective search algorithm. So, in order to train R-CNNs, we need to do the following:
Train the feature extractor CNN. This is a typical CNN training process. We either train a network from scratch, which rarely happens, or fine-tune a pretrained network, as we learned to do in chapter 6.
Train the SVM classifier. The SVM algorithm is not covered in this book, but it is a traditional ML classifier that is no different from DL classifiers in the sense that it needs to be trained on labeled data.
Train the bounding-box regressors. This model outputs four real-valued numbers for each of the K object classes to tighten the region bounding boxes.
Looking through the R-CNN learning steps, you could easily find out that training an R-CNN model is expensive and slow. The training process involves training three separate modules without much shared computation. This multistage pipeline training is one of the disadvantages of R-CNNs, as we will see next.
R-CNN is very simple to understand, and it achieved state-of-the-art results when it first came out, especially when using deep ConvNets to extract features. However, it is not actually a single end-to-end system that learns to localize via a deep neural network. Rather, it is a combination of standalone algorithms, added together to perform object detection. As a result, it has the following notable drawbacks:
Object detection is very slow. For each image, the selective search algorithm proposes about 2,000 RoIs to be examined by the entire pipeline (CNN feature extractor and classifier). This is very computationally expensive because it performs a ConvNet forward pass for each object proposal without sharing computation, which makes it incredibly slow. This high computation need means R-CNN is not a good fit for many applications, especially real-time applications that require fast inferences like self-driving cars and many others.
Training is a multi-stage pipeline. As discussed earlier, R-CNNs require the training of three modules: CNN feature extractor, SVM classifier, and bounding-box regressors. Thus the training process is very complex and not an end-to-end training.
Training is expensive in terms of space and time. When training the SVM classifier and bounding-box regressor, features are extracted from each object proposal in each image and written to disk. With very deep networks, such as VGG16, the training process for a few thousand images takes days using GPUs. The training process is expensive in space as well, because the extracted features require hundreds of gigabytes of storage.
What we need is an end-to-end DL system that fixes the disadvantages of R-CNN while improving its speed and accuracy.
Fast R-CNN was an immediate descendant of R-CNN, developed in 2015 by Ross Girshick. Fast R-CNN resembled the R-CNN technique in many ways but improved on its detection speed while also increasing detection accuracy through two main changes:
Instead of starting with the regions proposal module and then using the feature extraction module, like R-CNN, Fast-RCNN proposes that we apply the CNN feature extractor first to the entire input image and then propose regions. This way, we run only one ConvNet over the entire image instead of 2,000 ConvNets over 2,000 overlapping regions.
It extends the ConvNet’s job to do the classification part as well, by replacing the traditional SVM machine learning algorithm with a softmax layer. This way, we have a single model to perform both tasks: feature extraction and object classification.
As shown in figure 7.10, Fast R-CNN generates region proposals based on the last feature map of the network, not from the original image like R-CNN. As a result, we can train just one ConvNet for the entire image. In addition, instead of training many different SVM algorithms to classify each object class, a single softmax layer outputs the class probabilities directly. Now we only have one neural net to train, as opposed to one neural net and many SVMs.
The architecture of Fast R-CNN consists of the following modules:
Feature extractor module --The network starts with a ConvNet to extract features from the full image.
RoI extractor --The selective search algorithm proposes about 2,000 region candidates per image.
RoI pooling layer --This is a new component that was introduced to extract a fixed-size window from the feature map before feeding the RoIs to the fully connected layers. It uses max pooling to convert the features inside any valid RoI into a small feature map with a fixed spatial extent of height × width (H × W ). The RoI pooling layer will be explained in more detail in the Faster R-CNN section; for now, understand that it is applied on the last feature map layer extracted from the CNN, and its goal is to extract fixed-size RoIs to feed to the fully connected layers and then the output layers.
Since Fast R-CNN is an end-to-end learning architecture to learn the class of an object as well as the associated bounding box position and size, the loss is multi-task loss. With multi-task loss, the output has the softmax classifier and bounding-box regressor, as shown in figure 7.10.
In any optimization problem, we need to define a loss function that our optimizer algorithm is trying to minimize. (Chapter 2 gives more details about optimization and loss functions.) In object detection problems, our goal is to optimize for two goals: object classification and object localization. Therefore, we have two loss functions in this problem: Lcls for the classification loss and Lloc for the bounding box prediction defining the object location.
A Fast R-CNN network has two sibling output layers with two loss functions:
Classification --The first outputs a discrete probability distribution (per RoI) over K + 1 categories (we add one class for the background). The probability P is computed by a softmax over the K + 1 outputs of a fully connected layer. The classification loss function is a log loss for the true class u
where u is the true label, u ∈ 0, 1, 2, . . . (K + 1); where u = 0 is the background; and p is the discrete probability distribution per RoI over K + 1 classes.
Regression --The second sibling layer outputs bounding box regression offsets v = (x, y, w, h) for each of the K object classes. The loss function is the loss for bounding box for class u
Lloc(tu, u) = σ L1smooth(tiu - vi)
L(p, u, tu, v) = Lcls(p, u) + [u ≥ 1] lbox(tu, v)
Note that [u ≥ 1] is added before the regression loss to indicate 0 when the region inspected doesn’t contain any object and contains a background. It is a way of ignoring the bounding box regression when the classifier labels the region as a background. The indicator function [u ≥ 1] is defined as
Fast R-CNN is much faster in terms of testing time, because we don’t have to feed 2,000 region proposals to the convolutional neural network for every image. Instead, a convolution operation is done only once per image, and a feature map is generated from it. Training is also faster because all the components are in one CNN network: feature extractor, object classifier, and bounding-box regressor. However, there is a big bottleneck remaining: the selective search algorithm for generating region proposals is very slow and is generated separately by another model. The last step to achieve a complete end-to-end object detection system using DL is to find a way to combine the region proposal algorithm into our end-to-end DL network. This is what Faster R-CNN does, as we will see next.
Faster R-CNN is the third iteration of the R-CNN family, developed in 2016 by Shaoqing Ren et al. Similar to Fast R-CNN, the image is provided as an input to a convolutional network that provides a convolutional feature map. Instead of using a selective search algorithm on the feature map to identify the region proposals, a region proposal network (RPN) is used to predict the region proposals as part of the training process. The predicted region proposals are then reshaped using an RoI pooling layer and used to classify the image within the proposed region and predict the offset values for the bounding boxes. These improvements both reduce the number of region proposals and accelerate the test-time operation of the model to near real-time with then-state-of-the-art performance.
The architecture of Faster R-CNN can be described using two main networks:
Region proposal network (RPN) --Selective search is replaced by a ConvNet that proposes RoIs from the last feature maps of the feature extractor to be considered for investigation. The RPN has two outputs: the objectness score (object or no object) and the box location.
Fast R-CNN --It consists of the typical components of Fast R-CNN:
As you can see in figure 7.11, the input image is presented to the network, and its features are extracted via a pretrained CNN. These features, in parallel, are sent to two different components of the Faster R-CNN architecture:
The RPN to determine where in the image a potential object could be. At this point, we do not know what the object is, just that there is potentially an object at a certain location in the image.
The output is then passed into two fully connected layers: one for the object classifier and one for the bounding box coordinate predictions to obtain our final localizations.
This architecture achieves an end-to-end trainable, complete object detection pipeline where all of the required components are inside the network:
Similar to Fast R-CNN, the first step is to use a pretrained CNN and slice off its classification part. The base network is used to extract features from the input image. We covered how this works in detail in chapter 6. In this component, you can use any of the popular CNN architectures based on the problem you are trying to solve. The original Faster R-CNN paper used ZF4 and VGG5 pretrained networks on ImageNet; but since then, there have been lots of different networks with a varying number of weights. For example, MobileNet,6 a smaller and efficient network architecture optimized for speed, has approximately 3.3 million parameters, whereas ResNet-152 (152 layers)--once the state of the art in the ImageNet classification competition--has around 60 million. Most recently, new architectures like DenseNet7 are both improving results and reducing the number of parameters.
As we learned in earlier chapters, each convolutional layer creates abstractions based on the previous information. The first layer usually learns edges, the second finds patterns in edges to activate for more complex shapes, and so forth. Eventually we end up with a convolutional feature map that can be fed to the RPN to extract regions that contain objects.
The RPN identifies regions that could potentially contain objects of interest, based on the last feature map of the pretrained convolutional neural network. An RPN is also known as an attention network because it guides the network’s attention to interesting regions in the image. Faster R-CNN uses an RPN to bake the region proposal directly into the R-CNN architecture instead of running a selective search algorithm to extract RoIs.
The architecture of the RPN is composed of two layers (figure 7.12):
Two parallel 1 × 1 convolutional layers: a classification layer that is used to predict whether the region contains an object (the score of it being background or foreground), and a layer for regression or bounding box prediction.
The 3 × 3 convolutional layer is applied on the last feature map of the base network where a sliding window of size 3 × 3 is passed over the feature map. The output is then passed to two 1 × 1 convolutional layers: a classifier and a bounding-box regressor. Note that the classifier and the regressor of the RPN are not trying to predict the class of the object and its bounding box; this comes later, after the RPN. Remember, the goal of the RPN is to determine whether the region has an object to be investigated afterward by the fully connected layers. In the RPN, we use a binary classifier to predict the objectness score of the region, to determine the probability of this region being a foreground (contains an object) or a background (doesn’t contain an object). It basically looks at the region and asks, “Does this region contain an object?” If the answer is yes, then the region is passed along for further investigation by RoI pooling and the final output layers (see figure 7.13).
To answer this question, let’s first define the bounding box. It is the box that surrounds the object and is identified by the tuple (x, y, w, h), where x and y are the coordinates in the image that describes the center of the bounding box and h and w are the height and width of the bounding box. Researchers have found that defining the (x, y) coordinates of the center point can be challenging because we have to enforce some rules to make sure the network predicts values inside the boundaries of the image. Instead, we can create reference boxes called anchor boxes in the image and make the regression layer predict offsets from these boxes called deltas (Δx, Δy, Δw, Δh) to adjust the anchor boxes to better fit the object to get final proposals (figure 7.14).
Using a sliding window approach, the RPN generates k regions for each location in the feature map. These regions are represented as anchor boxes. The anchors are centered in the middle of their corresponding sliding window and differ in terms of scale and aspect ratio to cover a wide variety of objects. They are fixed bounding boxes that are placed throughout the image to be used for reference when first predicting object locations. In their paper, Ren et. al. generated nine anchor boxes that all had the same center but that had three different aspect ratios and three different scales.
Figure 7.15 shows an example of how anchor boxes are applied. Anchors are at the center of the sliding windows; each window has k anchor boxes with the anchor at their center.
The RPN is trained to classify an anchor box to output an objectness score and to approximate the four coordinates of the object (location parameters). It is trained using human annotators to label the bounding boxes. A labeled box is called the ground truth.
For each anchor box, the overlap probability value (p) is computed, which indicates how much these anchors overlap with the ground-truth bounding boxes:
If an anchor has high overlap with a ground-truth bounding box, then it is likely that the anchor box includes an object of interest, and it is labeled as positive with respect to the object versus no object classification task. Similarly, if an anchor has small overlap with a ground-truth bounding box, it is labeled as negative. During the training process, the positive and negative anchors are passed as input to two fully connected layers corresponding to the classification of anchors as containing an object or no object, and to the regression of location parameters (four coordinates), respectively. Corresponding to the k number of anchors from a location, the RPN network outputs 2k scores and 4k coordinates. Thus, for example, if the number of anchors per sliding window (k) is 9, then the RPN outputs 18 objectness scores and 36 location coordinates (figure 7.16).
The output fully connected layer takes two inputs: the feature maps coming from the base ConvNet and the RoIs coming from the RPN. It then classifies the selected regions and outputs their prediction class and the bounding box parameters. The object classification layer in Faster R-CNN uses softmax activation, while the location regression layer uses linear regression over the coordinates defining the location as a bounding box. All of the network parameters are trained together using multi-task loss.
Similar to Fast R-CNN, Faster R-CNN is optimized for a multi-task loss function that combines the losses of classification and bounding box regression:
The loss equation might look a little overwhelming at first, but it is simpler than it appears. Understanding it is not necessary to be able to run and train Faster R-CNNs, so feel free to skip this section. But I encourage you to power through this explanation, because it will add a lot of depth to your understanding of how the optimization process works under the hood. Let’s go through the symbols first; see table 7.2.
Now that you know the definitions of the symbols, let’s try to read the multi-task loss function again. To help understand this equation, just for a moment, ignore the normalization terms and the (i) terms. Here’s the simplified loss function for each instance (i):
Loss = Lcls(p, p*) + p* · L1smooth(t - t*)
This simplified function is the summation of two loss functions: the classification loss and the location loss (bounding box). Let’s look at them one at a time:
The idea of any loss function is that it subtracts the predicted value from the true value to find the amount of error. The classification loss is the cross-entropy function explained in chapter 2. Nothing new. It is a log loss function that calculates the error between the prediction probability (p) and the ground truth (p, p*):
The location loss is the difference between the predicted and true location parameters (ti , ti*) using the smooth L1 loss function. The difference is then multiplied by the ground truth probability of the region containing an object p*. If it is not an object, p* is 0 to eliminate the entire location loss for non-object regions.
Finally, we add the values of both losses to create the multi-loss function:
There you have it: the multi-loss function for each instance (i). Put back the (i) and σ symbols to calculate the summation of losses for each instance.
Table 7.3 recaps the evolution of the R-CNN architecture:
R-CNN --Bounding boxes are proposed by the selective search algorithm. Each is warped, and features are extracted via a deep convolutional neural network such as AlexNet, before a final set of object classifications and bounding box predictions is made with linear SVMs and linear regressors.
Fast R-CNN --A simplified design with a single model. An RoI pooling layer is used after the CNN to consolidate regions. The model predicts both class labels and RoIs directly.
Faster R-CNN --A fully end-to-end DL object detector. It replaces the selective search algorithm to propose RoIs with a region proposal network that interprets features extracted from the deep CNN and learns to propose RoIs directly.
As you might have noticed, each paper proposes improvements to the seminal work done in R-CNN to develop a faster network, with the goal of achieving real-time object detection. The achievements displayed through this set of work is truly amazing, yet none of these architectures manages to create a real-time object detector. Without going into too much detail, the following problems have been identified with these networks:
Fortunately, in the last few years, new architectures have been created to address the bottlenecks of R-CNN and its successors, enabling real-time object detection. The most famous are the single-shot detector (SSD) and you only look once (YOLO), which we will explain in sections 7.3 and 7.4.
Models in the R-CNN family are all region-based. Detection happens in two stages, and thus these models are called two-stage detectors:
The model proposes a set of RoIs using selective search or an RPN. The proposed regions are sparse because the potential bounding-box candidates can be infinite.
One-stage detectors take a different approach. They skip the region proposal stage and run detection directly over a dense sampling of possible locations. This approach is faster and simpler but can potentially drag down performance a bit. In the next two sections, we will examine the SSD and YOLO one-stage object detectors. In general, single-stage detectors tend to be less accurate than two-stage detectors but are significantly faster.
The SSD paper was released in 2016 by Wei Liu et al.8 The SSD network reached new records in terms of performance and precision for object detection tasks, scoring over 74% mAP at 59 FPS on standard datasets such as the PASCAL VOC and Microsoft COCO.
We learned earlier that the R-CNN family are multi-stage detectors: the network first predicts the objectness score of the bounding box and then passes this box through a classifier to predict the class probability. In single-stage detectors like SSD and YOLO (discussed in section 7.4), the convolutional layers make both predictions directly in one shot: hence the name single-shot detector. The image is passed once through the network, and the objectness score for each bounding box is predicted using logistic regression to indicate the level of overlap with the ground truth. If the bounding box overlaps 100% with the ground truth, the objectness score is 1; and if there is no overlap, the objectness score is 0. We then set a threshold value (0.5) that says, “If the objectness score is above 50%, this bounding box likely has an object of interest, and we get predictions. If it is less than 50%, we ignore the predictions.”
The SSD approach is based on a feed-forward convolutional network that produces a fixed-size collection of bounding boxes and scores for the presence of object class instances in those boxes, followed by a NMS step to produce the final detections. The architecture of the SSD model is composed of three main parts:
Base network to extract feature maps --A standard pretrained network used for high-quality image classification, which is truncated before any classification layers. In their paper, Liu et al. used a VGG16 network. Other networks like VGG19 and ResNet can be used and should produce good results.
Multi-scale feature layers --A series of convolution filters are added after the base network. These layers decrease in size progressively to allow predictions of detections at multiple scales.
Non-maximum suppression --NMS is used to eliminate overlapping boxes and keep only one box for each object detected.
As you can see in figure 7.17, layers 4_3, 7, 8_2, 9_2, 10_2, and 11_2 make predictions directly to the NMS layer. We will talk about why these layers progressively decrease in size in section 7.3.3. For now, let’s follow along to understand the end-to-end flow of data in SSD.
You can see in figure 7.17, that the network makes a total of 8,732 detections per class that are then fed to an NMS layer to reduce down to one detection per object. Where did the number 8,732 come from?
To have more accurate detection, different layers of feature maps also go through a small 3 × 3 convolution for object detection. For example, Conv4_3 is of size 38 × 38 × 512, and a 3 × 3 convolutional is applied. There are four bounding boxes, each of which has (number of classes + 4 box values) outputs. Suppose there are 20 object classes plus 1 background class; then the output number of bounding boxes is 38 × 38 × 4 = 5,776 bounding boxes. Similarly, we calculate the number of bounding boxes for the other convolutional layers:
Conv7: 19 × 19 × 6 = 2,166 boxes (6 boxes for each location)
Conv8_2: 10 × 10 × 6 = 600 boxes (6 boxes for each location)
If we sum them up, we get 5,776 + 2,166 + 600 + 150 + 36 + 4 = 8,732 boxes produced. This is a huge number of boxes to show for our detector. That’s why we apply NMS to reduce the number of the output boxes. As you will see in section 7.4, in YOLO there are 7 × 7 locations at the end with two bounding boxes for each location: 7 × 7 × 2 = 98 boxes.
Now, let’s dive a little deeper into each component of the SSD architecture.
As you can see in figure 7.17, the SSD architecture builds on the VGG16 architecture after slicing off the fully connected classification layers (VGG16 is explained in detail in chapter 5). VGG16 was used as the base network because of its strong performance in high-quality image classification tasks and its popularity for problems where transfer learning helps to improve results. Instead of the original VGG fully connected layers, a set of supporting convolutional layers (from Conv6 onward) was added to enable us to extract features at multiple scales and progressively decrease the size of the input to each subsequent layer.
Following is a simplified code implementation of the VGG16 network used in SSD using Keras. You will not need to implement this from scratch; my goal in including this code snippet is to show you that this is a typical VGG16 network like the one implemented in chapter 5:
conv1_1=
Conv2D(64
, (3
,3
),activation
=
'relu'
,padding
=
'same'
) conv1_2=
Conv2D(64
, (3
,3
),activation
=
'relu'
,padding
=
'same'
)(conv1_1) pool1=
MaxPooling2D(pool_size
=
(2
,2
),strides
=
(2
,2
),padding
=
'same'
)(conv1_2) conv2_1=
Conv2D(128
, (3
,3
),activation
=
'relu'
,padding
=
'same'
)(pool1) conv2_2=
Conv2D(128
, (3
,3
),activation
=
'relu'
,padding
=
'same'
)(conv2_1) pool2=
MaxPooling2D(pool_size
=
(2
,2
),strides
=
(2
,2
),padding
=
'same'
)(conv2_2) conv3_1=
Conv2D(256
, (3
,3
),activation
=
'relu'
,padding
=
'same'
)(pool2) conv3_2=
Conv2D(256
, (3
,3
),activation
=
'relu'
,padding
=
'same'
)(conv3_1) conv3_3=
Conv2D(256
, (3
,3
),activation
=
'relu'
,padding
=
'same'
)(conv3_2) pool3=
MaxPooling2D(pool_size
=
(2
,2
),strides
=
(2
,2
),padding
=
'same'
)(conv3_3) conv4_1=
Conv2D(512
, (3
,3
),activation
=
'relu'
,padding
=
'same'
)(pool3) conv4_2=
Conv2D(512
, (3
,3
),activation
=
'relu'
,padding
=
'same'
)(conv4_1) conv4_3=
Conv2D(512
, (3
,3
),activation
=
'relu'
,padding
=
'same'
)(conv4_2) pool4=
MaxPooling2D(pool_size
=
(2
,2
),strides
=
(2
,2
),padding
=
'same'
)(conv4_3) conv5_1=
Conv2D(512
, (3
,3
),activation
=
'relu'
,padding
=
'same'
)(pool4) conv5_2=
Conv2D(512
, (3
,3
),activation
=
'relu'
,padding
=
'same'
)(conv5_1) conv5_3=
Conv2D(512
, (3
,3
),activation
=
'relu'
,padding
=
'same'
)(conv5_2) pool5=
MaxPooling2D(pool_size
=
(3
,3
),strides
=
(1
,1
),padding
=
'same'
)(conv5_3)
You saw VGG16 implemented in Keras in chapter 5. The two main takeaways from adding this here are as follows:
Layer conv4_3 will be used again to make direct predictions.
Layer pool5 will be fed to the next layer (conv6), which is the first of the multiscale features layers.
Consider the following example. Suppose you have the image in figure 7.18, and the network’s job is to draw bounding boxes around all the boats in the image. The process goes as follows:
Similar to the anchors concept in R-CNN, SSD overlays a grid of anchors around the image. For each anchor, the network creates a set of bounding boxes at its center. In SSD, anchors are called priors.
The base network looks at each bounding box as a separate image. For each bounding box, the network asks, “Is there a boat in this box?” Or in other words, “Did I extract any features of a boat in this box?”
When the network finds a bounding box that contains boat features, it sends its coordinates prediction and object classification to the NMS layer.
NMS eliminates all the boxes except the one that overlaps the most with the ground-truth bounding box.
NOTE Liu et al. used VGG16 because of its strong performance in complex image classification tasks. You can use other networks like the deeper VGG19 or ResNet for the base network, and it should perform as well if not better in accuracy; but it could be slower if you chose to implement a deeper network. MobileNet is a good choice if you want a balance between a complex, high-performing deep network and being fast.
Now, on to the next component of the SSD architecture: multi-scale feature layers.
These are convolutional feature layers that are added to the end of the truncated base network. These layers decrease in size progressively to allow predictions of detections at multiple scales.
To understand the goal of the multi-scale feature layers and why they vary in size, let’s look at the image of horses in figure 7.19. As you can see, the base network may be able to detect the horse features in the background, but it may fail to detect the horse that is closest to the camera. To understand why, take a close look at the dotted bounding box and try to imagine this box alone outside the context of the full image (see figure 7.20)
Can you see horse features in the bounding box in figure 7.20? No. To deal with objects of different scales in an image, some methods suggest preprocessing the image at different sizes and combining the results afterward (figure 7.21). However, by using different convolution layers that vary in size, we can use feature maps from several different layers in a single network; for prediction we can mimic the same effect, while also sharing parameters across all object scales. As CNN reduces the spatial dimension gradually, the resolution of the feature maps also decreases. SSD uses lower-resolution layers to detect larger-scale objects. For example, 4 × 4 feature maps are used for larger scale objects.
To visualize this, imagine that the network reduces the image dimensions to be able to fit all of the horses inside its bounding boxes (figure 7.22). The multi-scale feature layers resize the image dimensions and keep the bounding-box sizes so that they can fit the larger horse. In reality, convolutional layers do not literally reduce the size of the image; this is just for illustration to help us intuitively understand the concept. The image is not just resized, it actually goes through the convolutional process and thus won’t look anything like itself anymore. It will be a completely random-looking image, but it will preserve its features. The convolutional process is explained in detail in chapter 3.
Using multi-scale feature maps improves network accuracy significantly. Liu et al. ran an experiment to measure the advantage gained by adding the multi-scale feature layers. Figure 7.23 shows a decrease in accuracy with fewer layers; you can see the accuracy with different numbers of feature map layers used for object detection.
Notice that network accuracy drops from 74.3% when having the prediction source from all six layers to 62.4% for one source layer. When using only the conv7 layer for prediction, performance is the worst, reinforcing the message that it is critical to spread boxes of different scales over different layers.
Liu et al. decided to add six convolutional layers that decrease in size. They did this with a lot of tuning and trial and error until they produced the best results. As you saw in figure 7.17, convolutional layers 6 and 7 are pretty straightforward. Conv6 has a kernel size of 3 × 3, and conv7 has a kernel size of 1 × 1. Layers 8 through 11, on the other hand, are treated more like blocks, where each block consists of two convolutional layers of kernel sizes 1 × 1 and 3 × 3.
Here is the code implementation in Keras for layers 6 through 11 (you can see the full implementation in the book’s downloadable code):
# conv6 and conv7 conv6=
Conv2D(1024
, (3
,3
),dilation_rate
=
(6
,6
),activation
=
'relu'
,padding
=
'same'
)(pool5) conv7=
Conv2D(1024
, (1
,1
),activation
=
'relu'
,padding
=
'same'
)(conv6) # conv8 block conv8_1=
Conv2D(256
, (1
,1
),activation
=
'relu'
,padding
=
'same'
)(conv7) conv8_2=
Conv2D(512
, (3
,3
),strides
=
(2
,2
),activation
=
'relu'
,padding
=
'valid'
)(conv8_1) # conv9 block conv9_1=
Conv2D(128
, (1
,1
),activation
=
'relu'
,padding
=
'same'
)(conv8_2) conv9_2=
Conv2D(256
, (3
,3
),strides
=
(2
,2
),activation
=
'relu'
,padding
=
'valid'
)(conv9_1) # conv10 block conv10_1=
Conv2D(128
, (1
,1
),activation
=
'relu'
,padding
=
'same'
)(conv9_2) conv10_2=
Conv2D(256
, (3
,3
),strides
=
(1
,1
),activation
=
'relu'
,padding
=
'valid'
)(conv10_1) # conv11 block conv11_1=
Conv2D(128
, (1
,1
),activation
=
'relu'
,padding
=
'same'
)(conv10_2) conv11_2=
Conv2D(256
, (3
,3
),strides
=
(1
,1
),activation
=
'relu'
,padding
=
'valid'
)(conv11_1)
As mentioned before, if you are not working in research or academia, you most probably won’t need to implement object detection architectures yourself. In most cases, you will download an open source implementation and build on it to work on your problem. I just added these code snippets to help you internalize the information discussed about different layer architectures.
Next, we discuss the third and last component of the SSD architecture: NMS.
Given the large number of boxes generated by the detection layer per class during a forward pass of SSD at inference time, it is essential to prune most of the bounding box by applying the NMS technique (explained earlier in this chapter). Boxes with a confidence loss and IoU less than a certain threshold are discarded, and only the top N predictions are kept (figure 7.24). This ensures that only the most likely predictions are retained by the network, while the noisier ones are removed.
How does SSD use NMS to prune the bounding boxes? SSD sorts the predicted boxes by their confidence scores. Starting from the top confidence prediction, SSD evaluates whether there are any previously predicted boundary boxes for the same class that overlap with each other above a certain threshold by calculating their IoU. (The IoU threshold value is tunable. Liu et al. chose 0.45 in their paper.) Boxes with IoU above the threshold are ignored because they overlap too much with another box that has a higher confidence score, so they are most likely detecting the same object. At most, we keep the top 200 predictions per image.
Similar to the R-CNN family, YOLO is a family of object detection networks developed by Joseph Redmon et al. and improved over the years through the following versions:
YOLOv1, published in 20169--Called “unified, real-time object detection” because it is a single-detection network that unifies the two components of a detector: object detector and class predictor.
YOLOv2 (also known as YOLO9000), published later in 201610--Capable of detecting over 9,000 objects; hence the name. It has been trained on ImageNet and COCO datasets and has achieved 16% mAP, which is not good; but it was very fast during test time.
YOLOv3, published in 201811--Significantly larger than previous models and has achieved a mAP of 57.9%, which is the best result yet out of the YOLO family of object detectors.
The YOLO family is a series of end-to-end DL models designed for fast object detection, and it was among the first attempts to build a fast real-time object detector. It is one of the faster object detection algorithms out there. Although the accuracy of the models is close but not as good as R-CNNs, they are popular for object detection because of their detection speed, often demonstrated in real-time video or camera feed input.
The creators of YOLO took a different approach than the previous networks. YOLO does not undergo the region proposal step like R-CNNs. Instead, it only predicts over a limited number of bounding boxes by splitting the input into a grid of cells; each cell directly predicts a bounding box and object classification. The result is a large number of candidate bounding boxes that are consolidated into a final prediction using NMS (figure 7.25).
YOLOv1 proposed the general architecture, YOLOv2 refined the design and made use of predefined anchor boxes to improve bounding-box proposals, and YOLOv3 further refined the model architecture and training process. In this section, we are going to focus on YOLOv3 because it is currently the state-of-the-art architecture in the YOLO family.
The YOLO network splits the input image into a grid of S × S cells. If the center of the ground-truth box falls into a cell, that cell is responsible for detecting the existence of that object. Each grid cell predicts B number of bounding boxes and their objectness score along with their class predictions, as follows:
Coordinates of B bounding boxes --Similar to previous detectors, YOLO predicts four coordinates for each bounding box (bx , by , bw , bh), where x and y are set to be offsets of a cell location.
Objectness score (P0)--indicates the probability that the cell contains an object. The objectness score is passed through a sigmoid function to be treated as a probability with a value range between 0 and 1. The objectness score is calculated as follows:
Class prediction --If the bounding box contains an object, the network predicts the probability of K number of classes, where K is the total number of classes in your problem.
It is important to note that before v3, YOLO used a softmax function for the class scores. In v3, Redmon et al. decided to use sigmoid instead. The reason is that softmax imposes the assumption that each box has exactly one class, which is often not the case. In other words, if an object belongs to one class, then it’s guaranteed not to belong to another class. While this assumption is true for some datasets, it may not work when we have classes like Women and Person. A multilabel approach models the data more accurately.
As you can see in figure 7.26, for each bounding box (b), the prediction looks like this: [(bounding box coordinates), (objectness score), (class predictions)]. We’ve learned that the bounding box coordinates are four values plus one value for the objectness score and K values for class predictions. Then the total number of values predicted for all bounding boxes is 5B + K multiplied by the number of cells in the grid S × S :
Total predicted values = S × S × (5B + K)
Look closely at figure 7.26. Notice that the prediction feature map has three boxes. You might have wondered why there are three boxes. Similar to the anchors concept in SSD, YOLOv3 has nine anchors to allow for prediction at three different scales per cell. The detection layer makes detections at feature maps of three different sizes having strides 32, 16, and 8, respectively. This means that with an input image of size 416 × 416, we make detections on scales 13 × 13, 26 × 26, and 52 × 52 (figure 7.27). The 13 × 13 layer is responsible for detecting large objects, the 26 × 26 layer is for detecting medium objects, and the 52 × 52 layer detects smaller objects.
This results in the prediction of three bounding boxes for each cell (B = 3). That’s why in figure 7.26, the prediction feature map is predicting Box 1, Box 2, and Box 3. The bounding box responsible for detecting the dog will be the one whose anchor has the highest IoU with the ground-truth box.
NOTE Detections at different layers help address the issue of detecting small objects, which was a frequent complaint with YOLOv2. The upsampling layers can help the network preserve and learn fine-grained features, which are instrumental for detecting small objects.
The network does this by downsampling the input image until the first detection layer, where a detection is made using feature maps of a layer with stride 32. Further, layers are upsampled by a factor of 2 and concatenated with feature maps of previous layers having identical feature-map sizes. Another detection is now made at layer with stride 16. The same upsampling procedure is repeated, and a final detection is made at the layer of stride 8.
For an input image of size 416 × 416, YOLO predicts ((52 × 52) + (26 × 26) + 13 × 13)) × 3 = 10,647 bounding boxes. That is a huge number of boxes for an output. In our dog example, we have only one object. We want only one bounding box around this object. How do we reduce the boxes from 10,647 down to 1?
First, we filter the boxes based on their objectness score. Generally, boxes having scores below a threshold are ignored. Second, we use NMS to cure the problem of multiple detections of the same image. For example, all three bounding boxes of the outlined grid cell at the center of the image may detect a box, or the adjacent cells may detect the same object.
Now that you understand how YOLO works, going through the architecture will be very simple and straightforward. YOLO is a single neural network that unifies object detection and classifications into one end-to-end network. The neural network architecture was inspired by the GoogLeNet model (Inception) for feature extraction. Instead of the Inception modules, YOLO uses 1 × 1 reduction layers followed by 3 × 3 convolutional layers. Redmon and Farhadi called this DarkNet (figure 7.28).
YOLOv2 used a custom deep architecture darknet-19, an originally 19-layer network supplemented with 11 more layers for object detection. With a 30-layer architecture, YOLOv2 often struggled with small object detections. This was attributed to loss of fine-grained features as the layers downsampled the input. However, YOLOv2’s architecture was still lacking some of the most important elements that are now stable in most state-of-the art algorithms: no residual blocks, no skip connections, and no upsampling. YOLOv3 incorporates all of these updates.
YOLOv3 uses a variant of DarkNet called Darknet-53 (figure 7.29). It has a 53-layer network that is trained on ImageNet. For the task of detection, 53 more layers are stacked onto it, giving us a 106-layer fully convolutional underlying architecture for YOLOv3. This is the reason behind the slowness of YOLOv3 compared to YOLOv2--but this comes with a great boost in detection accuracy.
We just learned that YOLOv3 makes predictions across three different scales. This becomes a lot clearer when you see the full architecture, shown in figure 7.30.
The input image goes through the DarkNet-53 feature extractor, and then the image is downsampled by the network until layer 79. The network branches out and continues to downsample the image until it makes its first prediction at layer 82. This detection is made on a grid scale of 13 × 13 that is responsible for detecting large objects, as we explained before.
Next the feature map from layer 79 is upsampled by 2x to dimensions 26 × 26 and concatenated with the feature map from layer 61. Then the second detection is made by layer 94 on a grid scale of 26 × 26 that is responsible for detecting medium objects.
Finally, a similar procedure is followed again, and the feature map from layer 91 is subjected to few upsampling convolutional layers before being depth concatenated with a feature map from layer 36. A third prediction is made by layer 106 on a grid scale of 52 × 52, which is responsible for detecting small objects.
The code for this project was created by Pierluigi Ferrari in his GitHub repository (https://github.com/pierluigiferrari/ssd_keras). The project was adapted for this chapter; you can find this implementation with the book’s downloadable code.
Note that for this project, we are going to build a smaller SSD network called SSD7. SSD7 is a seven-layer version of the SSD300 network. It is important to note that while an SSD7 network would yield some acceptable results, this is not an optimized network architecture. The goal is just to build a low-complexity network that is fast enough for you to train on your personal computer. It took me around 20 hours to train this network on the road traffic dataset; training could take a lot less time on a GPU.
NOTE The original repository created by Pierluigi Ferrari comes with implementation tutorials for SSD7, SSD300, and SSD512 networks. I encourage you to check it out.
In this project, we will use a toy dataset created by Udacity. You can visit Udacity’s GitHub repository for more information on the dataset (https://github.com/udacity/ self-driving-car/tree/master/annotations). It has more than 22,000 labeled images and 5 object classes: car, truck, pedestrian, bicyclist, and traffic light. All of the images have been resized to a height of 300 pixels and a width of 480 pixels. You can download the dataset as part of the book’s code.
NOTE The GitHub data repository is owned by Udacity, and it may be updated after this writing. To avoid any confusion, I downloaded the dataset that I used to create this project and provided it with the book’s code to allow you to replicate the results in this project.
What makes this dataset very interesting is that these are real-time images taken while driving in Mountain View, California, and neighboring cities during daylight conditions. No image cleanup was done. Take a look at the image examples in figure 7.31.
As stated on Udacity’s page, the dataset was labeled by CrowdAI and Autti. You can find the labels in CSV format in the folder, split into three files: training, validation, and test datasets. The labeling format is straightforward, as follows:
Xmin, xmax, ymin, and ymax are the bounding box coordinates. Class_id is the correct label, and frame is the image name.
Before jumping into the model training, take a close look at the build_model
method in the keras_ssd7.py
file. This file builds a Keras model with the SSD architecture. As we learned earlier in this chapter, the model consists of convolutional feature layers and a number of convolutional predictor layers that make their input from different feature layers.
Here is what the build_model
method looks like. Please read the comments in the keras_ssd7.py file to understand the arguments passed:
def
build_model
(image_size, mode=
'training'
, l2_regularization=
0.0
, min_scale=
0.1
, max_scale=
0.9
, scales=
None
, aspect_ratios_global=
[0.5
,1.0
,2.0
], aspect_ratios_per_layer=
None
, two_boxes_for_ar1=
True
, clip_boxes=
False
, variances=
[1.0
,1.0
,1.0
,1.0
], coords=
'centroids'
, normalize_coords=
False
, subtract_mean=
None
, divide_by_stddev=
None
, swap_channels=
False
, confidence_thresh=
0.01
, iou_threshold=
0.45
, top_k=
200
, nms_max_output_size=
400
, return_predictor_sizes=
False
)
In this section, we set the model configuration parameters. First we set the height, width, and number of color channels to whatever we want the model to accept as image input. If your input images have a different size than defined here, or if your images have non-uniform size, you must use the data generator’s image transformations (resizing and/or cropping) so that your images end up having the required input size before they are fed to the model:
img_height = 300 ❶ img_width = 480 ❶ img_channels = 3 ❶ intensity_mean = 127.5 ❷ intensity_range = 127.5 ❷
❶ Height, width, and channels of the input images
❷ Set to your preference (maybe None). The current settings transform the input pixel values to the interval [-1,1].
The number of classes is the number of positive classes in your dataset: for example, 20 for PASCAL VOC or 80 for COCO. Class ID 0 must always be reserved for the background class:
n_classes = 5 ❶ scales = [0.08, 0.16, 0.32, 0.64, 0.96] ❷ aspect_ratios = [0.5, 1.0, 2.0] ❸ steps = None ❹ offsets = None ❺ two_boxes_for_ar1 = True ❻ clip_boxes = False ❼ variances = [1.0, 1.0, 1.0, 1.0] ❽ normalize_coords = True ❾
❶ Number of classes in our dataset
❷ An explicit list of anchor box scaling factors. If this is passed, it overrides the min_scale and max_scale arguments.
❸ List of aspect ratios for the anchor boxes
❹ In case you’d like to set the step sizes for the anchor box grids manually; not recommended
❺ In case you’d like to set the offsets for the anchor box grids manually; not recommended
❻ Specifies whether to generate two anchor boxes for aspect ratio 1
❼ Specifies whether to clip the anchor boxes to lie entirely within the image boundaries
❽ List of variances by which the encoded target coordinates are scaled
❾ Specifies whether the model is supposed to use coordinates relative to the image size
Now we call the build_model()
function to build our model:
model = build_model(image_size=(img_height, img_width, img_channels),
n_classes=n_classes,
mode='training'
,
l2_regularization=0.0005,
scales=scales,
aspect_ratios_global=aspect_ratios,
aspect_ratios_per_layer=None,
two_boxes_for_ar1=two_boxes_for_ar1,
steps=steps,
offsets=offsets,
clip_boxes=clip_boxes,
variances=variances,
normalize_coords=normalize_coords,
subtract_mean=intensity_mean,
divide_by_stddev=intensity_range)
You can optionally load saved weights. If you don’t want to load weights, skip the following code snippet:
model.load_weights('<path/to/model.h5>', by_name=True)
Instantiate an Adam optimizer and the SSD loss function, and compile the model. Here, we will use a custom Keras function called SSDLoss
. It implements the multi-task log loss for classification and smooth L1 loss for localization. neg_pos_ratio
and alpha
are set as in the SSD paper (Liu et al., 2016):
adam = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0) ssd_loss = SSDLoss(neg_pos_ratio=3, alpha=1.0) model.compile(optimizer=adam, loss=ssd_loss.compute_loss)
To load the data, follow these steps:
Instantiate two DataGenerator
objects--one for training and one for validation:
train_dataset = DataGenerator(load_images_into_memory=False, hdf5_dataset_path=None) val_dataset = DataGenerator(load_images_into_memory=False, hdf5_dataset_path=None)
Parse the image and label lists for the training and validation datasets:
images_dir = 'path_to_downloaded_directory' train_labels_filename = 'path_to_dataset/labels_train.csv' ❶ val_labels_filename = 'path_to_dataset/labels_val.csv' train_dataset.parse_csv(images_dir=images_dir, labels_filename=train_labels_filename, input_format=['image_name', 'xmin', 'xmax', 'ymin', 'ymax', 'class_id'], include_classes='all') val_dataset.parse_csv(images_dir=images_dir, labels_filename=val_labels_filename, input_format=['image_name', 'xmin', 'xmax', 'ymin', 'ymax', 'class_id'], include_classes='all') train_dataset_size = train_dataset.get_dataset_size() ❷ val_dataset_size = val_dataset.get_dataset_size() ❷
❷ Gets the number of samples in the training and validation datasets
This cell should print out the size of your training and validation datasets as follows:
Number of images in the training dataset: 18000 Number of images in the validation dataset: 4241
batch_size = 16
As you learned in chapter 4, you can increase the batch size to get a boost in the computing speed based on the hardware that you are using for this training.
Define the data augmentation process:
data_augmentation_chain = DataAugmentationConstantInputSize(
random_brightness=(-48, 48, 0.5),
random_contrast=(0.5, 1.8, 0.5),
random_saturation=(0.5, 1.8, 0.5),
random_hue=(18, 0.5),
random_flip=0.5,
random_translate=((0.03,0.5), (0.03,0.5), 0.5),
random_scale=(0.5, 2.0, 0.5),
n_trials_max=3,
clip_boxes=True,
overlap_criterion='area'
,
bounds_box_filter=(0.3, 1.0),
bounds_validator=(0.5, 1.0),
n_boxes_min=1,
background=(0,0,0))
Instantiate an encoder that can encode ground-truth labels into the format needed by the SSD loss function. Here, the encoder constructor needs the spatial dimensions of the model’s predictor layers to create the anchor boxes:
predictor_sizes = [model.get_layer('classes4'
).output_shape[1:3], model.get_layer('classes5'
).output_shape[1:3], model.get_layer('classes6'
).output_shape[1:3], model.get_layer('classes7'
).output_shape[1:3]] ssd_input_encoder = SSDInputEncoder(img_height=img_height, img_width=img_width, n_classes=n_classes, predictor_sizes=predictor_sizes, scales=scales, aspect_ratios_global=aspect_ratios, two_boxes_for_ar1=two_boxes_for_ar1, steps=steps, offsets=offsets, clip_boxes=clip_boxes, variances=variances, matching_type='multi'
, pos_iou_threshold=0.5, neg_iou_limit=0.3, normalize_coords=normalize_coords)
Create the generator handles that will be passed to Keras’s fit_generator()
function:
train_generator = train_dataset.generate(batch_size=batch_size, shuffle=True, transformations=[ data_augmentation_chain], label_encoder=ssd_input_encoder, returns={'processed_images'
,'encoded_labels'
}, keep_images_without_gt=False) val_generator = val_dataset.generate(batch_size=batch_size, shuffle=False, transformations=[], label_encoder=ssd_input_encoder, returns={'processed_images'
,'encoded_labels'
}, keep_images_without_gt=False)
Everything is set, and we are ready to train our SSD7 network. We’ve already chosen an optimizer and a learning rate and set the batch size; now let’s set the remaining training parameters and train the network. There are no new parameters here that you haven’t learned already. We will set the model checkpoint, early stopping, and learning rate reduction rate:
model_checkpoint = ModelCheckpoint(filepath='ssd7_epoch-
{epoch:02d}_loss-
{loss:.4f}_val_loss-
{val_loss:.4f}.h5'
, monitor='val_loss'
, verbose=1, save_best_only=True, save_weights_only=False, mode='auto'
, period=1) csv_logger = CSVLogger(filename='ssd7_training_log.csv'
, separator=','
, append=True) early_stopping = EarlyStopping(monitor='val_loss'
, ❶ min_delta=0.0, patience=10, verbose=1) reduce_learning_rate = ReduceLROnPlateau(monitor='val_loss'
, ❷ factor=0.2, patience=8, verbose=1, epsilon=0.001, cooldown=0, min_lr=0.00001) callbacks = [model_checkpoint, csv_logger, early_stopping, reduce_learning_rate]
❶ Early stopping if val_loss did not improve for 10 consecutive epochs
❷ Learning rate reduction rate when it plateaus
Set one epoch to consist of 1,000 training steps. I’ve arbitrarily set the number of epochs to 20 here. This does not necessarily mean that 20,000 training steps is the optimum number. Depending on the model, dataset, learning rate, and so on, you might have to train much longer (or less) to achieve convergence:
initial_epoch = 0 ❶ final_epoch = 20 ❶ steps_per_epoch = 1000 history = model.fit_generator(generator=train_generator, ❷ steps_per_epoch=steps_per_epoch, epochs=final_epoch, callbacks=callbacks, validation_data=val_generator, validation_steps=ceil( val_dataset_size/batch_size), initial_epoch=initial_epoch)
❶ If you’re resuming previous training, set initial_epoch and final_epoch accordingly.
Let’s visualize the loss
and val_loss
values to look at how the training and validation loss evolved and check whether our training is going in the right direction (figure 7.32):
plt.figure(figsize=(20,12)) plt.plot(history.history['loss'
], label='loss'
) plt.plot(history.history['val_loss'
], label='val_loss'
) plt.legend(loc='upper right'
, prop={'size'
: 24})
Now let’s make some predictions on the validation dataset with the trained model. For convenience, we’ll use the validation generator that we’ve already set up. Feel free to change the batch size:
predict_generator = val_dataset.generate(batch_size=1, ❶ shuffle=True, transformations=[], label_encoder=None, returns={'processed_images'
,'processed_labels'
,'filenames'
}, keep_images_without_gt=False) batch_images, batch_labels, batch_filenames =next
(predict_generator) ❷ y_pred = model.predict(batch_images) ❸ y_pred_decoded = decode_detections(y_pred, ❹ confidence_thresh=0.5, iou_threshold=0.45, top_k=200, normalize_coords=normalize_coords, img_height=img_height, img_width=img_width) np.set_printoptions(precision=2, suppress=True, linewidth=90)"Predicted boxes:
"
)' class conf xmin ymin xmax ymax'
)
❶ 1. Set the generator for the predictions.
❹ 4. Decode the raw prediction y_pred.
This code snippet prints the predicted bounding boxes along with their class and the level of confidence for each one, as shown in figure 7.33.
When we draw these predicted boxes onto the image, as shown in figure 7.34, each predicted box has its confidence next to the category name. The ground-truth boxes are also drawn onto the image for comparison.
Image classification is the task of predicting the type or class of an object in an image.
Object detection is the task of predicting the location of objects in an image via bounding boxes and the classes of the located objects.
The general framework of object detection systems consists of four main components: region proposals, feature extraction and predictions, non-maximum suppression, and evaluation metrics.
Object detection algorithms are evaluated using two main metrics: frame per second (FPS) to measure the network’s speed, and mean average precision (mAP) to measure the network’s precision.
The three most popular object detection systems are the R-CNN family of networks, SSD, and the YOLO family of networks.
The R-CNN family of networks has three main variations: R-CNN, Fast R-CNN, and Faster R-CNN. R-CNN and Fast R-CNN use a selective search algorithm to propose RoIs, whereas Faster R-CNN is an end-to-end DL system that uses a region proposal network to propose RoIs.
The YOLO family of networks include YOLOv1, YOLOv2 (or YOLO9000), and YOLOv3.
R-CNN is a multi-stage detector: it separates the process to predict the objectness score of the bounding box and the object class into two different stages.
SSD and YOLO are single-stage detectors: the image is passed once through the network to predict the objectness score and the object class.
In general, single-stage detectors tend to be less accurate than two-stage detectors but are significantly faster.
1.Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik, “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation,” 2014, http://arxiv.org/abs/1311.2524.
2.Ross Girshick, “Fast R-CNN,” 2015, http://arxiv.org/abs/1504.08083.
3.Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” 2016, http://arxiv.org/abs/1506.01497.
4.Matthew D. Zeiler and Rob Fergus, “Visualizing and Understanding Convolutional Networks,” 2013, http://arxiv.org/abs/1311.2901.
5.Karen Simonyan and Andrew Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” 2014, http://arxiv.org/abs/1409.1556.
6.Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam, “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” 2017, http://arxiv.org/abs/1704.04861.
7.Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger, “Densely Connected Convolutional Networks,” 2016, http://arxiv.org/abs/1608.06993.
8.Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg, “SSD: Single Shot MultiBox Detector,” 2016, http://arxiv.org/abs/1512.02325.
9.Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” 2016, http://arxiv.org/abs/1506.02640.
10.Joseph Redmon and Ali Farhadi, “YOLO9000: Better, Faster, Stronger,” 2016, http://arxiv.org/abs/ 1612.08242.
11.Joseph Redmon and Ali Farhadi, “YOLOv3: An Incremental Improvement,” 2018, http://arxiv.org/abs/ 1804.02767.
3.15.143.181