Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

7 Object detection with R-CNN, SSD, and YOLO

This chapter covers

Understanding image classification vs. object detection
Understanding the general framework of object detection projects
Using object detection algorithms like R-CNN, SSD, and YOLO

In the previous chapters, we explained how we can use deep neural networks for image classification tasks. In image classification, we assume that there is only one main target object in the image, and the model’s sole focus is to identify the target category. However, in many situations, we are interested in multiple targets in the image. We want to not only classify them, but also obtain their specific positions in the image. In computer vision, we refer to such tasks as object detection. Figure 7.1 explains the difference between image classification and object detection tasks.

Figure 7.1 Image classification vs. object detection tasks. In classification tasks, the classifier outputs the class probability (cat), whereas in object detection tasks, the detector outputs the bounding box coordinates that localize the detected objects (four boxes in this example) and their predicted classes (two cats, one duck, and one dog).

Object detection is a CV task that involves both main tasks: localizing one or more objects within an image and classifying each object in the image (see table 7.1). This is done by drawing a bounding box around the identified object with its predicted class. This means the system doesn’t just predict the class of the image, as in image classification tasks; it also predicts the coordinates of the bounding box that fits the detected object. This is a challenging CV task because it requires both successful object localization, in order to locate and draw a bounding box around each object in an image, and object classification to predict the correct class of object that was localized.

Table 7.1 Image classification vs. object detection

Image classification	Object detection
The goal is to predict the type or class of an object in an image. Input: an image with a single object Output: a class label (cat, dog, etc.) Example output: class probability (for example, 84% cat)	The goal is to predict the location of objects in an image via bounding boxes and the classes of the located objects. Input: an image with one or more objects Output: one or more bounding boxes (defined by coordinates) and a class label for each bounding box Example output for an image with two objects: box1 coordinates (x, y, w, h) and class probability box2 coordinates and class probability Note that the image coordinates (x, y, w, h) are as follows: (x and y) are the coordinates of the bounding-box center point, and (w and h) are the width and height of the box.

Object detection is widely used in many fields. For example, in self-driving technology, we need to plan routes by identifying the locations of vehicles, pedestrians, roads, and obstacles in a captured video image. Robots often perform this type of task to detect targets of interest. And systems in the security field need to detect abnormal targets, such as intruders or bombs.

This chapter’s layout is as follows:

We will explore the general framework of the object detection algorithms.
We will dive deep into three of the most popular detection algorithms: the R-CNN family of networks, SSD, and the YOLO family of networks.
We will use what we’ve learned in a real-world project to train an end-to-end object detector.

By the end of this chapter, we will have gained an understanding of how DL is applied to object detection, and how the different object detection models inspire and diverge from one another. Let’s get started!

7.1 General object detection framework

Before we jump into the object detection systems like R-CNN, SSD, and YOLO, let’s discuss the general framework of these systems to understand the high-level workflow that DL-based systems follow to detect objects and the metrics they use to evaluate their detection performance. Don’t worry about the code implementation details of object detectors yet. The goal of this section is to give you an overview of how different object detection systems approach this task and introduce you to a new way of thinking about this problem and a set of new concepts to set you up to understand the DL architectures that we will explain in sections 7.2, 7.3, and 7.4.

Typically, an object detection framework has four components:

Region proposal --An algorithm or a DL model is used to generate regions of interest (RoIs) to be further processed by the system. These are regions that the network believes might contain an object; the output is a large number of bounding boxes, each of which has an objectness score. Boxes with large objectness scores are then passed along the network layers for further processing.
Feature extraction and network predictions --Visual features are extracted for each of the bounding boxes. They are evaluated, and it is determined whether and which objects are present in the proposals based on visual features (for example, an object classification component).
Non-maximum suppression (NMS) --In this step, the model has likely found multiple bounding boxes for the same object. NMS helps avoid repeated detection of the same instance by combining overlapping boxes into a single bounding box for each object.
Evaluation metrics --Similar to accuracy, precision, and recall metrics in image classification tasks (see chapter 4), object detection systems have their own metrics to evaluate their detection performance. In this section, we will explain the most popular metrics, like mean average precision (mAP), precision-recall curve (PR curve), and intersection over union (IoU).

Now, let’s dive one level deeper into each one of these components to build an intuition about what their goals are.

7.1.1 Region proposals

In this step, the system looks at the image and proposes RoIs for further analysis. RoIs are regions that the system believes have a high likelihood of containing an object, called the objectness score (figure 7.2). Regions with high objectness scores are passed to the next steps; regions with low scores are abandoned.

Figure 7.2 Regions of interest (RoIs) proposed by the system. Regions with high objectness score represent areas of high likelihood to contain objects (foreground), and the ones with low objectness score are ignored because they have a low likelihood of containing objects (background).

There are several approaches to generate region proposals. Originally, the selective search algorithm was used to generate object proposals; we will talk more about this algorithm when we discuss the R-CNN network. Other approaches use more complex visual features extracted from the image by a deep neural network to generate regions (for example, based on the features from a DL model).

We will talk in more detail about how different object detection systems approach this task. The important thing to note is that this step produces a lot (thousands) of bounding boxes to be further analyzed and classified by the network. During this step, the network analyzes these regions in the image and classifies each region as foreground (object) or background (no object) based on its objectness score. If the objectness score is above a certain threshold, then this region is considered a foreground and pushed forward in the network. Note that this threshold is configurable based on your problem. If the threshold is too low, your network will exhaustively generate all possible proposals, and you will have a better chance of detecting all objects in the image. On the flip side, this is very computationally expensive and will slow down detection. So, the trade-off with generating region proposals is the number of regions versus computational complexity--and the right approach is to use problem-specific information to reduce the number of RoIs.

7.1.2 Network predictions

This component includes the pretrained CNN network that is used for feature extraction to extract features from the input image that are representative for the task at hand and to use these features to determine the class of the image. In object detection frameworks, people typically use pretrained image classification models to extract visual features, as these tend to generalize fairly well. For example, a model trained on the MS COCO or ImageNet dataset is able to extract fairly generic features.

In this step, the network analyzes all the regions that have been identified as having a high likelihood of containing an object and makes two predictions for each region:

Bounding-box prediction--The coordinates that locate the box surrounding the object. The bounding box coordinates are represented as the tuple (x, y, w, h), where x and y are the coordinates of the center point of the bounding box and w and h are the width and height of the box.
Class prediction : The classic softmax function that predicts the class probability for each object.

Since thousands of regions are proposed, each object will always have multiple bounding boxes surrounding it with the correct classification. For example, take a look at the image of the dog in figure 7.3. The network was clearly able to find the object (dog) and successfully classify it. But the detection fired a total of five times because the dog was present in the five RoIs produced in the previous step: hence the five bounding boxes around the dog in the figure. Although the detector was able to successfully locate the dog in the image and classify it correctly, this is not exactly what we need. We need just one bounding box for each object for most problems. In some problems, we only want the one box that fits the object the most. What if we are building a system to count dogs in an image? Our current system will count five dogs. We don’t want that. This is when the non-maximum suppression technique comes in handy.

Figure 7.3 The bounding-box detector produces more than one bounding box for an object. We want to consolidate these boxes into one bounding box that fits the object the most.

7.1.3 Non-maximum suppression (NMS)

As you can see in figure 7.4, one of the problems of an object detection algorithm is that it may find multiple detections of the same object. So, instead of creating only one bounding box around the object, it draws multiple boxes for the same object. NMS is a technique that makes sure the detection algorithm detects each object only once. As the name implies, NMS looks at all the boxes surrounding an object to find the box that has the maximum prediction probability, and it suppresses or eliminates the other boxes (hence the name).

Figure 7.4 Multiple regions are proposed for the same object. After NMS, only the box that fits the object the best remains; the rest are ignored, as they have large overlaps with the selected box.

The general idea of NMS is to reduce the number of candidate boxes to only one bounding box for each object. For example, if the object in the frame is fairly large and more than 2,000 object proposals have been generated, it is quite likely that some of them will have significant overlap with each other and the object.

Let’s see the steps of how the NMS algorithm works:

Discard all bounding boxes that have predictions that are less than a certain threshold, called the confidence threshold. This threshold is tunable, which means a box will be suppressed if the prediction probability is less than the set threshold.
Look at all the remaining boxes, and select the bounding box with the highest probability.
Calculate the overlap of the remaining boxes that have the same class prediction. Bounding boxes that have high overlap with each other and that predict the same class are averaged together. This overlap metric is called intersection over union (IoU). IoU is explained in detail in the next section.
Suppress any box that has an IoU value smaller than a certain threshold (called the NMS threshold). Usually the NMS threshold is equal to 0.5, but it is tunable as well if you want to output fewer or more bounding boxes.

NMS techniques are typically standard across the different detection frameworks, but it is an important step that may require tweaking hyperparameters such as the confidence threshold and the NMS threshold based on the scenario.

7.1.4 Object-detector evaluation metrics

When evaluating the performance of an object detector, we use two main evaluation metrics: frames per second and mean average precision.

Frames per second (FPS) to measure detection speed

The most common metric used to measure detection speed is the number of frames per second (FPS). For example, Faster R-CNN operates at only 7 FPS, whereas SSD operates at 59 FPS. In benchmarking experiments, you will see the authors of a paper state their network results as: “Network x achieves mAP of Y% at Z FPS,” where x is the network name, y is the mAP percentage, and Z is the FPS.

Mean average precision (mAP) to measure network precision

The most common evaluation metric used in object recognition tasks is mean average precision (mAP). It is a percentage from 0 to 100, and higher values are typically better, but its value is different from the accuracy metric used in classification.

To understand how mAP is calculated, you first need to understand intersection over union (IoU) and the precision-recall curve (PR curve). Let’s explain IoU and the PR curve and then come back to mAP.

Intersection over union (IoU)

This measure evaluates the overlap between two bounding boxes: the ground truth bounding box (Bground truth) and the predicted bounding box (Bpredicted). By applying the IoU, we can tell whether a detection is valid (True Positive) or not (False Positive). Figure 7.5 illustrates the IoU between a ground truth bounding box and a predicted bounding box.

Figure 7.5 The IoU score is the overlap between the ground truth bounding box and the predicted bounding box.

The intersection over the union value ranges from 0 (no overlap at all) to 1 (the two bounding boxes overlap each other 100%). The higher the overlap between the two bounding boxes (IoU value), the better (figure 7.6).

Figure 7.6 IoU scores range from 0 (no overlap) to 1 (100% overlap). The higher the overlap (IoU) between the two bounding boxes, the better.

To calculate the IoU of a prediction, we need the following:

The ground truth bounding box (Bground truth): the hand-labeled bounding box created during the labeling process
The predicted bounding box (Bpredicted) from our model

We calculate IoU by dividing the area of overlap by the area of the union, as in the following equation:

IoU is used to define a correct prediction, meaning a prediction (True Positive) that has an IoU greater than some threshold. This threshold is a tunable value depending on the challenge, but 0.5 is a standard value. For example, some challenges, like Microsoft COCO, use [email protected] (IoU threshold of 0.5) or [email protected] (IoU threshold of 0.75). If the IoU value is above this threshold, the prediction is considered a True Positive (TP); and if it is below the threshold, it is considered a False Positive (FP).

Precision-recall curve (PR curve)

With the TP and FP defined, we can now calculate the precision and recall of our detection for a given class across the testing dataset. As explained in chapter 4, we calculate the precision and recall as follows (recall that FN stands for False Negative):

After calculating the precision and recall for all classes, the PR curve is then plotted as shown in figure 7.7.

Figure 7.7 A precision-recall curve is used to evaluate the performance of an object detector.

The PR curve is a good way to evaluate the performance of an object detector, as the confidence is changed by plotting a curve for each object class. A detector is considered good if its precision stays high as recall increases, which means if you vary the confidence threshold, the precision and recall will still be high. On the other hand, a poor detector needs to increase the number of FPs (lower precision) in order to achieve a high recall. That’s why the PR curve usually starts with high precision values, decreasing as recall increases.

Now that we have the PR curve, we can calculate the average precision (AP) by calculating the area under the curve (AUC). Finally, the mAP for object detection is the average of the AP calculated for all the classes. It is also important to note that some research papers use AP and mAP interchangeably.

Recap

To recap, the mAP is calculated as follows:

Get each bounding box’s associated objectness score (probability of the box containing an object).
Calculate precision and recall.
Compute the PR curve for each class by varying the score threshold.
Calculate the AP: the area under the PR curve. In this step, the AP is computed for each class.
Calculate the mAP: the average AP over all the different classes.

The last thing to note about mAP is that it is more complicated to calculate than other traditional metrics like accuracy. The good news is that you don’t need to compute mAP values yourself: most DL object detection implementations handle computing the mAP for you, as you will see later in this chapter.

Now that we understand the general framework of object detection algorithms, let’s dive deeper into three of the most popular. In this chapter, we will discuss the R-CNN family of networks, SSD, and YOLO networks in detail to see how object detectors have evolved over time. We will also examine the pros and cons of each network so you can choose the most appropriate algorithm for your problem.

7.2 Region-based convolutional neural networks (R-CNNs)

The R-CNN family of object detection techniques usually referred to as R-CNNs, which is short for region-based convolutional neural networks, was developed by Ross Girshick et al. in 2014.1 The R-CNN family expanded to include Fast-RCNN2 and Faster-RCN3 in 2015 and 2016, respectively. In this section, I’ll quickly walk you through the evolution of the R-CNN family from R-CNNs to Fast R-CNN to Faster R-CNN, and then we will dive deeper into the Faster R-CNN architecture and code implementation.

7.2.1 R-CNN

R-CNN is the least sophisticated region-based architecture in its family, but it is the basis for understanding how multiple object-recognition algorithms work for all of them. It was one of the first large, successful applications of convolutional neural networks to the problem of object detection and localization, and it paved the way for the other advanced detection algorithms. The approach was demonstrated on benchmark datasets, achieving then-state-of-the-art results on the PASCAL VOC-2012 dataset and the ILSVRC 2013 object detection challenge. Figure 7.8 shows a summary of the R-CNN model architecture.

Figure 7.8 Summary of the R-CNN model architecture. (Modified from Girshick et al., “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.”)

The R-CNN model consists of four components:

Extract regions of interest --Also known as extracting region proposals. These regions have a high probability of containing an object. An algorithm called selective search scans the input image to find regions that contain blobs, and proposes them as RoIs to be processed by the next modules in the pipeline. The proposed RoIs are then warped to have a fixed size; they usually vary in size, but as we learned in previous chapters, CNNs require a fixed input image size.
Feature extraction module --We run a pretrained convolutional network on top of the region proposals to extract features from each candidate region. This is the typical CNN feature extractor that we learned about in previous chapters.
Classification module --We train a classifier like a support vector machine (SVM), a traditional machine learning algorithm, to classify candidate detections based on the extracted features from the previous step.
Localization module --Also known as a bounding-box regressor. Let’s take a step back to understand regression. ML problems are categorized as classification or regression problems. Classification algorithms output discrete, predefined classes (dog, cat, elephant), whereas regression algorithms output continuous value predictions. In this module, we want to predict the location and size of the bounding box that surrounds the object. The bounding box is represented by identifying four values: the x and y coordinates of the box’s origin (x, y), the width, and the height of the box (w, h). Putting this together, the regressor predicts the four real-valued numbers that define the bounding box as the following tuple: (x, y, w, h).

Selective search is a greedy search algorithm that is used to provide region proposals that potentially contain objects. It tries to find areas that might contain an object by combining similar pixels and textures into rectangular boxes. Selective search combines the strength of both the exhaustive search algorithm (which examines all possible locations in the image) and the bottom-up segmentation algorithm (which hierarchically groups similar regions) to capture all possible object locations.

The selective search algorithm works by applying a segmentation algorithm to find blobs in an image, in order to figure out what could be an object (see the image on the right in the following figure).

The selective search algorithm looks for blob-like areas in the image to extract regions. At right, the segmentation algorithm defines blobs that could be objects. Then the selective search algorithm selects these areas to be passed along for further investigation.

Bottom-up segmentation recursively combines these groups of regions together into larger ones to create about 2,000 areas to be investigated, as follows:

The similarities between all neighboring regions are calculated.
The two most similar regions are grouped together, and new similarities are calculated between the resulting region and its neighbors.
This process is repeated until the entire object is covered in a single region.

Note that a review of the selective search algorithm and how it calculates regions’ similarity is outside the scope of this book. If you are interested in learning more

about this technique, you can refer to the original paper.a For the purpose of understanding R-CNNs, you can treat the selective search algorithm as a black box that intelligently scans the image and proposes RoI locations for us to use.

An example of bottom-up segmentation using the selective search algorithm. It combines similar regions in every iteration until the entire object is covered in a single region.

Figure 7.9 illustrates the R-CNN architecture in an intuitive way. As you can see, the network first proposes RoIs , then extracts features, and then classifies those regions based on their features. In essence, we have turned object detection into an image classification problem.

Figure 7.9 Illustration of the R-CNN architecture. Each proposed RoI is passed through the CNN to extract features, followed by a bounding-box regressor and an SVM classifier to produce the network output prediction.

Training R-CNNs

We learned in the previous section that R-CNNs are composed of four modules: selective search region proposal, feature extractor, classifier, and bounding-box regressor. All of the R-CNN modules need to be trained except the selective search algorithm. So, in order to train R-CNNs, we need to do the following:

Train the feature extractor CNN. This is a typical CNN training process. We either train a network from scratch, which rarely happens, or fine-tune a pretrained network, as we learned to do in chapter 6.
Train the SVM classifier. The SVM algorithm is not covered in this book, but it is a traditional ML classifier that is no different from DL classifiers in the sense that it needs to be trained on labeled data.
Train the bounding-box regressors. This model outputs four real-valued numbers for each of the K object classes to tighten the region bounding boxes.

Looking through the R-CNN learning steps, you could easily find out that training an R-CNN model is expensive and slow. The training process involves training three separate modules without much shared computation. This multistage pipeline training is one of the disadvantages of R-CNNs, as we will see next.

Disadvantages of R-CNN

R-CNN is very simple to understand, and it achieved state-of-the-art results when it first came out, especially when using deep ConvNets to extract features. However, it is not actually a single end-to-end system that learns to localize via a deep neural network. Rather, it is a combination of standalone algorithms, added together to perform object detection. As a result, it has the following notable drawbacks:

Object detection is very slow. For each image, the selective search algorithm proposes about 2,000 RoIs to be examined by the entire pipeline (CNN feature extractor and classifier). This is very computationally expensive because it performs a ConvNet forward pass for each object proposal without sharing computation, which makes it incredibly slow. This high computation need means R-CNN is not a good fit for many applications, especially real-time applications that require fast inferences like self-driving cars and many others.
Training is a multi-stage pipeline. As discussed earlier, R-CNNs require the training of three modules: CNN feature extractor, SVM classifier, and bounding-box regressors. Thus the training process is very complex and not an end-to-end training.
Training is expensive in terms of space and time. When training the SVM classifier and bounding-box regressor, features are extracted from each object proposal in each image and written to disk. With very deep networks, such as VGG16, the training process for a few thousand images takes days using GPUs. The training process is expensive in space as well, because the extracted features require hundreds of gigabytes of storage.

What we need is an end-to-end DL system that fixes the disadvantages of R-CNN while improving its speed and accuracy.

7.2.2 Fast R-CNN

Fast R-CNN was an immediate descendant of R-CNN, developed in 2015 by Ross Girshick. Fast R-CNN resembled the R-CNN technique in many ways but improved on its detection speed while also increasing detection accuracy through two main changes:

Instead of starting with the regions proposal module and then using the feature extraction module, like R-CNN, Fast-RCNN proposes that we apply the CNN feature extractor first to the entire input image and then propose regions. This way, we run only one ConvNet over the entire image instead of 2,000 ConvNets over 2,000 overlapping regions.
It extends the ConvNet’s job to do the classification part as well, by replacing the traditional SVM machine learning algorithm with a softmax layer. This way, we have a single model to perform both tasks: feature extraction and object classification.

Fast R-CNN architecture

As shown in figure 7.10, Fast R-CNN generates region proposals based on the last feature map of the network, not from the original image like R-CNN. As a result, we can train just one ConvNet for the entire image. In addition, instead of training many different SVM algorithms to classify each object class, a single softmax layer outputs the class probabilities directly. Now we only have one neural net to train, as opposed to one neural net and many SVMs.

The architecture of Fast R-CNN consists of the following modules:

Feature extractor module --The network starts with a ConvNet to extract features from the full image.
RoI extractor --The selective search algorithm proposes about 2,000 region candidates per image.
RoI pooling layer --This is a new component that was introduced to extract a fixed-size window from the feature map before feeding the RoIs to the fully connected layers. It uses max pooling to convert the features inside any valid RoI into a small feature map with a fixed spatial extent of height × width (H × W ). The RoI pooling layer will be explained in more detail in the Faster R-CNN section; for now, understand that it is applied on the last feature map layer extracted from the CNN, and its goal is to extract fixed-size RoIs to feed to the fully connected layers and then the output layers.
Two-head output layer --The model branches into two heads:
- A softmax classifier layer that outputs a discrete probability distribution per RoI
- A bounding-box regressor layer to predict offsets relative to the original RoI

Figure 7.10 The Fast R-CNN architecture consists of a feature extractor ConvNet, RoI extractor, RoI pooling layers, fully connected layers, and a two-head output layer. Note that, unlike R-CNNs, Fast R-CNNs apply the feature extractor to the entire input image before applying the regions proposal module.

Multi-task loss function in Fast R-CNNs

Since Fast R-CNN is an end-to-end learning architecture to learn the class of an object as well as the associated bounding box position and size, the loss is multi-task loss. With multi-task loss, the output has the softmax classifier and bounding-box regressor, as shown in figure 7.10.

In any optimization problem, we need to define a loss function that our optimizer algorithm is trying to minimize. (Chapter 2 gives more details about optimization and loss functions.) In object detection problems, our goal is to optimize for two goals: object classification and object localization. Therefore, we have two loss functions in this problem: Lcls for the classification loss and Lloc for the bounding box prediction defining the object location.

A Fast R-CNN network has two sibling output layers with two loss functions:

Classification --The first outputs a discrete probability distribution (per RoI) over K + 1 categories (we add one class for the background). The probability P is computed by a softmax over the K + 1 outputs of a fully connected layer. The classification loss function is a log loss for the true class u

L_cls(p,u) = −logp_u

where u is the true label, u ∈ 0, 1, 2, . . . (K + 1); where u = 0 is the background; and p is the discrete probability distribution per RoI over K + 1 classes.
Regression --The second sibling layer outputs bounding box regression offsets v = (x, y, w, h) for each of the K object classes. The loss function is the loss for bounding box for class u

L_loc(t^u, u) = σ L1_smooth(t_i^u - v_i)

where:
- v is the true bounding box, v = (x, y, w, h).
- t u is the prediction bounding box correction:
  t^u = (t_x^u, t_y^u, t_w^u, t_h^u)
- L1_smooth is the bounding box loss that measures the difference between tiu and vi using the smooth L1 loss function. It is a robust function and is claimed to be less sensitive to outliers than other regression losses like L2.

The overall loss function is

L = L_cls + L_loc

L(p, u, t^u, v) = L_cls(p, u) + [u ≥ 1] l_box(t^u, v)

Note that [u ≥ 1] is added before the regression loss to indicate 0 when the region inspected doesn’t contain any object and contains a background. It is a way of ignoring the bounding box regression when the classifier labels the region as a background. The indicator function [u ≥ 1] is defined as

Disadvantages of Fast R-CNN

Fast R-CNN is much faster in terms of testing time, because we don’t have to feed 2,000 region proposals to the convolutional neural network for every image. Instead, a convolution operation is done only once per image, and a feature map is generated from it. Training is also faster because all the components are in one CNN network: feature extractor, object classifier, and bounding-box regressor. However, there is a big bottleneck remaining: the selective search algorithm for generating region proposals is very slow and is generated separately by another model. The last step to achieve a complete end-to-end object detection system using DL is to find a way to combine the region proposal algorithm into our end-to-end DL network. This is what Faster R-CNN does, as we will see next.

7.2.3 Faster R-CNN

Faster R-CNN is the third iteration of the R-CNN family, developed in 2016 by Shaoqing Ren et al. Similar to Fast R-CNN, the image is provided as an input to a convolutional network that provides a convolutional feature map. Instead of using a selective search algorithm on the feature map to identify the region proposals, a region proposal network (RPN) is used to predict the region proposals as part of the training process. The predicted region proposals are then reshaped using an RoI pooling layer and used to classify the image within the proposed region and predict the offset values for the bounding boxes. These improvements both reduce the number of region proposals and accelerate the test-time operation of the model to near real-time with then-state-of-the-art performance.

Faster R-CNN architecture

The architecture of Faster R-CNN can be described using two main networks:

Region proposal network (RPN) --Selective search is replaced by a ConvNet that proposes RoIs from the last feature maps of the feature extractor to be considered for investigation. The RPN has two outputs: the objectness score (object or no object) and the box location.
Fast R-CNN --It consists of the typical components of Fast R-CNN:
- Base network for the feature extractor: a typical pretrained CNN model to extract features from the input image
- RoI pooling layer to extract fixed-size RoIs
- Output layer that contains two fully connected layers: a softmax classifier to output the class probability and a bounding box regression CNN for the bounding box predictions

As you can see in figure 7.11, the input image is presented to the network, and its features are extracted via a pretrained CNN. These features, in parallel, are sent to two different components of the Faster R-CNN architecture:

The RPN to determine where in the image a potential object could be. At this point, we do not know what the object is, just that there is potentially an object at a certain location in the image.
RoI pooling to extract fixed-size windows of features.

Figure 7.11 The Faster R-CNN architecture has two main components: an RPN that identifies regions that may contain objects of interest and their approximate location, and a Fast R-CNN network that classifies objects and refines their location defined using bounding boxes. The two components share the convolutional layers of the pretrained VGG16.

The output is then passed into two fully connected layers: one for the object classifier and one for the bounding box coordinate predictions to obtain our final localizations.

This architecture achieves an end-to-end trainable, complete object detection pipeline where all of the required components are inside the network:

Base network feature extractor
Regions proposal
RoI pooling
Object classification
Bounding-box regressor

Base network to extract features

Similar to Fast R-CNN, the first step is to use a pretrained CNN and slice off its classification part. The base network is used to extract features from the input image. We covered how this works in detail in chapter 6. In this component, you can use any of the popular CNN architectures based on the problem you are trying to solve. The original Faster R-CNN paper used ZF4 and VGG5 pretrained networks on ImageNet; but since then, there have been lots of different networks with a varying number of weights. For example, MobileNet,6 a smaller and efficient network architecture optimized for speed, has approximately 3.3 million parameters, whereas ResNet-152 (152 layers)--once the state of the art in the ImageNet classification competition--has around 60 million. Most recently, new architectures like DenseNet7 are both improving results and reducing the number of parameters.

Nowadays, ResNet architectures have mostly replaced VGG as a base network for extracting features. The obvious advantage of ResNet over VGG is that it has many more layers (is deeper), giving it more capacity to learn very complex features. This is true for the classification task and should be equally true in the case of object detection. In addition, ResNet makes it easy to train deep models with the use of residual connections and batch normalization, which was not invented when VGG was first released. Please revisit chapter 5 for a more detailed review of the different CNN architectures.

As we learned in earlier chapters, each convolutional layer creates abstractions based on the previous information. The first layer usually learns edges, the second finds patterns in edges to activate for more complex shapes, and so forth. Eventually we end up with a convolutional feature map that can be fed to the RPN to extract regions that contain objects.

Region proposal network (RPN)

The RPN identifies regions that could potentially contain objects of interest, based on the last feature map of the pretrained convolutional neural network. An RPN is also known as an attention network because it guides the network’s attention to interesting regions in the image. Faster R-CNN uses an RPN to bake the region proposal directly into the R-CNN architecture instead of running a selective search algorithm to extract RoIs.

The architecture of the RPN is composed of two layers (figure 7.12):

A 3 × 3 fully convolutional layer with 512 channels
Two parallel 1 × 1 convolutional layers: a classification layer that is used to predict whether the region contains an object (the score of it being background or foreground), and a layer for regression or bounding box prediction.

One important aspect of object detection networks is that they should be fully convolutional. A fully convolutional neural network means that the network does not contain any fully connected layers, typically found at the end of a network prior to making output predictions.

In the context of image classification, removing the fully connected layers is normally accomplished by applying average pooling across the entire volume prior to using a single dense softmax classifier to output the final predictions. An FCN has two main benefits:

It is faster because it contains only convolution operations and no fully connected layers.
It can accept images of any spatial resolution (width and height), provided the image and network can fit into the available memory.

Being an FCN makes the network invariant to the size of the input image. However, in practice, we might want to stick to a constant input size due to issues that only become apparent when we are implementing the algorithm. A significant such problem is that if we want to process images in batches (because images in batches can be processed in parallel by the GPU, leading to speed boosts), all of the images must have a fixed height and width.

Figure 7.12 Convolutional implementation of an RPN architecture, where k is the number of anchors

The 3 × 3 convolutional layer is applied on the last feature map of the base network where a sliding window of size 3 × 3 is passed over the feature map. The output is then passed to two 1 × 1 convolutional layers: a classifier and a bounding-box regressor. Note that the classifier and the regressor of the RPN are not trying to predict the class of the object and its bounding box; this comes later, after the RPN. Remember, the goal of the RPN is to determine whether the region has an object to be investigated afterward by the fully connected layers. In the RPN, we use a binary classifier to predict the objectness score of the region, to determine the probability of this region being a foreground (contains an object) or a background (doesn’t contain an object). It basically looks at the region and asks, “Does this region contain an object?” If the answer is yes, then the region is passed along for further investigation by RoI pooling and the final output layers (see figure 7.13).

Figure 7.13 The RPN classifier predicts the objectness score, which is the probability of an image containing an object (foreground) or a background.

How does the regressor predict the bounding box?

To answer this question, let’s first define the bounding box. It is the box that surrounds the object and is identified by the tuple (x, y, w, h), where x and y are the coordinates in the image that describes the center of the bounding box and h and w are the height and width of the bounding box. Researchers have found that defining the (x, y) coordinates of the center point can be challenging because we have to enforce some rules to make sure the network predicts values inside the boundaries of the image. Instead, we can create reference boxes called anchor boxes in the image and make the regression layer predict offsets from these boxes called deltas (Δx, Δy, Δw, Δh) to adjust the anchor boxes to better fit the object to get final proposals (figure 7.14).

Figure 7.14 Illustration of predicting the delta shift from the anchor boxes and the bounding box coordinates

Anchor boxes

Using a sliding window approach, the RPN generates k regions for each location in the feature map. These regions are represented as anchor boxes. The anchors are centered in the middle of their corresponding sliding window and differ in terms of scale and aspect ratio to cover a wide variety of objects. They are fixed bounding boxes that are placed throughout the image to be used for reference when first predicting object locations. In their paper, Ren et. al. generated nine anchor boxes that all had the same center but that had three different aspect ratios and three different scales.

Figure 7.15 shows an example of how anchor boxes are applied. Anchors are at the center of the sliding windows; each window has k anchor boxes with the anchor at their center.

Figure 7.15 Anchors are at the center of each sliding window. IoU is calculated to select the bounding box that overlaps the most with the ground truth.

Training the RPN

The RPN is trained to classify an anchor box to output an objectness score and to approximate the four coordinates of the object (location parameters). It is trained using human annotators to label the bounding boxes. A labeled box is called the ground truth.

For each anchor box, the overlap probability value (p) is computed, which indicates how much these anchors overlap with the ground-truth bounding boxes:

If an anchor has high overlap with a ground-truth bounding box, then it is likely that the anchor box includes an object of interest, and it is labeled as positive with respect to the object versus no object classification task. Similarly, if an anchor has small overlap with a ground-truth bounding box, it is labeled as negative. During the training process, the positive and negative anchors are passed as input to two fully connected layers corresponding to the classification of anchors as containing an object or no object, and to the regression of location parameters (four coordinates), respectively. Corresponding to the k number of anchors from a location, the RPN network outputs 2k scores and 4k coordinates. Thus, for example, if the number of anchors per sliding window (k) is 9, then the RPN outputs 18 objectness scores and 36 location coordinates (figure 7.16).

An RPN can be used as a standalone application. For example, in problems with a single class of objects, the objectness probability can be used as the final class probability. This is because in such a case, foreground means single class, and background means not a single class.

The reason you would want to use RPN for cases like single-class detection is the gain in speed in both training and prediction. Since the RPN is a very simple network that only uses convolutional layers, the prediction time can be faster than using the classification base network.

Figure 7.16 Region proposal network

Fully connected layer

The output fully connected layer takes two inputs: the feature maps coming from the base ConvNet and the RoIs coming from the RPN. It then classifies the selected regions and outputs their prediction class and the bounding box parameters. The object classification layer in Faster R-CNN uses softmax activation, while the location regression layer uses linear regression over the coordinates defining the location as a bounding box. All of the network parameters are trained together using multi-task l oss.

Multi-task loss function

Similar to Fast R-CNN, Faster R-CNN is optimized for a multi-task loss function that combines the losses of classification and bounding box regression:

The loss equation might look a little overwhelming at first, but it is simpler than it appears. Understanding it is not necessary to be able to run and train Faster R-CNNs, so feel free to skip this section. But I encourage you to power through this explanation, because it will add a lot of depth to your understanding of how the optimization process works under the hood. Let’s go through the symbols first; see table 7.2.

Multi-task loss function symbols

Symbol	Explanation
p_i and p_i^*	pi is the predicted probability of the anchor (i) being an object and the ground, and p*i is the binary ground truth (0 or 1) of the anchor being an object.
t_i and t_i^*	t_i is the predicted four parameters that define the bounding box, and t_i^* is the ground-truth parameters.
N_cls	Normalization term for the classification loss. Ren et al. set it to be a mini-batch size of ~256.
N_loc	Normalization term for the bounding box regression. Ren et al. set it to the number of anchor locations, ~2400.
L_cls(p_i, p_i^*)	The log loss function over two classes. We can easily translate a multi-class classification into a binary classification by predicting whether a sample is a target object: L_cls(p_i, p_i^* ) = −p_i^ log p_i - (1 - p_i^) log (1 − p_i)
L1_smooth	As described in section 7.2.2, the bounding box loss measures the difference between the predicted and true location parameters (t_i, t_i^*) using the smooth L1 loss function. It is a robust function and is claimed to be less sensitive to outliers than other regression losses like L2.
λ	A balancing parameter, set to be ~10 in Ren et al. (so the Lcls and Lloc terms are roughly equally weighted).

Now that you know the definitions of the symbols, let’s try to read the multi-task loss function again. To help understand this equation, just for a moment, ignore the normalization terms and the (i) terms. Here’s the simplified loss function for each instance (i):

Loss = L_cls(p, p^*) + p^* · L1_smooth(t - t^*)

This simplified function is the summation of two loss functions: the classification loss and the location loss (bounding box). Let’s look at them one at a time:

The idea of any loss function is that it subtracts the predicted value from the true value to find the amount of error. The classification loss is the cross-entropy function explained in chapter 2. Nothing new. It is a log loss function that calculates the error between the prediction probability (p) and the ground truth (p, p^*):

L_cls(p_i, p_i^* ) = −p_i^* log p_i − (1 − p_i^*) log (1 - p_i)
The location loss is the difference between the predicted and true location parameters (t_i , t_i^*) using the smooth L1 loss function. The difference is then multiplied by the ground truth probability of the region containing an object p^*. If it is not an object, p^* is 0 to eliminate the entire location loss for non-object regions.

Finally, we add the values of both losses to create the multi-loss function:

L = L_cls + L_loc

There you have it: the multi-loss function for each instance (i). Put back the (i) and σ symbols to calculate the summation of losses for each instance.

7.2.4 Recap of the R-CNN family

Table 7.3 recaps the evolution of the R-CNN architecture:

R-CNN --Bounding boxes are proposed by the selective search algorithm. Each is warped, and features are extracted via a deep convolutional neural network such as AlexNet, before a final set of object classifications and bounding box predictions is made with linear SVMs and linear regressors.
Fast R-CNN --A simplified design with a single model. An RoI pooling layer is used after the CNN to consolidate regions. The model predicts both class labels and RoIs directly.
Faster R-CNN --A fully end-to-end DL object detector. It replaces the selective search algorithm to propose RoIs with a region proposal network that interprets features extracted from the deep CNN and learns to propose RoIs directly.

The evolution of the CNN family of networks from R-CNN to Fast R-CNN to Faster R-CNN

	R-CNN	Fast R-CNN	Faster R-CNN

mAP on the PASCAL Visual Object Classes Challenge 2007	66.0%	66.9%	66.9%
Features	Applies selective search to extract RoIs (~2,000) from each image. A ConvNet is used to extract features from each of the ~2,000 regions extracted. Uses classification and bounding box predictions.	Each image is passed only once to the CNN, and feature maps are extracted. A ConvNet is used to extract feature maps from the input image. Selective search is used on these maps to generate predictions. This way, we run only one ConvNet over the entire image instead of ~2,000 ConvNets over 2000 overlapping regions.	Replaces the selective search method with a region proposal network, which makes the algorithm much faster. An end-to-end DL network.
Limitations	High computation time, as each region is passed to the CNN separately. Also, uses three different models for making predictions.	Selective search is slow and, hence, computation time is still high.	Object proposal takes time. And as there are different systems working one after the other, the performance of systems depends on how the previous system performed.
Test time per image	50 seconds	2 seconds	0.2 seconds
Speed-up from R-CNN	1x	25x	250x

R-CNN limitations

As you might have noticed, each paper proposes improvements to the seminal work done in R-CNN to develop a faster network, with the goal of achieving real-time object detection. The achievements displayed through this set of work is truly amazing, yet none of these architectures manages to create a real-time object detector. Without going into too much detail, the following problems have been identified with these networks:

Training the data is unwieldy and takes too long.
Training happens in multiple phases (such as the training region proposal versus a classifier).
The network is too slow at inference time.

Fortunately, in the last few years, new architectures have been created to address the bottlenecks of R-CNN and its successors, enabling real-time object detection. The most famous are the single-shot detector (SSD) and you only look once (YOLO), which we will explain in sections 7.3 and 7.4.

Multi-stage vs. single-stage detector

Models in the R-CNN family are all region-based. Detection happens in two stages, and thus these models are called two-stage detectors:

The model proposes a set of RoIs using selective search or an RPN. The proposed regions are sparse because the potential bounding-box candidates can be infinite.
A classifier only processes the region candidates.

One-stage detectors take a different approach. They skip the region proposal stage and run detection directly over a dense sampling of possible locations. This approach is faster and simpler but can potentially drag down performance a bit. In the next two sections, we will examine the SSD and YOLO one-stage object detectors. In general, single-stage detectors tend to be less accurate than two-stage detectors but are significantly faster.

7.3 Single-shot detector (SSD)

The SSD paper was released in 2016 by Wei Liu et al.8 The SSD network reached new records in terms of performance and precision for object detection tasks, scoring over 74% mAP at 59 FPS on standard datasets such as the PASCAL VOC and Microsoft COCO.

As discussed at the beginning of this chapter, the most common metric for measuring detection speed is the number of frames per second. For example, Faster R-CNN operates at only 7 frames per second (FPS). There have been many attempts to build faster detectors by attacking each stage of the detection pipeline, but so far, significantly increased speed has come only at the cost of significantly decreased detection accuracy. In this section, you will see why single-stage networks like SSD can achieve faster detections that are more suitable for real-time detection.

For benchmarking, SSD300 achieves 74.3% mAP at 59 FPS, while SSD512 achieves 76.8% mAP at 22 FPS, which outperforms Faster R-CNN (73.2% mAP at 7 FPS). SSD300 refers to an input image of size 300 × 300, and SSD512 refers to an input image of size 512 × 512.

We learned earlier that the R-CNN family are multi-stage detectors: the network first predicts the objectness score of the bounding box and then passes this box through a classifier to predict the class probability. In single-stage detectors like SSD and YOLO (discussed in section 7.4), the convolutional layers make both predictions directly in one shot: hence the name single-shot detector. The image is passed once through the network, and the objectness score for each bounding box is predicted using logistic regression to indicate the level of overlap with the ground truth. If the bounding box overlaps 100% with the ground truth, the objectness score is 1; and if there is no overlap, the objectness score is 0. We then set a threshold value (0.5) that says, “If the objectness score is above 50%, this bounding box likely has an object of interest, and we get predictions. If it is less than 50%, we ignore the predictions.”

7.3.1 High-level SSD architecture

The SSD approach is based on a feed-forward convolutional network that produces a fixed-size collection of bounding boxes and scores for the presence of object class instances in those boxes, followed by a NMS step to produce the final detections. The architecture of the SSD model is composed of three main parts:

Base network to extract feature maps --A standard pretrained network used for high-quality image classification, which is truncated before any classification layers. In their paper, Liu et al. used a VGG16 network. Other networks like VGG19 and ResNet can be used and should produce good results.
Multi-scale feature layers --A series of convolution filters are added after the base network. These layers decrease in size progressively to allow predictions of detections at multiple scales.
Non-maximum suppression --NMS is used to eliminate overlapping boxes and keep only one box for each object detected.

As you can see in figure 7.17, layers 4_3, 7, 8_2, 9_2, 10_2, and 11_2 make predictions directly to the NMS layer. We will talk about why these layers progressively decrease in size in section 7.3.3. For now, let’s follow along to understand the end-to-end flow of data in SSD.

Figure 7.17 The SSD architecture is composed of a base network (VGG16), extra convolutional layers for object detection, and a non-maximum suppression (NMS) layer for final detections. Note that convolution layers 7, 8, 9, 10, and 11 make predictions that are directly fed to the NMS layer. (Source: Liu et al., 2016.)

You can see in figure 7.17, that the network makes a total of 8,732 detections per class that are then fed to an NMS layer to reduce down to one detection per object. Where did the number 8,732 come from?

To have more accurate detection, different layers of feature maps also go through a small 3 × 3 convolution for object detection. For example, Conv4_3 is of size 38 × 38 × 512, and a 3 × 3 convolutional is applied. There are four bounding boxes, each of which has (number of classes + 4 box values) outputs. Suppose there are 20 object classes plus 1 background class; then the output number of bounding boxes is 38 × 38 × 4 = 5,776 bounding boxes. Similarly, we calculate the number of bounding boxes for the other convolutional layers:

Conv7: 19 × 19 × 6 = 2,166 boxes (6 boxes for each location)
Conv8_2: 10 × 10 × 6 = 600 boxes (6 boxes for each location)
Conv9_2: 5 × 5 × 6 = 150 boxes (6 boxes for each location)
Conv10_2: 3 × 3 × 4 = 36 boxes (4 boxes for each location)
Conv11_2: 1 × 1 × 4 = 4 boxes (4 boxes for each location)

If we sum them up, we get 5,776 + 2,166 + 600 + 150 + 36 + 4 = 8,732 boxes produced. This is a huge number of boxes to show for our detector. That’s why we apply NMS to reduce the number of the output boxes. As you will see in section 7.4, in YOLO there are 7 × 7 locations at the end with two bounding boxes for each location: 7 × 7 × 2 = 98 boxes.

For each feature, the network predicts the following:

4 values that describe the bounding box (x, y, w, h)
1 value for the objectness score
C values that represent the probability of each class

That’s a total of 5 + C prediction values. Suppose there are four object classes in our problem. Then each prediction will be a vector that looks like this: [x, y, w, h, objectness score, C1, C2, C3, C4].

An example visualization of the output prediction when we have four classes in our problem. The convolutional layer predicts the bounding box coordinates, objectness score, and four class probabilities: C1, C2, C3, and C4.

Now, let’s dive a little deeper into each component of the SSD architecture.

7.3.2 Base network

As you can see in figure 7.17, the SSD architecture builds on the VGG16 architecture after slicing off the fully connected classification layers (VGG16 is explained in detail in chapter 5). VGG16 was used as the base network because of its strong performance in high-quality image classification tasks and its popularity for problems where transfer learning helps to improve results. Instead of the original VGG fully connected layers, a set of supporting convolutional layers (from Conv6 onward) was added to enable us to extract features at multiple scales and progressively decrease the size of the input to each subsequent layer.

Following is a simplified code implementation of the VGG16 network used in SSD using Keras. You will not need to implement this from scratch; my goal in including this code snippet is to show you that this is a typical VGG16 network like the one implemented in chapter 5:

conv1_1 = Conv2D(64, (3, 3), activation='relu', padding='same')
conv1_2 = Conv2D(64, (3, 3), activation='relu', padding='same')(conv1_1)
pool1 = MaxPooling2D(pool_size=(2, 2), strides=(2, 2), padding='same')(conv1_2)
  
conv2_1 = Conv2D(128, (3, 3), activation='relu', padding='same')(pool1)
conv2_2 = Conv2D(128, (3, 3), activation='relu', padding='same')(conv2_1)
pool2 = MaxPooling2D(pool_size=(2, 2), strides=(2, 2), padding='same')(conv2_2)
  
conv3_1 = Conv2D(256, (3, 3), activation='relu', padding='same')(pool2)
conv3_2 = Conv2D(256, (3, 3), activation='relu', padding='same')(conv3_1)
conv3_3 = Conv2D(256, (3, 3), activation='relu', padding='same')(conv3_2)
pool3 = MaxPooling2D(pool_size=(2, 2), strides=(2, 2), padding='same')(conv3_3)
  
conv4_1 = Conv2D(512, (3, 3), activation='relu', padding='same')(pool3)
conv4_2 = Conv2D(512, (3, 3), activation='relu', padding='same')(conv4_1)
conv4_3 = Conv2D(512, (3, 3), activation='relu', padding='same')(conv4_2)
pool4 = MaxPooling2D(pool_size=(2, 2), strides=(2, 2), padding='same')(conv4_3)
  
conv5_1 = Conv2D(512, (3, 3), activation='relu', padding='same')(pool4)
conv5_2 = Conv2D(512, (3, 3), activation='relu', padding='same')(conv5_1)
conv5_3 = Conv2D(512, (3, 3), activation='relu', padding='same')(conv5_2)
pool5 = MaxPooling2D(pool_size=(3, 3), strides=(1, 1), padding='same')(conv5_3)

You saw VGG16 implemented in Keras in chapter 5. The two main takeaways from adding this here are as follows:

Layer conv4_3 will be used again to make direct predictions.
Layer pool5 will be fed to the next layer (conv6), which is the first of the multiscale features layers.

How the base network makes predictions

Consider the following example. Suppose you have the image in figure 7.18, and the network’s job is to draw bounding boxes around all the boats in the image. The process goes as follows:

Similar to the anchors concept in R-CNN, SSD overlays a grid of anchors around the image. For each anchor, the network creates a set of bounding boxes at its center. In SSD, anchors are called priors.
The base network looks at each bounding box as a separate image. For each bounding box, the network asks, “Is there a boat in this box?” Or in other words, “Did I extract any features of a boat in this box?”
When the network finds a bounding box that contains boat features, it sends its coordinates prediction and object classification to the NMS layer.
NMS eliminates all the boxes except the one that overlaps the most with the ground-truth bounding box.

NOTE Liu et al. used VGG16 because of its strong performance in complex image classification tasks. You can use other networks like the deeper VGG19 or ResNet for the base network, and it should perform as well if not better in accuracy; but it could be slower if you chose to implement a deeper network. MobileNet is a good choice if you want a balance between a complex, high-performing deep network and being fast.

Now, on to the next component of the SSD architecture: multi-scale feature layers.

Figure 7.18 The SSD base network looks at the anchor boxes to find features of a boat. Solid boxes indicate that the network has found boat features. Dotted boxes indicate no boat features.

7.3.3 Multi-scale feature layers

These are convolutional feature layers that are added to the end of the truncated base network. These layers decrease in size progressively to allow predictions of detections at multiple scales.

Multi-scale detections

To understand the goal of the multi-scale feature layers and why they vary in size, let’s look at the image of horses in figure 7.19. As you can see, the base network may be able to detect the horse features in the background, but it may fail to detect the horse that is closest to the camera. To understand why, take a close look at the dotted bounding box and try to imagine this box alone outside the context of the full image (see figure 7.20)

Figure 7.19 Horses at different scales in an image. The horses that are far from the camera are easier to detect because they are small in size and can fit inside the priors (anchor boxes). The base network might fail to detect the horse closest to the camera because it needs a different scale of anchors to be able to create priors that cover more identifiable features.

Figure 7.20 An isolated horse feature

Can you see horse features in the bounding box in figure 7.20? No. To deal with objects of different scales in an image, some methods suggest preprocessing the image at different sizes and combining the results afterward (figure 7.21). However, by using different convolution layers that vary in size, we can use feature maps from several different layers in a single network; for prediction we can mimic the same effect, while also sharing parameters across all object scales. As CNN reduces the spatial dimension gradually, the resolution of the feature maps also decreases. SSD uses lower-resolution layers to detect larger-scale objects. For example, 4 × 4 feature maps are used for larger scale objects.

Figure 7.21 Lower-resolution feature maps detect larger-scale objects (right); higher-resolution feature maps detect smaller-scale objects (left).

To visualize this, imagine that the network reduces the image dimensions to be able to fit all of the horses inside its bounding boxes (figure 7.22). The multi-scale feature layers resize the image dimensions and keep the bounding-box sizes so that they can fit the larger horse. In reality, convolutional layers do not literally reduce the size of the image; this is just for illustration to help us intuitively understand the concept. The image is not just resized, it actually goes through the convolutional process and thus won’t look anything like itself anymore. It will be a completely random-looking image, but it will preserve its features. The convolutional process is explained in detail in chapter 3.

Figure 7.22 Multi-scale feature layers reduce the spatial dimensions of the input image to detect objects with different scales. In this image, you can see that the new priors are kind of zoomed out to cover more identifiable features of the horse close to the camera.

Using multi-scale feature maps improves network accuracy significantly. Liu et al. ran an experiment to measure the advantage gained by adding the multi-scale feature layers. Figure 7.23 shows a decrease in accuracy with fewer layers; you can see the accuracy with different numbers of feature map layers used for object detection.

Figure 7.23 Effects of using multiple output layers from the original paper. The detector’s accuracy (mAP) increases when the authors add multi-scale features. (Source: Liu et al., 2016.)

Notice that network accuracy drops from 74.3% when having the prediction source from all six layers to 62.4% for one source layer. When using only the conv7 layer for prediction, performance is the worst, reinforcing the message that it is critical to spread boxes of different scales over different layers.

Architecture of the multi-scale layers

Liu et al. decided to add six convolutional layers that decrease in size. They did this with a lot of tuning and trial and error until they produced the best results. As you saw in figure 7.17, convolutional layers 6 and 7 are pretty straightforward. Conv6 has a kernel size of 3 × 3, and conv7 has a kernel size of 1 × 1. Layers 8 through 11, on the other hand, are treated more like blocks, where each block consists of two convolutional layers of kernel sizes 1 × 1 and 3 × 3.

Here is the code implementation in Keras for layers 6 through 11 (you can see the full implementation in the book’s downloadable code):

# conv6 and conv7
conv6 = Conv2D(1024, (3, 3), dilation_rate=(6, 6), activation='relu', 
    padding='same')(pool5)
conv7 = Conv2D(1024, (1, 1), activation='relu', padding='same')(conv6)
 
# conv8 block
conv8_1 = Conv2D(256, (1, 1), activation='relu', padding='same')(conv7)
conv8_2 = Conv2D(512, (3, 3), strides=(2, 2), activation='relu', 
    padding='valid')(conv8_1)
  
# conv9 block
conv9_1 = Conv2D(128, (1, 1), activation='relu', padding='same')(conv8_2)
conv9_2 = Conv2D(256, (3, 3), strides=(2, 2), activation='relu', 
    padding='valid')(conv9_1)
  
# conv10 block
conv10_1 = Conv2D(128, (1, 1), activation='relu', padding='same')(conv9_2)
conv10_2 = Conv2D(256, (3, 3), strides=(1, 1), activation='relu', 
    padding='valid')(conv10_1)
  
# conv11 block
conv11_1 = Conv2D(128, (1, 1), activation='relu', padding='same')(conv10_2)
conv11_2 = Conv2D(256, (3, 3), strides=(1, 1), activation='relu', 
    padding='valid')(conv11_1)

As mentioned before, if you are not working in research or academia, you most probably won’t need to implement object detection architectures yourself. In most cases, you will download an open source implementation and build on it to work on your problem. I just added these code snippets to help you internalize the information discussed about different layer architectures.

Dilated convolutions introduce another parameter to convolutional layers: the dilation rate. This defines the spacing between the values in a kernel. A 3 × 3 kernel with a dilation rate of 2 has the same field of view as a 5 × 5 kernel while only using nine parameters. Imagine taking a 5 × 5 kernel and deleting every second column and row.

This delivers a wider field of view at the same computational cost.

A 3 × 3 kernel with a dilation rate of 2 has the same field of view as a 5 × 5 kernel while only using nine parameters.

Dilated convolutions are particularly popular in the field of real-time segmentation. Use them if you need a wide field of view and cannot afford multiple convolutions or larger kernels.

The following code builds a dilated 3 × 3 convolution layer with a dilation rate of 2 using Keras:

Conv2D(1024, (3, 3), dilation_rate=(2,2), activation='relu', padding='same')

Next, we discuss the third and last component of the SSD architecture: NMS.

7.3.4 Non-maximum suppression

Given the large number of boxes generated by the detection layer per class during a forward pass of SSD at inference time, it is essential to prune most of the bounding box by applying the NMS technique (explained earlier in this chapter). Boxes with a confidence loss and IoU less than a certain threshold are discarded, and only the top N predictions are kept (figure 7.24). This ensures that only the most likely predictions are retained by the network, while the noisier ones are removed.

How does SSD use NMS to prune the bounding boxes? SSD sorts the predicted boxes by their confidence scores. Starting from the top confidence prediction, SSD evaluates whether there are any previously predicted boundary boxes for the same class that overlap with each other above a certain threshold by calculating their IoU. (The IoU threshold value is tunable. Liu et al. chose 0.45 in their paper.) Boxes with IoU above the threshold are ignored because they overlap too much with another box that has a higher confidence score, so they are most likely detecting the same object. At most, we keep the top 200 predictions per image.

Figure 7.24 Non-maximum suppression reduces the number of bounding boxes to only one box for each object.

7.4 You only look once (YOLO)

Similar to the R-CNN family, YOLO is a family of object detection networks developed by Joseph Redmon et al. and improved over the years through the following versions:

YOLOv1, published in 20169--Called “unified, real-time object detection” because it is a single-detection network that unifies the two components of a detector: object detector and class predictor.
YOLOv2 (also known as YOLO9000), published later in 201610--Capable of detecting over 9,000 objects; hence the name. It has been trained on ImageNet and COCO datasets and has achieved 16% mAP, which is not good; but it was very fast during test time.
YOLOv3, published in 201811--Significantly larger than previous models and has achieved a mAP of 57.9%, which is the best result yet out of the YOLO family of object detectors.

The YOLO family is a series of end-to-end DL models designed for fast object detection, and it was among the first attempts to build a fast real-time object detector. It is one of the faster object detection algorithms out there. Although the accuracy of the models is close but not as good as R-CNNs, they are popular for object detection because of their detection speed, often demonstrated in real-time video or camera feed input.

The creators of YOLO took a different approach than the previous networks. YOLO does not undergo the region proposal step like R-CNNs. Instead, it only predicts over a limited number of bounding boxes by splitting the input into a grid of cells; each cell directly predicts a bounding box and object classification. The result is a large number of candidate bounding boxes that are consolidated into a final prediction using NMS (figure 7.25).

Figure 7.25 YOLO splits the image into grids, predicts objects for each grid, and then uses NMS to finalize predictions.

YOLOv1 proposed the general architecture, YOLOv2 refined the design and made use of predefined anchor boxes to improve bounding-box proposals, and YOLOv3 further refined the model architecture and training process. In this section, we are going to focus on YOLOv3 because it is currently the state-of-the-art architecture in the YOLO family.

7.4.1 How YOLOv3 works

The YOLO network splits the input image into a grid of S × S cells. If the center of the ground-truth box falls into a cell, that cell is responsible for detecting the existence of that object. Each grid cell predicts B number of bounding boxes and their objectness score along with their class predictions, as follows:

Coordinates of B bounding boxes --Similar to previous detectors, YOLO predicts four coordinates for each bounding box (b_x , b_y , b_w , b_h), where x and y are set to be offsets of a cell location.
Objectness score (P₀)--indicates the probability that the cell contains an object. The objectness score is passed through a sigmoid function to be treated as a probability with a value range between 0 and 1. The objectness score is calculated as follows:

P₀ = P_r (containing an object) × IoU (pred, truth)
Class prediction --If the bounding box contains an object, the network predicts the probability of K number of classes, where K is the total number of classes in your problem.

It is important to note that before v3, YOLO used a softmax function for the class scores. In v3, Redmon et al. decided to use sigmoid instead. The reason is that softmax imposes the assumption that each box has exactly one class, which is often not the case. In other words, if an object belongs to one class, then it’s guaranteed not to belong to another class. While this assumption is true for some datasets, it may not work when we have classes like Women and Person. A multilabel approach models the data more accurately.

Figure 7.26 Example of a YOLOv3 workflow when applying a 13 × 13 grid to the input image. The input image is split into 169 cells. Each cell predicts B number of bounding boxes and their objectness score along with their class predictions. In this example, we show the cell at the center of the ground-truth making predictions for 3 boxes (B = 3). Each prediction has the following attributes: bounding box coordinates, objectness score, and class predictions.

As you can see in figure 7.26, for each bounding box (b), the prediction looks like this: [(bounding box coordinates), (objectness score), (class predictions)]. We’ve learned that the bounding box coordinates are four values plus one value for the objectness score and K values for class predictions. Then the total number of values predicted for all bounding boxes is 5B + K multiplied by the number of cells in the grid S × S :

Total predicted values = S × S × (5B + K)

Predictions across different scales

Look closely at figure 7.26. Notice that the prediction feature map has three boxes. You might have wondered why there are three boxes. Similar to the anchors concept in SSD, YOLOv3 has nine anchors to allow for prediction at three different scales per cell. The detection layer makes detections at feature maps of three different sizes having strides 32, 16, and 8, respectively. This means that with an input image of size 416 × 416, we make detections on scales 13 × 13, 26 × 26, and 52 × 52 (figure 7.27). The 13 × 13 layer is responsible for detecting large objects, the 26 × 26 layer is for detecting medium objects, and the 52 × 52 layer detects smaller objects.

Figure 7.27 Prediction feature maps at different scales

This results in the prediction of three bounding boxes for each cell (B = 3). That’s why in figure 7.26, the prediction feature map is predicting Box 1, Box 2, and Box 3. The bounding box responsible for detecting the dog will be the one whose anchor has the highest IoU with the ground-truth box.

NOTE Detections at different layers help address the issue of detecting small objects, which was a frequent complaint with YOLOv2. The upsampling layers can help the network preserve and learn fine-grained features, which are instrumental for detecting small objects.

The network does this by downsampling the input image until the first detection layer, where a detection is made using feature maps of a layer with stride 32. Further, layers are upsampled by a factor of 2 and concatenated with feature maps of previous layers having identical feature-map sizes. Another detection is now made at layer with stride 16. The same upsampling procedure is repeated, and a final detection is made at the layer of stride 8.

YOLOv3 output bounding boxes

For an input image of size 416 × 416, YOLO predicts ((52 × 52) + (26 × 26) + 13 × 13)) × 3 = 10,647 bounding boxes. That is a huge number of boxes for an output. In our dog example, we have only one object. We want only one bounding box around this object. How do we reduce the boxes from 10,647 down to 1?

First, we filter the boxes based on their objectness score. Generally, boxes having scores below a threshold are ignored. Second, we use NMS to cure the problem of multiple detections of the same image. For example, all three bounding boxes of the outlined grid cell at the center of the image may detect a box, or the adjacent cells may detect the same object.

7.4.2 YOLOv3 architecture

Now that you understand how YOLO works, going through the architecture will be very simple and straightforward. YOLO is a single neural network that unifies object detection and classifications into one end-to-end network. The neural network architecture was inspired by the GoogLeNet model (Inception) for feature extraction. Instead of the Inception modules, YOLO uses 1 × 1 reduction layers followed by 3 × 3 convolutional layers. Redmon and Farhadi called this DarkNet (figure 7.28).

Figure 7.28 High-level architecture of YOLO

YOLOv2 used a custom deep architecture darknet-19, an originally 19-layer network supplemented with 11 more layers for object detection. With a 30-layer architecture, YOLOv2 often struggled with small object detections. This was attributed to loss of fine-grained features as the layers downsampled the input. However, YOLOv2’s architecture was still lacking some of the most important elements that are now stable in most state-of-the art algorithms: no residual blocks, no skip connections, and no upsampling. YOLOv3 incorporates all of these updates.

YOLOv3 uses a variant of DarkNet called Darknet-53 (figure 7.29). It has a 53-layer network that is trained on ImageNet. For the task of detection, 53 more layers are stacked onto it, giving us a 106-layer fully convolutional underlying architecture for YOLOv3. This is the reason behind the slowness of YOLOv3 compared to YOLOv2--but this comes with a great boost in detection accuracy.

Figure 7.29 DarkNet-53 feature extractor architecture. (Source: Redmon and Farhadi, 2018.)

Full architecture of YOLOv3

We just learned that YOLOv3 makes predictions across three different scales. This becomes a lot clearer when you see the full architecture, shown in figure 7.30.

Figure 7.30 YOLOv3 network architecture. (Inspired by the diagram in Ayoosh Kathuria’s post “What’s new in YOLO v3?” Medium, 2018, http://mng.bz/lGN2.)

The input image goes through the DarkNet-53 feature extractor, and then the image is downsampled by the network until layer 79. The network branches out and continues to downsample the image until it makes its first prediction at layer 82. This detection is made on a grid scale of 13 × 13 that is responsible for detecting large objects, as we explained before.

Next the feature map from layer 79 is upsampled by 2x to dimensions 26 × 26 and concatenated with the feature map from layer 61. Then the second detection is made by layer 94 on a grid scale of 26 × 26 that is responsible for detecting medium objects.

Finally, a similar procedure is followed again, and the feature map from layer 91 is subjected to few upsampling convolutional layers before being depth concatenated with a feature map from layer 36. A third prediction is made by layer 106 on a grid scale of 52 × 52, which is responsible for detecting small objects.

7.5 Project: Train an SSD network in a self-driving car application

The code for this project was created by Pierluigi Ferrari in his GitHub repository (https://github.com/pierluigiferrari/ssd_keras). The project was adapted for this chapter; you can find this implementation with the book’s downloadable code.

Note that for this project, we are going to build a smaller SSD network called SSD7. SSD7 is a seven-layer version of the SSD300 network. It is important to note that while an SSD7 network would yield some acceptable results, this is not an optimized network architecture. The goal is just to build a low-complexity network that is fast enough for you to train on your personal computer. It took me around 20 hours to train this network on the road traffic dataset; training could take a lot less time on a GPU.

NOTE The original repository created by Pierluigi Ferrari comes with implementation tutorials for SSD7, SSD300, and SSD512 networks. I encourage you to check it out.

In this project, we will use a toy dataset created by Udacity. You can visit Udacity’s GitHub repository for more information on the dataset (https://github.com/udacity/ self-driving-car/tree/master/annotations). It has more than 22,000 labeled images and 5 object classes: car, truck, pedestrian, bicyclist, and traffic light. All of the images have been resized to a height of 300 pixels and a width of 480 pixels. You can download the dataset as part of the book’s code.

NOTE The GitHub data repository is owned by Udacity, and it may be updated after this writing. To avoid any confusion, I downloaded the dataset that I used to create this project and provided it with the book’s code to allow you to replicate the results in this project.

What makes this dataset very interesting is that these are real-time images taken while driving in Mountain View, California, and neighboring cities during daylight conditions. No image cleanup was done. Take a look at the image examples in figure 7.31.

As stated on Udacity’s page, the dataset was labeled by CrowdAI and Autti. You can find the labels in CSV format in the folder, split into three files: training, validation, and test datasets. The labeling format is straightforward, as follows:

frame	xmin	xmax	ymin	ymax	class_id
1478019952686311006.jpg	237	251	143	155	1

Xmin, xmax, ymin, and ymax are the bounding box coordinates. Class_id is the correct label, and frame is the image name.

7.5.1 Step 1: Build the model

Before jumping into the model training, take a close look at the build_model method in the keras_ssd7.py file. This file builds a Keras model with the SSD architecture. As we learned earlier in this chapter, the model consists of convolutional feature layers and a number of convolutional predictor layers that make their input from different feature layers.

Here is what the build_model method looks like. Please read the comments in the keras_ssd7.py file to understand the arguments passed:

def build_model(image_size,
               mode='training',
               l2_regularization=0.0,
               min_scale=0.1,
               max_scale=0.9,
               scales=None,
               aspect_ratios_global=[0.5, 1.0, 2.0],
               aspect_ratios_per_layer=None,
               two_boxes_for_ar1=True,
               clip_boxes=False,
               variances=[1.0, 1.0, 1.0, 1.0],
               coords='centroids',
               normalize_coords=False,
               subtract_mean=None,
               divide_by_stddev=None,
               swap_channels=False,
               confidence_thresh=0.01,
               iou_threshold=0.45,
               top_k=200,
               nms_max_output_size=400,
               return_predictor_sizes=False)

7.5.2 Step 2: Model configuration

In this section, we set the model configuration parameters. First we set the height, width, and number of color channels to whatever we want the model to accept as image input. If your input images have a different size than defined here, or if your images have non-uniform size, you must use the data generator’s image transformations (resizing and/or cropping) so that your images end up having the required input size before they are fed to the model:

img_height = 300            ❶
img_width = 480             ❶
img_channels = 3            ❶
 
intensity_mean = 127.5      ❷
intensity_range = 127.5     ❷

❶ Height, width, and channels of the input images

❷ Set to your preference (maybe None). The current settings transform the input pixel values to the interval [-1,1].

The number of classes is the number of positive classes in your dataset: for example, 20 for PASCAL VOC or 80 for COCO. Class ID 0 must always be reserved for the background class:

n_classes = 5                                ❶
 
scales = [0.08, 0.16, 0.32, 0.64, 0.96]      ❷
 
aspect_ratios = [0.5, 1.0, 2.0]              ❸
steps = None                                 ❹
offsets = None                               ❺
 
two_boxes_for_ar1 = True                     ❻

clip_boxes = False                           ❼
 
variances = [1.0, 1.0, 1.0, 1.0]             ❽
 
normalize_coords = True                      ❾

❶ Number of classes in our dataset

❷ An explicit list of anchor box scaling factors. If this is passed, it overrides the min_scale and max_scale arguments.

❸ List of aspect ratios for the anchor boxes

❹ In case you’d like to set the step sizes for the anchor box grids manually; not recommended

❺ In case you’d like to set the offsets for the anchor box grids manually; not recommended

❻ Specifies whether to generate two anchor boxes for aspect ratio 1

❼ Specifies whether to clip the anchor boxes to lie entirely within the image boundaries

❽ List of variances by which the encoded target coordinates are scaled

❾ Specifies whether the model is supposed to use coordinates relative to the image size

7.5.3 Step 3: Create the model

Now we call the build_model() function to build our model:

model = build_model(image_size=(img_height, img_width, img_channels),
                    n_classes=n_classes,
                    mode='training',
                    l2_regularization=0.0005,
                    scales=scales,
                    aspect_ratios_global=aspect_ratios,
                    aspect_ratios_per_layer=None,
                    two_boxes_for_ar1=two_boxes_for_ar1,
                    steps=steps,
                    offsets=offsets,
                    clip_boxes=clip_boxes,
                    variances=variances,
                    normalize_coords=normalize_coords,
                    subtract_mean=intensity_mean,
                    divide_by_stddev=intensity_range)

You can optionally load saved weights. If you don’t want to load weights, skip the following code snippet:

model.load_weights('<path/to/model.h5>', by_name=True)

Instantiate an Adam optimizer and the SSD loss function, and compile the model. Here, we will use a custom Keras function called SSDLoss. It implements the multi-task log loss for classification and smooth L1 loss for localization. neg_pos_ratio and alpha are set as in the SSD paper (Liu et al., 2016):

adam = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)
 
ssd_loss = SSDLoss(neg_pos_ratio=3, alpha=1.0)
 
model.compile(optimizer=adam, loss=ssd_loss.compute_loss)

7.5.4 Step 4: Load the data

To load the data, follow these steps:

Instantiate two DataGenerator objects--one for training and one for validation:

train_dataset = DataGenerator(load_images_into_memory=False, 
    hdf5_dataset_path=None)
val_dataset = DataGenerator(load_images_into_memory=False, 
    hdf5_dataset_path=None)

Parse the image and label lists for the training and validation datasets:

images_dir = 'path_to_downloaded_directory'
 
train_labels_filename = 'path_to_dataset/labels_train.csv'     ❶
val_labels_filename   = 'path_to_dataset/labels_val.csv'
 
train_dataset.parse_csv(images_dir=images_dir,
                      labels_filename=train_labels_filename,
                      input_format=['image_name', 'xmin', 'xmax', 'ymin',
                                    'ymax', 'class_id'],
                      include_classes='all')
 
val_dataset.parse_csv(images_dir=images_dir,
                      labels_filename=val_labels_filename,
                      input_format=['image_name', 'xmin', 'xmax', 'ymin',
                                    'ymax', 'class_id'],
                      include_classes='all')
 
 
train_dataset_size = train_dataset.get_dataset_size()          ❷
val_dataset_size   = val_dataset.get_dataset_size()            ❷
 
print("Number of images in the training dataset:	{:>6}".format(train_dataset_size))
print("Number of images in the validation dataset:	{:>6}".format(val_dataset_size))

❶ Ground truth

❷ Gets the number of samples in the training and validation datasets

This cell should print out the size of your training and validation datasets as follows:

Number of images in the training dataset:     18000
Number of images in the validation dataset:    4241

Set the batch size:
```
batch_size = 16
```
As you learned in chapter 4, you can increase the batch size to get a boost in the computing speed based on the hardware that you are using for this training.

Define the data augmentation process:

data_augmentation_chain = DataAugmentationConstantInputSize(
                                       random_brightness=(-48, 48, 0.5),
                                       random_contrast=(0.5, 1.8, 0.5),
                                       random_saturation=(0.5, 1.8, 0.5),
                                       random_hue=(18, 0.5),
                                       random_flip=0.5,
                                       random_translate=((0.03,0.5),                                                       (0.03,0.5), 0.5),
                                       random_scale=(0.5, 2.0, 0.5),
                                       n_trials_max=3,
                                       clip_boxes=True,
                                       overlap_criterion='area',
                                       bounds_box_filter=(0.3, 1.0),
                                       bounds_validator=(0.5, 1.0),
                                       n_boxes_min=1,
                                       background=(0,0,0))

Instantiate an encoder that can encode ground-truth labels into the format needed by the SSD loss function. Here, the encoder constructor needs the spatial dimensions of the model’s predictor layers to create the anchor boxes:

predictor_sizes = [model.get_layer('classes4').output_shape[1:3],
                   model.get_layer('classes5').output_shape[1:3],
                   model.get_layer('classes6').output_shape[1:3],
                   model.get_layer('classes7').output_shape[1:3]]
 
ssd_input_encoder = SSDInputEncoder(img_height=img_height,
                                    img_width=img_width,
                                    n_classes=n_classes,
                                    predictor_sizes=predictor_sizes,
                                    scales=scales,
                                    aspect_ratios_global=aspect_ratios,
                                    two_boxes_for_ar1=two_boxes_for_ar1,
                                    steps=steps,
                                    offsets=offsets,
                                    clip_boxes=clip_boxes,
                                    variances=variances,
                                    matching_type='multi',
                                    pos_iou_threshold=0.5,
                                    neg_iou_limit=0.3,
                                    normalize_coords=normalize_coords)

Create the generator handles that will be passed to Keras’s fit_generator() function:

train_generator = train_dataset.generate(batch_size=batch_size,
                                         shuffle=True,
                                         transformations=[                                                 data_augmentation_chain],
                                         label_encoder=ssd_input_encoder,
                                         returns={'processed_images',
                                                  'encoded_labels'},
                                         keep_images_without_gt=False)

val_generator = val_dataset.generate(batch_size=batch_size,
                                     shuffle=False,
                                     transformations=[],
                                     label_encoder=ssd_input_encoder,
                                     returns={'processed_images',
                                              'encoded_labels'},
                                     keep_images_without_gt=False)

7.5.5 Step 5: Train the model

Everything is set, and we are ready to train our SSD7 network. We’ve already chosen an optimizer and a learning rate and set the batch size; now let’s set the remaining training parameters and train the network. There are no new parameters here that you haven’t learned already. We will set the model checkpoint, early stopping, and learning rate reduction rate:

model_checkpoint = 
ModelCheckpoint(filepath='ssd7_epoch-{epoch:02d}_loss-{loss:.4f}_val_loss-{val_loss:.4f}.h5',
                                   monitor='val_loss',
                                   verbose=1,
                                   save_best_only=True,
                                   save_weights_only=False,
                                   mode='auto',
                                   period=1)
 
csv_logger = CSVLogger(filename='ssd7_training_log.csv',
                       separator=',',
                       append=True)
 
early_stopping = EarlyStopping(monitor='val_loss',                ❶
                               min_delta=0.0,
                               patience=10,
                               verbose=1)
 
reduce_learning_rate = ReduceLROnPlateau(monitor='val_loss',      ❷
                                         factor=0.2,
                                         patience=8,
                                         verbose=1,
                                         epsilon=0.001,
                                         cooldown=0,
                                         min_lr=0.00001)
 
callbacks = [model_checkpoint, csv_logger, early_stopping, reduce_learning_rate]

❶ Early stopping if val_loss did not improve for 10 consecutive epochs

❷ Learning rate reduction rate when it plateaus

Set one epoch to consist of 1,000 training steps. I’ve arbitrarily set the number of epochs to 20 here. This does not necessarily mean that 20,000 training steps is the optimum number. Depending on the model, dataset, learning rate, and so on, you might have to train much longer (or less) to achieve convergence:

initial_epoch   = 0                                                ❶
final_epoch     = 20                                               ❶
steps_per_epoch = 1000

history = model.fit_generator(generator=train_generator,           ❷
                              steps_per_epoch=steps_per_epoch,
                              epochs=final_epoch,
                              callbacks=callbacks,
                              validation_data=val_generator,
                              validation_steps=ceil(                                           val_dataset_size/batch_size),
                              initial_epoch=initial_epoch)

❶ If you’re resuming previous training, set initial_epoch and final_epoch accordingly.

❷ Starts training

7.5.6 Step 6: Visualize the loss

Let’s visualize the loss and val_loss values to look at how the training and validation loss evolved and check whether our training is going in the right direction (figure 7.32):

plt.figure(figsize=(20,12))
plt.plot(history.history['loss'], label='loss')
plt.plot(history.history['val_loss'], label='val_loss')
plt.legend(loc='upper right', prop={'size': 24})

Figure 7.32 Visualized loss and val_loss values during SSD7 training for 20 epochs

7.5.7 Step 7: Make predictions

Now let’s make some predictions on the validation dataset with the trained model. For convenience, we’ll use the validation generator that we’ve already set up. Feel free to change the batch size:

predict_generator = val_dataset.generate(batch_size=1,                    ❶
                                         shuffle=True,
                                         transformations=[],
                                         label_encoder=None,
                                         returns={'processed_images',
                                                  'processed_labels',
                                                  'filenames'},
                                         keep_images_without_gt=False)
 
batch_images, batch_labels, batch_filenames = next(predict_generator)     ❷
 
y_pred = model.predict(batch_images)                                      ❸
  
y_pred_decoded = decode_detections(y_pred,                                ❹
                                   confidence_thresh=0.5,
                                   iou_threshold=0.45,
                                   top_k=200,
                                   normalize_coords=normalize_coords,
                                   img_height=img_height,
                                   img_width=img_width)
 
np.set_printoptions(precision=2, suppress=True, linewidth=90)
print("Predicted boxes:
")
print('   class   conf xmin   ymin   xmax   ymax')
print(y_pred_decoded[i])

❶ 1. Set the generator for the predictions.

❷ 2. Generate samples.

❸ 3. Make a prediction.

❹ 4. Decode the raw prediction y_pred.

This code snippet prints the predicted bounding boxes along with their class and the level of confidence for each one, as shown in figure 7.33.

Figure 7.33 Predicted bounding boxes, confidence level, and class

When we draw these predicted boxes onto the image, as shown in figure 7.34, each predicted box has its confidence next to the category name. The ground-truth boxes are also drawn onto the image for comparison.

Figure 7.34 Predicted boxes drawn onto the image

Summary

Image classification is the task of predicting the type or class of an object in an image.
Object detection is the task of predicting the location of objects in an image via bounding boxes and the classes of the located objects.
The general framework of object detection systems consists of four main components: region proposals, feature extraction and predictions, non-maximum suppression, and evaluation metrics.
Object detection algorithms are evaluated using two main metrics: frame per second (FPS) to measure the network’s speed, and mean average precision (mAP) to measure the network’s precision.
The three most popular object detection systems are the R-CNN family of networks, SSD, and the YOLO family of networks.
The R-CNN family of networks has three main variations: R-CNN, Fast R-CNN, and Faster R-CNN. R-CNN and Fast R-CNN use a selective search algorithm to propose RoIs, whereas Faster R-CNN is an end-to-end DL system that uses a region proposal network to propose RoIs.
The YOLO family of networks include YOLOv1, YOLOv2 (or YOLO9000), and YOLOv3.
R-CNN is a multi-stage detector: it separates the process to predict the objectness score of the bounding box and the object class into two different stages.
SSD and YOLO are single-stage detectors: the image is passed once through the network to predict the objectness score and the object class.
In general, single-stage detectors tend to be less accurate than two-stage detectors but are significantly faster.

1.Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik, “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation,” 2014, http://arxiv.org/abs/1311.2524.

2.Ross Girshick, “Fast R-CNN,” 2015, http://arxiv.org/abs/1504.08083.

3.Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” 2016, http://arxiv.org/abs/1506.01497.

4.Matthew D. Zeiler and Rob Fergus, “Visualizing and Understanding Convolutional Networks,” 2013, http://arxiv.org/abs/1311.2901.

5.Karen Simonyan and Andrew Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” 2014, http://arxiv.org/abs/1409.1556.

6.Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam, “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” 2017, http://arxiv.org/abs/1704.04861.

7.Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger, “Densely Connected Convolutional Networks,” 2016, http://arxiv.org/abs/1608.06993.

8.Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg, “SSD: Single Shot MultiBox Detector,” 2016, http://arxiv.org/abs/1512.02325.

9.Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” 2016, http://arxiv.org/abs/1506.02640.

10.Joseph Redmon and Ali Farhadi, “YOLO9000: Better, Faster, Stronger,” 2016, http://arxiv.org/abs/ 1612.08242.

11.Joseph Redmon and Ali Farhadi, “YOLOv3: An Incremental Improvement,” 2018, http://arxiv.org/abs/ 1804.02767.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 7 Object detection with R-CNN, SSD, and YOLO

Create new playlist

Sign In

Sign Up