In this chapter, we are going to talk about how deep learning and convolutional neural networks (CNNs) can be adapted to solve semantic segmentation tasks in computer vision.

In a self- driving car (SDC), the vehicle must know exactly where another vehicle is on the road or where a person is crossing the road. Semantic segmentation helps make these identifications. Semantic segmentation with CNNs effectively means classifying each pixel in the image. Thus, the idea is to create a map of fully detectable object areas in the image. Basically, what we want is an output image in the slide where every pixel has a label associated with it.

For example, semantic segmentation will label all the cars in an image, as shown here:

Fig 8.1: Semantic segmentation output

The demand for understanding data has increased in the field of computer vision due to the availability of data from mobile phones, surveillance systems, and automotive vehicles. The advancement of computational power in recent years has enabled deep learning to take strides toward better visual understanding. Deep neural networks have achieved performance equal to humans on tasks such as image classification and traffic sign recognition, similar to what we implemented in Chapter 6, Improving the Image Classifier with CNNs. However, deep learning has a high computational cost as an increase in performance is accomplished by increasing the network's size. Large neural networks are difficult to use with SDCs.

As we already know, the first step in autonomous driving systems is based on perception or visual inputs, namely object recognition, object localization, and semantic segmentation. Semantic segmentation labels each pixel in an image belonging to a given semantic class. Typically, these classes could be streets, traffic signs, street markings, cars, pedestrians, or sidewalks. When we deploy deep learning for semantic segmentation, the recognition of major objects in the images, such as people or vehicles, is done at the higher levels of a neural network. The biggest benefit of this strategy is that slight variance at the pixel level doesn't impact identification. Semantic segmentation requires a pixel-exact classification of small features that usually only occur in lower layers.

In this chapter, we will cover the following topics:

Introduction to semantic segmentation
Understanding the semantic segmentation architecture
Overview of different semantic segmentation architectures
Deep learning for semantic segmentation

Let's get started!

Introduction to semantic segmentation

Numerous technology systems have emerged in recent years that have been designed to identify a car's surroundings. Understanding the scene around our surroundings turns out to be an important area of research for analyzing the geometry of scenes and the associated objects in the surroundings. CNNs have proved to be the most effective vision computing tool in image classification, object detection, and semantic segmentation. In an automated environment, it is important to make some critical decisions in order to understand a given scene in the surroundings at the pixel level. Semantic segmentation has proven to be one of the most effective methods of assigning labels to individual pixels in an image.

Researchers have proposed numerous ways for semantic pixel-wise labeling; some approaches have tried deep architecture pixel-wise labeling, and the results have been impressive. Since segmentation at the pixel level provides better performance, researchers started using those methods for real-time automated systems. Lately, driving assistant systems have become top research areas as they provide various opportunities for boosting driving experiences. Driver performance could be improved by using semantic segmentation techniques available in the Advanced Driving Assistance System (ADAS).

Semantic segmentation is the process that associates each pixel of an image with a class label, where the classes can be a person, street, road, the sky, the ocean, or a car.

A semantic segmentation algorithm consists of the following steps:

It creates a partition of the image and puts it into meaningful categories.
It associates every pixel in an input image with a class label such as a person, tree, street, road, car, bus, and so on.

In the next section, we will understand the semantic segmentation architecture.

Understanding the semantic segmentation architecture

The semantic segmentation network generally consists of an encoder-decoder network. The encoder produces high-level features using convolution, while the decoder helps in interpreting these high-level features using classes. The encoder is a common encoding mechanism that is used by pre-trained networks and the decoder weight that's learned while training a segmentation network. The following diagram shows the architecture of the encoder-decoder-based FCN architecture for semantic segmentation:

Fig 8.2: Semantic segmentation architecture

You can check out the preceding diagram at the following link: https://www.mdpi.com/2313-433X/4/10/116/pdf.

The encoder gradually reduces the spatial dimension with the help of pooling layers, while the decoder recovers the features of the object and spatial dimensions. You can read more about semantic segmentation in the paper on ECRU: An Encoder-Decoder-Based Convolution Neural Network (CNN) for Road-Scene Understanding.

One of the important concepts to understand is how semantic segmentation works in convolutional networks. The concept behind semantic segmentation is finding meaningful parts of an image. We can see how pixels belonging to one class occur in correlation with the pixels of another class. Let's consider the first layer of encoding on CNNs. The basic operation of a convolution is to encode the image into a higher-level representation, with the idea of expressing the image as a combination of parts, such as edges or gradients. The encoded features, such as edges, for example, are not exclusive but actually carry the context of a neighborhood as well. When we upsample and decode, these features are decoded with the associated class, guided by per-pixel maps during the backpropagation of the network.

Hence, the main goal of semantic segmentation is to represent an image in a way that is easy to analyze. More precisely, we can say that image segmentation is the process of labeling every pixel in an image so that pixels with the same label have similar characteristics.

In the next section, we will read about popular semantic segmentation architectures.

Overview of different semantic segmentation architectures

There are lots of deep learning architectures and pre-trained models for semantic segmentation that have been released in recent times. In this section, we will discuss the popular semantic segmentation architectures, which are as follows:

U-Net
SegNet
PSPNet
DeepLabv3+
E-Net

We will start by introducing U-Net.

U-Net

U-Net won the award for the most challenging Grand Challenge for the Computer-Automated Detection of Caries in Bitewing Radiography at the International Symposium on Biomedical Imaging (ISBI) 2015 and also won the Cell Tracking Challenge at ISBI in 2015.

U-Net is the fastest and most precise semantic segmentation architecture. It outperformed methods such as the sliding window CNN at the ISBI challenge for semantic segmentation of neuron structures in electron microscopic stacks.

At ISBI 2015, it also won the two most challenging transmitted light microscopy categories, Phase contrast and DIC microscopy, by a large margin.

The main idea behind U-Net is to add successive layers to a normal contracting network, where upsampling operators replace pooling operations. Due to this, the layers of U-Net increase the resolution of the output. The most important modification in U-Net occurs in the upsampling component, which contains a large number of feature channels that enable the network to propagate contextual information to higher-resolution layers.

The network is composed of a contracting path and an expansive path, which gives it the U-shaped architecture. The contracting path is a standard convolutional neural network comprised of repeated convolution, followed by a rectified linear unit (ReLU) and a max-pooling operation. Spatial information is reduced while feature information is increased during the contraction.

The expansive pathway is used to combine the spatial information and image features through a sequence of concatenations with high-resolution features from the contracting path and up-convolutions.

The architecture of U-Net can be seen in the following diagram:

Fig 8.3: U-Net architecture producing a 256 * 256 mask using a 256 * 256 image

U-Net: Convolutional Networks for Biomedical Image Segmentation is a paper by Olaf Ronneberger, Philipp Fischer, and Thomas Brox. Click on the following link for more details: https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/.

In the next section, we will cover SegNet.

SegNet

SegNet is a deep encoder-decoder architecture for multi-class pixel-wise segmentation that was researched and developed by members of the Computer Vision and Robotics Group (http://mi.eng.cam.ac.uk/Main/CVR) at the University of Cambridge, UK.

The SegNet architecture consists of an encoder network, a corresponding decoder network, and a final classification pixel-wise layer. It also consists of a series of non-linear processing layers (encoders) and a corresponding collection of decoders, accompanied by a pixel-wise classifier.

The architecture of SegNet can be seen in the following diagram:

Fig 8.4: SegNet architecture

You can also check out this diagram at https://mi.eng.cam.ac.uk/projects/segnet/.

The encoder typically consists of one or more convolutional layers with batch normalization and a ReLU, accompanied by non-overlapping max-pooling and sub-sampling. Sparse encoding, which results from the pooling process, is upsampled in the decoder using the encoding sequence's max-pooling indices. SegNet uses max-pooling indices in the decoders to upsample the feature maps with low resolution. This has the significant advantage of keeping high-frequency details in the segmented images, as well as reducing the total number of trainable parameters in the decoders. SegNet uses stochastic gradient descent to train the network.

In the next section, we will provide an overview of the encoder and decoder parts of the SegNet architecture.

Encoder

Convolutions and max-pooling are performed in the encoder, where 13 convolutional layers are taken from VGG-16. The corresponding max-pooling indices are stored while performing 2×2 max-pooling.

Decoder

Upsampling and convolutions are conducted in the decoder's softmax classifier, at the end of each pixel. The max-pooling indices at the corresponding encoder layer are recalled and upsampled during the upsampling process. Then, a K-class softmax classifier is used for predicting each pixel.

A Deep Convolutional Encoder-Decoder Architecture for Robust Semantic Pixel-Wise Labeling was researched and developed by members of the Computer Vision and Robotics Group at the University of Cambridge, UK. Click on the following link for more details: http://mi.eng.cam.ac.uk/projects/segnet/.

In the next section, we'll cover the Pyramid Scene Parsing Network (PSPNet).

PSPNet

PSPNet -Full-Resolution Residual Networks were really computationally intensive and using them on full-scale images was really slow. In order to deal with this problem, PSPNet came into the picture. It applies four different max-pooling operations with four different window sizes and strides. Using the max-pooling layers allows us to extract feature information from different scales with more efficiency.

PSPNet achieved state-of-the-art performance on various datasets. It became popular after the ImageNet scene parsing challenge in 2016. It hit the PASCAL VOC 2012 benchmark and the Cityscapes benchmark with a mIoU record of 85.4% accuracy on PASCAL VOC 2012, and also achieved 80.2% on Cityscapes. The following is a link to the relevant paper: https://arxiv.org/pdf/1612.01105.

The following diagram shows the architecture of PSPNet:

Fig 8.5: PSPNet architecture

Check out https://hszhao.github.io/projects/pspnet/ to find out more about the PSPNet architecture and its implementation.

In the preceding diagram, we can see the proposed architecture for PSPNet. For the given input image, a feature map is extracted using the convolutional neural network. Then, the pyramid parsing module is used to harvest different representations of the sub-regions. This is followed by upsampling and concatenation layers to form the final representation of features, which contain both local and global context information. Finally, the output from the previous layer is fed into a convolution layer to get the final prediction per pixel.

In the next section, we will look at DeepLabv3+.

DeepLabv3+

DeepLab is the semantic segmentation state-of-the-art model. In 2016, it was developed and open sourced by Google. Multiple versions have been released and many improvements have been made to the model since then. These include DeepLab V2, DeepLab V3, and DeepLab V3+.

Before the release of DeepLab V3+, we were able to encode multi-scale contextual information using filters and pooling operations at different rates; the newer networks could capture the objects with sharper boundaries by recovering spatial information. DeepLabv3+ combines these two approaches. It uses both the encoder-decoder and the spatial pyramid pooling modules.

The following diagram shows the architecture of DeepLabv3+, which consists of encoder and decoder modules:

Fig 8.6: DeepLabV3+ architecture

Let's look at the encoder and decoder modules in more detail:

Encoder: In the encoder step, essential information from the input image is extracted using a pre-trained convolutional neural network. The essential information for segmentation tasks is the objects present in the image and their locations.
Decoder: The information that's extracted from the encoder process is used to create an output that is the same as the original input image's size.

If you want to learn more about DeepLabV3+ you can read the Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation paper at: https://arxiv.org/pdf/1802.02611.

In the next section, we will look at the E-Net architecture.

E-Net

Real-time pixel-wise semantic segmentation is one of the great applications of semantic segmentation for SDCs. Accuracy can increase in SDCs, but deploying semantic segmentation is still a challenge. In this section, we'll look at an efficient neural network (E-Net) that aims to run on low-power mobile devices while improving accuracy.

E-Net is a popular network due to its ability to perform real-time pixel-wise semantic segmentation. E-Net is up to 18x faster, requires 75x fewer FLOPs, and has 79x fewer parameters than existing models such as U-Net and SegNet, leading to much better accuracy. E-Net networks are tested on the popular CamVid, Cityscapes, and SUN datasets.

The architecture of E-Net is as follows:

Fig 8.7: E-Net architecture

You can check out the preceding screenshot at https://arxiv.org/pdf/1606.02147.pdf.

This is a framework with one master and several branches that split from the master but also merge back via element-wise addition. The model architecture consists of an initial block and five bottlenecks. The first three bottlenecks are used to encode the input image, while the other two are used to decode it. Let's learn more about the initial block and the bottleneck block.

Initial block: Let's say the resolution of the input image is 512x512. The following diagram shows that this results in an output size of 16x256x256 after concatenating the convolution of 13 filters and performing max-pooling of 2x2 without an overlap:

Fig 8.8: Initial E-Net architecture block

You can find the preceding diagram at https://arxiv.org/abs/1606.02147.

Bottleneck block: As shown in the following diagram, each branch consists of three convolutional layers. The 1x1 projection initially reduces the dimensionality, and then later expands it. In between these convolutions, a regular asymmetric dilated or full convolution with no annotation also takes place. We can also see that batch normalization and PReLU are present between all convolutions. Also, spatial dropout is used:

Fig 8.9: Bottleneck block of E-Net

You can check out the preceding diagram at https://arxiv.org/abs/1606.02147.

We should note that max-pooling on the master will only be applied when the bottleneck is downsampled. Simultaneously, a non-overlapping 2x2 convolution replaces the first projection in the branch, and the activations get zero-padded so that they equal the number of feature maps. Max-unpooling replaces max-pooling in the decoder and performs spatial convolution without bias.

You can learn more about E-Net in the paper written by Adam Paszke, Abhishek Chaurasia, Sangpil Kim, and Eugenio Culurciello: ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. Click on the following link for more information: https://openreview.net/forum?id=HJy_5Mcll.

Summary

In this chapter, we learned about the importance of semantic segmentation in the field of SDCs. We also looked at an overview of a few popular deep learning architectures related to semantic segmentation: U-Net, SegNet, PSPNet, DeepLabv3+, and E-Net.

In the next chapter, we will implement semantic segmentation using E-Net. We will use it to detect various objects in images and videos.

Table of Contents for The Principles and Foundations of Semantic Segmentation

Create new playlist

Sign In

Sign Up