Classifying pixels 

As we have already discussed, the desired output of a model performing semantic segmentation is an image with each of its pixels assigned a label of its most likely class (or even a specific instance of a class). Throughout this book, we have also seen that layers of a deep neural network learn features that are activated when a corresponding input that satisfies the particular feature is detected. We can visualize these activations using a technique called class activation maps (CAMs). The output produces a heatmap of class activations over the input image; the heatmap consists of a matrix of scores associated with a specific class, essentially giving us a spatial map of how intensely the input region activates a specified class. The following figure shows an output of a CAM visualization for the class cat. Here, you can see that the heatmap portrays what the model considers important features (and therefore regions) for this class:

The preceding figure was produced using the implementation described in the paper Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization by R. Selvaraju. The approach is to take the output feature map of a convolutional layer and weigh every channel in that feature map by the gradient of the class. For more details of how it works, please refer to the original paper: https://arxiv.org/abs/1610.02391.

Early attempts of semantic segmentation were made using slightly adapted classification models such as VGG and Alexnet, but they only produced coarse approximations. This can be seen in the preceding figure and is largely due to the network using repetitive pooling layers, which results in loss of spatial information.

U-Net is one architecture that addresses this; it consists of an encoder and decoder, with the addition of shortcuts between the two to preserve spatial information. Released in 2015 by O. Ronneberger, P. Fischer, and T. Brox for biomedical image segmentation, it has since become one of the go-to architectures for segmentation due to its effectiveness (it can be trained on a small dataset) and performance. The following figure shows the modified U-Net we will be using in this chapter:

U-Net is one of many architectures for semantic segmentation. Sasank Chilamkurthy's post A 2017 Guide to Semantic Segmentation with Deep Learning provides a great overview and comparison of the most popular architectures, available at http://blog.qure.ai/notes/semantic-segmentation-deep-learning-review. For further details on U-Net, please refer to the original paper mentioned earlier. It is available at https://arxiv.org/pdf/1505.04597.pdf.

On the left in the preceding figure, we have the full network used in this chapter's project, and on the right we have an extract of blocks used in the encoder and decoder parts of the network. As a reminder, the focus of this book is on applying machine learning rather than the details of the models themselves. So for this reason, we won't be delving into the details, but there are a few interesting and useful things worth pointing out.

The first is the general structure of the network; it consists of an encoder and decoder. The encoder's role is to capture context. The decoder's task is to use this context and features from the corresponding shortcuts to project its understanding onto pixel space, to get a dense and precise classification. It's a common practice to bootstrap the encoder using an architecture and weights from a trained classification model, such as VGG16. This not only speeds up training but also is likely to increase performance as it brings with it a depth (pun intended) of understanding of images it has been trained on, which is typically from a larger dataset.

Another point worth highlighting is those shortcuts between the encoder and decoder. As mentioned previously, they are used to preserve spatial information outputted from convolutional layers from each encoding block before being lost when its downsampled using max pooling. This information is used to assist the model in precise localization.

It's the first time in this book that we have seen an upsampling layer. As the name implies, it's a technique that upsamples your image (or feature maps) to a higher resolution. One of the easiest ways is to use the same techniques we use with image upsampling, that is, rescaling the input to a desired size and calculating the values at each point using an interpolation method, such as bilinear interpolation.

Lastly, I wanted to bring to your attention the input and outputs of the model. The model is expecting a 448 x 448 color image as its input and outputs a 448 x 448 x 1 (single channel) matrix. If you inspect the architecture, you will notice that the last layer is a sigmoid activation, where a sigmoid function is typically used for binary classification, which is precisely what we are doing here. Typically, you would perform multi-class classification for semantic segmentation tasks, in which case you would replace the sigmoid activation with a softmax activation. An example commonly used when introducing semantic segmentation is scene understanding for self-driving cars. The following is an example of a labeled scene from Cambridge University's Motion-based Segmentation and Recognition Dataset where each color represents a different class:

Source: http://mi.eng.cam.ac.uk/research/projects/VideoRec/CamVid/

But in this example, a binary classifier is sufficient, which will become apparent as we go into the details of the project. However, I wanted to highlight it here as the architecture will scale to multi-class classification by simply swapping the last layer with a softmax activation and changing the loss function.

You have thus seen the architecture we will be using in this chapter. Let's now look at how we will use it and the data used to train the model.

Table of Contents for Classifying pixels&#xA0;

Create new playlist

Sign In

Sign Up

Table of Contents for
Classifying pixels