Understanding images

As mentioned previously, it's not my intention to give you a theoretical or deep understanding of any particular ML algorithm, but rather gently introduce you to some of the main concepts. This will help you to gain an intuitive understanding of how they work so that you know where and how to apply them, as well as give you a platform to dive deeper into the particular subject, which I strongly encourage you to do.

For a good introductory text on deep learning, I strongly recommend Andrew Trask's book Grokking Deep Learning. For a general introduction to ML, I would recommend Toby Segaran's book Programming Collective Intelligence: Building Smart Web 2.0 Applications.

In this section, we will be introducing CNNs, specifically introducing what they are and why they are well suited for spatial data, that is, images. But before discussing CNNs, we will start by inspecting the data; then we'll see why CNNs perform better than their counterpart, fully connected neural networks (or just neural networks).

For the purpose of illustrating these concepts, consider the task of classifying the following digits, where each digit is represented as a 5 x 5 matrix of pixels. The dark gray pixels have a value of 1 and light gray pixels have a value of 0:

Using a fully connected neural network (single hidden layer), our model would learn the joint probability of each pixel with respect to their associated label; that is, the model will assign positive weights to pixels that correlate with the label and using the output with the highest likelihood to be the most probable label. During training, we take each image and flatten it before feeding into our network, as shown in the following diagram:

This works remarkably well, and if you have experience with ML, particularly deep learning, you would have likely come across the MNIST dataset. It's a dataset consisting of labeled handwritten digits, where each digit is centrally rendered to a 28 x 28 gray scale (single channel with the pixel value ranging from 0-255) image. Using a single-layer fully connected network will likely result in a validation accuracy close to 90%. But what happens if we introduce some complexities such as moving the image around a larger space, as illustrated in the following diagram?

The fully connected network has no concept of space or local relationships; in this case, the model would need to learn all variants of each digit at each possible location. To further emphasize the importance of being able to capture the relationship of spatial data, consider the need to learn more complex images, such as classifying dogs and cats using a network that discards 2D information. Individual pixels alone are unable to portray complex shapes such as eyes, a nose, or ears; it's only when you consider neighboring pixels that you can describe these more complex shapes:

Images taken from the Kaggle competition cats vs dogs (https://www.kaggle.com/c/dogs-vs-cats)

We need something that can abstract away from the raw pixels, something that can describe images using high-level features. Let's return to our digits dataset and investigate how we might go about extracting higher-level features for the task of classification. As alluded to in an earlier example, we need a set of features that abstracts away from the raw pixels, is unaffected by position, and preserves 2D spatial information. If you're familiar with image processing, or even image processing tools, you would have most probably come across the idea and results of edge detection or edge filters; in simplest terms, these work by passing a set of kernels across the whole image, where the output is the image with its edges emphasized. Let's see how this looks diagrammatically. First, we have our set of kernels; each one extracts a specific feature of the image, such as the presence of horizontal edges, vertical edges, or edges at a 45 degree angle:

For each of these filters, we pass them over our image, extracting each of the features; to help illustrate this, let's take one digit and pass the vertical kernel over it:

As illustrated in the previous diagram, we slide the horizontal kernel across the image, producing a new image using the values of the image and kernel. We continue until we have reached the bounds of the image, as shown in the following diagram:

The output of this is a map showing the presence of vertical lines detected within the image. Using this and the other kernels, we can now describe each class by its dominant gradients rather than using pixel positions. This higher level abstraction allows us to recognize classes independent of their location as well as describe more complex objects.

Two useful things to be aware of when dealing with kernels are the stride value and padding. Strides determines how large your step size is when sliding your kernel across the image. In the preceding example, our stride is set to 1; that is, we're sliding only by a single value. Padding refers to how you deal with the boundaries; here, we are using valid, where we only process pixels within valid ranges. same would mean adding a border around the image to ensure that the output remains the same size as the input.

What we have performed here is known as feature engineering and something neural networks perform automatically; in particular, this is what CNNs do. They create a series of kernels (or convolution matrices) that are used to convolve the image to extract local features from neighboring pixels. Unlike our previous engineered example, these kernels are learned during training. Because they are learned automatically, we can afford to create many filters that can extract granular nuances of the image as well, allowing us to effectively stack convolution layers on top of each other. This allows for increasingly higher levels of abstraction to learn. For example, your first layer may learn to detect simple edges, and your second layer (operating on the previous extracted features) may learn to extract simple shapes. The deeper we go, the higher the level achieved by our features, as illustrated in the diagram:

And there we have it! An architecture capable of understanding the world by learning features and layers of abstraction to efficiently describe it. Let's now put this into practice using a pretrained model and Core ML to get our phone to recognize the objects it sees.

Table of Contents for Understanding images

Create new playlist

Sign In

Sign Up

Table of Contents for
Understanding images