Convolutional layers and filters

Convolutional layers and filters are at the heart of convolutional neural networks. In these layers, we slide a filter (also referred to in this text as a window or kernel) over our ndarray feature and take the inner product at each step. Convolving our ndarray and kernel in this way results in a lower-dimensional image representation. Let's explore how this works on this grayscale image (available in image-assets repository):

The preceding image is a 5 x 5 pixel grayscale image shows a black diagonal line against a white background.

Extracting the features from the following diagram, we get the following matrix of pixel intensities:

Next, let's assume we (or Keras) instantiate the following kernel:

We'll now visualize the convolution process. The movement of the window starts from the top, left of our image matrix. We'll slide the window right by a predetermined stride size. In this case, our stride size will be 1, but in general the stride size should be considered another hyperparameter of your model. Once the window reaches the rightmost edge of the image, we'll slide our window down by 1 (our stride size), move the window back to the leftmost edge of the image, and start the process of taking the inner product again.

Now let's do this step by step:

Slide the kernel over the top-left part of the matrix and calculate the inner product:

I'll explicitly map out the inner product for this first step so that you can easily follow along:

(0x0)+(255x0)+(255x0)+(255x0)+(0x1)+(255x0)+(255x0)+(255x0)+(0x0) = 0

We write the result to our feature map and continue!

Take the inner product and write the result to our feature map:

Step 3:

We've reached the rightmost edge of the image. Slide the window down by 1, our stride size, and start the process again at the leftmost edge of the image:

Step 5:

Step 6:

Step 7:

Step 8:

Step 9:

Voila! We've now represented our original 5 x 5 image in a 3 x 3 matrix (our feature map). In this toy example, we've been able to reduce the dimensionality from 25 features down to just 9. Let's take a look at the image that results from this operation:

If you're thinking that this looks exactly like our original black diagonal line but smaller, you're right. The values the kernel takes determine what's being identified, and in this specific example, we used what's called an identity kernel. Kernels taking other values will return other properties of the image—detecting the presence of lines, edges, outlines, areas of high contrast, and more.

We'll apply multiple kernels to the image, simultaneously, at each convolutional layer. The number of kernels used is up to the modeler—another hyperparameter. Ideally, you want to use as few as possible while still achieving acceptable cross-validation results. The simpler the better! However, depending on the complexity of the task, we may see performance gains by using more. The same thinking can can be applied when tuning the other hyperparameters of the model, such as the number of layers in the network or the number of neurons per layer. We're trading simplicity for complexity, and generalizability and speed for detail and precision.

While the number of kernels is our choice, the values that each kernel takes is a parameter of our model, which is learned from our training data and optimized during training in a manner that reduces the cost function.

We've seen the step-by-step process of how to convolve a filter with our image features to create a single feature map. But what happens when we apply multiple kernels simultaneously? And how do these feature maps pass through each layer of the network? Lets have a look at the following screenshot:

Image source: Lee et al., Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations, via stack exchange. Source text here: https://ai.stanford.edu/~ang/papers/icml09-ConvolutionalDeepBeliefNetworks.pdf

The preceding screenshot visualizes the feature maps generated at each convolutional layer of a network trained on images of faces. In the early layers of the network (at the very bottom), we detect the presence of simple visual structures—simple lines and edges. We did this with our identity kernel! The output of this first layer gets passed on to the next layer (the middle row), which combines these simple shapes into abstract forms. We see here that the combination of edges build the components of a face—eyes, noses, ears, mouths, and eyebrows. The output of this middle layer, in turn, gets passed to a final layer, which combines the combination of edges into complete objects—in this case, different people's faces.

One particularly powerful property of this entire process is that all of these features and representations are learned from the data. At no point do we explicitly tell our model: Model, for this task, I'd like to use an identity kernel and a bottom sobel kernel in the first convolutional layer because I think these two kernels will extract the most signal-rich feature maps. Once we've set the hyperparameter for the number of kernels we want to use, the model learns through optimization what lines, edges, shadows, and complex combinations thereof are best suited to determine what a face is or isn't. The model performs this optimization with no domain-specific, hardcoded rules about what faces, cat burritos, or clothes are.

There are many other fascinating properties of convolutional neural networks, which we won't cover in this chapter. However, we did explore the fundamentals, and hopefully you have a sense of the importance of using convolutional neural networks to extract highly expressive, signal-rich, low-dimensional features.

Next, we'll discuss Max pooling layers.

Table of Contents for Convolutional layers and filters

Create new playlist

Sign In

Sign Up

Table of Contents for
Convolutional layers and filters