Convolution on RGB images

Let's see how convolution is done with color images, and how we can obtain multi-dimensional output matrices.

As we saw previously, a color image is represented as a three-dimensional matrix of numbers:

The third dimension is usually called all the channels. In this case, we have three channels: red, green, and blue. Considering how the convolution was done with the grayscale images, just convolving a two-dimensional matrix with one filter, one reasonable thing to do here—since we have three of the two-dimensional matrices—is to convolve with three filters:

Each of these filters will be convolved with one of the channels.

So far, we've seen 3 x 3 filters, but actually, the two dimensions can vary from x to ε.

This kind of operation will now produce three outputs:

Let's look in a bit more detail at what's happened so far.

Let's take the following example:

We have three 6 x 6 images that represent the channels, and they will be convolved with three 3 x 3 filters, and, in the end, we'll have three 4 x 4 outputs. Notice now how these three 4 x 4 matrices don't represent the same pixel values as in the input.

So the question here really is: Does it make sense to keep them separated, since these matrices now are detecting edges rather than colors, or are these the same channels? Indeed, many experiments have shown that keeping them separate doesn't add any value besides wasting resources. Hence, one reasonable thing to do here is to add them together:

In the end, we'll have just one 4 x 4 matrix, so each of these cells will be added. These values are uniform at the moment, just to make it simple, but, in reality, they will vary a lot.

To summarize once again:

We had a 6 x 6 three-input image, which will convolve with three 3 x 3 filters, and since the convolution product will be summed in the end, we have just one two-dimensional matrix, 4 x 4. So, with the convolution, the number of channels—regardless of the input—will be always reduced to 1; in this case, from 3 to 1. And notice how the two-dimensional matrix will shrink from 6 to 4. Usually in convolution architectures, it's OK to reduce the two dimensions, but, on the other hand, we require a large number of channels. We want a greater number of channels because we want to be able to detect more features, or to capture more features.

For example, we may want not one channel but five:

One way to solve this simply is just to add five 3 x 3 x 3 filters:

In this case, for example, let's suppose the filter in the image is the horizontal filter, and so inside this, we'll have three horizontal filters convolving with three channels. Since we'll sum in the end, this will give just the 4 x 4 two-dimensional matrix. Then, if this is a vertical filter, convolving and summing will give a 4 x 4 two-dimensional matrix, and if we consider the filter as a Sobel, or maybe a Scharr, maybe the output would come up as in our previous experiments. Each of these filters will give a different two-dimensional matrix.

We can even go one step further and convolve with a 3 x 3 x 5 filter:

Since the five filters will be convolved separately with each of the input channels, and in the end they'll sum, we'll gain a 2 x 2 x 1 two-dimensional matrix:

If we want to increase the filters to three, we'll just add three 3 x 3 x 5 filters:

Notice how the third dimension of the filters is always equal to the third dimension of the input. It's always equal because we need five filters to handle the five input channels, and, in the end of course, that will produce just one, because of the final sum, but again we need five of them to handle the five channels.

We'll now see how to control the output matrix dimensions by introducing a number of parameters and techniques.

Table of Contents for Convolution on RGB images

Create new playlist

Sign In

Sign Up

Table of Contents for
Convolution on RGB images