Convolutional sliding window 

In this section, we'll resolve the downsides of using a sliding window by using a convolutional sliding window and gain some intuition behind this technique.

Before we delve into this new method, we need to modify the convolution architecture that we've used so far.

Here is a typical CNN:

We have the input, an red, green, and blue (RGB) image with three channels, and here we'll use a small 32 x 32 image. This is followed by a convolution that leaves the first two dimensions unchanged and increases the number of channels to 64, the max pooling layer divides the first two dimensions by 2, and leaves the number of channels unchanged. In this example, we have two layers. In practical architectures, there are several layers. In the end, we'll have the fully-connected layers with neurons. Here, we have two heightened layers, one with 256 neurons, and the other one with 128 neurons. There is a softmax layer that helps us gain probabilities for the classes we want to predict. We also have five classes.

The modification we'll introduce is to convert these fully-connected layers into convolutional layers. The fully-connected layers have two properties; the first one is that every value of the input is linked to every point in the output, so each of these 16 x 16 x 64 enters as an input to all other points that lie in the first heightened layer. Here, that's all the other 256 neurons. The second property of the fully-connected layer is that before the output is given to the next layer, we apply an activation function, which is a nonlinear function that gives us the ability to learn really complex functions.

This means that all of the values leaving the second heightened layer will be multiplied by the weights, they will be summed up to give only one output, and the values will be given to an activation function.

Let's understand how we can apply the same effect using convolutional layers. The only thing that changes is instead of the fully-connected layer, we use a 16 x 16 convolutional layer:

The first two dimensions stay the same and that causes the output to be 1 x 1 x 256, the 256 matches the number of neurons. Check whether the first property of having every input connected to the output is fulfilled.

If we take the convolution operation into account, the 16 x 16 x 64 values are each multiplied by 16, and now we have the third dimension, which we usually don't show because it's always the same as the input third dimension. Thus, your output will be 16 x 16 x 64. All the values that pass through the filter will be multiplied by 16 x 16 x 64. We then sum up all these matrices by gaining only one output, and the output will be this rectangle value. If we apply 256 of these structures of 16 x 16 x 64, we'll have these 256 rectangles values. Now, notice how every input contributes to the output, so basically the first property of having fully connected layers is achieved.

Next, we apply the 1 x 1 x 128 convolution, which gives us 1 x 1 x 128, so 128 neurons match this vector. When we used the 1 x 1 convolution, besides multiplying this with all the values and summing up, we also applied an activation function, and we use that as an output of the convolution, which fulfills the second condition of nonlinearity.

We apply the convolution routine, we just use an activation function to the output, and we use the result of the activation function as the final output. And that basically makes our function nonlinear.

Lastly, we apply a 1 x 1 x 5 convolution, and the softmax will enable us to predict 5 classes.

The question that still remains is, if the fully-connected layer and the convolution have the same mathematical effect, why did we feel the need to change this in the first place?

In order to understand that, we use a bigger image; instead of 32 x 32 we use 36 x 36 image:

In order for the 32 x 32 image to cover 36 x 36, we need nine movements. The rectangle depicted in the image rectangle is 32 x 32, and if we use stride value of 2, we need nine movements to cover all the pixels in the image.

Now, let's apply the convolutional and max pooling layers. The first convolutional layer increases the number of channels. The max pooling divides and here we need nine movements, or 16 x 16, to cover 18 x 18, then by applying the same convolutional layer we have 3 x 3 x 256. This makes it clear that we need nine movements or 1 x 1 to cover the 3 x 3 matrixes.

Similarly, we apply the same convolutional layer again and obtain 3 x 3 x 128 with nine movements again. Lastly, we apply 1 x 1 x 5. The output obtained is 3 x 3 x 5. Notice that the output that we've obtained is similar to having nine movements of 1 x 1 x 5 because the output obtained is 3 x 3 x 9 and we have nine movements to cover the entire structure of 36 x 36 using 32 x 32.

In a way, we obtain nine predictions with only one network execution for each of these windows. By the end, your window will have moved across the image, like this:

Notice that none of the pixels are being executed again; instead, they are being reused.

When we compare this to the sliding window method, we need to execute them separately. Assume another position, as shown in the following diagram:

Basically this structure has the prediction for the selected pixels.

To understand this better, let's compare the sliding window with the convolutional sliding window. The picture for the sliding window looked like this:

Several window sizes move to the old image to cover it, and each time the window moves, the selected pieces are given to a neural network and it's asked for a prediction. This means the neural network executes all the weights each time. When we use the convolutional sliding window, with just one execution we add all the predictions, as shown in the following image:

All this is because of the ability to reuse the sharing pixels. This is really a huge improvement, which enables us to do real-time object video detection. Still, even if the performance has improved, we have one last problem: sometimes the sliding algorithm will give an output of an inaccurate bounding box. To resolve this, we'll use YOLO algorithm.

Table of Contents for Convolutional sliding window&#xA0;

Create new playlist

Sign In

Sign Up

Table of Contents for
Convolutional sliding window