Substituting big convolutions

Before we jump in, we will first learn about the techniques that can reduce the number of parameters a model uses. This is important, firstly because it should improve your network's ability to generalize, as it will need less training data fed into it to utilize the number of parameters present in the model. Secondly, having less parameters means more hardware efficiency, as less memory will be needed.

Here, we will start by explaining one important technique for reducing model parameters, cascading several small convolutions together. In the diagram that follows, we have two 3x3 convolution layers. If we look at the second layer, on the right of the diagram, working back, we can see that one neuron in the second layer has a 3x3 receptive field:

When we say "receptive field," we mean the area that it can see from a previous layer. In this example, a 3x3 area is needed to create one output, hence a 3x3 receptive field.

Working back another layer, each element of that 3x3 area also has a 3x3 receptive field on the input. So, if we combine the receptive field for all those nine elements together, then we can see that total receptive field created on the input is of size 5x5.

So, in simpler words, cascading smaller convolutions together can get the same receptive field as using a bigger one. That means we can replace big convolutions with a cascade of small ones.

Note that this substitution cannot be done on the very first convolution layer acting on the input image due to the depth mismatch between the first convolution layer and the input file depth (depths of outputs need to be consistent): Also observe on the image how we calculate the number of parameters per layer.

In the preceding diagram, we substitute one 7x7 convolution by three 3x3 convolutions. Let's calculate for ourselves to see that less parameters are used.

Imagine a 7x7 size convolution, with C filters, being used on an input volume of shape WxHxC. We can calculate the number of weights in the filters as follows:

Now, instead, if we cascade three 3x3 convolutions (substituting the 7x7 convolution), we could calculate its number of weights as follows:

Here, we can see that there are less parameters than earlier!

Also observe that between each of those three convolution layers, we place ReLu activations. Doing this introduces more non-linearities into the model than we would have had using just a single large convolution layer. This added depth (and non-linearities) is a good thing, as it means the network can compose more concepts together and increase its capacity for learning!

The trend in most new successful models is to replace all large filters with many smaller convolutions (usually size 3x3) cascaded together. As explained before, we get two huge benefits from doing this. Not only does it reduce the number of parameters, it also increases the depth and number of non-linearities in your network, which is a good thing for increasing its learning capacity.

Table of Contents for Substituting big convolutions

Create new playlist

Sign In

Sign Up

Table of Contents for
Substituting big convolutions