So, now that you hopefully have an idea how convolutional layers work, let's talk about why we did all this crazy math. Why would we use a convolutional layer instead of the normal layers we have previously been using?
Let's say that we did use a normal layer, to get the same output shape we talked about previously. We started with a 32 x 32 x 3 image, so that's 3,072 values total. We were left with a 30 x 30 x 64 matrix. That's 57,600 values in total. If we were to use a fully connected layer to connect these two matrices, that layer would have 176,947,200 trainable parameters. That's 176 million.
However, when we use the convolutional layer above, we used 64 3 x 3 x 3 filters, which results in 1,728 learnable parameters + 64 biases for a total of 1,792 parameters.
So, obviously a convolutional layer requires much fewer parameters, but why does this matter?