ConvNets summary

CNNs are basically several layers of convolutions with nonlinear activation functions and pooling layers applied to the results. Each layer applies different filters (hundreds or thousands). The key observation to understand is that the filters are not pre-assigned, but instead, they are learned during the training phase in such a way that a suitable loss function is minimized. It has been observed that lower layers will learn to detect basic features, while the higher layers detect progressively more sophisticated features such as shapes or faces. Note that, thanks to pooling, individual neurons in later layers see more of the original image, hence they are able to compose basic features learned in the earlier layers.

So far we have described the basic concepts of ConvNets. CNNs apply convolution and pooling operations in one dimension for audio and text data along the time dimension, in two dimensions for images along the (height x width) dimensions, and in three dimensions for videos along the (height x width x time) dimensions. For images, sliding the filter over an input volume produces a map that provides the responses of the filter for each spatial position.

In other words, a ConvNet has multiple filters stacked together that learn to recognize specific visual features independently of the location in the image. Those visual features are simple in the initial layers of the network and then more and more sophisticated deeper in the network.

Table of Contents for ConvNets summary

Create new playlist

Sign In

Sign Up

Table of Contents for
ConvNets summary