© Umberto Michelucci 2022
U. MichelucciApplied Deep Learning with TensorFlow 2https://doi.org/10.1007/978-1-4842-8020-1_7

7. Convolutional Neural Networks

Umberto Michelucci1  
(1)
Dübendorf, Switzerland
 

In the previous chapters, we looked at fully connected networks and all the problems you encounter while training them. The network architecture we used, one where each neuron in a layer is connected to all the neurons in the previous and next layers, is not good at many fundamental tasks like image recognition, speech recognition, time series prediction, and many more. Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) are the most advanced architectures used today. This chapter looks at convolution and pooling, the basic building blocks of CNNs. We also discuss a complete, although basic, implementation of CNNs in Keras. RNNs are discussed, although briefly, in the next chapter.

Kernels and Filters

One of the main components of CNNs are filters , which are square matrices that have dimensions nK × nK, where usually nK is a small number, like 3 or 5. Sometimes filters are also called kernels . Let’s define four different filters and check later in the chapter their effect when used in convolution operations. For those examples, we will work with 3 × 3 filters. For the moment just take the following definitions as a reference and you will see how to use them later in the chapter.
  • The following kernel will allow the detection of horizontal edges
    $$ {mathfrak{I}}_H=left(egin{array}{ccc}1& 1& 1\ {}0& 0& 0\ {}-1& -1& -1end{array}
ight) $$
  • The following kernel will allow the detection of vertical edges
    $$ {mathfrak{I}}_V=left(egin{array}{ccc}1& 0& -1\ {}1& 0& -1\ {}1& 0& -1end{array}
ight) $$
  • The following kernel will allow the detection of edges when luminosity changes drastically
    $$ {mathfrak{I}}_L=left(egin{array}{ccc}-1& -1& -1\ {}-1& 8& -1\ {}-1& -1& -1end{array}
ight) $$
  • The following kernel will blur edges in an image
    $$ {mathfrak{I}}_B=-frac{1}{9}left(egin{array}{ccc}1& 1& 1\ {}1& 1& 1\ {}1& 1& 1end{array}
ight) $$

In the next sections, we will apply convolution to a test image with the filters and you will see the effect.

Convolution

The first step to understanding CNNs is to understand convolution . The easiest way is to see it in action is with a few simple cases. First, in the context of neural networks, convolution is done between tensors. The operation gets two tensors as input and produces a tensor as output. The operation is usually indicated with the operator ∗. Let’s see how it works. Let’s get two tensors, both with dimensions 3 × 3. The convolution operation is done by applying the following formula
$$ left(egin{array}{ccc}{a}_1& {a}_2& {a}_3\ {}{a}_4& {a}_5& {a}_6\ {}{a}_7& {a}_8& {a}_9end{array}
ight)ast left(egin{array}{ccc}{k}_1& {k}_2& {k}_3\ {}{k}_4& {k}_5& {k}_6\ {}{k}_7& {k}_8& {k}_9end{array}
ight)=sum limits_{i=1}^9{a}_i{k}_i $$
In this case, the result is simply the sum of each element ai multiplied by the respective element ki. In a more typical matrix formalism, this formula could be written with a double sum, as
$$ left(egin{array}{ccc}{a}_{11}& {a}_{12}& {a}_{13}\ {}{a}_{21}& {a}_{22}& {a}_{23}\ {}{a}_{31}& {a}_{32}& {a}_{33}end{array}
ight)ast left(egin{array}{ccc}{k}_{11}& {k}_{12}& {k}_{13}\ {}{k}_{21}& {k}_{22}& {k}_{23}\ {}{k}_{31}& {k}_{32}& {k}_{33}end{array}
ight)=sum limits_{i=1}^3sum limits_{j=1}^3{a}_{ij}{k}_{ij} $$

The first version has the advantage of making the fundamental idea very clear: each element from one tensor is multiplied by the correspondent element (the element in the same position) of the second tensor. Then all the values are summed to get the result.

In the previous section we talked about kernels, and the reason is that convolution is usually done between a tensor, which we indicate here with A, and a kernel. Typically, kernels are small, 3 × 3 or 5 × 5, while the input tensors A are normally bigger. In image recognition for example, the input tensors A are the images that may have dimensions as high as 1024 × 1024 × 3, where 1024 × 1024 is the resolution and the last dimension (3) is the number of the color channels, the RGB values. In advanced applications the images may even have higher resolutions. How do we apply convolution when we have matrices with different dimensions? To understand this, let’s consider a matrix A that is 4 × 4
$$ A=left(egin{array}{cccc}{a}_1& {a}_2& {a}_3& {a}_4\ {}{a}_5& {a}_6& {a}_7& {a}_8\ {}{a}_9& {a}_{10}& {a}_{11}& {a}_{12}\ {}{a}_{13}& {a}_{14}& {a}_{15}& {a}_{16}end{array}
ight) $$
Let’s see how to do convolution with a kernel K that we will take for this example to be 3 × 3
$$ K=left(egin{array}{ccc}{k}_1& {k}_2& {k}_3\ {}{k}_4& {k}_5& {k}_6\ {}{k}_7& {k}_8& {k}_9end{array}
ight) $$
The idea is to start on the top-left corner of the matrix A and select a 3 × 3 region. In the example that would be
$$ {A}_1=left(egin{array}{ccc}{a}_1& {a}_2& {a}_3\ {}{a}_5& {a}_6& {a}_7\ {}{a}_9& {a}_{10}& {a}_{11}end{array}
ight) $$
Or the elements marked in bold here
$$ A=left(egin{array}{cccc}{oldsymbol{a}}_{mathbf{1}}& {oldsymbol{a}}_{mathbf{2}}& {oldsymbol{a}}_{mathbf{3}}& {a}_4\ {}{oldsymbol{a}}_{mathbf{5}}& {oldsymbol{a}}_{mathbf{6}}& {oldsymbol{a}}_{mathbf{7}}& {a}_8\ {}{oldsymbol{a}}_{mathbf{9}}& {oldsymbol{a}}_{mathbf{1}mathbf{0}}& {oldsymbol{a}}_{mathbf{1}mathbf{1}}& {a}_{12}\ {}{a}_{13}& {a}_{14}& {a}_{15}& {a}_{16}end{array}
ight) $$
Then we perform the convolution between this smaller matrix A1 and K getting (we will indicate the result with B1)
$$ {B}_1={A}_1ast K={a}_1{k}_1+{a}_2{k}_2+{a}_3{k}_3+{k}_4{a}_5+{k}_5{a}_5+{k}_6{a}_7+{k}_7{a}_9+{k}_8{a}_{10}+{k}_9{a}_{11} $$
Then we need to shift the selected 3 × 3 region in matrix A one column to the right and select the elements marked in bold
$$ A=left(egin{array}{cccc}{a}_1& {oldsymbol{a}}_{mathbf{2}}& {oldsymbol{a}}_{mathbf{3}}& {oldsymbol{a}}_{mathbf{4}}\ {}{a}_5& {oldsymbol{a}}_{mathbf{6}}& {oldsymbol{a}}_{mathbf{7}}& {oldsymbol{a}}_{mathbf{8}}\ {}{a}_9& {oldsymbol{a}}_{mathbf{10}}& {oldsymbol{a}}_{mathbf{11}}& {oldsymbol{a}}_{mathbf{12}}\ {}{a}_{13}& {a}_{14}& {a}_{15}& {a}_{16}end{array}
ight) $$
That will give us the second sub-matrix A2
$$ {A}_2=left(egin{array}{ccc}{a}_2& {a}_3& {a}_4\ {}{a}_6& {a}_7& {a}_8\ {}{a}_{10}& {a}_{11}& {a}_{12}end{array}
ight) $$
We perform the convolution between this smaller matrix A2 and K
$$ {B}_2={A}_2ast K={a}_2{k}_1+{a}_3{k}_2+{a}_4{k}_3+{a}_6{k}_4+{a}_7{k}_5+{a}_8{k}_6+{a}_{10}{k}_7+{a}_{11}{k}_8+{a}_{12}{k}_9 $$
Now we cannot shift our 3 × 3 region anymore to the right, since we have reached the end of the matrix A. So what we do is shift it one row down and start again from the left side. The next selected region would be
$$ {A}_3=left(egin{array}{ccc}{a}_5& {a}_6& {a}_7\ {}{a}_9& {a}_{10}& {a}_{11}\ {}{a}_{13}& {a}_{14}& {a}_{15}end{array}
ight) $$
Again, we perform convolution of A3 with K
$$ {B}_3={A}_3ast K={a}_5{k}_1+{a}_6{k}_2+{a}_7{k}_3+{a}_9{k}_4+{a}_{10}{k}_5+{a}_{11}{k}_6+{a}_{13}{k}_7+{a}_{14}{k}_8+{a}_{15}{k}_9 $$
As you might have guessed at this point, the last step is to shift the 3 × 3 selected region to the right one column and perform convolution. Our selected region will now be
$$ {A}_4=left(egin{array}{ccc}{a}_6& {a}_7& {a}_8\ {}{a}_{10}& {a}_{11}& {a}_{12}\ {}{a}_{14}& {a}_{15}& {a}_{16}end{array}
ight) $$
And the convolution will give the result
$$ {B}_4={A}_4ast K={a}_6{k}_1+{a}_7{k}_2+{a}_8{k}_3+{a}_{10}{k}_4+{a}_{11}{k}_5+{a}_{12}{k}_6+{a}_{14}{k}_7+{a}_{15}{k}_8+{a}_{16}{k}_9 $$
Now we cannot shift our 3 × 3 region anymore, neither right nor down. We have calculated four values: B1, B2, B3, and B4. Those elements will form the resulting tensor of the convolution operation, giving us the tensor B
$$ B=left(egin{array}{cc}{B}_1& {B}_2\ {}{B}_3& {B}_4end{array}
ight) $$

The same process can be applied when the tensor A is bigger. You will simply get a bigger resulting B tensor, but the algorithm to get the elements Bi is the same. Before moving on, there is still a small detail that we need to discuss, and that is the concept of stride. In the process above we moved our 3 × 3 region one column to the right and one row down. The number of rows and columns, in this example 1, is called stride and is often indicated with s. Stride s = 2 means simply that we shift our 3 × 3 region two columns to the right and two rows down.

Something else that we need to discuss is the size of the selected region in the input matrix A. The dimensions of the selected region that we shifted around in the process must be the same as the kernel used. If you use a 5 × 5 kernel, you need to select a 5 × 5 region in A. In general, given a nK × nK kernel, you will select a nK × nK region in A.

In a more formal definition, convolution with stride s in the neural network context is a process that takes a tensor A of dimensions nA × nA and a kernel K of dimensions nK × nK and gives as output a matrix B of dimensions nB × nB with
$$ {n}_B=leftlfloor frac{n_A-{n}_K}{s}+1
ight
floor $$
Where we have indicated with ⌊x⌋ the integer part of x (in the programming world, this is often called the floor of x). A proof of this formula would take too long to discuss but is easy to see why it is true (try to derive it). To make things a bit easier, we will suppose that nK is odd. You will see soon why this is important (although not fundamental). We start explaining the case with a stride s = 1. The algorithm generates a new tensor B from an input tensor A and a kernel K according to this formula
$$ {B}_{ij}={left(Aast K
ight)}_{ij}=sum limits_{f=0}^{n_K-1}kern1em sum limits_{h=0}^{n_K-1}{A}_{i+f,j+h}{K}_{i+f,j+h} $$

The formula is cryptic and is very difficult to understand. Let’s see some more examples to grasp the meaning better. Figure 7-1 shows how convolution works. Suppose you have a 3 × 3 filter. Then, in the figure, you can see that the top-left nine elements of the matrix A, marked by a square drawn with a black continuous line, are the ones used to generate the first element of the matrix B1 according to the formula. The elements marked by the square drawn with a dotted line are the ones used to generate the second element B2, and so on.

To reiterate, the basic idea is that each element of the 3 × 3 square from matrix A is multiplied by the corresponding element of the kernel K and all the numbers are summed. The sum is then the element of the new matrix B. After calculating the value for B1 you shift the region you are considering in the original matrix of one column to the right (the square indicated in Figure 7-1 with a dotted line) and repeat the operation. You continue to shift your region to the right until you reach the border and then you move one element down and start again from the left. You continue in this fashion until you reach the lower-right angle of the matrix. The same kernel is used for all the regions in the original matrix.
Figure 7-1

A visual explanation of convolution

Given the kernel $$ {mathfrak{I}}_H $$ for example you can see in Figure 7-2 which element of A gets multiplied by which element in $$ {mathfrak{I}}_H $$ and the result for the element B1, that is nothing more than the sum of all the multiplications
$$ {B}_{11}=1	imes 1+2	imes 1+3	imes 1+1	imes 0+2	imes 0+3	imes 0+4	imes left(-1
ight)+3	imes left(-1
ight)+2	imes left(-1
ight)=-3 $$
Figure 7-2

A visualization of convolution with the kernel $$ {mathfrak{I}}_H $$

Figure 7-3 shows an example of convolution with stride s = 2.
Figure 7-3

A visual explanation of convolution with stride s = 2

The reason that the dimension of the output matrix takes only the floor (the integer part) of $$ frac{n_A-{n}_K}{s}+1 $$ can be seen in Figure 7-4. If s > 1, what can happen, depending on the dimensions of A, is that at a certain point you cannot shift your window on matrix A (the black square you can see in Figure 7-3 for example) anymore, and you cannot cover the matrix A completely. Figure 7-4 shows that you need an additional column on the right of matrix A (marked by many X) to be able to perform the convolution operation. In Figure 7-4, we chose s = 3, and since we have nA = 5 and nK = 3, B will be a scalar as a result:
$$ {n}_B=leftlfloor frac{n_A-{n}_K}{s}+1
ight
floor =leftlfloor frac{5-3}{3}+1
ight
floor =leftlfloor frac{5}{3}
ight
floor =1 $$
Figure 7-4

A visual explanation why the floor function is needed when evaluating the resulting matrix B dimensions

You can see from Figure 7-4 how, with a 3 × 3 region, you can only cover the top-left region of A, since with stride s = 3 you would end up outside A. Therefore, you can consider one region only for the convolution operation, thus ending up with a scalar for the resulting tensor B.

Let’s now look at a few additional examples to make this formula even clearer. We start with a small matrix 3 × 3
$$ A=left(egin{array}{ccc}1& 2& 3\ {}4& 5& 6\ {}7& 8& 9end{array}
ight) $$
and consider this kernel
$$ K=left(egin{array}{ccc}{k}_1& {k}_2& {k}_3\ {}{k}_4& {k}_5& {k}_6\ {}{k}_7& {k}_8& {k}_9end{array}
ight) $$
With the stride s = 1. The convolution will be given by
$$ B=Aast K=1cdotp {k}_1+2cdotp {k}_2+3cdotp {k}_3+4cdotp {k}_4+5cdotp {k}_5+6cdotp {k}_6+7cdotp {k}_7+8cdotp {k}_8+9cdotp {k}_9 $$
and the result B will be a scalar, since nA = 3, nK = 3, therefore
$$ {n}_B=leftlfloor frac{n_A-{n}_K}{s}+1
ight
floor =leftlfloor frac{3-3}{1}+1
ight
floor =1 $$
If now consider a matrix A with dimensions 4 × 4, or nA = 4, nK = 3 and s = 1, you will get as output a matrix B with dimensions 2 × 2, since
$$ {n}_B=leftlfloor frac{n_A-{n}_K}{s}+1
ight
floor =leftlfloor frac{4-3}{1}+1
ight
floor =2 $$
For example, you can verify that given
$$ A=left(egin{array}{cccc}1& 2& 3& 4\ {}5& 6& 7& 8\ {}9& 10& 11& 12\ {}13& 14& 15& 16end{array}
ight) $$
and
$$ K=left(egin{array}{ccc}1& 2& 3\ {}4& 5& 6\ {}7& 8& 9end{array}
ight) $$
you have with stride s = 1
$$ B=Aast K=left(egin{array}{cc}348& 393\ {}528& 573end{array}
ight) $$
Let’s verify one of the elements: B11 with the formula we saw before. We have
$$ B=Aast K=left(egin{array}{cc}348& 393\ {}528& 573end{array}
ight) $$

Note that the formula for the convolution works only for stride s = 1, but can be easily generalized for other values of s.

This calculation is very easy to implement in Python. The following function can evaluate the convolution of two matrixes easily enough for s = 1 (you can do it in Python with existing functions, but it is instructive to see how to do it from scratch)
import numpy as np
def conv_2d(A, kernel):
    output = np.zeros([A.shape[0]-(kernel.shape[0]-1), A.shape[1]-(kernel.shape[0]-1)])
    for row in range(1, A.shape[0]-1):
        for column in range(1, A.shape[1]-1):
            output[row-1, column-1] = np.tensordot(A[row-1:row+2, column-1:column+2], kernel)
    return output
Note that the input matrix A does not even need to a square one, but it is assumed that the kernel is and that its dimension nK is odd. The previous example can be evaluated with the following code
A = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])
K = np.array([[1,2,3],[4,5,6],[7,8,9]])
print(conv_2d(A,K))
This gives the result
[[ 348. 393.]
[ 528. 573.]]

Examples of Convolution

We’ll now apply the kernels we defined to a test image and see the results. As a test image let’s create a chessboard that’s 160 × 160 pixels with this code
chessboard = np.zeros([8*20, 8*20])
for row in range(0, 8):
    for column in range (0, 8):
        if ((column+8*row) % 2 == 1) and (row % 2 == 0):
            chessboard[row*20:row*20+20, column*20:column*20+20] = 1
        elif ((column+8*row) % 2 == 0) and (row % 2 == 1):
            chessboard[row*20:row*20+20, column*20:column*20+20] = 1
Figure 7-5 shows the chessboard.
Figure 7-5

The chessboard image generated with code

Now let’s try to apply convolution to this image with the different kernels and with stride s = 1.

Using the kernel $$ {mathfrak{I}}_H $$ will detect the horizontal edges. This can be applied with the code
edgeh = np.matrix('1 1 1; 0 0 0; -1 -1 -1')
outputh = conv_2d (chessboard, edgeh)
Figure 7-6 shows the output.
Figure 7-6

The result of performing a convolution between the kernel $$ {mathfrak{I}}_H $$ and the chessboard image

Now you can understand why this kernel detects horizontal edges. Additionally, this kernel detects when you go from light to dark or vice versa. Note this image is only 158 × 158 pixels, as expected, since
$$ {n}_B=leftlfloor frac{n_A-{n}_K}{s}+1
ight
floor =leftlfloor frac{160-3}{1}+1
ight
floor =leftlfloor frac{157}{1}+1
ight
floor =leftlfloor 158
ight
floor =158 $$
Now let’s apply $$ {mathfrak{I}}_V $$ with the code
edgev = np.matrix('1 0 -1; 1 0 -1; 1 0 -1')
outputv = conv_2d (chessboard, edgev)
This gives the result shown in Figure 7-7.
Figure 7-7

The result of performing a convolution between the kernel $$ {mathfrak{I}}_V $$ and the chessboard image

Now we can use the kernel $$ {mathfrak{I}}_L $$
edgel = np.matrix ('-1 -1 -1; -1 8 -1; -1 -1 -1')
outputl = conv_2d (chessboard, edgel)
This gives the result shown in Figure 7-8.
Figure 7-8

The result of performing a convolution between the kernel $$ {mathfrak{I}}_L $$ and the chessboard image

And finally, we can apply the blurring kernel $$ {mathfrak{I}}_B $$
edge_blur = -1.0/9.0*np.matrix('1 1 1; 1 1 1; 1 1 1')
output_blur = conv_2d (chessboard, edge_blur)
Figure 7-9 shows two plots: on the left the blurred image and on the right the original one. The images show only a small region of the original chessboard to make the blurring clearer.
Figure 7-9

The effect of the blurring kernel $$ {mathfrak{I}}_B $$. On the left is the blurred image and on the right is the original one

To finish this section, let’s try to understand how the edges can be detected. Consider the following matrix with a sharp vertical transition, since the left part is full of 10s and the right part full of 0s.
ex_mat = np.matrix('10 10 10 10 0 0 0 0; 10 10 10 10 0 0 0 0; 10 10 10 10 0 0 0 0; 10 10 10 10 0 0 0 0; 10 10 10 10 0 0 0 0; 10 10 10 10 0 0 0 0; 10 10 10 10 0 0 0 0; 10 10 10 10 0 0 0 0')
It looks like this
matrix([[10, 10, 10, 10,  0,  0,  0,  0],
        [10, 10, 10, 10,  0,  0,  0,  0],
        [10, 10, 10, 10,  0,  0,  0,  0],
        [10, 10, 10, 10,  0,  0,  0,  0],
        [10, 10, 10, 10,  0,  0,  0,  0],
        [10, 10, 10, 10,  0,  0,  0,  0],
        [10, 10, 10, 10,  0,  0,  0,  0],
        [10, 10, 10, 10,  0,  0,  0,  0]])
Let’s consider the kernel $$ {mathfrak{I}}_V $$. We can perform the convolution with the code
ex_out = conv_2d (ex_mat, edgev)
The result is
array([[ 0.,  0., 30., 30.,  0.,  0.],
       [ 0.,  0., 30., 30.,  0.,  0.],
       [ 0.,  0., 30., 30.,  0.,  0.],
       [ 0.,  0., 30., 30.,  0.,  0.],
       [ 0.,  0., 30., 30.,  0.,  0.],
       [ 0.,  0., 30., 30.,  0.,  0.]])
Figure 7-10 shows the original matrix (on the left) and the output of the convolution on the right. The convolution with the kernel $$ {mathfrak{I}}_V $$ has clearly detected the sharp transition in the original matrix, marking with a vertical black line where the transition from black to white happens. For example, consider B11 = 0
$$ {displaystyle egin{array}{l}{B}_{11}=left(egin{array}{ccc}10& 10& 10\ {}10& 10& 10\ {}10& 10& 10end{array}
ight)ast {mathfrak{I}}_V=left(egin{array}{ccc}10& 10& 10\ {}10& 10& 10\ {}10& 10& 10end{array}
ight)ast left(egin{array}{ccc}1& 0& -1\ {}1& 0& -1\ {}1& 0& -1end{array}
ight)\ {}=10	imes 1+10	imes 0+10	imes -1+10	imes 1+10	imes 0+10	imes -1+10	imes 1+10	imes 0+10	imes -1=0end{array}} $$
Note that in the input matrix
$$ left(egin{array}{ccc}10& 10& 10\ {}10& 10& 10\ {}10& 10& 10end{array}
ight) $$
there is no transition, as all the values are the same. On the contrary, if you consider B13, you need to consider this region of the input matrix
$$ left(egin{array}{ccc}10& 10& 0\ {}10& 10& 0\ {}10& 10& 0end{array}
ight) $$
where there is a clear transition, since the right-most column is made up of 0s, and the rest are 10s. Now you get a different result
$$ {displaystyle egin{array}{l}{B}_{11}=left(egin{array}{ccc}10& 10& 0\ {}10& 10& 0\ {}10& 10& 0end{array}
ight)ast {mathfrak{I}}_V=left(egin{array}{ccc}10& 10& 0\ {}10& 10& 0\ {}10& 10& 0end{array}
ight)ast left(egin{array}{ccc}1& 0& -1\ {}1& 0& -1\ {}1& 0& -1end{array}
ight)\ {}=10	imes 1+10	imes 0+0	imes -1+10	imes 1+10	imes 0+0	imes -1+10	imes 1+10	imes 0+0	imes -1=30end{array}} $$
As soon as there is a big change in values along the horizontal direction, this is how the convolution will return a high value since the values multiplied by the column with 1 in the kernel will be bigger. When there is a transition from small to high values along the horizontal axis, the elements multiplied by -1 will give a result that is bigger in absolute value. Therefore the final result will be negative and big in absolute value. This is why this kernel can also detect when you pass from a light color to a darker color or vice versa. In fact, if you consider the opposite transition (from 0 to 10) in a hypothetical different matrix A, you would have
$$ {displaystyle egin{array}{l}{B}_{11}=left(egin{array}{ccc}0& 10& 10\ {}0& 10& 10\ {}0& 10& 10end{array}
ight)ast {mathfrak{I}}_V=left(egin{array}{ccc}0& 10& 10\ {}0& 10& 10\ {}0& 10& 10end{array}
ight)ast left(egin{array}{ccc}1& 0& -1\ {}1& 0& -1\ {}1& 0& -1end{array}
ight)\ {}=0	imes 1+10	imes 0+10	imes -1+0	imes 1+10	imes 0+10	imes -1+0	imes 1+10	imes 0+10	imes -1=-30end{array}} $$
Since this time, we move from 0 to 10 along the horizontal direction.
Figure 7-10

The result of the convolution of the matrix ex_mat with the kernel $$ {mathfrak{I}}_V $$

Note how, as expected, the output matrix is 5 × 5 since the original matrix is 7 × 7 and the kernel is 3 × 3.

Pooling

Pooling is the second operation that is fundamental in CNNs. This operation is much easier to understand than convolution. To understand it, let’s again make a concrete example and let’s consider what is called max pooling. Let’s again consider the 4 × 4 matrix we used in the convolution section
$$ A=left(egin{array}{cccc}{a}_1& {a}_2& {a}_3& {a}_4\ {}{a}_5& {a}_6& {a}_7& {a}_8\ {}{a}_9& {a}_{10}& {a}_{11}& {a}_{12}\ {}{a}_{13}& {a}_{14}& {a}_{15}& {a}_{16}end{array}
ight) $$
To perform max pooling, we need to define a region of size nK × nK, analogous to what we did for convolution. Let’s consider nK = 2. What we need to do is start in the top-left corner of our matrix A and select a nK × nK region, in our case 2 × 2 from A. Here we would select
$$ left(egin{array}{cc}{a}_1& {a}_2\ {}{a}_5& {a}_6end{array}
ight) $$
or the elements marked in bold face in matrix A here
$$ A=left(egin{array}{cccc}{oldsymbol{a}}_{mathbf{1}}& {oldsymbol{a}}_{mathbf{2}}& {a}_3& {a}_4\ {}{oldsymbol{a}}_{mathbf{5}}& {oldsymbol{a}}_{mathbf{6}}& {a}_7& {a}_8\ {}{a}_9& {a}_{10}& {a}_{11}& {a}_{12}\ {}{a}_{13}& {a}_{14}& {a}_{15}& {a}_{16}end{array}
ight) $$
From the elements selected, a1, a2, a5, and a6, the max pooling operation selects the maximum value giving a result that we will indicate with B1
$$ {B}_1=underset{i=1,2,5,6}{max }{a}_i $$
Then we need to shift our 2 × 2 window two columns, typically the same number of columns the selected region has, to the right, and select the elements marked in bold
$$ A=left(egin{array}{cccc}{a}_1& {a}_2& {oldsymbol{a}}_{mathbf{3}}& {oldsymbol{a}}_{mathbf{4}}\ {}{a}_5& {a}_6& {oldsymbol{a}}_{mathbf{7}}& {oldsymbol{a}}_{mathbf{8}}\ {}{a}_9& {a}_{10}& {a}_{11}& {a}_{12}\ {}{a}_{13}& {a}_{14}& {a}_{15}& {a}_{16}end{array}
ight) $$
Or in other words the smaller matrix
$$ left(egin{array}{cc}{a}_3& {a}_4\ {}{a}_7& {a}_8end{array}
ight) $$
The max pooling algorithm will then select the maximum of the values giving a result that we will indicate with B2
$$ {B}_2=underset{i=3,4,7,8}{max }{a}_i $$
At this point we cannot shift the 2 × 2 region to the right anymore, so we shift it two rows down and start the process again from the left side of A, selecting the elements marked in bold and getting the maximum and calling it B3.
$$ A=left(egin{array}{cccc}{a}_1& {a}_2& {a}_3& {a}_4\ {}{a}_5& {a}_6& {a}_7& {a}_8\ {}{oldsymbol{a}}_{mathbf{9}}& {oldsymbol{a}}_{mathbf{10}}& {a}_{11}& {a}_{12}\ {}{oldsymbol{a}}_{mathbf{13}}& {oldsymbol{a}}_{mathbf{14}}& {a}_{15}& {a}_{16}end{array}
ight) $$
The stride s in this context has the same meaning discussed in convolution. It is simply the number of rows or columns you move your region when selecting the elements. Finally, we select the last region 2 × 2 in the bottom-lower part of A, selecting the elements a11, a12, a15, and a16. We then get the maximum, and we call it B4. With the values we obtain in this process, in the example the four values B1, B2, B3, and B4, we will build an output tensor
$$ B=left(egin{array}{cc}{B}_1& {B}_2\ {}{B}_3& {B}_4end{array}
ight) $$
In the example, we have s = 2. Basically, this operation takes as input a matrix A, a stride s, and a kernel size nK (the dimension of the region we selected in the example before) and returns a new matrix B with dimensions given by the same formula we discussed for convolution
$$ {n}_B=leftlfloor frac{n_A-{n}_K}{s}+1
ight
floor $$
To reiterate, the idea is to start at the top-left of your matrix A, take a region of dimensions nK × nK, apply the max function to the selected elements, then shift the region of s elements toward the right. Then select a new region of dimensions nK × nK, apply the function to its values, and so on. Figure 7-11 shows how you select the elements from matrix A with stride s = 2.
Figure 7-11

A visualization of pooling with stride s = 2

For example, applying max pooling to the input A
$$ A=left(egin{array}{cccc}1& 3& 5& 7\ {}4& 5& 11& 3\ {}4& 1& 21& 6\ {}13& 15& 1& 2end{array}
ight) $$
will get you the results (which are very easy to verify)
$$ B=left(egin{array}{cc}4& 11\ {}15& 21end{array}
ight) $$
Since 4 is the maximum of the values marked in bold
$$ A=left(egin{array}{cccc}mathbf{1}& mathbf{3}& 5& 7\ {}mathbf{4}& mathbf{5}& 11& 3\ {}4& 1& 21& 6\ {}13& 15& 1& 2end{array}
ight) $$
11 is the maximum of the values marked in bold
$$ A=left(egin{array}{cccc}1& 3& mathbf{5}& mathbf{7}\ {}4& 5& mathbf{11}& mathbf{3}\ {}4& 1& 21& 6\ {}13& 15& 1& 2end{array}
ight) $$

and so on. It’s worth mentioning another way of doing pooling, although not as widely used as max pooling, and that’s average pooling. Instead of returning the maximum of the selected values, it returns the average.

Note

The most common pooling operation is max pooling. Average pooling is not as widely used, but can be found in specific network architectures.

Padding

Sometimes, when dealing with images, it’s not optimal to get a result from a convolution operation that has dimensions that are different than the original image. In that case, you do what is called “padding .” Basically, the idea is very simple: it consists of adding rows of pixels to the top and bottom and to the columns of pixels on the right and on the left of the final images, which are filled with values to make the resulting matrices the same size of the original one. Some strategies are filling the added pixels with zeros, with the values of the closest pixels, and so on. For example, in our example the ex_out matrix, zero-padding would look like this
array([[ 0., 0., 0., 0., 0., 0., 0., 0.],
       [ 0., 0., 0., 30., 30., 0., 0., 0.],
       [ 0., 0., 0., 30., 30., 0., 0., 0.],
       [ 0., 0., 0., 30., 30., 0., 0., 0.],
       [ 0., 0., 0., 30., 30., 0., 0., 0.],
       [ 0., 0., 0., 30., 30., 0., 0., 0.],
       [ 0., 0., 0., 30., 30., 0., 0., 0.],
       [ 0., 0., 0., 0., 0., 0., 0., 0.]])
The use and reasons behind padding goes beyond the scope of this book, but it’s important to know that this exists. Only as a reference, if you use padding p (the width of the rows and the columns you use as padding), the final dimensions of the matrix B, in case of convolution and pooling, is given by
$$ {n}_B=leftlfloor frac{n_A+2p-{n}_K}{s}+1
ight
floor $$
Note

When dealing with real images, you always have color images, coded in three channels: RGB. That means that you need to do convolution and pooling in three dimensions: width, height, and color channel. This will add a layer of complexity to the algorithms.

Building Blocks of a CNN

Basically, convolutions and pooling operations are used to build the layers used in CNNs. In CNNs, you can typically find the following layers
  • Convolutional layers

  • Pooling layers

  • Fully connected layers

Fully connected layers are exactly what you have already seen in all previous chapters: a layer where neurons are connected to all neurons of previous and subsequent layers. You know them already. But the first two require some additional explanation.

Convolutional Layers

A convolutional layer takes as input a tensor (it can be three-dimensional, due to the three color channels). For example, an image of certain dimensions. It then applies a certain number of kernels, typically 10, 16, or even more, adds a bias, applies the ReLU activation functions (for example) to introduce non-linearity to the result of the convolution, and produces an output matrix, B. If you remember the notation we used in the previous chapters, the result of the convolution will have the role of W[l]Z[l − 1], which we discussed in Chapter 3.

In the previous sections, we saw some examples of applying convolutions with just one kernel. How can you apply several kernels at the same time? Well, the answer is very simple. The final tensor (now we use the word tensor since it will not be a simple matrix anymore) B will have not two dimensions but three. Let’s indicate the number of kernels you want to apply with nc (the c is used since sometimes people talk about channels). You simply apply each filter to the input independently and stack the results. So instead of a single matrix B with dimensions nB × nB, you get a final tensor $$ overset{sim }{B} $$ of dimensions nB × nB × nc. That means that
$$ {overset{sim }{B}}_{i,j,1}kern1em forall i,jin left[1,{n}_B
ight] $$
will be the output of convolution of the input image with the first kernel. And
$$ {overset{sim }{B}}_{i,j,2}kern1em forall i,jin left[1,{n}_B
ight] $$

will be the output of convolution with the second kernel, and so on. The convolution layer is nothing more than something that transforms the input into an output tensor. But what are the weights in this layer? The weights, or the parameters that the network learns during the training phase, are the elements of the kernel themselves. We discussed that we have nc kernels, each of nK × nK dimensions. That means that we have $$ {n}_K^2{n}_c $$ parameters in a convolutional layer.

Note

The number of parameters that you have in a convolutional layer, $$ {n}_K^2{n}_c $$, is independent of the input image size. This fact helps reduce overfitting, especially when dealing with large input images.

Sometimes this layer is indicated with the word CONV and then a number. In our case, we could call this layer CONV1. Figure 7-12 shows a representation of a convolutional layer. The input image gets transformed by applying convolution with nc kernels in a tensor of dimensions nA × nA × nc.
Figure 7-12

A representation of a convolutional layer1

Of course, a convolutional layer must not necessarily be placed immediately after the inputs. A convolutional layer may get as input the output of any other layer of course. Keep in mind that your input image will usually have dimensions nA × nA × 3, since an image in color has three channels: Red, Green, and Blue. A complete analysis of the tensors involved in a CNN when considering color images goes beyond the scope of this book. Very often in diagrams, the layer is simply indicated as a cube or a square.

Pooling Layers

A pooling layer is usually indicated with POOL and a number: for example, POOL1. It takes as input a tensor and gives as output another tensor after applying pooling to the input.

Note

A pooling layer has no parameter to learn, but it introduces additional hyper-parameters: nK and stride s. Typically, in pooling layers you do not use padding, since one of the reasons to use pooling is to reduce the dimensionality of the tensors.

Stacking Layers Together

In CNNs, you often stack convolutional and pooling layers together, one after the other. Figure 7-13 shows a convolutional and a pooling layer stack. A convolutional layer is always followed by a pooling layer. Sometimes the two together are called a layer. The reason is that a pooling layer has no learnable weights and therefore it is simply seen as a simple operation that is associated with the convolutional layer. So be aware when you read papers or blogs and be sure you understand their intentions.
Figure 7-13

A representation of how to stack convolutional and pooling layers

To conclude this part on CNNs, Figure 7-14 shows an example of a CNN. Figure 7-14 is an example like the very famous LeNet-5 network[1]. You have the inputs, then two convolution-pooling layers, then three fully connected layers, and then an output layer. This is where you may have your softmax function, if, for example, you perform multiclass classification. We included some numbers in the figure to give you an idea of the size of the different layers.
Figure 7-14

A representation of a CNN similar to the LeNet-5 network

An Example of a CNN

Let’s try to build a network to give you a feeling of how the process works and what the code looks like. We will not do any hyper-parameter tuning or optimization to keep the section understandable. We will build the following architecture with the following layers, in this order:
  • Convolution layer 1 (CONV1): Six filters 5 × 5, stride s = 1.

  • We then apply ReLU to the output of the previous layer.

  • Max pooling layer 1 (POOL1) with a window 2 × 2, stride s = 2.

  • Convolution layer 2 (CONV2): 16 filters 5 × 5, stride s = 1.

  • We then apply ReLU to the output of the previous layer.

  • Max pooling layer 2 (POOL2) with a window 2 × 2, stride s = 2.

  • Fully Connected Layer with 128 neurons with activation function ReLU.

  • Fully Connected Layer with ten neurons for classification of the Zalando dataset.

  • Softmax output neuron.

We import the Zalando dataset
((trainX, trainY), (testX, testY)) = fashion_mnist.load_data()
Check there if you do not remember the dataset’s details. Let’s prepare the data (reshaping the samples and one-hot encoding the labels):
labels_train = np.zeros((60000, 10))
labels_train[np.arange(60000), trainY] = 1
data_train = trainX.reshape(60000, 28, 28, 1)
and
labels_test = np.zeros((10000, 10))
labels_test[np.arange(10000), testY] = 1
data_test = testX.reshape(10000, 28, 28, 1)
Note that in this case, we use as network’s inputs tensors of dimensions (number_of_images, image_height, image_width, color_channels). Since the Zalando dataset is made up of gray values images, the color_channels will be equal to 1. Each observation is in a row (since feed-forward neural networks take as input flattened tensors). If you check the dimensions with the code
print('Dimensions of the training dataset: ', data_train.shape)
print('Dimensions of the test dataset: ', data_test.shape)
print('Dimensions of the training labels: ', labels_train.shape)
print('Dimensions of the test labels: ', labels_test.shape)
You will get the results
Dimensions of the training dataset:  (60000, 28, 28, 1)
Dimensions of the test dataset:  (10000, 28, 28, 1)
Dimensions of the training labels:  (60000, 10)
Dimensions of the test labels:  (10000, 10)
We need to normalize the data
data_train_norm = np.array(data_train/255.0)
data_test_norm = np.array(data_test/255.0)
We can now start to build our network. With Keras, creating and training a CNN model is straightforward; the following function defines the network’s architecture
def build_model():
  # create model
  model = models.Sequential()
  model.add(layers.Conv2D(6, (5, 5), strides = (1, 1),
            activation = 'relu', input_shape = (28, 28, 1)))
  model.add(layers.MaxPooling2D(pool_size = (2, 2),
            strides = (2, 2)))
  model.add(layers.Conv2D(16, (5, 5), strides = (1, 1),
            activation = 'relu'))
  model.add(layers.MaxPooling2D(pool_size = (2, 2),
            strides = (2, 2)))
  model.add(layers.Flatten())
  model.add(layers.Dense(128, activation = 'relu'))
  model.add(layers.Dense(10, activation = 'softmax'))
  # compile model
  model.compile(loss = 'categorical_crossentropy',
                optimizer = 'adam',
                metrics = ['categorical_accuracy'])
  return model

Remember that the convolutional layer will require the two-dimensional image, and not a flattened list of gray values of the pixels.

Note

One of the biggest advantages of CNNs is that they use the two-dimensional information contained in the input image, this is why the input of convolutional layers are two-dimensional images, and not a flattened vector.

When building CNNs in Keras, a single line of code (and a Keras method) will correspond to a different layer. The build_model function creates a CNN stacking Conv2D (which builds a convolutional layer) and MaxPooling2D (which builds a max pooling layer) layers. The stride is a tuple since it gives the stride in different dimensions (for rows and columns). In our examples we have gray images, but we could also have RGB, for example. That would mean having more dimensions: the three color channels.

Let’s display the architecture of the model so far, using model.summary():
Model: "sequential"
______________________________________________________________
Layer (type)                 Output Shape              Param #
==============================================================
conv2d (Conv2D)              (None, 24, 24, 6)         156
______________________________________________________________
max_pooling2d (MaxPooling2D) (None, 12, 12, 6)         0
______________________________________________________________
conv2d_1 (Conv2D)            (None, 8, 8, 16)          2416
______________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 4, 4, 16)          0
______________________________________________________________
flatten (Flatten)            (None, 256)               0
______________________________________________________________
dense (Dense)                (None, 128)               32896
______________________________________________________________
dense_1 (Dense)              (None, 10)                1290
==============================================================
Total params: 36,758
Trainable params: 36,758
Non-trainable params: 0
______________________________________________________________

Note that the output of every convolutional and pooling layer is a 3D tensor of shape (height, width, number_of_filters). The first dimension (i.e., the number of batches), is set to None since the network does not know it yet and thus it can be applied to every set of samples, of any length. The width and height dimensions decrease as you go deeper into the network. The number of output channels for each Conv2D layer is controlled by the first function argument. Typically, as the width and height decrease, you can afford (computationally) to add more output filters to each Conv2D layer.

To complete the model, we added two Dense layers. They take vectors as input (which are 1D), while the current output is a 3D tensor. This is why you first need to flatten the 3D output to 1D, then add one or more Dense layers on top.

Now it’s time to train and test our network. We will use mini-batch gradient descent with a batch size of 100 and we will train our network for ten epochs.
model.fit(data_train_norm, labels_train, validation_data = (data_test_norm, labels_test), epochs = 10, batch_size = 100, verbose = 1)

If you run this code (it took roughly four minutes on a medium performance laptop), it will start, after just one epoch, with a training accuracy of 76.3%. After ten epochs it will reach a training accuracy of 91% (88% on the dev set). We have trained our network here only for ten epochs. You can get much higher accuracy if you train longer. Additionally, note that we have not done any hyper-parameter tuning so this would get you much better results if you spent time tuning the parameters.

As you may have noticed, every time you introduce a convolutional layer, you will introduce new hyper-parameters for each layer:
  • Kernel size

  • Stride

  • Padding

Those will need to be tuned to get optimal results. Typically, researchers tend to use existing architectures for specific tasks that have been already optimized by other practitioners and are well documented in papers.

Conclusion

You should now have a basic understanding of how CNN networks work, and on what principles they work on. Convolutional neural networks are used extensively in multiple forms for various tasks, from classification (as you have seen here) to object localization, object segmentation, instance segmentation, and much more. This chapter just scratched the surface. But you should understand the building blocks of CNNs and should be able to understand how more complex architecture is built and structured.

Exercises

Exercise 1 (Convolution) (Level: Easy)
Try to apply the different convolution operators like the ones you saw in this chapter, but to different images, such as the handwritten digits of the MNIST database (http://yann.lecun.com/exdb/mnist/). To download the dataset from TensorFlow, use the following lines of code:
from tensorflow import keras
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
Exercise 2 (CNN) (Level: Easy)

Try to build a multiclass classification model like the one you saw in this chapter, but using the MNIST database of handwritten digits instead.

Exercise 3 (CNN)(Level: Medium)

Try to change the network’s parameters to see if you can get a better accuracy. Change kernel size, stride, and padding.

References

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.4.181