© Umberto Michelucci 2022
U. MichelucciApplied Deep Learning with TensorFlow 2https://doi.org/10.1007/978-1-4842-8020-1_3

3. Feed-Forward Neural Networks

Umberto Michelucci1  
(1)
Dübendorf, Switzerland
 

In the last chapter we did some amazing things with one neuron, but that is hardly flexible enough to tackle more complex cases. The real power of neural networks comes into light when several (thousands, even millions) neurons interact with each other to solve a specific problem. The network architecture (how neurons are connected to each other, how they behave, and so on) plays a crucial role in how efficient the learning of a network is, how good its predictions are, and what kind of problems it can solve. There are many kinds of architectures that have been extensively studied and that are very complex, but from a learning perspective, it is important to start from the simplest kind of neural network with multiple neurons. It makes sense to start with a feed-forward neural network, where data enters at the input layer and passes through the network, layer by layer, until it arrives at the output layer (this gives the networks their name: feed-forward neural networks).

This chapter discusses networks where each neuron in a layer gets its input from all neurons from the preceding layer and feeds its output into each neuron of the next layer. As it is easy to imagine, with more complexity come more challenges. It is more difficult to get quick learning and good accuracy, since the number of hyper-parameters that are available grows due to the increase in network complexity. A simple gradient descent algorithm is not as efficient when dealing with big datasets. When developing models with many neurons, we need to have at our disposal an expanded set of tools that allow us to deal with all the challenges that those networks present.

This chapter starts looking at some more advanced methods and algorithms that will allow you to work efficiently with big datasets and big networks. These complex networks can do some interesting multiclass classifications, one of the tasks that big networks are most often required to do (for example, handwriting recognition, face recognition, image recognition, and so on), so we will use a dataset that will allow us to perform multiclass classification and study its difficulties.

We start the chapter with the network architecture and the needed matrix formalism. Next is a short overview of the new hyper-parameters that come with this new type of networks. You learn how to do multiclass classifications using the softmax function and what kind of output layer is needed. Then, before starting with the Python code, we will go into a brief digression to explain what exactly overfitting is with a simple example, and how to do a basic error analysis with complex networks.

Then we will start using Keras to construct bigger networks, applying them to a MNIST-similar dataset based on images of clothing items (the Fashion-MNIST dataset, from Zalando). Then we will look at how to add many layers in an efficient way and how to initialize the weights and the biases in the best way possible to make training fast and stable. We will look at Xavier and He initialization for the sigmoid and ReLU activation functions, respectively. Finally, we describe a rule of thumb for comparing the complexity of networks going beyond only the number of neurons. This chapter concludes with some tips on how to choose the right networks and a method to estimate the memory footprint depending on the architecture.

A Short Review of Network’s Architecture and Matrix Notation

The network architecture is quite easy to understand. It consists of an input layer (the inputs xi, j), several layers (called hidden because they are sandwiched between the input and the output layers, so they are “invisible” from the outside so to speak), and then an output layer. In each layer you may have one to several neurons. The main property of such a network is that each neuron gets input from each neuron in the preceding layer and feeds its output to every neuron in the next layer. Figure 3-1 shows a graphical representation of such a network (in the inputs, we omitted the first index indicating the observation index for clarity).
Figure 3-1

The schematic representation of a deep feed-forward neural network with many hidden layers, where each neuron gets input from each neuron in the preceding layer and feeds its output to every neuron in the next layer

To jump from one neuron to this is quite a big step. To build the model, we need to work with matrix formalism and therefore we need to get all the matrix dimension right. Let’s first discuss some new notation:
  • L is the number of hidden layers, excluding the input layer but including the output layer

  • nl is the number of neurons in layer l

In a network such as the one in Figure 3-1, we indicate the total number of neurons with Nneurons, which can be written as follows
$$ {N}_{neurons}={n}_x+sum limits_{i=1}^L{n}_i=sum limits_{i=0}^L{n}_i $$
Where, by convention, we define n0 = nx. Each connection between two neurons will have its own weight. Let’s indicate the weight between neuron i in layer l and neuron j in layer l − 1 with $$ {w}_{ij}^{left[l
ight]} $$. Figure 3-2 shows only the first two layers (input and layer 1) of the generic network from Figure 3-1, with the weights between the first neuron in the input layer and all the others in layer 1. All other neurons are grayed out for clarity.
Figure 3-2

The first two layers of a generic neural network, with the weights of the connections between the first neuron in the input layers and the others in the second layer. All other neurons and connections are drawn in light gray to make the diagram clearer

The weights between the input layer and layer 1 can be written as a matrix as follows
$$ {W}^{left[1
ight]}=left(egin{array}{ccc}{w}_{11}^{left[1
ight]}& dots & {w}_{1{n}_x}^{left[1
ight]} \ {}vdots & ddots & vdots \ {}{w}_{n_11}^{left[1
ight]}& dots & {w^{left[1
ight]}}_{n_1{n}_x}end{array}
ight) $$

That means that the matrix W[1] has dimensions n1 × nx. This of course can be generalized between any two layers l and l − 1. Meaning that the weight matrix between two adjacent layers l and l − 1, which we indicate with W[l], will have dimensions nl × nl − 1. By convention, n0 = nx is the number of input features (not the number of observations, which we indicate with m).

Note

The weight matrix between two adjacent layers l and l − 1, which we indicate with W[l], will have dimensions nl × nl − 1, where, by convention, n0 = nx is the number of input features.

The bias (indicated with b) will be a matrix this time. Remember that each neuron that receives inputs will have its own bias, so when considering our two layers l and l − 1 we will need nl different values of b. We indicate this matrix with b[l] and it will have dimensions nl × 1.

Note

The bias matrix for two adjacent layers l and l − 1, which we indicate with b[l], will have dimensions nl × 1.

Output of Neurons

Now let’s consider the output of our neurons. To begin, we will consider the ith neuron of the first layer (remember our input layer is by definition layer 0). Let’s indicate its output with $$ {hat{y}}_i^{left[1
ight]} $$ and let’s assume that all neurons in layer l use the same activation function that we indicate with g[l]. Then we will have
$$ {hat{y}}_i^{left[1
ight]}={g}^{left[1
ight]}left({z}_i^{left[1
ight]}
ight)={g}^{left[i
ight]}kern0.75em left(sum limits_{j=1}^{n_x}left({w}_{ij}^{left[1
ight]} {x}_{i,j}+{b}_i^{left[1
ight]}
ight)
ight) $$
where we have indicated zi as
$$ {z}_i^{left[1
ight]}=sum limits_{j=1}^{n_x}left({w}_{ij}^{left[1
ight]} {x}_{i,j}+{b}_i^{left[1
ight]}
ight) $$
As you can imagine, we want to have a matrix for all the output of layer 1, so we will use this notation
$$ {Z}^{left[1
ight]}={W}^{left[1
ight]}X+{b}^{left[1
ight]} $$

Where Z[1] will have dimensions n1 × 1, and where with X we have indicated our matrix with all our observations (rows for the features and columns for observations). We assume that all neurons in layer l will use the same activation function, which we will indicate with g[l].

We can easily generalize the previous equation for a layer l:
$$ {Z}^{left[l
ight]}={W}^{left[l
ight]}{Z}^{left[l-1
ight]}+{b}^{left[l
ight]} $$
Since layer l will get its input from layer l − 1, we just need to substitute X with Z[l − 1]. Z[l] will have dimensions nl × 1. Our output in matrix form will then be
$$ {Y}^{left[l
ight]}={g}^{left[l
ight]}left({Z}^{left[l
ight]}
ight) $$

where the activation function acts, as usual, element by element.

A Short Summary of Matrix Dimensions

Let’s summarize the dimensions of all the matrixes we have described so far
  • W[l] has dimensions nl × nl − 1 (where we have n0 = nx by definition)

  • b[l] has dimensions nl × 1

  • Z[l − 1] has dimensions nl − 1 × 1

  • Z[l] has dimensions nl × 1

  • Y[l] has dimensions nl × 1

In each case, l goes from 1 to L.

Example: Equations for a Network with Three Layers

To make all this discussion a bit more concrete, let’s consider an example of a network with three layers (so L = 3) with n1 = 3, n2 = 2, and n3 = 1, as depicted in Figure 3-3.
Figure 3-3

A practical example of a feed-forward neural network

In this case, we need to calculate the following quantities
  • $$ {hat{Y}}^{left[1
ight]}={g}^{left[1
ight]}left({W}^{left[1
ight]}X+{b}^{left[1
ight]}
ight), $$ whereby W[1] has dimensions 3 × nx, b has dimensions 3 × 1, and X has dimensions nx × m

  • $$ {hat{Y}}^{left[2
ight]}={g}^{left[2
ight]}left({W}^{left[2
ight]}{hat{Y}}^{left[1
ight]}+{b}^{left[2
ight]}
ight), $$ whereby W[2] has dimensions 2 × 3, b has dimensions 2 × 1, and $$ {hat{Y}}^{left[1
ight]} $$ has dimensions 3 × m

  • $$ {hat{Y}}^{left[3
ight]}={g}^{left[3
ight]}left({W}^{left[3
ight]}{hat{Y}}^{left[2
ight]}+{b}^{left[3
ight]}
ight), $$ whereby W[3] has dimensions 1 × 2, b has dimensions 1 × 1, and $$ {hat{Y}}^{left[2
ight]} $$ has dimensions 2 × m

The network output $$ {hat{Y}}^{left[3
ight]} $$ will have, as expected, dimensions 1 × m.

All this may seem rather abstract (and in fact it is). You will see later in the chapter how easy it is to implement in Keras simply building the right architecture, based on the steps just discussed.

Hyper-Parameters in Fully Connected Networks

In networks as the ones just discussed, there are quite a few parameters that you can tune to find the best model for your problem.

Note

Parameters that you fix at the beginning and then do not change during the training phase are called hyper-parameters (like the number of epochs).

You need to tune the additional following hyper-parameters for feed-forward networks:
  • The number of layers, L

  • The number of neurons in each layer ni for i, from 1 to L

  • The choice of activation function for each layer g[l]

Then of course you still have the following hyper-parameters:
  • The number of iterations (or epochs)

  • The learning rate

A Short Review of the Softmax Activation Function for Multiclass Classifications

You still need to suffer a bit more theory before getting to some TensorFlow code. These kinds of networks start to be complex enough to be able to perform multiclass classifications with reasonable results. To do this, we must introduce the softmax function.

Mathematically speaking, the softmax function S transforms a k dimensional vector into another k dimensional vector of real values, each between 0 and 1, that sum up to 1. Given k real values zi for i = 1, …, k we define the vector z = (z1, …, zk) and we define the softmax vector function S(z) = (S(z)1S(z)2… S(z)k) as follows
$$ S{left(oldsymbol{z}
ight)}_i=frac{e^{z_i}}{sum_{j=1}^k{e}^{z_j}} $$
Since the denominator is always bigger than the nominator S(z)i < 1. Additionally, we have
$$ sum limits_{i=1}^kS{left(oldsymbol{z}
ight)}_i=sum limits_{i=1}^kfrac{e^{z_i}}{sum_{j=1}^k{e}^{z_j}}=frac{sum_{i=1}^k{e}^{z_i}}{sum_{j=1}^k{e}^{z_j}}=1 $$

So S(z)i behaves like a probability since its sum over i is 1 and its elements are all less than 1. We will look at S(z)i as a probability distribution over k possible outcomes. For this example, S(z)i will simply be the probability of our input observation of being of class i. Let’s suppose we are trying to classify an observation into three classes. We may get the following output: S(z)1 = 0.1, S(z)2 = 0.6, and S(z)3 = 0.3. That means that the observation has a 10% probability of being in class 1, a 60% probability of being in class 2, and a 30% probability of being in class 3. Normally, you would choose to classify the input observation into the class with the higher probability—in this example, class 2 with 60% probability.

Note

We will look at S(z)i with i = 1, …, k as a probability distribution over k possible outcomes. For this example, S(z)i will simply be the probability of our input observation being in class i.

To be able to use the softmax function for classification, we need to use a specific output layer. We need to use ten neurons (in the case of a ten-class multiclass classification problem, like the one we see later in the chapter), where each will give zi as its output and then one neuron that will output S(z). This neuron will have the softmax function as the activation function and will have as inputs the ten outputs zi of the last layer with ten neurons. In Keras, you use the tf.keras.activations.softmax function applied to the last layer with ten neurons. Remember that this Keras function will act element by element. You will a concrete example from start to finish on how to implement this practically later in this chapter.

A Brief Digression: Overfitting

One of the most common problems that you will encounter when training deep neural networks is overfitting. Your network may, due to its flexibility, learn patterns that are due to noise, errors, or simply wrong data. It is very important to understand what overfitting is, so now we will go through a practical example of what can happen, to get an decent understanding of it. To make it easier to visualize, we will work with a simple two-dimensional dataset created for this purpose.

A Practical Example of Overfitting

To understand overfitting, consider the following problem: find the best polynomial that approximates a given dataset. Given a set of two-dimensional points (xi, yi), we want to find the best polynomial of degree K in the form1
$$ fleft({x}_i
ight)=sum limits_{j=0}^K{a}_j{x}_i $$
That minimizes the mean square error
$$ frac{1}{m}sum limits_{i=1}^m{left({y}_i-fleft({x}_i
ight)
ight)}^2 $$
where, as usual, m indicates the number of data points we have. We do not only want to determine all the parameters aj but also the value of K that best approximates the data. K in this case measures our model complexity. For example, for K = 0 we simply have f(xi) = a0 (a constant), the simplest polynomial we can think of. For higher K, we have higher order polynomials, meaning our function is more complex and has more parameters available for training. Here is an example of the function for K = 3
$$ fleft({x}_i
ight)=sum limits_{j=0}^3{a}_j{x}_i^j={a}_0+{a}_1{x}_i+{a}_2{x}_i^2+{a}_3{x}_i^3 $$
Where we have four parameters that can be tuned during the model’s training. Let’s generate some data, starting from a second-order polynomial (K = 2)
$$ 1+2{x}_i+3{x}_i^2 $$
We are adding some random error (this will make overfitting visible). Let’s first import our standard libraries with the addition of the curve_fit function, which will automatically minimize the standard error and find the best parameters. Do not worry too much about this function; the goal here is to show you what can happen when you use a model that is too complex.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
Let’s define a function for a second-degree polynomial
def func_2(p, a, b, c):
    return a + b*p + c*p**2
Then let’s generate the dataset
x = np.arange(-5.0, 5.0, 0.05, dtype = np.float64)
y = func_2(x, 1, 2, 3) + 18.0 * np.random.normal(0, 1, size = len(x))

To add some random noise to the function, we used the np.random.normal(0, 1, size = len(x)) function, which generates a NumPy array of random values from a normal distribution of length len(x) with average 0 and standard deviation 1.

Figure 3-4 shows the data for a = 1, b = 2, and c = 3.
Figure 3-4

The data we generated with a = 1, b = 2, and c = 3

Now let’s consider a model that is too simple to capture the feature of the data, meaning we will see what a model with high bias2 can do. Consider a linear model (K = 1). The code will be
def func_1(p, a, b):
    return a + b*p
popt, pcov = curve_fit(func_1, x, y)
That will give the best values for a and b that minimize the standard error. In Figure 3-5, it’s clear how this model completely misses the main feature of the data, being too simple.
Figure 3-5

The linear model misses the main feature of the data being too simple. In this case, the model has high bias

Let’s try to fit a two-degree polynomial (K = 2). The results are shown in Figure 3-6.
Figure 3-6

The result (red line) for a two-degree polynomial

That is better. This model seems to capture the main features of the data, ignoring the random noise. Now let’s try a very complex model, a 21-degree polynomial (K = 21). The results are shown in Figure 3-7.
Figure 3-7

The results for a 21-degree polynomial model

This model shows features that we know are wrong (since we created our data). Those features are not present, but the model is so flexible that it captures the random variability that we introduced with noise. The oscillations that have appeared using this high-ordered polynomial are wrong and do not describe the data correctly.

In this case, we talk about overfitting, meaning we start capturing with our model features that are due to random error, for example. It is easy to understand that this generalizes quite badly. If we applied this 21-degree polynomial model to new data it would not work well, since the random noise would be different in the new data and so the oscillations (the ones represented in Figure 3-7) would make no sense.

Figure 3-8 shows the best 21-degree polynomial models obtained by fitting data generated with ten different random noise values . You can clearly see how much it varies. It is not stable and is strongly dependent on the random noise. The oscillations are always different! In this case, we talk about high variance.
Figure 3-8

The result of our model with a 21-degree polynomial fitted to ten different datasets generated with different random noise values

Now let’s run the same plot with this linear model, while varying the random noise as we did in Figure 3-8. You can see the results in Figure 3-9.
Figure 3-9

The result of the linear model applied to data where we have randomly changed the random noise. For easier comparison with Figure 3-8, we used the same scale

You can see that the model is much more stable. The linear model does not capture features that are dependent from the noise, but it misses the main features of the data (the concave nature). We talk here of high bias.

Figure 3-10 illustrates the concepts of bias and variance.
Figure 3-10

Bias is a measure of how close the measurements are to the true values (the center in the figure) and variance is a measure of how spread the measurements are around the average (not necessarily the true value, as you can see on the left)

In the case of neural networks, we have many hyper-parameters (number of layers, number of neurons in each layer, activation function, and so on) and it is very difficult to know in which regime we are. How can we tell if our model has a high variance or a high bias, for example? An entire chapter is dedicated to this subject, but the first step to performing this error analysis is to split the dataset in two different subsets. Let’s see what that means and why we have to do it.

Note

The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e., the noise) as if that variation represented underlying model structure [1]. The opposite is called underfitting, when the model cannot capture the structure of the data.

The problem with overfitting and deep neural networks is that there is no way of easily visualizing the results. Therefore we need a different approach to determine if our model is overfitting, underfitting, or is just right. This can be achieved by splitting the dataset into different parts and comparing the metrics of the parts. You learn about the basic idea in the next section.

Basic Error Analysis

To check how your model is doing and to perform a proper error analysis, you need to split your dataset in two parts3:
  • Training dataset: The model is trained on this dataset using the inputs and the relative labels and using an optimizer algorithm like gradient descent. Often this is called the training set.

  • Development (or validation) set: The trained model will then be used on this dataset to check how it is doing. On this dataset we will test different hyper-parameters. For example, we can train two different models with a different number of layers on the training dataset and test them on this dataset to check how they are doing. Often this is called the dev set.

There is an entire chapter dedicated to error analysis, but it is a good idea to read an overview of why it is important to split the dataset. Let’s suppose we are dealing with classification and the metric we use to judge how good our model is the accuracy minus one, or in other words the percentage of the cases that are wrongly classified.
Table 3-1

Examples of the Difference Between Models with High Bias and Models with High Variance

Error

Case A

Case B

Case C

Case D

Train set error

1%

15%

14%

0.3%

Dev set error

11%

16%

32%

1.1%

Let’s consider the three cases reported in Table 3-1:
  • Case A: We are overfitting (high variance), because we are doing very well on the training set, but our model generalizes very badly to our dev set (see Figure 3-8 again).

  • Case B: We see a problem with high bias, meaning that our model is not doing very well generally, on both datasets (see Figure 3-9 again).

  • Case C: We have a high bias (the model cannot predict the training set very well) and high variance (the model does not generalize on the dev set very well).

  • Case D: Everything seems okay. There is good error on the training set and good data on the dev set. It is a good candidate for our best model.

We will explain much better all those concepts later in the book, where we provide recipes on how to solve problems of high bias, high variances, both, and even more complex cases.

To recap, to do a very basic error analysis, you need to split your dataset into at least two sets: train and dev. You should then calculate your metric on both sets and compare them. You want to have a model that has low error on the train set, low error on the dev set (as with Case D in the previous example), and the two values should be comparable.

Note

Your main take away from this section should be two-fold: 1) you need a set of recipes and guidelines for understanding how your model is doing: is it overfitting, underfitting, or is it just right? 2) to answer question 1 and perform the analysis, you need to split the dataset into two parts (later in the book, you will also see what you can do with the dataset split in three or even four parts).

Implementing a Feed-Forward Neural Network in Keras

Building a feed-forward neural network in Keras is straightforward and is simply a generalization of the one neuron model you built in last chapter. Let’s compare the two cases. Here you have the schematic one-neuron model Keras implementation
model = keras.Sequential([
   layers.Dense(1, input_shape = [...])
])
The following is a feed-forward network model with one hidden layer made of 15 neurons (the first model we will use for our multiclass classification task on the Zalando dataset):
model = keras.Sequential([
   layers.Dense(15, input_shape = [...])
   layers.Dense(10)
])

As you can see, we added more neurons (15) to the hidden layer (that in the one-neuron model was already the output one) and we added the output layer, made of ten neurons, since we have ten classes. As you can notice, you can easily create very complex models by simply adding to the stack one layer after another.

In the next paragraphs, you will see a practical example about how you use this model, choosing the right activation function and the right loss function (given as additional parameters) to solve a multiclass classification task.

Multiclass Classification with Feed-Forward Neural Networks

The task we are going to solve together is a multiclass classification problem on the Zalando dataset. It consists of predicting the corresponding label among ten possible cases (ten different types of clothing). To solve it, we will use a feed-forward network architecture and try different configurations (different optimizers and architectures), performing some error analysis to see which situation is better. Let’s start by looking at the data.

The Zalando Dataset for the Real-World Example

Zalando SE is a German commerce company based in Berlin . The company maintains a cross-platform store that sells shoes, clothing, and other fashion items [2]. For a Kaggle competition, they prepared a MNIST-similar dataset of Zalando’s clothing article images [4], where they provided 60000 training images and 10000 test images. (If you do not know what Kaggle is, check out their website [3], where you can participate in many competitions where the goal is to solve problems with data science.) As in MNIST, each image is 28x28 pixels in grayscale. They classified all images in ten different classes and provided the labels for each image. The dataset has 785 columns, where the first column is the class label (an integer going from 0 to 9) and the remaining 784 contain the pixel gray value of the image (you can calculate that 28x28=784).

Each training and test example is assigned to one of the following labels (as from the documentation):
  • 0: T-shirt/top

  • 1: Trouser

  • 2: Pullover

  • 3: Dress

  • 4: Coat

  • 5: Sandal

  • 6: Shirt

  • 7: Sneaker

  • 8: Bag

  • 9: Ankle boot

Figure 3-11 shows an example of each class chosen randomly from the dataset.
Figure 3-11

One example from each of the ten classes in the Zalando dataset

The dataset has been provided under the MIT License4. The datafile can be downloaded from Kaggle (https://www.kaggle.com/zalando-research/fashionmnist/data) or directly from GitHub (https://github.com/zalandoresearch/fashion-mnist). If you choose the second option, you need to prepare the data a bit (you can convert it to CSV with the script located at https://pjreddie.com/projects/mnist-in-csv/). If you download it from Kaggle, you have all the data in the right format. You will find two CSV files zipped on the Kaggle website. The ZIP file contains fashion-mnist_train.csv, with 60000 images (roughly 130 MB) and fashion-mnist_test.csv, with 10000 (roughly 21 MB).

In our example, we will retrieve the dataset in a third way: from the TensorFlow datasets catalog (https://www.tensorflow.org/datasets/catalog/fashion_mnist), since in this way we will not have to perform any preprocessing steps and we will automatically import the data inside our notebook. Now, let’s code!

We will need the following imports in our code
# general libraries
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
from random import *
import time
# tensorflow libraries
from tensorflow.keras.datasets import fashion_mnist
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow_docs as tfdocs
import tensorflow_docs.modeling
Then, to retrieve the dataset, we can simply run the following command
((trainX, trainY), (testX, testY)) = fashion_mnist.load_data()

Incredibly easy! Now we have two numpy matrices (trainX and testX) containing all the pixel values describing each of the training and test images and two NumPy arrays (trainY and testY) containing the associated labels.

Let’s print the datasets’ dimensions
print('Dimensions of the training dataset: ', trainX.shape)
print('Dimensions of the test dataset: ', testX.shape)
print('Dimensions of the training labels: ', trainY.shape)
print('Dimensions of the test labels: ', testY.shape)
which will return as output
Dimensions of the training dataset:  (60000, 28, 28)
Dimensions of the test dataset:  (10000, 28, 28)
Dimensions of the training labels:  (60000,)
Dimensions of the test labels:  (10000,)
Tip

Remember that you should not focus on the Python implementation. Focus on the model, on the concepts behind the implementation. You can achieve the same results using pandas, NumPy, or even C. Try to concentrate on how to prepare the data, how to normalize it, how to check the training, and so on.

As you can see, we have a training dataset made of 60000 items, stored as images of 28x28 pixels each, and a test dataset made of 10000 items, stored in the same way. Then, an array of corresponding labels is associated with each dataset.

Now we need to modify our data to obtain a “flattened” version of image, meaning an array of 754 pixels, instead of a matrix of 28x28 pixels. This step is necessary, because, as we saw when we discussed the feed-forward network architecture, it receives as inputs all the features as separate values. Therefore, we need to have all the pixels stored in the same array. On the contrary, Convolutional Neural Networks (CNNs) do not work with flattened versions of the images, but that’s a topic for Chapter 7. For now, keep this in mind.

The following lines reshape the matrix dimensions
data_train = trainX.reshape(60000, 784)
data_test = testX.reshape(10000, 784)
Let’s summarize our data so far
  • labels: Has dimensions (60000) and contains the class labels (integers from 0 to 9)

  • train: Has dimensions m × nx (60000x784) and contains the features, where each column contains the grayscale value of a single pixel in the image (remember 28x28=784)

See Figure 3-11 again to get an idea of how the images look. Finally let’s normalize the input so that instead of having values from 0 to 255 (the grayscale values), it has only values between 0 and 1. This is very easy to do with the code
data_train_norm = np.array(data_train/255.0)
data_test_norm = np.array(data_test/255.0)

Before developing the network, we need to solve another problem. Labels must be provided in a different form when performing a multiclass classification task.

Modifying Labels for the Softmax Function: One-Hot Encoding

In classification , we use the following cost function (called cross-entropy)
$$ Lleft({hat{y}}_i,{y}_i
ight)=-left({y}_i log {hat{y}}_i+left(1-{y}_i
ight)log left(1-{hat{y}}_i
ight)
ight) $$

where yi contains our labels and $$ {hat{y}}_i $$is the result of our network. So, the two elements must have the same dimensions. In our case we saw that our network will give as output a vector with ten elements, while a label in our dataset is simply a scalar. So, we have $$ {hat{y}}_i $$ that has dimensions (10,1) and yi that has dimensions (1,1). This will not work if we do not do something smart. We need to transform our labels in a vector that have dimensions (10,1). A vector with a value for each class, but what value should we use?

What we need to do is what is called one-hot encoding5. Meaning we will transform our labels (integers from 0 to 9) to vectors with dimensions (1,10) with this algorithm: our one-hot encoded vector will have all zeros, except at the index of the label. For example, for a label 2 our 1x10 vector will have all zeros except at position with index 2, or in other words, it will be (0,0,1,0,0,0,0,0,0,0). Let’s look at some other examples (see Table 3-2).
Table 3-2

Examples of How One-Hot Encoding Works. Remember that Labels Go from 0 to 9, as Indexes

Label

One-Hot Encoded Label

0

(1,0,0,0,0,0,0,0,0,0)

2

(0,0,1,0,0,0,0,0,0,0)

5

(0,0,0,0,0,1,0,0,0,0)

7

(0,0,0,0,0,0,0,1,0,0)

Figure 3-12 shows a graphical representation of the process of one-hot encoding a label.
Figure 3-12

A graphical representation of the process of one-hot encoding a label. Two labels (2 and 5) are one-hot encoded in two vectors. The grayed element of the vector is the one the becomes 1, while the white marked ones remain 0

Sklearn has several ways of doing this automatically (check for example the function OneHotEncoder()). But I think it’s instructive to do it manually to really see how it’s done. Once you understand why you need it, and in which format you need it, you can use the function you like best. The Python code to do this is very simple:
labels_train = np.zeros((60000, 10))
labels_train[np.arange(60000), trainY] = 1
labels_test = np.zeros((10000, 10))
labels_test[np.arange(10000), testY] = 1

First you create a new array with the right dimensions: (60000,10), and then you fill it with zeros with the NumPy function np.zeros((60000,10)). Then you set to 1 only the columns related to the label itself, using pandas functionalities to slice dataframes with the line labels_train[np.arange(60000), trainY] = 1 (the same of course is also performed in the case of the test dataset). In the end, you obtain the dimensions (60000,10), where each row indicates a different observation.

Now we can compare yi and $$ {hat{y}}_i, $$since both have the dimensions (10,1) for one observation, or when considering the entire test dataset of (10000,10). The same can of course be asserted for the training dataset. Each column in $$ {hat{y}}_i $$ will represent the probability of our observation being of a specific class. At the very end when calculating the accuracy of our model, we will assign the class with the highest probability to each observation.

Note

Our network will give us the ten probabilities for the observation of being of each of the ten classes. At the end, we will assign to the observation the class that has the highest probability.

The Feed-Forward Network Model

We will start with a network with just one hidden layer. We will have an input layer with 784 features, then a hidden layer (where we will vary the number of neurons), then an output layer of ten neurons that will feed their output into a neuron that will have as an activation function the softmax function. Figure 3-13 shows a graphical representation of the network. We will spend some time looking at the various parts, especially the output layers.
Figure 3-13

The network architecture with a single hidden layer. We will vary the number of neurons n1 in the hidden layer during our analysis

Let’s look at why this strange output layer has ten neurons and why we need an additional neuron for the softmax function. Remember, we want to be able to tell which class each image belongs to. To do this, as explained when discussing the softmax function, we need to get ten outputs for each observation: each being the probability of the image being in each of the classes. So given input x(i), we need ten values: P(yi = 1| xi), P(yi = 2| xi), …, P(yi = 10| xi) (the probability of the observation class y(i) being one of the ten possibilities given the input xi). Or, in other words, our output should be a vector of dimensions 1x10 in this form
$$ hat{oldsymbol{y}}=left(Pleft({y}_i=1|{x}_i
ight)kern1em Pleft({y}_i=2|{x}_i
ight)kern0.75em dots kern1.25em Pleft({y}_i=10|{x}_i
ight)
ight) $$
And since the observation must be of one single class, this condition must be satisfied
$$ sum limits_{j=1}^{10}Pleft({y}_i=j|{x}_i
ight)=1 $$
This can be understood as: the observation has a 100% probability of being one of the ten classes. Or, in other words, all the probabilities must add up to 1. We solve this problem in two steps:
  • We create an output layer with ten neurons, in this way we will have our ten values as output.

  • Then we feed the ten values into a new neuron (let’s call it “softmax” neuron) that will take the ten inputs and give as output ten values that are all less than one, which adds to 1.

Calling zi the output of the ith neuron in the last layer (with i going from 1 to 10), we will have
$$ Pleft({y}_i=j|{x}_i
ight)=frac{e^{z_i}}{sum_{j=1}^{10}{e}^{z_j}} $$

In Keras, this is straightforward. But it is instructive to know exactly what each line of code does. That is what the Keras function model.add(Dense(10, activation = 'softmax')) does. It takes a vector as the input and returns a vector with the same dimensions as the input, but “normalized,” as discussed above. In other words, if we feed z = (z1  z2 …  z10) into the function, it will return a vector with the same dimensions as z, meaning 1x10, where each element added to the others gives 1.

Keras Implementation

Now is time to build our model with Keras. The following code will do the job
def build_model(opt):
  # create model
  model = Sequential()
  # add first hidden layer and set input dimensions
  model.add(Dense(15, input_dim = 784, activation = 'relu'))
  # add output layer
  model.add(Dense(10, activation = 'softmax'))
  # compile model
  model.compile(loss = 'categorical_crossentropy',
                optimizer = opt,
                metrics = ['categorical_accuracy'])
  return model
Now we will not go through each line of code, since you should understand by now how a basic Keras model is built (remember our simple one-neuron model in Chapter 2). But there are a few details of the code that need to be stressed:
  • Our last layer will use the softmax function: model.add(Dense(10, activation = 'softmax')).

  • The two parameters—15 (n1) and 10 (n2)—define the number of neurons in the different layers. Remember the second (output) layer must have ten neurons to be able to use the softmax function. But we will play with the value of n1. Increasing n1 will increase the complexity of the network.

  • We set the categorical cross-entropy (loss = 'categorical_crossentropy') as the loss function and the categorical accuracy (metrics = ['categorical_accuracy']) as the metrics. The reason for this choice is that we have hot-encoded the labels and therefore the categorical versions of these functions are needed.

Now let’s try to perform the training as we did in last chapter for the single-neuron model. The code structure is always the same. Try to run the following code on your laptop:
model = build_model(tf.keras.optimizers.SGD(momentum = 0.0, learning_rate = 0.01))
EPOCHS = 1000
history = model.fit(
  data_train_norm, labels_train,
  epochs = EPOCHS, verbose = 0,
  batch_size = data_train_norm.shape[0]
)

We have set as the optimizer the standard version of the gradient descent . The biggest problem is that the model, as we coded it, will create a huge matrix for all observations (that are 60,000) and then will modify the weights and bias only after a complete sweep over all observations. This requires a lot of resources, memory, and CPU. If that was the only choice we have, we would have been doomed. Keep in mind that in the deep learning world, 60,000 examples of 784 features is not a big dataset at all. We need to find a way of letting this model learn faster.

Moreover, notice that, when training the model, we have set batch_size = data_train_norm.shape[0] inside the Keras fit method. Keras by default sets the batch size to 32 observations [5], but the batch gradient descent updates weights and biases after all training observations have been seen by the network. Therefore, we need to change this parameter to obtain the basic version of gradient descent.

In the same way, we need to set the momentum = 0.0 inside the method tf.keras.optimizers.SGD. To summarize, since Keras does not include a function to perform the standard gradient descent, we used the stochastic gradient descent function, setting the momentum to zero and the batch size to the entire number of observations.

To do some basic error analysis, you also need the dev dataset, which we loaded and prepared in the previous paragraphs.

Do not get confused by the fact that the filename contains the word test. Sometimes the dev dataset is called the test dataset. When we discuss error analysis later in the book, we use three datasets: train, dev, and test. To remain consistent in the book, we use the name dev here as well, not to confuse you with different names in different chapters.

To calculate accuracy on the dev dataset, we use the model.evaluate() function and apply the built model on the dev dataset.
test_loss, test_accuracy = model.evaluate(data_test_norm, labels_test, verbose = 0)
print('The accuracy on the test set is equal to: ', int(test_accuracy*100), '%.')
which returns
The accuracy on the test set is equal to:  74 %.

To recap, we applied the model trained on the 60,000 observations to the dev test (made up of 10,000 observations) and then calculated the accuracy on both datasets.

A good exercise is to include this calculation in your model so that your build_model() function automatically returns the two values.

Gradient Descent Variations Performances

In Chapter 2, we looked at the different GD variations and discussed their advantages and disadvantages. Let’s know see how they differ in a practical case.

Comparing the Variations

Let’s summarize the findings for the three variations of gradient descent for 100 epochs (see Table 3-3).
Table 3-3

Comparing the Performances of Three Variations of Gradient Descent

Gradient Descent Variation

Running Time

Final Value of Cost Function

Accuracy

Batch gradient descent

0.35 min

1.86

43%

Stochastic gradient descent

60.23 min

0.26

91%

Mini-batch gradient descent (mini-batch size = 50)

1.70 min

0.26

90%

Now you can see that mini-batch gradient descent is definitely the best compromise in terms of execution time and classification performance. Now you should be convinced that it is currently the preferred method to be used as an optimizer in deep neural networks, among the different gradient descent types, since it can reach high performance, maintaining a good trade-off between performance and execution time.

Figure 3-14 shows how the cost function decreases with different mini-batch sizes. Note how, with respect to number of epochs, a smaller mini-batch size means a faster (not in time) decrease.
Figure 3-14

A comparison of speed of convergence of the mini-batch gradient descent algorithm with different mini-batch sizes. The learning rate used for this figure was γ = 0.0001. Note that the time needed by each case is not the same. The smaller the mini-batch size, the more time the algorithm needs

Tip

The best compromise between running time and convergence speed (with respect to the number of epochs) is achieved using the mini-batch gradient descent. The optimal size of the mini-batches is dependent on your problem, but small numbers like 30 or 50 are a good place to start. You will find a compromise between running time and convergence speed.

To get an idea of how the running time depends on what value the cost function can reach after 100 epochs, see Figure 3-15 (in comparison to Chapter 2, the times are evaluated with a real dataset). Each point is labeled with the size of the mini-batch used in that run. You can see that decreasing the mini-batch size decreases the value of J after 100 epochs. This happens quickly, without increasing the running time significantly, until you arrive at a value for the mini-batch size that is around 20. At that point, the time starts to increase quickly and the value of J after 100 epochs does not decrease and flattens out.

The best compromise is to choose a value for the mini-batch size where the curve is closer to zero (a small running time and a small cost function value), and that is at a mini-batch size value of 20 in this specific case. This is the reason that those are the most common values. After that point, the increase in running time becomes very quick and is not helpful. Note that for other datasets, the optimal value may be very different. So it’s worth trying different values to see which one works best. In very big datasets you may want to try bigger values, such as 200, 300, or 500. In this case, we have 60,000 observations and a mini-batch size of 50, which means 1200 batches. If you have much more data, for example 1e6 observations, a mini-batch size of 50 would give 20,000 batches. So, keep that in mind and try different values to see which one works best.
Figure 3-15

The plot shows the value of the cost function after 100 epochs for the Zalando dataset vs. the running time needed to run through 100 epochs. Note that the points are single runs, and the plot is only indicative of the dependency. Running time and cost function have a small variance when evaluated over several runs. This variance is not shown in the plot

As a tip, it is good programming practice to write a function that runs your evaluations. This way, you can tune your hyper-parameters (like the mini-batch size) without copy and pasting the same chunk of code over and over. The following function is one that you can use to train our model with different mini-batch sizes (of course, you can add more parameters to be optimized, such as the number of epochs, the learning rate, etc.):
def mini_batch_gradient_descent(mb_size):
  # build model
  model_mbgd = build_model(tf.keras.optimizers.SGD(momentum = 0.9, learning_rate = 0.0001))
  # set number of epochs
  EPOCHS = 100
  # train model
  history_mbgd = model_mbgd.fit(
    data_train_norm, labels_train,
    epochs = EPOCHS, verbose = 0,
    batch_size = mb_size,
    callbacks = [tfdocs.modeling.EpochDots()])
  # save performances
  hist_mbgd = pd.DataFrame(history_mbgd.history)
  hist_mbgd['epoch'] = history_mbgd.epoch
  return hist_mbgd
Tip

Writing a function with the hyper-parameters as inputs is common practice. This allows you to test different models with different values for the hyper-parameters and check which ones are better.

Examples of Wrong Predictions

Running the model with batch gradient descent—with one hidden layer with 15 neurons for 1000 epochs and at a learning rate of 0.0001—will give us an accuracy on the training set of 86%. You can increase the accuracy by using more neurons in your hidden layer. For example, using 50 neurons, 1000 epochs, and a learning rate of 0.0001 will allow you to reach 87% on the training set and 85% on the test set. It is interesting to check a few examples of incorrectly classified images, to see if you can learn something from these errors. Figure 3-16 shows an example of incorrectly classified images for each class.
Figure 3-16

One example of incorrectly classified images for each class. Over each image, the True class (labeled with “True”) and the predicted (labeled with “Pred”) class are reported. This model has one hidden layer with 15 neurons, has been run for 1000 epochs with a learning rate of 0.0001

Some errors are understandable, such as where a coat was wrongly classified as a pullover. We could easily make the same mistake. The wrongly classified dress is, on the other hand, easy for a human to get right.

Weight Initialization

If you tried to run the code, you will have realized that the convergence of the algorithm is strongly variable, and it depends on the way you initialize your weights. In the previous sections, we focused on understanding how such a network works to not get distracted by additional information, but it is time to look at this problem a bit more closely, since it plays a fundamental role in many layers.

Basically, we want to avoid the gradient descent algorithm to explode and start returning nan. For example, in the first layer for the ith neuron, we need to calculate the ReLU activation function of the quantity (see the beginning of this chapter for an explanation if you forget why):
$$ {z}_i=sum limits_{j=1}^{n_x}left({w}_{ij}^{left[1
ight]} {x}_{i,j}+{b}_i^{left[1
ight]}
ight) $$

Normally, in a deep network, the number of weights is quite big, so you can easily imagine that if the $$ {w}_{ij}^{left[1
ight]} $$ are big, the quantity zi can be quite big, and then the ReLU activation function can return a nan value since the argument is too big for Python to calculate it properly (remember that in a classification problems, you have a log() function and therefore values of the argument as zero for example are not acceptable). So, you want the zi to be small enough to avoid an explosion of the output of the neurons, and big enough to avoid having the outputs die out and making the convergence a very slow process.

The problem has been researched extensively [6], and there are different initialization strategies depending on the activation function you are using. Let’s outline a few of them in Table 3-4, where we assume that the weights will be initialized with a normal distribution with a mean of 0 and a standard deviation as given in the table (note that the standard deviation will depend on the activation function you want to use).
Table 3-4

Some Examples of Weight Initialization for Deep Neural Networks

Activation Function

Standard Deviation σ for a Given Layer

Sigmoid

$$ sigma =sqrt{frac{2}{n_{inputs}+{n}_{outputs}}} $$ Usually called Xavier initialization

ReLU

$$ sigma =sqrt{frac{4}{n_{inputs}+{n}_{outputs}}} $$ Usually called He initialization

In a layer l the number of inputs will be the number of neurons of the preceding layer l − 1 and the number of outputs will be the number of neurons in the layer coming next l + 1. So, we will have
$$ {n}_{inputs}={n}_{l-1} $$
And
$$ {n}_{outputs}={n}_{l+1} $$
Very often, deep networks, as the one we discussed before, will have several layers all with the same number of neurons. Therefore you have for most of the layers nl − 1 = nl + 1 and therefore you will have for Xavier initialization
$$ {sigma}_{Xavier}=sqrt{1/{n}_{l+1}}kern1.25em orkern1em sqrt{1/{n}_{l-1}} $$
And for ReLU activation functions the He initialization will be
$$ {sigma}_{He}=sqrt{2/{n}_{l+1}}kern1.25em orkern1em sqrt{2/{n}_{l-1}} $$
Let’s consider the ReLU activation function (the one we used in this chapter). Every layer, as we have discussed, will have nl neurons. A way of initializing the weights for our single hidden layer for example would then be
initializer = tf.keras.initializers.HeNormal()
layer = tf.keras.layers.Dense(15, kernel_initializer = initializer)
Typically, to more easily evaluate and construct the networks, the most typical initialization form used is for ReLU activation function
$$ {sigma}_{He}=sqrt{2/{n}_{l-1}} $$
And
$$ {sigma}_{Xavier}=sqrt{1/{n}_{l-1}} $$

For Sigmoid activation function.

Using this initialization can speed up training considerably and is the standard way that many libraries initialize weights (for example, the Caffe library).

In Keras, weight initialization is straightforward by means of the tf.keras.initializers function. Look at the Keras documentation to see which initialization strategies are available [7].

Adding Many Layers Efficiently

Typing all this code each time is a bit tedious and error prone. You can instead define a function that creates a layer. That can be done easily with this code
def model_nlayers(num_neurons, num_layers):
    # build model
    inputs = keras.Input(shape = 784) # input layer
    # first hidden layer
    dense = layers.Dense(num_neurons,
                         activation = 'relu')(inputs)
    # customized number of layers and neurons per layer
    for i in range(num_layers - 1):
        dense = layers.Dense(num_neurons,
                             activation = 'relu')(dense)
    # output layer
    outputs = layers.Dense(10, activation = 'softmax')(dense)
    model = keras.Model(inputs = inputs,
                        outputs = outputs,
                        name = 'model')
    # set optimizer and loss
    opt = tf.keras.optimizers.SGD(momentum = 0.9,
                                  learning_rate = 0.0001)
    model.compile(loss = 'categorical_crossentropy',
                  optimizer = opt,
                  metrics = ['categorical_accuracy'])
    # train model
    history = model.fit(
      data_train_norm, labels_train,
      epochs = 200, verbose = 0,
      batch_size = 20,
      callbacks = [tfdocs.modeling.EpochDots()])
    # save performances
    hist = pd.DataFrame(history.history)
    hist['epoch'] = history.epoch
    return hist
Let’s go through the code:
  • First we define the dimensions of the input layer.

  • Then we define the first hidden layer (the number of neurons is given as the function’s input).

  • Then we add the other hidden layers one at a time. The number of layers is given as the function’s input.

  • We then add the output layer and we stack all the layers together inside the model.

  • We then compile and train the model, returning its performance.

Notice that in the previous code, we used Keras functional API (see Appendix A if you are not sure how it works), a functionality that provides a more flexible way to create models with respect to the tf.keras.Sequential API. With this functionality, we easily created a model with a customizable number of layers and neurons per layer.

To create the networks, we can simply apply the function with a different number of neurons and layers as inputs:
res_10_1 = model_nlayers(10, 1)
res_10_2 = model_nlayers(10, 2)
res_10_3 = model_nlayers(10, 3)
res_10_4 = model_nlayers(10, 4)
res_100_4 = model_nlayers(100, 4)

The code is now much easier to understand, and you can use it to create networks as big as you want.

With the function defined, it’s easy to run several models and compare them, as we have done in Figure 3-17, where we tested five different models:
  • One layer and ten neurons each layer

  • Two layers and ten neurons each layer

  • Three layers and ten neurons each layer

  • Four layers and ten neurons each layer

  • Four layer and 100 neurons each layer

Figure 3-17

The cost function vs. epochs for five models

In case you are wondering, the model with four layers, each with 100 neurons, which seems much better than the others, is starting to go in the overfitting regime, with a train set accuracy of 91% and of 87% on the dev set (after only 200 epochs).

Advantages of Additional Hidden Layers

It is instructive to play with the models. Try varying the number of layers, the number of neurons, how you initialize the weights, and so on. If you invest some time you can reach an accuracy of over 90% in a few minutes of running time, but that requires some work. If you try several models you may realize that in this case using several layers does not seem to bring benefits versus a network with just one. This is often the case.

Theoretically speaking, a one-layer network can approximate every function you can imagine, but the number of neurons needed may be very large and therefore the model becomes less useful. Now the catch is that the ability of approximating a function does not mean that the network can do it, due for example to the sheer number of neurons involved or the time needed.

Empirically it has been shown that networks with more layers require a much smaller number of neurons to reach the same results and usually generalize better to unknown data.

Note

Theoretically speaking, you do not need to have multiple layers in your networks, but often in practice you should. It is almost always a good idea to try a network with several layers and a few neurons in each instead of a network with one layer populated by a huge number of neurons. There is no set rule on how many neurons or layers are best. You should try starting with low numbers of layers and neurons and then increase them until your results stop getting better.

In addition, having more layers may allow your network to learn different aspects of your inputs. For example, one layer may learn to recognize vertical edges of an image, and another horizontal ones. Remember that in this chapter we discussed networks where each layer is identical (up to the number of neurons) to all the others. You will learn in Chapter 7 how to build networks whereby the layers perform very different tasks and are structured very differently from each other, making this kind of network much more powerful for certain tasks with respect to the models we discussed in this chapter.

As a simple example, imagine predicting the selling prices of houses. In this case a network with several layers may learn more information on how the features relate to the price. For example, the first layer may learn basic relationships, such as bigger houses mean higher prices. But the second layer may learn more complex features, such as a big house with a small number of bathrooms means a lower selling price.

Comparing Different Networks

Now you should know how to build neural networks with a huge number of layers or neurons. But it is relatively easy to lose yourself in a forest of possible models without knowing which ones are worth trying. Suppose you start with a network (as we have done in the previous sections) with one hidden layer with five neurons, one layer with ten neurons (for our softmax function), and our softmax neuron. Now suppose you have reached some accuracy and you would like to try different models. At first you should try increasing the number of neurons in your hidden layers to see what you can achieve. Figure 3-18 shows the cost function plotted as it decreases for different numbers of neurons. The calculations have been performed with a mini-batch gradient descent with a batch size of 50, one hidden layer with 1, 5, 15, and 30 neurons, and a learning rate of 0.0001. You can see how moving from one neuron to five immediately makes the convergence faster. But further increasing the number of neurons does not bring as much improvement. For example, increasing from 15 to 30 brings no improvement at all.
Figure 3-18

The cost functionis decreasing vs. epochs for a neural network, with one hidden layer and 1, 5, 15, and 30 neurons. The calculations have been performed with mini-batch gradient descent with a batch size of 50 and a learning rate of 0.0001

Let’s first try to find a way of comparing those networks. Only comparing the number of neurons can be very misleading, as you will see shortly. Remember that your algorithm is trying to find the best combinations of weights and biases to minimize your cost function. But how many learnable parameters do we have in our model? We have the weights and the biases. You will remember from the theoretical discussion that we can associate a certain number of weights to each layer, and the number of learnable parameters in our layer l that we will indicate with Q[l] is given by the total number of elements in the matrix W[l], that is nlnl − 1 (where we have n0 = nx by definition) plus the number of biases we have (in each layer we will have nl biases). The number Q[l] can then be written as follows
$$ {Q}^{left[l
ight]}={n}_l{n}_{l-1}+{n}_l={n}_lleft({n}_{l-1}+1
ight) $$
So that the total number of learnable parameters in our network (indicated here with Q) can be written as follows
$$ Q=sum limits_{j=1}^L{n}_lleft({n}_{l-1}+1
ight) $$
Where by definition n0 = nx. Note that the Q parameter of our network is strongly architecture dependent. Let’s calculate it with some practical examples (see Table 3-5).
Table 3-5

Examples of Different Network Architectures and Their Corresponding Q Parameters

Network Architecture

Parameter Q (Number of Learnable Parameters)

Number of Neurons

Network A: 784 features, 2 layers: n1 = 15, n2 = 10

QA = 15(784 + 1) + 10 ∗ (15 + 1) = 11935

25

Network B: 784 features, 16 layers: n1 = n2 = … = n15 = 1, n16 = 10

QB = 1 ∗ (784 + 1) + 1 ∗ (1 + 1) + … + 10 ∗ (1 + 1) = 923

25

Network C: 784 features, 3 layers: n1 = 10, n2 = 10, n3 = 10

QC = 10 ∗ (784 + 1) + 10 ∗ (10 + 1) + 10 ∗ (10 + 1) = 8070

30

Draw your attention to networks A and B. Both have 25 neurons, but the QA parameter is much bigger (more than a factor of ten) than QB. You can imagine how network A will be much more flexible in learning than network B, even if the number of neurons is the same.

Note

Q in practice is not a measure of how complex or how good a network is. It may well happen that, of all the neurons, only a few will play a role, therefore calculating Q in this way does not tell the entire story. There is a vast amount of research on the so-called effective degrees of freedom of deep neural networks but that would go way beyond the scope of this book. But this parameter gives a good rule of thumb for deciding if the set of models you want to test are in a reasonable complexity progression.

Nonetheless, checking Q for the model you want to test may give you some hints about what you should neglect and what you should try. For example, let’s consider the cases we have tested in Figure 3-18 and calculate the Q parameter for each network (see Table 3-6).
Table 3-6

Network Architectures Tested in Figure 3-18 with their Corresponding Q Parameters

Network Architecture

Parameter Q (Number of Learnable Parameters)

Number of Neurons

784 features, 1 layer with 1 neuron, 1 layer with ten neurons

Q = 1 ∗ (784 + 1) + 10 ∗ (1 + 1) = 895

11

784 features, 1 layer with 5 neuron, 1 layer with ten neurons

Q = 5 ∗ (784 + 1) + 10 ∗ (5 + 1) = 3985

15

784 features, 1 layer with 15 neuron, 1 layer with ten neurons

Q = 15 ∗ (784 + 1) + 10 ∗ (15 + 1) = 11935

25

784 features, 1 layer with 30 neuron, 1 layer with ten neurons

Q = 30 ∗ (784 + 1) + 10 ∗ (30 + 1) = 23860

40

From Figure 3-18, let’s suppose we choose the model with 15 neurons as the candidate for the best model. Now let’s suppose we want to try a model with three layers, all with the same number of neurons that should compete (and possibly be better) than our (for the moment) candidate model, with one layer and 15 neurons. What should we choose as a starting point for the number of neurons in the three layers? Let’s indicate as model A the one with one layer and 15 neurons and model B as the model with three layers and an unknown number of neurons in each layer indicated with nB. We can easily calculate the Q parameter for both networks
$$ {Q}_A=15ast left(784+1
ight)+10ast left(15+1
ight)=11935 $$
And
$$ {Q}_B={n}_Bast left(784+1
ight)+{n}_Bast left({n}_B+1
ight)+{n}_Bast left({n}_B+1
ight)+10ast left({n}_B+1
ight)=2 {n}_B^2+797 {n}_B+10 $$
What value for nB will give QB ≈ QA? We can solve the equation
$$ 2 {n}_B^2+797 {n}_B+10=11935 $$

You should be able to solve a quadratic equation, so we will only look at the solution here. This equation is solved for a value of nB = 14.4, but since we cannot have 14.4 neurons, we will need to use the closest integer, nB = 14. For nB = 14, we will have QB = 11560, a value very close to 11935.

Note

The fact that the two networks have the same number of learnable parameters does not mean that they can reach the same accuracy and does not even mean that if one learns very quickly the second will learn at all!

The model with three layers, each with 14 neurons, could be a good starting point for further testing.

Let’s discuss another point that is important when dealing with a complex dataset. Consider the first layer. Let’s suppose we consider the Zalando dataset and we create a network with two layers: the first with one neuron and the second with many. All the complex features that your dataset has may well be lost in your single first neuron, since it will combine all features into a single value and pass the same value to all the other neurons of the second layer.

Tips for Choosing the Right Network

You have discussed a lot of cases, you have seen a lot of formulas, but how can you decide how to design your network?

Unfortunately, there is no fixed set of rules. But you may consider the following tips:
  • When considering a set of models (or network architectures) you want to test, a good rule of thumb is to start with a less complex one and move to more complex ones. A rule of thumb to estimate the relative complexity (to make sure that you are moving in the right direction) is the Q parameter .

  • If you cannot reach good accuracy if any of your layers has a particular low number of neurons. This layer may kill the effective capacity of learning from a complex dataset of your network. Consider for example the case with one neuron in Figure 3-18. The model cannot reach low values for the cost function because the network is too simple to learn from a complex dataset like the Zalando one.

  • Remember that a low or high number of neurons is always relative to the number of features you have. If you have only two features in your dataset, one neuron may well be enough, but if you have a few hundred (like in the Zalando dataset, where nx = 784), you should not expect one neuron to be enough.

  • Which architecture you need is also dependent on what you want to do. It’s always worth checking the online literature to see what others have already discovered about specific problems. For example, it’s well known that for image recognition, convolutional networks are very good, so they would be a good choice.

Final Tip

When moving from a model with L layers to one with L + 1 layers, it’s always a good idea to start with the new model and use a slightly smaller number of neurons in each layer and then increase it step by step. Remember that more layers have a chance of learning complex features more effectively, so if you are lucky fewer neurons may be enough. It is something worth trying. Always keep track of your optimizing metric for all your models. When you are not getting much improvement anymore, it may be worth trying completely different architectures (maybe convolutional neuronal networks, etc.).

Estimating the Memory Requirements of Models

For the calculation, let’s consider a feed-forward neural network with NL hidden layers, each having n neurons. Let’s consider a concrete case to make things clearer. Suppose we are working with the MNIST dataset. In this case, the input data is composed of a vector of the 784 gray values of the images (each image is 28 × 28 pixels in gray values). The output layer of a network for classification will be composed of ten neurons (the ten classes in which the model tries to classify the images). The total number of weights NW can be easily calculated and can be written as follows
$$ {N}_W=underset{mathrm{Weights}}{underbrace{784n+{N}_L{n}^2+10n}}+underset{mathrm{Biases}}{underbrace{N_Ln}}underset{mathrm{large} n}{sim }{N}_L{n}^2 $$
Where the second part of the equation mean that, for large n, NW will asymptotically grow quadratically in n, which is the number of neurons in each layer. In general, there are three components that need to be taken into account.
  • Parameters: In memory, you need to keep the parameters, their gradients during backpropagation, and also additional information if the optimization is using momentum, Adagrad, Adam, or RMSProp optimizers. A good rule of thumb in order to account for all these factors6 is to multiply the memory taken by the weights alone by roughly 3. With the notation we have used so far, the memory used from parameters (indicated with MW) in Gb would be
    $$ {M}_W=underset{mathrm{Correction} mathrm{Factor}}{underbrace{3}}+underset{mathrm{Conversion} mathrm{to} mathrm{byets} mathrm{for} mathrm{Floating} mathrm{point} 64}{underbrace{64/8}}+underset{mathrm{Conversion} mathrm{to}  Gb}{underbrace{frac{N_W}{1024^3}}} $$
  • Activations: Each neuron output must be stored, normally with their gradient for backpropagation. Conservatively only a mini-batch will need to be kept in memory. Calling SMB the mini-batch size the memory needed for activations MA in Gb can be written as
    $$ {M}_A=2{S}_{MB}left(2n+10
ight)frac{8}{1024^3} $$
  • Miscellaneous: This part includes the data that must be loaded into memory and so on. For the purposes of a rough estimate, the memory taken here MM will be estimated only with the dataset size. In the case of MNIST, that will be (in Gb) given by the following equation. Each pixel value, although originally an INT8, must be converted to floating-point 64-data type to perform the training
    $$ {M}_M=6000	imes 784frac{8}{1024^3} $$
For example, to find out if a model will run on a limited memory device that will have MD GB free, it is enough to check if the following equation has solutions in n or NL
$$ {M}_W+{M}_A+{M}_M={M}_D $$

Note that this is just a rough indication and will not be precise since the amount of memory taken by a model may depend on software versions, the operating system, and many more factors. If we solve the last equation for n for example, we can get a very good estimate of the biggest FFNN network that could be run on such a device when applied to MNIST. For example, consider the case of a Raspberry Pi 4 with 2 GB of memory. Typically, such a system has roughly MD = 1.7 GB free at any moment. So for a network with NL = 2 and SMB = 128, the last equation will give a solution of n ≈ 8200. Indeed, trying to train a network with more than 8200 neurons on such a device will give a memory error on the device, since there is not enough memory to keep everything available in RAM. (If you test it, your results may vary, depending as mentioned on which version of the Raspberry Pi you have, which version of TensorFlow you are using, and so on.)

For practical purposes to get a rougher estimate, you can neglect the linear terms in n in the last equation and still get a usable guideline. For example, in the example discussed previously, neglecting the linear term would give an estimate of n ≈ 8700. Higher than the actual one, but one that will give useful rough information about the maximum number of usable neurons. Finally, remember that only a practical test will guarantee that a specific model can run on a low-memory device.

General Formula for the Memory Footprint

In general , when working with a dataset of size nx the formula for the number of parameters can be written as follows
$$ {N}_W=underset{mathrm{Weights}}{underbrace{n_xn+{N}_L{n}^2+{N}_On}}+underset{mathrm{Biases}}{underbrace{N_Ln}}underset{mathrm{large} n}{sim }{N}_L{n}^2 $$
Where we indicate the number of neurons in the output layer as NO. If you refer to the previous section, you can see that the formulas for MW and MA are unchanged, while MM needs to be written as
$$ {M}_M=m{n}_xfrac{8}{1024^3} $$

This accounts for the fact that the dataset size is not 60000 but m and that the input dimension is not 784 but nx.

Exercises

Exercise 1 (Level: Easy)
Try to build a multiclass classification model like the one you saw in this chapter, but with a different dataset, the MNIST database of handwritten digits (http://yann.lecun.com/exdb/mnist/). To download the dataset from TensorFlow, use the following code:
from tensorflow import keras
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
Exercise 2 (Level: Medium)

Apply He weight initialization to the multiclass classification problem you saw in this chapter to see if you can speed up the learning phase.

Exercise 3 (Level: Difficult)

Try to optimize the feed-forward neural network built in this chapter to reach the best possible accuracy (without overfitting the training dataset!). Tune the number of epochs, the learning rate, the optimizer, the number of neurons, the layers, and the mini-batches. Hint: Write a function like the one we used to test different numbers of layers and neurons and give it as inputs all the tunable parameters.

Exercise 4 (Level: Difficult)

Consider the regression problem we solved with a model made by a single neuron (predicting radon activity in U.S. houses). Try to build a feed-forward neural network to solve the same regression task. See if you can get better prediction performance. Hint: You will need to change the loss function and the metrics to evaluate your results, and one-hot encoding will not be necessary anymore. As a starting point, you can find the entire code in the online version of the book at https://adl.toelt.ai.

References

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.27.29