Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Umberto MichelucciApplied Deep Learninghttps://doi.org/10.1007/978-1-4842-3790-8_3

3. Feedforward Neural Networks

Umberto Michelucci¹

(1)

toelt.ai, Dübendorf, Switzerland

In Chapter 2, we did some amazing things with one neuron, but that is hardly flexible enough to tackle more complex cases. The real power of neural networks comes to light when several (thousand, even million) neurons interact with each other to solve a specific problem. The network architecture (how neurons are connected to each other, how they behave, and so on) plays a crucial role in how efficient the learning of a network is, how good its predictions are, and what kind of problems it can solve.

There are many kinds of architectures that have been extensively studied and are very complex, but from a learning perspective, it is important to start from the most simple kind of neural network with multiple neurons. It makes sense to begin with so-called feedforward neural networks, in which data enters at the input layer and passes through the network, layer by layer, until it arrives at the output layer. (This gives the networks their name: feedforward neural networks.) In this chapter, we will consider networks in which each neuron in a layer gets its input from all neurons from the preceding layer and feeds their output into each neuron of the next layer.

As is easy to imagine, with more complexity come more challenges. It is more difficult to achieve fast learning and good accuracy; the number of hyperparameters that are available grows, due to the increased network complexity; and a simple gradient descent algorithm will no longer cut it when dealing with big datasets. When developing models with many neurons, we will need to have at our disposal an expanded set of tools that will allow us to deal with all the challenges that these networks present. In this chapter, we will start to look at some more advanced methods and algorithms that will allow us to work efficiently with big datasets and big networks. These complex networks become good enough to do some interesting multiclass classification, one of the most frequent tasks that big networks are required to perform (for example, handwriting recognition, face recognition, image recognition, and so on), so I have chosen a dataset that will allow us to do some interesting multiclass classification and study its difficulties.

I will start the chapter by discussing the network architecture and the needed matrix formalism. A short overview of the new hyperparameters that come with this new type of network is then given. How to implement multiclass classification using the softmax function , and what kind of output layer is needed, is then explained. Then, before starting with Python code, a brief digression is taken to explain in detail what exactly overfitting is, with a simple example, and how to conduct a basic error analysis with complex networks. Then we will start to use TensorFlow to construct bigger networks, applying them to a MNIST-similar dataset, based on images of clothing items (which will be lots of fun). We will look at how to make the gradient descent algorithm covered in Chapter 2 faster, introducing two new variations: stochastic and mini-batch gradient descent. Then we will look at how to add many layers in an efficient way and how to initialize the weights and the biases in the best way possible, to make training fast and stable. In particular, we will look at Xavier and He initialization for sigmoid and the ReLU activation function, respectively. Finally, a rule of thumb on how to compare complexity of networks going beyond only the number of neurons is offered, and the chapter concludes with some tips on how to choose the right networks.

Network Architecture

Neural network architecture is quite easy to understand. It consists of an input layer (the inputs ${x}_j^{(i)}$ ), several layers (called hidden, because they are sandwiched between the input and the output layers , so they are “invisible” from the outside, so to speak), and an output layer. In each layer, you may have one to several neurons. The main property of such a network is that each neuron receives input from each neuron in the preceding layer and feeds its output to every neuron in the next layer. In Figure 3-1, you can see a graphical representation of such a network.

../images/463356_1_En_3_Chapter/463356_1_En_3_Fig1_HTML.jpg — Figure 3-1
Diagram of a multilayered deep feedforward neural network, in which each neuron receives input from each neuron in the preceding layer and feeds its output to every neuron in the subsequent layer

To jump from one neuron, as in Chapter 2, to this is quite a big step. To build the model, we will have to work with matrix formalism, and, therefore, we must get all the matrix dimensions right. First, I’ll discuss some new notation.

L: Number of hidden layers, excluding the input layer but including the output layer
n_l: Number of neurons in layer l

In a network such as the one in Figure 3-1, we will indicate the total number of neurons with N_neurons, which can be written as

${N}_{neurons}={n}_x+sum limits_{i=1}^L{n}_i=sum limits_{i=0}^L{n}_i$

where, by convention, we defined n₀ = n_x. Each connection between two neurons will have its own weight. Let’s indicate the weight between neuron i in layer l and neuron j in layer l − 1 with ${w}_{ikern0.125em j}^{left[l ight]}$ . In Figure 3-2, I have drawn only the first two layers (input and layer 1) of our generic network of Figure 3-1, with the weights between the first neuron in the input layer and all the others in layer 1. All other neurons are grayed out for clarity.

../images/463356_1_En_3_Chapter/463356_1_En_3_Fig2_HTML.png — Figure 3-2
The first two layers of a generic neural network , with the weights of the connections between the first neuron in the input layers and the others in the second layer. All other neurons and connections are drawn in light gray, to make the diagram clearer.

The weights between the input layer and layer 1 can be written as a matrix, as follows:

${W}^{left[1 ight]}=left(egin{array}{ccc}{w}_{11}^{left[1 ight]}& dots & {w}_{1{n}_x}^{left[1 ight]} \ {}vdots & ddots & vdots \ {}{w}_{n_11}^{left[1 ight]}& dots & {w^{left[1 ight]}}_{n_1{n}_x}end{array} ight)$

This means that our matrix W^[1] has dimensions n₁ × n_x. Of course, this can be generalized between any two layers l and l − 1, meaning that the weight matrix between two adjacent layers l and l − 1, indicated by W^[l], will have dimensions n_l × n_l − 1. By convention, n₀ = n_x is the number of input features (not the number of observations that we indicate with m).

Note

The weight matrix between two adjacent layers l and l − 1, which we indicate with W^{[l ]}, will have dimensions n_l × n_l − 1, where, by convention, n₀ = n_x is the number of input features.

The bias (indicated by b in Chapter 2) will be a matrix this time. Remember that each neuron that receives inputs will have its own bias, so when considering our two layers, l and l − 1, we will require n_l different values of b. We will indicate this matrix with b^[l], and it will have dimensions n_l × 1.

Note

The bias matrix for two adjacent layers l and l − 1, which we indicate with b^[l], will have dimensions n_l × 1.

Output of Neurons

Now let’s start considering the output of our neurons. To begin, we will consider the i^th neuron of the first layer (remember that our input layer is by definition layer 0). Let’s indicate its output with ${widehat{y}}_i^{left[1 ight]}$ and assume that all neurons in layer l use the same activation function, which we will indicate by g^[l]. Then we will have

${widehat{y}}_i^{left[1 ight]}={g}^{left[1 ight]}left({z}_i^{left[1 ight]} ight)={g}^{left[i ight]}kern0.75em left(sum limits_{j=1}^{n_x}left({w}_{ij}^{left[1 ight]} {x}_j+{b}_i^{left[1 ight]} ight) ight)$

where we have indicated, as you will remember from Chapter 2, z_i as

${z}_i^{left[1 ight]}=sum limits_{j=1}^{n_x}left({w}_{ij}^{left[1 ight]} {x}_j+{b}_i^{left[1 ight]} ight)$

As you can imagine, we want to have a matrix for all the output of layer 1, so we will use the notation

${Z}^{left[1 ight]}={W}^{left[1 ight]}X+{b}^{left[1 ight]}$

where Z^[1] will have dimensions n₁ × 1, and where with X, we have indicated our matrix with all our observations (rows for the features, and columns for observations), as I have already discussed in Chapter 2. We assume here that all neurons in layer l will use the same activation function that we will indicate with g^[l].

We can easily generalize the previous equation for a layer l

${Z}^{left[l ight]}={W}^{left[l ight]}{Z}^{left[l-1 ight]}+{b}^{left[l ight]}$

because layer l will get its input from layer l − 1. We just need to substitute X with Z^[l − 1]. Z^[l] will have dimensions n_l × 1. Our output in matrix form will then be

${Y}^{left[l ight]}={g}^{left[l ight]}left({Z}^{left[l ight]} ight)$

where the activation function acts, as usual, element by element.

Summary of Matrix Dimensions

Following is a summary of the dimensions of all the matrices we have described so far.

W^[l] has dimensions n_l × n_l − 1 (where we have n₀ = n_x by definition)
b^[l] has dimensions n_l × 1
Z^[l − 1] has dimensions n_l − 1 × 1
Z^[l] has dimensions n_l × 1
Y^[l] has dimensions n_l × 1

In each case, l goes from 1 to L.

Example: Equations for a Network with Three Layers

To make this discussion a bit more concrete, let’s consider an example of a network with three layers (so L = 3), with n₁ = 3, n₂ = 2, and n₃ = 1, as depicted in Figure 3-3.

../images/463356_1_En_3_Chapter/463356_1_En_3_Fig3_HTML.png — Figure 3-3
A practical example of a feedforward neural network

In this case, we will have to calculate the following quantities:

${widehat{Y}}^{left[1 ight]}={g}^{left[1 ight]}left({W}^{left[1 ight]}X+{b}^{left[1 ight]} ight)$ , where W^[1] has dimensions 3 × n_x, b has dimensions 3 × 1, and X has dimensions n_x × m
${widehat{Y}}^{left[2 ight]}={g}^{left[2 ight]}left({W}^{left[2 ight]}{Z}^{left[1 ight]}+{b}^{left[2 ight]} ight)$ , where W^[2] has dimensions 2 × 3, b has dimensions 2 × 1, and Z^[1] has dimensions 3 × m
${widehat{Y}}^{left[3 ight]}={g}^{left[3 ight]}left({W}^{left[3 ight]}{Z}^{left[2 ight]}+{b}^{left[3 ight]} ight)$ , where W^[3] has dimensions 1 × 2, b has dimensions 1 × 1, and Z^[2] has dimensions 2 × m

and your network output, ${widehat{Y}}^{left[3 ight]}$ , will have, as expected, dimensions 1 × m.

All this may seem rather abstract (and, in fact, it is). You will see later in the chapter how easy it is to implement in TensorFlow, simply by building the right computational graph, based on the steps just discussed.

Hyperparameters in Fully Connected Networks

In networks such as the ones just discussed, there are quite a few parameters that you can tune to find the best model for your problem. You will remember from Chapter 2 that parameters that you fix at the beginning and then don’t change during the training phase are called hyperparameters. You will have to tune the additional following hyperparameters for feed forward networks:

Number of layers: L
Number of neurons in each layer: n_i for i from 1 to L
Choice of activation function for each layer: g^[l]

Then, of course, you still have the following hyperparameters that you encountered in Chapter 2:

Number of iterations (or epochs)
Learning rate

sof tmax Function for Multiclass Classification

You will still have to suffer a bit more theory before getting to some TensorFlow code. The kinds of networks described in this chapter start to be complex enough to be able to perform some multiclass classification with reasonable results. To do this, we must first introduce the softmax function.

Mathematically speaking, the softmax function S is one that transforms a k dimensional vector into another k dimensional vector of real values, each between 0 and 1, and that sum up to 1. Given k real values z_i for i = 1, …, k, we define the vector z = (z₁, …, z_k), and we define the softmax vector function S(z) = (S(z)₁ S(z)₂… S(z)_k) as

$S{(z)}_i=frac{e^{z_i}}{sum_{j=1}^k{e}^{z_j}}$

Because the denominator is always bigger than the nominator, S(z)_i < 1. Additionally, we have

$sum limits_{i=1}^kS{(z)}_i=sum limits_{i=1}^kfrac{e^{z_i}}{sum_{j=1}^k{e}^{z_j}}=frac{sum_{i=1}^k{e}^{z_i}}{sum_{j=1}^k{e}^{z_j}}=1$

So, S(z)_i behaves like a probability, because its sum over i is 1, and its elements are all less than 1. We will consider S(z)_i as a probability distribution over k possible outcomes. For us, S(z)_i will simply be the probability of our input observation of being of class i. Let’s suppose we are trying to classify an observation into three classes. We may get the following output: S(z)₁ = 0.1, S(z)₂ = 0.6, and S(z)₃ = 0.3. That means that our observation has a 10% probability of being of class 1, a 60% probability of being of class 2, and 30% probability of being of class 3. Normally, one chooses to classify the input observation into the class with the higher probability, in this example, class 2, with 60% probability.

Note

We will look at S(z )_i as a probability distribution over k with i = 1, …, k possible outcomes. For us, S(z )_i will simply be the probability of our input observation being of class i.

To be able to use the softmax function for classification, we will have to use a specific output layer. We will have to use ten neurons, each of which will give z_i as its output, and then one neuron that will output S(z). This neuron will have the softmax function as activation function and will have as inputs the 10 outputs, z_i, of the last layer with 10 neurons. In TensorFlow, you use the tf.nn.softmax function applied to the last layer with 10 neurons. Remember that this tensorflow function will act element by element. Later in the chapter, you will find a concrete example showing how to implement this from start to finish.

A Brief Digression: Overfitting

One of the most common problems that you will encounter when training deep neural networks will be overfitting. What can happen is that your network may, owing to its flexibility, learn patterns that are due to noise, errors, or simply wrong data. It is very important to understand what overfitting is, so I will give you a practical example of what can happen, to give you an intuitive understanding of it. To make it easier to visualize, I will work with a simple two-dimensional dataset, which I will create for the purpose. I hope that at the end of the next section, you will have a clear idea of what overfitting is.

A Practical Example of Overfitting

Networks described in the previous sections are rather complex and can easily lead to overfitting of the dataset. Let me explain briefly the concept of overfitting. To understand it, consider the following problem: find the best polynomial that approximates a given dataset. Given a set of two-dimensional points (x⁽ⁱ⁾, y⁽ⁱ⁾), we want to find the best polynomial of degree K in the form

$fleft({x}^{(i)} ight)=sum limits_{j=0}^K{a}_j{x^{(i)}}^j$

that minimizes the mean square error

$frac{1}{m}sum limits_{i=1}^m{left({y}^{(i)}-fleft( {x}^{(i)} ight) ight)}^2$

where, as usual, m indicates the number of data points we have. I don’t want only to determine all the parameters a_j, but also the value of K that best approximates our data. K, in this case, measures our model complexity. For example, for K = 0, we simply have f (x⁽ⁱ⁾) = a₀ (a constant), the simplest polynomial we can think of. For higher K, we have higher order polynomials, meaning that our function is more complex, having more parameters available for training.

Here is an example of our function for K = 3:

$fleft({x}^{(i)} ight)=sum limits_{j=0}^3{a}_j{x^{(i)}}^j={a}_0+{a}_1{x}^{(i)}+{a}_2{left({x}^{(i)} ight)}^2+{a}_3{left({x}^{(i)} ight)}^3$

where we have four parameters that can be tuned during our training models. Let’s generate some data, starting from a second order polynomial (K = 2)

$1+2{x}^{(i)}+3{x^{(i)}}^2$

and adding some random error (this will make overfitting visible). Let’s first import our standard libraries with the addition of the curve_fit function , which will minimize automatically the standard error and find the best parameters. Don’t worry too much about this function. The goal here is to show you what can happen when you use a model that is too complex.

import numpy as np

import matplotlib.pyplot as plt

from scipy.optimize import curve_fit

Let’s define a function for a second-degree polynomial

def func_2(p, a, b, c):

return a+b*p + c*p**2

then let’s generate our dataset

x = np.arange(-5.0, 5.0, 0.05, dtype = np.float64)

y = func_2(x, 1,2,3)+18.0*np.random.normal(0, 1, size=len(x))

To add some random noise to the function, we have used the function np.random.normal(0, 1, size=len(x)), which generates a numpy array of random values from a normal distribution of length len(x), with average 0 and standard deviation 1.

In Figure 3-4, you can see what the data looks like for a = 1, b = 2, and c = 3.

../images/463356_1_En_3_Chapter/463356_1_En_3_Fig4_HTML.jpg — Figure 3-4
The data we have generated with a = 1, b = 2, and c = 3, as described in the text

Now let’s consider a model that is too simple to capture the feature of the data, meaning that we will see what a model with high bias¹ can do. Let’s consider a linear model (K = 1). The code will be

def func_1(p, a, b):

return a+b*p

popt, pcov = curve_fit(func_1, x, y)

That will give the best values for a and b that minimize the standard error. In Figure 3-5, it is clear how this model completely misses the main feature of the data, being too simple.

../images/463356_1_En_3_Chapter/463356_1_En_3_Fig5_HTML.jpg — Figure 3-5
The linear model misses the main feature of the data, being too simple. In this case, the model has high bias.

Let’s try to fit a two-degree polynomial (K = 2). The results appear in Figure 3-6.

../images/463356_1_En_3_Chapter/463356_1_En_3_Fig6_HTML.jpg — Figure 3-6
The results for a 2-degree polynomial

That is better. This model seems to capture the main features of the model, ignoring the random noise. Now let’s try a very complex model—a 21-degree polynomial (K = 21). The results are shown in Figure 3-7.

../images/463356_1_En_3_Chapter/463356_1_En_3_Fig7_HTML.jpg — Figure 3-7
The results for a 21-degree polynomial model

Now, this model shows features that we know are wrong, because we created our data. These features are not present, but the model is so flexible that it captures the random variability that we have introduced with noise. Here, I am referring to the oscillations that have appeared using this high-ordered polynomial.

In this case, we talk about overfitting, meaning we start capturing with our model features owing, for example, to random error. It is easy to understand that this generalizes quite badly. If we applied this 21-degree polynomial model to new data, it would not work well, because the random noise would be different in new data, and so the oscillations we see in Figure 3-7 would make no sense on new data. In Figure 3-8, I have plotted the best 21-degree polynomial models obtained by fitting data generated with 10 different random noise values added. You can clearly see how much it varies. It is not stable and is strongly dependent on the random noise present. The oscillations are always different! In this case, we are talking about high variance.

../images/463356_1_En_3_Chapter/463356_1_En_3_Fig8_HTML.jpg — Figure 3-8
Result of our model, with a 21-degree polynomial fitted to 10 different datasets generated with different random noise values added

Now let’s create the same plot with our linear model, while varying our random noise, as we did in Figure 3-8. You can check the results in Figure 3-9.

../images/463356_1_En_3_Chapter/463356_1_En_3_Fig9_HTML.jpg — Figure 3-9
Results of our linear model applied to data in which we have randomly changed the random noise. For easier comparison with Figure 3-8, I have used the same scale.

You can see that our model is much more stable. Our linear model does not capture any feature that is dependent on our noise, but it misses the main features of our data (the concave nature). We are talking here of high bias.

Figure 3-10 should help you to gain an intuitive understanding of bias and variance. Bias is a measure of how close our measurements are to the true values (the center of the figure), and variance is a measure of how spread the measurements are around the average (not necessarily the true value, as you can see on the right).

../images/463356_1_En_3_Chapter/463356_1_En_3_Fig10_HTML.png — Figure 3-10
Bias and variance

In the case of neural networks, we have many hyperparameters (number of layers, number of neurons in each layer, activation function, and so on), and it is very difficult to know in which regime we are. How can we tell if our model has a high variance or a high bias, for example? I will dedicate an entire chapter to this subject, but the first step in performing this error analysis is to split our dataset into two different ones. Let’s see what this means and why we do it.

Note

The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e., the noise) as if that variation represented the underlying model structure (see Burnham, K. P.; Anderson, D. R., Model Selection and Multimodel Inference, 2nd ed., New York; Springer-Verlag, 2002). The opposite is called underfitting—when the model cannot capture the structure of the data.

The problem with overfitting and deep neural networks is that there is no way of visualizing easily the results, and, therefore, we require a different approach to determine if our model is overfitting, underfitting, or is just right. This can be achieved by splitting our dataset into different parts and evaluating and comparing the metrics on all of them. Let’s explore the basic idea in the next section.

Basic Error Analysis

To check how our model is doing, and to do a proper error analysis, we must split our dataset into the following two parts:²

Training dataset: The model is trained on this dataset, using the inputs and the relative labels, by an optimizer algorithm such as gradient descent, as we did in Chapter 2. Often, this set is called the “train set.”
Development (or validation) set: The trained model will then be used on this dataset, to check how it is doing. On this dataset, we will test different hyperparameters. For example, we can train two different models with a different number of layers on the training dataset and test them on this dataset, to check how they are doing. Often, this set is termed the “dev set.”

I will devote an entire chapter to error analysis, but it is a good idea to offer you an overview of why it is important to split the dataset. Let’s suppose we are dealing with classification, and let’s suppose that the metric we use to judge the quality of our model is 1 minus the accuracy, or, in other words, the percentage of the cases that are wrongly classified. Let’s consider the following three cases (Table 3-1):

Table 3-1

Four different cases to show how to recognize overfitting from the training and the dev set error

Error	Case A	Case B	Case C	Case D
Training set error	1%	15%	14%	0.3%
Dev set error	11%	16%	32%	1.1%

Case A: Here, we are overfitting (high variance), because we are doing very well on the training set, but our model generalizes very badly to our dev set (refer again to Figure 3-8).
Case B: Here, we see a problem with high bias, meaning that our model is not doing very well generally, on both datasets (refer again to Figure 3-9).
Case C: Here, we have a high bias (the model cannot predict very well the training set) and high variance (the model does not generalize well on the dev set).
Case D: Here, everything seems OK. The error is good on the train set and good on the dev set. That is a good candidate for our best model.

I will explain all these concepts more thoroughly later in the book, where I will provide recipes for how to solve problems of high bias, high variances, both, or even more complex cases.

To recap: To perform a very basic error analysis, you will have to split your dataset into at least two sets: train and dev. You should then calculate your metric on both sets and compare them. You want to have a model that has low error on the train set and on the dev set (as in Case D, in the preceding example), and the two values should be comparable.

Note

Your main takeaways from this section should be (1) a set of recipes and guidelines is required for understanding how your model is doing (is it overfitting, underfitting, or is it just right?); (2) to answer the preceding questions, you must split your dataset in two, to perform the relevant analysis. Later in the book, you will see what you can do with a dataset split into three, or even four, parts.

The Zalando Dataset

Zalando SE is a German e-commerce company based in Berlin. The company maintains a cross-platform store that sells shoes, clothing, and other fashion items.³ For a kaggle competition (if you don’t know what this is, check the website www.kaggle.com , from which you can participate in many competitions that have the goal of solving problems with data science), Zalando prepared a MNIST-similar dataset of images of its clothing, for which they provided 60,000 training images and 10,000 test images. As in MNIST, each image was 28 × 28 pixels in grayscale. Zalando grouped all images in ten different classes and provided the labels for each image. The dataset has 785 columns. The first column is the class label (an integer going from 0 to 9), and the remaining 784 contain the pixel gray value of the image (you can calculate that as 28 × 28 = 784), exactly as we have seen in Chapter 2, in the discussion related to the MNIST dataset of handwritten digits.

Each training and test sample is assigned one of the following labels (per the documentation):

0: T-shirt/top
1: Trouser
2: Pullover
3: Dress
4: Coat
5: Sandal
6: Shirt
7: Sneaker
8: Bag
9: Ankle boot

In Figure 3-11, you can see an example of each class, chosen randomly from the dataset.

../images/463356_1_En_3_Chapter/463356_1_En_3_Fig11_HTML.jpg — Figure 3-11
One example from each of the ten classes in the Zalando dataset

The dataset has been provided under the MIT License.⁴ The data file can be downloaded from kaggle ( www.kaggle.com/zalando-research/fashionmnist/data ) or directly from GitHub ( https://github.com/zalandoresearch/fashion-mnist ). If you choose the second option. you will have to prepare the data a bit. (You can convert it to CSV with the script located at https://pjreddie.com/projects/mnist-in-csv/ .) If you download it from kaggle, the data will already be in the correct format. You will find two CSV files zipped on the kaggle web site. When unzipped, you will have fashion-mnist_train.csv, with 60,000 images (roughly 130MB), and fashion-mnist_test.csv, with 10,000 (roughly 21MB). Let’s fire up a Jupyter notebook and start coding!

We will need the following imports in our code:

import pandas as pd

import numpy as np

import tensorflow as tf

%matplotlib inline

import matplotlib

import matplotlib.pyplot as plt

from random import *

Put the CSV files in the same directory as your notebook. Then, you can simply load the files with the pandas function.

data_train = pd.read_csv('fashion-mnist_train.csv', header = 0)

You can also read the file with standard NumPy functions (such as loadtxt()), but using read_csv() from pandas gives you a lot of flexibility in slicing and analyzing your data. Additionally, it is a lot faster. Reading the file (that is, roughly 130MB) with pandas takes about 10 seconds, while with NumPy, it takes 1 minute, 20 seconds on my laptop. So, if you are dealing with big datasets, keep this in mind. It is common practice to use pandas to read and prepare the data. If you aren’t familiar with pandas, don’t worry. All you need to understand will be explained in detail.

Note

Remember: You should not focus on the Python implementation. Focus on the model, on the concepts behind the implementation. You can achieve the same results using pandas, NumPy, or even C. Try to concentrate on how to prepare the data, how to normalize it, how to check the training, and so on.

With the command

data_train.head()

you can see the first five lines of your dataset, as shown in Figure 3-12.

../images/463356_1_En_3_Chapter/463356_1_En_3_Fig12_HTML.jpg — Figure 3-12
With data_train.head() the command, you can check the first five rows of your dataset

You will see that each column has a name. pandas retrieves it from the first line in the file. Checking the column name, you know immediately what is in which column. For example, in the first, we have the class label. Now we must create an array with the labels and one with the 784 features (remember that we have all the pixel gray values as features). For this, we can simply write the following:

labels = data_train['label'].values.reshape(1, 60000)

train = data_train.drop('label', axis=1).transpose()

Let’s discuss briefly what the code does, starting with the labels. In pandas, each column has a name (as you can see in Figure 3-12), which in our case, is automatically inferred from the first line of the CSV file. The first column (“label”) contains the class label, an integer from 0 to 9. In pandas, to select this column only, we can simply use the following syntax:

data_train['label']

giving the column name in square brackets.

If you check the shape of this array with

data_train['label'].shape

you get the value (60000), as expected. As we have already seen in Chapter 2, we want a tensor for our labels with the dimensions 1 × m, where m is the number of observations (in this case 60000). So, we must reshape it with the command

labels = data_train['label'].values.reshape(1, 60000)

Now the tensor labels have the dimension (1, 60000), as we want.

The tensor that should contain the features should contain all columns, except the labels. So, we simply remove the label column with drop('label', axis=1), take all the others, and then transpose the tensor. In fact, data_train.drop('label', axis=1) has the dimensions (60000, 784), and we want a tensor with the dimensions n_x × m, where here n_x = 784 is the number of the features. Following is a summary of our tensors so far.

Labels: This has the dimensions 1 × m (1 × 60000) and contains the class labels (integers from 0 to 9).
Train: This has the dimensions n_x × m (784 × 60000) and contains the features, in which each row contains the grayscale value of a single pixel in the image (remember 28 × 28 = 784).

Refer again to Figure 3-11 for an idea of how the images look. Finally, let’s normalize the input, so that instead of having values from 0 to 255 (the grayscale values), it has only values between 0 and 1. This is very easy to do with the following code:

train = np.array(train / 255.0)

Building a Model with tensorflow

Now it is time to expand what we did with TensorFlow in Chapter 2 with one neuron to networks with many layers and neurons. Let’s first discuss the network architecture and what kind of output layer we need, and then let’s build our model with TensorFlow.

Network Architecture

We will start with a network with just one hidden layer. We will have an input layer with 784 features, then a hidden layer (in which we will vary the number of neurons), then an output layer of ten neurons that will feed their output into a neuron that will have as an activation function the softmax function . See first Figure 3-13, for a graphical representation of the network, and then I will spend some time explaining the various parts, especially the output layers.

../images/463356_1_En_3_Chapter/463356_1_En_3_Fig13_HTML.png — Figure 3-13
The network architecture with a single hidden layer. During our analysis, we will vary the number of neurons, n₁, in the hidden layer.

Let me explain why there is this strange output layer with ten neurons and why there is a need for an additional neuron for the softmax function . Remember that for each image, we want to be able to determine to which class it belongs. To do this, as explained when discussing the softmax function, we will have to get ten outputs for each observation: each being the probability of the image of being of each of the classes. So, given an input x⁽ⁱ⁾, we will need the ten values: P( y⁽ⁱ⁾ = 1| x⁽ⁱ⁾), P( y⁽ⁱ⁾ = 2| x⁽ⁱ⁾), …., P( y⁽ⁱ⁾ = 10| x⁽ⁱ⁾) (probability of the observation class y⁽ⁱ⁾ being one of the ten possibilities given the input x⁽ⁱ⁾), or, in other words, our output should be a tensor of dimensions 1 × 10 in the form

$widehat{y}=left(Pleft({y}^{(i)}=1|{x}^{(i)} ight)kern1.25em Pleft({y}^{(i)}=2|{x}^{(i)} ight)kern0.75em dots kern1.5em Pleft({y}^{(i)}=10|{x}^{(i)} ight) ight)$

And because the observation must be of one single class the condition

$sum limits_{j=1}^{10}Pleft({y}^{(i)}=j|{x}^{(i)} ight)=1$

must be satisfied. This can be understood as follows: the observation has a 100% probability of being of one of the ten classes, or, in other words, all the probabilities must add to 1. We solve this problem in two steps:

We create an output layer with ten neurons. In this way, we will have our ten values as output.
Then we feed the ten values to a new neuron (let’s call it “softmax” neuron) that will take the ten inputs and give as output ten values that are all less than 1 and that add up to 1.

Figure 3-14 shows our “softmax” neuron in detail.

../images/463356_1_En_3_Chapter/463356_1_En_3_Fig14_HTML.png — Figure 3-14
The final neuron in our network that transforms the ten inputs in probabilities

Calling z_i the output of the i^th neuron in the last layer (with i going from 1 to 10), we will have

$Pleft({y}^{(i)}=j | {x}^{(i)} ight)=frac{e^{z_i}}{sum_{j=1}^{10}{e}^{z_j}}$

That is exactly what the tensorflow function tf.nn.softmax() does. It takes a tensor as input and returns a tensor with the same dimensions as the input but “normalized,” as discussed previously. In other words, if we feed z = (z₁ z₂ … z₁₀) to the function, it will return a tensor with the same dimensions as z, meaning 1 × 10, where each element is the last equation.

Modifying Labels for the softmax Function—One-Hot Encoding

Before developing our network, first we must solve another problem. You will remember from Chapter 2 that in classification, we will use the following cost function:

cost = - tf.reduce_mean(Y * tf.log(y_)+(1-Y) * tf.log(1-y_))

where Y contains our labels and y_ is the result of our network. So, the two tensors must have the same dimensions. In our case, I explained to you that our network will give as output a vector with ten elements, while a label in our dataset is simply a scalar. Therefore, we have y_ that has dimensions (10,1) and Y that has dimensions (1,1). This will not work if we don’t do something smart. We must transform our labels in a tensor that has dimensions (10,1). A vector with a value for each class is also required, but what value should we use?

We must perform what is known as one-hot encoding .⁵ This means that we will transform our labels (integers from 0 to 9) to tensors with dimensions (1,10) with the following algorithm: our one-hot encoded vector will have all zeros, except at the index of the label. For example, for a label 2, our 1 × 10 tensor will have all zeros, except at the position of index 2, or, in other words, it will be (0,0,1,0,0,0,0,0,0,0). Try some other examples (see Table 3-2), and the concept will become clear immediately.

Table 3-2

Examples of How One-Hot Encoding Works (Remember that labels go from 0 to 9 as indexes.)

Label	One-Hot Encoded Label
0	(1,0,0,0,0,0,0,0,0,0)
2	(0,0,1,0,0,0,0,0,0,0)
5	(0,0,0,0,0,1,0,0,0,0)
7	(0,0,0,0,0,0,0,1,0,0)

In Figure 3-15, you can see a graphical representation of the process of one-hot encoding a label. In the figure, two labels (2 and 5) are one-hot encoded in two tensors. The grayed element of the tensor (in this case, a one-dimensional vector) is the one that becomes one, while the white ones remain zero.

../images/463356_1_En_3_Chapter/463356_1_En_3_Fig15_HTML.png — Figure 3-15
Graphical representation of the process of one-hot encoding a label

Sklearn has several ways of doing this automatically (check, for example, the function OneHotEncoder()), but I think it is instructive to undertake the process manually, to really see how it is done. Once you understand why you need it, and in which format you need it, you can use the function you like most. The Python code to do this is very simple (the last line just converts the pandas data frame into a NumPy array):

labels_ = np.zeros((60000, 10))

labels_[np.arange(60000), labels] = 1

labels_ = labels_.transpose()

labels_ = np.array(labels_)

First, you create a new array with the right dimensions: (60000,10), then you fill it with zeros with the NumPy function np.zeros((60000,10)). Next, you set to 1 only the columns related to the label itself, using pandas functionalities to slice data frames with the line labels_[np.arange(60000), labels] = 1. Then you transpose it, to have the dimensions we want at the end: (10, 60000), where each column indicates a different observation.

Now in our code, we can finally compare Y and y_, because both now have the dimensions (10,1) for one observation, or when considering the entire training dataset of (10, 60000). Each row in y_ will now represent the probability of our observation as being of a specific class. At the very end, when calculating the accuracy of our model, we will assign the class with the highest probability to each observation.

Note

Our network will give us the ten probabilities for the observation as being of each of the ten classes. At the end, we will assign to the observation the class that has the highest probability.

The tensor flow Model

Now is time to build our model with tensorflow. The following code will do the job:

n_dim = 784

tf.reset_default_graph()

# Number of neurons in the layers

n1 = 5 # Number of neurons in layer 1

n2 = 10 # Number of neurons in output layer

cost_history = np.empty(shape=[1], dtype = float)

learning_rate = tf.placeholder(tf.float32, shape=())

X = tf.placeholder(tf.float32, [n_dim, None])

Y = tf.placeholder(tf.float32, [10, None])

W1 = tf.Variable(tf. truncated_normal ([n1, n_dim], stddev=.1))

b1 = tf.Variable(tf.zeros([n1,1]))

W2 = tf.Variable(tf. truncated_normal ([n2, n1], stddev=.1))

b2 = tf.Variable(tf.zeros([n2,1]))

# Let's build our network...

Z1 = tf.nn.relu(tf.matmul(W1, X) + b1)

Z2 = tf.nn.relu(tf.matmul(W2, Z1) + b2)

y_ = tf.nn.softmax(Z2,0)

cost = - tf.reduce_mean(Y * tf.log(y_)+(1-Y) * tf.log(1-y_))

optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)

init = tf.global_variables_initializer()

I will not go through each line of code, because you should understand by now what a placeholder or a variable is. But there are a few details of the code that I would like you to notice.

When we initialize the weights, use the code tf.Variable(tf. truncated_normal ([n1, n_dim], stddev=.1)). The truncated_normal() function will return values from a normal distribution, with the peculiarity that values that are more than 2 standard deviation from the average will be dropped and repicked. The reason for choosing a small stddev of 0.1 is to avoid that the output of the ReLU activation function becomes too big and, therefore, nans start to appear, owing to Python not being able to calculate properly numbers that are too big. I will discuss a better way of choosing the right stddev later in the chapter.
Our last neuron will use the softmax function: y_ = tf.nn.softmax(Z2,0). Remember that y_ will not be a scalar but a tensor of the same dimensions as Z2. The second parameter, 0, tells tensorflow that we want to apply the softmax function along the vertical axis (the rows).
The two parameters n1 and n2 define the number of neurons in the different layers. Remember that the second (output) layer must have ten neurons to be able to use the softmax function. But we will play with the value for n1. Increasing n1 will increase the complexity of the network.

Now let’s try to perform the training, as we did in Chapter 2. We can reuse the code we already wrote. Try to run the following code on your laptop:

sess = tf.Session()

sess.run(tf.global_variables_initializer())

training_epochs = 5000

cost_history = []

for epoch in range(training_epochs+1):

sess.run(optimizer, feed_dict = {X: train, Y: labels_, learning_rate: 0.001})

cost_ = sess.run(cost, feed_dict={ X:train, Y: labels_, learning_rate: 0.001})

cost_history = np.append(cost_history, cost_)

if (epoch % 20 == 0):

print("Reached epoch",epoch,"cost J =", cost_)

You should immediately notice one thing: it is very slow. Unless you have a very powerful CPU or have installed TensorFlow with GPU support and you have a powerful graphic card, this code will take, on a 2017 laptop, a few hours (from a couple to several, depending on the hardware you have). The problem is that the model, as we coded it, will create a huge matrix for all observations (that is 60,000) and then will modify the weights and bias only after a complete sweep over all observations. This requires quite some resources, memory, and CPU. If that were the only choice we had, we would be doomed. Keep in mind that in the deep-learning world, 60,000 examples of 784 features is not a big dataset at all. So, we must find a way of letting our model learn faster.

The last piece of code you need is the one you can use to calculate the accuracy of your model. You can do it easily with the following code:

correct_predictions = tf.equal(tf.argmax(y_,0), tf.argmax(Y,0))

accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"))

print ("Accuracy:", accuracy.eval({X: train, Y: labels_, learning_rate: 0.001}, session = sess))

The tf.argmax() function returns the index with the largest value across axes of a tensor. You will remember that when I discussed the softmax function, I said that we will assign an observation to the class that has the highest probability (y_ is a tensor with ten values, each containing the probability for the observation of being of each class). So, tf.argmax(y_,0) will give us the most probable class for each observation. tf.argmax(Y,0) will do the same for our labels. Remember that we one-hot encoded our labels, so that, for example, class 2 will now be (0,0,2,0,0,0,0,0,0). Therefore, tf.argmax([0,0,2,0,0,0,0,0,0],0) will return 2 (the index with the highest value, in this case, the only one different than zero).

I have shown you how to load and prepare the train dataset. To do some basic error analysis, you will also need the dev dataset. Following is the code that you can use. I will not discuss it, since it is exactly the same as that we used for the train dataset.

data_dev = pd.read_csv('fashion-mnist_test.csv', header = 0)

labels_dev = data_test['label'].values.reshape(1, 10000)

labels_dev_ = np.zeros((10000, 10))

labels_dev_[np.arange(10000), labels_dev] = 1

labels_dev_ = labels_dev_.transpose()

dev = data_dev.drop('label', axis=1).transpose()

Don’t get confused by the fact that the file name contains the word test. Sometimes, the dev dataset is called test dataset. Later in the book, when I discuss error analysis, we will use three datasets: train, dev, and test. To remain consistent throughout the book, I prefer to stick with the name dev, so as not to confuse you with different names in different chapters.

Finally, to calculate accuracy on the dev dataset, you simply reuse the same code I provided previously.

correct_predictions = tf.equal(tf.argmax(y_,0), tf.argmax(Y,0))

accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"))

print ("Accuracy:", accuracy.eval({X: dev, Y: labels_dev_, learning_rate: 0.001}, session = sess))

A good exercise would be to include this calculation in your model, so that your model() function automatically returns the two values.

Gradient Descent Variations

In Chapter 2, I described the very basic gradient descent algorithm (also called batch gradient descent). This is not the smartest way of finding the cost function minimum. Let’s have a look at the variations that you need to know, and let’s compare how efficient they are, using the Zalando dataset.

Batch Gradient Descent

The gradient descent algorithm described in Chapter 2 calculates the weights and bias variations for each observation but performs the learning (weights and bias update) only after all observations have been evaluated, or, in other words, after a so-called epoch. (Remember that a cycle through the entire dataset is called an epoch.)

Following is an advantage:

Fewer weights and bias updates mean a more stable gradient, which usually results in a more stable convergence.

Here are the downsides:

Usually, this algorithm is implemented in such a way that all the datasets must be in memory, which computationally is quite intensive.
This algorithm is typically very slow for very big datasets.

A possible implementation could look like this:

sess = tf.Session()

sess.run(tf.global_variables_initializer())

training_epochs = 100

cost_history = []

for epoch in range(training_epochs+1):

sess.run(optimizer, feed_dict = {X: train, Y: labels_, learning_rate: 0.01})

cost_ = sess.run(cost, feed_dict={ X:train, Y: labels_, learning_rate: 0.01})

cost_history = np.append(cost_history, cost_)

if (epoch % 50 == 0):

print("Reached epoch",epoch,"cost J =", cost_)

Running the code for 100 epochs would give a result similar to the following:

Reached epoch 0 cost J = 0.331401

Reached epoch 50 cost J = 0.329093

Reached epoch 100 cost J = 0.327383

This code ran in roughly 2.5 minutes, but the cost function barely changed. To see the cost function start decreasing, you must run your training for a few thousand epochs , and that will require quite some time. With the following code we can calculate the accuracy:

correct_predictions = tf.equal(tf.argmax(y_,0), tf.argmax(Y,0))

accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"))

print ("Accuracy:", accuracy.eval({X: train, Y: labels_, learning_rate: 0.001}, session = sess))

After 100 epochs, we only reached an accuracy of 16% on our training set!

Stochastic Gradient Descent

The Stochastic⁶ gradient descent (abbreviated SGD) calculates the gradient of the cost function and then updates weights and biases for each observation in the dataset.

The advantages are that

The frequent updates allow an easy check on how the model learning is going. (You don’t have to wait until all the datasets have been considered.)
In a few problems, this algorithm may be faster than batch gradient descent.
The model is intrinsically noisy, and that may allow it to avoid local minima when trying to find the absolute minimum of the cost function.

Among the downsides are that

On large datasets, this method is quite slow, because it is very computationally intensive, owing to the continuous updates.
The fact that the algorithm is noisy can make it hard for it to settle on a minimum for the cost function, and the convergence may not be as stable as expected.

A possible implementation could look like this:

sess = tf.Session()

sess.run(tf.global_variables_initializer())

cost_history = []

for epoch in range(100+1):

for i in range(0, features.shape[1], 1):

X_train_mini = features[:,i:i + 1]

y_train_mini = classes[:,i:i + 1]

sess.run(optimizer, feed_dict = {X: X_train_mini,

Y: y_train_mini,

learning_rate: 0.0001})

cost_ = sess.run(cost, feed_dict={ X:features,

Y: classes,

learning_rate: 0.0001})

cost_history = np.append(cost_history, cost_)

if (epoch % 50 == 0):

print("Reached epoch",epoch,"cost J =", cost_)

If you let the code run, you will get a result that should look like the following (the exact numbers will be different each time, because we initialize the weights and biases randomly, but the speed of decrease should be the same):

Reached epoch 0 cost J = 0.31713

Reached epoch 50 cost J = 0.108148

Reached epoch 100 cost J = 0.0945182

As mentioned, this method can be quite unstable. For example, using a learning rate of 1e-3 will make nan appear before having reached epoch 100. Try to play with the learning rate and see what happens. You require a rather small value for the method to converge nicely. In comparison, with bigger learning rates (as big as 0.05, for example), a method such as batch gradient descent converges without problems. As I mentioned before, the method is quite computationally intensive and for 100 epochs, requires roughly 35 minutes on my laptop. With this variation, after only 100 epochs, we already would have reached an accuracy of 80%. With this variation, learning is, in terms of epochs, very efficient but also very slow.

Mini-Batch Gradient Descent

With this variation of the gradient descent, datasets are split into a certain number of small (from here the term mini is used) groups of observations (called batches), and weights and biases are updated only after each batch has been fed to the model. This is by far the method most commonly used in the field of deep learning.

The advantages are that

The model update frequency is higher than with batch gradient descent but lower than SGD. Therefore, allow for a more robust convergence.
This method is computationally much more efficient than batch gradient descent, or SGD, because fewer calculations and resources are needed.
This variation is by far (as we will see later) the fastest of the three.

Among the downsides are that

The use of this variation introduces a new hyperparameter that must be tuned: the batch size (number of observations in the mini-batch).

A possible implementation could look like this for a batch size of 50:

sess = tf.Session()

sess.run(tf.global_variables_initializer())

cost_history = []

for epoch in range(100+1):

for i in range(0, features.shape[1], 50):

X_train_mini = features[:,i:i + 50]

y_train_mini = classes[:,i:i + 50]

sess.run(optimizer, feed_dict = {X: X_train_mini,

Y: y_train_mini,

learning_rate: 0.001})

cost_ = sess.run(cost, feed_dict={ X:features,

Y: classes,

learning_rate: 0.001})

cost_history = np.append(cost_history, cost_)

if (epoch % 50 == 0):

print("Reached epoch",epoch,"cost J =", cost_)

Note that the code is the same as that for the stochastic gradient descent. The only difference is the size of the batches. In this example, we use 50 observations each time before updating weights and biases. Running it will give you a result that should look like this (remember that your numbers will be different, due to the random initialization of weights and biases):

Reached epoch 0 cost J = 0.322747

Reached epoch 50 cost J = 0.193713

Reached epoch 100 cost J = 0.141135

In this case, we have used a learning rate of 1e-3—much bigger than the one in SGD—and reached a cost function value of 0.14—a bigger value than the 0.094 reached with SGD but much smaller than the 0.32 value reached with batch gradient descent—and it requires only 2.5 minutes. So, with a factor of 14, it is faster than SGD. After 100 epochs, we achieved an accuracy of 66%.

Comparison of the Variations

Following is a summary of the findings for our three variations of gradient descent for 100 epochs (Table 3-3).

Table 3-3

Summary of the Findings for Three Variations of Gradient Descent for 100 Epochs

Gradient Descent Variation	Running Time	Final Value of Cost Function	Accuracy
Batch gradient descent	2.5 min	0.323	16%
Mini-batch gradient descent	2.5 min	0.14	66%
Stochastic gradient descent (SGD)	35 min	0.094	80%

Now you can see that SGD is the algorithm that achieves the lowest value of cost function with the same number of epochs, although it is by far the slowest. For the mini-batch gradient descent to reach a value of 0.094 for the cost function, it takes 450 epochs and roughly 11 minutes. Still, this is a huge improvement over SGD—31% of the time for the same results.

In Figure 3-16, you can see the difference in how the cost function decreases with different mini-batch sizes. It is clear how, with respect to the number of epochs, the smaller the mini-batch size, the faster the decrease (though not in time). The learning rate used for this figure was γ=0.001. Note that the time required in each case is not the same, and the smaller the mini-batch size, the more time is required for the algorithm.

../images/463356_1_En_3_Chapter/463356_1_En_3_Fig16_HTML.jpg — Figure 3-16
Comparison of the speed of convergence of a mini-batch gradient descent algorithm with different mini-batch sizes

Note

The best compromise between running time and convergence speed (with respect to number of epochs) is achieved by mini-batch gradient descent. The optimal size of the mini-batches is dependent on your problem, but, usually, small numbers, such as 30 or 50, are a good option. You will find a compromise between running time and convergence speed.

To give you an idea of how the running time depends on the value the cost function can reach after 100 epochs, see Figure 3-17 . Each point is labeled with the size of the mini-batch used in that run. Note that the points are single runs, and the plot is only indicative of the dependency. Running time and cost function value have a small variance when evaluated over several runs. This variance is not shown in the plot. You can see that decreasing the mini-batch size from 300 quickly decreases the value of J after 100 epochs, without increasing the running time significantly, until you arrive at a value for the mini-batch size that is about 50. At that point, the time starts to increase quickly, and the value for J after 100 epochs no longer decreases as quickly and flattens out. Intuitively, the best compromise is to choose a value for the mini-batch size when the curve is closer to zero (small running time and small cost function value), and that is at a mini-batch size value between 50 and 30. This is why those are the values chosen most frequently. After that point, the increase in running time becomes very quick and is no longer worth decreasing the mini-batch size. Note that for other datasets, the optimal value may be very different. So, it is worth trying different values, to see which one works best. In very big datasets, you may want to try bigger values, such as 200, 300, or 500. In our case, we have 60,000 observations and a mini-batch size of 50, which gives 1200 batches. If you have much more data, for example 1e-6 observations, a mini-batch size of 50 would give 20,000 batches. Keep that in mind and try different values, to see which works best.

../images/463356_1_En_3_Chapter/463356_1_En_3_Fig17_HTML.jpg — Figure 3-17
Plot for the Zalando dataset, showing the value of the cost function after 100 epochs vs. the running time required to run through 100 epochs

It is good programming practice to write a function that runs your evaluations. In this way, you can tune your hyperparameters (such as the mini-batch size) without copying and pasting the same chunk of code over and over. The following function is one you can use to train our model:

def model(minibatch_size, training_epochs, features, classes, logging_step = 100, learning_r = 0.001):

sess = tf.Session()

sess.run(tf.global_variables_initializer())

cost_history = []

for epoch in range(training_epochs+1):

for i in range(0, features.shape[1], minibatch_size):

X_train_mini = features[:,i:i + minibatch_size]

y_train_mini = classes[:,i:i + minibatch_size]

sess.run(optimizer, feed_dict = {X: X_train_mini,

Y: y_train_mini,

learning_rate: learning_r})

cost_ = sess.run(cost, feed_dict={ X:features, Y: classes, learning_rate: learning_r})

cost_history = np.append(cost_history, cost_)

if (epoch % logging_step == 0):

print("Reached epoch",epoch,"cost J =", cost_)

return sess, cost_history

The model() function will accept the following parameters:

minibatch_size: The number of observations we want in each batch. Note that if we choose for this hyperparameter a number q that is not a divisor of m (number of observations), or, in other words, m/q is not an integer, we will have the last mini-batch with a different number of observations than all the others. But this will not be an issue for the training. For example, suppose we have a hypothetical dataset with m=100, and you decide to use mini-batch sizes of 32 observations. Then, with m=100, you will have 3complete mini-batches with 32 observations and 1 with just 4, since 100 = 3*32+4. Now you may wonder what will happen with a line such as
X_train_mini = features[:,i:i + 32]
when i=96 and features has only 100 elements. Are we not going over the limits of the array? Fortunately, Python is nice to programmers and takes care of this. Consider the following code:
l = np.arange(0,100)
for i in range (0, 100, 32):
print (l[i:i+32])
The result is
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31]
[32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63]
[64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95]
[96 97 98 99]
And as you see, the last batch has only four elements, and we don’t get any error. So, you should not worry about this, and you can choose any mini-batch size that works better for your problem.
training_epochs: The number of epochs we want
features: The tensor that contains our features
classes: The tensor that contains our labels
logging_step: This tells the function to print the value of the cost function every logging_step epoch
learning_r: The learning rate we want to use

Note

Writing a function with the hyperparameters as inputs is common practice. This allows you to test different models with different values for the hyperparameters and check which one is better.

Examples of Wrong Predictions

Running the model with batch gradient descent, one hidden layer with 5 neurons for 1000 epochs, and learning rate of 0.001 will give us an accuracy on the training set of 82.3%. You can increase the accuracy by using more neurons in your hidden layer. For example, using 50 neurons, using 1000 epochs and a learning rate of 0.001, will allow you to reach 86.4% on the training set and 86.1% on the test set. It is interesting to check a few examples of wrongly classified images, to see if we can understand something from the errors. Figure 3-18 shows an example of wrongly classified images for each class. Over each image, the True class (labeled as “True:”) and the predicted (labeled as “Pred:”) class are reported. The model used here has one hidden layer with five neurons and has been run for 1000 epochs with a learning rate of 0.001.

../images/463356_1_En_3_Chapter/463356_1_En_3_Fig18_HTML.jpg — Figure 3-18
Example of wrongly classified images for each class

Some errors are understandable, such as, for example, that at the lower left of the figure. A shirt has been wrongly classified as a coat. It is also difficult to determine which item is which, and I could easily have made the same mistake. The wrongly classified bag is, on the other hand, easy for a human to sort out.

Weight Initialization

If you have tried to run the code, you will have realized that the convergence of the algorithm is strongly dependent on the way you initialize your weights. You will remember that we use the following line to initialize weights:

W1 = tf.Variable(tf.truncated_normal([n1, n_dim], stddev=.1))

But why choose a standard deviation of 0.1?

You will surely have wondered why. In the previous sections, I wanted you to focus on understanding how such a network works, without the distraction of additional information, but it is now time to look at this problem a bit more closely, because it plays a fundamental role with many layers. Basically, we initialize the weights with a small standard deviation to prevent the gradient descent algorithm from exploding and starting to return nans. For example, in our first layer for the i^th neuron, we will have to calculate the ReLU activation function of the quantity (refer to the beginning of the chapter for an explanation, if you’ve forgotten why), as follows:

${z}_i=sum limits_{j=1}^{n_x}left({w}_{ij}^{left[1 ight]} {x}_j+{b}_i^{left[1 ight]} ight)$

Normally in a deep network, the number of weights is quite big, so you can easily imagine that if the ${w}_{ikern0.125em j}^{left[1 ight]}$ are big, the quantity z_i, too, can be quite big, and the ReLU activation function can return a nan value, because the argument is too big for Python to calculate it properly. So, you want the z_i to be small enough to avoid an explosion of the output of the neurons and big enough to prevent the outputs from dying out and, therefore, making the convergence a very slow process.

The problem has been researched extensively,⁷ and there are different initialization strategies, depending on the activation function you are using. A few are outlined in the Table 3-4, in which it is assumed that the weights will be initialized with a normal distribution with mean 0 and standard deviation. (Note that the standard deviation will depend on the activation function you want to use.)

Table 3-4

Different Initialization Strategies, Depending on Activation Functions

Activation Function	Standard Deviation σ for a Given Layer
Sigmoid	$sigma =sqrt{frac{2}{n_{inputs}+{n}_{outputs}}}$ Usually called Xavier Initialization
ReLU	$sigma =sqrt{frac{4}{n_{inputs}+{n}_{outputs}}}$ Usually called He Initialization

In a layer l, the number of inputs will be the number of neurons of the preceding layer l − 1, and the number of outputs will be the number of neurons in the layer coming next: l + 1. So, we will have

${n}_{inputs}={n}_{l-1}$

and

${n}_{outputs}={n}_{l+1}$

Very often, deep networks such as the one discussed before will have several layers, all with the same number of neurons. Therefore, you will have, for most of the layers, n_l − 1 = n_l + 1, and, therefore, you will have the following for Xavier initialization:

${sigma}_{Xavier}=sqrt{1/{n}_{l+1}}kern1.5em orkern1.25em sqrt{1/{n}_{l-1}}$

For the ReLU activation functions, the He initialization will be

${sigma}_{He}=sqrt{2/{n}_{l+1}}kern1.5em orkern1.25em sqrt{2/{n}_{l-1}}$

Let’s consider the ReLU activation function (the one we have used in this chapter). Every layer, as has been discussed, will have n_l neurons. A way of initializing the weights for layer 3, for example, would be

stddev = 2 / np.sqrt(n4+n2)

W3=tf.Variable(tf.truncated_normal([n3,n2], stddev = stddev)

Or, if all layers have the same number of neurons , and, therefore, n2=n3=n4, you could simply use the following:

stddev = 2 / np.sqrt(2.0*n2)

W3=tf.Variable(tf.truncated_normal([n3,n2], stddev = stddev)

Typically, to make evaluation and construction of networks easier, the most typical initialization form used is for ReLU activation function

${sigma}_{He}=sqrt{2/{n}_{l-1}}$

and

${sigma}_{Xavier}=sqrt{1/{n}_{l-1}}$

For a sigmoid activation function, for example, the code for the weight initialization for the network we have used previously with one layer would look like this:

W1 = tf.Variable(tf.random_normal([n1, n_dim], stddev= 2.0 / np.sqrt(2.0*n_dim)))

b1 = tf.Variable(tf.ones([n1,1]))

W2 = tf.Variable(tf.random_normal([n2, n1], stddev= 2.0 / np.sqrt(2.0*n1)))

b2 = tf.Variable(tf.ones([n2,1]))

Using this initialization can speed up training considerably and is the standard way in which many libraries initialize weights (for example, the Caffe library).

Adding Many Layers Efficiently

Repeatedly typing all this code each time is a bit tedious and error-prone. Usually, what one does is define a function that creates a layer. This can be done easily with this code:

def create_layer (X, n, activation):

ndim = int(X.shape[0])

stddev = 2 / np.sqrt(ndim)

initialization = tf.truncated_normal((n, ndim), stddev = stddev)

W = tf.Variable(init)

b = tf.Variable(tf.zeros([n,1]))

Z = tf.matmul(W,X)+b

return activation(Z)

Let’s go through the code:

First, we get the dimension of the inputs, to be able to define the right weight matrix.
Then, we initialize the weights with the He initialization discussed in the previous section.
Next, we create the weights W and bias b.
Then, we evaluate the quantity Z and return the activation function evaluated on Z. (Note that in Python, you can pass functions as parameters to other functions. In this case, activation may be tf.nn.relu.)

So, to create our networks, we can simply write our construction code (in this example, with two layers), as follows:

n_dim = 784

n1 = 300

n2 = 300

n_outputs = 10

X = tf.placeholder(tf.float32, [n_dim, None])

Y = tf.placeholder(tf.float32, [10, None])

learning_rate = tf.placeholder(tf.float32, shape=())

hidden1 = create_layer (X, n1, activation = tf.nn.relu)

hidden2 = create_layer (hidden1, n2, activation = tf.nn.relu)

outputs = create_layer (hidden2, n3, activation = tf.identity)

y_ = tf.nn.softmax(outputs)

cost = - tf.reduce_mean(Y * tf.log(y_)+(1-Y) * tf.log(1-y_))

optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

To run our model , we again define a model() function, as discussed previously.

def model(minibatch_size, training_epochs, features, classes, logging_step = 100, learning_r = 0.001):

sess = tf.Session()

sess.run(tf.global_variables_initializer())

cost_history = []

for epoch in range(training_epochs+1):

for i in range(0, features.shape[1], minibatch_size):

X_train_mini = features[:,i:i + minibatch_size]

y_train_mini = classes[:,i:i + minibatch_size]

sess.run(optimizer, feed_dict = {X: X_train_mini, Y: y_train_mini, learning_rate: learning_r})

cost_ = sess.run(cost, feed_dict={ X:features, Y: classes, learning_rate: learning_r})

cost_history = np.append(cost_history, cost_)

if (epoch % logging_step == 0):

print("Reached epoch",epoch,"cost J =", cost_)

return sess, cost_history

Now the code is much easier to understand, and you can use it to create networks as big as you wish.

With the preceding functions, it is very easy to run several models and compare them, as I have done in Figure 3-19, which illustrates five different tested models.

One layer and ten neurons each layer
Two layers and ten neurons each layer
Three layers and ten neurons each layer
Four layers and ten neurons each layer
4 layers and 100 neurons each layer

../images/463356_1_En_3_Chapter/463356_1_En_3_Fig19_HTML.jpg — Figure 3-19
The cost function vs. epochs for five models, as described in the legend

In case you are wondering , the model with four layers, each with 100 neurons, which seems much better than the others, is starting to go in the overfitting regime, with a train set accuracy of 94% and of 88% on the dev set (after only 200 epochs).

Advantages of Additional Hidden Layers

I suggest you play with the models. Try varying the number of layers, number of neurons, how to initialize the weights, and so on. If you invest some time, you can achieve an accuracy of more than 90% in a few minutes of running time, but that requires some work. If you try several models, you may realize that in this case, using several layers does not seem to accrue benefits vs. a network with just one. This is often the case.

Theoretically speaking, a one-layer network can approximate every function you can imagine, but the number of neurons needed may be very large, and, therefore, the model becomes much less useful. The catch is that the ability to approximate a function does not mean that the network is able to learn to do it, owing, for example, to the sheer number of neurons involved or the time required.

Empirically, it has been shown that networks with more layers require much smaller numbers of neurons to reach the same results and usually generalize better to unknown data.

Note

Theoretically speaking, you don’t need to have multiple layers in your networks, but often, in practice, you should. It is almost always a good idea to try a network of several layers with a few neurons in each, instead of a network with one layer populated by a huge number of neurons. There is no fixed rule on how to decide how many neurons or layers are best. You should try starting with low numbers of layers and neurons and then increase these until your results stop improving.

In addition, having more layers may allow your network to learn different aspects of your inputs. For example, one layer may learn to recognize vertical edges of an image, and another, horizontal ones. Remember that in this chapter, I have discussed networks in which each layer is identical (up to the number of neurons) to all the others. You will see later, in Chapter 4, how you can build networks in which each layer performs very different tasks and is also structured very differently from another, making this kind of network much more powerful for certain tasks that have been discussed previously in this chapter.

You may remember that in Chapter 2, we tried to predict the selling prices of houses in the Boston area. In that case, a network with several layers might reveal more information about how the features relate to the price. For example, the first layer might reveal basic relationships, such as bigger houses equal higher prices. But the second layer might reveal more complex relationships, such as big houses with a smaller numbers of bathrooms equal low selling prices.

Comparing Different Networks

Now you should know how to build neural networks with a huge number of layers or neurons. But it is relatively easy to lose yourself in a forest of possible models without knowing which are worth trying. Suppose you start with a network (as I have done in the previous sections) with one hidden layer with five neurons, one layer with ten neurons (for our “softmax” function) and our “softmax” neuron. Suppose you have reached some accuracy and would like to try different models. At first, you should try increasing the number of neurons in your hidden layers, to see what you can achieve. In Figure 3-20, I have plotted the cost function as it decreases for different numbers of neurons. The calculations have been performed with a mini-batch gradient descent with a batch size of 50, one hidden layer with respectively 1, 5, 15, and 30 neurons, and a learning rate of 0.05. You can see how moving from one neuron to five immediately makes the convergence faster. But further increasing the number of neurons doesn’t result in much improvement. For example, increasing the neurons from 15 to 30 adds almost no improvement.

../images/463356_1_En_3_Chapter/463356_1_En_3_Fig20_HTML.jpg — Figure 3-20
Cost function decrease vs. epochs for a neural network of one hidden layer with, respectively, 1, 5, 15, and 30 neurons, as indicated in the legend. The calculations have been performed with mini-batch gradient descent, with a batch size of 50 and a learning rate of 0.05.

Let’s first try to find a way of comparing these networks. Comparing only the number of neurons can be very misleading, as I will show you shortly. Remember that your algorithm is trying to find the best combinations of weights and biases to minimize your cost function. But how many learnable parameters do we have in our model? We have the weights and the biases. You will remember from our theoretical discussion that we can associate a certain number of weights to each layer, and the number of learnable parameters in our layer l that we will indicate with Q^[l] is given by the total number of elements in the matrix W^[l], that is, n_ln_l − 1 (where we have n₀ = n_x by definition), plus the number of biases we have (in each layer we will have n_l biases). The number Q^[l] can then be written as

${Q}^{left[l ight]}={n}_l{n}_{l-1}+{n}_l={n}_lleft({n}_{l-1}+1 ight)$

so that the total number of learnable parameters in our network (indicated here with Q) can be written as

$Q=sum limits_{j=1}^L{n}_lleft({n}_{l-1}+1 ight)$

where by definition n₀ = n_x. Please note that the parameter Q of our network is strongly architecture-dependent. Let’s calculate it in some examples, so that you understand what I mean (Table 3-5).

Table 3-5

A comparison of the values of Q for different network architectures

Network Architecture	Parameter Q (Number of learnable parameters)	Number of Neurons
Network A: 784 features, 2 layers: n₁ = 15, n₂ = 10	Q_A = 15(784 + 1) + 10 ∗ (15 + 1) = 11935	25
Network B: 784 features, 16 layers: n₁ = n₂ = … = n₁₅ = 1, n₁₆ = 10	Q_B = 1 ∗ (784 + 1) + 1 ∗ (1 + 1) + … + 10 ∗ (1 + 1) = 923	25
Network C: 784 features, 3 layers: n₁ = 10, n₂ = 10, n₃ = 10	Q_C = 10 ∗ (784 + 1) + 10 ∗ (10 + 1) + 10 ∗ (10 + 1) = 8070	30

I would like to draw your attention to networks A and B. Both have 25 neurons, but the parameter Q_A is much bigger (more than a factor of ten) than Q_B. You can easily imagine how network A will be much more flexible in learning than network B, even if the number of neurons is the same.

Note

I would be misleading you if I told you that this number Q is a measure of how complex or how good a network is. This is not the case, and it may well happen that of all the neurons, only a few will play a role. Therefore, calculating only as I told you will not tell the entire story. There is a vast amount of research on the so-called effective degrees of freedom of deep neural networks, but that goes way beyond the scope of this book. Nonetheless, this parameter will provide a good rule of thumb in deciding if the set of models you want to test are in a reasonable complexity progression.

Checking Q for the model you want to test may give you some hints on which you should neglect and which you should try. For example, let’s consider the cases we have tested in Figure 3-20 and calculate the parameter Q for each network (Table 3-6).

Table 3-6

A comparison of the values of Q for different network architectures

Network Architecture	Parameter Q	Number of Neurons
784 features, 1 layer with 1 neuron, 1 layer with 10 neurons	Q = 1 ∗ (784 + 1) + 10 ∗ (1 + 1) = 895	11
784 features, 1 layer with 5 neuron, 1 layer with 10 neurons	Q = 5 ∗ (784 + 1) + 10 ∗ (5 + 1) = 3985	15
784 features, 1 layer with 15 neuron, 1 layer with 10 neurons	Q = 15 ∗ (784 + 1) + 10 ∗ (15 + 1) = 11935	25
784 features, 1 layer with 30 neuron, 1 layer with 10 neurons	Q = 30 ∗ (784 + 1) + 10 ∗ (30 + 1) = 23860	40

From Figure 3-20, let’s suppose we choose the model with 15 neurons as our candidate as our best model. Now let’s suppose we want to try a model with 3 layers, all with the same number of neurons, that should compete (and possibly be better) than our (for the moment) candidate model with 1 layer and 15 neurons. What should we choose as a starting point for the number of neurons in the three layers? Let’s indicate as model A the one with 1 layer with 15 neurons and as B a model with 3 layers with an (as of yet) unknown number of neurons in each layer, indicated with n_B. We can easily calculate the parameter Q for both networks

${Q}_A=15ast left(784+1 ight)+10ast left(15+1 ight)=11935$

and

${Q}_B={n}_Bast left(784+1 ight)+{n}_Bast left({n}_B+1 ight)+{n}_Bast left({n}_B+1 ight)+10ast left({n}_B+1 ight)=2 {n}_B^2+797 {n}_B+10$

What value for n_B will give Q_B ≈ Q_A? We can easily solve the equation.

$2 {n}_B^2+797 {n}_B+10=11935$

You should be able to solve a quadratic equation, so I will only give the solution here (hint: try to solve it). This equation is solved for a value of n_B = 14.4, but because we cannot have 14.4 neurons, we will have to use the closest integer, which would be n_B = 14. For n_B = 14, we will have Q_B = 11560, a value very close to 11935.

Note

Please let me say it again. The fact that the two networks have the same number of learnable parameters does not mean that they can reach the same accuracy. It does not even mean that if one learns very fast the second will learn at all!

Our model with 3 layers with each 14 neurons could, however, be a good starting point for further testing.

Let’s discuss another point that is important when dealing with a complex dataset. Consider our first layer. Suppose we consider the Zalando dataset and we create a network with two layers: the first with one neuron and the second with many. All the complex features that your dataset has may well be lost in your single first neuron, because it will combine all features in one single value and pass the same exact value to all other neurons of the second layer.

Tips for Choosing the Right Network

I hear you crying, “You’ve discussed a lot of cases, given us a lot of formulas, but how can we decide how to design our network?”

Unfortunately, there is no fixed set of rules. But you may consider the following tips:

When considering a set of models (or network architectures) that you want to test, a good rule of thumb is to start with the less complex one and move to more complex ones. Another is to estimate the relative complexity (to make sure that you are moving in the right direction) of the use of the parameter Q.
In case you cannot achieve good accuracy, check if any of your layers has a particularly low number of neurons. This layer may kill the effective capacity of learning from a complex dataset of your network. Consider, for example, the case with one neuron in Figure 3-20. The model cannot reach low values for the cost function because the network is too simple to learn from a dataset as complex as the Zalando one.
Remember that a low or high number of neurons is always relative to the number of features you have. If you have only two features in your dataset, one neuron may well be sufficient, but if you have few hundred (like in the Zalando dataset where n_x = 784), you should not expect one neuron to be enough.
Which architecture you need is also dependent on what you want to do. It is always worth checking online literature to see what others have already discovered about specific problems. For example, it is well known that for image recognition, convolutional networks are very good, so they would be an excellent choice.

Note

When moving from a model with L layers to one with L + 1 layers, it is always a good idea to start with the new model, using a slightly lower number of neurons in each layer, and then increasing them step by step. Remember that more layers have a chance of learning complex features more effectively, so if you are lucky, fewer neurons may be enough. It is something worth trying. Always keep track of your optimizing metric (remember this from Chapter 2?) for all your models. When you are no longer getting much improvement, it may be worth trying completely different architectures (maybe convolutional neuronal networks, etc.).

Footnotes

Bias is a measure of the error originating from models that are too simple to capture the real features of the data.

To conduct a proper error analysis, we will require at least three parts, perhaps four. But to get a basic understanding of the process, two parts suffice.

Wikipedia, “Zalando,” https://en.wikipedia.org/wiki/Zalando , 2018.

Wikipedia, “MIT License,” https://en.wikipedia.org/wiki/MIT_License , 2018.

As a side note, this technique is often used to feed categorical variables to machine-learning algorithms.

Stochastic means that the updates have a random probability distribution and cannot be predicted exactly.

See, for example, Xavier Glorot and Yoshua Bengio, “Understanding the Difficulty of Training Deep Feedforward Neural Networks,” available at https://goo.gl/bHB5BM .

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 3. Feedforward Neural Networks

Create new playlist

Sign In

Sign Up

3. Feedforward Neural Networks

Network Architecture

Note

Note

Output of Neurons

Summary of Matrix Dimensions

Example: Equations for a Network with Three Layers

Hyperparameters in Fully Connected Networks

sof tmax Function for Multiclass Classification

Note

A Brief Digression: Overfitting

A Practical Example of Overfitting

Note

Basic Error Analysis

Note

The Zalando Dataset

Note

Building a Model with tensorflow

Network Architecture

Modifying Labels for the softmax Function—One-Hot Encoding

Note

The tensor flow Model

Gradient Descent Variations

Batch Gradient Descent

Stochastic Gradient Descent

Mini-Batch Gradient Descent

Comparison of the Variations

Note

Note

Examples of Wrong Predictions

Weight Initialization

Adding Many Layers Efficiently

Advantages of Additional Hidden Layers

Note

Comparing Different Networks

Note

Note

Tips for Choosing the Right Network

Note

Table of Contents for
3. Feedforward Neural Networks