© Umberto Michelucci 2018
Umberto MichelucciApplied Deep Learninghttps://doi.org/10.1007/978-1-4842-3790-8_2

2. Single Neuron

Umberto Michelucci1 
(1)
toelt.ai, Dübendorf, Switzerland
 

In this chapter, I will discuss what a neuron is and what its components are. I will clarify the mathematical notation we will require and cover the many activation functions that are used today in neural networks. Gradient descent optimization will be discussed in detail, and the concept of learning rate and its quirks will be introduced. To make things a bit more fun, we will then use a single neuron to perform linear and logistic regression on real datasets. I will then discuss and explain how to implement the two algorithms with tensorflow.

To keep the chapter focused and the learning efficient, I have left out a few things on purpose. For example, we will not split the dataset into training and test parts. We simply use all the data. Using the two would force us to do some proper analysis, and that would distract from the main goal of this chapter and make it way too long. Later in the book, I will conduct a proper analysis of the consequences of using several datasets and see how to do this properly, especially in the context of deep learning. This is a subject that requires its own chapter.

You can do wonderful, amazing, and fun things with deep learning. Let’s start to have fun!

The Structure of a Neuron

Deep learning is based on large and complex networks made up of a large number of simple computational units. Companies on the forefront of research are dealing with networks with 160 billion parameters [1]. To put things in perspective, this number is half that of the stars in our galaxy, or 1.5 times the number of people who ever lived. On a basic level, neural networks are a large set of differently interconnected units, each performing a specific (and usually relatively easy) computation. They recall LEGO toys, with which you can build very complex things using very simple and basic units. Neural networks are similar. Using relatively simple computational units, you can build very complex systems. We can vary the basic units, changing how they compute the result, how they are connected to each other, how they use the input values, and so on. Roughly formulated, all those aspects define what is known as the network architecture. Changing it will change how the network learns, how accurate the predictions are, and so on.

Those basic units are known, due to a biological parallel with the brain [2], as neurons. Basically, each neuron does a very simple thing: takes a certain number of inputs (real numbers) and calculates an output (also a real number). In this book, our inputs will be indicated by xi ∈  (real numbers), with i = 1, 2, …, nx, where i ∈  is an integer and nx is the number of input attributes (often called features). As an example of input features, you can imagine the age and weight of a person (so, we would have nx = 2). x1 could be the age, and x2 could be the weight. In real life, the number of features easily can be very big. In the dataset that we will use for our logistic regression example later in the chapter, we will have nx = 784.

There are several kinds of neurons that have been extensively studied. In this book, we will concentrate on the most commonly used one. The neuron we are interested in simply applies a function to a linear combination of all the inputs. In a more mathematical form, given nx, real parameters wi ∈  (with i = 1, 2, …, nx), and a constant b ∈  (usually called bias), the neuron will calculate first what is usually indicated in literature and in books by z.
$$ z={w}_1{x}_1+{w}_2{x}_2+cdots +{w}_{n_x}{x}_{n_x}+b $$
It will then apply a function f to z, giving the output $$ widehat{y}. $$
$$ widehat{y}=f(z)=fleft({w}_1{x}_1+{w}_2{x}_2+cdots +{w}_{n_x}{x}_{n_x}+b
ight) $$

Note

Practitioners mostly use the following nomenclature: wi refers to weights, b bias, xi input features, and f the activation function.

Owing to a biological parallel, the function f is called the neuron activation function (and sometimes transfer function), which will be discussed at length in the next sections.

Let’s summarize the neuron computational steps again.
  1. 1.

    Combine linearly all inputs xi, calculating $$ z={w}_1{x}_1+{w}_2{x}_2+cdots +{w}_{n_x}{x}_{n_x}+b $$;

     
  2. 2.

    Apply f to z, giving the output $$ widehat{y}=f(z)=fleft({w}_1{x}_1+{w}_2{x}_2+cdots +{w}_{n_x}{x}_{n_x}+b
ight) $$.

     
You may remember that in Chapter 1, I discussed computational graphs. In Figure 2-1, you will find the graph for the neuron described previously.
../images/463356_1_En_2_Chapter/463356_1_En_2_Fig1_HTML.png
Figure 2-1

The computational graph for the neuron described in the text

This is not what you usually find in blogs, books, and tutorials. It is rather complicated and not very practical to use, especially when you want to draw networks with many neurons. In the literature, you can find numerous representations for neurons. In this book, we will use the one shown in Figure 2-2, because it is widely used and is easy to understand.
../images/463356_1_En_2_Chapter/463356_1_En_2_Fig2_HTML.png
Figure 2-2

The neuron representation mostly used by practitioners

Figure 2-2 must be interpreted in the following way:
  • The inputs are not put in a bubble . This is simply to distinguish them from nodes that perform an actual calculation.

  • The weights’ names are written along the arrow. This means that before passing the inputs to the central bubble (or node), the input first will be multiplied by the relative weight, as labeled on the arrow. The first input, x1, will be multiplied by w1, x2, by w2, and so on.

  • The central bubble (or node) will perform several calculations at the same time. First, it will sum the inputs (the xiwi for i = 1, 2, …, nx), then sum to the result the bias b, and, finally, apply to the resulting value the activation function.

All neurons we will deal with in this book will have exactly this structure. Very often, an even simpler representation is used, as in Figure 2-3. In such a case, unless otherwise stated, it is understood that the output is
$$ widehat{y}=f(z)=fleft({w}_1{x}_1+{w}_2{x}_2+cdots +{w}_{n_x}{x}_{n_x}+b
ight) $$
../images/463356_1_En_2_Chapter/463356_1_En_2_Fig3_HTML.png
Figure 2-3

The following representation is a simplified version of Figure 2-2 . Unless otherwise stated, it is usually understood that the output is $$ widehat{y}=f(z)=fleft({w}_1{x}_1+{w}_2{x}_2+dots +{w}_{n_x}{x}_{n_x}+b
ight) $$ . The weights are often not explicitly reported in the neuron representation.

Matrix Notation

When dealing with big datasets, the number of features is large (nx will be big), and so it is better to use a vector notation for the features and the weights, as follows:
$$ x=left(egin{array}{c}{x}_1\ {}vdots \ {}{x}_{n_x}end{array}
ight) $$
where we have indicated the vector with a boldfaced x. For the weights, we use the same notation:
$$ w=left(egin{array}{c}{w}_1\ {}vdots \ {}{w}_{n_x}end{array}
ight) $$
For consistency with formulas that we will use later, to multiply x and w, we will use matrix multiplication notation, and, therefore, we will write
$$ {w}^Tx=left({w}_1dots {w}_{n_x}
ight)left(egin{array}{c}{x}_1\ {}vdots \ {}{x}_{n_x}end{array}
ight)={w}_1{x}_1+{w}_2{x}_2+cdots +{w}_{n_x}{x}_{n_x} $$
where wT indicates the transpose of w. z can then be written with this vector notation as
$$ z={w}^Tx+b $$
and the neuron output $$ widehat{y} $$ as
$$ widehat{y}=f(z)=fleft({w}^Tx+b
ight)#(3) $$
Let’s now summarize the different components that define our neuron and the notation we will use in this book.
  • $$ widehat{y} $$ → neuron output

  • f(z) → activation function (or transfer function) applied to z

  • w → weights (vector with nx components)

  • b → bias

Python Implementation Tip: Loops and NumPy

The calculation that we have outlined in the equation (3) can be done in Python by standard lists and with loops, but those tend to be very slow, as the number of variables and observations grows. A good rule of thumb is to avoid loops, when possible, and to use NumPy (or TensorFlow, as we will see later) methods as often as possible.

It is easy to get an idea of how fast NumPy can be (and how slow loops are). Let’s start by creating two standard lists of random numbers in Python with 107 elements in each.
import random
lst1 = random.sample(range(1, 10**8), 10**7)
lst2 = random.sample(range(1, 10**8), 10**7)

The actual values are not relevant for our purposes. We are simply interested in how fast Python can multiply two lists, element by element. The times reported were measured on a 2017 Microsoft surface laptop and will vary greatly, depending on the hardware the code runs on. We are not interested in the absolute values, but only on how much faster NumPy is in comparison with standard Python loops. To time Python code in a Jupyter notebook, we can use a “magic command.” Usually, in a Jupyter notebook, these commands start with %% or %. A good idea is to check the official documentation, accessible from http://ipython.readthedocs.io/en/stable/interactive/magics.html , to better understand how they work.

Going back to our test, let’s measure how much time a standard laptop takes to multiply, element by element, the two lists with standard loops. Using the code
%%timeit
ab = [lst1[i]*lst2[i] for i in range(len(lst1))]
gives us the following result (note that on your computer, you will probably get a different result):
2.06 s ± 326 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Over seven runs, the code needed roughly two seconds on average. Now let’s try to do the same multiplication, but, this time, using NumPy where we have first converted the two lists to NumPy arrays , with the following code:
import numpy as np
list1_np = np.array(lst1)
list2_np = np.array(lst2)
%%timeit
Out2 = np.multiply(list1_np, list2_np)
This time, we get the following result:
20.8 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The numpy code needed only 21 ms, or, in other words, was roughly 100 times faster than the code with standard loops. NumPy is faster for two reasons: the underlying routines are written in C, and it uses vectorized code as much as possible to speed up calculations on big amounts of data.

Note

Vectorized code refers to operations that are performed on multiple components of a vector (or a matrix) at the same time (in one statement). Passing matrices to NumPy functions is a good example of vectorized code. NumPy will perform operations on big chunks of data at the same time, obtaining a much better performance with respect to standard Python loops, which must operate on one element at a time. Note that part of the good performance NumPy is showing is also owing to the underlying routines being written in C.

While training deep learning models, you will find yourself doing this kind of operation over and over, and, therefore, such a speed gain will make the difference between having a model that can be trained and one that will never give you a result.

Activation Functions

There are many activation functions at our disposal to change the output of our neuron. Remember: An activation function is simply a mathematical function that transforms z in the output $$ widehat{y} $$. Let’s have a look at the most used.

Identity Function

This is the most basic function that you can use. Usually, it is indicated by I(z). It returns simply the input value unchanged. Mathematically we have
$$ f(z)=I(z)=z $$
This simple function will come in handy when I discuss linear regression with one neuron later in the chapter. Figure 2-4 shows what it looks like.
../images/463356_1_En_2_Chapter/463356_1_En_2_Fig4_HTML.jpg
Figure 2-4

The identity function

Implementing an identity function in Python with numpy is particularly simple.
def identity(z):
    return z

Sigmoid Function

This is a very commonly used function that gives only values between 0 and 1. It is usually indicated by σ(z).
$$ f(z)=sigma (z)=frac{1}{1+{e}^{-z}} $$
It is especially used for models in which we must predict the probability as an output (remember that a probability may only assume values between 0 and 1). You can see its shape in Figure 2-5. Note that in Python, if z is big enough, it can happen that the function returns exactly 0 or 1 (depending on the sign of z) for rounding errors. In classification problems , we will calculate logσ(z) or log(1 − σ(z)) very often, and, therefore, this can be a source of errors in Python, because it will try to calculate log 0, which is not defined. For example, you can start seeing nan appearing while calculating the cost function (more on that later). We will see a practical example of this phenomenon later in the chapter.
../images/463356_1_En_2_Chapter/463356_1_En_2_Fig5_HTML.jpg
Figure 2-5

The sigmoid activation function is an s-shaped function that goes from 0 to 1

Note

Although σ(z) should never be exactly 0 or 1, while programming in Python, the reality can be quite different. due to a very big z (positive or negative), Python may round the results to exactly 0 or 1. This could give you errors while calculating the cost function (I will give you a detailed explanation and practical example later in the chapter) for classification, because we will need to calculate log σ(z) and log(1 − σ(z)) and, therefore, Python will try to calculate log0, which is not defined. This may occur, for example, if we don’t normalize our input data correctly, or if we don’t initialize our weights correctly. For the moment, it is important to remember that although mathematically everything seems under control, the reality while programming can be more difficult. It is something that is good to keep in mind while debugging models that, for example, give nan as a result for the cost function.

The behavior with z can be seen in Figure 2-5. The calculation can be written in this form using numpy functions :
s = np.divide(1.0, np.add(1.0, np.exp(-z)))

Note

It is very useful to know that if we have two numpy arrays, A and B, the following are equivalent: A/B is equivalent to np.divide(A,B), A+B is equivalent to np.add(A,B), A-B is equivalent to np.subtract(A,B), and A*B is equivalent to np.multiply(A,B). In case you are familiar with object-oriented programming, we say that in numpy, basic operations, such as /, *, +, and -, are overloaded. Note also that all of these four basic operations in numpy act element by element.

We can write the sigmoid function in a more readable (at least for humans) form as follows:
def sigmoid(z):
    s = 1.0 / (1.0 + np.exp(-z))
    return s

As stated previously, 1.0 + np.exp(-z) is equivalent to np.add(1.0, np.exp(-z)), and 1.0 / (np.add(1.0, np.exp(-z))) to np.divide(1.0, np.add(1.0, np.exp(-z))). I want to draw your attention to another point in the formula. np.exp(-z) will have the dimensions of z (usually a vector that will have a length equal to the number of observations), while 1.0 is a scalar (a one-dimensional entity). How can Python sum the two? What happens is what is called broadcasting.1 Python, subject to certain constraints, will “broadcast ” the smaller array (in this case, the 1.0) across the larger one, so that at the end, the two have the same dimensions. In this case, the 1.0 becomes an array of the same dimension as z, all filled with 1.0. This is an important concept to understand, as it is very useful. You don’t have to transform numbers in arrays, for example. Python will take care of it for you. The rules on how broadcasting works in other cases are rather complex and beyond the scope of this book. However, it is important to know that Python is doing something in the background.

Tanh (Hyperbolic Tangent Activation) Function

The hyperbolic tangent is also an s-shaped curve that goes from -1 to 1.
$$ f(z)=	anh (z) $$
In Figure 2-6, you can see its shape. In Python, this can be easily implemented, as follows:
def tanh(z):
    return np.tanh(z)
../images/463356_1_En_2_Chapter/463356_1_En_2_Fig6_HTML.jpg
Figure 2-6

The tanh (or hyperbolic function) is an s-shaped curve that goes from -1 to 1

ReLU (Rectified Linear Unit) Activation Function

The ReLU function (Figure 2-7) has the following formula:
$$ f(z)=max left(0,z
ight) $$
../images/463356_1_En_2_Chapter/463356_1_En_2_Fig7_HTML.jpg
Figure 2-7

The ReLU function

It is useful to spend a few moments exploring how to implement the ReLU function in a smart way in Python. Note that when we will start using TensorFlow, we will have it already implemented for us, but it is very instructive to observe how different Python implementations can make a difference when implementing complex deep-learning models.

In Python, you can implement the ReLU function in several ways. Listed below are four different methods. (Try to understand why they work before proceeding.)
  1. 1.

    np.maximum(x, 0, x)

     
  2. 2.

    np.maximum(x, 0)

     
  3. 3.

    x * (x > 0)

     
  4. 4.

    (abs(x) + x) / 2

     
The four methods have very different execution speeds. Let’s generate a numpy array with 108 elements, as follows:
x = np.random.random(10**8)
Now let’s measure the time needed by the four different versions of the ReLU function when applied to it. Let the following code run:
x = np.random.random(10**8)
print("Method 1:")
%timeit -n10 np.maximum(x, 0, x)
print("Method 2:")
%timeit -n10 np.maximum(x, 0)
print("Method 3:")
%timeit -n10 x * (x > 0)
print("Method 4:")
%timeit -n10 (abs(x) + x) / 2
The results follow:
Method 1:
2.66 ms ± 500 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Method 2:
6.35 ms ± 836 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Method 3:
4.37 ms ± 780 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Method 4:
8.33 ms ± 784 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The difference is stunning . The Method 1 is four times faster than the Method 4. The numpy library is highly optimized, with many routines written in C. But knowing how to code efficiently still makes a difference and can have a great impact. Why is np.maximum(x, 0, x) faster than np.maximum(x, 0)? The first version updates x in place, without creating a new array. This can save a lot of time, especially when arrays are big. If you don’t want to (or can’t) update the input vector in place, you can still use the np.maximum(x, 0) version.

An implementation could look like this:
def relu(z):
    return np.maximum(z, 0)

Note

Remember: When optimizing your code, even small changes may make a huge difference. In deep-learning programs, the same chunk of code will be repeated millions and billions of times, so even a small improvement will have a huge impact in the long run. Spending time to optimize your code is a necessary step that will pay off.

Leaky ReLU

The Leaky ReLU (also known as a parametric rectified linear unit) is given by the formula
$$ f(z)=Big{{displaystyle egin{array}{ccc}alpha z& for& z<0\ {}z& for& zge 0end{array}} $$
with α a parameter typically of the order of 0.01. In Figure 2-8, you can see an example for α = 0.05. This value has been chosen to make the difference between x > 0 and x < 0 more marked. Usually, smaller values for α are used, but testing with your model is required to find the best value.
../images/463356_1_En_2_Chapter/463356_1_En_2_Fig8_HTML.jpg
Figure 2-8

The Leaky ReLU activation function with α = 0.05

In Python, for example, this can be implemented if the relu(z) function has already been defined as
def lrelu(z, alpha):
  return relu(z) - alpha * relu(-z)

Swish Activation Function

Recently, Ramachandran, Zopf, and Le at Google Brain [4] studied a new activation function, called Swish, that shows great promise in the deep-learning world. It is defined as
$$ f(z)= zsigma left(eta z
ight) $$
where β is a learnable parameter. In Figure 2-9, you can see how this activation function looks for three values of the parameter β: 0.1, 0.5, and 10.0. The team’s studies have shown that simply replacing ReLU activation functions with Swish improves classification accuracy on ImageNet by 0.9%. In today’s deep-learning world, that is a lot. You can find more information on ImageNet at www.image-net.org /.
../images/463356_1_En_2_Chapter/463356_1_En_2_Fig9_HTML.jpg
Figure 2-9

The Swish activation function for three different values of the parameter β

ImageNet is a large database of images that is often used to benchmark new network architectures or algorithms, such as, in this case, networks with a different activation function.

Other Activation Functions

There are many other activation functions, but these are rarely used. As a reference, following are some additional ones. The list is by no means comprehensive but should serve the purposes of giving you an idea of the variety of activation functions that can be used when developing neural networks.
  • ArcTan

$$ f(z)={	an}^{-1}z $$
  • Exponential Linear unit (ELU)

$$ f(z)=Big{{displaystyle egin{array}{ccc}alpha left({e}^z-1
ight)&amp; for&amp; z&lt;0\ {}z&amp; for&amp; zge 0end{array}} $$
  • Softplus

$$ f(z)=ln left(1+{e}^z
ight) $$

Note

Practitioners almost always use only two activation functions: the sigmoid and the ReLU (the ReLU probably most often). With both, you can achieve good results, and, given a complex enough network architecture, both can approximate any nonlinear function [5,6]. Remember that when using tensorflow, you will not have to implement the functions by yourself. tensorflow will offer an efficient implementation for you to use. But it is important to know how each activation function behaves, to understand when to use which one.

Cost Function and Gradient Descent : The Quirks of the Learning Rate

Now that you understand clearly what a neuron is, I will discuss what it means for it (and, in general, for a neural network) to learn. This will allow us to introduce concepts such as hyperparameters and learning rate. In almost all neural network problems, learning simply means finding the weights (remember that a neural network is composed of many neurons, and each neuron will have its own set of weights) and biases of the network that minimize a chosen function, which is usually called the cost function and typically indicated by J.

In calculus, there are several methods for finding the minimum of a given function analytically. Unfortunately, in all neural network applications, the number of weights is so big that it is not possible to use these methods. Numerical methods must be relied on, the most famous being gradient descent. It is the easiest method to understand, and it will give you the perfect basis from which to understand the more complex algorithms that you will see later in the book. Let me give a brief overview on how it works, because it is one of the best algorithms in machine learning to introduce the reader to the concept of learning rate and its quirks.

Given a generic function J(w), where w is a vector of weights, the minimum location in weight space (meaning the value for w for which J(w) has a minimum) can be found with an algorithm based on the following steps:
  1. 1.

    Iteration 0: Choose a random initial guess w0

     
  2. 2.

    Iteration n + 1 (with n starting from 0): The weights at iteration n + 1, wn + 1 will be updated from the previous values at iteration nwn, using the formula

     
$$ {w}_{n+1}={w}_n-gamma 
abla Jleft({w}_n
ight) $$
With ∇J(w), we have indicated the gradient of the cost function, which is a vector whose components are the partial derivatives of the cost function with respect to all the components of the weight vector w, as follows:
$$ 
abla J(w)=left(egin{array}{c}
aisebox{1ex}{$partial J(w)$}!left/ !
aisebox{-1ex}{$partial {w}_1$}
ight.\ {}vdots \ {}
aisebox{1ex}{$partial J(w)$}!left/ !
aisebox{-1ex}{$partial {w}_{n_x}$}
ight.end{array}
ight) $$

To decide when to stop, we could check when the cost function J(w) stops changing too much, or, in other words, you could define a threshold ϵ and stop at any iteration q > k (with k an integer that you have to find) that satisfies | J(wq + 1) − J(wq) | < ϵ for all q > k. The problem with this approach is that it is complicated, and this check is very expensive in terms of performance when implemented in Python (remember: you will have to do this step a very large number of times), so, usually, people simply let the algorithm run for a fixed big number of iterations and check the final results. If the result is not what is expected, they increase the fixed big number. How big? Well, that depends on your problem. What you do is choose a certain number of iterations (for example, 10,000 or 1,000,000) and let the algorithm run. At the same time, you plot the cost function vs. the number of iterations, and you check that the number of iterations you have chosen is sensible. Later in this chapter, you will see a practical example in which I will show you how to check if the number you chose was big enough. For the moment, you should know that you simply stop the algorithm after a fixed number of iterations.

Note

Why this algorithm converges toward the minimum (and how to show it) is beyond the scope of this book, would make this chapter too long, and distract the reader from the main learning goal, which is to make you understand what the effect of choosing a specific learning rate is and what the consequences are of choosing too big or too small a rate.

We will assume here that the cost function is differentiable. This is not usually the case, but a discussion of this issue goes well beyond the scope of this book. People tend to use a practical approach in this case. The implementations work very well, and so these kinds of theoretical problems are usually ignored by a large number of practitioners. Remember that in deep-learning models, the cost function becomes an incredibly complex function, and studying it is almost impossible.

The series wn will hopefully converge toward the minimum location, after a reasonable amount of iterations. The parameter γ is called the learning rate and is one of the most important parameters required in the neural network learning process.

Note

To distinguish it from weights, the learning rate is called a hyperparameter. We will encounter more of those. A hyperparameter is a parameter whose value is not determined by training and usually set before the learning process begins. In contrast, the values of parameters w and b are derived via training.

The word hopefully, has been chosen for good reason. It is possible that the algorithm will not converge toward the minimum. It is even possible that the series wn will oscillate between values without converging at all—or diverge outright. Choose γ too big or too small, and your model will not converge (or converge too slowly). To understand why this is the case, let’s consider a practical case and see how the method works while choosing different learning rates.

Learning Rate in a Practical Example

Let’s consider the dataset formed by m = 30 observations y generated by the code.
m = 30
w0 = 2
w1 = 0.5
x = np.linspace(-1,1,m)
y = w0 + w1 * x
As a cost function , we choose the classical mean squared error (MSE)
$$ Jleft({w}_0,{w}_1
ight)=frac{1}{m}sum limits_{i=1}^m{left({y}_i-fleft({w}_0,{w}_1,{x}^{(i)}
ight)
ight)}^2 $$
where we have indicated with the superscript (i) the ith observation. Remember that with the subscript i (xi), we have indicated the ith feature. To recap our notation, we have indicated with $$ {x}_j^{(i)} $$ the jth feature and the ith observation. In the example here, we have just one feature, so we don’t need the subscript j. The cost function can be implemented in Python easily as
np.average((y-hypothesis(x, w0, w1))**2, axis=2)/2
where we have defined
def hypothesis(x, w0, w1):
    return w0 + w1*x

Our goal is to find the values for w0 and w1 that minimize J(w0, w1).

To apply the gradient descent method , we must calculate the series for w0, n and w1, n. We have the following equations:
$$ Big{{displaystyle egin{array}{c}{w}_{0,n+1}={w}_{0,n}-gamma frac{partial Jleft({w}_{0,n},{w}_{1,n}
ight)}{partial {w}_0}={w}_{0,n}+gamma frac{1}{m}sum limits_{i=1}^m2left({y}_i-fleft({w}_{0,n},{w}_{1,n},{x}_i
ight)
ight)frac{partial fleft({w}_0,{w}_1,{x}_i
ight)}{partial {w}_0}\ {}{w}_{1,n+1}={w}_{1,n}-gamma frac{partial Jleft({w}_{0,n},{w}_{1,n}
ight)}{partial {w}_1}={w}_{1,n}+gamma frac{1}{m}sum limits_{i=1}^m2left({y}_i-fleft({w}_{0,n},{w}_{1,n},{x}_i
ight)
ight)frac{partial fleft({w}_0,{w}_1,{x}_i
ight)}{partial {w}_1}end{array}} $$
Simplifying equations by calculating the partial derivatives gives
$$ Big{{displaystyle egin{array}{c}{w}_{0,n+1}={w}_{0,n}+frac{gamma }{m}sum limits_{i=1}^mleft({y}_i-fleft({w}_{0,n},kern0.375em {w}_{1,n},kern0.375em {x}_i
ight)
ight)={w}_{0,n}left(1-gamma 
ight)+frac{gamma }{m}sum limits_{i=1}^mleft({y}_i-{w}_{1,n}{x}_i
ight)\ {}{w}_{1,n+1}={w}_{1,n}+frac{gamma }{m}sum limits_{i=1}^mleft({y}_i-fleft({w}_{0,n},kern0.375em {w}_{1,n},kern0.375em {x}_i
ight)
ight){x}_i={w}_{1,n}-gamma {w}_{0,n}+frac{gamma }{m}sum limits_{i=1}^mleft({y}_i-{w}_{1,n}{x}_i
ight){x}_iend{array}} $$

Because ∂f(w0, w1, xi)/∂w0 = 1 and ∂f(w0, w1, xi)/∂w1 = xi, the previous equations are the ones that must be implemented in Python, if we want to code the gradient descent algorithm by ourselves.

Note

The derivation of the equations in (2.11) has the goal of showing how the equations for gradient descent become very complicated very quickly, even for a very easy case. In the next section, we will build our first model with tensorflow. One of the best aspects of the library is that all those formulas are calculated automatically, and you don’t have to bother calculating anything. Implementing equations such as the ones in shown here and debugging them can take quite some time and prove to be impossible the moment you are dealing with large neural networks of interconnected neurons.

I have omitted in this book the complete Python implementation of the example, because it would require too much space.

It is instructive to check how the model works, by varying the learning rate. In Figures 2-10, 2-11, and 2-12, the contour lines2 of the cost functions have been drawn, and on top of these, the series (w0, n, w1, n) has been plotted, as points to visualize how the series converges (or doesn’t). In the figures, the minimum is indicated by a circle placed approximately at the center. We will consider the values γ = 0.8 (in Figure 2-10), γ = 2 (in Figure 2-11), and γ = 0.05 (in Figure 2-12). The different estimates, wn, are indicated with points. The minimum is indicated by the circle approximately in the middle of the image.

In the first case (in Figure 2-10), the converging is well behaved, and in just eight steps, the method converges toward the minimum. When γ = 2 (Figure 2-11), the method makes steps that are too big (remember: the steps are given by −γ∇J(w) and therefore the bigger γ the bigger the steps) and unable to get close to the minimum. It keeps oscillating around it, without reaching it. In this case, the model will never converge. In the last case, when γ = 0.05 (Figure 2-12), the learning is so slow that it will take many more steps to get close to the minimum. In some cases, the cost function may be so flat around the minimum that the method takes such a big number of iterations to converge that, practically, you will not get close enough to the real minimum in a reasonable amount of time. In Figure 2-12, 300 iterations are plotted, but the method is not even very close to the minimum.

Note

Choosing the right learning rate is of paramount importance when coding the learning part of a neural network. Choose too big a rate, and the method may just bounce around the minimum, without ever reaching it. Choose too small a rate, and the algorithm may become so slow that you will not be able to find the minimum in a reasonable amount of time (or number of iterations). A typical sign of a learning rate that is too big is that the cost function may become nan (“not a number,” in Python slang). Printing the cost function at regular intervals during the training process is a good way of checking such kind of problems. This will give you a chance to stop the process and avoid wasting time (in case you see nan appearing). A concrete example appears later in the chapter.

In deep-learning problems, each iteration will cost time, and you will have to perform this process several times. Choosing the right learning rate is a key part of designing a good model, because it will make training much faster (or make it impossible).
../images/463356_1_En_2_Chapter/463356_1_En_2_Fig10_HTML.jpg
Figure 2-10

Illustration of a gradient descent algorithm with well-behaved convergence

../images/463356_1_En_2_Chapter/463356_1_En_2_Fig11_HTML.jpg
Figure 2-11

Illustration of a gradient descent algorithm when the learning rate is too big. The method is not able to converge toward the minimum.

../images/463356_1_En_2_Chapter/463356_1_En_2_Fig12_HTML.jpg
Figure 2-12

Illustration of a gradient descent algorithm when the learning rate is too small. The method is so slow that it will take a huge number of iterations to converge toward the minimum.

Sometimes it is efficient to change the learning rate during the process. You start with a bigger value to get close to the minimum faster, and then you reduce it progressively, to make sure that you get as close as possible to the real minimum. I will discuss this approach later in the book.

Note

There are no fixed rules on how to choose the right learning rate. It depends on the model, on the cost function, on the starting point, and so on. A good rule of thumb is to start with γ = 0.05 and then see how the cost function behaves. It is rather common to plot J(w) vs. the number of iterations, to check that it decreases and the speed at which it is decreasing.

A good way of checking the convergence is to plot the cost function vs. the number of iterations . In this way, you can check its behavior. How the cost function looks in our three learning rates for the preceding example is shown in Figure 2-13. You can clearly see how the case with γ = 0.8 goes to zero rather quickly, indicating that we have reached a minimum. The case with γ = 2 does not even start to go down. It continues to remain at almost the same initial value. And, finally, the case with γ = 0.05 starts to go down, but it is a lot slower than the first case.
../images/463356_1_En_2_Chapter/463356_1_En_2_Fig13_HTML.jpg
Figure 2-13

The cost function vs. the number of iterations (only the first eight are considered)

So, here are the conclusions we should draw from Figure 2-13 for the three cases:
  • γ = 0.05 → J is decreasing, which is good, but after eight iterations, we have not reached a plateau, so we must use many more iterations, until we see that J is not changing much anymore.

  • γ = 2 → J is not decreasing. We should check our learning rate to see if it helps. Trying smaller values would be a good starting point.

  • γ = 0.8 → The cost function decreases rather quickly and then remains constant. That is a good sign and indicates that we have reached a minimum.

Remember also that the absolute value of the learning rate is not relevant. What is important is the behavior. We can multiply our cost function by a constant, and that would not influence our learning at all. Don’t look at the absolute values; check how fast and how the cost function is behaving. Additionally, the cost function will almost never reach zero, so don’t expect it. The value of J at its minimum is almost never zero (it depends on the functions itself). In the section about linear regression, you will see an example in which the cost function will not reach zero.

Note

When training your models, remember to always check the cost function vs. the number of iterations (or number of swipes over the entire training set, called epochs). This will give you an efficient way of estimating if the training is efficient, if it is working at all, and give you hints on how to optimize it.

Now that we have defined the basis, we will use a neuron to solve two simple problems with machine learning: linear and logistic regression.

Example of Linear Regression in tensorflow

The first type of regression will offer an opportunity to understand how to build a model in tensorflow. To explain how to perform linear regression efficiently with one neuron, I must first explain some additional notation. In the previous sections, I discussed inputs $$ x=left({x}_1,{x}_2,dots, {x}_{n_x}
ight) $$. These are the so-called features that describe an observation. Normally, we have many observations. As briefly explained before, we will use an upper index to indicate the different observations between parentheses. Our ith observation will be indicated with x(i), and the jth feature of the ith observations will be indicated as $$ {x}_j^{(i)} $$. We will indicate the number of observations with m.

Note

In this book, m is the number of observations, and nx is the number of features. Our jth feature of the ith observation will be indicated with $$ {x}_j^{(i)} $$. In deep-learning projects, the bigger the m the better. So be prepared to deal with a huge number of observations.

You will remember that I have said many times that numpy is highly optimized to perform several parallel operations at the same time. To get the best performance possible, it is important to write our equations in matrix form and feed the matrices to numpy. In this way, our code will be as efficient as possible. Remember: Avoid loops at all costs whenever possible. Let’s spend some time now in writing all our equations in matrix form. In this way, our Python implementation will be much easier later.

The entire set of inputs (features and observations) can be written in matrix form. We will use the following notation:
$$ X=left(egin{array}{ccc}{x}_1^{(1)}&amp; dots &amp; {x}_1^{(m)}\ {}vdots &amp; ddots &amp; vdots \ {}{x}_{n_x}^{(1)}&amp; dots &amp; {x}_{n_x}^{(m)}end{array}
ight) $$
where each column is an observation and each row represents a feature in the matrix X, which has dimensions nx × m. We can also write the output values $$ {widehat{y}}^{(i)} $$ in matrix form. If you recall our neuron discussion, we have defined a z(i) = wTx(i) + b for one observation i. Putting each observation in a column, we can use the following notation:
$$ z=left({z}^{(1)} {z}^{(2)}dots {z}^{(m)}
ight)={w}^TX+b $$
where we have b = (b bb). We will define $$ widehat{y} $$ as
$$ widehat{y}=left({widehat{y}}^{(1)} {widehat{y}}^{(2)}dots {widehat{y}}^{(m)}
ight)=left(fleft({z}^{(1)}
ight)kern0.5em fleft({z}^{(2)}
ight)kern0.5em dots kern0.5em fleft({z}^{(m)}
ight)
ight)=f(z) $$

where with f (z), we intend the function f be applied element by element to the matrix z.

Note

Although z has dimensions 1 × m, we will use the term matrix for it and not vector, to use consistent names in the book. This will also help you to remember that we should always use matrix operations. For our purposes, z is simply a matrix with just one row.

You know from Chapter 1 that in tensorflow, you must declare explicitly the dimensions of our matrices (or tensors), so it is a good idea to have them well under control. Here is an overview of the dimensions of all the vectors and matrices we will use:
  • X has dimensions nx × m

  • z has dimensions 1 × m

  • $$ widehat{y} $$ has dimensions 1 × m

  • w has dimensions nx × 1

  • b has dimensions 1 × m

Now that the formalism is clear, we will prepare the dataset.

Dataset for Our Linear Regression Model

To make things a bit more interesting, let’s use a real dataset. We will use the so-called Boston dataset.3 This contains information collected by the US Census Bureau concerning housing around Boston. Each record in the database describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. The attributes are defined as follows [3]:
  • CRIM: Per capita crime rate by town

  • ZN: Proportion of residential land zoned for lots over 25,000 square feet

  • INDUS: Proportion of non-retail business acres per town

  • CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

  • NOX: Nitric oxides concentration (parts per 10 million)

  • RM: Average number of rooms per dwelling

  • AGE: Proportion of owner-occupied units built prior to 1940

  • DIS: Weighted distances to five Boston employment centers

  • RAD: Index of accessibility to radial highways

  • TAX: Full-value property-tax rate per $10,000

  • PTRATIO: Pupil-teacher ratio by town

  • B - 1000(Bk - 0.63)^2 - Bk: Proportion of blacks by town

  • LSTAT: % lower status of the population

  • MEDV: Median value of owner-occupied homes in $1000s

Our target variable MEDV, the one we want to predict, is the median price of the house in $1000s for each suburb. For our example, we don’t have to understand or study the features . My goal here is to show you how to build a linear regression model with what you have learned. Normally, in a machine-learning project, you would first study your input data, check their distribution, quality, missing values, and so on; however, I will skip this part to concentrate on how to implement what you learned with tensorflow.

Note

In machine learning, the variable we want to predict is usually called the target variable.

Let’s import the usual libraries, including sklearn.datasets. Importing the data and getting features and target is very easy with the help of the sklearn.datasets package. You don’t have to download CSV files and import them. Simply run the following code:
import matplotlib.pyplot as plt
%matplotlib inline
import tensorflow as tf
import numpy as np
from sklearn.datasets import load_boston
boston = load_boston()
features = np.array(boston.data)
labels = np.array(boston.target)
Every dataset in the sklearn.datasets package comes with a description. You can check it with the following command:
print(boston["DESCR"])
Now let’s check how many observations and features we have.
n_training_samples = features.shape[0]
n_dim = features.shape[1]
print('The dataset has',n_training_samples,'training samples.')
print('The dataset has',n_dim,'features.')
Linking the mathematical notation with the Python code n_training_samples is m and n_dim is nx. The code will give the following results:
The dataset has 506 training samples.
The dataset has 13 features.
It is a good idea to normalize each numerical feature defining normalized features $$ {x}_{mathit{operatorname{norm}},j}^{(i)} $$ according to the formula
$$ {x}_{mathit{operatorname{norm}},j}^{(i)}=frac{x_j^{(i)}-kern0.5em leftlangle {x}_j^{(i)}
ight
angle }{sigma_j^{(i)}} $$
where $$ leftlangle {x}_j^{(i)}
ight
angle $$ is the average of the jth feature, and $$ {sigma}_j^{(i)} $$ is its standard deviation. This can be easily calculated in numpy with the following function:
def normalize(dataset):
    mu = np.mean(dataset, axis = 0)
    sigma = np.std(dataset, axis = 0)
    return (dataset-mu)/sigma

To normalize our features numpy array, we must simply call the function features_norm = normalize(features). Now each feature contained in the numpy array features_norm will have an average of zero and a standard deviation of one.

Note

It is generally a good idea to normalize the features, so that their average is zero, and the standard deviation is one. Sometimes, some features are much bigger than others and can have a stronger influence on the model, thus bringing wrong predictions. Particular care is needed when the dataset is split into training and test datasets, to have consistent normalizations.

For this chapter, we will simply use all the data for the training, to concentrate on implementation details.
train_x = np.transpose(features_norm)
train_y = np.transpose(labels)
print(train_x.shape)
print(train_y.shape)
The last two prints will give us the dimensions of our new matrices.
(13, 506)
(506,)

The train_x array has dimensions of (13, 506), and that is exactly what we expect. Remember for our discussion that X has dimensions nx × m.

The training target train_y has dimensions of (506,), which is how numpy describes one-dimensional arrays . tensorflow wants to have dimensions of (1, 506) (remember our previous discussion?), so we must reshape the array in this way:
train_y = train_y.reshape(1,len(train_y))
print(train_y.shape)
and our print statements give us what we need:
(1, 506)

Neuron and Cost Function for Linear Regression

A neuron that can perform linear regression uses the identity activation function. The cost function that needs to be minimized is the MSE (mean square error) that can be written as
$$ Jleft(w,b
ight)=frac{1}{m}sum limits_{i=1}^m{left({y}^{(i)}-{w}^T{x}^{(i)}-b
ight)}^2 $$

where the sum is over all m observations.

The tensorflow code to build this neuron and define the cost function is actually very simple.
tf.reset_default_graph()
X = tf.placeholder(tf.float32, [n_dim, None])
Y = tf.placeholder(tf.float32, [1, None])
learning_rate = tf.placeholder(tf.float32, shape=())
W = tf.Variable(tf.ones([n_dim,1]))
b = tf.Variable(tf.zeros(1))
init = tf.global_variables_initializer()
y_ = tf.matmul(tf.transpose(W),X)+b
cost = tf.reduce_mean(tf.square(y_-Y))
training_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

Note that in tensorflow, you don’t have to explicitly declare the number of observations. You can use None in the code. In this way, you will be able to run the model on any dataset independently of the number of observations, without modifying your code.

In the code, we have indicated the neuron output $$ widehat{y} $$ as y_, because we don’t have a hat in Python. Let me clarify a bit which line of code does what.
  • X = tf.placeholder(tf.float32, [n_dim, None]) → contains the matrix X, which must have dimensions nx × m. Remember that in our code, n_dim is nx and that m is not declared explicitly in tensorflow. In its place, we use None.

  • Y = tf.placeholder(tf.float32, [1, None]) → contains the output values $$ widehat{y} $$, which must have dimensions 1 × m. Here, this means that instead of m, we use None, because we want to use the same model for different datasets (that will have a different number of observations).

  • learning_rate = tf.placeholder(tf.float32, shape=()) → contains the learning rate as a parameter instead of a constant, so that we can run the same model varying it, without creating a new neuron each time.

  • W = tf.Variable(tf.zeros([n_dim, 1])) → defines and initializes the weights, w, with zeros. Remember that the weights, w, must have dimensions nx × 1.

  • b = tf.Variable(tf.zeros(1)) → defines and initializes the bias, b, with zero.

Remember that in tensorflow, a placeholder is a tensor that will not change during the learning phase, whereas a variable is one that will change. Weights, w, and bias, b, will be updated during the learning. Now we must define what to do with all those quantities. Remember: We must calculate z. The chosen activation function is the identity function, so z will also be the output of our neuron.
  • init = tf.global_variables_initializer() → creates a piece of the graph that initializes the variable and adds it to the graph.

  • y_ = tf.matmul(tf.transpose(W),X)+b → calculates the output of the neuron. The output of a neuron is $$ widehat{y}=f(z)=fleft({w}^TX+b
ight) $$. Because the activation function for linear regression is the identity, the output is $$ widehat{y}={w}^TX+b $$. Remember that b being a scalar is not a problem. Python broadcasting will take care of it, expanding it to the right dimensions , to make the sum between a vector wT X and a scalar b possible.

  • cost = tf.reduce_mean(tf.square(y_-Y)) → defines the cost function. tensorflow provides an easy and efficient way of calculating the average—tf.reduce_mean()—that simply performs the sum of all the elements of the tensor and divides it by the number of elements.

  • training_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost) → tells tensorflow which algorithm to use to minimize the cost function. In tensorflow language, the algorithms used to minimize the cost function are called optimizers. We now use gradient descent with the given learning rate. Later in the book, other optimizers will be extensively studied.

You will remember from the introduction in Chapter 1 that the previous code will not run any model. It simply defines the computational graph. Let’s define a function that will perform the actual learning and will run our model. It is easier to define it in a function, so that we can rerun it, changing, for example, the learning rate or the number of iterations we want to use.
def run_linear_model(learning_r, training_epochs, train_obs, train_labels, debug = False):
    sess = tf.Session()
    sess.run(init)
    cost_history = np.empty(shape=[0], dtype = float)
    for epoch in range(training_epochs+1):
        sess.run(training_step, feed_dict = {X: train_obs, Y: train_labels, learning_rate: learning_r})
        cost_ = sess.run(cost, feed_dict={ X:train_obs, Y: train_labels, learning_rate: learning_r})
        cost_history = np.append(cost_history, cost_)
        if (epoch % 1000 == 0) & debug:
            print("Reached epoch",epoch,"cost J =", str.format('{0:.6f}', cost_))
    return sess, cost_history
Let’s go through the code again, line by line.
  • sess = tf.Session() → creates a tensorflow session.

  • sess.run(init) → runs the initialization of the different element of the graphs.

  • cost_history = np.empty(shape=[0], dtype = float) → creates an empty vector (for the moment with zero elements) in which the value of our cost function at each iteration is stored.

  • for loop... → In this loop, tensorflow performs the gradient descent steps that we have discussed earlier and updates the weights and the bias. In addition, it will save in the array cost_history the value of the cost function each time: cost_history = np.append(cost_history, cost_).

  • if (epoch % 1000 == 0)... → Every 1000 epochs we will print the value of the cost function. This is an easy way of checking if the cost function is really decreasing or if nans are appearing. If you perform some initial tests in an interactive environment (such as a Jupyter notebook), you can stop the process if you see that the cost function is not behaving as you expect.

  • return sess, cost_history → returns the session (in case you want to calculate something else) and the array containing the cost function values (we will use this array to plot it).

Running the model is as easy as using the call.
sess, cost_history = run_linear_model(learning_r = 0.01,
                                training_epochs = 10000,
                                train_obs = train_x,
                                train_labels = train_y,
                                debug = True)
The output of the command will be the cost function every 1000 epochs (check in the function definition the if, starting with if (epoch % 1000 == 0)).
Reached epoch 0 cost J = 613.947144
Reached epoch 1000 cost J = 22.131165
Reached epoch 2000 cost J = 22.081099
Reached epoch 3000 cost J = 22.076544
Reached epoch 4000 cost J = 22.076109
Reached epoch 5000 cost J = 22.07606
Reached epoch 6000 cost J = 22.076057
Reached epoch 7000 cost J = 22.076059
Reached epoch 8000 cost J = 22.076059
Reached epoch 9000 cost J = 22.076054
Reached epoch 10000 cost J = 22.076054
The cost function clearly decreases and then reaches a value and stays almost constant. You can see a plot of it in Figure 2-14. That is a good sign, indicating that the cost function has reached a minimum. That does not mean that our model is good or that it will give good predictions. This tells us only that the learning has worked efficiently.
../images/463356_1_En_2_Chapter/463356_1_En_2_Fig14_HTML.jpg
Figure 2-14

The cost function resulting in our model applied to the Boston dataset with a learning rate of γ=0.01. We plot only the first 500 epochs, since the cost function has almost already reached its final value.

It would be nice to be able to visualize graphically how good our fit is. Because we have 13 features, it is not possible to plot the price vs. the other features. However, it is helpful to get a feel of how good the model predicts the observed values. This can be done by plotting our predicted target variable vs. the observed one, as I have done in Figure 2-15 . If we can perfectly predict our target variable, all the points should be on a diagonal line in the plot. The more spread the points are around the line, the worse our model is at predicting. Let’s check how our model is doing.
../images/463356_1_En_2_Chapter/463356_1_En_2_Fig15_HTML.jpg
Figure 2-15

The predicted target value vs. the measured target value for our model, applied to our trianing data

The points lay reasonably well around the line, so it seems we can predict our price to a certain degree. A more qualitative method for estimating the accuracy of our regression is the MSE itself (which, in our case, is simply our cost function). Whether the value we are obtaining (22.08 in 1000 USD) is good enough depends on the problem you are trying to solve, or the constraint and requirements you have been given.

Satisficing and Optimizing a Metric

We have seen that it is not easy to decide whether a model is good. Figure 2-15 will not allow us to describe quantitively how good (or not good) our model is. For this, we must define a metric.

The easiest way is to set up what is called a single number evaluation metric . That means that you calculate one single number and base your model evaluation on that number. It is easy and very practical. For example, you could use the accuracy or the F1 score, in the case of classification, or the MSE, in the case of regression. Normally, in real life, you will receive goals and constraints for your model. For example, your company may want to predict house prices with an MSE < 20 (in 1000 USD), and your model should be able to run on an iPad, or in less than 1 second. It is useful, therefore, to distinguish between two types of metrics:
  • Satisficing metric → Searching through available alternatives until an acceptability threshold is met, for example, code running (RT) time, which minimizes the cost function subject to RT < 1 sec, or choosing among modes the one that has an RT < 1 sec

  • Optimizing metric → Searching through available alternatives to maximize a specific metric, for example, choosing the model (or the hyperparameters) that maximize accuracy

Note

If you have several metrics, you should always choose one optimizing and the rest satisficing.

We have written our code to be able to run our model with different parameters. It is very instructive now to do that. Here is how the cost function behaves for three different learning rates: 0.1, 0.01, and 0.001. You can check the different behaviors in Figure 2-16.
../images/463356_1_En_2_Chapter/463356_1_En_2_Fig16_HTML.jpg
Figure 2-16

The cost function for linear regression applied to the Boston dataset for three learning rates: 0.1 (solid line), 0.01 (dashed line), and 0.001 (dotted line). The smaller the learning rate, the slower the learning process.

As expected for very small learning rates (0.001), the gradient descent algorithm is very slow in finding the minimum, whereas with a bigger value (0.1), the method works quickly. This kind of plot is very useful for giving you an idea of how fast and how good the learning process is going. You will see cases later in the book where the cost function is much less well behaved. For example, when applying dropout regularization, the cost function will not be smooth anymore.

Example of Logistic Regression

Logistic regression is a classic classification algorithm. To keep it simple, we will consider here a binary classification. This means that we will deal with the problem of recognizing two classes, which we will label as 0 or 1, only. We will need an activation function different from the one we used for linear regression, a different cost function to minimize, and a slight modification of the output of our neuron. Our goal is to be able to build a model that can predict if a certain new observation is of one of two classes. The neuron should give as output the probability P(y = 1| x) of the input x to be of class 1. We will then classify our observation as of class 1, if P(y = 1| x) > 0.5, or of class 0, if P(y = 1| x) < 0.5.

Cost Function

As a cost function, we will use the cross entropy.4 The function for one observation is
$$ Lleft({widehat{y}}^{(i)},{y}^{(i)}
ight)=-left({y}^{(i)} log {widehat{y}}^{(i)}+left(1-{y}^{(i)}
ight)log left(1-{widehat{y}}^{(i)}
ight)
ight) $$
In the presence of more than one observation, the cost function is the sum over all observations
$$ Jleft(w,b
ight)=frac{1}{m}sum limits_{i=1}^mkern0.375em Lleft({widehat{y}}^{(i)},{y}^{(i)}
ight) $$

In Chapter 10, I will provide a complete derivation of logistic regression from scratch, but for the moment, tensorflow will take care of all the details—derivatives, gradient descent implementation, and so on. We only have to build the right neuron, and we will be on our way.

Activation Function

Remember: We want our neuron to output the probability of our observation to be of class 0 or 1. Therefore, we need an activation function that can assume only values between 0 and 1. Otherwise, we cannot regard it as a probability. For our logistic regression, we will use the sigmoid function as the activation function.
$$ sigma (z)=frac{1}{1+{e}^{-z}} $$

The Dataset

To build an interesting model, we will use a modified version of the MNIST dataset. You will find all relevant information from the following link: http://yann.lecun.com/exdb/mnist/ .

The MNIST database is a large database of handwritten digits that we can use to train our model. The MNIST database contains 70,000 images. “The original black and white (bilevel) images from NIST were size normalized to fit in a 20×20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. The images were centered in a 28×28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28×28 field” (source: http://yann.lecun.com/exdb/mnist/ ).

Our features will be the gray value for each pixel, so we will have 28 × 28 = 784 features whose values will go from 0 to 255 (gray values). The dataset contains all ten digits, from 0 to 9. With the following code, you can prepare the data to use in the sections below. As usual, let’s first import the necessary library.
from sklearn.datasets import fetch_mldata
Then let’s load the data.
mnist = fetch_mldata('MNIST original')
X,y = mnist["data"], mnist["target"]
Now X contains the input images and y the target labels (remember that the value we want to predict is called target in machine-learning jargon). Just typing X.shape will give you the shape of X: (70000, 784). Note that X has 70,000 rows (each row is an image) and 784 columns (each column is a feature, or a pixel gray value, in our case). Let’s check how many digits we have in our dataset.
for i in range(10):
    print ("digit", i, "appears", np.count_nonzero(y == i), "times")
That gives us the following:
digit 0 appears 6903 times
digit 1 appears 7877 times
digit 2 appears 6990 times
digit 3 appears 7141 times
digit 4 appears 6824 times
digit 5 appears 6313 times
digit 6 appears 6876 times
digit 7 appears 7293 times
digit 8 appears 6825 times
digit 9 appears 6958 times
It is useful to define a function to visualize the digits, to get an idea of how they look.
def plot_digit(some_digit):
    some_digit_image = some_digit.reshape(28,28)
    plt.imshow(some_digit_image, cmap = matplotlib.cm.binary, interpolation = "nearest")
    plt.axis("off")
    plt.show()
For example, we can plot one randomly (see Figure 2-17).
plot_digit(X[36003])
../images/463356_1_En_2_Chapter/463356_1_En_2_Fig17_HTML.jpg
Figure 2-17

The 36,003rd digit in the dataset. It is easly recognizable as a 5

The model we want to implement here is a simple logistic regression for binary classification, so the dataset must be reduced to two classes, or in this case, to two digits. We choose ones and twos. Let’s extract from our dataset only the images that represent a 1 or a 2. Our neuron will try to recognize if a given image is of class 0 (a digit 1) or of class 1 (a digit 2).
X_train = X[np.any([y == 1,y == 2], axis = 0)]
y_train = y[np.any([y == 1,y == 2], axis = 0)]
Next, the input observations must be normalized. (Remember: You don’t want your input data to be too big when using the sigmoid activation function, because you have 784 of them.)
X_train_normalised = X_train/255.0
We chose 255, because each feature is the gray value of a pixel in the image, and gray levels in the source images go from 0 to 255. Later in the book I will discuss at length why we need to normalize the input features. For now, trust me that this is a necessary step. In each column, we want to have an input observation, and each row should represent a feature (a pixel gray value), so we must reshape the tensors
X_train_tr = X_train_normalised.transpose()
y_train_tr = y_train.reshape(1,y_train.shape[0])
and we can define a variable n_dim to contain the number of features
n_dim = X_train_tr.shape[0]

Now comes a very important point. The labels in our dataset as imported will be 1 or 2 (they simply tell you which digit the image represents). However, we will build our cost function with the assumptions that our class’s labels are 0 and 1, so we must rescale our y_train_tr array.

Note

When doing binary classification, remember to check the values of the labels you are using for training. Sometimes, using the wrong labels (not 0 and 1) may cost you quite some time in understanding why the model is not working.

y_train_shifted = y_train_tr - 1
Now all images representing a 1 will have a label of 0, and all images representing a 2 will have a label of 1. Finally, let’s use some proper names for our Python variables.
Xtrain = X_train_tr
ytrain = y_train_shifted
Figure 2-18 shows some of the digits we are dealing with.
../images/463356_1_En_2_Chapter/463356_1_En_2_Fig18_HTML.jpg
Figure 2-18

Six random digits chosen from the dataset. The relative rescaled labels (remember: labels in our dataset are now 0 or 1) are given in brackets.

tensorflow Implementation

The tensorflow implementation is not difficult and is almost the same as for the linear regression. First, let’s define placeholders and variables.
tf.reset_default_graph()
X = tf.placeholder(tf.float32, [n_dim, None])
Y = tf.placeholder(tf.float32, [1, None])
learning_rate = tf.placeholder(tf.float32, shape=())
W = tf.Variable(tf.zeros([1, n_dim]))
b = tf.Variable(tf.zeros(1))
init = tf.global_variables_initializer()
Note that the code is the same we used for the linear regression model. However, we must define a different cost function (as discussed earlier) and a different neuron output (the sigmoid function).
y_ = tf.sigmoid(tf.matmul(W,X)+b)
cost = - tf.reduce_mean(Y * tf.log(y_)+(1-Y) * tf.log(1-y_))
training_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
We have used the sigmoid function for the output of our neuron, with tf.sigmoid(). The code that will run the model is the same as that we have used for the linear regression. We have only changed the name of the function.
def run_logistic_model(learning_r, training_epochs, train_obs, train_labels, debug = False):
    sess = tf.Session()
    sess.run(init)
    cost_history = np.empty(shape=[0], dtype = float)
    for epoch in range(training_epochs+1):
        sess.run(training_step, feed_dict = {X: train_obs, Y: train_labels, learning_rate: learning_r})
        cost_ = sess.run(cost, feed_dict={ X:train_obs, Y: train_labels, learning_rate: learning_r})
        cost_history = np.append(cost_history, cost_)
        if (epoch % 500 == 0) & debug:
            print("Reached epoch",epoch,"cost J =", str.format('{0:.6f}', cost_))
    return sess, cost_history
Let’s run the model and see the results. We will choose to start with a learning rate of 0.01.
sess, cost_history = run_logistic_model(learning_r = 0.01,
                                training_epochs = 5000,
                                train_obs = Xtrain,
                                train_labels = ytrain,
                                debug = True)
The output of our code (stopped after 3000 epochs) follows:
Reached epoch 0 cost J = 0.678598
Reached epoch 500 cost J = 0.108655
Reached epoch 1000 cost J = 0.078912
Reached epoch 1500 cost J = 0.066786
Reached epoch 2000 cost J = 0.059914
Reached epoch 2500 cost J = 0.055372
Reached epoch 3000 cost J = nan
What happened? Suddenly, at some point, our cost function assumes the value nan (not a number). It seems that the model does not do well after a certain point. If the learning rate is too big, or you initialize your weights wrongly, your values for $$ {widehat{y}}^{(i)}=Pleft({y}^{(i)}=1|{x}^{(i)}
ight) $$ may get very close to zero or one (the sigmoid function assumes values very close to 0 or 1 for very big negative or positive values of z). Remember that in the cost function , you have the two terms tf.log(y_) and tf.log(1-y_), and because the log function is not defined for a value of zero, if y_ is 0 or 1, you will get a nan, because the code will try to evaluate tf.log(0). As an example, we can run the model with a learning rate of 2.0. After only one epoch, you already will get a nan value for the cost function. And it is easy to understand why, if you print out the value for b before and after the first training step. Simply modify your model code and use the following version:
def run_logistic_model(learning_r, training_epochs, train_obs, train_labels, debug = False):
    sess = tf.Session()
    sess.run(init)
    cost_history = np.empty(shape=[0], dtype = float)
    for epoch in range(training_epochs+1):
        print ('epoch: ', epoch)
        print(sess.run(b, feed_dict={X:train_obs, Y: train_labels, learning_rate: learning_r}))
        sess.run(training_step, feed_dict = {X: train_obs, Y: train_labels, learning_rate: learning_r})
        print(sess.run(b, feed_dict={X:train_obs, Y: train_labels, learning_rate: learning_r}))
        cost_ = sess.run(cost, feed_dict={ X:train_obs, Y: train_labels, learning_rate: learning_r})
        cost_history = np.append(cost_history, cost_)
        if (epoch % 500 == 0) & debug:
            print("Reached epoch",epoch,"cost J =", str.format('{0:.6f}', cost_))
    return sess, cost_history
You will get the following result (after stopping the training after just one epoch):
epoch:  0
[ 0.]
[-0.05966223]
Reached epoch 0 cost J = nan
epoch:  1
[-0.05966223]
[ nan]

You see how b goes from 0 to -0.05966223 and then to nan? Therefore, z = wTX + b turns into nan, then y = σ(z) also turns into nan, and, finally, the cost function, being a function of y, will also result in nan. This is simply because the learning rate is way too big.

What is the solution? You should try a different (read: much smaller) learning rate.

Let’s try and see if we can get a result that is more stable after 2500 epochs. We run the model with the call, as follows:
sess, cost_history = run_logistic_model(learning_r = 0.005,
                                training_epochs = 5000,
                                train_obs = Xtrain,
                                train_labels = ytrain,
                                debug = True)
The output of the command is
Reached epoch 0 cost J = 0.685799
Reached epoch 500 cost J = 0.154386
Reached epoch 1000 cost J = 0.108590
Reached epoch 1500 cost J = 0.089566
Reached epoch 2000 cost J = 0.078767
Reached epoch 2500 cost J = 0.071669
Reached epoch 3000 cost J = 0.066580
Reached epoch 3500 cost J = 0.062715
Reached epoch 4000 cost J = 0.059656
Reached epoch 4500 cost J = 0.057158
Reached epoch 5000 cost J = 0.055069
No more nan in our output. You can see a plot of the cost function in Figure 2-19. To evaluate our model, we must choose an optimizing metric (as discussed before). For a binary classification problem, a classical metric is the accuracy (which we can indicate with a) that can be understood as a measure of the difference between a result and its “true” value. Mathematically, it can be calculated as
$$ a=frac{number of cases correctly identified}{total number of cases} $$
To get the accuracy, we can run the following code. (Remember: We will classify an observation i of class 0 if P(y(i) = 1| x(i)) < 0.5, or in class 1 if P(y(i) = 1| x(i)) > 0.5.)
correct_prediction1 = tf.equal(tf.greater(y_, 0.5), tf.equal(Y,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction1, tf.float32))
print(sess.run(accuracy, feed_dict={X:Xtrain, Y: ytrain, learning_rate: 0.05}))
With this model, we reach an accuracy of 98.6%. Not bad for a network with only one neuron.
../images/463356_1_En_2_Chapter/463356_1_En_2_Fig19_HTML.jpg
Figure 2-19

The cost function vs. epochs for a learning rate of 0.005

You could also try to run the previous model (with a learning rate of 0.005) for more epochs . You will discover that at about 7000 epochs, the nan will reappear. The solution here would be to reduce the learning rate with an increasing number of epochs. A simple approach, such as halving the learning rate every 500 epochs, will get rid of the nans. I will discuss a similar approach in more detail later in the book.

References

  1. [1]

    Jeremy Hsu, “Biggest Neural Network Ever Pushes AI Deep Learning,” https://spectrum.ieee.org/tech-talk/computing/software/biggest-neural-network-ever-pushes-ai-deep-learning , 2015.

     
  2. [2]

    Raúl Rojas, Neural Networks: A Systematic Introduction, Berlin: Springer-Verlag, 1996.

     
  3. [3]

    Delve (Data for Evaluating Learning in Valid Experiments), “The Boston Housing Dataset,” www.cs.toronto.edu/~delve/data/boston/bostonDetail.html , 1996.

     
  4. [4]

    Prajit Ramachandran, Barret Zoph, Quoc V. Le, “Searching for Activation Functions,” arXiv:1710.05941 [cs.NE], 2017.

     
  5. [5]

    Guido F. Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio, “On the Number of Linear Regions of Deep Neural Networks,” https://papers.nips.cc/paper/5422-on-the-number-of-linear-regions-of-deep-neural-networks.pdf , 2014.

     
  6. [6]

    Brendan Fortuner, “Can Neural Networks Solve Any Problem?”, https://towardsdatascience.com/can-neural-networks-really-learn-any-function-65e106617fc6 , 2017.

     
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.85.183