Chapter 8. Deep Learning

Every model in Part I of this book employed classic machine learning algorithms that form the core of ML itself: logistic regression, random forests, and so on. Such models are often referred to as traditional machine learning models to differentiate them from deep-learning models. Recall from Chapter 1 that deep learning is a subset of machine learning that relies primarily on neural networks, and that most of what’s considered AI today is accomplished with deep learning. From recognizing objects in photos to real-time speech translation to using computers to generate art, music, poetry, and photorealistic faces, deep learning allows computers to perform feats that traditional machine learning does not.

I frequently introduce deep learning to software developers by challenging them to devise an algorithmic means for determining whether a photo contains a dog. If they offer a solution, I’ll counter with a dog picture that foils the algorithm. Traditional ML models can partially solve the problem, but when it comes to recognizing objects in images, deep learning represents the state of the art. It’s not terribly difficult to train a neural network to recognize dog pictures, sometimes more accurately than humans. Once you learn how to do that, it’s a small step forward to recognizing defective parts coming off an assembly line or bicycles passing in front of a self-driving car.

Neural networks have been around for decades, but it’s only in the past 10 years or so that sufficient compute power has been available to train sophisticated networks. Cutting-edge neural networks are trained on graphics processing units (GPUs) and tensor processing units (TPUs), often attached to high-performance computing clusters. GPUs are great for gaming because they deliver high-performance graphics. They are also efficient parallel processing machines that allow data scientists to train neural networks in a fraction of the time required on ordinary CPUs. Today, any researcher with a credit card can purchase an NVIDIA GPU or spin up GPUs in Azure or AWS and have access to compute power that researchers 20 years ago could only have dreamed of. This, more than anything else, has driven AI’s resurgence and precipitated continual advances in the state of the art.

This chapter is the first of several focused on deep learning. In it, you’ll learn:

  • What a neural network is and where the “deep” in deep learning comes from

  • How a neural network transforms input into output using simple mathematical operations

  • What happens when a neural network is trained, as well as the challenges that training entails

You won’t start building and training neural networks just yet; that begins in Chapter 9. Before you build a house, you need a foundation to build upon. That foundation begins right now.

Understanding Neural Networks

Neural networks come in many varieties. Convolutional neural networks (CNNs), for example, excel at computer-vision tasks such as classifying images. Recurrent neural networks (RNNs) find application in handwriting recognition and natural language processing (NLP), while generative adversarial networks, or GANs, enable computers to create art, music, and other content. But the first step in wrapping your head around deep learning is to understand what a neural network is and how it works.

The simplest type of neural network is the multilayer perceptron. It consists of nodes or neurons arranged in layers. The depth of the network is the number of layers; the width is the number of neurons in each layer, which can be different for every layer. State-of-the-art neural networks sometimes contain 100 or more layers and thousands of neurons in individual layers. A deep neural network is one that contains many layers, and it’s where the term deep learning is derived from.

The multilayer perceptron in Figure 8-1 contains three layers: an input layer with two neurons, a middle layer (also known as a hidden layer) with three neurons, and an output layer with one neuron. Because the input layer is often ignored when counting layers, some would argue that this network contains two layers, not three. Regardless, the network’s job is to take two floating-point values as input and produce a single floating-point number as output. Neural networks work with floating-point numbers. They only work with floating-point numbers. As with traditional machine learning models, a neural network can only process non-numeric data—for example, text strings—if the data is first converted to numbers.

Figure 8-1. Multilayer perceptron

The orange arrows in Figure 8-1 represent connections between neurons. Each neuron in each layer is connected to each neuron in the next layer, giving rise to the term fully connected layers. Each connection is assigned a weight, which is typically a small floating-point number. In addition, each neuron outside the input layer is assigned a bias, which is also a small floating-point number. Figure 8-2 shows a set of weights and biases that enable the network to sum two inputs (for example, to add 2 and 2). The blocks labeled “ReLU” represent activation functions, which apply simple nonlinear transforms to values propagated through the network. The most commonly used activation function is the rectified linear units (ReLU) function, which passes positive numbers through unchanged while converting negative numbers to 0s. Without activation functions, neural networks would struggle to model nonlinear data. And it’s no secret that real-world data tends to be nonlinear.

Figure 8-2. Weights and biases

Neurons perform simple linear transformations on data input to them. For a neuron with a single input x, the neuron’s value y is computed by multiplying x by the weight m assigned to the input and adding b, the neuron’s bias:

Look familiar? That’s the equation for linear regression. Scikit-Learn has a Perceptron class that models this behavior and can be used to build neural linear regression models. It even offers classes named MLPRegressor and MLPClassifier for building simple multilayer perceptrons. Scikit is not, however, a deep-learning library. Real deep-learning libraries do more to support advanced neural networks.

Note

The combination of neurons that perform linear transformations and activation functions that apply nonlinear transforms is an embodiment of the universal approximation theorem, which states that you can approximate any function f by summing the output from linear functions and transforming it with a nonlinear function. Textbooks often say that activation functions “add nonlinearity” to neural networks. Now you know why.

To turn inputs into outputs, a neural network assigns the input values to the neurons in the input layer. Then it multiplies the values of the input neurons by the weights connecting them to the neurons in the next layer, sums the inputs for each neuron, and adds the biases. It repeats this process to propagate values from left to right all the way to the output layer. Figure 8-3 shows what happens in the first two layers when the network in Figure 8-2 adds 2 and 2.

Figure 8-3. Flow of data from the input layer to the hidden layer when adding 2 and 2

Values propagate from the hidden layer to the output layer the same way, with one exception: they are transformed by an activation function before they’re multiplied by weights. Remember that the ReLU activation function turns negative numbers into 0s. In Figure 8-4, the –1.83 calculated for the middle neuron in the hidden layer is converted to 0 when forwarded to the output layer, effectively eliminating that neuron’s contribution to the output.

Figure 8-4. Flow of data from the hidden layer to the output layer when adding 2 and 2

Given a set of weights and biases, it isn’t difficult to code a neural network by hand. The following Python code models the network in Figure 8-2:

# Weights
w0 = 0.9907079
w1 = 1.0264927
w2 = 0.01417504
w3 = -0.8950311
w4 = 0.88046944
w5 = 0.7524377
w6 = 0.794296
w7 = 1.1687347
w8 = 0.2406084
 
# Biases
b0 = -0.00070612
b1 = -0.06846002
b2 = -0.00055442
b3 = -0.00000929
 
def relu(x):
    return max(0, x)
 
def predict(x1, x2):
    h1 = (x1 * w0) + (x2 * w1) + b0
    h2 = (x1 * w2) + (x2 * w3) + b1
    h3 = (x1 * w4) + (x2 * w5) + b2
    y = (relu(h1) * w6) + (relu(h2) * w7) + (relu(h3) * w8) + b3
    return y

If you’d like to see for yourself, paste the code into a Jupyter notebook and call the predict function with the inputs 2 and 2. The answer should be very close to the actual sum of 2 and 2.

For a given problem, there is an infinite combination of weights and biases that produces the desired outcome. Figure 8-5 shows the same network with a completely different set of weights and biases. Yet, if you plug the values into the preceding code (or propagate values through the network by hand), you’ll find that the network is equally capable of adding 2 and 2—or other small values, for that matter.

Figure 8-5. Adding 2 and 2 with a different set of weights and biases

Given a set of weights and biases, using a neural network to make predictions is simplicity itself. It’s little more than multiplication and addition. But coming up with a set of weights and biases to begin with is a challenge. It’s why neural networks must be trained.

Training Neural Networks

Training a traditional machine learning model fits it to a dataset. Neural networks require training too, and it is during training that weights and biases are calculated. Weights are typically initialized with small random numbers. Biases are usually initialized with 0s. In its untrained state, a neural network can do little more than generate random outputs. Once training is complete, the weights and biases enable the network to distinguish dogs from cats, translate a book review to another language, or do whatever else it was designed to do.

What happens when a neural network is trained? At a high level, training samples are fed through the network, the error (the difference between the computed output and the correct output) is computed using a loss function, and a backpropagation algorithm goes backward through the network adjusting the weights and biases (Figure 8-6). This is done repeatedly until the error is sufficiently small. With each iteration, the weights and biases become incrementally more refined and the error commensurately smaller.

Figure 8-6. Adjusting weights and biases during training

The most critical component of the backpropagation regimen is the optimizer, which on each backward pass decides how much and in which direction, positive or negative, to adjust the weights and biases. Data scientists work constantly to find better and more efficient optimizers to train networks more accurately and in less time.

Do a search on “neural networks” and you’ll turn up lots of articles with lots of complex math. Most of the math is related to optimization. An optimizer can’t just guess how to adjust the weights and biases due to their sheer numbers. A neural network containing two hidden layers with 1,000 neurons each has 1,000,000 connections between layers, and therefore 1,000,000 weights to adjust. Training would take forever if the optimization strategy were simply randomly guessing. An optimizer must be intelligent enough to make adjustments that reduce the error in each successive iteration.

Data scientists use plots like the one in Figure 8-7 to visualize what optimizers do. The plot is called a loss landscape. It has been reduced to three dimensions for visualization purposes, but in reality, it contains many dimensions—sometimes millions of them. The multicolored contour charts the error for different combinations of weights and biases. The optimizer’s goal is to navigate the contour and find the combination that produces the least error, which corresponds to the lowest point, or global minimum, in the loss landscape.

Figure 8-7. Loss landscape (Source: Alexander Amini, Ava Soleimany, Sertac Karaman, and Daniela Rus, “Spatial Uncertainty Sampling for End-to-End Control,” NeurIPS Bayesian Deep Learning [2017], https://arxiv.org/pdf/1805.04829)

The optimizer’s job isn’t an easy one. It involves partial derivatives (calculating the slope of the contour with respect to each weight and bias), gradient descent (adjusting the weights and biases to go down the slope rather than up it or sideways), and learning rates, which drive the fractional adjustments made to the weights and biases in each backpropagation pass. If the learning rate is too great, the optimizer might miss the global minimum. If it’s too small, the network will take a long time to train. Modern optimizers use adaptive learning rates that take smaller steps as they approach a minimum, where the slope of the contour is 0. To complicate matters, the optimizer must avoid getting trapped in local minima so that it can continue traversing the contour toward the global minimum where the error is the smallest. It also has to be wary of “saddle points” where the slope increases in one direction but falls off in a perpendicular direction.

Note

If you research gradient descent, you’ll encounter terms such as stochastic gradient descent (SGD) and mini-batch gradient descent (MBGD). Optimization via gradient descent is an iterative process in which samples are fed forward through the network, gradients are computed, and the gradients are combined with the learning rate to update weights and biases. Updating the weights and biases after every sample is fed forward through the network is computationally expensive, so training typically involves running batches of perhaps 30 to 40 samples through the network, averaging the error, and then performing a backpropagation pass. That’s MBGD. It speeds training and helps the optimizer bypass local minima. For more information, and for a very readable introduction to the challenges inherent to training neural networks, see the article “How Neural Networks Are Trained”.

Neural networks are fundamentally simple. Training them is mathematically complex. Fortunately, you don’t have to understand everything that happens during training in order to build them. Deep-learning libraries such as Keras and TensorFlow insulate you from the math and provide cutting-edge optimizers to do the heavy lifting. But now when you use one of these libraries and it asks you to pick a loss function and an optimizer, you’ll understand what it’s asking for and why.

Summary

Deep learning is a subset of machine learning that relies on deep neural networks, and it is the root of modern AI. It’s how computers identify objects in images, translate text and speech into other languages, generate artwork and music, and perform other tasks that were virtually impossible a few years ago.

The multilayer perceptron is a simple neural network comprising layers of neurons. Each neuron turns input into output using a simple mathematical formula. Activation functions further transform the data as it passes between layers by introducing non­li⁠nearities, enabling neural networks to fit to a variety of datasets. Hidden layers between the input layer and the output layer perform the bulk of the computational work, and a multilayer perceptron with many hidden layers is referred to as a deep neural network.

Training a neural network fits it to a dataset by iteratively adjusting weights and biases—the weights connecting neurons in adjacent layers and the biases assigned to the neurons themselves—to produce the desired outcome. The backpropagation passes that adjust the weights and biases are the heart of the training regimen. The component responsible for making adjustments is the optimizer; its ultimate goal is to find the optimum combination of weights and biases with as few backpropagation passes as possible.

Now that you understand how neural networks work, the next step is to learn how to build and train them. For that, data scientists rely on frameworks such as Keras and TensorFlow. Chapter 9 begins a deep dive into both.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.69.244