Learning

I want you to think about how you learn. Not the learning styles, mind; no, I want you to give your learning process a long hard deep thought. Think of the various ways you learn. Maybe you've touched a stove while it's hot once. Or if you ever learned a new language, maybe you started out by memorizing phrases before becoming fluent. Think about all the chapters that had preceded this. What do they have in common?

In broad strokes, learning is done by means of corrections. If you touched a stove while it's hot, you made a mistake. The correction is to never touch a stove when it's hot ever again. You've learned how not to touch the stove while it's hot.

Similarly, the way a neural network learns is by means of correction. If we want to train a machine to learn to classify handwriting, we would need to provide some sample images, and tell the machine which are the correct labels. If the machine predicted the labels wrongly, we need to tell it to change something in the neural network and try again.

What can be changed? The weights of course. The inputs can't be changed; they're inputs. But we can always try different weights. Hence, the process of learning can be broken down into two steps:

Telling the neural network that it is wrong when it made a mistake.
Updating the weights so that the next try will yield a better result.

When broken down like this, we have a good idea of how to proceed next. One way would be a binary determination mechanism: if the neural network predicted the correct answer, don't update the weights. If it's wrong, update the weights.

How to update the weights, then? Well, one way would be to completely replace the weight matrix with new values and try again. Since the weight matrix is filled from values pulled from a random distribution, the new weight matrix would be a new random matrix.

It should be quite obvious that these two methods, when combined, would take a very very long time before the neural network learns anything; it's as if we're simply guessing our way into the correct weight matrices.

Instead, modern neural networks use the concept of backpropagation to tell the neural network that it's made a mistake, and some form of gradient descent to update the weights.

The specifics of backpropagation and gradient descent are outside the scope of this chapter (and book). I'll, however, briefly run through the big ideas by sharing a story. I was having lunch with a couple of friends who also work in machine learning and that lunch ended with us arguing. This was because I had casually mentioned that backpropagation was "discovered", as opposed to "invented". My friends were adamant that backpropagation was invented, not discovered. My reasoning was simple: Mathematics is "discovered" if multiple people stumble upon it with the same formulation. Mathematics is "invented" if there were no parallel discovery of it.

Backpropagation, in various forms, has been constantly rediscovered over time. The first time backpropagation was discovered was in the invention of linear regression. I should note that it was a very specific form of backpropagation specific to linear regression: the sum of squared errors can be propagated back to its inputs by differentiating the result of the sum of squared errors with regard to the inputs.

We start with a cost. Remember how we have to tell the neural network that it's made a mistake. We do so by telling the neural network the cost of making a prediction. This is called a cost function. We can define a cost so that when the neural network makes a correct prediction, the cost is low, and when the neural network makes a wrong prediction, the cost is high.

Imagine for now, that the cost function is . How do you know at what values of the cost will be lowest? From high- school math, we know that the solution is to differentiate with regard to and solve for the solution when it's 0:

Backpropagation takes the same cue. In short, backpropagtion is just a bunch of partial differentiations with regard to the weights. The main difference between our toy example and real backpropagation is that the derivation of our expression is easy to solve. For more complex mathematical expressions, it can be computationally too expensive to compute the solution. Instead, we rely on gradient descent to find the answer.

Gradient descent assumes we start our x somewhere and we update the x iteratively toward the lowest cost. In each iteration, we update the weights. The simplest form of gradient descent is to add the gradient of the weights to the weights themselves.

The key takeaway is the powerful notion that you can tell the inputs that an error has occurred by performing differentiation of the function and finding a point at which the derivatives are at its minimum.

Table of Contents for Learning

Create new playlist

Sign In

Sign Up

Table of Contents for
Learning