Training MLPs with backpropagation

This is where backpropagation comes in, which is an algorithm for estimating the gradient of the cost function in neural networks. Some might say that it is basically a fancy word for the chain rule, which is a means to calculate the partial derivative of functions that depend on more than one variable. Nonetheless, it is a method that helped bring the field of artificial neural networks back to life, so we should be thankful for that.

Understanding backpropagation involves quite a bit of calculus, so I will only give you a brief introduction here.

Let's remind ourselves that the cost function, and therefore its gradient, depends on the difference between the true output (y_i) and the current output (ŷ_i) for every data sample, i. If we choose to define the cost function as the mean squared error, the corresponding equation would look like this:

Here, the sum is over all data samples, i. Of course, the output of a neuron (ŷ) depends on its input (z), which, in turn, depends on input features (x_j) and weights (w_j):

Therefore, in order to calculate the slope of the cost function as shown in the preceding equations, we have to calculate the derivative of E with respect to some weight, w_j. This calculation can involve quite a bit of mathematics because E depends on ŷ, which depends on z, which depends on w_j. However, thanks to the chain rule, we know that we can separate this calculation in the following way:

Here, the term on the left-hand side is the partial derivative of E with respect to w_j, which is what we are trying to calculate. If we had more neurons in the network, the equation would not end at w_j but might include some additional terms describing how the error flows through different hidden layers.

This is also where backpropagation gets its name from: similar to the activity flowing from the input layer via the hidden layers to the output layer, the error gradient flows backward through the network. In other words, the error gradient in the hidden layer depends on the error gradient in the output layer, and so on. You can see it is as if the error backpropagated through the network.

Well, this is still kind of complicated, you might say. The good news is, however, that the only term to depend on w_j in the preceding equation is the rightmost term, making the math at least a little easier. But you are right, this is still kind of complicated, and there's no reason for us to dig deeper here.

For more information on backpropagation, I can warmly recommend the following book: Ian Goodfellow, Yoshua Bengio, and Aaron Courville (2016), Deep Learning (Adaptive Computation and Machine Learning series), MIT Press, ISBN 978-026203561-3.

After all, what we are really interested in is how all this stuff works in practice.

Table of Contents for Training MLPs with backpropagation

Create new playlist

Sign In

Sign Up

Table of Contents for
Training MLPs with backpropagation