B. Backpropagation

In this appendix, we use the formal neural network notation from Appendix A to dive into the partial-derivative calculus behind the backpropagation method introduced in Chapter 8.

Let’s begin by defining some additional notation to help us along. Backpropagation works backwards, so the notation is based on the final layer (denoted L), and the earlier layers are annotated with respect to it (L – 1, L – 2, . . . L – n). The weights, biases, and outputs from functions are subscripted appropriately with this same notation. Recall from Equations 7.1 and 7.2 that the layer activation aL is calculated by multiplying the preceding layer’s activation (aL–1) by the weight wL and bias bL terms to produce zL and passing this through an activation function (denoted simply as σ here). Also, we implement a simple cost function at the end; here we’re using Euclidean distance. Thus, for the final layer we have:




In every iteration, we need the gradient of the total error from the preceding layer (∂C/∂aL); in this way, the total error of the system is propagated backwards. We’ll call this value δL. Because backpropagation runs back-to-front, we start with the output layer. This layer is a special case given that the error originates here in the form of the cost function and there are no layers above it. Thus, δL is given as follows:


Again, this is a special case for the initial δ value; the remaining layers will be different (more on that shortly). Now, to update the weights in layer L we need to find the gradient of the cost w.r.t. (with respect to) the weights, ∂C/∂wL. According to the chain rule, this is the product of the gradient of the cost for the layer before w.r.t. its output, the gradient of the activation function w.r.t. z, and the gradient of the z w.r.t. the weights wL:


Since ∂C/∂aL =δL (Equation B.4), this equation can be simplified to:


This value is essentially the relative amount by which the weights at layer L affect the total cost, and we use this to update the weights at this layer. Our work isn’t complete, however; now we need to continue down the rest of the layers. For layer L – 1:


Again, ∂C/∂aL = δL (Equation B.4). In this way, the total error is being incorporated down the line, or backpropagated. The remaining terms have derivatives, so the equation becomes:


Now we need to find the gradient of the cost w.r.t. the weights at this layer L – 1 as before:


Once again, substituting δL–1 for ∂C/∂aL–1 (Equation B.8) and taking the derivatives of the other terms, we get:


This process is repeated layer by layer all the way down to the first layer.

To recap, we first find L (Equation B.4) which is the error of the cost function (Equation B.3), and we use that value in the equation for the derivative of the cost function w.r.t. the weights in layer L (Equation B.6). In the next layer, we find δL–1 (Equation B.8)—the gradient of the cost w.r.t. the output of layer L–1. As before, this is used in the equation to calculate the gradient of the cost function w.r.t. the weights in layer L –1 (Equation B.10). And so on; backpropagation continues until we reach the model inputs.

Up to this point in this appendix, we’ve only dealt with networks with single inputs, single hidden neurons, and single outputs. In practice, deep learning models are never this simple. Thankfully, the math shown above scales straightforwardly given multiple neurons in a layer and multiple network inputs and outputs.

Consider the case where there are multiple output classes, such as when you’re classifying MNIST digits. In this case, there are 10 output classes (n = 10) representing the digits 0–9. For each class, the model provides a probability that a given input image belongs to that class. To find the total cost, we find the sum of the (quadratic, in this case) cost over all the classes:


In Equation B.11, aL and y are vectors, each containing n elements.

Examining ∂C/∂wL for this, the output layer, we must account for the fact that there may be many neurons in the final hidden layer, each one of them connected to each output neuron. It’s helpful here to switch the notation slightly: Let the final hidden layer be i and the output layer be j. In this way, we have a matrix of weights that can be accessed with a row for each output neuron and a column for each hidden-layer neuron, and each weight can be denoted wji. So now, we find the gradient on each weight (remember, there are i × j weights: one for each connection between each neuron in the two layers):


We do this for every single weight in the layer, creating the gradient vector for the weights of size i × j.

Although this is essentially the same as our single-neuron-per-layer backprop (refer to Equation B.7), the equation for the gradient of the cost w.r.t. the preceding layer’s output aL1 will change (i.e., the δL1 value). Because this gradient is composed of the partial derivatives of the current layer’s inputs and weights, and because there are now multiple of those, we need to sum everything up. Sticking with the i and j notation:


This is a lot of math to take in, so let’s review in simple terms: Relative to the simpler network of Equations B.1 through B.10, the equations haven’t changed except that instead of calculating the gradient on a single weight, we need to calculate the gradient on multiple weights (Equation B.12). In order to calculate the gradient on any given weight, we need that δ value—which itself is composed of the error over a number of connections in the preceding layer—so we calculate the sum over all these errors (Equation B.13).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.