In this appendix, we use the formal neural network notation from Appendix A to dive into the partial-derivative calculus behind the backpropagation method introduced in Chapter 8.

Let’s begin by defining some additional notation to help us along. Backpropagation works backwards, so the notation is based on the final layer (denoted *L*), and the earlier layers are annotated with respect to it (*L* – 1*, L* – 2, . . . *L – n*). The weights, biases, and outputs from functions are subscripted appropriately with this same notation. Recall from Equations 7.1 and 7.2 that the layer activation *a ^{L}* is calculated by multiplying the preceding layer’s activation (

$\begin{array}{cc}{z}^{L}={w}^{L}\xb7{a}^{L-1}+{b}^{L}& \left(\mathrm{B}.1\right)\end{array}$

$\begin{array}{cc}{a}^{L}=\sigma \left({z}^{L}\right)& \left(\text{B.2}\right)\end{array}$

$\begin{array}{cc}{C}_{0}=({a}^{L}-y{)}^{2}& \left(\text{B.3}\right)\end{array}$

In every iteration, we need the gradient of the total error from the preceding layer *( ^{∂C}/∂a^{L}*); in this way, the total error of the system is propagated backwards. We’ll call this value

$\begin{array}{cc}{\delta}_{L}=\frac{\partial C}{\partial {a}^{L}}=2({a}^{L}-y)& \left(\text{B.4}\right)\end{array}$

Again, this is a special case for the initial *δ* value; the remaining layers will be different (more on that shortly). Now, to update the weights in layer *L* we need to find the gradient of the cost *w.r.t.* (with respect to) the weights, * ^{∂C}/∂w^{L}*. According to the chain rule, this is the product of the gradient of the cost for the layer before w.r.t. its output, the gradient of the activation function w.r.t.

$\begin{array}{cc}\frac{\partial C}{\partial {w}^{L}}=\frac{\partial C}{\partial {a}^{L}}\xb7\frac{\partial {a}^{L}}{\partial {z}^{L}}\xb7\frac{\partial {z}^{L}}{\partial {w}^{L}}& \left(\text{B.5}\right)\end{array}$

Since * ^{∂C}/∂a^{L}* =

$\begin{array}{cc}\frac{\partial C}{\partial {w}^{L}}={\delta}_{L}\xb7{a}^{L-1}(1-{a}^{L-1})\xb7{a}^{L-1}& \left(\text{B.6}\right)\end{array}$

This value is essentially the relative amount by which the weights at layer *L* affect the total cost, and we use this to update the weights at this layer. Our work isn’t complete, however; now we need to continue down the rest of the layers. For layer *L* – 1:

$\begin{array}{cc}{\delta}_{L-1}=\frac{\partial C}{\partial {a}^{L-1}}=\frac{\partial C}{{\partial a}^{L}}\xb7\frac{\partial {a}^{L}}{{\partial z}^{L}}\xb7\frac{\partial {z}^{L}}{{\partial a}^{L-1}}& \left(\text{B.7}\right)\end{array}$

Again, * ^{∂C}/∂a^{L}* =

$\begin{array}{cc}{\delta}_{L-1}=\frac{\partial C}{\partial {a}^{L-1}}={\delta}_{L}\xb7{a}^{L}(1-{a}^{L})\xb7{w}^{L}& \left(\text{B.8}\right)\end{array}$

Now we need to find the gradient of the cost w.r.t. the weights at this layer *L* – 1 as before:

$\begin{array}{cc}\frac{\partial C}{\partial {w}^{L-1}}=\frac{\partial C}{{\partial a}^{L-1}}\xb7\frac{\partial {a}^{L-1}}{{\partial z}^{L-1}}\xb7\frac{\partial {z}^{L-1}}{{\partial w}^{L-1}}& \left(\text{B.9}\right)\end{array}$

Once again, substituting *δ _{L–1}* for

$\begin{array}{cc}\frac{\partial C}{\partial {w}^{L-1}}={\delta}_{L-1}\xb7{a}^{L-1}(1-{a}^{L-1})\xb7{a}^{L-2}& \left(\text{B.10}\right)\end{array}$

This process is repeated layer by layer all the way down to the first layer.

To recap, we first find *∂ _{L}* (Equation B.4) which is the error of the cost function (Equation B.3), and we use that value in the equation for the derivative of the cost function w.r.t. the weights in layer

Up to this point in this appendix, we’ve only dealt with networks with single inputs, single hidden neurons, and single outputs. In practice, deep learning models are never this simple. Thankfully, the math shown above scales straightforwardly given multiple neurons in a layer and multiple network inputs and outputs.

Consider the case where there are multiple output classes, such as when you’re classifying MNIST digits. In this case, there are 10 output classes (*n* = 10) representing the digits 0–9. For each class, the model provides a probability that a given input image belongs to that class. To find the total cost, we find the sum of the (quadratic, in this case) cost over all the classes:

$\begin{array}{cc}{C}_{0}={\displaystyle \sum _{n=1}^{n}}({a}_{n}^{L}-{y}_{n}{)}^{2}& \left(\text{B.11}\right)\end{array}$

In Equation B.11, *a ^{L}* and

Examining * ^{∂C}/∂w^{L}* for this, the output layer, we must account for the fact that there may be many neurons in the final hidden layer, each one of them connected to each output neuron. It’s helpful here to switch the notation slightly: Let the final hidden layer be

$\begin{array}{cc}\frac{\partial \text{}C}{\partial {w}_{ji}^{L}}=\frac{\partial \text{}C}{\partial {a}_{j}^{L}}\xb7\frac{\partial \text{}{a}_{j}^{L}}{\partial {z}_{j}^{L}}\xb7\frac{\partial \text{}{z}_{j}^{L}}{\partial {w}_{ji}^{L}}& \left(\text{B.12}\right)\end{array}$

We do this for every single weight in the layer, creating the gradient vector for the weights of size *i × j*.

Although this is essentially the same as our single-neuron-per-layer backprop (refer to Equation B.7), the equation for the gradient of the cost w.r.t. the preceding layer’s output *a _{L}_{–}*

$\begin{array}{cc}{\delta}_{L-1}=\frac{\partial C}{\partial {a}_{i}^{L-1}}={\displaystyle \sum _{j=0}^{{n}_{j}-1}}\frac{\partial C}{\partial {a}_{j}^{L}}\xb7\frac{\partial {a}_{j}^{L}}{\partial {z}_{j}^{L}}\xb7\frac{\partial {z}_{j}^{L}}{\partial {a}_{i}^{L-1}}& \left(\text{B.13}\right)\end{array}$

This is a lot of math to take in, so let’s review in simple terms: Relative to the simpler network of Equations B.1 through B.10, the equations haven’t changed except that instead of calculating the gradient on a single weight, we need to calculate the gradient on multiple weights (Equation B.12). In order to calculate the gradient on any given weight, we need that *δ* value—which itself is composed of the error over a number of connections in the preceding layer—so we calculate the sum over all these errors (Equation B.13).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.