Gradient descent

As a result of forward propagation, we are in the output layer. So now, we will backpropagate the network from the output layer to the input layer and update the weights by calculating the gradient of the cost function with respect to the weights to minimize the error. Sounds confusing, right? Let's begin with an analogy. Imagine you are on top of a hill, as shown in the following diagram, and you want to reach the lowest point on the hill. You will have to make a step downwards on the hill, which leads you to the lowest point (that is, you descend from the hill towards the lowest point). There could be many regions which look like the lowest points on the hill, but we have to reach the lowest point which is actually the lowest of all. That is, you should not be stuck at a point believing it is the lowest point when the global lowest point exists:

Similarly, we can represent our cost function, as follows. It is a plot of cost against weights. Our objective is to minimize the cost function. That is, we have to reach the lowest point where the cost is the minimum. The point shows our initial weights (that is where we are on the hill). If we move this point down, then we can reach the place where there is minimal error, that is, the lowest point on the cost function (the lowest point on the hill):

How can we move this point (initial weight) downward? How do we descend and reach the lowest point? We can move this point (initial weight) by calculating a gradient of the cost function with respect to that point. Gradients are the derivatives which are actually the slope of a tangent line, which is shown in the following diagram. So, by calculating the gradient, we descend (move downwards) and reach the lowest point:

After calculating gradients, we update our old weights by our weight update rule:

What is α? It is known as the learning rate. If the learning rate is small, then we take a small step downward and our gradient descent can be slow. If the learning rate is large, then we take a large step and our gradient descent will be fast, but we might fail to reach the global minimum and become stuck at a local minimum. So, the learning rate should be chosen optimally, illustrated as follows:

Now, let's look at this mathematically. We are going to look at a lot of interesting math now, so put on your calculus hats and follow these steps. So, we have two weights, one is w_xh, which is hidden to input weights and the other is w_hy, which is hidden to output weights. We need to update these weights according to our weight update rule. For that, first, we need to calculate the derivative of the cost function with respect to weights.

Since we are backpropagating, that is, going from the output layer to the input layer, our first weight will be w_hy. So, now we need to calculate the derivative of J with respect to w_hy. How do we calculate the derivative? Recall our cost function . We cannot compute the derivative directly as there is no w_hy term in J.

Recall the forward propagation equations given as follows:

First, we will calculate a partial derivative with respect to , and then from we will calculate the partial derivative with respect to . From , we can directly calculate our derivative . It is actually the chain rule.

So, our equation becomes:

---- (1)

We will compute each of these terms:

Where is the derivative of our sigmoid activation function. We know that the sigmoid function is , so the derivative of the sigmoid function will be .

We will substitute all of these in the first equation (1).

Now, we need to compute a derivative of J with respect to our next weight w_xh. Similarly, we cannot calculate the derivative of w_xh directly from J as we don't have any w_xh terms in J. So, we need to use the chain rule; recall our forward propagation steps again:

Now, the gradient calculation for weight w_xh becomes:

--- (2)

We will compute each of these terms:

Once we have calculated the gradients for both weights, we will update our previous weights according to our weight update rule.

Now, let's do some coding. Look at the equations (1) and (2). We have and in both equations, so we don't have to compute this again and again. We define this as delta3:

delta3 = np.multiply(-(y-yHat),sigmoidPrime(z2))

Now, we compute gradient for w_hy as:

dJ_dWhy = np.dot(a1.T,delta3)

We compute gradient for w_xh as:

delta2 = np.dot(delta3,Why.T)*sigmoidPrime(z1)
dJ_dWxh = np.dot(X.T,delta2)

We will update the weights according to our weight update rule as:

Wxh += -alpha * dJ_dWhy
Why += -alpha * dJ_dWxh

The complete code for this backpropagation will be as follows:

 def backProp():
 
        delta3 = np.multiply(-(y-yHat),sigmoidPrime(z2))      
        dJdW2 = np.dot(a1.T, delta3)
        
        delta2 = np.dot(delta3,Why.T)*sigmoidPrime(z1)
        dJdW1 = np.dot(X.T, delta2) 
        
        Wxh += -alpha * dJdW1
        Why += -alpha * dJdW2

Before going ahead, let's familiarize ourselves with some of the frequently used terminologies in neural networks:

Forward pass: Forward pass implies forward propagating from the input layer to the output layer.
Backward pass: Backward pass implies backpropagating from the output layer to the input layer.
Epoch: Epoch specifies the number of times the neural network sees our whole training data. So, we can say one epoch is equal to one forward pass and one backward pass for all training samples.
Batch size: The batch size specifies the number of training samples we use in one forward pass and one backward pass.
No. of iterations: The number of iterations implies the number of passes where one pass = one forward pass + one backward pass.

Say that we have 12,000 training samples and that our batch size is 6,000. It will take us two iterations to complete one epoch. That is, in the first iteration, we pass the first 6,000 samples and perform a forward pass and a backward pass; in the second iteration, we pass the next 6,000 samples and perform a forward pass and a backward pass. After two iterations, our neural network will see the whole 12,000 training samples, which makes it one epoch.

Table of Contents for Gradient descent

Create new playlist

Sign In

Sign Up

Table of Contents for
Gradient descent