Backpropagation through time

Now, how do we train RNNs? Just like we have trained our normal neural networks, we can use backpropagation for training RNNs. But in RNNs, since there is a dependency on all the time steps, gradients at each output will not depend only on the current time step but also on the previous time step. We call this backpropagation through time (BPTT). It is basically the same as backpropagation except that it is applied an RNN. To see how it occurs in an RNN, let's consider the unrolled version of an RNN, as shown in the following diagram:

In the preceding diagram, L₁, L₂, and L₃ are the losses at each time step. Now, we need to compute the gradients of this loss with respect to our weight matrices U, V, and W at each time step. Like we previously calculated the total loss by summing up a loss at each time step, we update the weight matrices with a sum of the gradients at each time step:

However, we have a problem with this method. The gradient calculation involves calculating the gradient with respect to the activation function. When we calculate the gradient with respect to the sigmoid/tanh function, the gradient will become very small. When we further backpropagate the network over many time steps and multiply the gradients, the gradients will tend to get smaller and smaller. This is called a vanishing gradient problem. So, what would happen because of this problem? Because of the gradients vanishing over time, we cannot learn information about long-term dependencies, that is, RNNs cannot retain information for a longer time in the memory.

The vanishing gradient occurs not only in RNNs but also in other deep networks where we have many hidden layers when we use sigmoid/tanh functions. There is also a problem called exploding gradients where the gradient values become greater than one, and when we multiply these gradients, it will lead to a very big number.

One solution is to use ReLU as an activation function. However, we have a variant of the RNN called LSTM, which can solve the vanishing gradient problem effectively. We will see how it works in the upcoming section.

Table of Contents for Backpropagation through time

Create new playlist

Sign In

Sign Up

Table of Contents for
Backpropagation through time