Backpropagation through time

Training an RNN requires a slightly different implementation of backpropagation, known as backpropagation through time (BPTT).

As with normal backpropagation, the goal of BPTT is to use the overall network error to adjust the weights of each neuron/unit with respect to their contribution to the overall error, by the gradient. The overall goal is the same.

When using BPTT, our definition of error slightly changes however. As we just saw, a recurrent neuron can be unrolled through several time steps. We care about the prediction quality at all of those time steps, not just the terminal time step, because the goal of an RNN is to predict a sequence correctly, given that a logic unit error is defined as the sum of the error across all unrolled time steps.

When using BPTT, we need to sum up the error across all time steps. Then, after we've computed that overall error, we will adjust the unit's weights by the gradients for each time step.

This forces us to explicitly define how far we will unroll our LSTM. You'll see this in the following example, when we create a specific set of time steps what we will train on for each observation.

The number of steps you choose to backpropagate across is of course a hyperparameter. If you need to learn something from very far back in the sequence, obviously you'll have to include that many lags in the series. You'll need to be able to capture the relevant period. On the other hand, capturing too many time steps also isn't desirable. The network will become very hard to train because, as the gradient propagates through time, it will become very small. This is another instantiation of the vanishing gradient problem that I've described in previous chapters.

As you imagine this scenario, you might wonder if choosing too large of a time step will crash your program. If our gradients are driven so small that they become NaN then we can't complete the update operation. A common and easy way to handle this issue is to fix the gradient between some upper and lower threshold, which we call gradient clipping. All Keras optimizers have gradient clipping turned on by default. If your gradient is clipped, the network probably won't learn much for that time step, but at least your program won't crash.

If BPTT seems really confusing, just imagine the LSTM in it's unrolled state, where a unit exists for each time step. For that network structure, the algorithm is really pretty much identical to standard backpropagation, with the exception that all the unrolled layers share weights.

Table of Contents for Backpropagation&#xA0;through time

Create new playlist

Sign In

Sign Up

Table of Contents for
Backpropagation through time