Backpropagation through time 

With RNNs, we have multiple copies of the same network, one for each timestep. Therefore, we need a way to backpropagate the error derivatives and calculate weight updates for each of the parameters in every timestep. The way we do this is simple. We're following the contours of a function so that we can try and optimize its shape. We have multiple copies of the trainable parameters, one at each timestep, and we want these copies to be consistent with each other so that when we calculate all the gradients for a given parameter, we take their average. We use this to update the parameter at t0 for each iteration of the learning process.

The goal is to calculate the error as that accumulates across timesteps, and unroll/roll the network and update the weights accordingly. There is, of course, a computational cost to this; that is, the amount of computation required increases with the number of timesteps. The method for dealing with this is to truncate (hence, truncated BPTT) the sequence of input/output pairs, meaning that we only roll/unroll a sequence of 20 timesteps at once, making the problem tractable.

Additional information for those who are interested in exploring the math behind this can be found in the Further reading section of this chapter.

Table of Contents for Backpropagation through time&#xA0;

Create new playlist

Sign In

Sign Up

Table of Contents for
Backpropagation through time