Backpropagation through time (BPTT)

The simplest way to train an RNN is based on a representational trick. As the input sequences are limited and their length can be fixed, it's possible to restructure the simple neuron with a feedback connection as an unrolled feed-forward network. In the following diagram, there's an example with k timesteps:

Example of unrolled recurrent network

This network (which can be easily extended to more complex architecture with several layers) is exactly like an MLP, but in this case, the weights of each clone are the same. The algorithm called BPTT is the natural extension of the standard learning technique to unrolled recurrent networks. The procedure is straightforward. Once all the outputs have been computed, it's possible to determine the value of the cost function for every single network. At this point, starting from the last step, the corrections (the gradients) are computed and stored, and the process is repeated until the initial step. Then, all of the gradients are summed and applied to the network. As every single contribution is based on a precise temporal experience (made up of a local sample and a previous memory element), the standard backpropagation will learn how to manage a dynamic condition as if it were a point-wise prediction. However, we know that the actual network is not unrolled and the past dependencies are theoretically propagated and remembered. I voluntarily used the word theoretically, because all practical experiments show a completely different behavior that we are going to discuss. This technique is very easy to implement, but it can be very expensive for deep networks that must be unrolled for a large number of timesteps. For this reason, a variant called truncated backpropagation through time (TBPTT) has been proposed (in Subgrouping reduces complexity and speeds up learning in recurrent networks, Zipser D., Advances in Neural Information Processing Systems, II 1990).

The idea is to use two sequence lengths t1 and t2 (with t1 >> t2)—the longer one (t1) is employed for the feed-forward phase, while the shorter length (t2) is used to train the network. At first sight, this version seems like a normal BPTT with a short sequence; however, the key idea is to force the network to update the hidden states with more pieces of information and then compute the corrections according to the result of the longer sequence (even if the updates are propagated to a limited number of previous timesteps). Clearly, this is an approximation that can speed up the training process, but the final result is normally comparable with the one obtained by processing long sequences, in particular when the dependencies can be split into shorter temporal chunks (and therefore the assumption is that there are no very long dependencies).

Even if the BPTT algorithm is mathematically correct and it's not difficult to learn short-term dependencies (corresponding to short unrolled networks), several experiments confirmed that it's extremely difficult (or almost impossible) learning long-term dependencies. In other words, it's easy to exploit past experiences whose contribution is limited to a short window (and therefore whose importance is limited because they cannot manage the most complex trends) but the network cannot easily learn all behaviors that, for example, have a periodicity of hundreds of timesteps. In 1994, Bengio, Simard, and Frasconi provided a theoretical explanation of the problem (in Learning Long-Term Dependencies with Gradient Descent is Difficult, Bengio Y., Simard P., Frasconi P., IEEE Transactions on Neural Networks, 5/1994). The mathematical details are rather complex, because they involve dynamic system theory; however, the final result is that a network whose neurons are forced to become robust to noise (the normal expected behavior) is affected by the vanishing gradients problem when t → ∞. More generally, we can represent a vectorial recurrent neuron dynamic as follows:

The multiplicative effect of BPTT forces the gradients to be proportional to Wt. If the largest absolute eigenvalue (also known as spectral radius) of W is smaller than 1, then the following applies:

More simply, we can re-express the result saying that the magnitude of the gradients is proportional to the length of the sequences and even if the condition is asymptotically valid, many experiments confirmed that the limited precision of numeric computations and the exponential decay due to subsequent multiplications can force the gradients to vanish even when the sequences are not extremely long. This seems to be the end of any RNN architecture, but luckily more recent approaches have been designed and proposed to resolve this problem, allowing RNNs to learn both short and long-term dependencies without particular complications. A new era of RNNs started and the results were immediately outstanding.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.38.121