GRU

This model, named Gated recurrent unit (GRU), proposed by Cho et al. (in Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, Cho K., Van Merrienboer B., Gulcehre C., Bahdanau D., Bougares F., Schwenk H., Bengio Y., arXiv:1406.1078 [cs.CL]) can be considered as a simplified LSTM with a few variations. The structure of a generic full-gated unit is represented in the following diagram:

The main differences from LSTM are the presence of only two gates and the absence of an explicit state. These simplifications can speed both the training and the prediction phases while avoiding the vanishing gradient problem.

The first gate is called the reset gate (conventionally denoted with the letter r) and its function is analogous to the forget gate:

Similar to the forget gate, its role is to decide what content of the previous output must be preserved and the relative degree. In fact, the additive contribution to new output is obtained as follows:

In the previous expression, I've preferred to separate the weight matrices to better exposes the behavior. The argument of tanh(•) is the sum of a linear function of the new input and a weighted term that is a function of the previous state. Now, it's clear how the reset gate works: it modulates the amount of history (accumulated in the previous output value) that must be preserved and what instead can be discarded. However, the reset gate is not enough to determine the right output with enough accuracy, considering both short and long-term dependencies. In order to increase the expressivity of the unit, an update gate (with a role similar to the LSTM input gate) has been added:

The update gate controls the amount of information that must contribute to the new output (and hence to the state). As it's a value bounded between 0 and 1, GRUs are trained to mix old output and new additive contribution with an operation similar to a weighted average:

Therefore, the update gate becomes a modulator that can select which components of each flow must be output and stored for the next operation. This unit is structurally simpler than an LSTM, but several studies confirmed that its performance is on average, equivalent to LSTM, with some particular cases when GRU has even outperformed the more complex cell. My suggestion is that you test both models, starting with LSTM. The computational cost has been dramatically reduced by modern hardware and in many contexts the advantage of GRUs is negligible. In both cases, the philosophy is the same: the error is kept inside the cell and the weights of the gates are corrected in order to maximize the accuracy. This behavior prevents the multiplicative cascade of small gradients and increases the ability to learn very complex temporal behaviors.

However, a single cell/layer would not be able to successfully achieve the desired accuracy. In all these cases, it's possible to stack multiple layers made up of a variable number of cells. Every layer can normally output the last value or the entire sequence. The former is used when connecting the LSTM/GRU layer to a fully-connected one, while the whole sequence is necessary to feed another recurrent layer. We are going to see how to implement these techniques with Keras in the following example.

Just like for LSTMs, Keras implements theGRU class and its NVIDIA CUDA optimized version CuDNNGRU.

Table of Contents for GRU

Create new playlist

Sign In

Sign Up

Table of Contents for
GRU