There's more...

In this section, we wanted to briefly mention some characteristics and extensions of the RNNs.

RNNs can be used for a variety of tasks. Here, we present the groups of models defined by the type of mapping they offer (together with an example of a use case):

one to one: binary image classification
one to many: creating a caption for an image
many to one: classifying the sentiment of a sentence
many to many: machine translation

Over the years, researches have come up with many new extensions/variants of the vanilla RNN model, each one overcoming some of the shortcomings of the original.

Bidirectional RNNs account for the fact that the time t output might not only depend on the past observations in the sequence, but also on the future ones. Identifying a missing word in a sentence could serve as an example. To properly understand the context, we might want to look at both sides of the missing word.

Long Short-Term Memory (LSTM) networks were already mentioned as one of the possible solutions to the vanishing gradient problem. LSTMs' core is based on the cell state and various gates. We can think of the cell state as the memory of the network, which can transfer information even from the initial layers of a sequence to the final ones. Additionally, there are several gates that add and remove information from the cell state as it gets passed along the layers. The goal of the gates is to learn which kind of information is important to retain and what can be forgotten.

LSTMs contain the following gates:

Forget gate: By using the sigmoid function, it decides what information should be kept (values of the sigmoid closer to 1) or forgotten (values closer to 0). The function is applied to the current input and the previous hidden state.
Input gate: Decides which information from the past hidden state and the current input to save into the state cell.
Output gate: Decides what the LSTM cell will output as the next hidden state. It uses the current input and the previous hidden state to filter the cell state in order to create the next hidden state. It can filter out short- and/or long-term memories from the cell state.

The forget and input gates govern what is removed and what is added to the cell state. Each LSTM cell outputs the hidden state and the cell state.

Gated Recurrent Unit (GRU) is a variation of the LSTM network. In comparison to the LSTMs, GRUs have two gates. The first one is the update gate, which is a combination of LSTM's forget and input gates. This gate determines how much information about the past to retain in the hidden state. The second gate is the reset gate, which decides how much information to forget. In the extreme case of setting all the values of the reset gate to ones and the values of the update gate to zeros, we would obtain the vanilla RNN. Another distinction between the two architectures is the fact that GRUs do not have the cell state; all the information is stored and passed through the cells using the hidden state.

Summing up, GRUs are simpler than the LSTMs, have fewer trainable parameters, and thus are faster to train. That does not mean that they offer worse performance.

Thanks to PyTorch, it is very easy to define the LSTM/GRU models. We must only replace the nn.RNN module in the class definition from this recipe with nn.LSTM or nn.GRU, as all of them share the same parameters (with the exception of nonlinearity, which can only be defined in the case of RNNs).

An additional concept that can come in handy when training RNNs is teacher forcing, commonly used in language models. At a very high level, while training at time t, the network receives the actual (or expected) time t-1 output as input instead of the output predicted by the network (which can potentially be incorrect).

Table of Contents for There's more...

Create new playlist

Sign In

Sign Up

Table of Contents for
There's more...