Vanishing gradients

We are already aware that long-term dependencies are very important for RNNs to function correctly. RNNs can become too deep because of the long-term dependencies. The vanishing gradient problem arises in cases where the gradient of the activation function is very small. During backpropagation, when the weights are multiplied with the low gradients, they tend to become very small and vanish as they go further into the network. This makes the neural network forget the long-term dependency. The following diagram is an illustration showing the cause that leads to vanishing gradients:

The cause of vanishing gradients

To summarize, due to the vanishing gradients problem, RNNs experience difficulty in memorizing previous words very far away in the sequence and are only able to make predictions based on the most recent words. This can impact the accuracy of RNN predictions. At times, the model may fail to predict or classify what it is supposed to do.

There are several ways in which one could handle the vanishing gradients problem. The following are some of the most popular techniques:

Initialize network weights for the identity matrix so that the potential for a vanishing gradient is minimized.

Setting the activation functions to ReLU instead of sigmoid or tanh. This makes the network computations stay close to the identity function. This works well because when the error derivatives are being propagated backwards through time, they remain constants of either 0 or 1, and so aren't likely to suffer from vanishing gradients.
Using LSTMs, which are a variant of the regular recurrent network designed to make it easy to capture long-term dependencies in sequence data. The standard RNN operates in such a way that the hidden state activation is influenced by the other local activations closest to it, which corresponds to a short-term memory, while the network weights are influenced by the computations that take place over entire long sequences, which corresponds to a long-term memory. The RNN was redesigned so that it has an activation state that can also act like weights and preserve information over long distances, hence the name long short-term memory.

In LSTMs, rather than each hidden node being simply a node with a single activation function, each node is a memory cell in itself that can store other information. Specifically, it maintains its own cell state. Normal RNNs take in their previous hidden state and the current input, and output a new hidden state. An LSTM does the same, except it also takes in its old cell state and will output its new cell state.

Table of Contents for Vanishing gradients

Create new playlist

Sign In

Sign Up

Table of Contents for
Vanishing gradients