LSTMs

RNNs start losing historical context over time in the sequence, and hence are hard to train for practical purposes. This is where LSTMs come into the picture! Introduced by Hochreiter and Schmidhuber in 1997, LSTMs can remember information from really long sequence-based data and prevent issues such as the vanishing gradient problem. LSTMs usually consist of three or four gates, including input, output, and forget gates.

The following diagram shows a high-level representation of a single LSTM cell:

The input gate can usually allow or deny incoming signals or inputs to alter the memory cell state. The output gate usually propagates the value to other neurons as necessary. The forget gate controls the memory cell's self-recurrent connection to remember or forget previous states as necessary. Multiple LSTM cells are usually stacked in any deep learning network to solve real-world problems, such as sequence prediction. In the following diagram, we compare the basic structures of RNNs and LSTMs:

A detailed architecture of an LSTM cell and information flow is presented in the following diagram. Let t indicate one time step; C, the cell states; and h, the hidden states. The LSTM cell has the ability to remove or add information to the cell state, by structures called gates. The gates i, f, and o represent the input, forget, and output gates, respectively, and each of them is modulated by the sigmoid layer, which outputs numbers from zero to one, controlling how much of the output from these gates should pass. Thus this helps in protecting and controlling the cell state:

The information flow through LSTMs has four steps:

Decide what information to throw away from the cell state: This decision is made by a sigmoid layer called the forget gate layer. An affine transformation is applied on h_t, x_t-1, and the output of that is passed through a sigmoid squashing function to get a number between zero and one for each number in the cell state, C_t-1. One indicates that memory should be kept and zero indicates that memory should be erased completely.

Decide what new information to write to memory: This is a two-step process. Firstly, a sigmoid layer called the input gate layer, i_t, is used to decide which locations to write the information in. Next, a tanh layer creates the new candidate information to be written.
Update memory state: The old memory state is multiplied by f_t, erasing things that were determined forgettable. Then, the new state information computed in step 2 is added after scaling them by i_t.
Output memory state: The final output of the cell state depends on the current input and the updated cell start. Firstly, a sigmoid layer is used to decide which parts of the cell state we're going to output. Then, the cell state is passed through tanh and multiplied by the output of the sigmoid gate.

I recommended that you check out Christophers's blog at http://colah.github.io/posts/2015-08-Understanding-LSTMs/ for a more detailed explanation of the LSTM steps. Most of the diagrams we have looked at here are taken from this.

LSTMs can be used for sequence forecasting as well as sequence classification. For example, we can forecast future stock prices. Also, we can use LSTMs to build classifiers that will predict whether an input signal from some health-monitoring system is a fatal or non-fatal signal—a binary classifier. We can even build a text document classifier with LSTMs. The sequence of words would go as input to the LSTM layer, and the hidden state of the LSTM would be connected to a dense softmax layer as a classifier.

Table of Contents for LSTMs

Create new playlist

Sign In

Sign Up

Table of Contents for
LSTMs