Long Short Term Memory (LSTM)

An LSTM network can control when to let the input enter the neuron, when to remember what has been learned in the previous time step, and when to let the output pass on to the next timestamp. All these decisions are self-tuned and only based on the input. At first glance, an LSTM looks difficult to understand but it is not. Let's use the following figure to explain how it works:

An example of an LSTM cell

First, we need a logistic function σ (see Chapter 2, Regression) to compute a value between 0 and 1 and control which piece of information flows through the LSTM gates. Remember that the logistic function is differentiable and therefore it allows backpropagation. Then, we need an operator ⊗ that takes two matrices of the same dimensions and produces another matrix where each element ij is the product of the elements ij of the original two matrices. Similarly, we need an operator ⊕ that takes two matrices of the same dimensions and produces another matrix where each element ij is the sum of the elements ij of the original two matrices. With these basic blocks, we consider the input X_i at time i and simply juxtapose it with output Y_i-1 from the previous step.

The equation f_t = σ(W_f.[y_i-1,x_t] + b_f is implements a logistic regression which controls the activation gate ⊗ and it is used to decide how much information from the previous candidate value C_i-1 should be passed to the next candidate value C_i(here, W_f and b_f are the weight matrix and the bias used for the logistic regression). If the logistic outputs 1, this would mean don't forget the previous cell state C_i-1; if it outputs 0, this would mean forget the previous cell state C_i-1. Any number in (0,1) will represent the amount of information to be passed.

Then we have two equations: s_i = σ(W_s[Y_i-1,x_i]+b_s) used to control via ⊗ how much of the information Ĉ_i = tanh(W_C.[Y_i-1,x_i] + b_cproduced by the current cell should be added to the next candidate value C_i via the operator ⊕ according to the scheme represented in the preceding figure.

To implement what has discussed with the operators ⊕ and ⊗, we need another equation where the actual sums + and multiplications * take place: C_i=f_t*C_i-1+ s_i*Ĉ_i

Finally, we need to decide which part of the current cell should be sent to the Y_i output. This is simple: we take a logistic regression equation one more time and use this to control via an ⊗ operation which part of the candidate value should go to the output. Here, there is a little piece that deserves care and it is the use of the tanh function to squash the output into [-1, 1]. This latest step is described by the equation:

Now, I understand that this looks like a lot of math but there are two pieces of good news. First, the math part is not so difficult after all if you understand what the goal is that we want to achieve. Second, you can use LSTM cells as a blackbox drop-in replacement of standard RNN cells and immediately get the benefit of solving the problem of a vanishing gradient. For this reason, you really don't need to know all the math. You just take the TensorFlow LSTM implementation from the library and use it.

Table of Contents for Long Short Term Memory (LSTM)

Create new playlist

Sign In

Sign Up

Table of Contents for
Long Short Term Memory (LSTM)