Long short-term memory networks

LSTM is a particular architecture of RNN, originally conceived by Hochreiter and Schmidhuber in 1997. This type of neural network has been recently rediscovered in the context of deep learning because it is free from the problem of vanishing gradient, and in practice it offers excellent results and performance.

The vanishing gradient problem affects the training of ANNs with gradient-based learning methods. In gradient-based methods such as backpropagation, weights are adjusted proportionally to the gradient of the error. Because of the way in which the aforementioned gradients are calculated, we obtain the effect that their module decreases exponentially, proceeding towards the deepest layers. The problem is that in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value. In the worst case, this may completely stop the neural network from further training.

LSTM-based networks are ideal for prediction and classification of time sequences, and they are supplanting many classic machine learning approaches. In fact, in 2012, Google replaced its voice recognition models, passing from the Hidden Markov Models (which represented the standard for over 30 years) to deep learning neural networks. In 2015, it switched to the RNNs LSTM combined with connectionist temporal classification (CTC).

CTC is a type of neural network output and associated scoring function for training RNNs.

This is due to the fact that LSTM networks are able to consider long-term dependencies between data, and in the case of speech recognition, this means managing the context within a sentence to improve recognition capacity.

An LSTM network consists of cells (LSTM blocks) linked together. Each cell is in turn composed of three types of ports: input gate, output gate, and forget gate. They respectively implement the write, read, and reset functions on the cell memory. The ports are not binary but analogical (generally managed by a sigmoid activation function mapped in a range (0, 1), where zero indicates total inhibition and 1 indicates total activation), and they are multiplicative. The presence of these ports allows the LSTM cells to remember information for an indefinite amount of time. In fact, if the input gate is below the activation threshold, the cell will maintain the previous state, while if it is enabled, the current state will be combined with the input value. As the name suggests, the forget gate resets the current state of the cell (when its value is brought to zero), and the output gate decides whether the value inside the cell must be taken out or not.

The following figure shows an LSTM unit:

The approaches based on neural networks are very powerful, as they allow capture of the characteristics and relationships between the data. In particular, it has also been seen that LSTM networks, in practice, offer high performance and excellent recognition rates. One disadvantage is that the neural networks are black box models, so their behavior is not predictable, and it is not possible to trace the logic with which they process the data.

Table of Contents for Long short-term memory networks

Create new playlist

Sign In

Sign Up

Table of Contents for
Long short-term memory networks