Recurrent neural networks for time series forecasting

Recurrent Neural Networks (RNNs) are a special type of neural network designed to work with sequential data. They are popular for time series forecasting as well as for solving NLP problems such as machine translation, text generation, and speech recognition. There are numerous extensions of the RNNs, such as Long-Short Term Memory (LSTM) networks and Gated Recurrent Unit (GRU) networks, which are currently part of some of the state-of-the-art architectures. However, it is good to be familiar with the original vanilla RNN. The following diagram presents the typical RNN schema:

One of the main differences between the feedforward networks and RNNs is that the former take a fixed size input at once to produce a fixed size output. On the other hand, RNNs do not take all the input data at once – they ingest the data sequentially, one at a time. At each step, the network applies a series of calculations to produce the output, also known as the hidden state. Then, the hidden state is passed over and combined with the next input to create the following output. And so the algorithm continues until reaching the end of the input sequence. The hidden states can be described as memory in RNNs as they contain the context of the past inputs of the sequence. Summing up, the RNNs take into account the past observations in the sequence to model the future ones.

Another difference is that RNNs share the weights/parameters across all the time steps, as it is the same RNN cell that is doing all the calculations for each element of the sequence. In other words, only the input and the hidden state are unique at each time step. For training (reducing the loss) they use a variation of the backpropagation algorithm called Backpropagation Through Time (BPTT). The gradient of error at each time step is also dependent on the previous steps, so they need to be taken into account while updating the weights. In BPTT, the structure of the network is unrolled (as in the preceding diagram) by creating copies of neurons that have recurrent connections. By doing so, we can turn the cyclic graph of an RNN into an acyclic graph (such as the one of MLPs) and use the backpropagation algorithm.

Theoretically, RNNs have infinite memory. However, there are issues resulting from the backpropagation algorithm that make learning long-term dependencies pretty much impossible. As the gradient is also dependent on the previous steps, we can experience either the exploding or vanishing gradient. The Exploding Gradient occurs when the values of the gradient grow very large due to accumulation during the update. When this happens, the minimum of the optimization function is never reached and the model is unstable. This issue can be solved by applying gradient clipping (capping the value of the gradient to a predefined threshold). The Vanishing Gradient occurs when the values of the gradient become close to zero (also due to accumulation), which results in no (or very small) updates to the weights of the initial layers (as in the RNN's unrolled structure).

The derivatives of the hyperbolic tangent (tanh) and sigmoid activation functions approach a flat line of 0 at both ends (called saturation regions). When this happens and the values of the gradient are close to (or equal) zero, they drive the gradients in the "further" layers toward zero as well, thus stopping the learning of the initial layers. This problem is not exclusive to RNNs, but it is frequently observed when dealing with long sequences.

That is why RNNs have trouble learning long-term dependencies from the input series. Some of the solutions include using ReLU activation functions (instead of tanh or sigmoid), or using more advanced networks such as LSTMs or GRUs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.151.153