Recurrent neural networks 

An recurrent neural Network (RNN) is specialized for processing a sequence of values, as in x(1), . . . , x(t).  We need to do sequence modeling if, say, we wanted to predict the next term in the sequence given the recent history of the sequence, or maybe translate a sequence of words in one language to another language. RNNs are distinguished from feedforward networks by the presence of a feedback loop in their architecture. It is often said that RNNs have memory. The sequential information is preserved in the RNNs hidden state. So, the hidden layer in the RNN is the memory of the network. In theory, RNNs can make use of information in arbitrarily long sequences, but in practice they are limited to looking back only a few steps.

We will explain this later:

By unrolling the feedback loop in the network, we can obtain a feedforward network. For example, if our input sequence is of length 4, then the network can be unrolled as follows. After unrolling the same set of weights, U,V, and W are shared across all the steps, unlike in a traditional DNN, which uses different parameters at each layer. So, we are actually performing the same task at each step, just with different inputs. This greatly reduces the total number of parameters we need to learn. Now, to learn these shared weights, we need a loss function. At each time step, we can compare the network output, y(t), with the target sequence, s(t), and get an error, E(t). So, the total error is :

Let's take a look at the total error derivatives that we will need for learning weights with gradient-based optimization algorithms. We have  , Φ being the non-linear activation and  .

Now, , where by chain rule we have.

Here, the Jacobian  that is, the derivative of layer t, with regards to a previous layer, k—is itself a product of Jacobian  

Using the preceding equation of ht,we have . So, the norm of the Jacobian  is given by the product . If the quantities  are less than one, then over a long sequence of, say, 100 steps, the product of these norms will tend to zero. Similarly, if the norm is greater than one, then the product over long sequences will grow exponentially. These problems are called vanishing gradients and exploding gradients in RNNs. So, in practice, RNNs cannot have very long-term memories.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.170.174