In the last chapter, we looked at convolutional neural networks (CNNs). Another network architecture that is widely used (for example, in natural language processing) is the recurrent one. Networks with this architecture are called recurrent neural networks, or RNNs. This chapter is a superficial description of how RNNs work, with one small application that should help you better understand their inner workings. A full explanation of RNNs would require multiple books, so the goal of this chapter is to give you a very basic understanding of how they work. It’s useful for machine learning engineers to have at least a basic understanding of RNNs. I discuss only the very basic components of RNNs to elucidate the very fundamental aspects. I hope you find it useful. At the end of the chapter, I suggest further reading in case you find the subject interesting and want to better understand RNNs.
Introduction to RNNs
RNN are very different from CNNs and are typically used when dealing with sequential information. In other words, for data in which the order matters. The typical example given is a series of words in a sentence. You can easily understand how the order of words in a sentence can make a big difference. For example, saying “the man eats the rabbit” has a different meaning than “the rabbit eats the man.” The order of the words changes and that changes who gets eaten by whom.
RNNs use sequential data and the information encoded in the order of the terms in a sequence.
RNNs apply the same kind of operation to all terms in a sequence and build a memory of the previous terms in the sequence to predict the next term.
Generating text : Predicting the probability of words, given a previous set of words. For example, you can easily generate text that looks like Shakespeare with RNNs, as A. Karpathy has done in his blog [2].
Translation : Given a set of words in a language, you predict words in a different language.
Speech recognition : Given a series of audio signals (words), you want to predict the sequence of letters forming the spoken words.
Generating image labels : With CNNs, RNNs can be used to generate labels for images. Check out the paper “Deep Visual-Semantic Alignments for Generating Image Descriptions” by A. Karpathy on the subject [3]. Be aware that this is a rather advanced paper that requires a mathematical background.
Chatbots : When a sequence of words is given as input, RNNs try to generate answers to the input.
As you can imagine, to solve those problems you need sophisticated architectures that are not easy to describe in a few sentences and that require a deeper (pun intended) understanding of how RNNs work. These are topics that go beyond the scope of this chapter and book.
Notation
“Paris” will be the first word of the sequence: w1 = 'Paris'
“is” will be the second word of the sequence: w2 = 'is'
“the” will be the third word of the sequence: w3 = 'the'
“capital” will be the fourth word of the sequence: w4 = 'capital'
“of” will be the fifth word of the sequence: w5 = 'of'
“France” will be the sixth word of the sequence: w6 = 'France'
xt: The input at time t. For example, w1 could be the input at time 1 x1, w2 at time 2 x2, and so on.
st: The notation with which the internal memory, which we have not defined yet, at time t is indicated. This quantity st will contain the accumulated information of the previous terms in the sequence we discussed previously. An intuitive understanding of it will have to suffice, since a mathematical definition requires a very detailed explanation.
ot: The output of the network at time t, or in other words after all the elements of the sequence until t, including the element xt, have been fed into the network.
The Basic Idea of RNNs
where we indicate with f() one of the activation functions we have seen as ReLU or tanh. Additionally, the previous formula will be of course multi-dimensional. st can be understood as the memory of the network at time t. The number of neurons (or time steps) that can be used is a new hyperparameter that needs to be tuned, depending on the problem. Research has shown that when this number is too big, the network has problems during training.
Something very important to note is that at each time step, the weights don't change. We are performing the same operation at each step, simply changing the inputs every time we perform an evaluation. Additionally, in Figure 8-1 we have for every step an output in the diagram (ot, ot + 1, and ot + 2) but typically this is not necessary. In the example where we wanted to predict the final word in a sentence, we may just need the final output.
Why the Name Recurrent
The internal memory state at time t is evaluated using the same memory state at time t − 1, the one at time t − 1 with the value at time t − 2 and so on. This is at the origin of the name recurrent.
Learning to Count
To give you an idea of the power of such networks, this section shows a very basic example of something RNNs are very good at, and that standard fully connected networks, as the one you saw in the previous chapters, are really bad at. Let's try to teach a network to count.
The problem we want to solve is the following: given a certain vector made of 15 elements containing just 0s and 1s, we want to build a neural network that can count the amount of 1s there are. This is a difficult problem for a standard network, but why? Consider the problem we analyzed of distinguishing the 1 and 2 digits in the MNIST dataset. In that case, the learning happens because the 1s and the 2s have black pixels in fundamentally different positions. A digit 1 will always differ in (at least in the MNIST dataset) the same way from the digit 2, and the network will identify those differences. As soon as they are detected, a clear identification can be made. In this case, that is not possible.
Consider for example a simpler case of a vector with just five elements. Consider the case when a 1 appears exactly one time. We have five possible cases: [1,0,0,0,0], [0,1,0,0,0], [0,0,1,0,0], [0,0,0,1,0], and [0,0,0,0,1]. There is no discernable pattern to be detected here. There is no easy weight configuration that could cover those cases at the same time. In an image, this problem is similar to the problem of detecting the position of a black square in a white image. We can build a network in TensorFlow and check how good such networks are. Due to the introductory nature of this chapter, there is no hyperparameter discussion, metric analysis, and so on. We simply look at a basic network that can count.
As you can easily verify, you have all possible combinations in the list. You have all possible combinations of the 1 appearing one times ([0001], [0010], [0100] and [1000]), of the 1s appearing two times, and so on. For this example, we will simply do it with 15 digits, which means we will do that with numbers up to 215. The rest of the code is there to simply transform a string like '0100' to a list [0,1,0,0] and then concatenate all the lists with all the possible combinations.
Remember that this will work since we shuffled the vectors at the beginning, so we should have a random distribution of cases. We will use 2,000 cases for the dev set and the rest (roughly 30000) for the training set. The train_input will have dimensions (30768, 15, 1) and the dev_input will have dimensions (2000, 16).
After just ten epochs, the network is right in 99% of the cases. Just let it run for more epochs to reach incredible precision. An instructive exercise is trying to train a fully connected network (as the ones we have discussed so far) to count. You will see how this is not possible.
Conclusion
This chapter was a very brief description of RNNs. You should have the basics down as to how they work and how LSTM neurons are structured. There is a lot more about RNNs to discuss, but that would go beyond the scope of this book and therefore I have chosen to neglect it here. RNNs are an advanced topic and require a bit more know-how to understand. In the next section, I list two sources that are free on the Internet that you can use to kick-start your RNN education.
Further Readings
A much more complete and advanced treatment of RNNs can be found at www.deeplearningbook.org/contents/rnn.html. Be aware that this is more advanced and requires a much more advanced mathematics background.
This review paper is full of information and further references that you can track down and read: https://arxiv.org/pdf/1808.03314.pdf.