Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

U. MichelucciApplied Deep Learning with TensorFlow 2https://doi.org/10.1007/978-1-4842-8020-1_8

8. A Brief Introduction to Recurrent Neural Networks

Umberto Michelucci¹

(1)

Dübendorf, Switzerland

In the last chapter, we looked at convolutional neural networks (CNNs). Another network architecture that is widely used (for example, in natural language processing) is the recurrent one. Networks with this architecture are called recurrent neural networks, or RNNs. This chapter is a superficial description of how RNNs work, with one small application that should help you better understand their inner workings. A full explanation of RNNs would require multiple books, so the goal of this chapter is to give you a very basic understanding of how they work. It’s useful for machine learning engineers to have at least a basic understanding of RNNs. I discuss only the very basic components of RNNs to elucidate the very fundamental aspects. I hope you find it useful. At the end of the chapter, I suggest further reading in case you find the subject interesting and want to better understand RNNs.

Introduction to RNNs

RNN are very different from CNNs and are typically used when dealing with sequential information. In other words, for data in which the order matters. The typical example given is a series of words in a sentence. You can easily understand how the order of words in a sentence can make a big difference. For example, saying “the man eats the rabbit” has a different meaning than “the rabbit eats the man.” The order of the words changes and that changes who gets eaten by whom.

You can use RNNs to predict, for example, the next word in a sentence. Take for example the phrase “Paris is the capital of.” It is easy to complete the sentence with “France,” and that means that there is information about the final word of the sentence encoded in the previous words. That information is what RNNs exploit in order to predict the next terms in a sequence. The name recurrent comes from how they work: the network applies the same operation on each element of the sequence, accumulating information about the previous terms. To summarize:

RNNs use sequential data and the information encoded in the order of the terms in a sequence.
RNNs apply the same kind of operation to all terms in a sequence and build a memory of the previous terms in the sequence to predict the next term.

Before exploring how they work in more depth, let's consider a few important use cases. These examples show the range of applications possible.

Generating text : Predicting the probability of words, given a previous set of words. For example, you can easily generate text that looks like Shakespeare with RNNs, as A. Karpathy has done in his blog [2].
Translation : Given a set of words in a language, you predict words in a different language.
Speech recognition : Given a series of audio signals (words), you want to predict the sequence of letters forming the spoken words.
Generating image labels : With CNNs, RNNs can be used to generate labels for images. Check out the paper “Deep Visual-Semantic Alignments for Generating Image Descriptions” by A. Karpathy on the subject [3]. Be aware that this is a rather advanced paper that requires a mathematical background.
Chatbots : When a sequence of words is given as input, RNNs try to generate answers to the input.

As you can imagine, to solve those problems you need sophisticated architectures that are not easy to describe in a few sentences and that require a deeper (pun intended) understanding of how RNNs work. These are topics that go beyond the scope of this chapter and book.

Notation

Consider the sequence: “Paris is the capital of France.” This sentence will be fed to a RNN one word at a time: first “Paris,” then “is,” then “the,” and so on.

“Paris” will be the first word of the sequence: w1 = 'Paris'
“is” will be the second word of the sequence: w2 = 'is'
“the” will be the third word of the sequence: w3 = 'the'
“capital” will be the fourth word of the sequence: w4 = 'capital'
“of” will be the fifth word of the sequence: w5 = 'of'
“France” will be the sixth word of the sequence: w6 = 'France'

The words will be fed into the RNN in the following order: w1, w2, w3, w4, w5, and then w6. The different words will be processed by the network one after the other, or at different time points. If word w1 is processed at time t, then w2 is processed at time t + 1, w3 at time t + 2, and so on. The time t is not related to the real time but is meant to suggest the fact that each element in the sequence is processed sequentially and not in parallel. The time t is also not related to computing time or anything related to it. And the increment of 1 in t + 1 does not have any meaning, it simply indicates the next element in the sequence. You may see the following notations when reading papers, blogs, or books:

x_t: The input at time t. For example, w1 could be the input at time 1 x₁, w2 at time 2 x₂, and so on.
s_t: The notation with which the internal memory, which we have not defined yet, at time t is indicated. This quantity s_t will contain the accumulated information of the previous terms in the sequence we discussed previously. An intuitive understanding of it will have to suffice, since a mathematical definition requires a very detailed explanation.
o_t: The output of the network at time t, or in other words after all the elements of the sequence until t, including the element x_t, have been fed into the network.

The Basic Idea of RNNs

Typically, a RNN is indicated in the literature as the leftmost part of Figure 8-1. The notation is indicative and has the goal of simply indicating the different elements of the network: x is the inputs, s is the internal memory, W is one set of weights, and U is another set of weights. In reality, this schematic representation is simply a way of depicting the real structure of the network, which you can see on the right side of Figure 8-1. This is sometimes called the unfolded version of the network.

Figure 8-1
A schematic representation of an RNN

The right side of Figure 8-1 should be read left to right. The first neuron in the figure does its evaluation at an indicative time t, produces an output o_t, and creates an internal memory state s_t. The second neuron, which does its evaluation at a time t + 1, after the first neuron, gets as input both the next element in the sequence x_t + 1 and the previous memory state s. The second neuron then generates an output o_t + 1 and a new internal memory state s_t + 1. The third neuron (the one at the extreme right of Figure 8-1) gets as input the new element of the sequence x_t + 2 and the previous internal memory state s_t + 1. The process proceeds this way for a finite number of neurons. You can see in Figure 8-1 that there are two sets of weights: W and U. One set (indicated with W) is used for the internal memory states and one (U) for the sequence element. Typically, each neuron will generate the new internal memory state with a formula that will look like this

${s}_t=fleft(U{x}_t+W{s}_{t-1} ight)$

where we indicate with f() one of the activation functions we have seen as ReLU or tanh. Additionally, the previous formula will be of course multi-dimensional. s_t can be understood as the memory of the network at time t. The number of neurons (or time steps) that can be used is a new hyperparameter that needs to be tuned, depending on the problem. Research has shown that when this number is too big, the network has problems during training.

Something very important to note is that at each time step, the weights don't change. We are performing the same operation at each step, simply changing the inputs every time we perform an evaluation. Additionally, in Figure 8-1 we have for every step an output in the diagram (o_t, o_t + 1, and o_t + 2) but typically this is not necessary. In the example where we wanted to predict the final word in a sentence, we may just need the final output.

Why the Name Recurrent

We need to discuss very briefly why the networks are called recurrent . We have mentioned that the internal memory state at a time t is given by the following

${s}_t=fleft(U{x}_t+W{s}_{t-1} ight)$

The internal memory state at time t is evaluated using the same memory state at time t − 1, the one at time t − 1 with the value at time t − 2 and so on. This is at the origin of the name recurrent.

Learning to Count

To give you an idea of the power of such networks, this section shows a very basic example of something RNNs are very good at, and that standard fully connected networks, as the one you saw in the previous chapters, are really bad at. Let's try to teach a network to count.

The problem we want to solve is the following: given a certain vector made of 15 elements containing just 0s and 1s, we want to build a neural network that can count the amount of 1s there are. This is a difficult problem for a standard network, but why? Consider the problem we analyzed of distinguishing the 1 and 2 digits in the MNIST dataset. In that case, the learning happens because the 1s and the 2s have black pixels in fundamentally different positions. A digit 1 will always differ in (at least in the MNIST dataset) the same way from the digit 2, and the network will identify those differences. As soon as they are detected, a clear identification can be made. In this case, that is not possible.

Consider for example a simpler case of a vector with just five elements. Consider the case when a 1 appears exactly one time. We have five possible cases: [1,0,0,0,0], [0,1,0,0,0], [0,0,1,0,0], [0,0,0,1,0], and [0,0,0,0,1]. There is no discernable pattern to be detected here. There is no easy weight configuration that could cover those cases at the same time. In an image, this problem is similar to the problem of detecting the position of a black square in a white image. We can build a network in TensorFlow and check how good such networks are. Due to the introductory nature of this chapter, there is no hyperparameter discussion, metric analysis, and so on. We simply look at a basic network that can count.

Let's start by creating the vectors. We will create 10⁵ vectors that we will split into training and dev sets.

import numpy as np

import tensorflow as tf

from random import shuffle

from tensorflow import keras

from tensorflow.keras import layers

Now we will create the list of vectors. The code is a slightly more complicated, so we look at it in a bit more detail.

nn = 15

ll = 2**15

train_input = ['{0:015b}'.format(i) for i in range(ll)]

# consider every number up to 2^15 in binary format

shuffle(train_input) # shuffle inputs

train_input = [map(int, i) for i in train_input]

ti = []

for i in train_input:

temp_list = []

for j in i:

temp_list.append([j])

ti.append(np.array(temp_list))

train_input = ti

We want to have all possible combinations of 1 and 0 in vectors of 15 elements. So, an easy way to do that is take all numbers up to 2¹⁵ in binary format. To understand why, suppose you want to do this with only four elements. You want all possible combinations of four 0s and 1s. Consider all the numbers up to 2⁴ in binary that you can get with this code

['{0:04b}'.format(i) for i in range(2**4)]

The code simply formats all the numbers that you get with the range(2**4) function from 0 to 2**4 in binary format with {0:04b}, which limits the number of digits to four. The result is the following:

['0000',

'0001',

'0010',

'0011',

'0100',

'0101',

'0110',

'0111',

'1000',

'1001',

'1010',

'1011',

'1100',

'1101',

'1110',

'1111']

As you can easily verify, you have all possible combinations in the list. You have all possible combinations of the 1 appearing one times ([0001], [0010], [0100] and [1000]), of the 1s appearing two times, and so on. For this example, we will simply do it with 15 digits, which means we will do that with numbers up to 2¹⁵. The rest of the code is there to simply transform a string like '0100' to a list [0,1,0,0] and then concatenate all the lists with all the possible combinations.

If you check the dimension of the output array, you will notice that you get (32768, 15, 1). Each observation is an array of dimensions (15, 1). Then you prepare the target variable, a one-hot encoded version of the counts. That means that if you have an input with four 1s in the vector, the target vector will look like [0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0]. As expected, the train_output array will have the dimensions (32768, 16). Now let's split the set into a train and a dev set, as we have done several times. We will do it here in a dumb way

NUM_EXAMPLES = ll - 2000

test_input = train_input[NUM_EXAMPLES:]

test_output = train_output[NUM_EXAMPLES:] # everything beyond 10,000

train_input = train_input[:NUM_EXAMPLES]

train_output = train_output[:NUM_EXAMPLES] # till 10,000

Remember that this will work since we shuffled the vectors at the beginning, so we should have a random distribution of cases. We will use 2,000 cases for the dev set and the rest (roughly 30000) for the training set. The train_input will have dimensions (30768, 15, 1) and the dev_input will have dimensions (2000, 16).

Now you can build a network with this code, and you should be able to understand almost all of it by now

model = keras.Sequential()

model.add(layers.Embedding(input_dim = 15, output_dim = 15))

# Add a LSTM layer with 128 internal units.

model.add(layers.LSTM(24, input_dim = 15))

# Add a Dense layer with 10 units.

model.add(layers.Dense(16, activation = 'softmax'))

model.compile(loss = 'categorical_crossentropy', optimizer = 'adam', metrics = ['categorical_accuracy'])

Let's train the network

# we need to convert the input and output to numpy array to be used by the network

train_input = np.array(train_input)

train_output = np.array(train_output)

test_input = np.array(test_input)

test_output = np.array(test_output)

model.fit(train_input, train_output, validation_data = (test_input, test_output), epochs = 10, batch_size = 100)

For performance reasons and to show how efficient RNNs are, we use an LSTM kind of neuron. They have a special way of calculating the internal state. A discussion goes well beyond the scope of the book. For the moment, you should focus on the results and not on the code itself. If you let the code run, you will get the following result

Epoch 1/10

308/308 [==============================] - 4s 9ms/step - loss: 1.9441 - categorical_accuracy: 0.3063 - val_loss: 1.1784 - val_categorical_accuracy: 0.6840

Epoch 2/10

308/308 [==============================] - 2s 7ms/step - loss: 0.7472 - categorical_accuracy: 0.8332 - val_loss: 0.4515 - val_categorical_accuracy: 0.9270

Epoch 3/10

308/308 [==============================] - 2s 7ms/step - loss: 0.3311 - categorical_accuracy: 0.9554 - val_loss: 0.2360 - val_categorical_accuracy: 0.9630

Epoch 4/10

308/308 [==============================] - 2s 7ms/step - loss: 0.1921 - categorical_accuracy: 0.9658 - val_loss: 0.1530 - val_categorical_accuracy: 0.9675

Epoch 5/10

308/308 [==============================] - 2s 7ms/step - loss: 0.1306 - categorical_accuracy: 0.9760 - val_loss: 0.1071 - val_categorical_accuracy: 0.9775

Epoch 6/10

308/308 [==============================] - 2s 7ms/step - loss: 0.0937 - categorical_accuracy: 0.9824 - val_loss: 0.0778 - val_categorical_accuracy: 0.9870

Epoch 7/10

308/308 [==============================] - 2s 7ms/step - loss: 0.0696 - categorical_accuracy: 0.9905 - val_loss: 0.0586 - val_categorical_accuracy: 0.9930

Epoch 8/10

308/308 [==============================] - 2s 7ms/step - loss: 0.0533 - categorical_accuracy: 0.9921 - val_loss: 0.0446 - val_categorical_accuracy: 0.9945

Epoch 9/10

308/308 [==============================] - 2s 7ms/step - loss: 0.0422 - categorical_accuracy: 0.9924 - val_loss: 0.0367 - val_categorical_accuracy: 0.9960

Epoch 10/10

308/308 [==============================] - 2s 7ms/step - loss: 0.0346 - categorical_accuracy: 0.9943 - val_loss: 0.0301 - val_categorical_accuracy: 0.9955

<tensorflow.python.keras.callbacks.History at 0x7f6b7b3bd990>

After just ten epochs, the network is right in 99% of the cases. Just let it run for more epochs to reach incredible precision. An instructive exercise is trying to train a fully connected network (as the ones we have discussed so far) to count. You will see how this is not possible.

Conclusion

This chapter was a very brief description of RNNs. You should have the basics down as to how they work and how LSTM neurons are structured. There is a lot more about RNNs to discuss, but that would go beyond the scope of this book and therefore I have chosen to neglect it here. RNNs are an advanced topic and require a bit more know-how to understand. In the next section, I list two sources that are free on the Internet that you can use to kick-start your RNN education.

Table of Contents for
8. A Brief Introduction to Recurrent Neural Networks

8. A Brief Introduction to Recurrent Neural Networks