Chapter 14. Sequence-to-Sequence Networks and Natural Language Translation

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 14 Sequence-to-Sequence Networks and Natural Language Translation

In Chapter 11, “Text Autocompletion with LSTM and Beam Search,” we discussed many-to-many sequence prediction problems and showed with a programming example how it can be used for autocompletion of text. Another important sequence prediction problem is to translate text from one natural language to another. In such a setting, the input sequence is a sentence in the source language, and the predicted output sequence is the corresponding sentence in the destination language. It is not necessarily the case that the sentences consist of the same number of words in the two different languages. A good English translation of the French sentence Je suis étudiant is “I am a student,” where we see that the English sentence contains one more word than its French counterpart. Another thing to note is that we want the network to consume the entire input sequence before starting to emit the output sequence, because in many cases, you need to consider the full meaning of a sentence to produce a good translation. A popular approach to handle this is to teach the network to interpret and emit START and STOP tokens as well as to ignore padding values. Both the padding value and the START and STOP tokens should be values that do not naturally appear in the text. For example, with words represented by indices that are inputs to an embedding layer, we would simply reserve specific indices for these tokens.

START tokens, STOP tokens, and padding can be used to create training examples that enable many-to-many sequences with variable lengths.

Figure 14-1 illustrates this process. The upper part of the figure shows a many-to-many network where gray represents the input, blue is the network, and green is the output. For now, ignore the ghosted (white) shapes. The network is unrolled in time from left to right. The figure shows that the desired behavior is that during the first four timesteps, we present the symbols for Je, suis, étudiant, START to the network. During the timestep that the network receives the START token, the network will output the first word (I) of the translated sentence, followed by am, a, student, and STOP during the subsequent timesteps. Let us now consider the white shapes. As previously noted, it is impossible for the network to not output a value, and similarly, the network will always get some kind of input for every timestep. This applies to the first three timesteps for the output and the last four timesteps for the input. A simple solution would be to use our padding value on both the output and the input for these timesteps. However, it turns out that a better solution is to help the network by feeding the output from the previous timestep back as input to the next timestep, just as we did in the neural language models in previous chapters. This is what is shown in the Figure 14-1.

A translation network and training example is given in a figure. — ***Figure 14-1*** Neural machine translation is an example of a many-to-many sequence where the input and output sequences are not necessarily of the same length.

A diagram represents a translation network and the training example of the same. The sentence, Je suis etudiant, is fed into the first three hidden units of the first layer. These three hidden units produce the output, pad. The first layer contains eight units in total and these units are connected along the direction of the time steps. The start command is given to the fourth hidden unit. The fourth unit produces the word, I. This word is fed back into the fifth hidden unit. The fifth hidden unit produces the word, 'am.' This process repeats for each unit till the last. The final hidden unit stops the process. The training example shows two matrices for input and output values. Each matrix has a single column and several rows. The time steps are 0 to 7, these are labeled from top to the bottom. The words in each cell from top to bottom are, Je, suis, etudiant, start, I, am, 'a,' and student. The output values matrix lists the following words from top to bottom, PAD, PAD, PAD, I, am, a, student, and STOP.

To make this abundantly clear, the lower part of the figure shows the corresponding training example without the network. That is, during training, the network will see both the source and the destination sequences on its input and be trained to predict the destination sequence on its output. Predicting the destination sequence as output might not seem that hard given that the destination sequence is also presented as input. However, they are skewed in time, so the network needs to predict the next word in the destination sequence before it has seen it. When we later use the network to produce translations, we do not have the destination sequence. We start with feeding the source sequence to the network, followed by the START token, and then start feeding back its output prediction as input to the next timestep until the network produces a STOP token. At that point, we have produced the full translated sentence.

Encoder-Decoder Model for Sequence-to-Sequence Learning

How does the model that we just described relate to the neural language models studied in previous chapters? Let us consider our translation network at the timestep when the START token is presented at its input. The only difference between this network and the neural language model networks is its initial accumulated state. In our language model, we started with 0 as internal state and presented one or more words on the input. Then the network completed the sentence. Our translation network starts with an accumulated state from seeing the source sequence, is then presented with a single START symbol, and then completes the sentence in the destination language. That is, during the second half of the translation process, the network simply acts like a neural language model in the destination language. It turns out that the internal state is all that the network needs to produce the right sentence. We can think of the internal state as a language-independent representation of the overall meaning of the sentence. Sometimes this internal state is referred to as the context or a thought vector.

Now let us consider the first half of the translation process. The goal of this phase is to consume the source sentence and build up this language-independent representation of the meaning of the sentence. Apart from being a somewhat different task than generating a sentence, it is also working with a different language/vocabulary than the second phase of the translation process. A reasonable question, then, is whether both phases should be handled by the same neural network or if it is better to have two specialized networks. The first network would be specialized in encoding the source sentence into the internal state, and the second network would be specialized in decoding the internal state into a destination sentence. Such an architecture is known as an encoder-decoder architecture, and one example is illustrated in Figure 14-2. The network is not unrolled in time. The network layers in the encoder are distinct from the network layers in the decoder. The horizontal arrow represents reading out the internal states of the recurrent layers in the encoder and initializing the internal states of the recurrent layers in the decoder. Thus, the assumption in the figure is that both networks contain the same number of hidden recurrent layers of the same size and type. In our programming example, we implement this model with two hidden recurrent layers in both networks, each consisting of 256 long short-term memory (LSTM) units.

A diagram represents an encoder-decoder model. — ***Figure 14-2*** Encoder-decoder model for language translation

A diagram illustrates the sequence-to-sequence encoder-decoder model. In the encoder, the source word sequence enters the embedding layer. The embedding layer feeds into the hidden recurrent layer or layers, the latter produced the discarded output. In the decoder, the operation starts with previous outputs. Then the flow enters the embedding layer, after which it enters the hidden recurrent layers. The hidden recurrent layers in the encoder connected to the hidden recurrent layers of the decoder. The hidden recurrent layers of the decoder provides input to the softmax layer. Finally, the softmax layer produces the dest word sentence and stops the process.

In an encoder-decoder architecture, the encoder creates an internal state known as context or thought vector, which is a language-independent representation of the meaning of the sentence.

Figure 14-2 shows just one example of an encoder-decoder model. Given how we evolved from a single RNN to this encoder-decoder network, it might not be that odd that the communication channel between the two networks is to transfer the internal state from one network to another. However, we should also recognize that the statement “Discarded output” is a little misleading in the figure. The internal state of an LSTM layer consists of the cell state (often denoted by c) and the recurrent layer hidden state (often denoted by h), where h is identical to the output of the layer. Similarly, if we had used a gated recurrent unit (GRU) instead of LSTM, there would not be a cell state, and the internal state of the network would be simply the recurrent layer hidden state, which again is identical to the output of the recurrent layer. Still, we chose to call it discarded output because that term is commonly found in other descriptions.

One can envision other ways of connecting the encoder and the decoder. For example, we could feed the state/output as a regular input to the decoder just during the first timestep, or we could give the decoder network access to it during each timestep. Or, in the case of an encoder with multiple layers, we could choose to just present the state/output from the topmost layer as inputs to the bottommost decoder layer. It is also worth noting that encoder-decoder models are not limited to working with sequences. We can construct other combinations, such as cases where only one of the encoder or decoder, or neither of them, has recurrent layers. We discuss more details about this in the next couple of chapters, but at this point, we move on to implementing our neural machine translator (NMT) in Keras.

Encoder-decoder architectures can be built in many different ways. Different network types can be used for the encoder and decoder, and the connection between the two can also be done in multiple ways.

Introduction to the Keras Functional API

It is not obvious how to implement the described architecture using the constructs that we have used in the Keras API so far. To implement this architecture, we need to use the Keras Functional API, which is specifically created to enable creation of complex models. There is a key difference compared to using the sequential models that we have used so far. Instead of just declaring a layer and adding to the model and letting Keras automatically connect the layers in a sequential manner, we now need to explicitly describe how layers are connected to each other. This process is more complex and error prone than letting Keras do it for us, but the benefit is the increased flexibility that enables us to describe a more complex model.

Keras Functional API is more flexible than the Sequential API and can therefore be used to build more complex network architectures.

We use the example models in Figure 14-3 to illustrate how to use the Keras Functional API. The model to the left is a simple sequential model that could easily have been implemented with the Sequential API, but the model to the right has an input that bypasses the first layer and therefore needs to use the Functional API.

A diagram shows two simple models. — ***Figure 14-3*** Two simple models. The left one is straightforward to implement with the Sequential API, but the right one requires the Functional API.

A pair of simple models represents the usage of functional 'A' P I. In the first model, the input tensor 10 enters the first fully connected layer 64. The flow leads to the second fully connected layer 64, which in turn produces the output. In the second model, the input tensor 10 enters the first fully connected layer 64. The input tensor 5 bypasses the first layer and enters the second fully connected layer 64. The second fully connected layer produces the final output. The two fully connected layers are linked.

The implementation of the left model is shown in Code Snippet 14-1. We start by declaring an Input object. This is different from the Sequential API, where the input layer was implicitly created when the first layer was created. We then declare the two fully connected layers in the model. Once this is done, it is time to connect the layers by using the assigned variable name as a function and passing it its inputs as an argument. The function returns an object representing the outputs of the layer, which can then be used as input argument when connecting the next layer.

Code Snippet 14-1 Example How to Implement a Simple Sequential Model Using the Functional API

SOURCE	TARGET	PREDICTION
je déteste manger seule	i hate eating alone	i hate to eat alone
je n’ai pas le choix	i don’t have a choice	i have no choice
je pense que tu devrais le faire	i think you should do it	i think you should do it
tu habites où	where do you live	where do you live
nous partons maintenant	we’re leaving now	we’re leaving now
j’ai pensé que nous pouvions le faire	i thought we could do it	i thought we could do it
je ne fais pas beaucoup tout ça	i don’t do all that much	i’m not busy at all
il a été élu roi du bal de fin d’année	he was voted prom king	he used to negotiate and look like golfer

Table of Contents for Chapter 14. Sequence-to-Sequence Networks and Natural Language Translation

Create new playlist

Sign In

Sign Up

Chapter 14

Sequence-to-Sequence Networks and Natural Language Translation

Encoder-Decoder Model for Sequence-to-Sequence Learning

Introduction to the Keras Functional API

Programming Example: Neural Machine Translation

Experimental Results

Properties of the Intermediate Representation

Concluding Remarks on Language Translation

Table of Contents for
Chapter 14. Sequence-to-Sequence Networks and Natural Language Translation