Building an automated prose generator with an RNN

In this project, we will attempt to build a character-level language model using an RNN to generate prose given some initial seed characters. The main task of a character-level language model is to predict the next character given all previous characters in a sequence of data. In other words, the function of an RNN is to generate text character by character.

To start with, we feed the RNN a huge chunk of text as input and ask it to model the probability distribution of the next character in the sequence, given a sequence of previous characters. These probability distributions conceived by the RNN model will then allow us to generate new text, one character at a time.

The first requirement for building a language model is to secure a corpus of text that the model can use to compute the probability distribution of various characters. The larger the input text corpus, the better the RNN will model the probabilities.

We do not have to strive a lot to secure the big text corpus that is required to train the RNN. There are classical texts (books) such as The Bible that can be used as a corpus. The best part is many of the classical texts are no longer protected under copyright. Therefore, the texts can be downloaded and used freely in our models.

Project Gutenberg is the best place to get access to free books that are no longer protected by copyright. Project Gutenberg can be accessed through the URL http://www.gutenberg.org. There are several books such as The Bible, Alice's Adventures in Wonderland, and so on are available from Project Gutenberg. As of December 2018, there are 58,486 books available for download. The books are available in several formats for us to be able to download and use, not just for this project, but for any project where huge text corpus input is required. The following screenshot is of a sample book from Project Gutenberg and the various formats in which the book is available for download:

Sample book available from Project Gutenberg in various formats

Irrespective of the format of the file that is downloaded, Project Gutenberg adds standard header and footer text to the actual book text. The following is an example of the header and footer that can be seen in a book:

*** START OF THIS PROJECT GUTENBERG EBOOK ALICE'S ADVENTURES IN WONDERLAND ***

THE END

It is essential that we remove this header and footer text from the book text downloaded from Project Gutenberg website. For a text file that is downloaded, one can open the file in a text editor and delete the header and footer.

For our project in this chapter, let's use a favorite book from childhood as the text corpus: Alice’s Adventures in Wonderland by Lewis Carroll. While we have an option to download the text format of this book from Project Gutenberg and make use of it as a text corpus, the R language's languageR library makes the task even more easier for us. The languageR library already has the Alice’s Adventures in Wonderland book text. After installing the languageR library, use the following code to load the text data into the memory and print the loaded text:

# including the languageR library
library("languageR")
# loading the "Alice’s Adventures in Wonderland" to memory
data(alice)
# printing the loaded text
print(alice)

You will get the following output:

[1] "ALICE"           "S"                "ADVENTURES"       "IN"               "WONDERLAND"      
[6] "Lewis" "Carroll" "THE" "MILLENNIUM" "FULCRUM"
[11] "EDITION" "3" "0" "CHAPTER" "I"
[16] "Down" "the" "Rabbit-Hole" "Alice" "was"
[21] "beginning" "to" "get" "very" "tired"
[26] "of" "sitting" "by" "her" "sister"
[31] "on" "the" "bank" "and" "of"
[36] "having" "nothing" "to" "do" "once"
[41] "or" "twice" "she" "had" "peeped"
[46] "into" "the" "book" "her" "sister"
[51] "was" "reading" "but" "it" "had"
[56] "no" "pictures" "or" "conversations" "in"

We see from the output that the book text is stored as a character vector, where each of the vector items is a word from the book text that is split by punctuation. It may also be noted that not all the punctuation is retained in the book text.

The following code reconstructs the sentences from the words in the character vector. Of course, we do not get things like sentence boundaries during the reconstruction process, as the character vector does not have as much punctuation as character vector items. Now, let's do the reconstruction of the book text from individual words:

alice_in_wonderland<-paste(alice,collapse=" ")
print(alice_in_wonderland)

You will get the following output:

[1] "ALICE S ADVENTURES IN WONDERLAND Lewis Carroll THE MILLENNIUM FULCRUM EDITION 3 0 CHAPTER I Down the Rabbit-Hole Alice was beginning to get very tired of sitting by her sister on the bank and of having nothing to do once or twice she had peeped into the book her sister was reading but it had no pictures or conversations in it and what is the use of a book thought Alice without pictures or conversation So she was considering in her own mind as well as she could for the hot day made her feel very sleepy and stupid whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies when suddenly a White Rabbit with pink eyes ran close by her There was nothing so VERY remarkable in that nor did Alice think it so VERY much out of the way to hear the Rabbit say to itself Oh dear Oh dear I shall be late when she thought it over afterwards it occurred to her that she ought to have wondered at this but at the time it all seemed quite natural but when the Rabbit actually TOOK A WATCH OUT OF ITS WAISTCOAT- POCKET and looked at it and then hurried on Alice started to her feet for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket or a watch to take out of it and burning with curiosity she ran across the field after it and fortunately was just in time to see it pop down a large rabbit-hole under the hedge In another moment down went Alice after it never once considering how in the world she was to get out again The rabbit-hole we .......

From the output, we see that a long string of text is constructed from the words. Now, we can move on to doing some preprocessing on this text to feed it to the RNN so that the model learns the dependencies between characters and the conditional probabilities of characters in sequences.

One of the things to note is that, as with a character-level language model that generates the next character in a sequence, you can build a word-level language model too. However, the character-level language model has an advantage in that it can create its own unique words that are not in the vocabulary we train it on.

Let's now learn how RNN works to conceive the dependencies between characters in sequences. Assume that we only had a vocabulary of four possible letters, [a, p, l, e], and the intent is to train an RNN on the training sequence apple. This training sequence is in fact a source of four separate training examples:

  • The probability of the letter p should be likely, given the context of a, , in other words, the conditional probability of p given the letter a in the word apple
  • Similar to the first point, p should be likely in the context of ap
  • The letter l should also be likely given the context of app
  • The letter e should be likely given the context of appl

We start to encode each character in the word apple into a vector using 1-of-k encoding. 1-of-k encoding represents each character in the word as all zeros, except for the single 1 at the index of the character in the vocabulary. Each character thus represented with 1-of-k encoding is then fed into the RNN one at a time with the help of a step function. The RNN takes this input and generates a four-dimensional output vectors (one dimension per character, and recollect we only have four characters in our vocabulary). This output vector can be interpreted as the confidence that the RNN currently assigns to each character coming next in the sequence. The following diagram is a visualization of the RNN learning the characters:

RNN learning the character language model

In the preceding diagram, we see an RNN with four-dimensional input and output layers. There is also a hidden layer with three neurons. The diagram displays the activations in the forward pass when the RNN is fed with the input of the characters a, p, p, and l. The output layer contains the confidence that the RNN assigned to each of the following characters. The expectation of the RNN is for the green numbers in the output layer to be higher than the red numbers. The high values of green numbers enable the prediction of the right characters as per the input.

We see that in the first time step, when the RNN is fed the input character a, it assigned a confidence of 1.0 to the next letter being a, 2.2 as confidence to letter p, -3.0 to l, and 4.1 to e. As per our training data, the sequence we considered is apple; therefore, the next correct character is p given the character a as input in the first time step. We would like our RNN to maximize the confidence in the first step (indicated in green) and minimize the confidence of all other letters (indicated in red). Likewise, we have a desired output character at each one of the four time steps that we would like our RNN to assign a greater confidence to.

Since the RNN consists entirely of differentiable operations, we can run the backpropagation algorithm to figure out in what direction we should adjust each one of its weights to increase the scores of the correct targets (the bold green numbers).

Based on the gradient direction, the parameters are updated and the algorithm actually alters the weight by a tiny amount in the same direction as that of the gradient. Ideally, if gradient decent has successfully run and updated the weights, we would see a slightly higher weight for the right choice and lower weights for the incorrect characters. For example, we would find that the scores of the correct character p in the first time step would be slightly higher, say 2.3 instead of 2.2. At the same time, the scores for the other characters a, l, and e would be observed as lower than that of the score that was assigned prior to gradient descent.

The process of updating the parameters through gradient descent is repeated multiple times in the RNN until the network converges, in other words, until the predictions are consistent with the training data.

Technically speaking, we run the standard softmax classifier, otherwise called the cross-entropy loss, on every output vector simultaneously. The RNN is trained with mini-batch stochastic gradient descent or adaptive learning rate methods such as RMSProp or Adam to stabilize the updates.

You may notice that the first time the character p is input, the output is p; however, the second time the same input is fed, the output is l. An RNN, therefore, cannot rely only on the input that is given. This is where an RNN uses its recurrent connection to keep track of the context to perform the task and make the correct predictions. Without the context, it would have been challenging for the network to predict the right output specifically, given the same input.

When we have to generate text using the trained RNN model, we provide a seed input character to the network and get the distribution over what characters are likely to come next. The distribution is then sampled and fed it right back in, to get the next letter. The process is repeated until the maximum number of characters is reached (until a specific user-defined character length), or until the model encounters an end of line character such as <EOS> or <END>.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.134.198