Generating Text with RNNs and GPT-2

When your mobile phone completes a word as you type a message or when Gmail suggests a short reply or completes a sentence as you reply to an email, a text generation model is working in the background. The Transformer architecture forms the basis of state-of-the-art text generation models. BERT, as explained in the previous chapter, uses only the encoder part of the Transformer architecture.

However, BERT, being bi-directional, is not suitable for the generation of text. A left-to-right (or right-to-left, depending on the language) language model built on the decoder part of the Transformer architecture is the foundation of text generation models today.

Text can be generated a character at a time or with words and sentences together. Both of these approaches are shown in this chapter. Specifically, we will cover the following topics:

  • Generating text with:
    • Character-based RNNs for generating news headlines and completing text messages
    • GPT-2 to generate full sentences
  • Improving the quality of text generation using techniques such as:
    • Greedy search
    • Beam search
    • Top-K sampling
  • Using advanced techniques such as learning rate annealing and checkpointing to enable long training times:
  • Details of the Transformer decoder architecture
  • Details of the GPT and GPT-2 models

A character-based approach for generating text is shown first. Such models can be quite useful for generating completions of a partially typed word in a sentence on a messaging platform, for example.

Generating text – one character at a time

Text generation yields a window into whether deep learning models are learning about the underlying structure of language. Text will be generated using two different approaches in this chapter. The first approach is an RNN-based model that generates a character at a time.

In the previous chapters, we have seen different tokenization methods based on words and sub-words. Text is tokenized into characters, which include capital and small letters, punctuation symbols, and digits. There are 96 tokens in total. This tokenization is an extreme example to test how much a model can learn about the language structure. The model will be trained to predict the next character based on a given set of input characters. If there is indeed an underlying structure in the language, the model should pick it up and generate reasonable-looking sentences.

Generating coherent sentences one character at a time is a very challenging task. The model does not have a dictionary or vocabulary, and it has no sense of capitalization of nouns or any grammar rules. Yet, we are expecting it to generate reasonable-looking sentences. The structure of words and their order in a sentence is not random but driven by grammar rules in a language. Words have some structure, based on parts of speech and word roots. A character-based model has the smallest possible vocabulary, but we hope that the model learns a lot about the use of the letters. This may seem like a tall order but be prepared to be surprised. Let's get started with the data loading and pre-processing steps.

Data loading and pre-processing

For this particular example, we are going to use data from a constrained domain – a set of news headlines. The hypothesis is that news headlines are usually short and follow a particular structure. These headlines are usually a summary of an article and contain a large number of proper nouns like names of companies and celebrities. For this particular task, data from two different datasets are joined together and used. The first dataset is called the News Aggregator dataset generated by the Artificial Intelligence Lab, part of the Faculty of Engineering at Roma Tre University in Italy. The University of California, Irvine, has made the dataset available for download from https://archive.ics.uci.edu/ml/datasets/News+Aggregator. This dataset has over 420,000 news article titles, URLs, and other information. The second dataset is a set of over 200,000 news articles from The Huffington Post, called the News Category dataset, collected by Rishabh Mishra and posted on Kaggle at https://www.kaggle.com/rmisra/news-category-dataset.

News article headlines from both datasets are extracted and compiled into one file. This step is already done to save time. The compressed output file is called news-headlines.tsv.zip and is located in the chapter5-nlg-with-transformer-gpt/char-rnn GitHub folder corresponding to this chapter. The folder is located inside the GitHub repository for this book. The format of this file is pretty simple. It has two columns separated by a tab. The first column is the original headline, and the second column is an uncased version of the same headline. This example uses the first column of the file only.

However, you can try the uncased version to see how the results differ. Training such models usually takes a lot of time, often several hours. Training in an IPython notebook can be difficult as a number of issues, such as the loss of the connection to the kernel or the kernel process dying, can result in the loss of the trained model. What we are attempting to do in this example is akin to training BERT from scratch. Don't worry; we train the model for a much shorter time than it took to train BERT. Running long training loops runs the risk of training loops crashing in the middle. In such a case, we don't want to restart training from scratch. The model is checkpointed frequently during training so that the model state can be restored from the last checkpoint if a failure occurs. Then, training can be restarted from the last checkpoint. Python files executed from the command line give the most control when running long training loops.

The command-line instructions shown in this example were tested on an Ubuntu 18.04 LTS machine. These commands should work as is on a macOS command line but may need some adjustments. Windows users may need to translate these commands for their operating system. Windows 10 power users should be able to use the Windows Subsystem for Linux (WSL) capabilities to execute the same commands.

Going back to the data format, all that needs to be done for loading the data is to unzip the prepared headline file. Navigate to the folder where the ZIP file has been pulled down from GitHub. The compressed file of headlines can be unzipped and inspected:

$ unzip news-headlines.tsv.zip
Archive:  news-headlines.tsv.zip
  inflating: news-headlines.tsv

Let's inspect the contents of the file to get a sense of the data:

$ head -3 news-headlines.tsv
There Were 2 Mass Shootings In Texas Last Week, But Only 1 On TV there were 2 mass shootings in texas last week, but only 1 on tv
Will Smith Joins Diplo And Nicky Jam For The 2018 World Cup's Official Song will smith joins diplo and nicky jam for the 2018 world cup's official song
Hugh Grant Marries For The First Time At Age 57 hugh grant marries for the first time at age 57

The model is trained on the headlines shown above. We are ready to move on to the next step and load the file to perform normalization and tokenization.

Data normalization and tokenization

As discussed above, this model uses a token per character. So, each letter, including punctuation, numbers, and space, becomes a token. Three additional tokens are added. These are:

  • <EOS>: Denotes end of sentences. The model can use this token to indicate that the generation of text is complete. All headlines end with this token.
  • <UNK>: While this is a character-level model, it is possible to have different characters from other languages or character sets in the dataset. When a character is detected that is not present in our set of 96 characters, this token is used. This approach is consistent with word-based vocabulary approaches where it is common to replace out-of-vocabulary words with a special token.
  • <PAD>: This is a unique padding token used to pad all headlines to the same length. Padding is done by hand in this example as opposed to using TensorFlow methods, which we have seen previously.

All the code in this section will refer to the rnn-train.py file from the chapter5-nlg-with-transformer-gpt folder of the GitHub repo of the book. The first part of this file has the imports and optional instructions for setting up a GPU. Ignore this section if your setup does not use a GPU.

A GPU is an excellent investment for deep learning engineers and researchers. A GPU could speed up your training times by orders of magnitude or more! It would be worthwhile to outfit your deep learning setup with a GPU like the Nvidia GeForce RTX 2070.

The code for data normalization and tokenization is between lines 32 and 90 of this file. To start, the tokenization function needs to be set up:

chars = sorted(set("abcdefghijklmnopqrstuvwxyz0123456789 -,;.!?:'''/|_@#$%ˆ&*˜'+-=()[]{}' ABCDEFGHIJKLMNOPQRSTUVWXYZ"))
chars = list(chars)
EOS = '<EOS>'
UNK = "<UNK>"
PAD = "<PAD>"      # need to move mask to '0'index for Embedding layer
chars.append(EOS)  # end of sentence
chars.insert(0, PAD)  # now padding should get index of 0

Once the token list is ready, methods need to be defined for converting characters to tokens and vice versa. Creating mapping is relatively straightforward:

# Creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(chars)}
idx2char = np.array(chars)
def char_idx(c):
    # takes a character and returns an index
    # if character is not in list, returns the unknown token
    if c in chars:
        return char2idx[c]
    return char2idx[UNK]

Now, the data needs can be read in from the TSV file. A maximum length of 75 characters is used for the headlines. If the headlines are shorter than this length, they are padded. Any headlines longer than 75 characters are snipped. The <EOS> token is appended to the end of every headline. Let's set this up:

data = []     # load into this list of lists 
MAX_LEN = 75  # maximum length of a headline 
with open("news-headlines.tsv", "r") as file:
    lines = csv.reader(file, delimiter='	')
    for line in lines:
        hdln = line[0]
        cnvrtd = [char_idx(c) for c in hdln[:-1]]  
        if len(cnvrtd) >= MAX_LEN:
            cnvrtd = cnvrtd[0:MAX_LEN-1]
            # add padding tokens
            remain = MAX_LEN - len(cnvrtd)
            if remain > 0:
                for i in range(remain):
print("**** Data file loaded ****")

All the data is loaded into a list with the code above. You may be wondering about the ground truth here for training as we only have a line of text. Since we want this model to generate text, the objective can be reduced to predicting the next character given a set of characters. Hence, a trick will be used to construct the ground truth – we will just shift the input sequence by one character and set it as the expected output. This transformation is quite easy do with numpy:

# now convert to numpy array
np_data = np.array(data)
# for training, we use one character shifted data
np_data_in = np_data[:, :-1]
np_data_out = np_data[:, 1:]

With this nifty trick, we have both inputs and expected outputs ready for training. The final step is to convert it into tf.Data.DataSet for ease of batching and shuffling:

# Create TF dataset
x = tf.data.Dataset.from_tensor_slices((np_data_in, np_data_out))

Now everything is ready to start training.

Training the model

The code for model training starts at line 90 in the rnn-train.py file. The model is quite simple. It has an embedding layer, followed by a GRU layer and a dense layer. The size of the vocabulary, the number of RNN units, and the size of the embeddings are set up:

# Length of the vocabulary in chars
vocab_size = len(chars)
# The embedding dimension
embedding_dim = 256
# Number of RNN units
rnn_units = 1024
# batch size

With the batch size being defined, training data can be batched and ready for use by the model:

# create tf.DataSet
x_train = x.shuffle(100000, reshuffle_each_iteration=True).batch(BATCH_SIZE, drop_remainder=True)

Similar to code in previous chapters, a convenience method to build models is defined like so:

# define the model
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
  return model 

A model can be instantiated with this method:

model = build_model(
                  vocab_size = vocab_size,
print("**** Model Instantiated ****")
**** Model Instantiated ****
Model: "sequential"
Layer (type)                 Output Shape              Param #
embedding (Embedding)        (256, None, 256)          24576
gru (GRU)                    (256, None, 1024)         3938304
dropout (Dropout)            (256, None, 1024)         0
dense (Dense)                (256, None, 96)           98400
Total params: 4,061,280
Trainable params: 4,061,280
Non-trainable params: 0

There are just over 4 million trainable parameters in this model. The Adam optimizer, with a sparse categorical loss function, is used for training this model:

loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer = 'adam', loss = loss)

Since training is potentially going to take a long time, we need to set up checkpoints along with the training. If there is any problem in training and training stops, these checkpoints can be used to restart the training from the last saved checkpoint. A directory is created using the current timestamp for saving these checkpoints:

# Setup checkpoints 
# dynamically build folder names
dt = datetime.datetime.today().strftime("%Y-%b-%d-%H-%M-%S")
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints/'+dt
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

A custom callback that saves checkpoints during training is defined in the last line of code above. This is passed to the model.fit() function to be called at the end of every epoch. Starting the training loop is straightforward:

print("**** Start Training ****")
start = time.time()
history = model.fit(x_train, epochs=EPOCHS, 
print("**** End Training ****")
print("Training time: ", time.time()- start)

The model will be trained for 25 epochs. The time taken in training will be logged as well in the code above. The final piece of code uses the history to plot the loss and save it as a PNG file in the same directory:

# Plot accuracies
lossplot = "loss-" + dt + ".png"
plt.title('model loss')
print("Saved loss to: ", lossplot) 

The best way to start training is to start the Python process so that it can run in the background without needing a Terminal or command-line. On Unix systems, this can be done with the nohup command:

$ nohup python rnn-train.py > training.log &

This command line starts the process in a way that disconnecting the Terminal would not interrupt the training process. On my machine, this training took approximately 1 hour and 43 minutes. Let's check out the loss curve:

Figure 5.1: Loss curve

Description automatically generated

Figure 5.1: Loss curve

As we can see, the loss decreases to a point and then shoots up. The standard expectation is that loss would monotonically decrease as the model was trained for more epochs. In the case shown above, the loss suddenly shoots up. In other cases, you may observe a NaN, or Not-A-Number, error. NaNs result from the exploding gradient problem during backpropagation through RNNs. The gradient direction causes weights to grow very large quickly and overflow, resulting in NaNs. Given how prevalent this is, there are quite a few jokes about NLP engineers and Indian food to go with the nans (referring to a type of Indian bread).

The primary reason behind these occurrences is gradient descent overshooting the minima and starting to climb the slope before reducing again. This happens when the steps gradient descent is taking are too large. Another way to prevent the NaN issue is gradient clipping where gradients are clipped to an absolute maximum, preventing loss from exploding. In the RNN model above, a scheme needs to be used that reduces the learning rate over time. Reducing the learning rate over epochs reduces the chances for gradient descent to overshoot the minima. This technique of reducing the learning rate over time is called learning rate annealing or learning rate decay. The next section walks through implementing learning rate decay while training the model.

Implementing learning rate decay as custom callback

There are two ways to implement learning rate decay in TensorFlow. The first way is to use one of the prebuilt schedulers that are part of the tf.keras.optimizers.schedulers package and use a configured instance with the optimizer. An example of a prebuilt scheduler is InverseTimeDecay, and it can be set up as shown below:

lr_schedule = tf.keras.optimizers.schedules.InverseTimeDecay(

The first parameter, 0.001 in the example above, is the initial learning rate. The number of steps per epoch can be calculated by dividing the number of training examples by batch size. The number of decay steps determines how the learning rate is reduced. The equation used to compute the learning rate is:

After being set up, all this function needs is the step number for computing the new learning rate. Once the schedule is set up, it can be passed to the optimizer:

optimizer = tf.keras.optimizers.Adam(lr_schedule)

That's it! The rest of the training loop code is unchanged. However, this learning rate scheduler starts reducing the learning rate from the first epoch itself. A lower learning rate increases the amount of training time. Ideally, we would keep the learning rate unchanged for the first few epochs and then reduce it.

Looking at Figure 5.1 above, the learning rate is probably effective until about the tenth epoch. BERT also uses learning rate warmup before learning rate decay. Learning rate warmup generally refers to increasing the learning rate for a few epochs. BERT was trained for 1,000,000 steps, which roughly translates to 40 epochs. For the first 10,000 steps, the learning rate was increased, and then it was linearly decayed. Implementing such a learning rate schedule is better accomplished by a custom callback.

Custom callbacks in TensorFlow enable the execution of custom logic at various points during training and inference. We saw an example of a prebuilt callback that saves checkpoints during training. A custom callback provides hooks that enable desired logic that can be executed at various points during training. This main step is to define a subclass of tf.keras.callbacks.Callback. Then, one or more of the following functions can be implemented to hook onto the events exposed by TensorFlow:

  • on_[train,test,predict]_begin / on_[train,test,predict]_end: This callback happens at the start of training or the end of the training. There are methods for training, testing, and prediction loops. Names for these methods can be constructed using the appropriate stage name from the possibilities shown in brackets. The method naming convention is a common pattern across other methods in the rest of the list.
  • on_[train,test,predict]_batch_begin / on_[train,test,predict] _batch_end: These callbacks happen when training for a specific batch starts or ends.
  • on_epoch_begin / on_epoch_end: This is a training-specific function called at the start or end of an epoch.

We will implement a callback for the start of the epoch that adjusts that epoch's learning rate. Our implementation will keep the learning rate constant for a configurable number of initial epochs and then reduce the learning rate in a fashion similar to the inverse time decay function described above. This learning rate would look like the following Figure 5.2:

A screenshot of a cell phone

Description automatically generated

Figure 5.2: Custom learning rate decay function

First, a subclass is created with the function defined in it. The best place to put this in rnn_train.py is just around the checkpoint callback, before the start of training. This class definition is shown below:

class LearningRateScheduler(tf.keras.callbacks.Callback):
  """Learning rate scheduler which decays the learning rate"""
  def __init__(self, init_lr, decay, steps, start_epoch):
    self.init_lr = init_lr          # initial learning rate
    self.decay = decay              # how sharply to decay
    self.steps = steps              # total number of steps of decay
    self.start_epoch = start_epoch  # which epoch to start decaying
  def on_epoch_begin(self, epoch, logs=None):
    if not hasattr(self.model.optimizer, 'lr'):
      raise ValueError('Optimizer must have a "lr" attribute.')
    # Get the current learning rate
    lr = float(tf.keras.backend.get_value(self.model.optimizer.lr))
    if(epoch >= self.start_epoch):
        # Get the scheduled learning rate.
        scheduled_lr = self.init_lr / (1 + self.decay * (epoch / self.steps))
        # Set the new learning rate
Epoch %05d: Learning rate is %6.4f.' % (epoch, scheduled_lr))

Using this callback in the training loop requires the instantiation of the callback. The following parameters are set while instantiating the callback:

  • The initial learning rate is set to 0.001.
  • The decay rate is set to 4. Please feel free to play around with different settings.
  • The number of steps is set to the number of epochs. The model is trained for 150 epochs.
  • Learning rate decay should start after epoch 10, so the start epoch is set to 10.

The training loop is updated to include the callback like so:

print("**** Start Training ****")
lr_decay = LearningRateScheduler(0.001, 4., EPOCHS, 10)
start = time.time()
history = model.fit(x_train, epochs=EPOCHS,
                    callbacks=[checkpoint_callback, lr_decay])
print("**** End Training ****")
print("Training time: ", time.time()- start)
print("Checkpoint directory: ", checkpoint_dir)

Changes are highlighted above. Now, the model is ready to be trained using the command shown above. Training 150 epochs took over 10 hours on the GPU-capable machine. The loss surface is shown in Figure 5.3:

A close up of a piece of paper

Description automatically generated

Figure 5.3: Model loss after learning rate decay

In the figure above, the loss drops very fast for the first few epochs before plateauing near epoch 10. Learning rate decay kicks in at that point, and the loss starts to fall again. This can be verified from a snippet of the log file:

Epoch 8/150
2434/2434 [==================] - 249s 102ms/step - loss: 0.9055
Epoch 9/150
2434/2434 [==================] - 249s 102ms/step - loss: 0.9052
Epoch 10/150
2434/2434 [==================] - 249s 102ms/step - loss: 0.9064
Epoch 00010: Learning rate is 0.00078947.
Epoch 11/150
2434/2434 [==================] - 249s 102ms/step - loss: 0.8949
Epoch 00011: Learning rate is 0.00077320.
Epoch 12/150
2434/2434 [==================] - 249s 102ms/step - loss: 0.8888
Epoch 00149: Learning rate is 0.00020107.
Epoch 150/150
2434/2434 [==================] - 249s 102ms/step - loss: 0.7667
**** End Training ****
Training time:  37361.16723680496
Checkpoint directory:  ./training_checkpoints/2021-Jan-01-09-55-03
Saved loss to:  loss-2021-Jan-01-09-55-03.png

Note the highlighted loss above. The loss slightly increased around epoch 10 as learning rate decay kicked in, and the loss started falling again. The small bumps in the loss that can be seen in Figure 5.3 correlate with places where the learning rate was higher than needed, and learning rate decay kicked it down to make the loss go lower. The learning rate started at 0.001 and ended at a fifth of that at 0.0002.

Training this model took much time and advanced tricks like learning rate decay to train. But how does this model do in terms of generating text? That is the focus of the next section.

Generating text with greedy search

Checkpoints were taken during the training process at the end of every epoch. These checkpoints are used to load a trained model for generating text. This part of the code is implemented in an IPython notebook. The code for this section is found in the charRNN-text-generation.ipynb file in this chapter's folder in GitHub. The generation of text is dependent on the same normalization and tokenization logic used during training. The Setup Tokenization section of the notebook has this code replicated.

There are two main steps in generating text. The first step is restoring a trained model from the checkpoint. The second step is generating a character at a time from a trained model until a specific end condition is met.

The Load the Model section of the notebook has the code to define the model. Since the checkpoints only stored the weights for the layers, defining the model structure is important. The main difference from the training network is the batch size. We want to generate a sentence at a time, so we set the batch size as 1:

# Length of the vocabulary in chars
vocab_size = len(chars)
# The embedding dimension
embedding_dim = 256
# Number of RNN units
rnn_units = 1024
# Batch size

A convenience function for setting up the model structure is defined like so:

# this one is without padding masking or dropout layer
def build_gen_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
  return model
gen_model = build_gen_model(vocab_size, embedding_dim, rnn_units, 

Note that the embedding layer does not use masking because, in text generation, we are not passing an entire sequence but only part of a sequence that needs to be completed. Now that the model is defined, the weights for the layers can be loaded in from the checkpoint. Please remember to replace the checkpoint directory with your local directory containing the checkpoints from training:

checkpoint_dir = './training_checkpoints/<YOUR-CHECKPOINT-DIR>' 
gen_model.build(tf.TensorShape([1, None]))

The second main step is to generate text a character at a time. Generating text needs a seed or a starting few letters, which are completed by the model into a sentence. The process of generation is encapsulated in the function below:

def generate_text(model, start_string, temperature=0.7, num_generate=75):
  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
  # Converting our start string to numbers (vectorizing)
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)
  # Empty string to store our results
  text_generated = []
  # Here batch size == 1
  for i in range(num_generate):
      predictions = model(input_eval)
      # remove the batch dimension
      predictions = tf.squeeze(predictions, 0)
      # using a categorical distribution to predict the 
      # word returned by the model
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, 
      # We pass the predicted word as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)
      # lets break is <EOS> token is generated
      # if idx2char[predicted_id] == EOS:
      # break #end of a sentence reached, let's stop
  return (start_string + ''.join(text_generated))

The generation method takes in a seed string that is used as the starting point for the generation. This seed string is vectorized. The actual generation happens in a loop, where one character is generated at a time and appended to the sequence generated. At every point, the character with the highest likelihood is chosen. Choosing the next letter with the highest probability is called greedy search. However, there is a configuration parameter called temperature, which can be used to adjust the predictability of the generated text.

Once probabilities for all characters are predicted, dividing the probabilities by the temperature changes the distribution of the generated characters. Smaller values of the temperature generate text that is closer to the original text. Larger values of the temperature generate more creative text. Here, a value of 0.7 is chosen to bias more on the surprising side.

To generate the text, all that is needed is one line of code:

print(generate_text(gen_model, start_string=u"Google"))
Google plans to release the Xbox One vs. Samsung Galaxy Gea<EOS><PAD>ote on Mother's Day 

Each execution of the command may generate slightly different results. The line generated above, while obviously nonsensical, is pretty well structured. The model has learned capitalization rules and headline structure. Normally, we would not generate text beyond the <EOS> token, but all 75 characters are generated here for the sake of understanding the model output.

Note that the output shown for text generation is indicative. You may see a different output for the same prompt. There is some inherent randomness that is built into this process, which we can try and control by setting random seeds. When a model is retrained, it may end up on a slightly different point on the loss surface, where even though the loss numbers look similar, there may be slight differences in the model weights. Please take the outputs presented in the entire chapter as indicative versus actual.

Here are some other examples of seed strings and model outputs, snipped after the end-of-sentence tag:


Generated Sentence


S&P 500 closes above 190<EOS>

S&P: Russell Slive to again find any business manufacture<EOS>

S&P closes above 2000 for first tim<EOS>


Beyonce and Solange pose together for 'American Idol' contes<EOS>

Beyonce's sister Solange rules' Dawn of the Planet of the Apes' report<EOS>

Beyonce & Jay Z Get Married<EOS>

Note the model's use of quotes in the first two sentences for Beyonce as the seed word. The following table shows the impact of different temperature settings for similar seed words:



Generated Sentence






S&P 500 Closes Above 1900 For First Tim<EOS>

S&P Close to $5.7 Billion Deal to Buy Beats Electronic<EOS>

S&P 500 index slips to 7.2%, signaling a strong retail sale<EOS>

S&P, Ack Factors at Risk of what you see This Ma<EOS>






Kim Kardashian and Kanye West wedding photos release<EOS>

Kim Kardashian Shares Her Best And Worst Of His First Look At The Met Gala<EOS>

Kim Kardashian Wedding Dress Dress In The Works From Fia<EOS>

Kim Kardashian's en<EOS>

Generally, the quality of the text goes down at higher values of temperature. All these examples were generated by passing in the different temperature values to the generation function.

A practical application of such a character-based model is to complete words in a text messaging or email app. By default, the generate_text() method is generating 75 characters to complete the headline. It is easy to pass in much shorter lengths to see what the model proposes as the next few letters or words.

The table below shows some experiments of trying to complete the next 10 characters of text fragments. These completions were generated using:

print(generate_text(gen_model, start_string=u"Lets meet tom", 
                    temperature=0.7, num_generate=10))
Lets meet tomorrow to t



I need some money from ba

I need some money from bank chairma

Swimming in the p

Swimming in the profitabili

Can you give me a

Can you give me a Letter to

are you fr

are you from around

The meeting is

The meeting is back in ex

Lets have coffee at S

Lets have coffee at Samsung hea

Lets have coffee at Staples stor

Lets have coffee at San Diego Z

Given that the dataset used was only from news headlines, it is biased toward certain types of activities. For example, the second sentence could be completed with pool instead of the model trying to fill it in with profitability. If a more general text dataset was used, then this model could do quite well at generating completions for partially typed words at the end of the sentence. However, there is one limitation that this text generation method has – the use of the greedy search algorithm.

The greedy search process is a crucial part of the text generation above. It is one of several ways to generate text. Let's take an example to understand this process. For this example, bigram frequencies were analyzed by Peter Norvig and published on http://norvig.com/mayzner.html. Over 743 billion English words were analyzed in this work. With 26 characters in an uncased model, there are theoretically 26 x 26 = 676 bigram combinations. However, the article reports that the following bigrams were never seen in roughly 2.8 trillion bigram instances: JQ, QG, QK, QY, QZ, WQ, and WZ.

The Greedy Search with Bigrams section of the notebook has code to download and process the full dataset and show the process of greedy search. After downloading the set of all n-grams, bigrams are extracted. A set of dictionaries is constructed to help look up the highest-probability next letter given a starting letter. Then, using some recursive code, a tree is constructed, picking the top three choices for the next letter. In the generation code above, only the top letter is chosen. However, the top three letters are chosen to show how greedy search works and its shortcomings.

Using the nifty anytree Python package, a nicely formatted tree can be visualized. This tree is shown in the following figure:

Figure 5.4: Greedy search tree starting with WI

Description automatically generated

Figure 5.4: Greedy search tree starting with WI

The algorithm was given the task of completing WI in a total of five characters. The preceding tree shows cumulative probabilities for a given path. More than one path is shown so that the branches not taken by greedy search can also be seen. If a three-character word was being built, the highest probability choice is WIN with a probability of 0.243, followed by WIS at 0.01128. If four-letter words are considered, then the greedy search would consider only those words that start with WIN as that was the path with the highest probability considering the first three letters. WIND has the highest probability of 0.000329 in this path. However, a quick scan across all four-letter words shows that the highest probability word should be WITH having a probability of 0.000399.

This, in essence, is the challenge of the greedy search algorithm for text generation. Higher-probability options considering joint probabilities are hidden due to optimization at each character instead of cumulative probability. Whether the text is generated a character or a word at a time, greedy search suffers from the same issue.

An alternative algorithm, called beam search, allows tracking multiple options, and pruning out the lower-probability options as generation proceeds. The tree shown in Figure 5.4 can also be seen as an illustration of tracking beams of probabilities. To see the power of this technique, a more sophisticated model for generating text would be better. The GPT-2, or Generative Pre-Training, based model published by OpenAI set many benchmarks including in open-ended text generation. This is the subject of the next half of this chapter, where the GPT-2 model is explained first. The next topic is fine-tuning a GPT-2 model for completing email messages. Beam search and other options to improve the quality of the generated text are also shown.

Generative Pre-Training (GPT-2) model

OpenAI released the first version of the GPT model in June 2018. They followed up with GPT-2 in February 2019. This paper attracted much attention as full details of the large GPT-2 model were not released with the paper due to concerns of nefarious uses. The large GPT-2 model was released subsequently in November 2019. The GPT-3 model is the most recent, released in May 2020.

Figure 5.5 shows the number of parameters in the largest of each of these models:

Figure 5.5: Parameters in different GPT models

The first model used the standard Transformer decoder architecture with twelve layers, each with twelve attention heads and 768-dimensional embeddings, for a total of approximately 110 million parameters, which is very similar to the BERT model. The largest GPT-2 has over 1.5 billion parameters, and the most recently released GPT-3 model's largest variant has over 175 billion parameters!

Cost of training language models

As the number of parameters and dataset sizes increase, the time taken for training also increases. As per a Lambda Labs article, If the GPT-3 model were to be trained on a single Nvidia V100 GPU, it would take 342 years. Using stock Microsoft Azure pricing, this would cost over $3 million. GPT-2 model training is estimated to run to $256 per hour. Assuming a similar running time as BERT, which is about four days, that would cost about $25,000. If the cost of training multiple models during research is factored in, the overall cost can easily increase ten-fold.

At such costs, training these models from scratch is out of reach for individuals and even most companies. Transfer learning and the availability of pre-trained models from companies like Hugging Face make it possible for the general public to use these models.

The base architecture of GPT models uses the decoder part of the Transformer architecture. The decoder is a left-to-right language model. The BERT model, in contrast, is a bidirectional model. A left-to-right model is autoregressive, that is, it uses tokens generated thus far to generate the next token. Since it cannot see future tokens like a bi-directional model, this language model is ideal for text generation.

Figure 5.6 shows the full Transformer architecture with the encoder blocks on the left and decoder blocks on the right:

Figure 5.6: Full Transformer architecture with encoder and decoder blocks

The left side of Figure 5.6 should be familiar – it is essentially Figure 4.6 from the Transformer model section of the previous chapter. The encoder blocks shown are the same as the BERT model. The decoder blocks are very similar to the encoder blocks with a couple of notable differences.

In the encoder block, there is only one source of input – the input sequence and all of the input tokens are available for the multi-head attention to operate on. This enables the encoder to understand the context of the token from both the left and right sides.

In the decoder block, there are two inputs to each block. The outputs generated by the encoder blocks are available to all the decoder blocks and fed to the middle of the decoder block through multi-head attention and layer norms.

What is layer normalization?

Large deep neural networks are trained using the Stochastic Gradient Descent (SGD) optimizer or a variant like Adam. Training large models on big datasets can take a significant amount of time for the model to converge. Techniques such as weight normalization, batch normalization, and layer normalization are aimed at reducing training time by helping models to converge faster while also acting as a regularizer. The idea behind layer normalization is to scale the inputs of a given hidden layer with the mean and standard deviation of the inputs. First, the mean and standard deviation are computed:

H denotes the number of hidden units in layer l. Inputs to the layer are normalized using the above-calculated values:

where g is a gain parameter. Note that the formulation of the mean and standard deviation is not dependent on the size of the mini-batches or dataset size. Hence, this type of normalization can be used for RNNs and other sequence modeling problems.

However, the tokens generated by the decoder thus far are fed back through a masked multi-head self-attention and added to the output from the encoder blocks. Masked here refers to the fact that tokens to the right of the token being generated are masked, and the decoder cannot see them. Similar to the encoder, there are several such blocks stacked on top of each other. However, GPT architecture is only one half of the Transformer. This requires some modifications to the architecture.

The modified architecture for GPT is shown in Figure 5.7. Since there is no encoder block to feed the representation of the input sequence, the multi-head layer is no longer required. The outputs generated by the model are recursively fed back to generate the next token.

The smallest GPT-2 model has twelve layers and 768 dimensions for each token. The largest GPT-2 model has 48 layers and 1,600 dimensions per token. To pre-train models of this size, the authors of GPT-2 needed to create a new dataset. Web pages provide a great source of text, but the text comes with quality issues. To solve this challenge, they scraped all outbound links from Reddit, which had received at least three karma points. The assumption made by the authors is that karma points are an indicator of the quality of the web page being linked. This assumption allows scraping a huge set of text data. The resulting dataset was approximately 45 million links.

To extract text from the HTML on the web pages, two Python libraries were used: Dragnet and Newspaper. After some quality checks and deduplication, the final dataset was about 8 million documents with 40 GB of text. One exciting thing that the authors did was to remove any Wikipedia documents as they felt many of the test datasets used Wikipedia, and adding these pages would cause an overlap between test and training data sets. The pre-training objective is a standard LM training objective of predicting the next word given a set of previous words:

A close up of a sign

Description automatically generated

Figure 5.7: GPT architecture (Source: Improving Language Understanding by Generative Pre-Training by Radford et al.)

During pre-training, the GPT-2 model is trained with a maximum sequence length of 1,024 tokens. A Byte Pair Encoding (BPE) algorithm is used for tokenization, with a vocabulary size of about 50,000 tokens. GPT-2 uses byte sequences rather than Unicode code points for the byte pair merges. If GPT-2 only used bytes for encoding, then the vocabulary would only be 256 tokens. On the other hand, using Unicode code points would yield a vocabulary of over 130,000 tokens. By cleverly using bytes in BPE, GPT-2 is able to keep the vocabulary size to a manageable 50,257 tokens.

Another peculiarity of the tokenizer in GPT-2 is that it converts all text to lowercase and uses spaCy and ftfy tokenizers prior to using BPE. The ftfy library is quite useful for fixing Unicode issues. If these two are not available, then the basic BERT tokenizer is used.

There are several ways to encode the inputs to solve various problems, even though the left-to-right model may seem limiting. These are shown in Figure 5.8:

A screenshot of a cell phone

Description automatically generated

Figure 5.8: Input transformations in GPT-2 for different problems (Source: Improving Language Understanding by Generative Pre-Training by Radford et al.)

The figure above shows how a pre-trained GPT-2 model can be used for a variety of tasks other than text generation. In each instance, start and end tokens are added before and after the input sequence. In all cases, a linear layer is added to the end that is trained during model fine-tuning. The major advantage being claimed is that many different types of tasks can be accomplished using the same architecture. The topmost architecture in Figure 5.8 shows how it can be used for classification. GPT-2 could be used for IMDb sentiment analysis using this approach, for example.

The second example is of textual entailment. Textual entailment is an NLP task where the relationship between two fragments of text needs to be established. The first text fragment is called a premise, and the second fragment is called the hypothesis. Different relationships can exist between the premise and hypothesis. The premise can validate or contradict the hypothesis, or they may be unrelated.

Let's say the premise is Exercising every day is an important part of a healthy lifestyle and longevity. If the hypothesis is exercise increases lifespan, then the premise entails or validates the hypothesis. Alternatively, if the hypothesis is Running has no benefits, then the premise contradicts the hypothesis. Lastly, if the hypothesis is that lifting weights can build a six-pack, then the premise neither entails nor contradicts the hypothesis. To perform entailment with GPT-2, the premise and hypothesis are concatenated with a delimiter, usually $, in between them.

For text similarity, two input sequences are constructed, one with the first text sequence first and the second with the second text sequence first. The output from the GPT model is added together and fed to the linear layer. A similar approach is used for multiple-choice questions. However, our focus in this chapter is text generation.

Generating text with GPT-2

Hugging Face's transformers library simplifies the process of generating text with GPT-2. Similar to the pre-trained BERT model, as shown in the previous chapter, Hugging Face provides pre-trained GPT and GPT-2 models. These pre-trained models are used in the rest of the chapter. Code for this and the rest of the sections of this chapter can be found in the IPython notebook named text-generation-with-GPT-2.ipynb. After running the setup, scoot over to the Generating Text with GPT-2 section. A section showing the generation of text with GPT is also provided for reference. The first step in generating text is to download the pre-trained model, and its corresponding tokenizer:

from transformers import TFGPT2LMHeadModel, GPT2Tokenizer
gpt2tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# add the EOS token as PAD token to avoid warnings
gpt2 = TFGPT2LMHeadModel.from_pretrained("gpt2", 

This may take a few minutes as the models need to be downloaded. You may see a warning if spaCy and ftfy are not available in your environment. These two libraries are not mandatory for text generation. The following code can be used to generate text using a greedy search algorithm:

# encode context the generation is conditioned on
input_ids = gpt2tokenizer.encode('Robotics is the domain of ', return_tensors='tf')
# generate text until the output length 
# (which includes the context length) reaches 50
greedy_output = gpt2.generate(input_ids, max_length=50)
" + 50 * '-')
print(gpt2tokenizer.decode(greedy_output[0], skip_special_tokens=True))
Robotics is the domain of the United States Government.
The United States Government is the primary source of information on the use of drones in the United States.
The United States Government is the primary source of information on the use of drones

A prompt was supplied for the model to complete. The model started in a promising manner but soon resorted to repeating the same output.

Note that the output shown for text generation is indicative. You may see different outputs for the same prompt. There are a few different reasons for this. There is some inherent randomness that is built into this process, which we can try and control by setting random seeds. The models themselves may be retrained periodically by the Hugging Face team and may evolve with newer versions.

Issues with the greedy search were noted in the previous section. Beam search can be considered as an alternative. At each step of generating a token, a set of top probability tokens are kept as part of the beam instead of just the highest-probability token. The sequence with the highest overall probability is returned at the end of the generation. Figure 5.4, in the previous section with a greedy search, can be considered as the output of a beam search algorithm with a beam size of 3.

Generating text using beam search is trivial:

# activate beam search and early_stopping
beam_output = gpt2.generate(
" + 50 * '-')
print(gpt2tokenizer.decode(beam_output[0], skip_special_tokens=True))
Robotics is the domain of science and technology. It is the domain of science and technology. It is the domain of science and technology. It is the domain of science and technology. It is the domain of science and technology. It is the domain

Qualitatively, the first sentence makes a lot more sense than the one generated by the greedy search. The early_stopping parameter signals generation to stop when all beams reach the EOS token. However, there is still much repetition going on. One parameter that can be used to control the repetition is by setting a limit on n-grams being repeated:

# set no_repeat_ngram_size to 2
beam_output = gpt2.generate(
" + 50 * '-')
print(gpt2tokenizer.decode(beam_output[0], skip_special_tokens=True))
Robotics is the domain of science and technology.
In this article, we will look at some of the most important aspects of robotics and how they can be used to improve the lives of people around the world. We will also take a look

This has made a considerable difference in the quality of the generated text. The no_repeat_ngram_size parameter prevents the model from generating any 3-grams or triplets of tokens more than once. While this improves the quality of the text, using the n-gram constraint can have a significant impact on the quality of the generated text. If the generated text is about The White House, then these three words can only be used once in the entire generated text. In such a case, using the n-gram constraint will be counter-productive.

To beam or not to beam

Beam search works well in cases where the generated sequence is of a restricted length. As the length of the sequence increases, the number of beams to be maintained and computed increases significantly. Consequently, beam search works well in tasks like summarization and translation but performs poorly in open-ended text generation. Further, beam search, by trying to maximize the cumulative probability, generates more predictable text. The text feels less natural. The following piece of code can be used to get a feel for the various beams being generated. Just make sure that the number of beams is greater than or equal to the number of sequences to be returned:

# Returning multiple beams
beam_outputs = gpt2.generate(
" + 50 * '-')
for i, beam_output in enumerate(beam_outputs):
{}: {}".format(i, 
0: Robotics is the domain of the U.S. Department of Homeland Security. The agency is responsible for the security of the United States and its allies, including the United Kingdom, Canada, Australia, New Zealand, and the European Union.
1: Robotics is the domain of the U.S. Department of Homeland Security. The agency is responsible for the security of the United States and its allies, including the United Kingdom, France, Germany, Italy, Japan, and the European Union.
2: Robotics is the domain of the U.S. Department of Homeland Security. The agency is responsible for the security of the United States and its allies, including the United Kingdom, Canada, Australia, New Zealand, the European Union, and the United
The text generated is very similar but differs near the end. Also, note that temperature is available to control the creativity of the generated text. 

There is another method for improving the coherence and creativity of the text being generated called Top-K sampling. This is the preferred method in GPT-2 and plays an essential role in the success of GPT-2 in story generation. Before explaining how this works, let's try it out and see the output:

# Top-K sampling
tf.random.set_seed(42)  # for reproducible results
beam_output = gpt2.generate(
" + 50 * '-')
print(gpt2tokenizer.decode(beam_output[0], skip_special_tokens=True))
Robotics is the domain of people with multiple careers working with robotics systems. The purpose of Robotics & Machine Learning in Science and engineering research is not necessarily different for any given research type because the results would be much more diverse.
Our team uses

The above sample was generated by selecting a high temperature value. A random seed was set to ensure repeatable results. The Top-K sampling method was published in a paper titled Hierarchical Neural Story Generation by Fan Lewis and Dauphin in 2018. The algorithm is relatively simple – at every step, it picks a token from the top K highest probability tokens. If K is set to 1, then this algorithm is identical to the greedy search.

In the code example above, the model looks at the 25 top tokens out of the 50,000+ tokens while generating text. Then, it picks a random word from these and continues the generation. Choosing larger values will result in more surprising or creative text. Choosing lower values of K will result in more predictable text. If you are a little underwhelmed by the results thus far, that is because the prompt selected is a really tough one. Consider this output generated with Top-K of 50 for the prompt In the dark of the night, there was a:

In the dark of the night, there was a sudden appearance of light.

Sighing, Xiao Chen slowly stood up and looked at Tian Cheng standing over. He took a step to look closely at Tian Cheng's left wrist and frowned.

Lin Feng was startled, and quickly took out a long sword!

Lin Feng didn't understand what sort of sword that Long Fei had wielded in the Black and Crystal Palace!

The Black and Crystal Palace was completely different than his original Black Stone City. Long Fei carried a sword as a souvenir, which had been placed on the back of his father's arm by Tian Cheng.

He drew the sword from his dad's arm again!

The black blade was one of the most valuable weapons within the Black and Crystal Palace. The sword was just as sharp as the sharpest of all weapons, which had been placed on Long Fei's father's arm by the Black Stone City's Black Ice, for him to

The above longer form text was generated by the smallest GPT-2 model, which has roughly 124 million parameters. Several different settings and model sizes are available for you to now play with. Remember, with great power comes great responsibility.

Between the last chapter and this one, we have covered both the encoder and decoder parts of the Transformer architecture conceptually. Now, we are ready to put both parts together in the next chapter. Let's quickly review what we covered in this chapter.


Generating text is a complicated task. There are practical uses that can make typing text messages or composing emails easier. On the other hand, there are creative uses, like generating stories. In this chapter, we covered a character-based RNN model to generate headlines one character at a time and noted that it picked up the structure, capitalization, and other things quite well. Even though the model was trained on a particular dataset, it showed promise in completing short sentences and partially typed words based on the context. The next section covered the state-of-the-art GPT-2 model, which is based on the Transformer decoder architecture. The previous chapter had covered the Transformer encoder architecture, which is used by BERT.

Generating text has many knobs to tune like temperature to resample distributions, greedy search, beam search, and Top-K sampling to balance the creativity and predictability of the generated text. We saw the impact of these settings on text generation and used a pre-trained GPT-2 model provided by Hugging Face to generate text.

Now that both the encoder and decoder parts of the Transformer architecture have been covered, the next chapter will use the full Transformer to build a text summarization model. Text summarization is at the cutting edge of NLP today. We will build a model that will read news articles and summarize them in a few sentences. Onward!

