Chapter 6. Text Generation

One of the most uncanny features of Transformer-based language models is their ability to generate text that is almost indistinguishable from human-written text. A famous example is OpenAI’s GPT-2,1 which when given the prompt

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

was able to generate a compelling news article about talking unicorns:

The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science. Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved. Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow. Pérez and the others then ventured further into the valley. “By the time we reached the top of one peak, the water looked blue, with some crystals on top,” said Pérez. Pérez and his friends were astonished to see the unicorn herd. These creatures could be seen from the air without having to move too much to see them - they were so close they could touch their horns. While examining these bizarre creatures the scientists discovered that the creatures also spoke some fairly regular English …

What makes this example so remarkable is that it was generated without any explicit supervision! By simply learning to predict the next word in millions of webpages, GPT-2 and its more powerful descendants like GPT-32 are able to acquire a broad set of skills and pattern recognition abilities that can be activated with different kinds of input prompts. Figure 6-1 shows how language models are sometimes exposed to sequences of sub-tasks like addition, unscrambling words, and translation during pretraining, which allows them to transfer this knowledge effectively during fine-tuning or (if the model is large enough) inference time. These sub-tasks are not chosen ahead of time, but occur naturally in the huge corpora used to train billion-parameter language models.

LM Meta Learning
Figure 6-1. During pretraining, language models are exposed to sequences of sub-tasks that can be adapted to during inference (courtesy of Tom B. Brown).

This ability of Transformers to generate realistic text has produced a diverse range of applications like Talk To Transformer, Write With Transformer, AI Dungeon, and conversational agents like Google’s Meena that can even tell corny jokes (Figure 6-2)!

Meena
Figure 6-2. Meena (left) telling a corny joke to a human (right) (courtesy of Daniel Adiwardana and Thang Luong)

In this chapter we’ll use GPT-2 to illustrate how text generation works for language models and explore how a recent class of Transformer architectures can be applied to one of the most challenging tasks in NLP: generating accurate summaries from long text documents like news articles or business reports.

The Challenge With Generating Coherent Text

In this book, we have focused on tackling NLP tasks via a combination of pretraining and supervised fine-tuning. For task-specific heads like sequence or token classification, generating predictions was fairly straightforward; the model produced some logits and we either took the maximum value to get the predicted class, or applied a softmax function to obtain the predicted probabilities per class. By contrast, converting the model’s probabilistic output to text requires a decoding method which introduces a few challenges that are unique to text generation:

  • The decoding is done iteratively and thus involves significantly more compute than simply passing inputs once through the forward pass of a model.

  • The quality and diversity of the generated text depends on the choice of decoding method and associated hyperparameters.

To understand how this decoding process works, let’s start by examining how GPT-2 is pretrained and subsequently applied to generate text.

Like other autoregressive or causal language models, GPT-2 is pretrained to estimate the probability P(y1,y2,...,yt|?) of a sequence of tokens y1,y2,...yt occurring in the text, given some initial prompt or context sequence ?. Since it is impractical to acquire enough training data to estimate P(?|?) directly, it is common to use the chain rule of probability to factorize it as a product of conditional probabilities

P(y1,...,yt|?)=t=1NP(yt|y<t,?),

where y<t is a shorthand notation for the sequence y1,...,yt-1. It is from these conditional probabilities that we pick up the intuition that autoregressive language modeling amounts to predicting each word given the preceding words in a sentence; this is exactly what the probability on the right-hand side of the preceding equation describes. Notice that this pretraining objective is quite different to BERT’s, which utilizes both past and future contexts to predict a masked token. As shown in Figure 6-3, to prevent the attention heads from peeking at future tokens, GPT-2 applies a mask to the input sentence so that tokens to the right of the current position are hidden.

MLM vs CLM
Figure 6-3. Difference between the self-attention mechanisms of BERT (left) and GPT-2 (right) for three token embeddings. In the BERT case, each token embedding can attend to all other embeddings. In the GPT-2 case, token embeddings can only attend to previous embeddings in the sequence.

By now you may have guessed how we can adapt this next-token prediction task to generate text sequences of arbitrary length. As shown in Figure 6-4, we start with a prompt like “Transformers are the” and use the model to predict the next token. Once we have determined the next token, we append it to the prompt and then use the new input sequence to generate another token. We do this until we have reached a special end of sequence (EOS) token or a predefined maximum length.

Text generation
Figure 6-4. Generating text from an input sequence by adding a new word to the input at each step.
Note

Since the output sequence is conditioned on the choice of input prompt, this type of text generation is often called conditional text generation.

At the heart of this process lies a decoding method that determines which token is selected at each timestep. Since the language model head produces a logit zt,i per token in the vocabulary at each step, we can get the probability distribution over the next possible token wi by taking the softmax:

P(yt=wi|y<t,?)=softmax(zt,i).

The goal of most decoding methods is to search for the most likely overall sequence by picking a ?^ such that

?^=argmaxytP(yt|y<t,?).

Since there does not exist an algorithm that can find the optimal decoded sequence in polynomial time, we rely on approximations instead. In this section we’ll explore a few of these approximations and gradually build up towards smarter and more complex algorithms that can be used to generate high quality texts.

Greedy Search Decoding

The simplest decoding method to get discrete tokens from a model’s continuous output is to greedily select the token with the highest probability at each timestep:

y^t=argmaxytP(yt|y<t,?).

To see how greedy search works, let’s start by loading the 1.5 billion-parameter version of GPT-2 with a language modeling head:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "gpt2-xl"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

Now let’s generate some text! Although Transformers provides a generate function for autoregressive models like GPT-2, let’s implement this decoding method ourselves to see what goes on under the hood. To warm up, we’ll take the same iterative approach shown in Figure 6-4 and use “Transformers are the” as the input prompt and run the decoding for eight timesteps. At each timestep, we pick out the model’s logits for the last token in the prompt and wrap them with a softmax to get a probability distribution. We then pick the next token with the highest probability, add it to the input sequence and run the process again. The following code does the job, and also stores the five most probable tokens at each timestep so we can visualize the alternatives:

import pandas as pd

input_txt = "Transformers are the"

input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
iterations = []
n_steps = 8
choices_per_step = 5

with torch.no_grad():
    for _ in range(n_steps):
        iteration = dict()
        iteration["Input"] = tokenizer.decode(input_ids[0])
        output = model(input_ids=input_ids)
        # Select logits of the first batch and the last token and apply softmax
        next_token_logits = output.logits[0, -1, :]
        next_token_probs = torch.softmax(next_token_logits, dim=-1)
        sorted_ids = torch.argsort(next_token_probs, dim=-1, descending=True)
        # Store tokens with highest probabilities
        for choice_idx in range(choices_per_step):
            token_id = sorted_ids[choice_idx]
            token_prob = next_token_probs[token_id].cpu().numpy()
            token_choice = (
                f"{tokenizer.decode(token_id)} ({100 * token_prob:.2f}%)"
            )
            iteration[f"Choice {choice_idx+1}"] = token_choice
        # Append predicted next token to input
        input_ids = torch.cat([input_ids, sorted_ids[None, 0, None]], dim=-1)
        iterations.append(iteration)

display_df(pd.DataFrame.from_records(iterations), index=None)
Input Choice 1 Choice 2 Choice 3 Choice 4 Choice 5
Transformers are the most (8.53%) only (4.96%) best (4.65%) Transformers (4.37%) ultimate (2.16%)
Transformers are the most popular (16.78%) powerful (5.37%) common (4.96%) famous (3.72%) successful (3.20%)
Transformers are the most popular toy (10.63%) toys (7.23%) Transformers (6.60%) of (5.46%) and (3.76%)
Transformers are the most popular toy line (34.38%) in (18.20%) of (11.71%) brand (6.10%) line (2.69%)
Transformers are the most popular toy line in (46.28%) of (15.09%) , (4.94%) on (4.40%) ever (2.72%)
Transformers are the most popular toy line in the (65.99%) history (12.42%) America (6.91%) Japan (2.44%) North (1.40%)
Transformers are the most popular toy line in the world (69.26%) United (4.55%) history (4.29%) US (4.23%) U (2.30%)
Transformers are the most popular toy line in the world , (39.73%) . (30.64%) and (9.87%) with (2.32%) today (1.74%)

With this simple method we were able to generate the perfectly reasonable sentence “Transformers are the most popular toy line in the world” from the input prompt. We can also see explicitly the iterative nature of text generation; unlike other tasks such as sequence classification or question answering where a single forward pass suffices to generate the predictions, with text generation we need to decode the output tokens one at a time.

Implementing greedy search wasn’t too hard, but we’ll we want to use the in-built generate function from Transformers to explore more sophisticated decoding methods. To reproduce our simple example, let’s make sure sampling is switched off (it’s off by default, unless the specific configuration of the model you are loading the checkpoint from states otherwise) and specify the max_length to eleven tokens since our prompt already has three words:

input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output = model.generate(input_ids, max_length=11, do_sample=False)
print(tokenizer.decode(output[0]))
Transformers are the most popular toy line in the world

Now let’s try something a bit more interesting: can we reproduce the unicorn story from OpenAI? As we did above, we’ll encode the prompt with the tokenizer and specify a larger value for max_length to generate a longer sequence of text:

max_length = 128
input_txt = """In a shocking finding, scientist discovered 
a herd of unicorns living in a remote, previously unexplored 
valley, in the Andes Mountains. Even more surprising to the 
researchers was the fact that the unicorns spoke perfect English.


"""
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output_greedy = model.generate(input_ids, max_length=max_length,
                               do_sample=False)
print(tokenizer.decode(output_greedy[0]))
In a shocking finding, scientist discovered a herd of unicorns living in a
 > remote, previously unexplored valley, in the Andes Mountains. Even more
 > surprising to the researchers was the fact that the unicorns spoke perfect
 > English.


The researchers, from the University of California, Davis, and the University of
 > Colorado, Boulder, were conducting a study on the Andean cloud forest, which
 > is home to the rare species of cloud forest trees.


The researchers were surprised to find that the unicorns were able to
 > communicate with each other, and even with humans.


The researchers were surprised to find that the unicorns were able

Well, the first few sentences are quite different from the OpenAI example and amusingly involve a different universities being credited with the discovery! We can also see one of the main drawbacks with greedy search decoding: it tends to produce repetitive output sequences, which is certainly undesirable in a news article. This is a common problem with greedy search algorithms which can fail to give you the optimal solution; in the context of decoding, it can miss word sequences whose overall probability is higher just because high probability words happen to be preceded by low probability ones.

Fortunately, we can do better - let’s examine a popular method known as beam search decoding.

Note

Although greedy search decoding is rarely used for text generation tasks that require diversity, it can be useful for producing short sequences like arithmetic where a deterministic and factually correct output is preferred.3 For these tasks, you can condition GPT-2 by providing a few line-separated examples in the format "5 + 8 => 13 7 + 2 => 9 1 + 0 =>" as the input prompt.

Beam Search Decoding

Instead of decoding the token with the highest probability at each step, beam search keeps track of the top-b most probable next-tokens, where b is referred to as the number of beams or partial hypotheses. The next set of beams are chosen by considering all possible next-token extensions of the existing set and selecting the b most likely extensions. The process is repeated until we reach the maximum length or an EOS token, and the most likely sequence is selected by ranking the b beams according to their log-probabilities. An example of beam search is represented in Figure 6-5.

Why do we score the sequences using log-probabilities instead of the probabilities themselves? One reason is that calculating the overall probability of a sequence P(y1,y2,...,yt|?) involves calculating a product of conditional probabilities P(yt|y<t,?). Since each conditional probability is typically a small number in the range [0,1], their product can lead to an overall probability that can easily underflow. For example, suppose we have a sequence of t=1024 tokens and generously assume that the probability for each token is 0.5. The overall probability for this sequence is an extremely small number

0.5 ** 1024
5.562684646268003e-309

which leads to numerical instability as we run into underflow. We can avoid this by calculating a related term: the log-probability. If we apply the logarithm to the joint and conditional probabilities, then with the help of the product rule for logarithms we get:

logP(y1,...yt|?)=t=1NlogP(yt|y<t,?).

In other words, the product of probabilities on the right-hand side of the equation becomes a sum of log-probabilities. The log-probabilities have larger absolute values, and therefore we get a larger absolute value in total. For example, calculating the log-probability of the same example as before gives

import numpy as np

sum([np.log(0.5)] * 1024)
-709.7827128933695

This is a number we can easily deal with and this approach still works for much smaller numbers. Since we only want to compare relative probabilities we can do this directly with log-probabilities.

Let’s calculate and compare the log-probabilities of the text generated by greedy and beam search to see if beam search can improve the overall probability. Since Transformer models return the unnormalized logits for the next token given the input tokens, we first need to normalize the logits to create a probability distribution over the whole vocabulary for each token in the sequence. We then need to select only the token probabilities that were present in the sequence. The following function implements these steps:

import torch.nn.functional as F

def log_probs_from_logits(logits, labels):
    logp = F.log_softmax(logits, dim=-1)
    logp_label = torch.gather(logp, 2, labels.unsqueeze(2)).squeeze(-1)
    return logp_label

This gives us the log-probability for a single token, so to get the total log-probability of a sequence we just need to sum the log-probabilities for each token:

def sequence_logprob(model, labels, input_len=0):
    with torch.no_grad():
        output = model(labels)
        log_probs = log_probs_from_logits(
            output.logits[:, :-1, :], labels[:, 1:])
        seq_log_prob = torch.sum(log_probs[:, input_len:])
    return seq_log_prob.cpu().numpy()

Note, that we ignore the log-probabilities of the input sequence since they were not generated by the model. We can also see that it is important to align the logits and the labels; since the model predicts the next-token, we do not get a logit for the first label and we don’t need the last logit since we don’t have a ground truth token for it.

Let’s use these functions to first calculate the sequence log-probability of the greedy decoder on the OpenAI prompt:

logp = sequence_logprob(model, output_greedy, input_len=len(input_ids[0]))
print(tokenizer.decode(output_greedy[0]))
print(f"
log-prob: {logp:.2f}")
In a shocking finding, scientist discovered a herd of unicorns living in a
 > remote, previously unexplored valley, in the Andes Mountains. Even more
 > surprising to the researchers was the fact that the unicorns spoke perfect
 > English.


The researchers, from the University of California, Davis, and the University of
 > Colorado, Boulder, were conducting a study on the Andean cloud forest, which
 > is home to the rare species of cloud forest trees.


The researchers were surprised to find that the unicorns were able to
 > communicate with each other, and even with humans.


The researchers were surprised to find that the unicorns were able

log-prob: -87.43

Now let’s compare this to a sequence that is generated with beam search. To activate beam search with the generate function we just need to specify the number of beams with the num_beams parameter. The more beams we choose the better the result potentially gets, however the generation process becomes much slower since we generate parallel sequences for each beam:

output_beam = model.generate(input_ids, max_length=max_length, num_beams=5,
                             do_sample=False)
logp = sequence_logprob(model, output_beam, input_len=len(input_ids[0]))
print(tokenizer.decode(output_beam[0]))
print(f"
log-prob: {logp:.2f}")
In a shocking finding, scientist discovered a herd of unicorns living in a
 > remote, previously unexplored valley, in the Andes Mountains. Even more
 > surprising to the researchers was the fact that the unicorns spoke perfect
 > English.


The discovery of the unicorns was made by a team of scientists from the
 > University of California, Santa Cruz, and the National Geographic Society.


The scientists were conducting a study of the Andes Mountains when they
 > discovered a herd of unicorns living in a remote, previously unexplored
 > valley, in the Andes Mountains. Even more surprising to the researchers was
 > the fact that the unicorns spoke perfect English

log-prob: -55.23

We can see that we get a better log-probability (higher is better) with beam search than we did with simple greedy decoding. However we can see that beam search also suffers from repetitive text. One way to address this is to impose an n-gram penalty with the no_repeat_ngram_size parameter that tracks which n-grams have been seen and sets the next-token probability to zero if it would produce a previously seen n-gram:

output_beam = model.generate(input_ids, max_length=max_length, num_beams=5,
                             do_sample=False, no_repeat_ngram_size=2)
logp = sequence_logprob(model, output_beam, input_len=len(input_ids[0]))
print(tokenizer.decode(output_beam[0]))
print(f"
log-prob: {logp:.2f}")
In a shocking finding, scientist discovered a herd of unicorns living in a
 > remote, previously unexplored valley, in the Andes Mountains. Even more
 > surprising to the researchers was the fact that the unicorns spoke perfect
 > English.


The discovery was made by a team of scientists from the University of
 > California, Santa Cruz, and the National Geographic Society.

According to a press release, the scientists were conducting a survey of the
 > area when they came across the herd. They were surprised to find that they
 > were able to converse with the animals in English, even though they had never
 > seen a unicorn in person before. The researchers were

log-prob: -93.12

This is not too bad! We’ve managed to stop the repetitions and we can see that despite producing a lower score, the text remains coherent. Beam search with n-gram penalty is a good way to find a trade-off between focusing on high-probability tokens (with beam search) while reducing repetitions (with n-gram penalty), and is commonly used in applications such as summarization or machine-translation where factual correctness is important. When factual correctness is less important than the diversity of generated output, for instance in open-domain chitchat or story generation, another alternative to reduce repetitions while improving diversity is to use sampling instead of greedy decoding/beam search.

Let’s thus round out our exploration of text generation by examining a few of the most common sampling methods.

Sampling Methods

The simplest sampling method is to randomly sample from the model output’s probability distribution over the full vocabulary at each timestep:

P(yt=wi|y<t,?)=exp(zt,i)j=1|V|exp(zt,j),

where |V| denotes the cardinality of the vocabulary. We can easily control the diversity of the output by adding a temperature parameter T that rescales the logits before taking the softmax:

P(yt=wi|y<t,?)=exp(zt,i/T)j=1|V|exp(zt,j/T).

With temperature we can control the shape of the probability distribution and if you remember your high-school physics, you may recognize this equation bears a striking similarity to the Boltzmann distribution that describes the probability pi that a system will be in an energy state Ei as a function of temperature T and a constant k:

pi=exp(-Ei/kT)jexp(-Ej/kT).

In the limit of very low temperatures, only the lowest energy state is occupied or in other words p0=1. In the opposite limit of high temperatures each energy state is equally likely (pi=pj).

Now, if we replace energy with our model’s output logits we can adapt the concept of temperature: low temperature means that the tokens with high probability get boosted while the probabilities of less likely tokens get damped. In other words the distribution becomes much sharper. When we increase the temperature the distribution smooths out and the probabilities get closer to each other. The effect of temperature on token probabilities is shown in Figure 6-6.

Token probabilities as a function of temperature
Figure 6-6. Token probabilities as a function of temperature

To see how we can use temperature to influence the generated text, let’s sample with T=2 by setting the temperature parameter in the generate function:

torch.manual_seed(42)
output_temp = model.generate(input_ids, max_length=max_length, do_sample=True,
                             temperature=2.0, top_k=0)
print(tokenizer.decode(output_temp[0]))
In a shocking finding, scientist discovered a herd of unicorns living in a
 > remote, previously unexplored valley, in the Andes Mountains. Even more
 > surprising to the researchers was the fact that the unicorns spoke perfect
 > English.


While the station aren protagonist receive Pengala nostalgiates tidbitRegarding
 > Jenny loclonju AgreementCON irrational �rite Continent seaf A jer Turner
 > Dorbecue WILL Pumpkin mere Thatvernuildagain YoAniamond disse *
 > Runewitingkusstemprop});b zo coachinginventorymodules deflation press
 > Vaticanpres Wrestling chargesThingsctureddong Ty physician PET KimBi66 graz
 > Oz at aff da temporou MD6 radi iter

We can clearly see that a high temperature has produced mostly gibberish; by accentuating the rare tokens, we’ve caused the model to create strange grammar and quite a few made-up words! Let’s see what happens if we cool down the temperature:

torch.manual_seed(42)
output_temp = model.generate(input_ids, max_length=max_length, do_sample=True,
                             temperature=0.5, top_k=0)
print(tokenizer.decode(output_temp[0]))
In a shocking finding, scientist discovered a herd of unicorns living in a
 > remote, previously unexplored valley, in the Andes Mountains. Even more
 > surprising to the researchers was the fact that the unicorns spoke perfect
 > English.


The scientists were searching for the source of the mysterious sound, which was
 > making the animals laugh and cry.


The unicorns were living in a remote valley in the Andes mountains

'When we first heard the noise of the animals, we thought it was a lion or a
 > tiger,' said Luis Guzman, a researcher from the University of Buenos Aires,
 > Argentina.


'But when

This is significantly more coherent and even includes a quote from yet another university being credited with the discovery! The main lesson we can draw from temperature is that it allows us to control the quality of the samples, but there’s always a trade-off between coherence (low temperature) and diversity (high temperature) that one has to tune to the use-case at hand.

Another way to adjust the trade-off between coherence and diversity is to truncate the distribution of the vocabulary. This allows us to adjust the diversity freely with the temperature, but in a more limited range that excludes words which would be too strange in the context, i.e. low probability words. There are two main ways to do this: top-k and nucleus (or top-p) sampling. Let’s take a look.

Top-k and Nucleus Sampling

Top-k and nucleus (top-p) sampling are two popular alternatives or extensions to using temperature. In both cases the basic idea is to restrict the number of possible tokens we can sample from at each timestep. To see how this works, let’s first visualize the cumulative probability distribution of the model’s outputs at T=1:

torch.manual_seed(42)
with torch.no_grad():
    output = model(input_ids=input_ids)
    next_token_logits = output.logits[:, -1, :]
    probs = F.softmax(next_token_logits, dim=-1).detach().cpu().numpy()

Let’s tease apart these plots since they contain a lot of information. In the left plot we can see a histogram of the token probabilities. It has a peak around 10-8 and a second, smaller peak around 10-4, followed by a sharp drop with just a handful of tokens occurring with probability between 10-2 and 10-1. Looking at this diagram we can see that picking the token with the highest probability (the isolated bar at 10-1) is 1 in 10.

In the right plot we ordered the tokens by descending probability and calculated the cumulative sum of the first 10,000 tokens (in total there are 50,257 tokens in GPT-2’s vocabulary). The way to read the graph is that the line represents the probability of picking any of the preceding tokens. For example, there is roughly a 96% chance of picking any of the 1,000 tokens with the highest probability. We see that the probability rises quickly above 90% but saturates only after several thousand tokens to close to 100%. The plot shows that there is a 1 in 100 chance of not picking any of the tokens that are not even in the top-2,000.

Although these numbers might appear small at first sight, they become important because we sample multiple times when generating text but once per token that is generated. So even if there is only a 1 in 100 or 1,000 chance, if we sample hundreds of times there is a significant chance of picking an unlikely token at some point. And picking such tokens when sampling can badly influence the quality of the generated text. For this reason we generally want to avoid these very unlikely tokens. This is where top-k and top-p sampling come into play.

The idea behind top-k sampling is to avoid the low probability choices by only sampling from the k tokens with the highest probability. This puts a fixed cut on the long tail of the distribution and ensures that we only sample from likely choices. Going back to the cumulative sum probabilities, top-k sampling is equivalent of defining a vertical line and sampling from the tokens on the left. Again, the generate function provides an easy method to achieve this with the top_k argument:

torch.manual_seed(42)
output_topk = model.generate(input_ids, max_length=max_length, do_sample=True,
                             top_k=50)
print(tokenizer.decode(output_topk[0]))
In a shocking finding, scientist discovered a herd of unicorns living in a
 > remote, previously unexplored valley, in the Andes Mountains. Even more
 > surprising to the researchers was the fact that the unicorns spoke perfect
 > English.


The wild unicorns roam the Andes Mountains in the region of Cajamarca, on the
 > border with Argentina (Picture: Alamy/Ecole Nationale Supérieure d'Histoire
 > Naturelle)

The researchers came across about 50 of the animals in the valley. They had
 > lived in such a remote and isolated area at that location for nearly a
 > thousand years that

This is arguably the most human-looking text we’ve generated so far. Now how do we choose k? Is it independent of the actual output distribution? Indeed, the value of k is chosen manually and is the same for each choice in the sequence independent of the actual output distribution. We can find a good value for k by looking at some text quality metrics which we will explore later in this chapter, but that fixed cutoff might not be very satisfactory.

Instead of defining a fixed cutoff we can use a dynamic one with nucleus or top-p. Instead of choosing a cutoff value, we set a condition when to cutoff. This condition is when a certain probability mass in the selection is reached. Let’s say we set that value to 95%. We then order all tokens by probability and add one token after another from the top list until the sum of the selected tokens is 95%. Depending on the output distribution this could be just one (very likely) token or one hundred (more equally likely) tokens. There is again an visual interpretation of nucleus sampling: the value for p defines a horizontal line on the cumulative sum of probabilities plot and we sample only from tokens below the line. At this point you are probably not surprised that the generate function also provides an argument to activate top-p sampling:

torch.manual_seed(42)
output_topp = model.generate(input_ids, max_length=max_length, do_sample=True,
                             top_p=0.90)
print(tokenizer.decode(output_topp[0]))
In a shocking finding, scientist discovered a herd of unicorns living in a
 > remote, previously unexplored valley, in the Andes Mountains. Even more
 > surprising to the researchers was the fact that the unicorns spoke perfect
 > English.


The scientists studied the DNA of the animals and came to the conclusion that
 > the herd are descendants of a prehistoric herd that lived in Argentina about
 > 50,000 years ago.


According to the scientific analysis, the first humans who migrated to South
 > America migrated into the Andes Mountains from South Africa and Australia,
 > after the last ice age had ended.


Since their migration, the animals have been adapting to

Top-p sampling has also produced a coherent story and this time with a new twist about migrations from Australia to South America. You can also combine the two approaches to get the best of both worlds. Setting top_k=50 and top_p=0.9 corresponds to the rule of choosing tokens with a probability mass that is 90% but at most 50 tokens.

Which Decoding Method is Best?

Unfortunately, there is no universally “best” decoding method. Which approach is best will depend on the nature of the task you are generating text for. If you want your model to perform a precise task like arithmetic or providing an answer to a specific question, then you should lower the temperature or use deterministic methods like greedy or beam search to guarantee getting the most likely answer. If you want the model to generate longer text and even be a bit creative, then you should switch to sampling methods and control the temperature or use a mix of top-k and nucleus sampling.

Conclusion

In this chapter we looked at text generation which is a very different task to the NLU tasks we encountered previously. Generating text requires at least one forward pass per generated token and even more if we use beam search. This makes text generation computationally demanding and one needs the right infrastructure to run text generation at scale. In addition, a good decoding strategy that transforms the model’s output probabilities into descrete tokens can improve the text quality. Finding the best decoding strategy requires some experimentation and based on the generated texts we can decide which yields the best texts.

In practice, however, we don’t want to make these decisions based on gut feeling alone! Like other NLP tasks, we should choose a model performance metric that reflects the problem we want to solve. Unsurprisingly, here there are also a wide range of choices, and we will encounter the most common ones later in the next chapter where we have a look at how to train and evaluate a model for text summarization. If you can’t wait to learn how to train a GPT type model from scratch you can skip right to [Link to Come] where we collect a large dataset of code and then train a autoregressive language model on it.

1 Language Models are Unsupervised Multitask Learners, A. Radford et al. (2018)

2 Language Models are Few-Shot Learners, T.B. Brown et al. (2020)

3 CTRL: A Conditional Transformer Language Model for Controllable Generation, N. Keskar et al. (2019)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset