One of the most uncanny features of Transformer-based language models is their ability to generate text that is almost indistinguishable from human-written text. A famous example is OpenAI’s GPT-2,1 which when given the prompt
In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.
was able to generate a compelling news article about talking unicorns:
The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science. Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved. Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow. Pérez and the others then ventured further into the valley. “By the time we reached the top of one peak, the water looked blue, with some crystals on top,” said Pérez. Pérez and his friends were astonished to see the unicorn herd. These creatures could be seen from the air without having to move too much to see them - they were so close they could touch their horns. While examining these bizarre creatures the scientists discovered that the creatures also spoke some fairly regular English …
What makes this example so remarkable is that it was generated without any explicit supervision! By simply learning to predict the next word in millions of webpages, GPT-2 and its more powerful descendants like GPT-32 are able to acquire a broad set of skills and pattern recognition abilities that can be activated with different kinds of input prompts. Figure 6-1 shows how language models are sometimes exposed to sequences of sub-tasks like addition, unscrambling words, and translation during pretraining, which allows them to transfer this knowledge effectively during fine-tuning or (if the model is large enough) inference time. These sub-tasks are not chosen ahead of time, but occur naturally in the huge corpora used to train billion-parameter language models.
This ability of Transformers to generate realistic text has produced a diverse range of applications like Talk To Transformer, Write With Transformer, AI Dungeon, and conversational agents like Google’s Meena that can even tell corny jokes (Figure 6-2)!
In this chapter we’ll use GPT-2 to illustrate how text generation works for language models and explore how a recent class of Transformer architectures can be applied to one of the most challenging tasks in NLP: generating accurate summaries from long text documents like news articles or business reports.
In this book, we have focused on tackling NLP tasks via a combination of pretraining and supervised fine-tuning. For task-specific heads like sequence or token classification, generating predictions was fairly straightforward; the model produced some logits and we either took the maximum value to get the predicted class, or applied a softmax function to obtain the predicted probabilities per class. By contrast, converting the model’s probabilistic output to text requires a decoding method which introduces a few challenges that are unique to text generation:
The decoding is done iteratively and thus involves significantly more compute than simply passing inputs once through the forward pass of a model.
The quality and diversity of the generated text depends on the choice of decoding method and associated hyperparameters.
To understand how this decoding process works, let’s start by examining how GPT-2 is pretrained and subsequently applied to generate text.
Like other autoregressive or causal language models, GPT-2 is
pretrained to estimate the probability
where
By now you may have guessed how we can adapt this next-token prediction task to generate text sequences of arbitrary length. As shown in Figure 6-4, we start with a prompt like “Transformers are the” and use the model to predict the next token. Once we have determined the next token, we append it to the prompt and then use the new input sequence to generate another token. We do this until we have reached a special end of sequence (EOS) token or a predefined maximum length.
Since the output sequence is conditioned on the choice of input prompt, this type of text generation is often called conditional text generation.
At the heart of this process lies a decoding method that determines
which token is selected at each timestep. Since the language model head
produces a logit
The goal of most decoding methods is to search for the most likely
overall sequence by picking a
Since there does not exist an algorithm that can find the optimal decoded sequence in polynomial time, we rely on approximations instead. In this section we’ll explore a few of these approximations and gradually build up towards smarter and more complex algorithms that can be used to generate high quality texts.
The simplest decoding method to get discrete tokens from a model’s continuous output is to greedily select the token with the highest probability at each timestep:
To see how greedy search works, let’s start by loading the 1.5 billion-parameter version of GPT-2 with a language modeling head:
import
torch
from
transformers
import
AutoTokenizer
,
AutoModelForCausalLM
device
=
"cuda"
if
torch
.
cuda
.
is_available
()
else
"cpu"
model_name
=
"gpt2-xl"
tokenizer
=
AutoTokenizer
.
from_pretrained
(
model_name
)
model
=
AutoModelForCausalLM
.
from_pretrained
(
model_name
)
.
to
(
device
)
Now let’s generate some text! Although Transformers provides
a generate
function for autoregressive models like GPT-2,
let’s implement this decoding method ourselves to see what
goes on under the hood. To warm up, we’ll take the same
iterative approach shown in Figure 6-4 and use
“Transformers are the” as the input prompt and run the decoding for
eight timesteps. At each timestep, we pick out the model’s
logits for the last token in the prompt and wrap them with a softmax to
get a probability distribution. We then pick the next token with the
highest probability, add it to the input sequence and run the process
again. The following code does the job, and also stores the five most
probable tokens at each timestep so we can visualize the alternatives:
import
pandas
as
pd
input_txt
=
"Transformers are the"
input_ids
=
tokenizer
(
input_txt
,
return_tensors
=
"pt"
)[
"input_ids"
]
.
to
(
device
)
iterations
=
[]
n_steps
=
8
choices_per_step
=
5
with
torch
.
no_grad
():
for
_
in
range
(
n_steps
):
iteration
=
dict
()
iteration
[
"Input"
]
=
tokenizer
.
decode
(
input_ids
[
0
])
output
=
model
(
input_ids
=
input_ids
)
# Select logits of the first batch and the last token and apply softmax
next_token_logits
=
output
.
logits
[
0
,
-
1
,
:]
next_token_probs
=
torch
.
softmax
(
next_token_logits
,
dim
=-
1
)
sorted_ids
=
torch
.
argsort
(
next_token_probs
,
dim
=-
1
,
descending
=
True
)
# Store tokens with highest probabilities
for
choice_idx
in
range
(
choices_per_step
):
token_id
=
sorted_ids
[
choice_idx
]
token_prob
=
next_token_probs
[
token_id
]
.
cpu
()
.
numpy
()
token_choice
=
(
f
"{tokenizer.decode(token_id)} ({100 * token_prob:.2f}%)"
)
iteration
[
f
"Choice {choice_idx+1}"
]
=
token_choice
# Append predicted next token to input
input_ids
=
torch
.
cat
([
input_ids
,
sorted_ids
[
None
,
0
,
None
]],
dim
=-
1
)
iterations
.
append
(
iteration
)
display_df
(
pd
.
DataFrame
.
from_records
(
iterations
),
index
=
None
)
Input | Choice 1 | Choice 2 | Choice 3 | Choice 4 | Choice 5 |
---|---|---|---|---|---|
Transformers are the | most (8.53%) | only (4.96%) | best (4.65%) | Transformers (4.37%) | ultimate (2.16%) |
Transformers are the most | popular (16.78%) | powerful (5.37%) | common (4.96%) | famous (3.72%) | successful (3.20%) |
Transformers are the most popular | toy (10.63%) | toys (7.23%) | Transformers (6.60%) | of (5.46%) | and (3.76%) |
Transformers are the most popular toy | line (34.38%) | in (18.20%) | of (11.71%) | brand (6.10%) | line (2.69%) |
Transformers are the most popular toy line | in (46.28%) | of (15.09%) | , (4.94%) | on (4.40%) | ever (2.72%) |
Transformers are the most popular toy line in | the (65.99%) | history (12.42%) | America (6.91%) | Japan (2.44%) | North (1.40%) |
Transformers are the most popular toy line in the | world (69.26%) | United (4.55%) | history (4.29%) | US (4.23%) | U (2.30%) |
Transformers are the most popular toy line in the world | , (39.73%) | . (30.64%) | and (9.87%) | with (2.32%) | today (1.74%) |
With this simple method we were able to generate the perfectly reasonable sentence “Transformers are the most popular toy line in the world” from the input prompt. We can also see explicitly the iterative nature of text generation; unlike other tasks such as sequence classification or question answering where a single forward pass suffices to generate the predictions, with text generation we need to decode the output tokens one at a time.
Implementing greedy search wasn’t too hard, but
we’ll we want to use the in-built generate
function from
Transformers to explore more sophisticated decoding methods. To
reproduce our simple example, let’s make sure sampling is
switched off (it’s off by default, unless the specific
configuration of the model you are loading the checkpoint from states
otherwise) and specify the max_length
to eleven tokens since our
prompt already has three words:
input_ids
=
tokenizer
(
input_txt
,
return_tensors
=
"pt"
)[
"input_ids"
]
.
to
(
device
)
output
=
model
.
generate
(
input_ids
,
max_length
=
11
,
do_sample
=
False
)
(
tokenizer
.
decode
(
output
[
0
]))
Transformers are the most popular toy line in the world
Now let’s try something a bit more interesting: can we
reproduce the unicorn story from OpenAI? As we did above,
we’ll encode the prompt with the tokenizer and specify a
larger value for max_length
to generate a longer sequence of text:
max_length
=
128
input_txt
=
"""In a shocking finding, scientist discovered
a herd of unicorns living in a remote, previously unexplored
valley, in the Andes Mountains. Even more surprising to the
researchers was the fact that the unicorns spoke perfect English.
"""
input_ids
=
tokenizer
(
input_txt
,
return_tensors
=
"pt"
)[
"input_ids"
]
.
to
(
device
)
output_greedy
=
model
.
generate
(
input_ids
,
max_length
=
max_length
,
do_sample
=
False
)
(
tokenizer
.
decode
(
output_greedy
[
0
]))
In a shocking finding, scientist discovered a herd of unicorns living in a > remote, previously unexplored valley, in the Andes Mountains. Even more > surprising to the researchers was the fact that the unicorns spoke perfect > English. The researchers, from the University of California, Davis, and the University of > Colorado, Boulder, were conducting a study on the Andean cloud forest, which > is home to the rare species of cloud forest trees. The researchers were surprised to find that the unicorns were able to > communicate with each other, and even with humans. The researchers were surprised to find that the unicorns were able
Well, the first few sentences are quite different from the OpenAI example and amusingly involve a different universities being credited with the discovery! We can also see one of the main drawbacks with greedy search decoding: it tends to produce repetitive output sequences, which is certainly undesirable in a news article. This is a common problem with greedy search algorithms which can fail to give you the optimal solution; in the context of decoding, it can miss word sequences whose overall probability is higher just because high probability words happen to be preceded by low probability ones.
Fortunately, we can do better - let’s examine a popular method known as beam search decoding.
Although greedy search decoding is rarely used for text generation tasks that require diversity, it can be useful for producing short sequences like arithmetic where a deterministic and factually correct output is preferred.3 For these tasks, you can condition GPT-2 by providing a few line-separated examples in the format "5 + 8 => 13
7 + 2 => 9
1 + 0 =>"
as the input prompt.
Instead of decoding the token with the highest probability at each step,
beam search keeps track of the top-
Why do we score the sequences using log-probabilities instead of the
probabilities themselves? One reason is that calculating the overall
probability of a sequence
0.5
**
1024
5.562684646268003e-309
which leads to numerical instability as we run into underflow. We can avoid this by calculating a related term: the log-probability. If we apply the logarithm to the joint and conditional probabilities, then with the help of the product rule for logarithms we get:
In other words, the product of probabilities on the right-hand side of the equation becomes a sum of log-probabilities. The log-probabilities have larger absolute values, and therefore we get a larger absolute value in total. For example, calculating the log-probability of the same example as before gives
import
numpy
as
np
sum
([
np
.
log
(
0.5
)]
*
1024
)
-709.7827128933695
This is a number we can easily deal with and this approach still works for much smaller numbers. Since we only want to compare relative probabilities we can do this directly with log-probabilities.
Let’s calculate and compare the log-probabilities of the text generated by greedy and beam search to see if beam search can improve the overall probability. Since Transformer models return the unnormalized logits for the next token given the input tokens, we first need to normalize the logits to create a probability distribution over the whole vocabulary for each token in the sequence. We then need to select only the token probabilities that were present in the sequence. The following function implements these steps:
import
torch.nn.functional
as
F
def
log_probs_from_logits
(
logits
,
labels
):
logp
=
F
.
log_softmax
(
logits
,
dim
=-
1
)
logp_label
=
torch
.
gather
(
logp
,
2
,
labels
.
unsqueeze
(
2
))
.
squeeze
(
-
1
)
return
logp_label
This gives us the log-probability for a single token, so to get the total log-probability of a sequence we just need to sum the log-probabilities for each token:
def
sequence_logprob
(
model
,
labels
,
input_len
=
0
):
with
torch
.
no_grad
():
output
=
model
(
labels
)
log_probs
=
log_probs_from_logits
(
output
.
logits
[:,
:
-
1
,
:],
labels
[:,
1
:])
seq_log_prob
=
torch
.
sum
(
log_probs
[:,
input_len
:])
return
seq_log_prob
.
cpu
()
.
numpy
()
Note, that we ignore the log-probabilities of the input sequence since they were not generated by the model. We can also see that it is important to align the logits and the labels; since the model predicts the next-token, we do not get a logit for the first label and we don’t need the last logit since we don’t have a ground truth token for it.
Let’s use these functions to first calculate the sequence log-probability of the greedy decoder on the OpenAI prompt:
logp
=
sequence_logprob
(
model
,
output_greedy
,
input_len
=
len
(
input_ids
[
0
]))
(
tokenizer
.
decode
(
output_greedy
[
0
]))
(
f
"
log-prob: {logp:.2f}"
)
In a shocking finding, scientist discovered a herd of unicorns living in a > remote, previously unexplored valley, in the Andes Mountains. Even more > surprising to the researchers was the fact that the unicorns spoke perfect > English. The researchers, from the University of California, Davis, and the University of > Colorado, Boulder, were conducting a study on the Andean cloud forest, which > is home to the rare species of cloud forest trees. The researchers were surprised to find that the unicorns were able to > communicate with each other, and even with humans. The researchers were surprised to find that the unicorns were able log-prob: -87.43
Now let’s compare this to a sequence that is generated with
beam search. To activate beam search with the generate
function we
just need to specify the number of beams with the num_beams
parameter.
The more beams we choose the better the result potentially gets, however
the generation process becomes much slower since we generate parallel
sequences for each beam:
output_beam
=
model
.
generate
(
input_ids
,
max_length
=
max_length
,
num_beams
=
5
,
do_sample
=
False
)
logp
=
sequence_logprob
(
model
,
output_beam
,
input_len
=
len
(
input_ids
[
0
]))
(
tokenizer
.
decode
(
output_beam
[
0
]))
(
f
"
log-prob: {logp:.2f}"
)
In a shocking finding, scientist discovered a herd of unicorns living in a > remote, previously unexplored valley, in the Andes Mountains. Even more > surprising to the researchers was the fact that the unicorns spoke perfect > English. The discovery of the unicorns was made by a team of scientists from the > University of California, Santa Cruz, and the National Geographic Society. The scientists were conducting a study of the Andes Mountains when they > discovered a herd of unicorns living in a remote, previously unexplored > valley, in the Andes Mountains. Even more surprising to the researchers was > the fact that the unicorns spoke perfect English log-prob: -55.23
We can see that we get a better log-probability (higher is better) with
beam search than we did with simple greedy decoding. However we can see
that beam search also suffers from repetitive text. One way to address
this is to impose an no_repeat_ngram_size
parameter that tracks which
output_beam
=
model
.
generate
(
input_ids
,
max_length
=
max_length
,
num_beams
=
5
,
do_sample
=
False
,
no_repeat_ngram_size
=
2
)
logp
=
sequence_logprob
(
model
,
output_beam
,
input_len
=
len
(
input_ids
[
0
]))
(
tokenizer
.
decode
(
output_beam
[
0
]))
(
f
"
log-prob: {logp:.2f}"
)
In a shocking finding, scientist discovered a herd of unicorns living in a > remote, previously unexplored valley, in the Andes Mountains. Even more > surprising to the researchers was the fact that the unicorns spoke perfect > English. The discovery was made by a team of scientists from the University of > California, Santa Cruz, and the National Geographic Society. According to a press release, the scientists were conducting a survey of the > area when they came across the herd. They were surprised to find that they > were able to converse with the animals in English, even though they had never > seen a unicorn in person before. The researchers were log-prob: -93.12
This is not too bad! We’ve managed to stop the repetitions
and we can see that despite producing a lower score, the text remains
coherent. Beam search with
Let’s thus round out our exploration of text generation by examining a few of the most common sampling methods.
The simplest sampling method is to randomly sample from the model output’s probability distribution over the full vocabulary at each timestep:
where
With temperature we can control the shape of the probability
distribution and if you remember your high-school physics, you may
recognize this equation bears a striking similarity to the
Boltzmann
distribution that describes the probability
In the limit of very low temperatures, only the lowest energy state is
occupied or in other words
Now, if we replace energy with our model’s output logits we can adapt the concept of temperature: low temperature means that the tokens with high probability get boosted while the probabilities of less likely tokens get damped. In other words the distribution becomes much sharper. When we increase the temperature the distribution smooths out and the probabilities get closer to each other. The effect of temperature on token probabilities is shown in Figure 6-6.
To see how we can use temperature to influence the generated text,
let’s sample with temperature
parameter in the generate
function:
torch
.
manual_seed
(
42
)
output_temp
=
model
.
generate
(
input_ids
,
max_length
=
max_length
,
do_sample
=
True
,
temperature
=
2.0
,
top_k
=
0
)
(
tokenizer
.
decode
(
output_temp
[
0
]))
In a shocking finding, scientist discovered a herd of unicorns living in a > remote, previously unexplored valley, in the Andes Mountains. Even more > surprising to the researchers was the fact that the unicorns spoke perfect > English. While the station aren protagonist receive Pengala nostalgiates tidbitRegarding > Jenny loclonju AgreementCON irrational �rite Continent seaf A jer Turner > Dorbecue WILL Pumpkin mere Thatvernuildagain YoAniamond disse * > Runewitingkusstemprop});b zo coachinginventorymodules deflation press > Vaticanpres Wrestling chargesThingsctureddong Ty physician PET KimBi66 graz > Oz at aff da temporou MD6 radi iter
We can clearly see that a high temperature has produced mostly gibberish; by accentuating the rare tokens, we’ve caused the model to create strange grammar and quite a few made-up words! Let’s see what happens if we cool down the temperature:
torch
.
manual_seed
(
42
)
output_temp
=
model
.
generate
(
input_ids
,
max_length
=
max_length
,
do_sample
=
True
,
temperature
=
0.5
,
top_k
=
0
)
(
tokenizer
.
decode
(
output_temp
[
0
]))
In a shocking finding, scientist discovered a herd of unicorns living in a > remote, previously unexplored valley, in the Andes Mountains. Even more > surprising to the researchers was the fact that the unicorns spoke perfect > English. The scientists were searching for the source of the mysterious sound, which was > making the animals laugh and cry. The unicorns were living in a remote valley in the Andes mountains 'When we first heard the noise of the animals, we thought it was a lion or a > tiger,' said Luis Guzman, a researcher from the University of Buenos Aires, > Argentina. 'But when
This is significantly more coherent and even includes a quote from yet another university being credited with the discovery! The main lesson we can draw from temperature is that it allows us to control the quality of the samples, but there’s always a trade-off between coherence (low temperature) and diversity (high temperature) that one has to tune to the use-case at hand.
Another way to adjust the trade-off between coherence and diversity is to truncate the distribution of the vocabulary. This allows us to adjust the diversity freely with the temperature, but in a more limited range that excludes words which would be too strange in the context, i.e. low probability words. There are two main ways to do this: top-k and nucleus (or top-p) sampling. Let’s take a look.
Top-
torch
.
manual_seed
(
42
)
with
torch
.
no_grad
():
output
=
model
(
input_ids
=
input_ids
)
next_token_logits
=
output
.
logits
[:,
-
1
,
:]
probs
=
F
.
softmax
(
next_token_logits
,
dim
=-
1
)
.
detach
()
.
cpu
()
.
numpy
()
Let’s tease apart these plots since they contain a lot of
information. In the left plot we can see a histogram of the token
probabilities. It has a peak around
In the right plot we ordered the tokens by descending probability and
calculated the cumulative sum of the first 10,000 tokens (in total there
are 50,257 tokens in GPT-2’s vocabulary). The way to read
the graph is that the line represents the probability of picking any of
the preceding tokens. For example, there is roughly a
Although these numbers might appear small at first sight, they become
important because we sample multiple times when generating text but once
per token that is generated. So even if there is only a 1 in 100 or
1,000 chance, if we sample hundreds of times there is a significant
chance of picking an unlikely token at some point. And picking such
tokens when sampling can badly influence the quality of the generated
text. For this reason we generally want to avoid these very unlikely
tokens. This is where top-
The idea behind top-generate
function provides an easy
method to achieve this with the top_k
argument:
torch
.
manual_seed
(
42
)
output_topk
=
model
.
generate
(
input_ids
,
max_length
=
max_length
,
do_sample
=
True
,
top_k
=
50
)
(
tokenizer
.
decode
(
output_topk
[
0
]))
In a shocking finding, scientist discovered a herd of unicorns living in a > remote, previously unexplored valley, in the Andes Mountains. Even more > surprising to the researchers was the fact that the unicorns spoke perfect > English. The wild unicorns roam the Andes Mountains in the region of Cajamarca, on the > border with Argentina (Picture: Alamy/Ecole Nationale Supérieure d'Histoire > Naturelle) The researchers came across about 50 of the animals in the valley. They had > lived in such a remote and isolated area at that location for nearly a > thousand years that
This is arguably the most human-looking text we’ve generated
so far. Now how do we choose
Instead of defining a fixed cutoff we can use a dynamic one with
nucleus or top-generate
function
also provides an argument to activate top-
torch
.
manual_seed
(
42
)
output_topp
=
model
.
generate
(
input_ids
,
max_length
=
max_length
,
do_sample
=
True
,
top_p
=
0.90
)
(
tokenizer
.
decode
(
output_topp
[
0
]))
In a shocking finding, scientist discovered a herd of unicorns living in a > remote, previously unexplored valley, in the Andes Mountains. Even more > surprising to the researchers was the fact that the unicorns spoke perfect > English. The scientists studied the DNA of the animals and came to the conclusion that > the herd are descendants of a prehistoric herd that lived in Argentina about > 50,000 years ago. According to the scientific analysis, the first humans who migrated to South > America migrated into the Andes Mountains from South Africa and Australia, > after the last ice age had ended. Since their migration, the animals have been adapting to
Top-top_k=50
and top_p=0.9
corresponds to the rule of choosing
tokens with a probability mass that is 90% but at most 50 tokens.
Unfortunately, there is no universally “best” decoding method. Which
approach is best will depend on the nature of the task you are
generating text for. If you want your model to perform a precise task
like arithmetic or providing an answer to a specific question, then you
should lower the temperature or use deterministic methods like greedy or
beam search to guarantee getting the most likely answer. If you want the
model to generate longer text and even be a bit creative, then you
should switch to sampling methods and control the temperature or use a
mix of top-
In this chapter we looked at text generation which is a very different task to the NLU tasks we encountered previously. Generating text requires at least one forward pass per generated token and even more if we use beam search. This makes text generation computationally demanding and one needs the right infrastructure to run text generation at scale. In addition, a good decoding strategy that transforms the model’s output probabilities into descrete tokens can improve the text quality. Finding the best decoding strategy requires some experimentation and based on the generated texts we can decide which yields the best texts.
In practice, however, we don’t want to make these decisions based on gut feeling alone! Like other NLP tasks, we should choose a model performance metric that reflects the problem we want to solve. Unsurprisingly, here there are also a wide range of choices, and we will encounter the most common ones later in the next chapter where we have a look at how to train and evaluate a model for text summarization. If you can’t wait to learn how to train a GPT type model from scratch you can skip right to [Link to Come] where we collect a large dataset of code and then train a autoregressive language model on it.