So far in this book we have applied Transformers to solve NLP tasks on English corpora, so what do you do when your documents are written in Greek, Swahili, or Klingon? One approach is to search the HuggingFace Model Hub for a suitable pretrained language model and fine-tune it on the task at hand. However, these pretrained models tend to exist only for “high-resource” languages like German, Russian, or Mandarin, where plenty of webtext is available for pretraining. Another common challenge arises when your corpus is multilingual – maintaining multiple monolingual models in production will not be any fun for you or your engineering team.
Fortunately, there is a class of multilingual Transformers to the rescue! Like BERT, these models use masked language modeling as a pretraining objective, but are trained jointly on texts in over 100 concurrent languages. By pretraining on huge corpora across many languages, these multilingual Transformers enable zero-shot cross-lingual transfer, where a model that is fine-tuned on one language can be applied to others without any further training! This also makes these models well suited for “code-switching”, where a speaker alternates between two or more languages or dialects in the context of a single conversation.
In this chapter we will explore how a single Transformer model called XLM-RoBERTa1 can be fine-tuned to perform named entity recognition (NER) across several languages. NER is a common NLP task that identifies entities like people, organizations, or locations in text. These entities can be used for various applications such as gaining insights from company documents, augmenting the quality of search engines, or simply building a structured database from a corpus.
For this chapter let’s assume that we want to perform NER for a customer based in Switzerland, where there are four national languages, with English often serving as a bridge between them. Let’s start by getting a suitable multilingual corpus for this problem.
Zero-shot transfer or zero-shot learning usually refers to the task of training a model on one set of labels and then evaluating it on a different set of labels. In the context of Transformers, zero-shot learning may also refer to situations where a language model like GPT-3 is evaluated on a downstream task it wasn’t even fine-tuned on!
In this chapter we will be using a subset of the Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME)2 benchmark called Wikiann3 or PAN-X. This dataset consists of Wikipedia articles in many languages, including the four most commonly spoken languages in Switzerland: German (62.9%), French (22.9%), Italian (8.4%), and English (5.9%). Each article is annotated with LOC (location), PER (person) and ORG (organization) tags in the “inside-outside-beginning” (IOB2) format, where a B- prefix indicates the beginning of an entity, and consecutive positions of the same entity are given an I- prefix. An O tag indicates that the token does not belong to any entity. For example, the following sentence
Jeff Dean is a computer scientist at Google in California
would be labeled in IOB2 format as shown in Table 4-1.
Tokens | Jeff | Dean | is | a | computer | scientist | at | in | California | |
---|---|---|---|---|---|---|---|---|---|---|
Tags | B-PER | I-PER | O | O | O | O | O | B-ORG | O | B-LOC |
To load PAN-X with HuggingFace Datasets we first need to manually download the file AmazonPhotos.zip from XTREME’s Amazon Cloud Drive, and place it in a local directory (data in our example). Having done that, we can then load a PAN-X corpus using one of the two-letter ISO 639-1 language codes supported in the XTREME benchmark (see Table 5 of the paper for a list of the 40 available language codes). For example, to load the German corpus we use the “de” code as follows:
from
datasets
import
load_dataset
load_dataset
(
"xtreme"
,
"PAN-X.de"
,
data_dir
=
"data"
)
DatasetDict({ validation: Dataset({ features: ['tokens', 'ner_tags', 'langs'], num_rows: 10000 }) test: Dataset({ features: ['tokens', 'ner_tags', 'langs'], num_rows: 10000 }) train: Dataset({ features: ['tokens', 'ner_tags', 'langs'], num_rows: 20000 }) })
In this case, load_dataset
returns a DatasetDict
where each key
corresponds to one of the splits, and each value is a Dataset
object
with features
and num_rows
attributes. To make a representative
Swiss corpus, we’ll sample the German (de), French (fr),
Italian (it), and English (en) corpora from PAN-X according to their
spoken proportions. This will create a language imbalance that is very
common in real-world datasets, where acquiring labeled examples in a
minority language can be expensive due to the lack of domain experts who
are fluent in that language.
To keep track of each language, let’s create a Python
defaultdict
that stores the language code as the key and a PAN-X
corpus of type DatasetDict
as the value:
from
collections
import
defaultdict
from
datasets
import
DatasetDict
langs
=
[
"de"
,
"fr"
,
"it"
,
"en"
]
fracs
=
[
0.629
,
0.229
,
0.084
,
0.059
]
# return a DatasetDict if a key doesn't exist
panx_ch
=
defaultdict
(
DatasetDict
)
for
lang
,
frac
in
zip
(
langs
,
fracs
):
# load monolingual corpus
ds
=
load_dataset
(
"xtreme"
,
f
"PAN-X.{lang}"
,
data_dir
=
"data"
)
# shuffle and downsample each split according to spoken proportion
for
split
in
ds
.
keys
():
panx_ch
[
lang
][
split
]
=
(
ds
[
split
]
.
shuffle
(
seed
=
0
)
.
select
(
range
(
int
(
frac
*
ds
[
split
]
.
num_rows
))))
Here we’ve used the Dataset.shuffle
function to make sure
we don’t accidentally bias our dataset splits, while
Dataset.select
allows us to downsample each corpus according to the
values in fracs
. Let’s have a look at how many examples we
have per language in the training sets by accessing the
Dataset.num_rows
attribute:
import
pandas
as
pd
pd
.
DataFrame
({
lang
:
[
panx_ch
[
lang
][
"train"
]
.
num_rows
]
for
lang
in
langs
},
index
=
[
"Number of training examples"
])
de | fr | it | en | |
---|---|---|---|---|
Number of training examples | 12580 | 4580 | 1680 | 1180 |
By design, we have more examples in German than all other languages combined, so we’ll use it as a starting point from which to perform zero-shot cross-lingual transfer to French, Italian, and English. Let’s inspect one of the examples in the German corpus:
panx_ch
[
"de"
][
"train"
][
0
]
{'langs': ['de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de', 'de'], 'ner_tags': [0, 0, 0, 0, 5, 6, 0, 0, 5, 5, 6, 0], 'tokens': ['2.000', 'Einwohnern', 'an', 'der', 'Danziger', 'Bucht', 'in', 'der', 'polnischen', 'Woiwodschaft', 'Pommern', '.']}
As with our previous encounters with Dataset
objects, the keys of our
example correspond to the column names of an Apache Arrow table, while
the values denote the entry in each column. In particular, we see that
the ner_tags
column corresponds to the mapping of each entity to an
integer. This is a bit cryptic to the human eye, so let’s
create a new column with the familiar LOC, PER, and ORG tags. To
do this, the first thing to notice is that our Dataset
object has a
features
attribute that specifies the underlying data types associated
with each column:
panx_ch
[
"de"
][
"train"
]
.
features
{'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'ner_tags': Sequence(feature=ClassLabel(num_classes=7, names=['O', 'B-PER', > 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC'], names_file=None, id=None), > length=-1, id=None), 'langs': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}
The Sequence
class specifies that the field contains a list of
features, which in the case of ner_tags
corresponds to a list of
ClassLabel
features. Let’s pick out this feature from the
training set as follows:
tags
=
panx_ch
[
"de"
][
"train"
]
.
features
[
"ner_tags"
]
.
feature
tags
ClassLabel(num_classes=7, names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', > 'B-LOC', 'I-LOC'], names_file=None, id=None)
One handy property of the ClassLabel
feature is that it has conversion
methods to convert from the class name to an integer and vice versa. For
example, we can find the integer associated with the B-PER tag by
using the ClassLabel.str2int
function as follows:
tags
.
str2int
(
"B-PER"
)
1
Similarly, we can map back from an integer to the corresponding class name:
tags
.
int2str
(
1
)
'B-PER'
Let’s use the ClassLabel.int2str
function to create a new
column in our training set with class names for each tag.
We’ll use the Dataset.map
function to return a dict
with
the key corresponding to the new column name and the value as a list
of class names:
def
create_tag_names
(
batch
):
return
{
"ner_tags_str"
:
[
tags
.
int2str
(
idx
)
for
idx
in
batch
[
"ner_tags"
]]}
panx_de
=
panx_ch
[
"de"
]
.
map
(
create_tag_names
)
Now that we have our tags in human-readable format, let’s see how the tokens and tags align for the first example in the training set:
de_example
=
panx_de
[
"train"
][
0
]
df
=
pd
.
DataFrame
([
de_example
[
"tokens"
],
de_example
[
"ner_tags_str"
]],
[
'Tokens'
,
'Tags'
])
display_df
(
df
,
header
=
None
)
Tokens | 2.000 | Einwohnern | an | der | Danziger | Bucht | in | der | polnischen | Woiwodschaft | Pommern | . |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Tags | O | O | O | O | B-LOC | I-LOC | O | O | B-LOC | B-LOC | I-LOC | O |
The presence of the LOC tags make sense since the sentence “2,000 Einwohnern an der Danziger Bucht in der polnischen Woiwodschaft Pommern” means “2,000 inhabitants at the Gdansk Bay in the Polish voivodeship of Pomerania” in English, and Gdansk Bay is a bay in the Baltic sea, while “voivodeship” corresponds to a state in Poland.
As a sanity check that we don’t have any unusual imbalance in the tags, let’s calculate the frequencies of each entity across each split:
from
itertools
import
chain
from
collections
import
Counter
split2freqs
=
{}
for
split
in
panx_de
.
keys
():
tag_names
=
[]
for
row
in
panx_de
[
split
][
"ner_tags_str"
]:
tag_names
.
append
([
t
.
split
(
"-"
)[
1
]
for
t
in
row
if
t
.
startswith
(
"B"
)])
split2freqs
[
split
]
=
Counter
(
chain
.
from_iterable
(
tag_names
))
pd
.
DataFrame
.
from_dict
(
split2freqs
,
orient
=
"index"
)
ORG | LOC | PER | |
---|---|---|---|
validation | 2683 | 3172 | 2893 |
test | 2573 | 3180 | 3071 |
train | 5366 | 6186 | 5810 |
This looks good - the distribution of the PER, LOC, and ORG frequencies are roughly the same for each split, so the validation and test sets should provide a good measure of our NER tagger’s ability to generalize. Next, let’s look at a few popular multilingual Transformers and how they can be adapted to tackle our NER task.
Multilingual Transformers involve similar architectures and training procedures as their monolingual counterparts, except that the corpus used for pretraining consists of documents in many languages. A remarkable feature of this approach is that despite receiving no explicit information to differentiate among the languages, the resulting linguistic representations are able to generalize well across languages for a variety of downstream tasks. In some cases, this ability to perform cross-lingual transfer can produce results that are competitive with monolingual models, which circumvents the need to train one model per language!
To measure the progress of cross-lingual transfer for NER, the CoNLL-2002 and CoNLL-2003 datasets are often used as a benchmark for English, Dutch, Spanish, and German. This benchmark consists of news articles annotated with the same LOC, PER, and ORG categories as PAN-X, but contains an additional MISC label for miscellaneous entities that do not belong to the previous three groups. Multilingual Transformer models are then evaluated in three different ways:
Fine-tune on the English training data and then evaluate on each language’s test set.
Fine-tune and evaluate on monolingual training data to measure per-language performance.
Fine-tune on all the training data to evaluate multilingual learning.
We will adopt a similar evaluation strategy for our NER task and we’ll use XLM-RoBERTa (or XLM-R for short) which, as of this book’s writing, is the current state-of-the-art Transformer model for multilingual applications. But first, let’s take a look at the two models that inspired its development: mBERT and XLM.
Multilingual BERT (mBERT)4 was developed by the authors of BERT from Google Research in 2018 and was the first multilingual Transformer model. It has the same architecture and training procedure as BERT, except that the pretraining corpus consists of Wikipedia articles from 104 languages. The tokenizer is also WordPiece, but the vocabulary is learnt from the whole corpus so that the model can share embeddings across languages.
To handle the fact that each language’s Wikipedia dump can vary greatly in size, the data for pretraining and learning the WordPiece vocabulary is weighted with an exponential smoothing function that down-samples high-resource languages like English and up-samples low-resource languages likes Burmese.
In the Cross-lingual Language Model
Pretraining paper, Guillaume Lample and Alexis Conneau from Facebook AI
Research investigated three pretraining objectives for cross-lingual
language (XLM) models. One of these objectives is the masked language
modeling (MLM) objective from BERT, but instead of receiving complete
sentences as input, XLM receives sentences that can be truncated
arbitrarily (there is also no next-sentence prediction task). To
increase the number of tokens associated with low-resource languages,
the sentences are sampled from a monolingual corpus
and
There are several variants of XLM based on the choice of pretraining objective and number of languages to be trained on. For the purposes of this discussion, we’ll use XLM to denote the model trained on the same 100 languages used for mBERT.
Like its predecessors, XLM-R uses MLM as a pretraining objective for 100 languages, but, as shown in Figure 4-2, is distinguished by the huge size of the corpus used for pretraining: Wikipedia dumps for each language and 2.5 terabytes of Common Crawl data from the web. This corpus is several orders of magnitude larger than the ones used in previous models and provides a significant boost in signal for low-resource languages like Burmese and Swahili, where only a small number of Wikipedia articles exist.
The RoBERTa part of the model’s name refers to the fact that the pretraining approach is the same as monolingual RoBERTa models. In the RoBERTa paper,5 the authors improved on several aspects of BERT, in particular by removing the next sentence prediction task altogether. XLM-R also drops the language embeddings used in XLM and uses SentencePiece6 to tokenize the raw texts directly. Besides its multilingual nature, a notable difference between XLM-R and RoBERTa is the size of the respective vocabularies: 250,000 tokens versus 55,000!
The Table 4-2 summarizes the main architectural differences between all the multilingual Transformers.
Model | Languages | Tokenizer | Layers | Hidden States | Attention Heads | Vocabulary Size | Parameters |
---|---|---|---|---|---|---|---|
mBERT |
104 |
WordPiece |
12 |
768 |
12 |
110k |
172M |
XLM |
100 |
BytePairEncoding |
16 |
1280 |
16 |
200k |
570M |
XLM-R (Base) |
100 |
SentencePiece |
12 |
768 |
12 |
250k |
270M |
XLM-R (Large) |
100 |
SentencePiece |
24 |
1024 |
16 |
250k |
550M |
The performance of mBERT and XLM-R on the CoNLL benchmark is also shown in Figure 4-3. We see that when trained on all the languages, the XLM-R models significantly outperform mBERT and earlier state-of-the-art approaches.
From this research it becomes apparent that XLM-R is the best choice for multilingual NER. In the next section we explore how to fine-tune XLM-R for this task on a new dataset.
In Chapter 1, we saw that for text
classification, BERT uses the special [CLS]
token to represent an
entire sequence of text. As shown in the left diagram of
Figure 4-4, this representation is then fed through a
fully-connected or dense layer to output the distribution of all the
discrete label values. BERT and other encoder Transformers take a
similar approach for NER, except that the representation of every
input token is fed into the same fully-connected layer to output the
entity of the token. For this reason, NER is often framed as a token
classification task and the process looks something like the right
diagram of Figure 4-4.
So far, so good, but how should we handle subwords in a token classification task? For example, the last name “Sparrow” in Figure 4-4 is tokenized by WordPiece into the subwords “Spa” and “##rrow”, so which one (or both) should be assigned the I-PER label?
In the BERT paper,7 the authors used the representation from first subword (i.e. “Spa” in our example) and this is the convention we’ll adopt here. Although we could have chosen to include the representation from the “##rrow” subword by assigning it a copy of the I-LOC label, this introduces extra complexity when subwords are associated with a B- entity because then we need to copy these tags and this violates the IOB2 format.
Fortunately, all this intuition from BERT carries over to XLM-R since the architecture is based on RoBERTa, which is identical to BERT! However, there are some slight differences, especially around the choice of tokenizer. Let’s see how the two differ.
Instead of using a WordPiece tokenizer, XLM-R uses a tokenizer called SentencePiece that is trained on the raw text of all 100 languages. The SentencePiece tokenizer is based on a type of subword segmentation called Unigram and encodes input text as a sequence of Unicode characters. This last feature is especially useful for multilingual corpora since it allows SentencePiece to be agnostic about accents, punctuation, and the fact that many languages like Japanese do not have whitespace characters.
To get a feel for how SentencePiece compares to WordPiece, let’s load the BERT and XLM-R tokenizers in the usual way with Transformers:
from
transformers
import
AutoTokenizer
bert_model_name
=
"bert-base-cased"
xlmr_model_name
=
"xlm-roberta-base"
bert_tokenizer
=
AutoTokenizer
.
from_pretrained
(
bert_model_name
)
xlmr_tokenizer
=
AutoTokenizer
.
from_pretrained
(
xlmr_model_name
)
By encoding a small sequence of text we can also retrieve the special tokens that each model used during pretraining:
text
=
"Jack Sparrow loves New York!"
bert_tokens
=
bert_tokenizer
(
text
)
.
tokens
()
xlmr_tokens
=
xlmr_tokenizer
(
text
)
.
tokens
()
BERT | [CLS] | Jack | Spa | ##rrow | loves | New | York | ! | [SEP] | None |
---|---|---|---|---|---|---|---|---|---|---|
XLM-R | <s> | _Jack | _Spar | row | _love | s | _New | _York | ! | </s> |
Here we see that instead of the [CLS]
and [SEP]
tokens that BERT
uses for sentence classification tasks, XLM-R uses <s>
and <s>
to
denote the start and end of a sequence. Another special feature of
SentencePiece is that it treats raw text as a sequence of Unicode
characters, with whitespace given the Unicode symbol U+2581 or _
character. By assigning a special symbol for whitespace, SentencePiece
is able to detokenize a sequence without ambiguities. In our example
above, we can see that WordPiece has lost the information that there is
no whitespace between “York” and “!”. By contrast, SentencePiece
preserves the whitespace in the tokenized text so we can convert back to
the raw text without ambiguity:
""
.
join
(
xlmr_tokens
)
.
replace
(
"_"
,
" "
)
'<s> Jack Sparrow loves New York!</s>'
Now that we understand how SentencePiece works, let’s see how we can encode our simple example in a form suitable for NER. The first thing to do is load the pretrained model with a token classification head. But instead of loading this head directly from the Transformers library we will build it ourselves! By diving deeper into the Transformers API, let’s see how we can do this with just a few steps.
As we have seen in previous chapters, the Transformers library is organized around dedicated classes for each architecture and task. The list of supported tasks can be found in the Transformers documentation, and as of this book’s writing includes
Sequence classification
Extractive question answering
Language modeling
Named entity recognition
Summarization
Translation
and the associated classes are named according to a ModelNameForTask
convention. Most of the time, we load these models using the
ModelNameForTask.from_pretrained
function and since the architecture
can usually be guessed from the name alone (e.g. bert-base-uncased
),
Transformers provides a convenient set of AutoClasses to automatically
load the relevant configuration, vocabulary, or weights. In practice,
these AutoClasses are extremely useful because it means that we can
switch to a completely different architecture in our experiments by
simply changing the model name!
However, this approach has its limitations, and to motivate going deeper in the Transformers API consider the following scenario. Suppose you work for a consulting company that is engaged with many customer projects each year. By studying how these projects evolve, you’ve noticed that the initial estimates for person-months, number of required people, and the total project timespan are extremely inaccurate. After thinking about this problem, you have the idea that feeding the written project descriptions to a Transformer model might yield much better estimates of these quantities.
So you set up a meeting with your boss and, with an artfully crafted
Powerpoint presentation, you pitch that you could increase the accuracy
of the project estimates and thus increase the efficiency of the staff
and revenue by making more accurate offers. Impressed with your colorful
presentation and talk of efficiency and profits, your boss generously
agrees to give you one week to build a proof-of-concept. Happy with the
outcome, you start working straight away and decide that the only thing
you need is regression model to predict the three variables
(person-months, number of people, and timespan). You fire up your
favorite GPU and open a notebook. You execute
from transformers import BertForRegression
and color escapes your face
as dreaded red color fills your screen:
ImportError: cannot import name 'BertForRegression'
.
Oh no, there is no BERT model for regression! How should you complete
the project in one week if you have to implement the whole model
yourself?! Where should you even start?
Don’t panic! The Transformers library is designed to enable you the easily extend existing models for your specific use-case. With it you have access to various utilities such as loading weights of pretrained models or task specific helper functions. This lets you build custom models for specific objectives with very little overhead.
The main concept that makes Transformers so versatile is the split of
the architecture into a body and head. We have already seen that
when we switch from the pretraining task to the downstream task, we need
to replace the last layer of the model with one that is suitable for the
task. This last layer is called the model head and is the part that is
task specific. The rest of the model is called the body and includes
the token embeddings and Transformer layers that are task agnostic.
This structure is reflected in the Transformers code as well: The body
of a model is implemented in a class such as BertModel
or GPT2Model
that returns the hidden states of the last layer. Task specific models
such as BertForMaskedLM
or BertForSequenceClassification
use the
base model and add the necessary head on top of the hidden states as
shown in figure Figure 4-5.
BertModel
class only contains the body of the model while the BertForTask
classes combine the body with a dedicated head for a given task.This separation of bodies and heads allows us to build a custom head for any task and just mount it on top of a pretrained model! Let’s go through the exercise of building a a custom token classification head for XLM-R. Since XLM-R uses the same model architecture as RoBERTa, we will use RoBERTa as the base model, but augmented with settings specific to XLM-R.
To get started we need a data structure that will represent our XLM-R
NER tagger. As a first guess, we’ll need a configuration
file to initialize the model and a forward
function to generate the
outputs. With these considerations, let’s go ahead and build
our XLM-R class for token classification:
import
torch.nn
as
nn
from
transformers
import
XLMRobertaConfig
from
transformers.models.roberta.modeling_roberta
import
(
RobertaModel
,
RobertaPreTrainedModel
)
class
XLMRobertaForTokenClassification
(
RobertaPreTrainedModel
):
config_class
=
XLMRobertaConfig
def
__init__
(
self
,
config
):
super
()
.
__init__
(
config
)
self
.
num_labels
=
config
.
num_labels
# load model body
self
.
roberta
=
RobertaModel
(
config
,
add_pooling_layer
=
False
)
# setup token classification head
self
.
dropout
=
nn
.
Dropout
(
config
.
hidden_dropout_prob
)
self
.
classifier
=
nn
.
Linear
(
config
.
hidden_size
,
config
.
num_labels
)
# load and initialize weights
self
.
init_weights
()
def
forward
(
self
,
input_ids
=
None
,
attention_mask
=
None
,
token_type_ids
=
None
,
labels
=
None
,
**
kwargs
):
pass
The config_class
ensures that the standard XLM-R settings are used
when we initialize a new model. If you want to change the default
parameters you can do this by overwriting the default settings in the
configuration. With the super()
function we call the initialization
function of RobertaPreTrainedModel
. Then we define our model
architecture by taking the model body from RobertaModel
and extending
it with our own classification head consisting of a dropout and a
standard feedforward layer. Finally, we initialize all the weights by
calling the init_weights
function which will load the pretrained
weights for the model body and randomly initialize the weights of our
token classification head.
The only thing left to do is to define what the model should do in a
forward pass. We define the following behavior in the forward
function:
from
transformers.modeling_outputs
import
TokenClassifierOutput
def
forward
(
self
,
input_ids
=
None
,
attention_mask
=
None
,
token_type_ids
=
None
,
labels
=
None
,
**
kwargs
):
# use model body to get encoder representations
outputs
=
self
.
roberta
(
input_ids
,
attention_mask
=
attention_mask
,
token_type_ids
=
token_type_ids
,
**
kwargs
)
# apply classifier to encoder representation
sequence_output
=
self
.
dropout
(
outputs
[
0
])
logits
=
self
.
classifier
(
sequence_output
)
# calculate losses
loss
=
None
if
labels
is
not
None
:
loss_fct
=
nn
.
CrossEntropyLoss
()
loss
=
loss_fct
(
logits
.
view
(
-
1
,
self
.
num_labels
),
labels
.
view
(
-
1
))
# return model output object
return
TokenClassifierOutput
(
loss
=
loss
,
logits
=
logits
,
hidden_states
=
outputs
.
hidden_states
,
attentions
=
outputs
.
attentions
)
During the forward pass the data is first fed through the model body.
There are a number of input variables, but the important ones you should
recognize are the input_ids
and attention_masks
which are the only
ones we need for now. The hidden state, which is part of the model body
output, is then fed through the dropout and classification layer. If we
also provide labels in the forward pass we can directly calculate the
loss. If there is an attention mask we need to do a little bit more work
to make sure we only calculate the loss of the unmasked tokens. Finally,
we wrap all the outputs in a TokenClassifierOutput
object that allows
us to access elements in a the familiar named tuple from previous
chapters.
The only thing left to do is updating the placeholder function in the model class with our freshly baked functions:
XLMRobertaForTokenClassification
.
forward
=
forward
Looking back at the example of the triple regression problem at the beginning of this section we now see that we can easily solve this by adding a custom regression head to the model with the necessary loss function and still have a chance at meeting the challenging deadline.
Now we are ready to load our token classification model. Here we need to
provide some additional information beyond the model name, including the
tags that we will use to label each entity and the mapping of each tag
to an ID and vice versa. All of this information can be derived from our
tags
variable, which as a ClassLabel
object has a names
attribute
that we can use to derive the mapping:
index2tag
=
{
idx
:
tag
for
idx
,
tag
in
enumerate
(
tags
.
names
)}
tag2index
=
{
tag
:
idx
for
idx
,
tag
in
enumerate
(
tags
.
names
)}
With this information and the ClassLabel.num_classes
attribute, we can
load the XLM-R configuration for NER as follows:
from
transformers
import
AutoConfig
xlmr_config
=
AutoConfig
.
from_pretrained
(
xlmr_model_name
,
num_labels
=
tags
.
num_classes
,
id2label
=
index2tag
,
label2id
=
tag2index
)
Now, we can load the model weights as usual with the from_pretrained
function. Note that we did not implement this ourselves; we get this for
free by inheriting from RobertaPreTrainedModel
:
import
torch
device
=
torch
.
device
(
"cuda"
if
torch
.
cuda
.
is_available
()
else
"cpu"
)
xlmr_model
=
(
XLMRobertaForTokenClassification
.
from_pretrained
(
xlmr_model_name
,
config
=
xlmr_config
)
.
to
(
device
))
As a sanity check that we have initialized the tokenizer and model correctly, let’s test the predictions on our small sequence of known entities:
input_ids
=
xlmr_tokenizer
.
encode
(
text
,
return_tensors
=
"pt"
)
.
to
(
device
)
Tokens | <s> | _Jack | _Spar | row | _love | s | _New | _York | ! | </s> |
---|---|---|---|---|---|---|---|---|---|---|
Input IDs | 0 | 21763 | 37456 | 15555 | 5161 | 7 | 2356 | 5753 | 38 | 2 |
As we can see, the start <s>
and end </s>
tokens are given the IDs 0
and 2 respectively. For reference we can find the mappings of the other
special characters via the all_special_ids
and all_special_tokens
attributes of xlmr_tokenizer
:
df
=
pd
.
DataFrame
([
xlmr_tokenizer
.
all_special_tokens
,
xlmr_tokenizer
.
all_special_ids
],
index
=
[
"Special Token"
,
"Special Token ID"
])
display_df
(
df
,
header
=
None
)
Special Token | <s> | </s> | <unk> | <pad> | <mask> |
---|---|---|---|---|---|
Special Token ID | 0 | 2 | 3 | 1 | 250001 |
Finally, we need to pass the inputs to the model and extract the predictions by taking the argmax to get the most likely class per token:
outputs
=
xlmr_model
(
input_ids
)
.
logits
predictions
=
torch
.
argmax
(
outputs
,
dim
=-
1
)
(
f
"Number of tokens in sequence: {len(xlmr_tokens)}"
)
(
f
"Shape of outputs: {outputs.shape}"
)
Number of tokens in sequence: 10 Shape of outputs: torch.Size([1, 10, 7])
Here we see that the logits have the shape
[batch_size, num_tokens, num_tags]
, with each token given a logit
among the 7 possible NER tags. By enumerating over the sequence, we can
quickly see what the pretrained model predicts:
preds
=
[
tags
.
names
[
p
]
for
p
in
predictions
[
0
]
.
cpu
()
.
numpy
()]
df
=
pd
.
DataFrame
([
xlmr_tokens
,
preds
],
index
=
[
"Tokens"
,
"Tags"
])
display_df
(
df
,
header
=
None
)
Tokens | <s> | _Jack | _Spar | row | _love | s | _New | _York | ! | </s> |
---|---|---|---|---|---|---|---|---|---|---|
Tags | I-ORG | I-PER | I-PER | I-PER | I-PER | I-PER | I-PER | I-PER | I-ORG | I-ORG |
Unsurprisingly, our token classification layer with random weights leaves a lot to be desired; let’s fine-tune on some labeled data to make it better! Before doing so, let’s wrap the above steps into a helper function for later use:
def
tag_text
(
text
,
tags
,
model
,
tokenizer
):
# get tokens with special characters
tokens
=
tokenizer
.
tokenize
(
tokenizer
.
decode
(
tokenizer
.
encode
(
text
)))
# encode the sequence into IDs
inputs
=
tokenizer
.
encode
(
text
,
return_tensors
=
"pt"
)
.
to
(
device
)
# get predictions as distribution over 7 possible classes
outputs
=
model
(
inputs
)[
0
]
# take argmax to get most likely class per token
predictions
=
torch
.
argmax
(
outputs
,
dim
=
2
)
# convert to DataFrame
preds
=
[
tags
.
names
[
p
]
for
p
in
predictions
[
0
]
.
cpu
()
.
numpy
()]
df
=
pd
.
DataFrame
([
tokens
,
preds
],
index
=
[
"Tokens"
,
"Tags"
])
display_df
(
df
,
header
=
None
)
Now that we’ve established that the tokenizer and model can
encode a single example, our next step is to tokenize the whole dataset
so that we can pass it to the XLM-R model for fine-tuning. As we saw in
Chapter 1, Datasets provides a fast way to
tokenize a Dataset
object with the Dataset.map
operation. To achieve
this, recall that we first need to define a function with the minimal
signature
function
(
examples
:
Dict
[
str
,
List
])
->
Dict
[
str
,
List
]
where examples
is equivalent to a slice of a Dataset
, e.g.
panx_de['train'][:10]
. Since the XLM-R
tokenizer returns the input IDs for the model’s inputs, we
just need to augment this information with the attention mask and the
label IDs that encode the information about which token is associated
with each NER tag.
Following the approach taken in the Transformers documentation, let’s look at how this works with our single German example by first collecting the words and tags as ordinary lists:
words
,
labels
=
de_example
[
"tokens"
],
de_example
[
"ner_tags"
]
Next we tokenize each word and use the is_split_words
argument to tell
the tokenizer that our input sequence has already been split into words:
tokenized_input
=
xlmr_tokenizer
(
de_example
[
"tokens"
],
is_split_into_words
=
True
)
tokens
=
xlmr_tokenizer
.
convert_ids_to_tokens
(
tokenized_input
[
"input_ids"
])
Tokens | <s> | _2.000 | _Einwohner | n | _an | _der | _Dan | ... | schaft | _Po | mmer | n | _ | . | </s> |
---|
In this example we can see that the tokenizer has split “Einwohnern”
into two subwords "Einwohner” and “n”. Since we’re
following the convention that only “_Einwohner” should be associated
with the _B-LOC label (see “Training a Named Entity Recognition Tagger”), we need a way
to mask the subword representations after the first subword.
Fortunately, tokenized_input
is a class that contains a word_ids
function that can help us achieve this:
word_ids
=
tokenized_input
.
word_ids
()
Tokens | <s> | _2.000 | _Einwohner | n | _an | _der | _Dan | ... | schaft | _Po | mmer | n | _ | . | </s> |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Word IDs | None | 0 | 1 | 1 | 2 | 3 | 4 | ... | 9 | 10 | 10 | 10 | 11 | 11 | None |
Here we can see that word_ids
has mapped each subword to the
corresponding index in the words
sequence, so the first subword
“_2.000” is assigned the index 0, while “_Einwohner” and “n” are
assigned the index 1 since “Einwohnern” is the second word in words
.
We can also see that special tokens like <s>
and <s>
are mapped to
None
. Let’s set -100 as the label for these special tokens
and the subwords we wish to mask during training:
previous_word_idx
=
None
label_ids
=
[]
for
word_idx
in
word_ids
:
if
word_idx
is
None
:
label_ids
.
append
(
-
100
)
elif
word_idx
!=
previous_word_idx
:
label_ids
.
append
(
labels
[
word_idx
])
else
:
label_ids
.
append
(
-
100
)
previous_word_idx
=
word_idx
Tokens | <s> | _2.000 | _Einwohner | n | _an | _der | ... | _Po | mmer | n | _ | . | </s> |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Word IDs | None | 0 | 1 | 1 | 2 | 3 | ... | 10 | 10 | 10 | 11 | 11 | None |
Label IDs | -100 | 0 | 0 | -100 | 0 | 0 | ... | 6 | -100 | -100 | 0 | -100 | -100 |
Labels | IGN | O | O | IGN | O | O | ... | I-LOC | IGN | IGN | O | IGN | IGN |
Why did we choose -100 as the ID to mask subword representations? The reason is that in PyTorch the cross entropy loss class torch.nn.CrossEntropyLoss
has an attribute called ignore_index
whose value is -100. This index is ignored during training and so we can use it to ignore the tokens associated with consecutive subwords.
And that’s it! We can clearly see how the label IDs align with the tokens, so let’s scale this out to the whole dataset by defining a single function that wraps all the logic:
def
tokenize_and_align_labels
(
examples
):
tokenized_inputs
=
xlmr_tokenizer
(
examples
[
"tokens"
],
truncation
=
True
,
is_split_into_words
=
True
)
labels
=
[]
for
idx
,
label
in
enumerate
(
examples
[
"ner_tags"
]):
word_ids
=
tokenized_inputs
.
word_ids
(
batch_index
=
idx
)
previous_word_idx
=
None
label_ids
=
[]
for
word_idx
in
word_ids
:
if
word_idx
is
None
or
word_idx
==
previous_word_idx
:
label_ids
.
append
(
-
100
)
else
:
label_ids
.
append
(
label
[
word_idx
])
previous_word_idx
=
word_idx
labels
.
append
(
label_ids
)
tokenized_inputs
[
"labels"
]
=
labels
return
tokenized_inputs
Next let’s verify whether our function works as expected on a single training example:
single_sample
=
panx_de
[
"train"
]
.
select
(
range
(
1
))
single_sample_encoded
=
single_sample
.
map
(
tokenize_and_align_labels
,
batched
=
True
)
First, we should be able to decode the training example from the
input_ids
:
(
" "
.
join
(
token
for
token
in
single_sample
[
0
][
"tokens"
]))
(
xlmr_tokenizer
.
decode
(
single_sample_encoded
[
"input_ids"
][
0
]))
2.000 Einwohnern an der Danziger Bucht in der polnischen Woiwodschaft Pommern . <s> 2.000 Einwohnern an der Danziger Bucht in der polnischen Woiwodschaft > Pommern.</s>
Good, the decoded output from the tokenizer makes sense and we can see
the appearance of the special tokens <s>
and </s>
for the start and
end of the sentence. Next let’s check that the label IDs are
implemented correctly by filtering out the padding label IDs and mapping
back from ID to tag:
original_labels
=
single_sample
[
"ner_tags_str"
][
0
]
reconstructed_labels
=
[
index2tag
[
idx
]
for
idx
in
single_sample_encoded
[
"labels"
][
0
]
if
idx
!=
-
100
]
Original Labels | O | O | O | O | B-LOC | I-LOC | O | O | B-LOC | B-LOC | I-LOC | O |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Reconstructed Labels | O | O | O | O | B-LOC | I-LOC | O | O | B-LOC | B-LOC | I-LOC | O |
We now have all the ingredients we need to encode each split, so let’s write a function we can iterate over:
def
encode_panx_dataset
(
corpus
):
return
corpus
.
map
(
tokenize_and_align_labels
,
batched
=
True
,
remove_columns
=
[
'langs'
,
'ner_tags'
,
'tokens'
])
By applying this function to a DatasetDict
object, we get an encoded
Dataset
object per split. Let’s use this to encode our
German corpus:
panx_de_encoded
=
encode_panx_dataset
(
panx_ch
[
"de"
])
panx_de_encoded
[
"train"
]
Dataset({ features: ['attention_mask', 'input_ids', 'labels'], num_rows: 12580 })
Evaluating NER taggers is similar to other classification tasks and it
is common to report results for precision, recall, and
from
seqeval.metrics
import
classification_report
y_true
=
[[
"O"
,
"O"
,
"O"
,
"B-MISC"
,
"I-MISC"
,
"I-MISC"
,
"O"
],
[
"B-PER"
,
"I-PER"
,
"O"
]]
y_pred
=
[[
"O"
,
"O"
,
"B-MISC"
,
"I-MISC"
,
"I-MISC"
,
"I-MISC"
,
"O"
],
[
"B-PER"
,
"I-PER"
,
"O"
]]
(
classification_report
(
y_true
,
y_pred
))
precision recall f1-score support MISC 0.00 0.00 0.00 1 PER 1.00 1.00 1.00 1 micro avg 0.50 0.50 0.50 2 macro avg 0.50 0.50 0.50 2 weighted avg 0.50 0.50 0.50 2
As we can see, seqeval expects the predictions and labels as a list of lists, with each list corresponding to a single example in our validation or test sets. To integrate these metrics during training we need a function that can take the outputs of the model and convert them into the lists that seqeval expects. The following does the trick by ensuring we ignore the label IDs associated with subsequent subwords:
import
numpy
as
np
def
align_predictions
(
predictions
,
label_ids
):
preds
=
np
.
argmax
(
predictions
,
axis
=
2
)
batch_size
,
seq_len
=
preds
.
shape
labels_list
,
preds_list
=
[],
[]
for
batch_idx
in
range
(
batch_size
):
example_labels
,
example_preds
=
[],
[]
for
seq_idx
in
range
(
seq_len
):
# ignore label IDs = -100
if
label_ids
[
batch_idx
,
seq_idx
]
!=
-
100
:
example_labels
.
append
(
index2tag
[
label_ids
[
batch_idx
][
seq_idx
]])
example_preds
.
append
(
index2tag
[
preds
[
batch_idx
][
seq_idx
]])
labels_list
.
append
(
example_labels
)
preds_list
.
append
(
example_preds
)
return
preds_list
,
labels_list
We now have all the ingredients to fine-tune our model! Our first
strategy will be to fine-tune our base model on the German subset of
PAN-X and then evaluate it’s zero-shot cross-lingual
performance on French, Italian, and English. As usual, we’ll
use the Transformers Trainer
to handle our training loop, so first
we need to define the training attributes using the TrainingArguments
class:
from
transformers
import
TrainingArguments
num_epochs
=
3
batch_size
=
24
logging_steps
=
len
(
panx_de_encoded
[
"train"
])
//
batch_size
training_args
=
TrainingArguments
(
output_dir
=
"results"
,
num_train_epochs
=
num_epochs
,
per_device_train_batch_size
=
batch_size
,
per_device_eval_batch_size
=
batch_size
,
evaluation_strategy
=
"epoch"
,
save_steps
=
1e6
,
weight_decay
=
0.01
,
disable_tqdm
=
False
,
logging_steps
=
logging_steps
)
Here we evaluate the model’s predictions on the validation
set at the end of every epoch, tweak the weight decay, and set
save_steps
to a large number to disable checkpointing and thus
speed-up training.
We also need to tell the Trainer
how to compute metrics on the
validation set, so here we can use the align_predictions
function that
we defined earlier to extract the predictions and labels in the format
needed by seqeval to calculate the
from
seqeval.metrics
import
f1_score
def
compute_metrics
(
eval_pred
):
y_pred
,
y_true
=
align_predictions
(
eval_pred
.
predictions
,
eval_pred
.
label_ids
)
return
{
"f1"
:
f1_score
(
y_true
,
y_pred
)}
The final step is to define a data collator so we can pad each input sequence to the largest sequence length in a batch. Transformers provides a dedicated data collator for token classification which will also pad the label sequences along with the inputs:
from
transformers
import
DataCollatorForTokenClassification
data_collator
=
DataCollatorForTokenClassification
(
xlmr_tokenizer
)
Let’s pass all this information together with the encoded
encoded datasets to the Trainer
from
transformers
import
Trainer
trainer
=
Trainer
(
model
=
xlmr_model
,
args
=
training_args
,
data_collator
=
data_collator
,
compute_metrics
=
compute_metrics
,
train_dataset
=
panx_de_encoded
[
"train"
],
eval_dataset
=
panx_de_encoded
[
"validation"
],
tokenizer
=
xlmr_tokenizer
)
and then run the training loop as follows:
trainer
.
train
();
Epoch | Training Loss | Validation Loss | F1 |
---|---|---|---|
1 | 0.270888 | 0.162622 | 0.819401 |
2 | 0.129113 | 0.137760 | 0.851463 |
3 | 0.081817 | 0.136745 | 0.863226 |
Now that the model is fine-tuned, it’s a good idea to save the weights and tokenizer so we can reuse them at a later stage:
trainer
.
save_model
(
"models/xlm-roberta-base-finetuned-panx-de"
)
As a sanity check the our model works as expected, let’s test it on the German translation of our simple example:
text_de
=
"Jeff Dean ist ein Informatiker bei Google in Kalifornien"
tag_text
(
text_de
,
tags
,
trainer
.
model
,
xlmr_tokenizer
)
Tokens | <s> | _Jeff | _De | an | _ist | _ein | _Informati | ker | _bei | _Google | _in | _Kaliforni | en | </s> |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Tags | O | B-PER | I-PER | I-PER | O | O | O | O | O | B-ORG | O | B-LOC | I-LOC | O |
It works! But we should never get too confident about performance based on a single example. Instead we should conduct a proper and thorough investigations of the model’s errors. In the next section we explore how to do this for the NER task.
Before we dive deeper into the multilingual aspects of XLM-R let’s take a minute to investigate the errors of our model. As we saw in Chapter 1, a thorough error analysis of your model is one of the most important aspects when training and debugging Transformers (and machine learning models in general). There are several failure modes where it might look like the model is performing well while in practice it has some serious flaws. Examples where Transformers can fail include:
We can accidentally mask too many tokens and also mask some of our labels to get a really promising loss drop.
The compute_metrics
function can have a bug that overestimates the
true performance.
We might include the zero class or O entity in NER as a normal class
which will heavily skew the accuracy and
When the model performs much worse than expected, looking at the errors can also yield useful insights and reveal bugs which would be hard to spot by just looking at the code. Even if the model performs well and there are no bugs in the code, error analysis is still a useful tool to understand the strength and weaknesses of the model. These are aspects we always need to keep in mind when we deploy a model in a production environment.
We will again use one of the most powerful tools at our disposal which is to look at the validation examples with highest loss. We can reuse much of the function we built to analyze the sequence classification model in Chapter 1 but in contrast we now calculate a loss per token in the sample sequence.
Let’s first load our fine-tuned model
xlmr_model
=
(
XLMRobertaForTokenClassification
.
from_pretrained
(
"models/xlm-roberta-base-finetuned-panx-de"
)
.
to
(
device
))
and define a function that we can iterate over the validation set:
from
torch.nn.functional
import
cross_entropy
def
forward_pass_with_label
(
batch
):
# convert dict of lists to list of dicts
features
=
[
dict
(
zip
(
batch
,
t
))
for
t
in
zip
(
*
batch
.
values
())]
# pad inputs and labels
batch
=
data_collator
(
features
)
input_ids
=
batch
[
"input_ids"
]
.
to
(
device
)
attention_mask
=
batch
[
"attention_mask"
]
.
to
(
device
)
labels
=
batch
[
"labels"
]
.
to
(
device
)
with
torch
.
no_grad
():
output
=
xlmr_model
(
input_ids
,
attention_mask
)
batch
[
"predicted_label"
]
=
torch
.
argmax
(
output
.
logits
,
axis
=-
1
)
loss
=
cross_entropy
(
output
.
logits
.
view
(
-
1
,
7
),
labels
.
view
(
-
1
),
reduction
=
"none"
)
loss
=
loss
.
view
(
len
(
input_ids
),
-
1
)
batch
[
"loss"
]
=
loss
# datasets requires list of NumPy array data types
for
k
,
v
in
batch
.
items
():
batch
[
k
]
=
v
.
cpu
()
.
numpy
()
return
batch
We now apply this function to the whole validation set using
Dataset.map
and load all the data into a DataFrame
for further
analysis:
valid_set
=
panx_de_encoded
[
"validation"
]
valid_set
=
valid_set
.
map
(
forward_pass_with_label
,
batched
=
True
,
batch_size
=
32
)
valid_set
.
set_format
(
"pandas"
)
df
=
valid_set
[:]
The tokens and the labels are still encoded with their IDs, so
let’s map the tokens and labels back to strings to make it
easier to read the results. For the padding tokens with label -100 we
assign a special label IGN
so we can filter them later:
index2tag
[
-
100
]
=
"IGN"
df
[
"input_tokens"
]
=
df
[
"input_ids"
]
.
apply
(
lambda
x
:
xlmr_tokenizer
.
convert_ids_to_tokens
(
x
))
df
[
"predicted_label"
]
=
df
[
"predicted_label"
]
.
apply
(
lambda
x
:
[
index2tag
[
i
]
for
i
in
x
])
df
[
"labels"
]
=
df
[
"labels"
]
.
apply
(
lambda
x
:
[
index2tag
[
i
]
for
i
in
x
])
attention_mask | input_ids | labels | loss | predicted_label | input_tokens | |
---|---|---|---|---|---|---|
0 | [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, ... | [0, 10699, 11, 15, 16104, 1388, 2, 1, 1, 1, 1,... | [IGN, B-ORG, IGN, I-ORG, I-ORG, I-ORG, IGN, IG... | [0.0, 0.015054655261337757, 0.0, 0.01456521265... | [I-ORG, B-ORG, I-ORG, I-ORG, I-ORG, I-ORG, I-O... | [<s>, _Ham, a, _(, _Unternehmen, _), </s>, <pa... |
Each column contains a list of tokens, labels, predicted labels, and so
on for each sample. Let’s have a look at the tokens
individually by unpacking these lists. The pandas.Series.explode
function allows us to do exactly that in one line by creating a row for
each element in the original rows list. Since all the lists in one row
have the same length we can do this in parallel for all columns. We also
drop the padding tokens since their loss is zero anyway:
df_tokens
=
df
.
apply
(
pd
.
Series
.
explode
)
df_tokens
=
df_tokens
.
query
(
"labels != 'IGN'"
)
df_tokens
[
"loss"
]
=
df_tokens
[
"loss"
]
.
astype
(
float
)
attention_mask | input_ids | labels | loss | predicted_label | input_tokens |
---|---|---|---|---|---|
1 | 10699 | B-ORG | 0.015055 | B-ORG | _Ham |
1 | 15 | I-ORG | 0.014565 | I-ORG | _( |
1 | 16104 | I-ORG | 0.017757 | I-ORG | _Unternehmen |
1 | 1388 | I-ORG | 0.017589 | I-ORG | _) |
1 | 56530 | O | 0.000149 | O | _WE |
With the data in this shape we can now group it by the input tokens and aggregate the losses for each token with the count, mean, and sum. Finally, we sort the aggregated data by the sum of the losses and see which tokens have accumulated most loss in the validation set:
(
df_tokens
.
groupby
(
"input_tokens"
)[[
"loss"
]]
.
agg
([
"count"
,
"mean"
,
"sum"
])
.
droplevel
(
level
=
0
,
axis
=
1
)
# get rid of multi-level columns
.
sort_values
(
by
=
"sum"
,
ascending
=
False
)
.
reset_index
()
.
head
(
20
)
)
input_tokens | count | mean | sum | |
---|---|---|---|---|
0 | _ | 6066 | 0.038087 | 231.037361 |
1 | _der | 1388 | 0.093476 | 129.744606 |
2 | _in | 989 | 0.128473 | 127.059641 |
3 | _von | 808 | 0.148930 | 120.335503 |
4 | _/ | 163 | 0.545257 | 88.876889 |
5 | _( | 246 | 0.340354 | 83.726985 |
6 | _) | 246 | 0.317856 | 78.192667 |
7 | _und | 1171 | 0.066501 | 77.872532 |
8 | _'' | 2898 | 0.024963 | 72.342368 |
9 | _A | 125 | 0.489913 | 61.239122 |
10 | _die | 860 | 0.053126 | 45.688522 |
11 | _D | 89 | 0.492030 | 43.790635 |
12 | _des | 366 | 0.115172 | 42.152793 |
13 | _West | 48 | 0.873962 | 41.950152 |
14 | _' | 2133 | 0.019044 | 40.621803 |
15 | _Am | 35 | 1.055619 | 36.946656 |
16 | _I | 94 | 0.384951 | 36.185379 |
17 | _Ober | 27 | 1.315953 | 35.530743 |
18 | _The | 45 | 0.737148 | 33.171655 |
19 | _of | 125 | 0.256964 | 32.120553 |
We can observe several patterns in this list:
The whitespace token has the highest total loss which is not surprising since it is also the most common token in the list. On average it seems to be well below most tokens in the list.
Words like in, von, der, and und appear relatively frequently. They often appear together with named entities and are sometimes part of them which explains why the model might mix them up.
Parentheses, slashes, and capital letters at the beginning of words are rarer but have a relatively high average loss. We will investigate them further.
At the end of list we see some subwords that appear rarely but have a
very high average loss. For example _West
shows that these tokens
appear in almost any class, and thus pose a classification challenge to
the model:
df_tokens
.
query
(
"input_tokens == '_West'"
)[
"labels"
]
.
value_counts
()
O 23 B-LOC 6 I-ORG 6 B-ORG 5 I-LOC 4 I-PER 3 B-PER 1 Name: labels, dtype: int64
We can also group the label IDs and look at the losses for each class. We see that B-ORG has the highest average loss which means that determining the beginning of an organization poses a challenge to our model:
(
df_tokens
.
groupby
(
"labels"
)[[
"loss"
]]
.
agg
([
"count"
,
"mean"
,
"sum"
])
.
droplevel
(
level
=
0
,
axis
=
1
)
.
sort_values
(
by
=
"mean"
,
ascending
=
False
)
.
reset_index
()
)
labels | count | mean | sum | |
---|---|---|---|---|
0 | B-ORG | 2683 | 0.627179 | 1682.721081 |
1 | I-LOC | 1462 | 0.575575 | 841.490868 |
2 | I-ORG | 3820 | 0.508612 | 1942.896333 |
3 | B-LOC | 3172 | 0.292173 | 926.773229 |
4 | B-PER | 2893 | 0.271715 | 786.071277 |
5 | I-PER | 4139 | 0.201903 | 835.677673 |
6 | O | 43648 | 0.032605 | 1423.160096 |
We can break this down further by plotting the confusion matrix of the token classification, where we see that the beginning of an organization is often confused with the subsequent I-ORG token:
plot_confusion_matrix
(
df_tokens
[
"labels"
],
df_tokens
[
"predicted_label"
],
tags
.
names
)
Now that we’ve examined the errors at the token level
let’s move on and look at sequences with high losses. For
this calculation, we revisit “unexploded” DataFrame
and calculate
the total loss by summing over the loss per token. To do this
let’s first write a function that helps us display the token
sequence with the labels and the losses:
def
display_samples
(
df
):
for
_
,
row
in
df
.
iterrows
():
labels
,
preds
,
tokens
,
losses
=
[],
[],
[],
[]
for
i
,
mask
in
enumerate
(
row
[
"attention_mask"
]):
if
mask
==
1
:
labels
.
append
(
row
[
"labels"
][
i
])
preds
.
append
(
row
[
"predicted_label"
][
i
])
tokens
.
append
(
row
[
"input_tokens"
][
i
])
losses
.
append
(
f
"{row['loss'][i]:.2f}"
)
df_tmp
=
pd
.
DataFrame
({
"tokens"
:
tokens
,
"labels"
:
labels
,
"preds"
:
preds
,
"losses"
:
losses
})
.
T
display_df
(
df_tmp
,
header
=
None
,
max_cols
=
10
)
df
[
"total_loss"
]
=
df
[
"loss"
]
.
apply
(
sum
)
display_samples
(
df
.
sort_values
(
by
=
"total_loss"
,
ascending
=
False
)
.
head
(
3
))
tokens | <s> | _' | _'' | _Τ | Κ | ... | k | _'' | _' | ala | </s> |
---|---|---|---|---|---|---|---|---|---|---|---|
labels | IGN | O | O | O | IGN | ... | IGN | I-LOC | I-LOC | IGN | IGN |
preds | O | O | O | B-ORG | I-ORG | ... | O | O | O | B-LOC | O |
losses | 0.00 | 0.00 | 0.00 | 2.90 | 0.00 | ... | 0.00 | 10.00 | 9.81 | 0.00 | 0.00 |
tokens | <s> | _'' | 8 | . | _Juli | ... | n | ischen | _Gar | de | </s> |
---|---|---|---|---|---|---|---|---|---|---|---|
labels | IGN | B-ORG | IGN | IGN | I-ORG | ... | IGN | IGN | I-ORG | IGN | IGN |
preds | O | O | O | O | O | ... | I-ORG | I-ORG | I-ORG | I-ORG | O |
losses | 0.00 | 8.91 | 0.00 | 0.00 | 6.29 | ... | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 |
tokens | <s> | _United | _Nations | _Multi | dimensional | ... | _the | _Central | _African | _Republic | </s> |
---|---|---|---|---|---|---|---|---|---|---|---|
labels | IGN | B-PER | I-PER | I-PER | IGN | ... | I-PER | I-PER | I-PER | I-PER | IGN |
preds | I-ORG | B-ORG | I-ORG | I-ORG | I-ORG | ... | I-ORG | I-ORG | I-ORG | I-ORG | I-ORG |
losses | 0.00 | 6.51 | 6.83 | 6.45 | 0.00 | ... | 5.31 | 5.30 | 6.33 | 5.82 | 0.00 |
It is apparent that something is wrong with the labels of these samples; for example, the United Nations is labeled as a person! It turns out the annotations for the Wikiann dataset were generated through an automated process. Such annotations are often referred to as “silver-standard” (in contrast to the “gold-standard” of human-generated annotations), and it is no surprise that there are cases where the automated approach failed to produce sensible labels. However, such failure modes are not unique to automatic approaches; even when humans carefully annotate data, mistakes can occur when the concentration of the annotators fades or they simply misunderstood the sentence.
Another thing we noticed when looking at the tokens with the most loss were the parentheses and slashes. Lets look at a few examples of sequences with an opening parenthesis:
display_samples
(
df
.
loc
[
df
[
"input_tokens"
]
.
apply
(
lambda
x
:
"_("
in
x
)]
.
head
(
3
))
tokens | <s> | _Ham | a | _( | _Unternehmen | _) | </s> |
---|---|---|---|---|---|---|---|
labels | IGN | B-ORG | IGN | I-ORG | I-ORG | I-ORG | IGN |
preds | I-ORG | B-ORG | I-ORG | I-ORG | I-ORG | I-ORG | I-ORG |
losses | 0.00 | 0.02 | 0.00 | 0.01 | 0.02 | 0.02 | 0.00 |
tokens | <s> | _Kesk | kül | a | _( | _Mart | na | _) | </s> |
---|---|---|---|---|---|---|---|---|---|
labels | IGN | B-LOC | IGN | IGN | I-LOC | I-LOC | IGN | I-LOC | IGN |
preds | I-LOC | B-LOC | I-LOC | I-LOC | I-LOC | I-LOC | I-LOC | I-LOC | I-LOC |
losses | 0.00 | 0.01 | 0.00 | 0.00 | 0.01 | 0.01 | 0.00 | 0.01 | 0.00 |
tokens | <s> | _Pik | e | _Town | ship | ... | _ | , | _Ohio | _) | </s> |
---|---|---|---|---|---|---|---|---|---|---|---|
labels | IGN | B-LOC | IGN | I-LOC | IGN | ... | I-LOC | IGN | I-LOC | I-LOC | IGN |
preds | I-LOC | B-LOC | I-LOC | I-LOC | I-LOC | ... | I-LOC | I-LOC | I-LOC | I-LOC | I-LOC |
losses | 0.00 | 0.02 | 0.00 | 0.01 | 0.00 | ... | 0.01 | 0.00 | 0.01 | 0.01 | 0.00 |
Since Wikiann is a dataset created from Wikipedia, we can see that the entities contain parentheses from the introductory sentence of each article where the name of the article is described. In the first example, the parenthesis simply states that the Hama is an “Unternehmen” or company in English. In general we would not include the parenthesis and its content as part of the named entity but this seems to be the way the automatic extraction annotated the documents. In the other examples the parenthesis contains a geographic specification. While this is indeed a location as well we might want disconnect them from the original location in the annotations. These are important details to know when we roll-out the model since it might have implication on the downstream performance of the whole pipeline the model is part of.
With a relatively simple analysis we found weaknesses in both our model and the dataset. In a real use-case we would iterate on this step and clean up the dataset, re-train the model and analyze the new errors until we are satisfied with the performance.
Now we analysed the errors on a single language but we are also interested in the performance across the languages. In the next section we perform some experiments to see how well the cross-lingual transfer in XLM-R works.
Now that we have fine-tuned XLM-R on German, we can evaluate its ability
to transfer to other languages via the Trainer.predict
function that
generates predictions on Dataset
objects. For example, to get the
predictions on the validation set we can run the following:
panx_de_encoded
[
"validation"
]
.
reset_format
()
preds_valid
=
trainer
.
predict
(
panx_de_encoded
[
"validation"
])
The output of Trainer.predict
is a trainer_utils.PredictionOutput
object which contains arrays of predictions
and label_ids
, along
with the metrics we passed to the trainer. For example, the metrics on
the validation set can be accessed as follows:
preds_valid
.
metrics
{'eval_loss': 0.13674472272396088, 'eval_f1': 0.863226026230625}
The predictions and label IDs are not quite in a form suitable for
seqeval’s classification report, so let’s
align them using our align_predictions
function and print out the
classification report with the following function:
def
generate_report
(
trainer
,
dataset
):
preds
=
trainer
.
predict
(
dataset
)
preds_list
,
label_list
=
align_predictions
(
preds
.
predictions
,
preds
.
label_ids
)
(
classification_report
(
label_list
,
preds_list
,
digits
=
4
))
return
preds
.
metrics
[
"eval_f1"
]
To keep track of our performance per language, our function also returns
the micro-averaged dict
:
f1_scores
=
defaultdict
(
dict
)
f1_scores
[
"de"
][
"de"
]
=
generate_report
(
trainer
,
panx_de_encoded
[
"test"
])
precision recall f1-score support LOC 0.8596 0.8950 0.8769 3180 ORG 0.7979 0.7765 0.7871 2573 PER 0.9162 0.9225 0.9194 3071 micro avg 0.8619 0.8700 0.8659 8824 macro avg 0.8579 0.8647 0.8611 8824 weighted avg 0.8613 0.8700 0.8655 8824
These are pretty good results for a NER task. Our metrics are in the ballpark of 85% and we can see that the model seems to struggle the most on the ORG entities, probably because ORG entities are the least common in the training data and many organization names are rare in XLM-R’s vocabulary. How about on other languages? To warm up, let’s see how our model fine-tuned on German fares on French:
text_fr
=
"Jeff Dean est informaticien chez Google en Californie"
tag_text
(
text_fr
,
tags
,
trainer
.
model
,
xlmr_tokenizer
)
Tokens | <s> | _Jeff | _De | an | _est | _informatic | ien | _chez | _Google | _en | _Cali | for | nie | </s> |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Tags | O | B-PER | I-PER | I-PER | O | O | O | O | B-ORG | O | B-LOC | I-LOC | I-LOC | O |
Not bad! Although the name and organization are the same in both languages, the model did manage to correctly label the French translation of “Kalifornien”. Next, let’s quantify how well our German model fares on the whole French test set by writing a simple function that encodes a dataset and generates the classification report on it:
def
evaluate_zero_shot_performance
(
lang
,
trainer
):
panx_ds
=
encode_panx_dataset
(
panx_ch
[
lang
])
return
generate_report
(
trainer
,
panx_ds
[
"test"
])
f1_scores
[
"de"
][
"fr"
]
=
evaluate_zero_shot_performance
(
"fr"
,
trainer
)
precision recall f1-score support LOC 0.7239 0.7239 0.7239 1130 ORG 0.6371 0.6407 0.6389 885 PER 0.7207 0.7703 0.7447 1045 micro avg 0.6981 0.7157 0.7068 3060 macro avg 0.6939 0.7116 0.7025 3060 weighted avg 0.6977 0.7157 0.7064 3060
Although we see a drop of about 15 points in the micro-averaged metrics, remember that our model has not seen a single labeled French example! In general, the size of the performance drop is related to how “far away” the languages are from each other. Although German and French are grouped as Indo-European languages, they technically belong to the different languages families of “Germanic” and “Romance” respectively.
Next, let’s evaluate the performance on Italian. Since Italian is also a Romance language, we expect to get a similar result as we found on French:
f1_scores
[
"de"
][
"it"
]
=
evaluate_zero_shot_performance
(
"it"
,
trainer
)
precision recall f1-score support LOC 0.7143 0.7042 0.7092 426 ORG 0.5961 0.6185 0.6071 346 PER 0.6736 0.8094 0.7353 362 micro avg 0.6647 0.7116 0.6874 1134 macro avg 0.6613 0.7107 0.6839 1134 weighted avg 0.6652 0.7116 0.6864 1134
Indeed, our expectations are borne out by the macro-averaged metrics. Finally, let’s examine the performance on English which belongs to the Germanic language family:
f1_scores
[
"de"
][
"en"
]
=
evaluate_zero_shot_performance
(
"en"
,
trainer
)
precision recall f1-score support LOC 0.4847 0.6148 0.5421 283 ORG 0.6034 0.6449 0.6235 276 PER 0.6512 0.6932 0.6716 264 micro avg 0.5722 0.6501 0.6086 823 macro avg 0.5798 0.6510 0.6124 823 weighted avg 0.5779 0.6501 0.6109 823
Surprisingly, our model fares worst on English even though we might intuitively expect German to be more similar than French. Let’s next examine the trade-offs between zero-shot cross-lingual transfer and fine-tuning directly on the target language.
So far we’ve seen that fine-tuning XLM-R on the German
corpus yields an
In this section we will explore this question for the French corpus by fine-tuning XLM-R on training sets of increasing size. By tracking the performance this way, we can determine at which point zero-shot cross-lingual transfer is superior, which in practice can be useful for guiding decisions about whether to collect more labeled data.
Since we want to train several models, we’ll use the
model_init
feature of the Trainer
class so that we can instantiate a
fresh model with each call to Trainer.train
:
def
model_init
():
return
(
XLMRobertaForTokenClassification
.
from_pretrained
(
xlmr_model_name
,
config
=
xlmr_config
)
.
to
(
device
))
For simplicity, we’ll also keep the same hyperparameters
from the fine-tuning run on the German corpus, except that
we’ll tweak TrainingArguments.logging_steps
to account for
the changing training set sizes. We can wrap this altogether in a simple
function that takes a DatasetDict
object corresponding to a
monolingual corpus, downsamples it by num_samples
, and fine-tunes
XLM-R on that sample to return the metrics from the best epoch:
def
train_on_subset
(
dataset
,
num_samples
):
train_ds
=
dataset
[
"train"
]
.
shuffle
(
seed
=
42
)
.
select
(
range
(
num_samples
))
valid_ds
=
dataset
[
"validation"
]
test_ds
=
dataset
[
"test"
]
training_args
.
logging_steps
=
len
(
train_ds
)
//
batch_size
trainer
=
Trainer
(
model_init
=
model_init
,
args
=
training_args
,
data_collator
=
data_collator
,
compute_metrics
=
compute_metrics
,
train_dataset
=
train_ds
,
eval_dataset
=
valid_ds
,
tokenizer
=
xlmr_tokenizer
)
trainer
.
train
()
metrics
=
trainer
.
predict
(
test_ds
)
.
metrics
return
pd
.
DataFrame
.
from_dict
(
{
"num_samples"
:
[
len
(
train_ds
)],
"f1_score"
:
[
metrics
[
"eval_f1"
]]})
As we did with fine-tuning on the German corpus, we also need to encode the French corpus into input IDs, attention masks, and label IDs:
panx_fr_encoded
=
encode_panx_dataset
(
panx_ch
[
"fr"
])
Next let’s check that our function works by running it on a small training set of 250 examples:
metrics_df
=
train_on_subset
(
panx_fr_encoded
,
250
)
metrics_df
num_samples | f1_score | |
---|---|---|
0 | 250 | 0.172973 |
We can see that with only 250 examples, fine-tuning on French under-performs the zero-shot transfer from German by a large margin. Let’s now increase our training set sizes to 500, 1,000, 2,000, and 4,000 examples to get an idea of how the performance increases:
for
num_samples
in
[
500
,
1000
,
2000
,
4000
]:
metrics_df
=
metrics_df
.
append
(
train_on_subset
(
panx_fr_encoded
,
num_samples
),
ignore_index
=
True
)
We can compare how fine-tuning on French samples compares to zero-shot
cross-lingual transfer from German by plotting the
From the plot we can see that zero-shot transfer remains competitive until about 750 training examples, after which fine-tuning on French reaches a similar level of performance to what we obtained when fine-tuning on German. Nevertheless, this result is not to be sniffed at! In our experience, getting domain experts to label even hundreds of documents can be costly; especially for NER where the labeling process is fine-grained and time consuming.
There is one final technique we can try to evaluate multilingual learning: fine-tune on multiple languages at once! Let’s see how we can do this in the next section.
So far we’ve seen that zero-shot cross-lingual transfer from
German to French or Italian produces a drop of around 15 points in
performance. One way to mitigate this is by fine-tuning on multiple
languages at the same time! To see what type of gains we can get,
let’s first use the concatenate_datasets
function from
Datasets to concatenate the German and French corpora together:
from
datasets
import
concatenate_datasets
def
concatenate_splits
(
corpora
):
multi_corpus
=
DatasetDict
()
for
split
in
corpora
[
0
]
.
keys
():
multi_corpus
[
split
]
=
concatenate_datasets
(
[
corpus
[
split
]
for
corpus
in
corpora
])
.
shuffle
(
seed
=
42
)
return
multi_corpus
panx_de_fr_encoded
=
concatenate_splits
([
panx_de_encoded
,
panx_fr_encoded
])
For training, we’ll also use the same hyperparameters from the previous sections, so we can simply update the logging steps, model, and datasets in the trainer:
training_args
.
logging_steps
=
len
(
panx_de_fr_encoded
[
"train"
])
//
batch_size
trainer
.
train_dataset
=
panx_de_fr_encoded
[
"train"
]
trainer
.
eval_dataset
=
panx_de_fr_encoded
[
"validation"
]
trainer
.
train
();
Epoch | Training Loss | Validation Loss | F1 |
---|---|---|---|
1 | 0.206954 | 0.276173 | 0.827138 |
2 | 0.205289 | 0.276173 | 0.827138 |
3 | 0.206151 | 0.276173 | 0.827138 |
This model gives a similar
evaluate_zero_shot_performance
(
"it"
,
trainer
);
precision recall f1-score support LOC 0.7143 0.7042 0.7092 426 ORG 0.5961 0.6185 0.6071 346 PER 0.6736 0.8094 0.7353 362 micro avg 0.6647 0.7116 0.6874 1134 macro avg 0.6613 0.7107 0.6839 1134 weighted avg 0.6652 0.7116 0.6864 1134
Wow, this is a 10 point improvement compared to our German model which
scored an
evaluate_zero_shot_performance
(
"en"
,
trainer
);
precision recall f1-score support LOC 0.4847 0.6148 0.5421 283 ORG 0.6034 0.6449 0.6235 276 PER 0.6512 0.6932 0.6716 264 micro avg 0.5722 0.6501 0.6086 823 macro avg 0.5798 0.6510 0.6124 823 weighted avg 0.5779 0.6501 0.6109 823
Here we also have a significant boost in zero-shot performance by 7-8 points, with most of the gain coming from a dramatic improvement of the PER tokens! Apparently the Norman conquest of 1066 left a long-lasting effect on the English language.
Let’s round out our analysis by comparing the performance of
fine-tuning on each language separately against multilingual learning on
all the corpora. Since we have already fine-tuned on the German corpus,
we can fine-tune on the remaining languages with our train_on_subset
function, but where num_samples
is equal to the number of examples in
the training set:
corpora
=
[
panx_de_encoded
]
# exclude German from iteration
for
lang
in
langs
[
1
:]:
# fine-tune on monolingual corpus
ds_encoded
=
encode_panx_dataset
(
panx_ch
[
lang
])
metrics
=
train_on_subset
(
ds_encoded
,
ds_encoded
[
"train"
]
.
num_rows
)
# collect F1-scores in common dict
f1_scores
[
lang
][
lang
]
=
metrics
[
"f1_score"
][
0
]
# add monolingual corpus to list of corpora to concatenate
corpora
.
append
(
ds_encoded
)
Now that we’ve fine-tuned on each language’s
corpus, the next step is to concatenate all the splits together to
create a multilingual corpus of all four languages. As we did with the
previous German and French analysis, we can use our concatenate_splits
function to do this step for us on the list of coropora we generate in
the previous step:
corpora_encoded
=
concatenate_splits
(
corpora
)
Now that we have our multilingual corpus, we run the familiar steps with the trainer
training_args
.
logging_steps
=
len
(
corpora_encoded
[
"train"
])
//
batch_size
trainer
.
train_dataset
=
corpora_encoded
[
"train"
]
trainer
.
eval_dataset
=
corpora_encoded
[
"validation"
]
trainer
.
train
();
Epoch | Training Loss | Validation Loss | F1 |
---|---|---|---|
1 | 0.307639 | 0.199173 | 0.819988 |
2 | 0.160570 | 0.162879 | 0.849782 |
3 | 0.101694 | 0.171258 | 0.854648 |
The final step is generate the predictions from the trainer on each
language’s test set. This will give us an insight into how
well multilingual learning is really working. We’ll collect
the f1_scores
dictionary and then
create a DataFrame
that summarizes the main results from our
multilingual experiments:
for
idx
,
lang
in
enumerate
(
langs
):
f1_scores
[
"all"
][
lang
]
=
(
trainer
.
predict
(
corpora
[
idx
][
"test"
])
.
metrics
[
"eval_f1"
])
scores_data
=
{
"de"
:
f1_scores
[
"de"
],
"each"
:
{
lang
:
f1_scores
[
lang
][
lang
]
for
lang
in
langs
},
"all"
:
f1_scores
[
"all"
]}
f1_scores_df
=
pd
.
DataFrame
.
from_dict
(
scores_data
,
orient
=
"index"
)
.
round
(
4
)
f1_scores_df
.
index
.
name
=
"Fine-tune on"
de | fr | it | en | |
---|---|---|---|---|
Fine-tune on | ||||
de | 0.8659 | 0.7068 | 0.6874 | 0.6086 |
each | 0.8659 | 0.8365 | 0.8161 | 0.7164 |
all | 0.8688 | 0.8646 | 0.8594 | 0.7512 |
From these results we can draw a few general conclusions:
Multilingual learning can provide significant gains in performance,
especially if the low-resource languages for cross-lingual transfer
belong to similar language families. In our experiments we can see that
German, French, and Italian achieve similar performance in the all
category suggesting that these languages are more similar to each other
than English.
As a general strategy, it is a good idea to focus attention on cross-lingual transfer within language families, especially when dealing with different scripts like Japanese.
Although the Trainer
object is useful for training and evaluation, in
production we would like to be able to pass raw text as input and
receive the model’s predictions as output. Fortunately,
there is a way to do that using the Transformers pipeline abstraction!
For named entity recognition, we can use the
TokenClassificationPipeline
so we just need to load the model and
tokenizer and wrap them as follows:
from
transformers
import
TokenClassificationPipeline
pipeline
=
TokenClassificationPipeline
(
trainer
.
model
.
to
(
"cpu"
),
xlmr_tokenizer
,
grouped_entities
=
True
)
Note that we set the model’s device to cpu
since it is
generally faster to run inference on CPUs. Once the pipeline is loaded,
we can then pass raw text to retrieve the predictions in a structured
format:
pipeline
(
text_de
)
[{'entity_group': 'PER', 'score': 0.9977577924728394, 'word': 'Jeff Dean', 'start': 0, 'end': 9}, {'entity_group': 'ORG', 'score': 0.984943151473999, 'word': 'Google', 'start': 35, 'end': 41}, {'entity_group': 'LOC', 'score': 0.9276596009731293, 'word': 'Kalifornien', 'start': 45, 'end': 56}]
By inspecting the output we see each word is given both a predicted entity, confidence score, and indices to locate it in the span of text.
In this chapter we saw how one can tackle NLP task on a multilingual corpus using a single Transformer pretrained on 100 languages: XLM-R. Although we were able to show that cross-lingual transfer from German to French is competitive when only a small number of labeled examples are available for fine-tuning, this good performance generally does not occur if the target language is significantly different from German or was not one of the 100 languages used during pretraining. For such cases, poor performance can be understood from a lack of model capacity in both the vocabulary and space of cross-lingual representations. Recent proposals like MAD-X8 are designed precisely for these low-resource scenarios, and since MAD-X is built on top of Transformers you can easily adapt the code in this chapter to work with it!
In this chapter we saw that cross-lingual transfer helps improve the performance on tasks in a langauge where label are scarce. In the next chapter we will see how we can deal with few labels in cases where we can’t use cross-lingual transfer, for example if there is no language with many labels.
1 Unsupervised Cross-lingual Representation Learning at Scale, A. Conneau et al. (2019)
2 XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization, J. Hu et al. (2020)
3 Cross-lingual Name Tagging and Linking for 282 Languages, X. Pan et al. (2017)
4 Release as part of BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by J. Devlin et al. (2018).
5 RoBERTa: A Robustly Optimized BERT Pretraining Approach, Y. Liu et al. (2019)
6 SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing, T. Kudo and J. Richardson (2018)
7 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, J. Devlin et al. (2018)
8 MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer, J. Pfeiffer et al. (2020)