3

Pretraining a RoBERTa Model from Scratch

In this chapter, we will build a RoBERTa model from scratch. The model will take the bricks of the Transformer construction kit we need for BERT models. Also, no pretrained tokenizers or models will be used. The RoBERTa model will be built following the fifteen-step process described in this chapter.

We will use the knowledge of transformers acquired in the previous chapters to build a model that can perform language modeling on masked tokens step by step. In Chapter 1, Getting Started with the Model Architecture of the Transformer, we went through the building blocks of the original Transformer. In Chapter 2, Fine-Tuning BERT Models, we fine-tuned a pretrained BERT model.

This chapter will focus on building a pretrained transformer model from scratch using a Jupyter notebook based on Hugging Face's seamless modules. The model is named KantaiBERT.

KantaiBERT first loads a compilation of Immanuel Kant books created for this chapter. We will see how the data was obtained. You will see how you will be able to create your own datasets for this notebook.

KantaiBERT trains its own tokenizer from scratch. It will build its merge and vocabulary files, which will be used during the pretraining process.

KantaiBERT then processes the dataset, initializes a trainer, and trains the model.

Finally, KantaiBERT uses the trained model to perform an experimental downstream language modeling task and fills a mask using Immanuel Kant's logic.

By the end of the chapter, you will know how to build a transformer model from scratch.

This chapter covers the following topics:

  • RoBERTa- and DistilBERT-like models
  • How to train a tokenizer from scratch
  • Byte-level byte-pair encoding
  • Saving the trained tokenizer to files
  • Recreating the tokenizer for the pretraining process
  • Initializing a RoBERTa model from scratch
  • Exploring the configuration of the model
  • Exploring the 80 million parameters of the model
  • Building the dataset for the trainer
  • Initializing the trainer
  • Pretraining the model
  • Saving the model
  • Applying the model to the downstream tasks of masked language modeling

Our first step will be to describe the transformer model that we are going to build.

Training a tokenizer and pretraining a transformer

In this chapter, we will train a transformer model named KantaiBERT using the building blocks provided by Hugging Face for BERT-like models. We covered the theory of the building blocks of the model we will be using in Chapter 2, Fine-Tuning BERT Models.

We will describe KantaiBERT, building on the knowledge we acquired in the previous chapters.

KantaiBERT is a Robustly Optimized BERT Pretraining Approach (RoBERTa)-like model based on the architecture of BERT.

The initial BERT models were undertrained. RoBERTa increases the performance of pretraining transformers for downstream tasks. RoBERTa has improved the mechanics of the pretraining process. For example, it does not use WordPiece tokenization but goes down to byte-level Byte Pair Encoding (BPE).

In this chapter, KantaiBERT, like BERT, will be trained using masked language modeling.

KantaiBERT will be trained as a small model with 6 layers, 12 heads, and 84,095,008 parameters. It might seem that 84 million parameters represent a large number of parameters. However, the parameters are spread over 6 layers and 12 heads, which makes it relatively small. A small model will make the pretraining experience smooth so that each step can be viewed in real time without waiting for hours to see a result.

KantaiBERT is a DistilBERT-like model because it has the same architecture of 6 layers and 12 heads. DistilBERT is a distilled version of BERT. We know that large models provide excellent performance. But what if you want to run a model on a smartphone? Miniaturization has been the key to technological evolution. Transformers will have to follow the same path during implementation. The Hugging Face approach using a distilled version of BERT is thus a good step forward. Distillation, or other such methods in the future, is a clever way of taking the best of pretraining and making it efficient for the needs of many downstream tasks.

KantaiBERT will implement a byte-level byte-pair encoding tokenizer like the one used by GPT-2. The special tokens will be the ones used by RoBERTa. BERT models most often use a workpiece tokenizer.

There are no token type IDs to indicate which part of a segment a token is a part of. The segments will be separated with the separation token </s>.

KantaiBERT will use a custom dataset, train a tokenizer, train the transformer model, save it, and run it with a masked language modeling example.

Let's get going and build a transformer from scratch.

Building KantaiBERT from scratch

We will build KantaiBERT in 15 steps from scratch and then run it on a masked language modeling example.

Open Google Colaboratory (you need a Gmail account). Then upload KantaiBERT.ipynb, which is on GitHub in this chapter's directory.

The titles of the 15 steps of this section are similar to the titles of the cells of the notebook, which makes it easy to follow.

Let's start by loading the dataset.

Step 1: Loading the dataset

Ready-to-use datasets provide an objective way to train and compare transformers. In Chapter 4, Downstream NLP Tasks with Transformers, we will explore several datasets. However, the goal of this chapter is to understand the training process of a transformer with notebook cells that could be run in real time without having to wait for hours to obtain a result.

I chose to use the works of Immanuel Kant (1724-1804), the German philosopher, who was the epitome of the Age of Enlightenment. The idea is to introduce human-like logic and pretrained reasoning for downstream reasoning tasks.

Project Gutenberg, https://www.gutenberg.org, offers a wide range of free eBooks that can be downloaded in text format. You can use other books if you want to create customized datasets of your own based on books.

I compiled the following three books by Immanuel Kant into a text file named kant.txt:

  • The Critique of Pure Reason
  • The Critique of Practical Reason
  • Fundamental Principles of the Metaphysic of Morals

kant.txt provides a small training dataset to train the transformer model of this chapter. The result obtained remains experimental. For a real-life project, I would add the complete works of Immanuel Kant, Rene Descartes, Pascal, and Leibnitz, for example.

The text file contains the raw text of the books:

…For it is in reality vain to profess _indifference_ in regard to such
inquiries, the object of which cannot be indifferent to humanity.

The dataset is downloaded automatically from GitHub:…

You can load kant.txt, which is in the directory of this chapter on GitHub using Colab's file manager. Or you can use curl to retrieve it from GitHub:

#@title Step 1: Loading the Dataset
#1.Load kant.txt using the Colab file manager
#2.Downloading the file from GitHub
!curl -L https://raw.githubusercontent.com/PacktPublishing/Transformers-for-Natural-Language-Processing/master/Chapter03/kant.txt --output "kant.txt"

You can see it appear in the Colab file manager pane once you have loaded or downloaded it:

Figure 3.1: Colab file manager

Note that Google Colab deletes the files when you restart the VM.

The dataset is defined and loaded.

Note: Do not run the subsequent cells without kant.txt. Training data is a prerequisite.

Now, the program will install the Hugging Face transformers.

Step 2: Installing Hugging Face transformers

We will need to install Hugging Face transformers and tokenizers, but we will not need TensorFlow in this instance of the Google Colab VM:

#@title Step 2:Installing Hugging Face Transformers
# We won't need TensorFlow here
!pip uninstall -y tensorflow
# Install `transformers` from master
!pip install git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers'
# transformers version at notebook update --- 2.9.1
# tokenizers version at notebook update --- 0.7.0

The output displays the versions installed:

Successfully built transformers
tokenizers               0.7.0          
transformers             2.10.0

Transformer versions are evolving at quite a speed. The version you run may differ and be displayed differently.

The program will now begin by training a tokenizer.

Step 3: Training a tokenizer

In this section, the program does not use a pretrained tokenizer. For example, a pretrained GPT-2 tokenizer could be used. However, the training process in this chapter includes training a tokenizer from scratch.

Hugging Face's ByteLevelBPETokenizer() will be trained using kant.txt. A byte-level tokenizer will break a string or word down into a sub-string or sub-word. There are two main advantages among many others:

  • The tokenizer can break words into minimal components. Then it will merge these small components into statistically interesting ones. For example, "smaller" and smallest" can become "small," "er," and "est." The tokenizer can go further, and we could obtain "sm" and "all," for example. In any case, the words are broken down into sub-word tokens and smaller units of sub-word parts such as "sm" and "all" instead of simply "small."
  • The chunks of strings classified as an unknown unk_token, using WorkPiece level encoding, will practically disappear.

In this model, we will be training the tokenizer with the following parameters:

  • files=paths is the path to the dataset.
  • vocab_size=52_000 is the size of our tokenizer's model length.
  • min_frequency=2 is the minimum frequency threshold.
  • special_tokens=[] is a list of special tokens.

In this case, the list of special tokens is:

  • <s>: a start token
  • <pad>: a padding token
  • </s>: an end token
  • <unk>: an unknown token
  • <mask>: the mask token for language modeling

The tokenizer will be trained to generate merged sub-string tokens and analyze their frequency.

Let's take these two words in the middle of a sentence:

…the tokenizer…

The first step will be to tokenize the string:

'Ġthe', 'Ġtoken',   'izer',

The string is now tokenized into tokens with Ġ (whitespace) information.

The next step is to replace them with their indices:

'Ġthe'

'Ġtoken'

'izer'

150

5430

4712

The program runs the tokenizer as expected:

#@title Step 3: Training a Tokenizer
%%time
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer
paths = [str(x) for x in Path(".").glob("**/*.txt")]
# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()
# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

The tokenizer outputs the time taken to train:

CPU times: user 14.8 s, sys: 14.2 s, total: 29 s
Wall time: 7.72 s

The tokenizer is trained and is ready to be saved.

Step 4: Saving the files to disk

The tokenizer will generate two files when trained:

  • merges.txt, which contains the merged tokenized sub-strings
  • vocab.json, which contains the indices of the tokenized sub-strings

The program first creates the KantaiBERT directory and then saves the two files:

#@title Step 4: Saving the files to disk
import os
token_dir = '/content/KantaiBERT'
if not os.path.exists(token_dir):
  os.makedirs(token_dir)
tokenizer.save_model('KantaiBERT')

The program output shows that the two files have been saved:

['KantaiBERT/vocab.json', 'KantaiBERT/merges.txt']

The two files should appear in the file manager pane:

Figure 3.2: Colab file manager

The files in this example are small. You can double-click on them to view their contents.

merges.txt contains the tokenized sub-strings as planned:

#version: 0.2 - Trained by `huggingface/tokenizers`
Ġ t
h e
Ġ a
o n
i n
Ġ o
Ġt he
r e
i t
Ġo f

vocab.json contains the indices:

[…,"Ġthink":955,"preme":956,"ĠE":957,"Ġout":958,"Ġdut":959,"aly":960,"Ġexp":961,…]

The trained tokenized dataset files are ready to be processed.

Step 5: Loading the trained tokenizer files

We could have loaded pretrained tokenizer files. However, we trained our own tokenizer and now are ready to load the files:

#@title Step 5 Loading the Trained Tokenizer Files 
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing
tokenizer = ByteLevelBPETokenizer(
    "./KantaiBERT/vocab.json",
    "./KantaiBERT/merges.txt",
)

The tokenizer can encode a sequence:

tokenizer.encode("The Critique of Pure Reason.").tokens

"The Critique of Pure Reason" will become:

['The', 'ĠCritique', 'Ġof', 'ĠPure', 'ĠReason', '.']

We can also ask to see the number of tokens in this sequence:

tokenizer.encode("The Critique of Pure Reason.")

The output will show that there are 6 tokens in the sequence:

Encoding(num_tokens=6, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

The tokenizer now processes the tokens to fit the BERT model variant used in this notebook. The post processor will add a start and end token, for example:

tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

Let's encode a post-processed sequence:

tokenizer.encode("The Critique of Pure Reason.")

The output shows that we now have 8 tokens:

Encoding(num_tokens=8, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

If we want to see what was added, we can ask the tokenizer to encode the post-processed sequence by running the following cell:

tokenizer.encode("The Critique of Pure Reason.").tokens

The output shows that the start and end tokens have been added, which brings the number of tokens to 8 including start and end tokens:

['<s>', 'The', 'ĠCritique', 'Ġof', 'ĠPure', 'ĠReason', '.', '</s>']

The data for the training model is now ready to be trained. We will now check the system information of the machine we are running the notebook on.

Step 6: Checking resource constraints: GPU and CUDA

KantaiBERT runs at optimal speed with a Graphics Processing Unit (GPU).

We will first run a command to see if an NVIDIA GPU card is present:

#@title Step 6: Checking Resource Constraints: GPU and NVIDIA 
!nvidia-smi

The output displays the information and version on the card:

Figure 3.3: Information on the NVIDIA card

We will now check to make sure PyTorch sees CUDA:

#@title Checking that PyTorch Sees CUDA
import torch
torch.cuda.is_available()

The results should be True:

True

Compute Unified Device Architecture (CUDA) was developed by NVIDIA to use the parallel computing power of its NVIDIA card.

We are now ready to define the configuration of the model.

Step 7: Defining the configuration of the model

We will be pretraining a RoBERTa-type transformer model using the same number of layers and heads as a DistilBERT transformer. The model will have a vocabulary size set to 52,000, 12 attention heads, and 6 layers:

#@title Step 7: Defining the configuration of the Model
from transformers import RobertaConfig
config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

We will explore the configuration in more detail in Step 9: Initializing a model from scratch.

Let's first recreate the tokenizer in our model.

Step 8: Reloading the tokenizer in transformers

We are now ready to load our trained tokenizer, which is our pretrained tokenizer in RobertaTokenizer.from_pretained():

#@title Step 8: Re-creating the Tokenizer in Transformers
from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained("./KantaiBERT", max_length=512)

Now that we have loaded our trained tokenizer, let's initialize a RoBERTa model from scratch.

Step 9: Initializing a model from scratch

In this section, we will initialize a model from scratch and examine the size of the model.

The program first imports a RoBERTa masked model for language modeling:

#@title Step 9: Initializing a Model From Scratch
from transformers import RobertaForMaskedLM

The model is initialized with the configuration defined in Step 7:

model = RobertaForMaskedLM(config=config)

If we print the model, we can see that it is a BERT model with 6 layers and 12 heads:

print(model)

The building blocks of the encoder of the original Transformer model are present with different dimensions, as shown in this excerpt of the output:

RobertaForMaskedLM(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(52000, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
…/…
        

Take some time to go through the details of the output of the configuration before continuing. You will get to know the model from the inside.

The LEGO® type building blocks of transformers make it fun to analyze. For example, you will note that dropout regularization is present throughout the sub layers.

Now, let's explore the parameters.

Exploring the parameters

The model is small and contains 84,095,008 parameters.

We can check its size:

print(model.num_parameters())

The output shows the approximate number of parameters, which might vary from one transformer version to another:

84095008

Let's now look into the parameters. We first store the parameters in LP and calculate the length of the list of parameters:

#@title Exploring the Parameters
LP=list(model.parameters())
lp=len(LP)
print(lp)

The output shows that there are approximately 108 matrices and vectors, which might vary from one transformer model to another:

108

Now, let's display the 108 matrices and vectors in the tensors that contain them:

for p in range(0,lp):
  print(LP[p])

The output displays all the parameters as shown in the following excerpt of the output:

Parameter containing:
tensor([[-0.0175, -0.0210, -0.0334,  ...,  0.0054, -0.0113,  0.0183],
        [ 0.0020, -0.0354, -0.0221,  ...,  0.0220, -0.0060, -0.0032],
        [ 0.0001, -0.0002,  0.0036,  ..., -0.0265, -0.0057, -0.0352],
        ...,
        [-0.0125, -0.0418,  0.0190,  ..., -0.0069,  0.0175, -0.0308],
        [ 0.0072, -0.0131,  0.0069,  ...,  0.0002, -0.0234,  0.0042],
        [ 0.0008,  0.0281,  0.0168,  ..., -0.0113, -0.0075,  0.0014]],
       requires_grad=True)

Take a few minutes to peek inside the parameters to add to your understanding of how transformers are built.

The number of parameters is calculated by taking all parameters in the model and adding them up; for example:

  • The vocabulary (52,000) x dimensions (768)
  • The size of many vectors is 1 x 768
  • The many other dimensions found

You will note that dmodel = 768. There are 12 heads in the model. The dimension of dk for each head will thus be . This shows, once again, the optimized Lego concept of the building blocks of a transformer.

We will now see how the number of parameters of a model is calculated and how the figure 84,095,008 is reached.

If we hover over LP in the notebook, we will see some of the shapes of the Torch tensors:

Figure 3.4: LP

Note that all of the numbers we are displaying might vary depending on the version of the transformers module we are using.

We will take this further and count the number of parameters of each tensor.

First, the program initializes a parameter counter named np (number of parameters) and goes through the lp (108) number of elements in the list of parameters:

#@title Counting the parameters
np=0
for p in range(0,lp):#number of tensors

The parameters are matrices and vectors of different sizes; for example:

  • 768 x 768
  • 768 x 1
  • 768

We can see that some parameters are two-dimensional, and some are one-dimensional.

An easy way to find out is to try and see if a parameter p in the list LP[p] has two dimensions or not:

  PL2=True
  try:
    L2=len(LP[p][0]) #check if 2D
  except:
    L2=1             #not 2D but 1D
    PL2=False
  

If the parameter has two dimensions, its second dimension will be L2>0 and PL2=True (2 dimensions=True). If the parameter has only one dimension, its second dimension will be L2=1 and PL2=False (2 dimensions=False).

L1 is the size of the first dimension of the parameter. L3 is the size of the parameters defined by:

L1=len(LP[p])      
L3=L1*L2

We can now add the parameters up at each step of the loop:

np+=L3             # number of parameters per tensor

We will obtain the sum of the parameters, but we also want to see exactly how the number of parameters of a transformer model is calculated:

  if PL2==True:
    print(p,L1,L2,L3)  # displaying the sizes of the parameters
  if PL2==False:
    print(p,L1,L3)  # displaying the sizes of the parameters
print(np)              # total number of parameters

Note that if a parameter only has one dimension, PL2=False, then we only display the first dimension.

The output is the list of how the number of parameters was calculated for all the tensors in the model, as shown in the following excerpt:

0 52000 768 39936000
1 514 768 394752
2 1 768 768
3 768 768
4 768 768
5 768 768 589824
6 768 768
7 768 768 589824
8 768 768
9 768 768 589824
10 768 768

The total number of parameters of the RoBERTa model is displayed at the end of the list:

84,095,008

The number of parameters might vary with the version of the libraries used.

We now know precisely what the number of parameters represents in a transformer model.

Take a few minutes to go back and look at the output of the configuration, the content of the parameters, and the size of the parameters.

At this point, you will have a precise mental representation of the building blocks of the model.

The program now builds the dataset.

Step 10: Building the dataset

The program will now load the dataset line by line for batch training with block_size=128 limiting the length of an example:

#@title Step 10: Building the Dataset
%%time
from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./kant.txt",
    block_size=128,
)

The output shows that Hugging Face has invested a considerable amount of resources into optimizing the time it takes to process data:

CPU times: user 8.48 s, sys: 234 ms, total: 8.71 s
Wall time: 3.88 s

The wall time, the actual time the processors were active, is optimized.

The program will now define a data collator to create an object for backpropagation.

Step 11: Defining a data collator

We need to run a data collator before initializing the trainer. A data collator will take samples from the dataset and collate them into batches. The results are dictionary-like objects.

We are preparing a batched sample process for Masked Language Modeling (MLM) by setting mlm=True.

We also set the number of masked tokens to train mlm_probability=0.15. This will determine the percentage of tokens masked during the pretraining process.

We now initialize data_collator with our tokenizer, MLM activated, and the proportion of masked tokens set to 0.15:

#@title Step 11: Defining a Data Collator
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

We are now ready to initialize the trainer.

Step 12: Initializing the trainer

The previous steps have prepared the information required to initialize the trainer. The dataset has been tokenized and loaded. Our model is built. The data collator has been created.

The program can now initialize the trainer. For educational purposes, the program trains the model quickly. The number of epochs is limited to one. The GPU comes in handy since we can share the batches and multi-process the training tasks:

#@title Step 12: Initializing the Trainer
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
    output_dir="./KantaiBERT",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=64,
    save_steps=10_000,
    save_total_limit=2,
)
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

The model is now ready for training.

Step 13: Pretraining the model

Everything is ready. The trainer is launched with one line of code:

#@title Step 13: Pre-training the Model
%%time
trainer.train()

The output displays the training process in real time showing the loss, learning rate, epoch, and steps:

Epoch: 100%
1/1 [17:59<00:00, 1079.91s/it]
Iteration: 100%
2672/2672 [17:59<00:00, 2.47it/s]
{"loss": 5.6455852394104005, "learning_rate": 4.06437125748503e-05, "epoch": 0.18712574850299402, "step": 500}
{"loss": 4.940259679794312, "learning_rate": 3.12874251497006e-05, "epoch": 0.37425149700598803, "step": 1000}
{"loss": 4.639936000347137, "learning_rate": 2.1931137724550898e-05, "epoch": 0.561377245508982, "step": 1500}
{"loss": 4.361462069988251, "learning_rate": 1.2574850299401197e-05, "epoch": 0.7485029940119761, "step": 2000}
{"loss": 4.228510192394257, "learning_rate": 3.218562874251497e-06, "epoch": 0.9356287425149701, "step": 2500}
CPU times: user 11min 36s, sys: 6min 25s, total: 18min 2s
Wall time: 17min 59s
TrainOutput(global_step=2672, training_loss=4.7226536670130885)

The model has been trained. It's time to save our work.

Step 14: Saving the final model (+tokenizer + config) to disk

We will now save the model and configuration:

#@title Step 14: Saving the Final Model(+tokenizer + config) to disk
trainer.save_model("./KantaiBERT")

Click on Refresh in the file manager and the files should appear:

Figure 3.5: Colab file manager

config.json, pytorh_model.bin, and training_args.bin should now appear in the file manager.

merges.txt and vocab.json contain the pretrained tokenization of the dataset.

We have built a model from scratch.

Let's import the pipeline to perform a language modeling task with our pretrained model and tokenizer.

Step 15: Language modeling with FillMaskPipeline

We will now import a language modeling fill-mask task. We will use our trained model and trained tokenizer to perform masked language modeling:

#@title Step 15: Language Modeling with the FillMaskPipeline
from transformers import pipeline
fill_mask = pipeline(
    "fill-mask",
    model="./KantaiBERT",
    tokenizer="./KantaiBERT"
)

We can now ask our model to think like Immanuel Kant:

fill_mask("Human thinking involves human <mask>.")

The output will likely change after each run because we are pretraining the model from scratch with a limited amount of data. However, the output obtained in this run is interesting because it introduces conceptional language modeling:

[{'score': 0.022831793874502182,
  'sequence': '<s> Human thinking involves human reason.</s>',
  'token': 393},
 {'score': 0.011635891161859035,
  'sequence': '<s> Human thinking involves human object.</s>',
  'token': 394},
 {'score': 0.010641072876751423,
  'sequence': '<s> Human thinking involves human priori.</s>',
  'token': 575},
 {'score': 0.009517930448055267,
  'sequence': '<s> Human thinking involves human conception.</s>',
  'token': 418},
 {'score': 0.00923212617635727,
  'sequence': '<s> Human thinking involves human experience.</s>',
  'token': 531}]

The predictions might vary at each run and each time Hugging Face updates its models.

However, the following output comes out often:

Human thinking involves human reason

The goal here was to see how to train a transformer model. We can see that very interesting humanlike predictions can be made.

These results are experimental and subject to variations during the training process. They will change each time we train the model again.

The model would require much more data from other Age of Enlightenment thinkers.

However, the goal of this model is to show that we can create datasets to train a transformer for a specific type of complex language modeling task.

Thanks to the Transformer, we are only at the beginning of a new era of AI!

Next steps

You have trained a transformer from scratch. Take some time to imagine what you could do in your personal or corporate environment. You could create a dataset for a specific task and train it from scratch. Use your areas of interest or company projects to experiment with the fascinating world of transformer construction kits!

Once you have made a model you like, you can share it with the Hugging Face community. Your model will appear on the Hugging Face models page: https://huggingface.co/models

You can upload your model in a few steps using the instructions described on this page: https://huggingface.co/transformers/model_sharing.html

You can also download models the Hugging Face community has shared to get new ideas for your personal and professional projects.

Summary

In this chapter, we built KantaiBERT, a RoBERTa-like model transformer, from scratch using the construction blocks provided by Hugging Face.

We first started by loading a customized dataset on a specific topic related to the works of Immanuel Kant. You can load an existing dataset or create your own depending on your goals. We saw that using a customized dataset provides insights into the way a transformer model thinks. However, this experimental approach has its limits. It would take a much larger dataset to train a model beyond educational purposes.

The KantaiBERT project was used to train a tokenizer on the kant.txt dataset. The trained merges.txt and vocab.json files were saved. A tokenizer was recreated with our pretrained files. KantaiBERT built the customized dataset and defined a data collator to process the training batches for backpropagation. The trainer was initialized, and we explored the parameters of the RoBERTa model in detail. The model was trained and saved.

Finally, the saved model was loaded for a downstream language modeling task. The goal was to fill the mask using Immanuel Kant's logic.

The door is now wide open for you to experiment on existing or customized datasets to see what results you obtain. You can share your model with the Hugging Face community. Transformers are data-driven. You can use this to your advantage to discover new ways of using transformers.

In the next chapter, Downstream NLP Tasks with Transformers, we will discover yet another innovative architecture of transformers.

Questions

  1. RoBERTa uses a byte-level byte-pair encoding tokenizer. (True/False)
  2. A trained Hugging Face tokenizer produces merges.txt and vocab.json. (True/False)
  3. RoBERTa does not use token type IDs. (True/False)
  4. DistilBERT has 6 layers and 12 heads. (True/False)
  5. A transformer model with 80 million parameters is enormous. (True/False)
  6. We cannot train a tokenizer. (True/False)
  7. A BERT-like model has 6 decoder layers. (True/False)
  8. Masked language modeling predicts a word contained in a mask token in a sentence. (True/False)
  9. A BERT-like model has no self-attention sub-layers. (True/False)
  10. Data collators are helpful for backpropagation. (True/False)

References

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.248.208