© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
S. M. JainIntroduction to Transformers for NLPhttps://doi.org/10.1007/978-1-4842-8844-3_6

6. Fine-Tuning Pretrained Models

Shashank Mohan Jain1  
(1)
Bangalore, India
 

So far we have seen how to use huggingface APIs that include the pretrained models to create simple applications. Wouldn’t it be amazing if you could start from scratch and train your own model using only your own data?

Utilizing transfer learning is the most effective strategy to take if you do not have a large amount of spare time or computing resources at your disposal. There are two main advantages of utilizing transfer learning with Hugging Face as opposed to starting from scratch when training a model.

As we stated in Chapter 4, models like GPT3 take enormous amount of infrastructural resources to train. This is beyond the capability of most of us. So how do we then use these models in a much more flexible way and not just use them by downloading the pretrained models? The answer lies in fine-tuning these models with the additional data we have. This would require very little resources and be easy to achieve as compared with training a full big language model from scratch.

To transform a basic model into something that is able to generate reliable outcomes requires a significant investment of time and resources. Because of transfer learning , you can forego the laborious step of training and devote only a small amount of time to adapting the dataset to your specific requirements.

In fact, pretrained models from Hugging Face are capable of excelling at tasks in a variety of domains even without the need for additional fine-tuning. It is likely that one can use these models in a zero-shot learning scenario as well, but if there is a specific dataset one has, then our good friend, the huggingface APIs, provides us the needed abstractions for fine-tuning these existing models.

Therefore, we can basically consider transfer learning to be a kind of a shortcut when it comes to training. Simply by making use of pretrained language models, you can save tens of thousands of dollars and thousands of hours in terms of your computing needs. You should stick to transfer learning unless the tasks you are working on are extremely specific and cannot be solved using models that already exist.

We can now move on to our fine-tuning guide with Hugging Face because we have a better understanding of the applications and advantages of transfer learning.

The workflow for fine-tuning is shown in the following:
  • Select a pretrained model from huggingface that suits the need for your use case.

  • The additional custom dataset has to stick to the huggingface dataset spec, so we need to preprocess our data so that it conforms to the format needed.

  • Upload the dataset to Colab, S3, or any other store.

  • Use the Trainer API from huggingface to fine-tune the existing model.

  • Save the model locally or upload it to the huggingface repo.

With some basic idea, let’s get started on some transfer learning with Hugging Face libraries.

During the fine-tuning stages, most of the neural architecture is frozen. This means that we only adjust weights in the output layer. Since we have already covered tokenizers in an earlier chapter, we will give a brief overview about huggingface datasets here, which are the most important constructs for this chapter. Once we understand the Datasets API, we will proceed to using a custom dataset for a pretrained model through transfer learning.

Datasets

In this section we describe the basic dataset construct from huggingface and some of its basic functions.

The data that you are utilizing throughout the course of any machine learning project is going to be of utmost significance. The real accuracy derives not only from the quantity but also from the quality of the data that is being used, and this is true regardless of the algorithm or model that you are working with.

Accessing large datasets can be a challenging endeavor at times. The process of scraping, accumulating, and then cleaning up this data in the appropriate manner can take a significant amount of time. Hugging Face, fortunately for people interested in NLP as well as image and audio processing, comes with a central repository of datasets that are already prepared for use. In the following paragraphs, we will have a brief look at how you can work with this datasets module to select and prepare the appropriate dataset for your project.

To install the datasets library, use the following command:
!pip install datasets
As we have been reading the documentation for the datasets repository, we have discovered that there are several primary methods. The first method is the one that we are able to use to investigate the list of datasets that are readily available. You should see options to work with close to 6800 different datasets, all of which are currently available:
from datasets import list_datasets, load_dataset, list_metrics, load_metric
# Print all the available datasets
print(len(list_datasets()))
Output:
6783
Load a dataset:
dataset = load_dataset('imdb')
Print the dataset object:
DatasetDict({ train: Dataset({ features: ['text', 'label'], num_rows: 25000 }) test: Dataset({ features: ['text', 'label'], num_rows: 25000 }) unsupervised: Dataset({ features: ['text', 'label'], num_rows: 50000 }) })

It constitutes a dictionary with train, test, and unsupervised datasets, each having features and num_rows as values. Here, examples are taken from the IMDB dataset, and thereby the text on which we will be doing sentiment analysis is also taken from IMDB.

Let us access the train dataset:
dataset['train'][2]
{'label': 0, 'text': "If only to avoid making this type of film in the future. This film is interesting as an experiment but tells no cogent story.<br /><br />One might feel virtuous for sitting thru it because it touches on so many IMPORTANT issues but it does so without any discernable motive. The viewer comes away with no new perspectives (unless one comes up with one while one's mind wanders, as it will invariably do during this pointless film).<br /><br />One might better spend one's time staring out a window at a tree growing.<br /><br />"}
Describe the dataset:
dataset['train'].description
We get the following output:
Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
List features of the dataset:
dataset['train'].features
We can see there are two features:
{'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None), 'text': Value(dtype='string', id=None)}

It’s possible that you won’t want to deal with utilizing one of the Hugging Face datasets in certain circumstances. This dataset object is still capable of loading locally stored CSV files in addition to other types of files. If, for example, you want to work with a CSV file, you can easily pass this information into the load dataset method along with the path to the CSV file on your local machine.

Fine-Tuning a Pretrained Model

Now since we understand the dataset construct, it’s time for applying some transfer learning on a pretrained model using our own dataset. In the following we show an example of how to fine-tune a pretrained model with the IMDB dataset.

We will divide the fine-tuning aspect into two parts. The training section is where we will use the Trainer API of huggingface to fine-tune the model and save it. The other section is the inference part where we will load this fine-tuned model to achieve inferencing.

Training for Fine-Tuning

First, install transformers and datasets using the following command:
!pip install datasets transformers
Next, load the IMDB dataset:
from datasets import load_dataset
dataset = load_dataset("imdb")
dataset["train"][100]
The following is a sample review:
{'label': 0, 'text': "Terrible movie. Nuff Said.<br /><br />These Lines are Just Filler. The movie was bad. Why I have to expand on that I don't know. This is already a waste of my time. I just wanted to warn others. Avoid this movie. The acting sucks and the writing is just moronic. Bad in every way. Even that was ruined though by a terrible and unneeded rape scene. The movie is a poorly contrived and totally unbelievable piece of garbage.<br /><br />OK now I am just going to rag on IMDb for this stupid rule of 10 lines of text minimum. First I waste my time watching this offal. Then feeling compelled to warn others I create an account with IMDb only to discover that I have to write a friggen essay on the film just to express how bad I think it is. Totally unnecessary."}
Next, we need to tokenize the dataset we loaded using the BERT tokenizer. The first step is to create a new Jupyter notebook in Google Colab and copy the following code line by line:
from transformers import AutoTokenizer
brt_tkn = AutoTokenizer.from_pretrained("bert-base-cased")
def generate_tokens_for_imdb(examples):
    return brt_tkn(examples["text"], padding="max_length", truncation=True)
tkn_datasets = dataset.map(generate_tokens_for_imdb, batched=True)
The aforementioned code yields the following output:
loading configuration file https://huggingface.co/bert-base-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/a803e0468a8fe090683bdc453f4fac622804f49de86d7cecaee92365d4a0f829.a64a22196690e0e82ead56f388a3ef3a50de93335926ccfa20610217db589307
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}
loading file https://huggingface.co/bert-base-cased/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/6508e60ab3c1200bffa26c95f4b58ac6b6d95fba4db1f195f632fa3cd7bc64cc.437aa611e89f6fc6675a049d2b5545390adbc617e7d655286421c191d2be2791
loading file https://huggingface.co/bert-base-cased/resolve/main/tokenizer.json from cache at /root/.cache/huggingface/transformers/226a307193a9f4344264cdc76a12988448a25345ba172f2c7421f3b6810fddad.3dab63143af66769bbb35e3811f75f7e16b2320e12b7935e216bd6159ce6d9a6
loading file https://huggingface.co/bert-base-cased/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/bert-base-cased/resolve/main/special_tokens_map.json from cache at None
loading file https://huggingface.co/bert-base-cased/resolve/main/tokenizer_config.json from cache at /root/.cache/huggingface/transformers/ec84e86ee39bfe112543192cf981deebf7e6cbe8c91b8f7f8f63c9be44366158.ec5c189f89475aac7d8cbd243960a0655cfadc3d0474da8ff2ed0bf1699c2a5f
loading configuration file https://huggingface.co/bert-base-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/a803e0468a8fe090683bdc453f4fac622804f49de86d7cecaee92365d4a0f829.a64a22196690e0e82ead56f388a3ef3a50de93335926ccfa20610217db589307
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}
Once we tokenize the dataset, we will only be fine-tuning on 200 samples, so that we can tune the model faster for simplicity’s sake. You are encouraged to try with more samples:
training_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(200))
evaluation_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(200))
Load the BERT-based sequence classification model:
from transformers import AutoModelForSequenceClassification
mdl = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

The Transformers library includes a Trainer class that is specifically designed for training huggingface transformer models. This class makes it much simpler to begin training without the need to manually write your own code. The Trainer API provides features like logging, monitoring, etc.

Here, we provide the training arguments by instantiating a class called TrainingArguments that has all of the hyperparameters that one can experiment with. Here in this case, we will be just using the defaults:
from transformers import TrainingArguments
training_args = TrainingArguments(output_dir="imdb")
During training, Trainer does not automatically evaluate how well the model is performing. If you want Trainer to be able to compute and report metrics, you will need to pass it a function. This is what we will do in the following code segment:
import numpy as np
from datasets import load_metric
mdl_metrics = load_metric("accuracy")
def calculate_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return mdl_metrics.compute(predictions=predictions, references=labels)
from transformers import TrainingArguments, Trainer
trng_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch", num_train_epochs=3)
Instantiate a Trainer object that contains your model, the training arguments, the datasets to be used for training and testing, and the evaluation function:
Mdl_trainer = Trainer(
    model=model,
    args=trng_args,
    train_dataset=training_dataset,
    eval_dataset=evaluation_dataset,
    compute_metrics=calculate_metrics,
)
Train the model:
trainer.train()
Figure 6-1 shows the training run for the IMDB dataset we used for fine-tuning an existing pretrained model.

A representation of the documentation of the datasets repository and primary methods using components output, loading a dataset, and printing the dataset object.

Figure 6-1

Training run for the IMDB dataset for fine-tuning

Save the fine-tuned trained model:
trainer.save_model()

A representation of the trainer saving models with the file test trainer.

Figure 6-2

Save the model locally (we have a PyTorch-based model with extension .bin)

We can see that the fine-tuned model is saved with the name pytorch_model.bin.

We can check the accuracy of the model using the following code:
metrics = mdl_trainer.evaluate(evaluation_dataset)
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)

A representation of running evaluation with number examples, batch size, and eval matrics involving eval accuracy, loss, runtime, same per second, and steps per second.

Figure 6-3

Evaluation of the fine-tuned model in terms of its accuracy

Inference

Once we have fine-tuned the model and saved it, it’s time to do inference on data outside the train dataset.

We will load the fine-tuned model from the path and use it to make a classification, which in this case is a sentiment classification on IMDB movie reviews:
from transformers import BertTokenizer
Load the fine-tuned model from the following path:
PATH = 'test_trainer/'
md = AutoModelForSequenceClassification.from_pretrained(PATH, local_files_only=True)
def make_classification(text):
    # Tokenize
    inps = brt_tkn(text, padding=True, truncation=True, max_length=512, return_tensors="pt").to("cuda")
    # get output
    outputs = model(**inps)
    # softmax for generating probablities
    probablities = outputs[0].softmax(1)
    # get best match.
    return probablities .argmax()
Here is the first inference:
text = """
This is the show that puts a smile on your face as you watch it. You get in love with each and every character of the show. At the end, I felt eight episode were not enough. Will wait for season 2.
"""
print(make_classification(text))
This yields the following output:
tensor(1, device='cuda:0')
Output of 1 is positive review
Here is the second inference:
text = """
It was fun to watch but It did not impress that much I think i waste my money popcorn time pizza burgers everything.
Akshay should make only comedy movies these King type movies suits on king like personality of actors Total waste.
"""
print(make_classification(text))
This yields the following output:
tensor(0, device='cuda:0')
Output of zero is negative review

Summary

In this chapter, we learned about the huggingface datasets and their different functions. We also learned how we can use the huggingface APIs for fine-tuning existing pretrained models with other datasets.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.17.46