The key ingredient for supervised learning is a labeled dataset. Weakly supervised techniques provide machine learning practitioners with a powerful approach for getting the labeled datasets that are needed to train NLP models.
In Chapter 3, you learned how to use Snorkel to label the data from the FakeNews Dataset. In this chapter, you will learn how to use the following Python libraries for performing text classification using the weakly-labeled dataset, produced by Snorkel:
krain - At the beginning of this chapter, we first introduce the ktrain library. ktrain is a Python library that provides a lightweight wrapper for transformer-based models and enables anyone (including someone new to NLP) a gentle introduction to NLP.
HuggingFace - Once you are used to the different NLP concepts, we will learn how to unleash the powerful capabilities of the HuggingFace library
By showing both ktrain and pre-trained models from the HuggingFace libraries, we hope to help you get started by providing a gentle introduction to performing text classification on the weakly-labeled dataset before moving to use the full capabilities of the HuggingFace library.
We included a section showing you how you can deal with a class-imbalanced dataset. While the FakeNews dataset used in this book is not imbalanced, we take you through the exercise of handling the class imbalance to help you build up the skills, and prepare you to be ready when you actually have to deal with a class-imbalanced dataset in the future.
NLP enables one to automate the processing of text data in tasks including parsing sentences to extract their grammatical structure, extracting entities from documents, classifying documents into categories, document ranking for retrieving the most relevant documents, summarizing documents, answering questions, document translation, and more. The field of NLP has been continuously evolving and has made significant progress in recent years.
For a theoretical and practical introduction to NLP, the book Foundations of Natural Language Processing by Christopher D. Manning and Hinrich Schütze and Natural Language Processing with Python will be useful.
Sebastian Ruder and colleagues tutorial delivereed at the Conference of the North American Chapter of the Association for Computational Linguistics (NAACLHLT), titled “Transfer Learning in Natural Language Processing” [Sebastian et al, 2019] will be a great resource to help you get started if you are looking to jumpstart your understanding of the exciting field of NLP. In the tutorial, Sebastian and colleagues shared a comprehensive overview of how Transfer Learning for NLP works. Interested readers should view the tutorial available here - Transfer Learning in Natural Language Processing.
The goal of this chapter is to show how you can leverage NLP libraries for performing text classification, using a labeled dataset produced by Snorkel. This chapter is not meant to demonstrate how weak supervision enables application debugging or supports iterative error analsysis with the Snorkel model that is provided. Readers should refer to the Snorkel documentation and tutorials to learn more.
Let us get started by learning about how Transformer-based approaches have enabled transfer learning for NLP, and how we can use it for performing text classification using the Snorkel-labeled FakeNews dataset.
Transformers have been the key catalyst for many innovative Natural Language Processing (NLP) applications. In a nutshell, a Transformer enables one to perform sequence-to-sequence tasks by leveraging a novel self-attention technique.
Transformers use a novel architecture that does not require a recurrent neural network (RNN) or convolutional neural network (CNN). In the paper Attention is All You Need, the authors showed how Transformers outperform both recurrent and convolutional neural network approaches.
One of the popular transformer-based model, which uses a left/right language modeling objective was described in the paper Improving Language Understanding by Generative Pre-Training. Over the last few years, Bidirectional Encoder Representations from Transformers (BERT) has inspired many other transformer-based models. The paper (BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding ) provides a good overview of how BERT works. BERT has inspired an entire family of transformer-based approaches for pre-training language representations. This ranges from the different sizes of pre-trained BERT models (from tiny to extremely large BERT models), RoBERTa, ALBERT, and more.
To understand how Transformers and self-attention work, the following articles provide a good read:
https://arxiv.org/abs/2002.12327 [A Primer in BERTology: What we know about how BERT works] - Extensive survey of more than 100+ studies of the BERT model
Transformers from Scratch by Peter Bloem - To understand why self-attention works, Peter Bloem provided a good, simplified discussion in the following articles:
Google AI Blog: Transformer: A Novel Neural Network Architecture for Language Understanding provides a good overview of the transformer
Motivated by the need to reduce the size of a BERT model, [Victor et al, 2020] proposed DistilBERT, a language model that is 40% smaller, and yet retains 97% of BERT’s language understanding capabilities. Consequently, this enables DistilBERT to perform at least 60% faster.
Many novel applications and evolutions of Transformer-based approaches are continuously being made available. For example, OpenAI used Transformers to create language models in the form of GPT via unsupervised pre-training. These language models are then fine-tuned for a specific downstream task [Radford et al, 2018]. More recent transformer-innovation include GPT3 and Switch Transformers.
In this chapter, you will learn how to use the following transformer-based models for text classification:
We will fine-tune the pre-trained DistilBert and RoBERTa model, and use them for performing text classification on the FakeNews dataset. To help everyone jumpstart, we will start the text classification exercise by using ktrain. Once we are used to the steps for training Transformer-based models (e.g. fine-tuning DistilBert using ktrain), we will learn how to tap the full capabilities of HuggingFace. HuggingFace is a popular Python NLP library, that provides a rich collection of pre-trained models, and 12,638 + models (as of July 2021) on their model hub.
In this chapter, we will use HuggingFace for training the RoBERTa model and fine-tune it for the text classification task. Let’s get started!
In Chapter 3, we learn how we can use Snorkel to produce the labels for the dataset that will be used for training. As noted by the authors of the Snorkel Intro Tutorial: Data Labeling, the goal is to be able to leverage the labels from each of the Snorkel labeling function, and convert them into a single noise-aware probabilistic (or confidence-weighted) label.
In this chapter, we use the fake news dataset and label that has been labeled by Snorkel in Chapter 3. This Snorkel label is referred to as a column, called snorkel_labels in the rest of this chapter.
Similar to the Snorkel Spam Tutorial, we will use the snorkel.labeling.MajorityLabelVoter. The labels are produced by using the predict() method of snorkel.labeling.MajorityLabelVoter. From the documentation (https://bit.ly/2Wlf02o), the predict() method returns the predicted labels, which are return as a ndarray of integer labels. In addition, the method supports different policies for breaking ties (e.g. abstain, and random). By default, the abstain policy is used.
It is important to note that the Snorkel labeling functions (LFs) may be correlated. This might cause a majority-vote-based model to overrepresent some of the signals.
To address this, the snorkel.labeling.model.label_model.LabelModeL can be used. The predict() method of LabelModeL returns a ndarray of integer labels, and a ndarray of probabilistic labels (if return_probs is set to True). These probabilisitic labels can be used to train a classifier.
You can modify the code discussed in this chapter to use the probabilisitic labels provided by LabelModel as well. HuggingFace implementation of transformers provide the BCEWithLogitsLoss function, which can be used with the probabilistic labels (See HuggingFace code for Roberta available here - https://bit.ly/2Vg8Rnw to understand the different loss functions supported.)
For simplicity, this chapter uses the label outputs from MajorityLabelVoter.
To help everyone get started, we will use the Python library ktrain to illustrate how to train a transformer-based model. ktrain enables anyone to get started with using a transformer-based model quickly. ktrain enables us to leverage the pre-trained DistilBert models (available in HuggingFace) to perform text classification.
We load the fakenews_sample_13211.csv file and show the first few rows of the dataset.
import
pandas
as
pd
# Read the Fake News Dataset and show the first few rows
fakenews_df
=
pd
.
read_csv
(
'../data/fakenews_snorkel_labels.csv'
)
fakenews_df
[[
'id'
,
'statement'
,
'snorkel_labels'
]]
.
head
()
Let us first take a look at some of the columns in the Fake News dataset. For text classification, we will be using the columns statement and snorkel_labels. A value of 1 indicates it is real news, and 0 indicates it is fake news.
id | statement | … | snorkel_label |
---|---|---|---|
1248 |
During the Reagan era, while productivity incr… |
… |
1 |
4518 |
“It costs more money to put a person on death … |
… |
1 |
15972 |
Says that Minnesota Democratic congressional c… |
… |
0 |
11165 |
“This is the most generous country in the worl… |
… |
1 |
14618 |
“Loopholes in current law prevent ‘Unaccompani… |
… |
0 |
In the dataset, you will notice a -1 value in snorkel_labels. This is a value set by Snorkel when it is unsure of the label. We will remove rows that have snorkel_values = -1 using the following code:
fakenews_df
=
fakenews_df
[
fakenews_df
[
'snorkel_labels'
]
>=
0
]
Next, let us take a look at the unique labels in the dataset.
# Get the unique labels
categories
=
fakenews_df
.
snorkel_labels
.
unique
()
categories
The code produces the following output:
array([1, 0])
Let us understand the number of occurrences of real news (1) vs fake news (0). We use the following code to get the value_counts of fakenews_df[label]. This helps you understand how the real news vs fake news data is distributed, and whether the dataset is imbalanced.
# count the number of rows with label 0 or 1
fakenews_df
[
'snorkel_labels'
]
.
value_counts
()
The code produces the following output:
1 6287 0 5859 Name: snorkel_labels, dtype: int64
Next, we will split the dataset into training and testing data. We used train_test_split from sklearn.model_selection. The dataset is split into 70% training data and 30% testing data. In addition, we initialize the random generator seed to be 98052. You can set the random generator seed to any value. Having a fixed value for the seed enables the results of your experiment to be reproducible in multiple runs.
# Prepare training and test data
from
sklearn.model_selection
import
train_test_split
X
=
fakenews_df
[
'statement'
]
# get the labels
labels
=
fakenews_df
.
snorkel_labels
# Split the data into train/test datasets
X_train
,
X_test
,
y_train
,
y_test
=
train_test_split
(
X
,
labels
,
test_size
=
0.30
,
random_state
=
98052
)
Let us count the number of labels in the training dataset.
# Count of label 0 and 1 in the training data set
(
"Rows in X_train
%d
: "
%
len
(
X_train
))
type
(
X_train
.
values
.
tolist
())
y_train
.
value_counts
()
The code produces the following output:
Rows in X_train 8502 : 1 4395 0 4107 Name: snorkel_labels, dtype: int64
While the data distribution for this current dataset does not indicate an imbalanced dataset, we included a section on how to deal with the imbalanced dataset, that we hope will be useful with you in future experiments where you have to deal with one.
In many real-world cases, the dataset is imbalanced. That is, there are more instances of one class (majority class), than the other class (minority class). In this section, we show how you can deal with class imbalance.
There are different approaches to dealing with the imbalanced dataset. One of the commonly used technique is resampling. In resampling, the data from the majority class are under-sampled, and the data from the minority class are over-sampled. In this way, you get a balanced dataset that has equal occurrences of both classes.
In this exercise, we will use imblearn.under_sampling.RandomUnderSampler. This approach will perform random under-sampling of the majority class. Before using imblearn.under_sampling.RandomUnderSampler, we will need to prepare the data so it is in the input shape expected by RandomUnderSampler.
# Getting the dataset ready for using RandomUnderSampler
import
numpy
as
np
X_train_np
=
X_train
.
to_numpy
()
X_test_np
=
X_test
.
to_numpy
()
# Convert 1D to 2D (used as input to sampler)
X_train_np2D
=
np
.
reshape
(
X_train_np
,(
-
1
,
1
))
X_test_np2D
=
np
.
reshape
(
X_test_np
,(
-
1
,
1
))
Once the data is in the expected shape, we use RandomUnderSampler to perform under-sampling of the training dataset.
from
imblearn.under_sampling
import
RandomUnderSampler
# Perform random under-sampling
sampler
=
RandomUnderSampler
(
random_state
=
98052
)
X_train_rus
,
Y_train_rus
=
sampler
.
fit_resample
(
X_train_np2D
,
y_train
)
X_test_rus
,
Y_test_rus
=
sampler
.
fit_resample
(
X_test_np2D
,
y_test
)
The results are returned in the variables X_train_rus, and Y_train_rus. Let us count the number of occurrences.
from
collections
import
Counter
(
'Resampled Training dataset
%s
'
%
Counter
(
Y_train_rus
))
(
'Resampled Test dataset
%s
'
%
Counter
(
Y_test_rus
))
The results are shown below. You will see that the number of occurrences for both labels 0 and 1 in the training and test datasets are now balanced.
Resampled Training dataset Counter({0: 4107, 1: 4107}) Resampled Test dataset Counter({0: 1752, 1: 1752})
Before we start training, we first flatten both training and testing datasets.
# Preparing the resampled datasets
# Flatten train and test dataset
X_train_rus
=
X_train_rus
.
flatten
()
X_test_rus
=
X_test_rus
.
flatten
()
In this section, we will be using ktrain to train a DistilBert model using the FakeNews dataset.
We will be using ktrain to train the text classification model. ktrain provides a lightweight Tensorflow Keras wrapper that empowers any data scientist to quickly train various deep learning models (text, vision, and many more). From version v0.8 onwards, ktrain has also added support for Hugging Face transformers.
Using text.Transformer(), we first load the _distilBert-base-uncased _model provided by Hugging Face.
import
ktrain
from
ktrain
import
text
model_name
=
'distilbert-base-uncased'
t
=
text
.
Transformer
(
model_name
,
class_names
=
labels
.
unique
(),
maxlen
=
500
)
Once the model has been loaded, we use t.preprocess_train() and t.preprocess_test() to process both the training and testing data.
train
=
t
.
preprocess_train
(
X_train_rus
.
tolist
(),
Y_train_rus
.
to_list
())
When running the above code snippet on the training data, you will see the following output:
preprocessing train... language: en train sequence lengths: mean : 18 95percentile : 34 99percentile : 43
Similar to how we process the training data, we process the testing data as well.
val
=
t
.
preprocess_test
(
X_test_rus
.
tolist
(),
Y_test_rus
.
to_list
())
When running the above code snippet on the training data, you will see the following output:
preprocessing test... language: en test sequence lengths: mean : 18 95percentile : 34 99percentile : 44
Once we have pre-processed both training and test datasets, we are ready to train the model. First, we retrieve the classifier and store it in the model variable.
In the BERT paper, the authors selected the best fine-tuning hyperparameters from various batch sizes: 8, 16, 32, 64, 128, and learning rate ranging from 3e-4, 1e-4, 5e-5, 3e-5 and trained the model for 4 epochs.
For this exercise, we use a batch_size of 8, and a learning rate of 3e-5, and trained for 3 epochs. These values are chosen based on common defaults used in many papers. The number of epochs was set to 3 to prevent overfitting.
model
=
t
.
get_classifier
()
learner
=
ktrain
.
get_learner
(
model
,
train_data
=
train
,
val_data
=
val
,
batch_size
=
8
)
learner
.
fit_onecycle
(
3e-5
,
3
)
After you have run fit_onecycle(), you will observe the following output:
begin training using onecycle policy with max lr of 3e-05... Train for 1027 steps, validate for 110 steps Epoch 1/3 1027/1027 [==============================] - 1118s 1s/step - loss: 0.6494 - accuracy: 0.6224 - val_loss: 0.6207 - val_accuracy: 0.6527 Epoch 2/3 1027/1027 [==============================] - 1113s 1s/step - loss: 0.5762 - accuracy: 0.6980 - val_loss: 0.6039 - val_accuracy: 0.6695 Epoch 3/3 1027/1027 [==============================] - 1111s 1s/step - loss: 0.3620 - accuracy: 0.8398 - val_loss: 0.7672 - val_accuracy: 0.6567 <tensorflow.python.keras.callbacks.History at 0x7f309c747898>
Next, we evaluate the quality of the model by using learner.validate().
learner
.
validate
(
class_names
=
t
.
get_classes
())
The output shows the precision, recall, f1-score, and support.
precision recall f1-score support 1 0.67 0.61 0.64 1752 0 0.64 0.70 0.67 1752 accuracy 0.66 3504 macro avg 0.66 0.66 0.66 3504 weighted avg 0.66 0.66 0.66 3504 array([[1069, 683], [ 520, 1232]])
ktrain enable you to easily view the top N rows where the model made mistakes. This enables one to quickly troubleshoot or learn more areas of improvement for the model To get the top 3 rows where the model makes mistakes, use learner.view_top_losses().
# show the top 3 rows where the model made mistakes
learner
.
view_top_losses
(
n
=
3
,
preproc
=
t
)
This produces the following output:
id:2826 | loss:5.31 | true:0 | pred:1) ---------- id:3297 | loss:5.29 | true:0 | pred:1) ---------- id:1983 | loss:5.25 | true:0 | pred:1)
Once you have the identifier of the top 3 rows, let us examine one of the rows. As this is based on the quality of the weak labels, it is used as an example only. In a real-world case, you will need to leverage various data sources, and subject matter experts (SMEs) to deeply understand why the model has made a mistake in this area.
# Show the text for the entry where we made mistakes
# We predicted 1, when this should be predicted as 0
(
"Ground truth:
%d
"
%
Y_test_rus
[
2826
])
(
"-------------"
)
(
X_test_rus
[
2826
])
The output is shown as follow: You can observe that even though the ground truth label is 0, the model has predicted it as a 1
Ground truth: 1 ------------- "Tim Kaine announced he wants to raise taxes on everyone."
Let us use the trained model on a new instance of news, extracted from CNN News.
Using ktrain.get_predictor(), we first get the predictor. Next, we invoked predictor.predict() on news text. You will see how we obtain an output of 1.
news_txt
=
'Now is a time for unity. We must
respect the results of the U.S. presidential election and,
as we have with every election, honor the decision of the voters
and support a peaceful transition of power," said Jamie Dimon,
CEO of JPMorgan Chase .'
predictor
=
ktrain
.
get_predictor
(
learner
.
model
,
preproc
=
t
)
predictor
.
predict
(
news_txt
)
ktrain also makes it easy to explain the results, using predictor.explain().
predictor
.
explain
(
news_txt
)
Running predictor.explain() shows the following output and the top features that contributed to the prediction.
y=0 (probability 0.023, score -3.760) top features Contribution? Feature +0.783 of the +0.533 transition +0.456 the decision +0.438 a time +0.436 and as +0.413 the results +0.373 support a +0.306 jamie dimon +0.274 said jamie +0.272 u s +0.264 every election +0.247 we have +0.243 transition of +0.226 the u +0.217 now is +0.205 is a +0.198 results of +0.195 the voters +0.179 must respect +0.167 election honor +0.165 jpmorgan chase +0.165 s presidential +0.143 for unity +0.124 support +0.124 honor the +0.104 respect the +0.071 results +0.066 decision of +0.064 dimon ceo +0.064 as we +0.055 time for -0.060 have -0.074 power said -0.086 said -0.107 every -0.115 voters and -0.132 of jpmorgan -0.192 must -0.239 s -0.247 now -0.270 <BIAS> -0.326 of power -0.348 respect -0.385 power -0.394 u -0.442 of -0.491 presidential -0.549 honor -0.553 jpmorgan -0.613 jamie -0.622 dimon -0.653 time -0.708 a -0.710 we -0.731 peaceful -1.078 the -1.206 election
It is important to find a good learning rate before you start training the model.
ktrain provides the lr_find() function for finding a good learning rate. lr_find() outputs the plot that shows the loss vs the learning rate (expressed in a logarithmic scale).
# Using lr_find to find a good learning rate
learner
.
lr_find
(
show_plot
=
True
,
max_epochs
=
5
)
The output and plot from running learner.lr_find() is shown. In this example, you will see the loss value to be roughly in the range between 0.6 to 0.7. Once the learning rate gets closer to 10e-2, it increases significantly. As a general best practice, it is usually beneficial to choose a learning rate that is near the lowest point of the graph.
simulating training for different learning rates... this may take a few moments... Train for 1026 steps Epoch 1/5 1026/1026 [==============================] - 972s 947ms/step - loss: 0.6876 - accuracy: 0.5356 Epoch 2/5 1026/1026 [==============================] - 965s 940ms/step - loss: 0.6417 - accuracy: 0.6269 Epoch 3/5 1026/1026 [==============================] - 964s 939ms/step - loss: 0.6968 - accuracy: 0.5126 Epoch 4/5 368/1026 [=========>....................] - ETA: 10:13 - loss: 1.0184 - accuracy: 0.5143 done. Visually inspect loss plot and select learning rate associated with falling loss
Using the output from lr_find(), and visually inspecting the loss plot as shown in Figure 4-1, you can start training the model using a learning rate, that has the least loss. This will enable you to get to a good start when training the model.
In the previous section, we showed how you can use ktrain to perform text classification. As you get familiar with using transformer-based models, you might want to leverage the full capabilities of the Hugging Face Python library directly.
In this section, we show you how you can use Hugging Face and one of the state-of-art transformers in Hugging Face called RoBERTa. RoBERTa uses a similar architecture to BERT, and uses a byte-level BPE as a tokenizer. RoBERTa made several other optimizations to improve the BERT architecture. These include bigger batch size, longer training time, and using more diversified training data.
Let us get started by loading the relevant Hugging Face Transformer libraries, and sklearn.
In the code snippet, you will observe that we are loading several libraries. For example, we will be using RobertaForSequenceClassification, RobertaTokenizer for performing text classification and tokenization respectively.
import
numpy
as
np
import
pandas
as
pd
from
sklearn.preprocessing
import
LabelEncoder
from
sklearn.linear_model
import
LogisticRegression
from
sklearn.model_selection
import
cross_val_score
from
sklearn.model_selection
import
train_test_split
import
torch
import
torch.nn
as
nn
import
transformers
as
tfs
from
transformers
import
AdamW
,
BertConfig
from
transformers
import
RobertaTokenizer
,
RobertaForSequenceClassification
Similar to the earlier sections, we will load the FakeNews dataset. After we have loaded the fakenews_df DataFrame, we will extract the relevant columns that we will use to fine-tune the RoBERTa model.
# Read the Fake News Dataset and show the first few rows
fakenews_df
=
pd
.
read_csv
(
'./data/data2_final.csv'
)
X
=
fakenews_df
.
loc
[:,[
'statement'
,
'snorkel_labels'
]]
# remove rows with a -1 snorkel_labels value
X
=
X
[
X
[
'snorkel_labels'
]
>=
0
]
labels
=
X
.
snorkel_labels
Next, we will split the data in X into the training, validation, and testing datasets. The training and validation dataset will be used when we train the model, and do a model evaluation. The training, validation, and testing datasets are assigned to variables X_train, X_val, and X_test and respectively.
# Split the data into train/test datasets
X_train
,
X_test
,
y_train
,
y_test
=
train_test_split
(
X
[
'statement'
],
X
[
'snorkel_labels'
],
test_size
=
0.20
,
random_state
=
122520
)
# withold test cases for testing
X_test
,
X_val
,
y_test
,
y_val
=
train_test_split
(
X_test
,
y_test
,
test_size
=
0.30
,
random_state
=
122420
)
In this section, we provide sample code to determine the number of available GPUs that can be used for training. In addition, we also print out the type of GPU.
if
torch
.
cuda
.
is_available
():
device
=
torch
.
device
(
"cuda"
)
(
f
'GPU(s) available: {torch.cuda.device_count()} '
)
(
'Device:'
,
torch
.
cuda
.
get_device_name
(
0
))
else
:
device
=
torch
.
device
(
"cpu"
)
(
'No GPUs available. Default to use CPU'
)
For example, running the above code on an Azure Standard NC6_Promo (6 vcpus, 56 GiB memory) Virtual Machine (VM), the following output is printed. The output will differ depending on the NVidia GPUs that are available on the machine that you are using for training.
GPU(s) available: 1 Device: Tesla K80
Let us learn how you can perform tokenization using the RoBERTa tokenizer.
In the code shown, you will see that we first load the pre-trained roberta-base model using RobertaForSequenceClassification.from_pretrained(…).
Next, we also loaded the pre-trained tokenizer using RobertaTokenizer.from_pretrained(…).
model
=
RobertaForSequenceClassification
.
from_pretrained
(
'roberta-base'
,
return_dict
=
True
)
tokenizer
=
RobertaTokenizer
.
from_pretrained
(
'roberta-base'
)
Next, you will use tokenizer() to prepare the training and validation data by performing tokenization, truncation, and padding of the data. The encoded training and validation data is stored in the variables tokenized_train and tokenized_validation respectively.
We specify padding = ‘max_length’ to control the padding that is used. In addition, we specify truncation=True to make sure we truncate the inputs to the maximum length specified.
max_length
=
256
# Use the Tokenizer to tokenize and encode
tokenized_train
=
tokenizer
(
X_train
.
to_list
(),
padding
=
'max_length'
,
max_length
=
max_length
,
truncation
=
True
,
return_token_type_ids
=
False
,
)
tokenized_validation
=
tokenizer
(
X_val
.
to_list
(),
padding
=
'max_length'
,
max_length
=
max_length
,
truncation
=
True
,
return_token_type_ids
=
False
)
Hugging Face v3.x introduced new APIs for all tokenizers. See https://bit.ly/2VIqauj on how to migrate from v2.X to v3.X. In this chapter, we are using the new v3.X APIs for tokenization.
Next, we will convert the tokenized input_ids, attention mask, and labels into Tensors that we can use as inputs to training.
# Convert to Tensor
train_input_ids_tensor
=
torch
.
tensor
(
tokenized_train
[
"input_ids"
])
train_attention_mask_tensor
=
torch
.
tensor
(
tokenized_train
[
"attention_mask"
])
train_labels_tensor
=
torch
.
tensor
(
y_train
.
to_list
())
val_input_ids_tensor
=
torch
.
tensor
(
tokenized_validation
[
"input_ids"
])
val_attention_mask_tensor
=
torch
.
tensor
(
tokenized_validation
[
"attention_mask"
])
val_labels_tensor
=
torch
.
tensor
(
y_val
.
to_list
())
Before we start to fine-tune the RoBERTa model, we will create the DataLoader for both the training and validation data. The DataLoader will be used during the fine-tuning of the model. To do this, we first convert the inputs_ids, attention_mask, and labels to a TensorDataset. Next, we create the DataLoader using the TensorDataset as inputs and specify the batch size. We set the variable batch_size to be 16.
# Preparing the DataLaoders
from
torch.utils.data
import
TensorDataset
,
DataLoader
from
torch.utils.data
import
RandomSampler
# Specify a batch size of 16
batch_size
=
16
# 1. Create a Tensor Datset
# 2. Define the data sampling approach
# 3. Create the DataLoader
train_data_tensor
=
TensorDataset
(
train_input_ids_tensor
,
train_attention_mask_tensor
,
train_labels_tensor
)
train_dataloader
=
DataLoader
(
train_data_tensor
,
batch_size
=
batch_size
,
shuffle
=
True
)
val_data_tensor
=
TensorDataset
(
val_input_ids_tensor
,
val_attention_mask_tensor
,
val_labels_tensor
)
val_dataloader
=
DataLoader
(
val_data_tensor
,
batch_size
=
batch_size
,
shuffle
=
True
)
Next, we specify the number of epochs that will be used for fine-tuning the model and also compute the total_steps needed based on the number of epochs, and the number of batches in train_dataloader.
num_epocs
=
2
total_steps
=
num_epocs
*
len
(
train_dataloader
)
Next, we specify the optimizer that will be used. For this exercise, we will use the AdamW optimizer, which is part of the HuggingFace optimization module. The AdamW optimizer implements the Adam algorithm with the weight decay fix that can be used when fine-tuning models.
In addition, you will notice that we specified a scheduler, using get_linear_schedule_with_warmup(). This creates a schedule with a learning rate that decreases linearly, using the initial learning rate that was set in the optimizer as the reference point. The learning rate decreases linearly after a warmup period.
# Use the Hugging Face optimizer
from
transformers
import
AdamW
from
transformers
import
get_linear_schedule_with_warmup
optimizer
=
AdamW
(
model
.
parameters
(),
lr
=
3e-5
)
# Create the learning rate scheduler.
scheduler
=
get_linear_schedule_with_warmup
(
optimizer
,
num_warmup_steps
=
100
,
num_training_steps
=
total_steps
)
The HuggingFace optimization module provides several optimizers, learning schedulers, and gradient accumulator. See https://huggingface.co/transformers/main_classes/optimizer_schedules.html for how to use the different capabilities provided by the HuggingFace optimization module.
Now that we have the optimizer, and scheduler created, we are ready to define the fine-tuning training function, called train(). First, we set the model to be in training mode. Next, we iterate through each batch of data that is obtained from the train_dataloader. We use optimizer.zero_grad() to clear previously calculated gradients. Next, we invoke the forward pass with the model() function, and retrieve both the loss and logits after it completes. We add the loss obtained to the total_loss, and then invoke the backward pass, by calling loss.backward().
To mitigate the exploding gradient problem, we clip the normalized gradiated to 1.0. Next, we update the parameters using optimizer.step().
def
train
():
total_loss
=
0.0
total_preds
=
[]
# Set model to training mode
model
.
train
()
# Iterate over the batch in dataloader
for
step
,
batch
in
enumerate
(
train_dataloader
):
# Get it batch to leverage device
batch
=
[
r
.
to
(
device
)
for
r
in
batch
]
input_ids
,
mask
,
labels
=
batch
model
.
zero_grad
()
outputs
=
model
(
input_ids
,
attention_mask
=
mask
,
labels
=
labels
)
loss
=
outputs
.
loss
logits
=
outputs
.
logits
# add on to the total loss
total_loss
=
total_loss
+
loss
# backward pass
loss
.
backward
()
# Reduce the effects of the exploding gradient problem
torch
.
nn
.
utils
.
clip_grad_norm_
(
model
.
parameters
(),
1.0
)
# update parameters
optimizer
.
step
()
# Update the learning rate.
scheduler
.
step
()
# append the model predictions
total_preds
.
append
(
outputs
)
# compute the training loss of the epoch
avg_loss
=
total_loss
/
len
(
train_dataloader
)
#returns the loss and predictions
return
avg_loss
Similar to how we defined the fine-tuning function, we define the evaluation function, called evaluate(). We set the model to be in the evaluation model, and iterate through each batch of data provided by val_dataloader. We used torch.no_grad() as we do not require the gradients during the evaluation of the model.
The average validation loss is computed once we have iterated through all the batches of validation data.
def
evaluate
():
total_loss
=
0.0
total_preds
=
[]
# Set model to evaluation mode
model
.
eval
()
# iterate over batches
for
step
,
batch
in
enumerate
(
val_dataloader
):
batch
=
[
t
.
to
(
device
)
for
t
in
batch
]
input_ids
,
mask
,
labels
=
batch
# deactivate autograd
with
torch
.
no_grad
():
outputs
=
model
(
input_ids
,
attention_mask
=
mask
,
labels
=
labels
)
loss
=
outputs
.
loss
logits
=
outputs
.
logits
# add on to the total loss
total_loss
=
total_loss
+
loss
total_preds
.
append
(
outputs
)
# compute the validation loss of the epoch
avg_loss
=
total_loss
/
len
(
val_dataloader
)
return
avg_loss
Now that we have defined both the training and evaluation function, we are ready to start fine-tuning the model and performing an evaluation.
We first push the model to the available GPU, and then iterate through multiple epochs. For each epoch, we invoke the train() and evalaute() functions and obtain both the training and validation loss.
Whenever we find a better validation loss, we will save the model to disk by invoking torch.save()_m and update the variable _best_val_loss.
train_losses
=
[]
valid_losses
=
[]
# set initial loss to infinite
best_val_loss
=
float
(
'inf'
)
#push the model to GPU
model
=
model
.
to
(
device
)
# Specify the name of the saved weights file
saved_file
=
"fakenewsnlp-saved_weights.pt"
#for each epoch
for
epoch
in
range
(
num_epocs
):
(
'
Epoch {:} / {:}'
.
format
(
epoch
+
1
,
num_epocs
))
train_loss
=
train
()
val_loss
=
evaluate
()
(
f
' Loss: {train_loss:.3f} - Val_Loss: {val_loss:.3f}'
)
#save the best model
if
val_loss
<
best_val_loss
:
best_val_loss
=
val_loss
torch
.
save
(
model
.
state_dict
(),
saved_file
)
# Track the training/validation loss
train_losses
.
append
(
train_loss
)
valid_losses
.
append
(
val_loss
)
# Release the memory in GPU
model
.
cpu
()
torch
.
cuda
.
empty_cache
()
When we run the code, you will see the output with the training and validation loss for each epoch. In this example, the output terminates at epoch 2, and we have now a fine-tuned RoBERTa model using the data from the FakeNews dataset.
Epoch 1 / 2 Loss: 0.649 - Val_Loss: 0.580 Epoch 2 / 2 Loss: 0.553 - Val_Loss: 0.546
Let’s run the fine-tuned RoBERTa model on the test dataset. Similar to how we prepared the training and validation datasets earlier, we will start by tokenizing the test data, performing truncation, and padding.
tokenized_test
=
tokenizer
(
X_test
.
to_list
(),
padding
=
'max_length'
,
max_length
=
max_length
,
truncation
=
True
,
return_token_type_ids
=
False
)
Next, we prepare the test_dataloader that we will use for testing.
test_input_ids_tensor
=
torch
.
tensor
(
tokenized_test
[
"input_ids"
])
test_attention_mask_tensor
=
torch
.
tensor
(
tokenized_test
[
"attention_mask"
])
test_labels_tensor
=
torch
.
tensor
(
y_test
.
to_list
())
test_data_tensor
=
TensorDataset
(
test_input_ids_tensor
,
test_attention_mask_tensor
,
test_labels_tensor
)
test_dataloader
=
DataLoader
(
test_data_tensor
,
batch_size
=
batch_size
,
shuffle
=
False
)
We are now ready to test the fine-tuned RoBERTa model. To do this, we iterate through multiple batches of data provided by test_dataloader. To obtain the predicted label, we use torch.argmax() to get the label using the logits that are provided. The predicted results are then stored in the variable predictions.
total_preds
=
[]
predictions
=
[]
model
=
model
.
to
(
device
)
# Set model to evaluation mode
model
.
eval
()
# iterate over batches
for
step
,
batch
in
enumerate
(
test_dataloader
):
batch
=
[
t
.
to
(
device
)
for
t
in
batch
]
input_ids
,
mask
,
labels
=
batch
# deactivate autograd
with
torch
.
no_grad
():
outputs
=
model
(
input_ids
,
attention_mask
=
mask
)
logits
=
outputs
.
logits
predictions
.
append
(
torch
.
argmax
(
logits
,
dim
=
1
)
.
tolist
())
total_preds
.
append
(
outputs
)
model
.
cpu
()
Now that we have the predicted results, we are ready to compute the performance metrics of the model. We will use sklearn classification report to get the different performance metrics for the evaluation.
from
sklearn.metrics
import
classification_report
y_true
=
y_test
.
tolist
()
y_pred
=
list
(
np
.
concatenate
(
predictions
)
.
flat
)
(
classification_report
(
y_true
,
y_pred
))
The output of running the code is shown:
precision recall f1-score support 0 0.76 0.52 0.62 816 1 0.66 0.85 0.74 885 accuracy 0.69 1701 macro avg 0.71 0.69 0.68 1701 weighted avg 0.71 0.69 0.68 1701
Snorkel has been used in many real world NLP applications across industry, medicine, and academia. At the same time, the field of NLP is evolving at a rapid pace. Transformer-based models have enabled many NLP tasks to be performed with high-quality results.
In this chapter, you learned how to use HuggingFace and ktrain to perform text classification for a FakeNews dataset, that has been labeled by Snorkel in Chapter 3.
By combining the power of Snorkel for weak labeling, and NLP libraries like HuggingFace, ML practitioners can get started with developing innovation NLP applications!
Anna Rogers, Olga Kovaleva, Anna Rumshisky https://arxiv.org/abs/2002.12327 [A Primer in BERTology: What we know about how BERT works], TACL, 2020.
Alec Radford,Karthik Narasimhan, Tim Salimans, Ilya Sutskever. Improving Language Understanding by Generative Pre-Training, OpenAI, 2018
Young, T. Hazarika, D. Poria, S. Cambria, E. Recent Trends in Deep Learning Based Natural Language Processing, Institute of Electrical and Electronics Engineers, Piscataway, NJ, USA, 2018.
Sebastian Ruder, Matthew Peters, Swabha Swayamdipta, Thomas Wolf. Transfer Learning in Natural Language Processing, NAACL Tutorial, 2019.
Sebastian Ruder. NLP’s ImageNet moment has arrived, Gradient 2018.
Lukasz Kaiser, Andrew Ng, Younes Mourri, Break into NLP, Deeplearning.ai, 2020
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. Attention is All You Need, Conference on Neural Information Processing Systems (NIPS), 2017
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL, 2019
Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, Conference on Neural Information Processing Systems (NeurIPS), 2019
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach
Iz Beltagy, Matthew E. Peters Arman Cohan. Longformer: The Long-Document Transformer, the Allen Institute for Artificial Intelligence (AI2), 2020
3.90.242.249