If you have even a passing familiarity with the advancements that have been made in the fields of machine learning and artificial intelligence in the years since 2018, you have almost certainly become aware of the tremendous strides that have been taken in the field of natural language processing (also known as NLP). Most of the progress in this area can be attributed to large language models , also known as LLMs. The architecture behind these LLMs is the transformer’s encoder-decoder, which we discussed in Chapter 2.
Success of transformers came from the architecture’s ability to process input data in parallel as well as having a better contextual understanding via the attention mechanism. We already referred to Vaswani’s “Attention Is All You Need” paper in previous chapters. Before the emergence of transformers, the context was captured by vanilla RNN or LSTM without attention.
Hugging Face is now widely recognized as a one-stop shop for all things related to natural language processing (NLP), offering not only datasets and pretrained models but also a community and even a course.
Hugging Face is a new company that has been built on principles of using open source software and data . In true sense, the revolution in NLP started with democratization of the NLP models based on the transformer architecture. Hugging Face turned out to be a pioneer in not just open sourcing these models but also providing a handy and easy-to-use abstraction in the form of the Transformers library , which made it really easy to consume and infer these models.
Hugging Face provided a central place or hub for model developers to publish the models in the huggingface repository , which can then be consumed by consumers who are looking to build applications on top of these models. As an example, the BERT (Bidirectional Encoder Representations from Transformers) model was contributed by Google to huggingface. This then allowed a community of users to consume these models in their applications. Then came GPT models from OpenAI like GPT2, which are generative models and allowed the end user to write applications that can generate, say, stories, novels, etc. GPT2 is also a part of the huggingface ecosystem. Hugging Face provided not only APIs for consuming those models but also a way to fine-tune them with our own dataset and monitor and benchmark these models. So in a nutshell, the emergence of an ecosystem like huggingface has really opened up a plethora of opportunities for developers intending to build applications on top of natural language processing–based models.
Currently the scope of these models is not just limited to text processing . We now see the emergence of vision transformers and transformer-based models for audio. People are building applications for music generation or voice cloning in the audio domain and using them for fake image generation in the case of vision use cases. There are also models that have mined scientific literature and can be used for extracting knowledge from scientific journals. Similarly, models based on law-related documents have emerged, and people can use them to build a question and answering system on top of them. These models come in a variety of sizes based on the underlying architectures being used. As an example, the GPT3 model has around 175 billion parameters.
Similarly, the size and scope of their training datasets have both increased significantly. For instance, the original transformer was replaced by the much larger Transformer-XL ; the number of parameters in BERT-Base increased from 110 million to 340 million in Bert-Large; and the GPT2 model, which had 1.5 billion parameters, was replaced by the GPT3 model that has 175 billion parameters. China launched a model named Wu Dao 2.0, which has around 1.75 trillion parameters. The proponents of scaling opine that as we increase the sizes of these models, we will also be reaching our goal of artificial general intelligence (AGI) .
To give an example of infrastructure needed for such large models, GPT3 was trained on a super-computer with more than 10000 GPUs. This means training these models only lies in the realm of big companies. Now with the ability to consume most of these models, the end user becomes part of the application development ecosystem based around these large language models.
Features of the Hugging Face Platform
Because the Hugging Face platform is predicated on the idea of attention-based transformer models, it should come as no surprise that the Transformers library is at the center of the Hugging Face ecosystem. The accompanying Datasets and Tokenizers libraries offer assistance to the Transformers library. Keep in mind that transformers are unable to comprehend text in its original form, which is a string of characters. Since our inputs to transformers are in text, this text has to be encoded in a way that makes it consumable via the transformer-based neural network architecture . For this we make use of huggingface-provided APIs for tokenizing, which are known as tokenizers.
Apart from tokenizing, we might need to use some custom datasets to either fine-tune existing models or train the models from scratch. To have a uniformity in architecture, huggingface provides an abstraction for datasets via the Datasets API. The user can then have their own datasets, upload the datasets, train/fine-tune the models, and also upload the trained models all just by using the huggingface APIs. This is nothing less than revolutionary.
Components of Hugging Face
The huggingface library is based on a set of rich abstractions, which abstract out the complexity of creating applications based on natural language processing. These abstractions allow us a single interface to load models, use tokenizers, use datasets across a variety of models, tokenizers and datasets. This is really an empowering experience for the developer whose job is simplified by using these abstractions. We discuss some of these abstractions in the following subsections.
Pipelines
- 1.
Determine whether the overall sentiment of the sentence can be characterized as positive or negative.
- 2.
Question and answering takes a question and pulls an answer out of the text that corresponds to it.
- 3.
The masked language modeling technique suggests possible words to fill masked input with the given context.
- 4.
The named entity recognition program will automatically assign a label to each of the tokens that are included in the input.
- 5.
Reducing a longer piece of writing or an article into a more concise summary is referred to as summarization.
- 1.
Tokenizer
- 2.
Model
- 3.Post-processor
Tokenizer
The first component in the pipeline is a tokenizer , which takes raw text as input and converts it into numbers so that the model can interpret them.
- 1.
The process of separating the input into individual tokens, which may be words, sub-words, or symbols (such as punctuations)
- 2.
Converting each token into an integer
- 3.
Introducing new variables into the model that might prove to be of some use
When the model was trained, we had the need to tokenize the inputs. At that time there would have been a use of a certain tokenizer. We need to make sure that during usage of this model on actual inputs, we use the same tokenizer. This task is made easy by the AutoTokenizer class , which will automatically load the tokenizer used during the training. This makes the life simple for a developer considerably.
We load the tokenizer used to pretrain the GPT2 model in the following code.
Code for a simple tokenizer
The tokenizer will provide a dictionary with the following entries:
The numerical representations of your tokens are referred to as input IDs.
The attention mask is a mask that specifies which tokens need to have attention paid to them.
Code for using a tokenizer for tokenizing text
Using BERT for tokenizing the text
- 1.
The indices that correspond to each token in the sentence are denoted by the input ids variable.
- 2.
The value of the attention mask specifies whether or not a token needs to be attended to.
- 3.
When there is more than one sequence, the token type ids variable is used to determine which sequence a token is a part of.
As can be seen, the tokenizer inserted two specialized tokens into the sentence. These tokens are known as CLS and SEP , which stand for classifier and separator, respectively. The tokenizer will take care of adding any necessary special tokens for you, provided that the model in question actually requires them.
This code takes two sentences as input and generates tokens for them
Padding
When we process a group of sentences, their individual lengths do not always remain consistent. The inputs to the models need to have the same size as this is based on the underlying standard architecture. This presents a problem. The addition of a padding token to a sentence that contains an insufficient number of tokens is an example of the strategy known as “padding .”
This code shows how padding for tokenizers works
As we can see, the third sentence was shorter in length, and thereby the tokenizer padded it with zeroes.
Truncation
It’s possible that a model can’t deal with a sequence that’s too long sometimes. In this particular scenario, you will be required to condense the sequence down to a more manageable length.
This code shows how the truncation flag works in tokenizers
The next stage in the pipeline is the model.
AutoModel
We will check how AutoModel makes life simpler for us in terms of loading pretrained models.
The process of loading pretrained instances is made straightforward and unified by the Transformers library . This indicates that you are able to load an AutoModel in the same way that you load an AutoTokenizer . The sole distinction lies in selecting the appropriate AutoModel for the task at hand.
If we take an example of text classification , the way we load the model is shown in the following.
- 1.
Create an instance of a tokenizer and a model based on the name of the checkpoint. The model is determined to be a BERT model, and then weights that have been saved in the checkpoint are loaded into it.
- 2.
Get the tokens and pass it to the model.
- 3.
The model returns the logits.
- 4.
Apply a softmax to calculate the probability of the class in which the sentence is classified (negative or positive for our following example).
Loading the tokenizer
Listing 4-7-2 shows how a model is loaded via the Transformers library .
Loading the model
Generate the logits
Print probabilities specific to the individual class of sentiment
This reflects a negative sentiment in the sentence.
Next is for the second sentence:
This reflects a positive sentiment in the sentence.
This code shows how to use the pipeline API for doing sentiment analysis
We can check that the outputs match when we didn’t use and used the pipeline class in our code.
Summary
In this chapter we discussed the architecture of the huggingface library and its components like tokenizers and models. We also learned how we can use these components to do a simple task like analyzing the sentiments of the sentences. In the next chapter, we will take more examples of doing different kinds of tasks using the Transformers library .