© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
S. M. JainIntroduction to Transformers for NLPhttps://doi.org/10.1007/978-1-4842-8844-3_1

1. Introduction to Language Models

Shashank Mohan Jain1  
(1)
Bangalore, India
 

Language is power, life and the instrument of culture, the instrument of domination and liberation.

—Angela Carter (English writer)

One of the biggest developments that made Homo sapiens different from other animal species on this planet was the evolution of language. This allowed us to exchange and communicate ideas and thoughts, which led to so many scientific discoveries including the Internet. This is how important language is.

So when we venture into the area of artificial intelligence , the progress made there would not move much unless we made sure the machines understand and comprehend natural language. So it’s pertinent for anyone who wants to venture into the area of artificial intelligence, and thereby artificial general intelligence, that they develop a good grasp of how we are progressing on teaching machines how to understand language.

The intent of this chapter is to take you through the evolution of the natural language processing domain by covering some of the historical aspects of it and its evolution to the state-of-the-art neural network–based language models of today.

History of NLP

Natural language processing is one of the fastest-developing areas of machine learning and artificial intelligence. Its aim is to provide machines the capability to understand natural language and provide capabilities that assist humans in achieving tasks related to natural language. The concept of machine translation (also known as MT), which was first developed during the Second World War, was the seed from which it grew. The intent of NLU, or natural language understanding, is to allow machines to comprehend natural language and accomplish tasks like translation from one language to another, determining the sentiment of a specific text segment, or providing a summary of, say, a paragraph.

A language can be broken down into its component parts, which can be thought of as a set of rules or symbols. Following the integration process, these symbols are put to use for both the transmission and the broadcasting of information. The field of natural language processing is broken down into several subfields, the most notable of which are natural language generation and natural language understanding . These subfields, as their names imply, are concerned with the production of text as well as its comprehension. Be careful not to let these relatively new words – such as phonology, pragmatics, morphology, syntax, and semantics – throw you off.

One of the main areas of NLP or NLU is to understand not just the statistical properties of a particular language but also the semantics of it. With machine learning the aim was to feed in the content in a certain language to the machine and let the machine understand not just the statistical properties but also the meaning and context of, say, a certain word.

An NLP engineer’s workflow would consist of first attempting to transform the words into numbers that computers are able to interpret and then developing machine learning architectures that are able to use these numbers for the many tasks that are necessary. In more specific terms, it should involve the following steps:
  1. 1.

    Collecting Data: The first thing to do with every project you have in mind is to collect data that is directly connected to the project you are working on. This is essential to the study of machine learning, which is its own area. We supply many algorithms with vast volumes of data, some of which may have been prohibitively expensive to obtain. In the context of NLP, this stage may entail the collection of tweets or reviews from various ecommerce websites like Amazon or Yelp. Additionally, this process may involve cleaning and categorizing the collected tweets or reviews.

     
  2. 2.

    Tokenization : This is the process of chopping up each piece of text into manageable word chunks in order to get ready for the subsequent stage. In this phase, you may also be asked to remove stop words, perform lemmatization, or stem the text.

     
  3. 3.

    Vectorization : In this stage , the tokens that were obtained in the previous step known as “tokenization ” are transformed into vectors that the ML algorithms are able to process. It is clear to us at this point that the models that we develop do not truly see the words and comprehend them in the same manner that we humans do (or that we think we do), but rather operate on the vector representations of these words.

     
  4. 4.

    Model Creation and Evaluation: This step entails developing ML models and architectures, such as transformers, that are able to chew on the word vectors that are provided for the many tasks that are required. Translation, semantic analysis, named entity identification, and other similar activities may be included in these tasks. The NLP engineer is responsible for performing ongoing evaluations of the models and architectures, measuring them against previously established goals and KPIs.

     

When we want to understand something, we need to form a mental model of that thing. As an example, if I say I understand how a cat looks like, I have a mental model of its features like fur, eyes, legs, etc. These are features or dimensions that enable us to represent a thing or a concept.

Similarly, to begin understanding a sentence or a word, first of all, we need a mathematical representation of the word itself. The reason for this is that the machines only understand numbers . So we need an approach to encode the representation in numbers. We start with a very simplistic statistical approach called bag of words, which is described in the following.

Bag of Words

The bag of words is a statistical technique based on word counts and using that to create mathematical representations of documents in which these words occur. When we say a mathematical representation, we mean a mathematical vector, which represents a document in vector space, where each word can be thought of as a separate dimension in that space.

Let’s take a simple example. Assume we have three documents with one sentence each as illustrated in the following:
  • Document 1: I am having fun of my lifetime.

  • Document 2: I am going to visit a tourist destination this time.

  • Document 3: Tourist destinations provide such fun.

Now what we do is that we first create a vocabulary based on all the words present in our document set. As our document set constitutes these three documents , we will have a vocabulary as in the following (we only will take the unique words):
  • I

  • am

  • having

  • fun

  • of

  • my

  • lifetime

  • going

  • to

  • visit

  • tourist

  • destination

  • this

  • time

  • provide

  • such

So we can see we have 16 words here (for simplicity reasons, we are keeping words like a, the, etc. in the set). This becomes our corpus. Now think of each word as a dimension in 16-dimensional vector space.

If we take document 1, the dimensions can be coded as in the following:
[1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0]
Document 2 will look like this:
[1,1,0,0,0,0,1,1,1,1,1,1,1,0,0,0]
Document 3 will look like this:
[0,0,0,1,0,0,0,0,0,0,1,1,0,0,1,1]

Here, 1 is representing the presence of a word in the document. This mechanism is also called one-hot encoding and is a most simplistic representation of a document or a sentence. As we will move ahead in the book, we will see how this mechanism of word representation gets better and better.

This representation allows the machine to play around with these numbers and perform mathematical operations on them.

n-grams

Before we start to understand more of Ngrams, we will take a detour and first understand something called the bag of words model.

Because of the sequential quality of language, the sequence in which words are presented in a text is of utmost importance. This is something that we are all well aware of, even if we don’t consciously think about it very often. n-grams provide us the capability to predict the next word given the previous words.

The primary concept behind creating text with n-grams is based on the statistical distribution of words. The main idea is to determine the probability of occurrence of the nth word in the sequence given the probability of occurrence of n-1 words prior to that word. In effect it uses the chain rule of probability to predict the occurrence of a word with a certain probability.

This calculation is shown in the following:

$$ Pleft({oldsymbol{x}}^{left(t+1
ight)};left|{oldsymbol{x}}^{(t)}
ight.,dots, {oldsymbol{x}}^{(1)}
ight)=Pleft({oldsymbol{x}}^{left(t+1
ight)};left|{oldsymbol{x}}^{(t)}
ight.,dots, {oldsymbol{x}}^{left(t-n+2
ight)}
ight) $$

The left side of the equation signifies the probability of seeing a word x(t+1) given that we have seen words from x(1) to x(t).

The whole concept rests on the chain rule of probability, which in simple terms allows us to represent a complex joint distribution to be factored into a set of conditional distributions.

Let’s examine a simple illustration of a trigram model. A trigram model keeps the context of the last two words to predict the next word in the sequence.

Take for example the following utterance:
  • “Anuj is walking on the ___.”

Here, we need to predict the word after the utterance “on the.”

We make an assumption here that, based on the dataset, these are the following probable continuations: “road,” “pavement.”

Now we need to calculate the probabilities:

P(road |on the) and P(pavement| on the)

The first probability is for the occurrence of the word road given the words on the have already occurred before it. The second probability is for the word pavement given the words on the have occurred before it.

After the probabilities have been calculated, the word that has the highest conditional probability would be the next word.

We can restate as a simple mechnism based on statistical methods to calculate the probability of occurrence of the next word given the context of previous words. As the corpus grows big and the number of sentences increases, doing calculations beyond simple bigrams will be extremely challenging. We need to learn a way to generate these conditional probability distributions. By conditional distribution we mean given some words, we need to understand the probability distribution of words that can occur next. Since we don’t know what the shape of these distributions could be, we use neural networks to approximate the parameters of such distributions. In the next section, we will cover recurrent neural networks (RNNs) , which allow us to achieve this task .

Recurrent Neural Networks

Siri on Apple products and voice search on Google products both make use of recurrent neural networks (RNNs), the most advanced algorithm currently available for processing sequential input.

As the name suggests, it’s a recurrent neural network and due to their recurrent nature, they are best suited for handling sequences of data , like time series or languages. When dealing with sequences, the most important aspect is to handle context, which entails remembering what has happened in previous sequences and using that information to represent the current input in a better way.

Because it is the only algorithm that has an internal memory, recurrent neural networks (RNNs) are extremely strong and reliable. They are also among the most promising algorithms now in use.

In comparison with many other types of deep learning algorithms, recurrent neural networks have been around for quite some time. They were first developed in the 1980s , but it has only been in the most recent decades that we have realized the full extent of their potential.

Though RNNs have been used for handling sequential data , they tend to suffer from certain issues like vanishing gradients, and they are unable to capture long-term dependencies. This has led to the emergence of new neural network architectures like LSTM (Long Short-Term Memory) and GRU, which overcame the issues with RNN.

LSTMs and GRUs , which have their own internal memories and also have more refined architectures than a plain RNN, are able to remember significant aspects of the data input they were given, which enables them to make very accurate forecasts regarding what will happen in the future.

LSTMs and GRUs became the default neural network architectures when it came to handling sequential data like time series or languages or data provided by sensors deployed in the field mainly for IoT-based applications.

What Exactly Is a Recurrent Neural Network (RNN)?

A subcategory of neural networks known as recurrent neural networks, or RNNs, are advantageous for the modeling of sequence data . RNNs are derived from feed-forward networks and display behavior that is analogous to the way in which human brains operate. Put another way, recurrent neural networks are capable of producing predicted outcomes with sequential data, while other algorithms are unable to do so.

How RNNs Work

In order to have a complete comprehension of RNNs, you will need to have a functional understanding of “regular” feed-forward neural networks as well as sequential data.

The most fundamental definition of sequential data is simply ordered data in which related items follow one another in time. Examples of this include the DNA sequence and financial data. The most common kind of sequential data is probably time series data, which is nothing more than a string of data points that are presented in the appropriate chronological sequence.

The manner in which information is transmitted is what gives RNNs and feed-forward neural networks their respective titles.

A feed-forward network is a single-pass network, which means there is no recurrence in it. Input is passed via various layers, and the output of the network is compared with actual output, and error correction is introduced via the backpropagation (BP) mechanism. This is only helpful in handling one input at a time.

Because it can just handle one input at a time, a feed-forward neural network is terrible at forecasting what will happen next because it has no memory to store the information that it takes in. A feed-forward network does not have any concept of chronological order because it simply takes into account the most recent input. It simply has no memory of anything that occurred in the past, other than the training that it received.

In an RNN, the information is repeated endlessly within a loop. When it comes time to make a choice, it takes into account the most recent input in addition to the lessons it has picked up from the inputs it has previously been given.

Figure 1-1 presents this difference in a visual form.

A representation for the difference between two types of neural networks one is a recurrent neural network and the other is a feed-forward neural network.

Figure 1-1

Difference between a simple feed-forward network and an RNN

Another effective method of illuminating the idea of the memory of a recurrent neural network is to describe it by way of an illustration.

Imagine you have a typical feed-forward neural network and you feed it the word “machine” as an input. You then watch as the network processes the word, one character at a time. Because by the time it reaches the character “h,” it has already forgotten about “m,” “a,” and “c,” it is nearly difficult for this form of neural network to anticipate which character will come next because it has already forgotten about those characters.

However, due to the fact that it possesses its own internal memory, a recurrent neural network is able to remember those characters. It generates output, replicates that output, and then feeds both of those versions back into the network.

To put it another way, recurrent neural networks incorporate recent history into the analysis of the present.

As a result, an RNN takes into consideration both the present and the recent past as its inputs. This is significant because the sequence of data provides essential information about what is to come after it. This is one of the reasons an RNN is able to accomplish tasks that other algorithms are unable to.

The output is first produced by a feed-forward neural network , which, like all other deep learning algorithms, begins by assigning a weight matrix to the inputs of the network. Take note that RNNs assign weights not only to the most recent input but also to the inputs that came before it. In addition, a recurrent neural network will adjust the weights over the course of time using gradient descent as well as backpropagation (BP) .

As things evolved, we discovered that there are challenges with RNNs and their counterparts in terms of processing time and capturing long-term dependencies between words in a sentence. This led us to the evolution of language models, which we describe in the next section.

Language Models

Over the course of the last decade, there has been a substantial amount of development in the field of information extraction from textual data. Natural language processing replaced text mining as the name of this field, and as a result, the approach that is applied in this field has also undergone a significant change. The development of language models as a foundation for a wide variety of applications that seek to extract useful insights from unprocessed text was one of the primary factors that brought about this transition.

A probability distribution over words or word sequences is the fundamental building block of a language model. In application, a language model provides the chance that a particular word sequence can be considered “valid.” In this discussion, “validity” does not at all refer to the grammatical correctness of a statement. This indicates that it is similar to the way people speak (or, to be more specific, write) because this is how the language model acquires its knowledge. A language model is “just” a tool to incorporate abundant information in a condensed manner that is reusable in an out-of-sample setting. This is an important point to keep in mind because it shows that there is no magic to a language model (like other machine learning models, particularly deep neural networks).

What Advantages Does Using a Language Model Give Us?

The abstract comprehension of natural language , which is required in order to deduce word probabilities based on context, can be put to use in a variety of contexts and activities.

We are able to execute extractive or abstractive summarization of texts if we have an accurate language model. If we have models for a variety of languages, it will be much simpler to develop a system for automatic translation. Among the more complicated applications is the process of question answering (with or without context). Language models today are being put to very interesting tasks like software code generation, text-to-image generation like DALL-E 2 from OpenAI, as well as other text generation mechanisms like GPT3 and so on.

It is essential to understand that there is a distinction between
  1. a)

    Statistical techniques that come under probabilistic language models

     
  2. b)

    Language models that are built on neural networks

     

As explained in the “n-grams ” section, calculating the probabilities of n-grams results in the construction of a straightforward probabilistic language model (a) (an n-gram being an n-word sequence, n being an integer greater than 0). The likelihood of an n-gram can be defined as the conditional probability that the n-gram’s final word follows a specific n-1 gram (leaving out the last word). In everyday terms, it refers to the frequency of occurrence of the final word that comes after the n-1 gram that does not include the last word. Given the n-1 gram, which represents the present, the probabilities of the n-gram, which represents the future, do not depend on the n-2, n-3, etc. grams, which represent the past. This is an assumption made by Markov.

There are obvious disadvantages associated with taking this method. The probability distribution of the following word is only influenced by the n words that came before it. This is the most essential point. Texts that are difficult to understand contain rich contexts, which can have a significant impact on the choice of the next word. Therefore, the identity of the following word might not be discernible from the n words that came before it, even if n is as high as 50.

On top of that, it is obvious that this method does not scale well: the number of possible permutations skyrockets as the size of the dataset (n) increases, despite the fact that the majority of the variants never appear in the actual text. In addition, each and every occurring probability (or each and every n-gram count) needs to be computed and saved! Additionally, non-occurring n-grams cause a sparsity problem because the granularity of the probability distribution might be relatively low when there are so few of them (word probabilities have few different values ; therefore, most of the words have the same probability).

Neural Network–Based Language Models

The manner in which neural network–based language models encode inputs makes it easier to deal with the sparsity problem. Embedding layers produce a vector of arbitrary size for each word, where the semantic links between the words is taken into account. These continuous vectors generate the granularity that is so desperately required in the probability distribution of the following word. In addition, the language model is basically a function (as are all neural networks, which include a great deal of matrix calculations), which means that it is not necessary to store all of the n-gram counts in order to construct the probability distribution of the next word.

The sparsity problem can be solved using neural networks, but there is still a difficulty with the context. First, the process of developing language models consisted of finding ways to solve the context problem in an ever-increasingly effective manner. This was accomplished by bringing in ever-more context words in order to impact the probability distribution in an ever-more effective manner. Second, the objective was to design an architecture that would endow the model with the capability of discovering which phrases in a given context are more significant than others.

Utilizing recurrent neural networks (described in the previous section) is a step in the right direction with regard to the topic of handling context. When selecting the following word, it takes into account all of the words that came before it because it is either an LSTM or a GRU cell-based network.

The fact that RNN-based designs are sequential is the primary downside associated with using these kinds of models. Due to the absence of any opportunity for parallel processing, the amount of time required for training skyrockets when dealing with lengthy sequences. The transformer architecture is the answer to this predicament that you’re having. It is recommended that you read the original document that was produced by Google Brain.

Additionally, OpenAI’s GPT models and Google’s BERT are utilizing the transformer design (which we will discuss in upcoming chapters). In addition to this, these models make use of a technique known as attention , which allows the model to discover which inputs, in particular circumstances, merit more attention than others.

In terms of the architecture of models, the most significant quantum leaps were firstly made by RNNs (specifically LSTM and GRU), which solved the sparsity problem and allowed language models to use a great deal less disk space, and then, more recently, by the transformer architecture, which made parallelization possible and created attention mechanisms. Both of these developments were important. However, architecture is not the only domain in which a language model can demonstrate its prowess.

OpenAI released few language models based on the transformer architecture. The first model was GPT1, and the latest one is GPT3. When contrasted with the architecture of GPT1, GPT3 contains almost no innovative features. But it is massive. It was trained using the largest corpus a model has ever been trained on, which is known as the Common Crawl , and it has 175 billion different parameters. This is made possible in part by the semi-supervised training method of a language model, which allows for the use of a text as a training example while simultaneously omitting some words. The remarkable power of GPT3 stems from the fact that it read virtually all of the text that has been published anywhere on the Internet over the course of the last few years and possesses the capacity to reflect the majority of the complexity that natural language possesses.

In conclusion, I would like to present the T5 model that Google offers (Figure 1-2). In the past, language models were utilized for conventional natural language processing tasks, such as part-of-speech (POS) tagging or machine translation, albeit with a few tweaks here and there. Because of its abstract capability to understand the fundamental structure of natural language, BERT, for instance, may be retrained to function as a POS tagger with only a little bit of additional instruction.

An illustration represents a language model which is representing a Google T5 model which is a text-to-text model this model is translating English to German.

Figure 1-2

Google’s T5 model , which is a text-to-text model

Figure 1-2 shows a representation of the T5 language model .

When using T5 , there is no requirement for any sort of modification in order to complete NLP jobs. If it receives a text that contains certain M tokens, it recognizes that these tokens represent blanks that need to be filled in with the proper words. It is also capable of providing answers. If it is provided with some context after the questions , it will seek that context for the answer; otherwise, it will answer based on the information it already possesses. It is an interesting fact that its designers have been bested by it in a trivia contest. Additional examples of possible applications can be seen in the image (Figure 1-2) on the left.

Summary

In my opinion, NLP is the area in which we are the most likely to succeed in developing artificial intelligence . There is a great deal of excitement about this term, and many straightforward decision-making systems, as well as virtually every neural network, are referred to as AI; however, this is primarily marketing. The Oxford Dictionary of English, as well as just about any other dictionary, defines artificial intelligence as the performance of intelligence-related tasks by a machine that are analogous to those performed by humans. One of the key aspects for AI is generalization, which is the ability of a single model to do many tasks. The fact that the same model can perform a wide variety of NLP tasks and can infer what to do based on the input is astounding in and of itself, and it gets us one step closer to genuinely developing artificial intelligence systems that are comparable to human intellect.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.115.118