In Chapter 5, Word Embeddings and Distance Measurements for Text, we looked at how information related to the ordering of words, along with their semantics, can be taken into account when building embeddings to represent words. The idea of building embeddings will be extended in this chapter. We will explore techniques that will help us build embeddings for documents and sentences, as well as words based on their characters. We will start by looking into an algorithm called Doc2Vec, which, as the name suggests, provides document-or paragraph-level contextual embeddings. A sentence can essentially be treated as a paragraph, and embeddings for individual sentences can also be obtained using Doc2Vec. We will briefly discuss techniques such as Sent2Vec, which are focused on obtaining embeddings for sentences based on n-grams. Before Sent2Vec, we will discuss fastText extensively, which is a technique for building word representations using n-grams. An introduction to the Universal Sentence Encoder (USE) will be provided toward the end of this chapter.

The following topics will be covered in this chapter:

Venturing into Doc2Vec
Exploring fastText
Understanding Sent2Vec and the Universal Sentence Encoder

Technical requirements

The code files for this chapter can be found at the following GitHub link: https://github.com/PacktPublishing/Hands-On-Python-Natural-Language-Processing/tree/master/Chapter06.

Venturing into Doc2Vec

As we saw in Chapter 5, Word Embeddings and Distance Measurements for Text, Word2Vec helped in fetching semantic embeddings for word-level representations. However, most of the NLP tasks we deal with are a combination of words or are essentially what we call a paragraph:

How do we fetch paragraph-level embeddings?

One simple mechanism would be to take the word embeddings for the words occurring in the paragraph and average them out to have representations of paragraphs:

Can we do better than averaging word embeddings?

Leand Mikolov extended the idea of Word2Vec to develop paragraph-level embeddings so that paragraphs of differing lengths can be represented by fixed-length vectors. In doing so, they presented the paper Distributed Representations of Sentences and Documents (https://arxiv.org/abs/1405.4053), which aimed at building paragraph-level embeddings. Similar to Word2Vec, the idea here is to predict certain words as well. However, in addition to using word representations for predicting words, as we did in the Word2Vec model, here, document representations are used as well.

These documents are represented using dense vectors, similar to how we represent words. The vectors are called document or paragraph vectors and are trained to predict words in the document. Documents vectors are updated similarly to how word vectors are. The paragraph vectors are concatenated with multiple word vectors to predict the next word in the context. Similar to Word2Vec, Doc2Vec also falls under the class of unsupervised algorithms since the data that's used here is unlabeled.

The paper described two ways of building paragraph vectors, as follows:

Distributed Memory Model of Paragraph Vectors (PV-DM): This is similar to the continuous bag-of-words approach we discussed regarding Word2Vec. Paragraph vectors are concatenated with the word vectors to predict the target word. Another approach is to use the average of the word and paragraph vectors to predict the target word. How are embeddings built for unseen documents after training? The model uses the built word matrix to develop embeddings for unseen documents, and these are added to the paragraph vector or document matrix. The following diagram shows how the PV-DM model is trained. InLearn Natural Language Processing (NLP), along with the word vectors of Learn, Natural, and Language, the Document vector is used to predict the next word, Processing. The model is tuned based on how it did in terms of predicting the word Processing and how it learned throughout:

Distributed Bag-of-Words Model of Paragraph Vectors (PV-DBOW): In this approach, word vectors aren't taken into account. Instead, the paragraph vector is used to predict randomly sampled words from the paragraph. In the process of using gradient descent and backpropagation, the paragraph vectors get adjusted and learning happens based on how good or bad they are doing in terms of making predictions. This approach is analogous to the Skip-gram approach used in Word2Vec.

The following diagram shows the architecture of a PV-DBOW model wherein the paragraph vector gets trained by predicting words in the paragraph itself:

The PV-DBOW model is simpler and more memory-efficient as word vectors don't need to be stored in this approach. The learned representations that are obtained from both the distributed memory model and the distributed bag-of-words model can be combined to form the paragraph vector. Each of the learned representations can individually be treated as paragraph vectors, which are also used to represent the document. These learned representations serve as vector representations of documents and can be fed as input to any machine learning model to perform various tasks such as the classification/clustering of the documents and so on.

Now that we have understood the intuition behind Doc2Vec, let's look at it in action. We will use the Doc2Vec module that was built as part of the Gensim library for our experimentation. Here, we will look at some basic examples to understand the theory we described here. We will use these in conjunction with machine learning algorithms to solve actual problems in Chapter 7, Identifying Patterns in Text Using Machine Learning.

Building a Doc2Vec model

Next, we will look into the step-wise details of building a Doc2Vec model. Let's begin!

We will begin by importing common_texts from genism. This is a small document corpus. Along with this, we will import the Doc2Vec and TaggedDocument modules since Doc2Vec expects sentences in TaggedDocument format:

from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

Now, let's check the training corpus:

common_texts

Here's our training corpus:

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

We will now convert the tokenized documents into TaggedDocument format and validate this:

documents = [TaggedDocument(doc, [i]) for i, doc in 
                            enumerate(common_texts)]
documents

Here is our corpus in the TaggedDocument form:

[TaggedDocument(words=['human', 'interface', 'computer'], tags=[0]),
 TaggedDocument(words=['survey', 'user', 'computer', 'system', 'response', 'time'], tags=[1]),
 TaggedDocument(words=['eps', 'user', 'interface', 'system'], tags=[2]),
 TaggedDocument(words=['system', 'human', 'system', 'eps'], tags=[3]),
 TaggedDocument(words=['user', 'response', 'time'], tags=[4]),
 TaggedDocument(words=['trees'], tags=[5]),
 TaggedDocument(words=['graph', 'trees'], tags=[6]),
 TaggedDocument(words=['graph', 'minors', 'trees'], tags=[7]),
 TaggedDocument(words=['graph', 'minors', 'survey'], tags=[8])]

Here, we have used a simple iterator to act as a tag for the documents. This can be extended to a list of topics and so on. Also, note that Doc2Vec expects a list of tokens as input for each document.

Next, let's build and train a basic Doc2Vec model using the following code:

model = Doc2Vec(documents, vector_size=5, min_count=1, workers=4, epochs = 40)
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

Here, vector size of 5 denotes that each document will be represented by a vector of five floating-point values. The min_count parameter sets a threshold so that only terms that occur at least min_count number of times will be considered in the vocabulary.

The workers parameter denotes the number of threads to be used while training to speed up the process. Finally, the epochs parameter represents the number of iterations that will be made over the corpus.

Now, we will validate the vector size for the document embeddings:

model.vector_size

Our vectors are of the following size:

Let's check whether the number of document vectors being built is equal to the number of documents being used in the training process:

len(model.docvecs)

There are 9 documents in total, as can be seen in the following output block. This is in line with our expectations:

Now, we need to check the vocabulary and the vocabulary size of the model we've developed. Let's begin by checking the length of our vocabulary:

len(model.wv.vocab)

Here's our vocabulary size:

Now, let's take a look at our vocabulary:

model.wv.vocab

Here's our vocabulary:

{'human': <gensim.models.keyedvectors.Vocab at 0x1275bfa58>,
 'interface': <gensim.models.keyedvectors.Vocab at 0x1275bfa90>,
 'computer': <gensim.models.keyedvectors.Vocab at 0x1275bfac8>,
 'survey': <gensim.models.keyedvectors.Vocab at 0x1275bfb00>,
 'user': <gensim.models.keyedvectors.Vocab at 0x1275bfb38>,
 'system': <gensim.models.keyedvectors.Vocab at 0x1275bfb70>,
 'response': <gensim.models.keyedvectors.Vocab at 0x1275bfba8>,
 'time': <gensim.models.keyedvectors.Vocab at 0x1275bfbe0>,
 'eps': <gensim.models.keyedvectors.Vocab at 0x1275bfc18>,
 'trees': <gensim.models.keyedvectors.Vocab at 0x1275bfc50>,
 'graph': <gensim.models.keyedvectors.Vocab at 0x1275bfc88>,
 'minors': <gensim.models.keyedvectors.Vocab at 0x1275bfcc0>}

Now that we have trained a very basic Doc2Vec model, let's build a document vector for a new sentence:

vector = model.infer_vector(['user', 'interface', 'for', 'computer'])
 print(vector)

Here's our vector for the document specified in the previous code block:

[-0.00837848  0.02508169 -0.07431821 -0.0596405  -0.0423368 ]

Now, let's experiment with the other important parameters. This can be useful for building paragraph vectors.

Changing vector size and min_count

We will begin by building a Doc2Vec model, but this time with vectors of size 50 and the min_count parameter set to 3. We will take a look at these in detail in the upcoming code and output blocks:

First, let's build our Doc2Vec model using the following code block:

model = Doc2Vec(documents, vector_size=50, min_count=3, epochs=40)
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

Now that we have built our models, let's do some basic checks in terms of the vocabulary and its size. Let's check the vocabulary size using the following code:

len(model.wv.vocab)

Here's our vocabulary size:

Now, let's check the vocabulary:

          model.wv.vocab

Here's our vocabulary:

{'user': <gensim.models.keyedvectors.Vocab at 0x1275e5278>,
 'system': <gensim.models.keyedvectors.Vocab at 0x1275e52b0>,
 'trees': <gensim.models.keyedvectors.Vocab at 0x1275e52e8>,
 'graph': <gensim.models.keyedvectors.Vocab at 0x1275e5320>}

Let's build a new paragraph vector using the Doc2Vec model:

vector = model.infer_vector(['user', 'interface', 'for', 'computer'])
print(vector)

Here's our paragraph vector:

[-0.0007554   0.00245294 -0.00753151 -0.00607859 -0.00448105  0.00735318
 -0.00594467 0.00859313  0.00224831  0.00329965 -0.00813412 -0.00946166
 -0.00889105 -0.00073677  0.00183127  0.00870271  0.00402407 -0.00895064
 -0.00469407 -0.00866868  0.00176067 -0.00080887 -0.00720792  0.0097493
  0.00787539  0.00132159  0.00142888  0.00662106  0.00739355 -0.0035373
 -0.004258    0.00317122 -0.00414719  0.0087981   0.00254999  0.0062838
  0.00276298 -0.00396981  0.00029113  0.0015878   0.0088333   0.00634579
 -0.00670296  0.00886645 -0.00246914 -0.00679858 -0.0062902   0.00156767
  0.00728981  0.00063676]

As we can see, the vector size is now 50 and only 4 terms are in the vocabulary. This is because min_count was modified to 3 and, consequently, terms that were equal to or greater than 3 terms are present in the vocabulary now.

Earlier, we discussed that there are two approaches we can use to build paragraph vectors: the PV-DM and PV-DBOW approaches. Next, we'll check how we can change between them.

The dm parameter for switching between modeling approaches

The value of dm, when set to 1, indicates that the model should be based on the distributed memory approach.

The following code builds a PV-DM model:

model = Doc2Vec(documents, vector_size=50, min_count=2, epochs=40, dm=1)
 model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

dm equal to 0 builds the Doc2Vec model based on the distributed bag-of-words approach, as shown in the following code block:

model = Doc2Vec(documents, vector_size=50, min_count=2, epochs=40, dm=0)
 model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

The distributed memory model takes word vectors into account and comes with two additional parameters, dm_concat and dm_mean. We'll discuss them next.

The dm_concat parameter

The dm_concat parameter is used in the PV-DM approach. Its value, when set to 1, indicates to the algorithm that the context vectors should be concatenated while trying to predict the target word. This, of course, leads to building a larger model since multiple word embeddings get concatenated.

Let's see how it can be built in the following code snippet:

model = Doc2Vec(documents, vector_size=50, min_count=2, epochs=40, window=2, dm=1, alpha=0.3, min_alpha=0.05, dm_concat=1)
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

What to do if I don’t wish to concatenate and use a lighter model?

The dm_concat parameter can be set to 0 for that:

However, how do I take into account information related to the context vectors?

Next, we'll look at this in terms of the dm_mean parameter.

The dm_mean parameter

In the previous section, The dm_concat parameter, we saw that context vectors can be concatenated. Here, we will look at other options that can be used instead of concatenating the context vectors. Two alternative approaches are to sum or average the context vectors instead of concatenating them. Whether the context vectors should be summed up or averaged can be controlled by the dm_mean parameter.

When the dm_mean parameter is set to 1, the mean of the context word vectors is taken. The sum of the context word vectors is taken into account when dm_mean is set to 0. Let's see the two in action.

Using the code in the following code block, the mean of the context vectors can be taken:

model = Doc2Vec(documents, vector_size=50, min_count=2, epochs=40, window=2, dm=1, dm_concat=0, dm_mean=1, alpha=0.3, min_alpha=0.05)
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

The following piece of code can be executed to take the sum of the context vectors:

model = Doc2Vec(documents, vector_size=50, min_count=2, epochs=40, window=2, dm=1, dm_concat=0, dm_mean=0, alpha=0.3, min_alpha=0.05)
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

Now, we will see the effect the window size has.

Window size

The window size parameter controls the distance between the word under concentration and the word to be predicted, similar to the Word2Vec approach. The following code block illustrates the same:

model = Doc2Vec(documents, vector_size=50, min_count=2, epochs=40, window=2, dm=0)
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

Now, let's explore what the learning rate is and how it can be leveraged.

Learning rate

Most machine learning models come with a learning rate, which we will look at in detail in Chapter 8, From Human Neurons to Artificial Neurons for Understanding Text. For Doc2Vec, the initial learning rate can be specified using the alpha parameter. With the min_alpha parameter, we can specify what value the learning rate should drop to over the course of training. These details have been specified in the following code block:

model = Doc2Vec(documents, vector_size=50, min_count=2, epochs=40, window=2, dm=1, alpha=0.3, min_alpha=0.05)
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

Apart from these, there are other parameters, including negative for enabling negative sampling similar to Word2Vec, max_vocab_size to limit the vocabulary, and more.

Before we proceed and briefly discuss other algorithms that have been built for developing sentence-level representations, let's discuss a character-based n-gram approach known as fastText, which is used to build word-level embeddings that outperform Word2Vec in most use cases. We will build on the fastText approach later to see how sentence-level embeddings can be built in a similar manner.

Exploring fastText

We discussed and built models based on the Word2Vec approach in Chapter 5, Word Embeddings and Distance Measurements for Text, wherein each word in the vocabulary had a vector representation. Word2Vec relies heavily on the vocabulary it has been trained to represent. Words that occur during inference times, if not present in the vocabulary, will be mapped to a possibly unknown token representation. There can be a lot of unseen words here:

Can we do better than this?

In certain languages, sub-words or internal word representations and structures carry important morphological information:

Can we capture this information?

To answer the preceding code block, yes, we can, and we will use fastText to capture the information contained in the sub-words:

What is fastText and how does it work?

Bojanowski et al., researchers from Facebook, built on top of the Word2Vec Skip-gram model developed by Mikolov et al., which we discussed in Chapter 5, Word Embeddings and Distance Measurements for Text, by encapsulating each word as a combination of character n-grams. Each of these n-grams has a vector representation. Word representations are actually a result of the summation of their character n-grams:

What are the character n-grams?

Let's see the two- and three-character n-grams for the word language:

la, lan, an, ang, ng, ngu, gu, gua, ua, uag, ag, age, ge

fastText leads to parameter sharing among various words that have any overlapping n-grams. We capture their morphological information from sub-words to build an embedding for the word itself. Also, when certain words are missing from the training vocabulary or rarely occur, we can still have a representation for them if their n-grams are present as part of other words.

The authors kept most of the settings similar to the Word2Vec model. They initially trained fastText using a Wikipedia corpus based on 9 different languages. As of March 18, 2020, the fastText GitHub documentation states that fastText models have been built for 157 languages.

Facebook released the fastText library as a standalone implementation that can be directly imported and worked on in Python. Gensim offers its own fastText implementation and has also built a wrapper around Facebook's fastText library. Since we have focused on Gensim for most of our tasks, we will use Gensim's fastText implementation next to build word representations.

We will discuss parameters that are new to fastText as most of them are common to the Word2Vec and Doc2Vec models. We have taken the same common_texts data to explore fastText.

Building a fastText model

In this section, we will look at how to build a fastText model:

We will begin by importing the necessary libraries and dataset using the following code block:

from gensim.models import FastText
from gensim.test.utils import common_texts

Let's instantiate and train a basic FastText model using the following code:

model = FastText(size=5, window=3, min_count=1)

model.build_vocab(sentences=common_texts)
model.train(sentences=common_texts, total_examples=len(common_texts), epochs=10)

Now, let's validate our vocabulary:

model.wv.vocab

Here's our vocabulary:

{'human': <gensim.models.keyedvectors.Vocab at 0x1103db780>,
 'interface': <gensim.models.keyedvectors.Vocab at 0x1103db7f0>,
 'computer': <gensim.models.keyedvectors.Vocab at 0x1274b84a8>,
 'survey': <gensim.models.keyedvectors.Vocab at 0x1274b8710>,
 'user': <gensim.models.keyedvectors.Vocab at 0x1274b8748>,
 'system': <gensim.models.keyedvectors.Vocab at 0x1274b8780>,
 'response': <gensim.models.keyedvectors.Vocab at 0x1274b87b8>,
 'time': <gensim.models.keyedvectors.Vocab at 0x1274b87f0>,
 'eps': <gensim.models.keyedvectors.Vocab at 0x1274b8828>,
 'trees': <gensim.models.keyedvectors.Vocab at 0x1274b8860>,
 'graph': <gensim.models.keyedvectors.Vocab at 0x1274b8898>,
 'minors': <gensim.models.keyedvectors.Vocab at 0x1274b88d0>}

Let's visualize the vector of the word human:

model.wv['human']

Here's the vector of the word human:

array([ 0.03953331, -0.02951075,  0.02039873,  0.00304991, -0.00968183],
      dtype=float32)

The size of the vector is 6—it's size + 1 as we specified size = 5 in our fastText model.

Now, let's explore the most similar method to this, as we did with Word2Vec in Chapter 5, Word Embeddings and Distance Measurements for Text. We will see what the closest vector is to the following vector expression:

vec(computer) + vec(interface) - vec(human)

model.wv.most_similar(positive=['computer', 'interface'], negative=['human'])

Here's the output:

[('system', 0.908109724521637),
 ('eps', 0.886881947517395),
 ('response', 0.6286922097206116),
 ('user', 0.38861846923828125),
 ('minors', 0.24753454327583313),
 ('time', 0.06086184084415436),
 ('survey', -0.0791618824005127),
 ('trees', -0.40337082743644714),
 ('graph', -0.46148836612701416)]

Let's understand the very important min_n and max_n parameters.

Since word representations in FastText are built using the n-grams, min_n, and max_n characters, this helps us by setting the minimum and maximum lengths of the character n-grams so that we can build representations. In the following code block, we have used a range of 1-gram to 5-grams to build our fastText model:

model = FastText(size=5, window=3, min_count=1, min_n=1, max_n=5)

model.build_vocab(sentences=common_texts)
model.train(sentences=common_texts, total_examples=len(common_texts), epochs=10)

Now, we will try and build a representation of a word that does not occur in our vocabulary. Let's try and fetch the vector for the word rubber:

model.wv['rubber']

Here's the vector for rubber:

array([-0.01671136, -0.01868909, -0.03945312, -0.01389101, -0.0250267 ],
      dtype=float32)

Now, let's use an out-of-vocabulary term in the most_similar function to validate whether it works:

model.wv.most_similar(positive=['computer', 'human'], negative=['rubber'])

Here's the output:

[('time', 0.5615436434745789),
 ('system', 0.4772699475288391),
 ('minors', 0.3850055932998657),
 ('eps', 0.15983597934246063),
 ('user', -0.2565014064311981),
 ('graph', -0.411243200302124),
 ('response', -0.4405473470687866),
 ('trees', -0.6079868078231812),
 ('interface', -0.6381739377975464),
 ('survey', -0.8393087387084961)]

Now, we will try and extend our model so that it incorporates new sentences and vocabulary. This can be done using the following code snippet:

sentences_to_be_added = [["I", "am", "learning", "Natural", "Language", "Processing"],
 ["Natural", "Language", "Processing", "is", "cool"]]

model.build_vocab(sentences_to_be_added, update=True)
 model.train(sentences=common_texts, total_examples=len(sentences_to_be_added), epochs=10)

Note: The update parameter is set to True.

Here's the output:

{'human': <gensim.models.keyedvectors.Vocab at 0x1103db908>,
 'interface': <gensim.models.keyedvectors.Vocab at 0x1274cbcf8>,
 'computer': <gensim.models.keyedvectors.Vocab at 0x1274cb9e8>,
 'survey': <gensim.models.keyedvectors.Vocab at 0x1274cba20>,
 'user': <gensim.models.keyedvectors.Vocab at 0x1274cba58>,
 'system': <gensim.models.keyedvectors.Vocab at 0x1274cba90>,
 'response': <gensim.models.keyedvectors.Vocab at 0x1274cbac8>,
 'time': <gensim.models.keyedvectors.Vocab at 0x1274cbdd8>,
 'eps': <gensim.models.keyedvectors.Vocab at 0x1274cbcc0>,
 'trees': <gensim.models.keyedvectors.Vocab at 0x1274cbe10>,
 'graph': <gensim.models.keyedvectors.Vocab at 0x1274cbb38>,
 'minors': <gensim.models.keyedvectors.Vocab at 0x1274cbef0>,
 'I': <gensim.models.keyedvectors.Vocab at 0x1274cb320>,
 'am': <gensim.models.keyedvectors.Vocab at 0x1274cb240>,
 'learning': <gensim.models.keyedvectors.Vocab at 0x1274cb2b0>,
 'Natural': <gensim.models.keyedvectors.Vocab at 0x1274cbf28>,
 'Language': <gensim.models.keyedvectors.Vocab at 0x1274cbbe0>,
 'Processing': <gensim.models.keyedvectors.Vocab at 0x1274cb5c0>,
 'is': <gensim.models.keyedvectors.Vocab at 0x1274cb550>,
 'cool': <gensim.models.keyedvectors.Vocab at 0x1274cbc88>}

As we can see, the model was updated to incorporate the new vocabulary terms.

The original fastText research paper extended on the Skip-gram approach for Word2Vec, but today, both the Skip-gram and continuous bag-of-words approach can be used. Pre-trained fastText models across multiple languages are available online and can be directly used or fine-tuned so that we can understand a specific dataset better.

fastText can be applied to solve a plethora of problems such as spelling correction, auto suggestions, and so on since it is based on sub-word representation. Datasets such as user search query, chatbots or conversations, reviews, and ratings can be used to build fastText models. We can apply them to enhance the customer experience in the future by providing information such as better suggestions, displaying better products, autocorrecting user input, and so on. In the next section, we'll take a look at the spelling corrector/auto-suggestion use case and build a fastText model for it.

Building a spelling corrector/word suggestion module using fastText

Let's try and build a fastText model based on some comments data that can be obtained from Kaggle's toxic comment classification challenge. This data can be sourced from https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge. We will take the comments column from the dataset and build a fastText model on top of it. We will also provide some incorrect spellings to the built model and see how well the model does in terms of correcting them. The same code can be extended to the problem statements mentioned in the previous section. We will use the Gensim implementation of fastText for this exercise. Let's begin!

We will start by importing the necessary libraries:

import nltk
import re
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from gensim.models import FastText
import io
import collections

Let's read the data into basic data structures using the following code snippet:

words = []
data = []
with io.open('comment_text.txt', 'r') as file:
    for entry in file:
        entry = entry.strip()
        data.append(entry)
        words.extend(entry.split())

Let's fetch some basic information about the data in terms of the most common words in the corpus using the following code snippet:

unique_words = []
unique_words = collections.Counter(words)
unique_words.most_common(10)

Here are our most common terms:

[('the', 445892),
 ('to', 288753),
 ('of', 219279),
 ('and', 207335),
 ('a', 201765),
 ('I', 182618),
 ('is', 164602),
 ('you', 157025),
 ('that', 140495),
 ('in', 130244)]

As we can see, the data is dominated by stopwords. We can apply necessary preprocessing in terms of keeping only alphanumeric data, case-folding, and removing stopwords. We won't lemmatize or stem because we want the model to understand incorrect spellings as well.

Let's preprocess the data using the preprocessing pipeline we built in Chapter 3, Building Your NLP Vocabulary:

data = preprocess(data)

You can learn about the preprocess method in more detail by reading Chapter 3, Building Your NLP Vocabulary, or by viewing the code files.

fastText expects data to be in a certain format, so let's modify our data so that it comprehends our requirements. The following code block does that for us:

preprocessed_data = []
for line in data:
    if line != "":
        preprocessed_data.append(line.split())

Now, we will initialize our fastText model:

model = FastText(size=300, window=3, min_count=1, min_n=1, max_n=5)

Now, let's build our vocabulary and check the size of the built vocabulary. Here, we're building the vocabulary:

model.build_vocab(sentences=preprocessed_data)

Now, let's check the size of our vocabulary:

len(model.wv.vocab)

Here's our vocabulary size:

The size would have been smaller if we had applied stemming or lemmatization.

Let's train our model now:

model.train(sentences=preprocessed_data, total_examples=len(preprocessed_data), epochs=10)

Now, we will check whether our model can actually predict the correct spelling for the incorrect words as part of the top 5 similar suggestions.

Let's see what autocorrect suggestion our model provides for the word eplain:

model.wv.most_similar('eplain', topn=5)

Here's the output:

[('xplain', 0.8792348504066467),
 ('eexplain', 0.8370275497436523),
 ('explain', 0.8350315093994141),
 ('plain', 0.8258184194564819),
 ('reexplain', 0.8141466379165649)]

explain and plain occur in the top 5 most similar words to eplain, which is very positive for us.

Now, let's see the outputs for the term reminder:

model.wv.most_similar('reminder', topn=5)

Here's the output:

[('remainder', 0.9140011668205261),
 ('rejoinder', 0.9139667749404907),
 ('reminde', 0.9069227576255798),
 ('minderbinder', 0.9042780995368958),
 ('reindeer', 0.9034557342529297)]

Even though reminder is a correct word in itself, the model suggests remainder as a potential correct spelling:

How does our model do for the incorrectly spelled term relevnt?

Now, let's check out how the model does for relevnt:

model.wv.most_similar('relevnt', topn=5)

Here are the top 5 suggestions for relevnt:

[('relevant', 0.7919449806213379),
 ('relev', 0.7878341674804688),
 ('relevanmt', 0.7624361515045166),
 ('releant', 0.7576485276222229),
 ('releve', 0.7547794580459595)]

relevant appears right at the top of the suggestions for relevnt, which is what we wanted:

What suggestions does my model provide for the possibly correctly spelled word, purse?

Next, let's look at how the model does for purse:

model.wv.most_similar('purse', topn=5)

Here are the top 5 suggestions for purse:

[('purpse', 0.9245591163635254),
 ('cpurse', 0.910297691822052),
 ('pursue', 0.8908491134643555),
 ('pure', 0.8890833258628845),
 ('pulse', 0.8745534420013428)]

Again, purse is a correctly spelled word; however, pursue and pulse are valid suggestions provided by the model.

Our fastText model does a good job in terms of suggesting corrections and potential alternatives for input text. This model can further be improved by providing better data where incorrect and correct spellings have been used in the same context across different sentences. An ideal example of such data would be conversations, wherein a lot of short forms and incorrect spellings are typed in by users. Next, we'll learn how document distances can be computed using fastText.

fastText and document distances

Let's use the model we built for spelling correction to check for document distances using the Word Mover's Distance (WMD) algorithm. We will use the same example that we used in the Word2Vec section in Chapter 5, Word Embeddings and Distance Measurements for Text.

Let's get started:

We will start by initializing the sentences that we wish to compute the distances between:

sentence_1 = "Obama speaks to the media in Illinois"
sentence_2 = "President greets the press in Chicago"
sentence_3 = "Apple is my favorite company"

Let's compute the distance between the document pairs using WMD, which we discussed extensively in Chapter 5, Word Embeddings and Distance Measurements for Text.

Let's compute the WMD between sentence_1 and sentence_2 using fastText-based vectors:

word_mover_distance = model.wmdistance(sentence_1, sentence_2)
word_mover_distance

Here's the distance between sentence_1 and sentence_2:

16.179816809121103

Now, we can compute the distance between sentence_2 and sentence_3:

word_mover_distance = model.wmdistance(sentence_2, sentence_3)
word_mover_distance

Here's the corresponding distance:

21.01126373005312

As expected, sentences 1 and 2 have a smaller distance compared to the distance between sentences 2 and 3.

The results that we obtained in the spelling correction and distance calculations would be potentially better if pre-trained fastText models were used since those are mostly built on Wikipedia text corpora and are more generalized to understand different data points.

fastText is a very convenient technique for building word representations using character-level features. It outperformed Word2Vec since it incorporated internal word structure information and associated it with morphological features, which are very important in certain languages. It also allows us to represent words not present in the original vocabulary. Now, we will extend our understanding of n-grams and briefly discuss how this can be extended to build embeddings for documents and sentences by using an approach called Sent2Vec. We will also briefly touch upon the Universal Sentence Encoder, which is one of the most recent algorithms that's used to build sentence-level embeddings.

Understanding Sent2Vec and the Universal Sentence Encoder

In the previous sections, we discussed Doc2Vec and fastText extensively. We will build on the concepts we learned about there and try to understand the basic underlying concepts of another algorithm, called Sent2Vec. We will briefly touch on the Universal Sentence Encoder (USE) in the second part of this section.

Sent2Vec

Sent2Vec combines the continuous bag-of-words approach we discussed regarding Word2Vec, along with the fastText thought process of using constituent n-gram, to build sentence embeddings.

Matteo et al. devised the Sent2Vec approach, wherein contextual word embeddings and target word embeddings were learned by trying to predict the target words based on the context of the words, similar to the C-BOW approach. However, they extended the C-BOW methodology to define sentence embeddings as the average of the context word embeddings present in the sentence, wherein context word embeddings are not restricted to unigrams but extended to n-grams in each sentence, similar to the fastText approach. The sentence embedding would then be represented as the average of these n-gram embeddings. Research has shown that Sent2Vec outperforms Doc2Vec in the majority of the tasks it undertakes and that it is a better representation method for sentences or documents. The Sent2Vec library is an open sourced implementation of the model that's built on top of fastText and can be used similar to the Doc2Vec and fastText models, which we have discussed extensively so far.

Before we close this chapter, we will briefly look at the Universal Sentence Encoders, which is a very recent technique that has been open sourced by Google to build sentence or document-level embeddings.

The Universal Sentence Encoder

The Universal Sentence Encoder (USE) is a model for fetching embeddings at the sentence level. These models are trained using Wikipedia, web news, web question-answer pages, and discussion forums. The pre-trained generalized model can be used for transfer learning directly or can be fine-tuned to a specific task or dataset. The basic building block of USE is an encoder (we will learn about this in Chapter 9, Applying Convolutions to Text). The USE model can be built using the transformers methodology, which will be discussed in Chapter 10, Capturing Temporal Relationships in Text, or it can be built by combining unigram and bigram representations and feeding them to a neural network to output sentence embeddings through a technique known as deep averaging networks. Several models that have been built using USE-based transfer learning have outperformed state-of-the-art results in the recent past. USE can be used similar to TF-IDF, Word2Vec, Doc2Vec, fastText, and so on for fetching sentence-level embeddings.

Summary

In this chapter, we began by extending our discussion on Word2Vec, applied a similar thought process to building document-level embedding, and discussed the Doc2Vec algorithm extensively. We followed that up by building word representations using character n-grams from the words themselves, a technique referred to as fastText. The fastText model helped us capture morphological information from sub-word representations. fastText is also flexible as it can provide embeddings for out-of-vocabulary words since embeddings are a result of sub-word representations. After that, we briefly discussed Sent2Vec, which combines the C-BOW and fastText approaches to building sentence-level representations. Finally, we introduced the Universal Sentence Encoder, which can also be used for fetching sentence-level embeddings and is based on complex deep learning architectures, all of which we will read about in the upcoming chapters.

In the next chapter, we will use whatever we have discussed so far in terms of text cleaning, preprocessing, and word and document representations to build models that can solve real-life machine learning tasks.

Table of Contents for Exploring Sentence-, Document-, and Character-Level Embeddings

Create new playlist

Sign In

Sign Up

Technical requirements

Venturing into Doc2Vec

Building a Doc2Vec model

Changing vector size and min_count

The dm parameter for switching between modeling approaches

The dm_concat parameter

The dm_mean parameter

Window size

Learning rate

Exploring fastText

Building a fastText model

Building a spelling corrector/word suggestion module using fastText

fastText and document distances

Understanding Sent2Vec and the Universal Sentence Encoder

Sent2Vec

The Universal Sentence Encoder

Summary

Table of Contents for
Exploring Sentence-, Document-, and Character-Level Embeddings