In Chapter 4, Transforming Text into Data Structures, we discussed the bag-of-words and term-frequency and inverse document frequency-based methods to represent text in the form of numbers. These methods mostly rely on the syntactical aspects of a word in terms of its presence or absence in a document or across a text corpus. However, information about the neighborhood of the word, in terms of what words come after or before a word, wasn't taken into account in the approaches we have discussed so far. The neighborhood of a word carries important information in terms of what context the word is carrying in a sentence. The relationship between the word and its neighborhood tends to define the semantics of a word and its overall positioning and presence in a sentence. In this chapter, we will use this idea to build word vectors that will try to capture the meaning of the word based on the context it's been used in.

The following topics will be covered in this chapter:

Understanding word embeddings
Demystifying Word2vec
Training a Word2vec model
Word mover's distance

Technical requirements

The code files for this chapter can be found at the following GitHub link: https://github.com/PacktPublishing/Hands-On-Python-Natural-Language-Processing/tree/master/Chapter05.

Understanding word embeddings

Word embedding is a learned representation of a word wherein each word is represented using a vector in n-dimensional space. Words with similar meanings should have similar representations. These representations can also help in identifying synonyms, antonyms, and various other relationships between words. We mentioned that embeddings can be built to correspond to individual words; however, this idea can be extended to develop embeddings for individual sentences, documents, characters, and so on. Word2vec captures relationships in text; consequently, similar words have similar representations. Let's try to understand what type of semantic information Word2vec can actually encapsulate.

We will look at a few examples to understand what relationships and analogies can be captured by a Word2vec model. A very frequently used example deals with the embedding of King,Man,Queen, and Woman. Once a Word2vec model is built properly and the embedding from it is obtained for these words, the following relationship is frequently obtained, provided that these words are actually a part of the vocabulary:

vector (Man) – vector (King) + vector (Queen) = vector (Woman)

This equation boils down to the following relationship:

vector (Man) + vector (Queen) = vector (King) + vector (Woman)

The thought process here is that the relationship of Man:King is the same as Woman:Queen. The Word2vec algorithm is able to capture these semantic relationships when it devises an embedding for each of these words.

Let's take one more example, but this time we will relate countries to capitals. If we build vectors for France, Italy, and Paris using Word2vec, what would be the output of the following equation?

vector (France) + vector (Rome) - vector (Italy) = ??

The output would be vector (Paris).

Similar to the previous example, the analogy here is that the Italy: Rome relationship is the same as the France: Paris relationship.

All of this seems to be magic!

Now, let's try to understand how exactly we capture all of this information. It all boils down to the Word2vec algorithm. Let's look at Word2vec in detail in the next section.

The values or vectors obtained from the simple mathematics discussed previously are not exactly equal to the actual vector representation of the words, but they are close enough to substantiate that these relationships are obtained using the Word2vec methodology.

Demystifying Word2vec

Word2vec targets exactly what John Rupert Firth famously said:

"A word is known by the company it keeps."

It is a model that enables the building of word vectors using contextual information from the neighborhood of a word. For every word whose embedding is developed, it's based on the words around it. Word2vec uses a simple neural network to build this architecture. We’ll discuss the details of neural networks in depth in Chapter 8, From Human Neurons to Artificial Neurons for Text Understanding, onward.

A paper on Word2vec came out in 2013 and was one of the revolutionary findings in the domain of Natural Language Processing (NLP). It was developed by Thomas Mikolov et al. at Google and was later made open source for the community to use and build on. A link to the paper can be found at https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf.

Before we get into the details of Word2vec, we will try to define what supervised and unsupervised learning is.

Supervised and unsupervised learning

Supervised and unsupervised learning will be covered in detail in Chapter 7, Identifying Patterns in Text Using Machine Learning. In order to just give you a brief heads up on supervised and unsupervised learning, we'll look at a few examples here:

Supervised learning:This includes cases such as breast cancer prediction, where we have labeled data in which each data point either belongs to a person suffering from breast cancer or someone who is not.
Unsupervised learning:Apart from the type of task mentioned in the previous point, there are tasks such as ones where we need to figure out segments or groups of customers based on their spending patterns. These data points do not have any labels, such as high-spending or low-spending, and the aim is to just group users. These tasks come under the scope of unsupervised learning.

Now that we have understood the difference between supervised and unsupervised, let's find out which category Word2vec belongs to.

Word2vec – supervised or unsupervised?

Word2vec is an unsupervised methodology for building word embeddings. In the Word2vec architecture, an attempt is made to do either of the following:

Predict the target word based on the context word
Predict the context word based on the target word

Even though words are being predicted, the prediction component or the class attribute itself comes from the text or the corpus. Hence, there is no specific class attribute available, as is the case in a supervised learning scenario. Due to this, Word2vec falls under the class of unsupervised algorithms. All the learning comes from unstructured data in an unsupervised manner.

Pretrained Word2vec

As discussed previously, the Word2vec algorithm tries to capture relationships between words in the text corpus. In this section, we will explore the pretrained implementations available for Word2vec. This will be followed by a deep dive into the Word2vec architecture, where, using that knowledge, we will try to understand how exactly the Word2vec model encapsulates contextual information.

The output of the Word2vec algorithm is a |V| * D matrix, where |V| is the size of the vocabulary we want vector representations for and D is the number of dimensions used to represent each word vector. As you may have guessed, each row in this matrix carries the embedding for an individual word in the vocabulary. The value of D can be changed and played around with depending on several factors, such as the size of the text corpus and the various relationships that need to be captured. Generally, D takes values between 50 and 300 in real-life use cases.

There is a pretrained, globally available Word2vec model that Google trained on Google News dataset. It has a vocabulary size of 3 million words and phrases and each vector has 300 dimensions. This model is 1.5 GB in size and can be downloaded from https://code.google.com/archive/p/word2vec/. Python's gensim library provides various methods to use the pretrained model directly or to fine-tune it. It also allows the Word2vec model to be built from scratch based on any provided dataset. We will use this model intensively as part of this chapter.

Exploring the pretrained Word2vec model using gensim

Let's go through a few steps in detail that will help us import, explore, and infer from the pretrained model:

Install the genism library:

          pip install gensim

The preceding statement can be run from the command line.

Import the gensim library and the KeyedVectors component:

import gensim
from gensim.models import KeyedVectors

Load the pretrained vectors from the pretrained Word2vec model file:

model=KeyedVectors.load_word2vec_format('/Users/amankedia/Desktop/Sunday/nlp-book/Chapter5/Code/GoogleNews-vectors-negative300.bin', binary=True)

Validate the size of the pretrained Word2vec vocabulary:

len(model.wv.vocab)

Here is the output:

As you can see, the vocabulary size for this model is 3000000.

Explore the size of each Word2vec vector:

model.vector_size

Here is the output:

As you can see, each vector is 300-dimensional.

Explore the pretrained Word2vec vocabulary:

model.wv.vocab

You can check the output of the preceding command in the Jupyter notebook (code files) for this book as the output is too large to be displayed here.

Check the most_similar functionality:

model.most_similar('Delhi')

Here is the output:

[('Kolkata', 0.7663769721984863),
 ('Mumbai', 0.7306068539619446),
 ('Lucknow', 0.7277829051017761),
 ('Patna', 0.7159016728401184),
 ('Guwahati', 0.7072612643241882),
 ('Jaipur', 0.6992814540863037),
 ('Hyderabad', 0.6983195543289185),
 ('Ranchi', 0.6962575912475586),
 ('Bhubaneswar', 0.6959235072135925),
 ('Chandigarh', 0.6940240859985352)]

This output shows that the embedding for 'Delhi' is most similar to 'Kolkata'.

Let's validate the king, queen, woman, and man examples from earlier, both in terms of the closest word and the second-closest word:

result = model.most_similar(positive=['man', 'queen'], negative=['king'], topn=1)
print(result)

Here is the output:

[('woman', 0.7609435319900513)]

Let's see what the two closest words are:

result = model.most_similar(positive=['man', 'queen'], negative=['king'], topn=1)
print(result)

Here is the output:

[('woman', 0.7609435319900513), ('girl', 0.6139994263648987)]

This output validates the first equation that we saw earlier, where we had vector (man) + vector (queen) – vector (king) = vector (woman). The second closest entity here is girl.

Let’s now validate the country and capital example we saw earlier in this chapter:

result = model.most_similar(positive=['France', 'Rome'], negative=['Italy'], topn=1)
print(result)
[('Paris', 0.7190686464309692)]

The result is Paris, which is consistent with our expected output.

The Word2vec architecture

In the previous section, Pretrained Word2vec, we saw the pretrained Word2vec offering from Google and explored its various associated features. In this section, we will try to understand how Word2vec models are trained and what the architecture for training a Word2vec algorithm is.

As we discussed earlier, Word2vec models can be trained by two approaches, as follows:

Predicting the context word using the target word as input, which is referred to as the Skip-gram method
Predicting the target word using the context words as input, which is referred to as the Continuous Bag-of-Words (CBOW) method

Here, we will discuss the Skip-gram method in detail, but you can use the ideas from our Skip-gram discussion to build the Word2vec model using the CBOW approach.

The Skip-gram method

The Skip-gram method builds a Word2vec model by trying to predict a context word when a target word is taken as input. These words are present in each others' neighborhoods. Each target-context pair will help you build the embeddings of these target words. Let's see how the Skip-gram method works.

How do you define target and context words?

Let's take the following sentence:

All that glitters is not gold

Here, glitters is the target word.

The context words are comprised of the words appearing in the neighborhood of glitters. We can define something called window_size, which is a configurable parameter that conveys to the model the size of the neighborhood to consider when taking in a word as the target word. For the preceding sentence, let's define a window_size value of 5. When the window size is defined as 5, the model takes in two words from the left and two words from the right of the target word as the context words.

In this example, the mapping would be as follows for glitters:

Target/Input Word	Context Word
`glitters`	`All`
`glitters`	`that`
`glitters`	`is`
`glitters`	`not`

The expectation is that whenever glitters is provided as input, the model should be able to predict the correct context word. Based on how it is doing in terms of predicting the correct context word, it learns and, over time, gets better at predicting the right context work.

Now, that we understand what the target word and context word are, let's try to generalize our understanding. We will follow a sliding window approach to generate target and context words for the sentence.

Say we have the sentence Let us make an effort to understand natural language processing:

Every row in the preceding graph has one word shaded in brown. This word represents the target word. Each row also has some words shaded in gray. These words represent the context words for the corresponding target word. As you will have guessed, the window_size value used here is 5.

As an example, let's pick up the fourth row. The word an, shaded in brown, is the target word and the words us, make, effort, and to , shaded in gray, are the context words for the target word, an.

Let's now dive into the various components that are required as part of building the Skip-gram model and attain the functionality to predict correct context words based on the target word.

Exploring the components of a Skip-gram model

Let's now understand and explore the various components that are involved in building a Skip-gram model.

Input vector

The input is a one-hot vector with a size of |V| * 1, where |V| is the size of the vocabulary. Only one entry in this vector is marked 1, which corresponds to the position of the target word. All other entries are marked 0.

For example, let's assume that our vocabulary contains the words the, Sun, is, and rising. If the words are in the same order, the one-hot vector for each of these words would be as follows:

For the, we would have the following:

1 0 0 0

For sun, we would have the following:

0 1 0 0

For is, we would have the following:

0 0 1 0

For rising, we would have the following:

0 0 0 1

The size of each vector is 4, since our vocabulary contains four words. In each of the vectors, only one entry is 1, which corresponds to the index of the word in the vocabulary.

Embedding matrix

The next entry in the Word2vec architecture is the embedding matrix, which has a size of |V| * N, where |V| is the size of the vocabulary and N is the number of dimensions we wish to represent each word vector with.

The embedding matrix can be instantiated with random numbers; however, certain initialization methods, such as Xavier initialization, provide better results than random initialization. You can read more about this at http://cs231n.github.io/neural-networks-2/#init.

A dot product is performed between the embedding matrix and the input vector, which yields an intermediate vector. In hindsight, when this dot product is performed, the row corresponding to the target word in the embedding matrix will be activated and come out as the intermediate vector because only that particular word's entry is 1 in the input vector, and the rest are 0.

Context matrix

The next matrix in our architecture is called the context matrix, which also has a size of |V| * N, which is the same dimensionality as the embedding matrix. The dot product of the intermediate vector obtained previously and the context matrix is performed to yield the output vector.

The thinking here is that the target word's embedding obtained as the intermediate vector will be able to activate the context word's entry in the context matrix.

Output vector

The dot product of the intermediate vector and the context matrix yields the output vector, which has a size of |V| * 1, where |V| is the size of the vocabulary. Each entry in this vector has a number that represents the chances of the word corresponding to that index being the context word predicted by the model. The higher the value in a particular position, the higher the model's inclination to predict the word corresponding to that index as the context word.

These entries can take in any real numbers as their values. However, we want normalized values between 0 and 1, and for that, we use something called the softmax function, which is discussed next.

Softmax

The softmax function takes the following form:

Here, z is the predicted value of each word being the context word.

The softmax function returns normalized probabilities for a set of numbers.

Let's look at an example so as to be able to understand softmax.

Assuming we have seven words, the following array shows z, or the predicted value of each word to be the context word:

z = [2.0, 3.0, 1.0, 4.0, 2.0, 3.0, 2.0]

Now, we want the normalized probabilities such that they sum up to 1.

The normalized probability of 2.0 will be as follows:

Let's see how can we achieve this using three lines of code:

import numpy as np
 z = [2.0, 3.0, 1.0, 4.0, 2.0, 3.0, 2.0]
 np.exp(z) / np.sum(np.exp(z))

The normalized probabilities outputted by our simple code are as follows:

array([0.06175318, 0.16786254, 0.02271772, 0.45629768, 0.06175318,
       0.16786254, 0.06175318])

The first value in our output array gives the normalized probability of the first entry in the input array, z. The same is true for other indices as well. So, the normalized probability of 2.0, in this case, is 0.06175318.

Loss calculation and backpropagation

After the normalized probability is obtained for whether something is the context word, this is compared with the actual expected context word, and the loss function or error in prediction is calculated, as in the following diagram. The models predict thepredicted vector, which contains the normalized probability of each word in the vocabulary being the context word. The target vectoris a one-hot vector, which indicates which value we expect to be the context word. These two vectors are subtracted to compute the error made in predicting the context word when given the target word as input:

As part of the loss function calculation, we attempt to figure out how close or far the model was to predicting the correct context word. The results of the loss function show how well or badly the model performed in predicting the context word. The computed error is sent back to the model, where the weights or entries in the embedding and context matrices are adjusted based on how much they were responsible for predicting the context word correctly or incorrectly. This methodology is referred to as backpropagation. You can read more about backpropagation and loss functions athttp://cs231n.github.io/optimization-2/andhttp://cs231n.github.io/neural-networks-2/#losses, respectively.

Inference

The preceding steps are repeated several times or for several epochs (which is a configurable parameter) and, at the end of the training, the embedding matrix provides the output we need. It is drawn out of the architecture and each row in this trained matrix contains the word embedding for a word in the vocabulary. The i^th row here contains the word vector for the i^th word in our vocabulary:

This diagram shows all the components and various interactions involved in building the Word2vec model based on the Skip-gram method.

The CBOW method

The CBOW method works similarly to the Skip-gram method. However, the change here is that the vector corresponding to the context word is sent in as input and the model tries to predict the target word.

Computational limitations of the methods discussed and how to overcome them

The methods we discussed previously are computationally expensive since all the weights or entries in the embedding and context matrix are updated for each target word, context word or context word, target word pair. Mikolov et al. addressed this problem by employing two strategies—subsampling and negative sampling. We will discuss both of them in the following sections.

Subsampling

There are some situations where certain words, such as a,an and the, don’t add much context when they appear in the neighborhood of a target word. Also, these words occur too frequently in any text corpus, so the creators of this method decided to subsample certain words so that these words would be deleted from the text itself. These words would not be used as target words, hence reducing the training data size, and neither would they play a role in being the context word for other target words. Whether a word is sampled depends on a metric called the sampling rate.

The sampling rate or probability of keeping a word is determined by the following formula:

Here, f(word_i) is the fraction of total words in the corpus, which is word_i.

Negative sampling

The other methodology applied to prevent all the weights updating is referred to as negative sampling. As part of negative sampling, a very small subset of negative words, or words that are not expected to appear in the context of a target word, are selected and only their weights are updated, apart from the actual context word. As a result, only a very small fraction of weights in the matrix are updated, instead of all the weights.

How to select negative samples

The negative samples or words whose weights are updated againdepend on the frequency of the occurrence of the word relative to other words in the corpus.

The probability of picking a word is given by the following formula:

freq(word_i) is the number of times the ith word occurs in the corpus.

These two methodologies largely help in reducing the computational efforts required to build Word2vec models.

Training a Word2vec model

Now that we know how the pretrained Word2vec model can be leveraged and we have looked at and understood the Word2vec model architecture, let's try to actually train a Word2vec model. We can create a custom implementation for this; however, for the sake of this exercise, we will leverage the functionalities provided by the gensim library.

The gensim library provides a convenient interface for building a Word2vec model. We will start by building a very simple model using the fewest possible parameters and then we will build on it.

Building a basic Word2vec model

Let's build a basic Word2vec model by executing the following steps:

We will start by importing the Word2vec module from gensim, define a few sentences as our data, and then build a model using the following code:

from gensim.models import Word2Vec
sentences = [["I", "am", "trying", "to", "understand", "Natural", 
              "Language", "Processing"],
            ["Natural", "Language", "Processing", "is", "fun", 
             "to", "learn"],
            ["There", "are", "numerous", "use", "cases", "of", 
             "Natural", "Language", "Processing"]]
model = Word2Vec(sentences, min_count=1)

We can provide the Word2vec module with a list of tokenized sentences as input, as we have done in the preceding example. We can also provide a text corpus as input using the corpus_file parameter as the corpus contains a list of sentences where the words in each sentence are separated by whitespace.

The min_count parameter helps create custom vocabulary based on the text itself. The value of min_count sets a minimum threshold so that vectors are built only for words that occur more often than the value specified in the min_count parameter.

Here, we have used a very small list of custom-built sentences to build out the Word2vec model. However, this can be extended to any dataset. In real-life scenarios, the entire dataset is provided as a list of sentences or a corpus a whole.

Let’s see what the size of each vector that we just built is using the following one-line code:

model.vector_size

The output for this is as follows:

The default vector size in Word2vec is 100; however, this is a configurable parameter and we will look at changing it in the upcoming sections.

Let's find out the size of the vocabulary we built:

len(model.wv.vocab)

Our vocabulary has a size of 17, as shown in the following output:

The size of the vocabulary is equal to the number of unique words in the sentences we have defined.

Now that we have built a basic Word2vec model, let's learn how to modify the min_count parameter in the following section.

Modifying the min_count parameter

In order to modify the min_count parameter, we execute the following steps:

The min_count parameter helps restrict the vocabulary so that word vectors are only built for words that occur at least min_count times in the corpus:

model = Word2Vec(sentences, min_count=2)

Let's find out what the vocabulary size is when we set min_count to 2 based on the previous code block:

len(model.wv.vocab)

Here is the output:

The vocabulary size is 4 because only four words occur twice or more in our corpus.

Let's see what those words are:

model.wv.vocab

Here is the output:

{'to': <gensim.models.keyedvectors.Vocab at 0x127591a58>,
 'Natural': <gensim.models.keyedvectors.Vocab at 0x127591a90>,
 'Language': <gensim.models.keyedvectors.Vocab at 0x127591ac8>,
 'Processing': <gensim.models.keyedvectors.Vocab at 0x127591b00>}

The dimension for these vectors would still be 100. Let’s validate that:

model.vector_size

As expected, the vector size is 100, as we can see in the following output code block:

Let's move on and try some more interesting things with the Word2vec parameters.

Playing with the vector size

Higher-dimensional vectors capture more information across dimensions, especially when the corpus and vocabulary are big and the data is highly varied.

Let's try to build a model where each vector is 300-dimensional using the following code block:

model = Word2Vec(sentences, min_count=2, size = 300)

Let's now find out the vector size for the model we just built using the following line of code:

model.vector_size

Here is our vector size:

As we can see, each of the four words that occur more than once is nowrepresented using 300 dimensions.

Other important configurable parameters

Apart from min_count and size, some other important parameters are as follows:

sg, whose value when 1 uses the Skip-gram approach and, when 0, uses the CBOW approach
negative, which when greater than 0, indicates that negative sampling should be used and the integer value signifies the number of negative samples to use
workers, which defines the number of threads to use for training:

model = Word2Vec (sentences, min_count=1, size = 300, workers = 2, sg = 1, negative = 1)

Let's find out the vocabulary size and vocabulary for this model:

len(model.wv.vocab)

Our vocabulary size is as follows:

Let's check the vocabulary using the following code:

model.wv.vocab

Here's our vocabulary:

{'I': <gensim.models.keyedvectors.Vocab at 0x1275ab5c0>,
 'am': <gensim.models.keyedvectors.Vocab at 0x1275ab588>,
 'trying': <gensim.models.keyedvectors.Vocab at 0x1275ab518>,
 'to': <gensim.models.keyedvectors.Vocab at 0x1275ab4e0>,
 'understand': <gensim.models.keyedvectors.Vocab at 0x1275ab4a8>,
 'Natural': <gensim.models.keyedvectors.Vocab at 0x1275ab438>,
 'Language': <gensim.models.keyedvectors.Vocab at 0x1275ab400>,
 'Processing': <gensim.models.keyedvectors.Vocab at 0x1275ab3c8>,
 'is': <gensim.models.keyedvectors.Vocab at 0x1275ab390>,
 'fun': <gensim.models.keyedvectors.Vocab at 0x1275ab358>,
 'learn': <gensim.models.keyedvectors.Vocab at 0x1275ab2e8>,
 'There': <gensim.models.keyedvectors.Vocab at 0x1275ab208>,
 'are': <gensim.models.keyedvectors.Vocab at 0x1275ab240>,
 'numerous': <gensim.models.keyedvectors.Vocab at 0x1275ab1d0>,
 'use': <gensim.models.keyedvectors.Vocab at 0x127591a20>,
 'cases': <gensim.models.keyedvectors.Vocab at 0x1275919e8

Word2vec models are generally stored as pickle files that serialize the model; the save() method can be used for this.

Limitations of Word2vec

Word2vec is a great tool for capturing semantic information from text, and we have seen how well it captures information. However, the Word2vec model has some limitations. Let's take the following two sentences:

I am eating an apple.
I am using an apple desktop.

apple in the first sentence signifies the fruit and, in the second sentence, it signifies the company. However, the word vector generated for apple would be the same for both the company and the fruit. In other words, since a static embedding is created for each word after the training, generating an embedding on the fly based on the context for a word's specific usage is a limitation of the Word2vec model.

Word2vec can also capture stereotypical or biased relationships depending on the text corpus it was trained on. These biases can be related to gender, ethnicity, religion, and so on. For example, some patterns that can be observed are as follows:

man:doctor what woman:nurse
man:computer programmer what woman:homemaker

This is another limitation of the Word2vec model, but this is highly dependent on the text provided and, as is always said, the model is as good as the data it was trained on.

Applications of the Word2vec model

Word2vec has a large-scale application. It can be used in search engines, building classification, and clustering models where sentences can be represented by using embeddings of the words in them. Another very important scenario where Word2vec is used is in capturing document similarity or how related two or more documents are to each other. These are only some of its use cases, and the internet is filled with other examples of where Word2vec finds its place and is highly relevant.

Word mover’s distance

In the previous section, we discussed how measuring document similarity is one of the major use cases of Word2vec. Think of a problem statement, such as one where we are building an engine that can rank resumes based on their relevance to a job description. Here, we ideally need to figure out the distance between the job description and the set of resumes. The smaller the distance between the resume and the job description, the higher the relevance of the resume to the job description.

One measure we discussed in Chapter 4, Transforming Text into Data Structures, was to use cosine similarity to find how close or far text documents are to one another or how far removed they are from one another. In this section, we will discuss another measure, Word Mover's Distance (WMD), which is more relevant than cosine similarity, especially when we base the distance measure for documents on word embeddings.

Kusner et al. devised the WMD algorithm. They define the dissimilarity between two text documents as the minimum amount of distance that the embedded words of one document need to travel to reach the embedded words of another document.

Let's look at an example that the authors use in their research paper:

Sentence 1: Obama speaks to the media in Illinois.
Sentence 2: President greets the press in Chicago.

Based on the Word2vec model, the embedding for Obama would be very close to President. Similarly, speaks would be pretty close to greets, media would be pretty close to press, and Illinois would map pretty closely to Chicago.

Let's take a look at a third sentence—Apple is my favorite company. Now, this is likely to be more distant to sentence 1 than sentence 2 is. This is because there is not much of a semantic relationship between the words in the first and third sentences.

WMD computes the pairwise Euclidean distance between words across the sentences and it defines the distance between two documents as the minimum cumulative cost in terms of the Euclidean distance required to move all the words from the first sentence to the second sentence.

Let's see how we implement this using gensim:

We will import the libraries using the following two lines of code:

import gensim
from gensim.models import KeyedVectors

Now, we will load our pretrained model:

model=KeyedVectors.load_word2vec_format('/Users/amankedia/Desktop/Sunday/nlp-book/Chapter 5/Code/GoogleNews-vectors-negative300.bin', binary=True)

Now that we have loaded our model, let's define our data:

sentence_1 = "Obama speaks to the media in Illinois"
sentence_2 = "President greets the press in Chicago"
sentence_3 = "Apple is my favorite company"

We will get into the real action next!

We will now compute the WMD between the sentences from the data we just defined. Let's begin by calculating the WMD between sentence_1andsentence_2first:

word_mover_distance = model.wmdistance(sentence_1, sentence_2)
word_mover_distance

This is the WMD between sentence_1 and sentence_2:

1.1642040735998236

Now, we will compute the distance between sentence_1 and sentence_3:

word_mover_distance = model.wmdistance(sentence_1, sentence_3)
word_mover_distance

The distance between sentence_1 and sentence_3 is given in the following output block:

1.365806580758697

Let's normalize our word embeddings using the following line of code to get the best measure of distance:

model.init_sims(replace = True)

Let's now recompute the WMD between the sentences based on the normalized embeddings we created in the previous step. We will again start by calculating the WMD between sentence_1 and sentence_2, this time with normalized embeddings:

word_mover_distance = model.wmdistance(sentence_1, sentence_2)
word_mover_distance

Here's the distance between sentence_1 andsentence_2using normalized embeddings:

0.4277553083600646

Let's repeat this for sentence_1 and sentence_2:

word_mover_distance = model.wmdistance(sentence_1, sentence_3)
word_mover_distance

The WMD between sentence_1 and sentence_3 based on normalized embeddings is as follows:

0.47793400675650705

As we can see, the distance between sentence 1 and sentence 2 is much smaller than the distance between sentence 1 and sentence 3. This indicates that sentence 2 is much more similar to sentence 1 compared to sentence 3. With this understanding of how WMD works, we are now better equipped to apply it to cases where we need to compute distances between documents based on their Word2vec representations. A simple use case would be to apply this to document clustering, where documents with small WMDs between them are clustered together and documents with larger WMDs are kept further apart.

Summary

In this chapter, we expanded on the ideas introduced in Chapter 4, Transforming Text into Data Structures. Instead of using the syntactical aspects of a document, we focused on capturing the semantics of words in a sentence. Properties such as the co-occurrence of words help in understanding the context of a word, and we tried to leverage this to build vector representations of text using the Word2vec algorithm. We explored the pretrained Word2vec model developed by Google and looked at a few relationships that it can capture. We followed this up by learning about the architecture of a Word2vec model. After that, we trained a few Word2vec models from scratch. Limitations and bias around the Word2Vec model were then discussed, followed by a discussion on some applications of the Word2vec model. Finally, we looked at how the WMD algorithm uses word vectors to capture document distances.

In the next chapter, we will take this idea further to build vectors for documents, sentences, and characters.

Table of Contents for Word Embeddings and Distance Measurements for Text

Create new playlist

Sign In

Sign Up

Technical requirements

Understanding word embeddings

Demystifying Word2vec

Supervised and unsupervised learning

Word2vec – supervised or unsupervised?

Pretrained Word2vec

Exploring the pretrained Word2vec model using gensim

The Word2vec architecture

The Skip-gram method

How do you define target and context words?

Exploring the components of a Skip-gram model

Input vector

Embedding matrix

Context matrix

Output vector

Softmax

Loss calculation and backpropagation

Inference

The CBOW method

Computational limitations of the methods discussed and how to overcome them

Subsampling

Negative sampling

How to select negative samples

Training a Word2vec model

Building a basic Word2vec model

Modifying the min_count parameter

Playing with the vector size

Other important configurable parameters

Limitations of Word2vec

Applications of the Word2vec model

Word mover’s distance

Summary

Table of Contents for
Word Embeddings and Distance Measurements for Text