Understanding word embedding

The BoW models that we discussed in our earlier section suffer from a problem that they do not capture information about a word’s meaning or context. This means that potential relationships, such as contextual closeness, are not captured across collections of words. For example, the approach cannot capture simple relationships, such as determining that the words "cars" and "buses" both refer to vehicles that are often discussed in the context of transportation. This problem that we experience with the BoW approach will be overcome by word embedding, which is an improved approach to mapping semantically similar words.

Word vectors represent words as multidimensional continuous floating point numbers, where semantically similar words are mapped to proximate points in geometric space. For example, the words fruit and leaves would have a similar word vector, tree. This is due to the similarity of their meanings, whereas the word television would be quite distant in the geometrical space. In other words, words that are used in a similar context will be mapped to a proximate vector space.

The word vectors can be of n dimensions, and n can take any number as input from the user creating it (for example 10, 70, 500). The dimensions are latent in the sense that it may not be apparent to humans what each of these dimensions means in reality. There are methods such as Continuous Bag of Words (CBOW) and Skip-Gram that enable conceiving the word vectors from the text provided as training input to word embedding algorithms. Also, the individual numbers in the word vector represent the word's distributed weight across dimensions. In a general sense, each dimension represents a latent meaning, and the word's numerical weight on that dimension captures the closeness of its association with and to that meaning. Thus, the semantics of the word are embedded across the dimensions of the vector.

Though the word vectors are multidimensional and cannot be visualized directly, it is possible to visualize the vectors learned, by projecting them down to two dimensions using techniques such as the t-SNE dimensionality reduction technique. The following diagram displays learned word vectors in two dimensional spaces for country capitals, verb tenses, and gender relationships:

Visualization of word embeddings in a two dimensional space

When we observe the word embedding visualization, we can perceive that the vectors captured some general, and in fact quite useful, semantic information about words and their relationships to one another. With this, each word in the text now can be represented as a row in the matrix similar to that of the BoW approach, but, unlike the BoW approach, it captures the relationships between the words.

The advantage of representing words as vectors is that they lend themselves to mathematical operators. For example, we can add and subtract vectors. The canonical example here is showing that by using word vectors we can determine the following:

king - man + woman = queen

In the given example, we subtracted the gender (man) from the word vector for king and added another gender (woman), and we obtained a new word vector from the operation (king - man + woman) that maps most closely to the word vector for queen.

A few more amazing examples of mathematical operations that can be achieved on word vectors are shown as follows:

Given two words, we can establish the degree of similarity between them:

model.similarity('woman','man')

And the output is as follows:

0.73723527

Finding the odd one out from the set of words given as input:

model.doesnt_match('breakfast cereal dinner lunch';.split())

The odd one is given as the following output:

'cereal'

Derive analogies, for example:

model.most_similar(positive=['woman','king'],negative=['man'],topn=1)

The output is as follows:

queen: 0.508

Now, what it all means for us is that machines are able to identify semantically similar words given in a sentence. The following diagram is a gag related to word embedding that made me laugh, but the gag does convey the power of word embedding application, which otherwise would not be possible with the BoW kind of text representations:

A gag demonstrating the power of word embeddings application

There are several techniques that can be used to learn word embedding from text data. Word2vec, GloVe, and fastText are some of the popular techniques. Each of these techniques allows us to either train our own word embedding from the text data we have, or use the readily available pretrained vectors.

This approach of learning our own word embedding requires a lot of training data and can be slow, but this option will learn an embedding both targeted to the specific text data and the NLP task at hand.

Pretrained word embedding vectors are vectors that are trained on large amounts of text data (usually billions of words) available on sources such as Wikipedia. These are generally high-quality word embedding vectors made available by companies such as Google or Facebook. We can download these pretrained vector files and consume them to obtain word vectors for the words in the text that we would like to classify or cluster.

Table of Contents for Understanding word embedding

Create new playlist

Sign In

Sign Up

Table of Contents for
Understanding word embedding