Word embeddings

Up until this point, we have used scikit-learn to embed documents (tweets, reviews, URLs, and so on) into a vectorized format by regarding tokens (words, n-grams) as features and documents as having a certain amount of these tokens. For example, if we had 1,583 documents and we told our CountVectorizer to learn the top 1,000 tokens of ngram_range from one to five, we would end up with a matrix of shape (1583, 1000) where each row represented a single document and the 1,000 columns represented literal n-grams found in the corpus. But how do we achieve an even lower level of understanding? How do we start to teach the machine what words mean in context?

For example, if we were to ask you the following questions, you may give the following answers:

Q: What would you get if we took a king, removed the man aspect of it, and replaced it with a woman?

A: A queen

Q: London is to England as Paris is to ____.

A: France

You, a human, may find these questions simple, but how would a machine figure this out without knowing what the words by themselves mean in context? This is, in fact, one of the greatest challenges that we face in natural language processing (NLP) tasks.

Word embeddings are one approach to helping a machine understand context. A word embedding is a vectorization of a single word in a feature space of n dimensions, where n represents the number of latent characteristics that a word can have. This means that every word in our vocabulary is not longer, just a string, but a vector in and of itself. For example, if we extracted n=5 characteristics about each word, then each word in our vocabulary would correspond to a 1 x 5 vector. For example, we might have the following vectorizations:

# set some fake word embeddings
king = np.array([.2, -.5, .7, .2, -.9])
man = np.array([-.5, .2, -.2, .3, 0.])
woman = np.array([.7, -.3, .3, .6, .1])

queen = np.array([ 1.4, -1. , 1.2, 0.5, -0.8])

And with these vectorizations, we can tackle the question What would you get if we took a king, removed the man aspect of it, and replaced it with a woman? by performing the following operation:

king - man + woman

In code, this would look like:

np.array_equal((king - man + woman), queen)

True

This seems simple but it does some with a few caveats:

Context (in the form of word embeddings) changes from corpus to corpus as does word meanings. This means that static word embeddings by themselves are not always the most useful
Word embeddings are dependent on the corpus that they were learned from

Word embeddings allow us to perform very precise calculations on single words to achieve what we might consider context.

Table of Contents for Word embeddings

Create new playlist

Sign In

Sign Up

Table of Contents for
Word embeddings