Word embedding

Bag of Word models have a few less than ideal properties that are worth noting.

The first problem with the Bag of Word models we've previously looked at is that they don't consider the context of the word. They don't really consider the relationships that exist between the words in the document.

A second but related concern is that the assignment of words in the vector space is somewhat arbitrary. Information that might exist about the relation between two words in a corpus vocabulary might not be captured. For example, a model that has learned to process the word alligator can leverage very little of that learning when it comes across the word crocodile, even though both alligators and crocodiles are somewhat similar creatures that share many characteristics (bring on the herpetologist hate mail).

Lastly, because the vocabulary of a corpus can be very large and may not be present in all documents, BoW models tend to produce very sparse vectors.

Word-embedding models address these problems by learning a vector for each word where semantically similar words are mapped to (embedded in) nearby points. Additionally, we will represent the entire vocabulary in a much smaller vector space than we could with a BoW model. This provides dimensionality reduction and leaves us with a smaller and more dense vector that captures the word's semantic value.

Word-embedding models often provide quite a bit of lift over Bag of Word models in real-world document classification problems and semantic analysis problems because of this ability to preserve the semantic value of the word relative to other words in the corpus.

Table of Contents for Word embedding

Create new playlist

Sign In

Sign Up

Table of Contents for
Word embedding