Similar to the work we did in the previous chapter, traditional NLP approaches rely on converting individual words--which we created via tokenization--into a format that a computer algorithm can learn (that is, predicting the movie sentiment). Doing this required us to convert a single review of N tokens into a fixed representation by creating a TF-IDF matrix. In doing so, we did two important things behind the scenes:
- Individual words were assigned an integer ID (for example, a hash). For example, the word friend might be assigned to 39,584, while the word bestie might be assigned to 99,928,472. Cognitively, we know that friend is very similar to bestie; however, any notion of similarity is lost by converting these tokens into integer IDs.
- By converting each token into an integer ID, we consequently lose the context with which the token was used. This is important because in order to understand the cognitive meaning of words, and thereby train a computer to learn that friend and bestie are similar, we need to understand how the two tokens are used (for example, their respective contexts).
Given this limited functionality of traditional NLP techniques with respect to encoding the semantic and syntactic meaning of words, Tomas Mikolov and other researchers explored methods that employ neural networks to better encode the meaning of words as a vector of N numbers (for example, vector bestie = [0.574, 0.821, 0.756, ... , 0.156]). When calculated properly, we will discover that the vectors for bestie and friend are close in space, whereby closeness is defined as a cosine similarity. It turns out that these vector representations (often referred to as word embeddings) give us the ability to capture a richer understanding of text.
Interestingly, using word embeddings also gives us the ability to learn the same semantics across multiple languages despite differences in the written form (for example, Japanese and English). For example, the Japanese word for movie is eiga (); therefore, it follows that using word vectors, these two words, movie and , should be close in the vector space despite their differences in appearance. Thus, the word embeddings allow for applications to be language-agnostic--yet another reason why this technology is hugely popular!