Word Embeddings

In the two previous chapters, we applied the bag-of-words model to convert text data into a numerical format. The results were sparse, fixed-length vectors that represent documents in a high-dimensional word space. This allows evaluating the similarity of documents and creates features to train a machine learning algorithm and classify a document's content or rate the sentiment expressed in it. However, these vectors ignore the context in which a term is used so that, for example, a different sentence containing the same words would be encoded by the same vector.

In this chapter, we will introduce an alternative class of algorithms that use neural networks to learn a vector representation of individual semantic units such as a word or a paragraph. These vectors are dense rather than sparse, and have a few hundred real-valued rather than tens of thousands of binary or discrete entries. They are called embeddings because they assign each semantic unit a location in a continuous vector space.

Embeddings result from training a model to relate tokens to their context with the benefit that similar usage implies a similar vector. Moreover, we will see how the embeddings encode semantic aspects, such as relationships among words by means of their relative location. As a result, they are powerful features for use in the deep learning models that we will introduce in the following chapters.

More specifically, in this chapter, we will cover the following topics:

What word embeddings are and how they work and capture semantic information
How to use trained word vectors
Which network architectures are useful to train Word2vec models
How to train a Word2vec model using Keras, gensim, and TensorFlow
How to visualize and evaluate the quality of word vectors
How to train a Word2vec model using SEC filings
How Doc2vec extends Word2vec

Table of Contents for Word Embeddings

Create new playlist

Sign In

Sign Up

Table of Contents for
Word Embeddings