In the two previous chapters, we applied the bag-of-words model to convert text data into a numerical format. The results were sparse, fixed-length vectors that represent documents in a high-dimensional word space. This allows evaluating the similarity of documents and creates features to train a machine learning algorithm and classify a document's content or rate the sentiment expressed in it. However, these vectors ignore the context in which a term is used so that, for example, a different sentence containing the same words would be encoded by the same vector.
In this chapter, we will introduce an alternative class of algorithms that use neural networks to learn a vector representation of individual semantic units such as a word or a paragraph. These vectors are dense rather than sparse, and have a few hundred real-valued rather than tens of thousands of binary or discrete entries. They are called embeddings because they assign each semantic unit a location in a continuous vector space.
Embeddings result from training a model to relate tokens to their context with the benefit that similar usage implies a similar vector. Moreover, we will see how the embeddings encode semantic aspects, such as relationships among words by means of their relative location. As a result, they are powerful features for use in the deep learning models that we will introduce in the following chapters.
More specifically, in this chapter, we will cover the following topics:
- What word embeddings are and how they work and capture semantic information
- How to use trained word vectors
- Which network architectures are useful to train Word2vec models
- How to train a Word2vec model using Keras, gensim, and TensorFlow
- How to visualize and evaluate the quality of word vectors
- How to train a Word2vec model using SEC filings
- How Doc2vec extends Word2vec