Topic Modeling

In the last chapter, we converted unstructured text data into a numerical format using the bag-of-words model. This model abstracts from word order and represents documents as word vectors, where each entry represents the relevance of a token to the document.

The resulting document-term matrix (DTM), (you may also come across the transposed term-document matrix) is useful to compare documents to each other or to a query vector based on their token content, and quickly find a needle in a haystack or classify documents accordingly.

However, this document model is both high-dimensional and very sparse. As a result, it does little to summarize the content or get closer to understanding what it is about. In this chapter, we will use unsupervised machine learning in the form of topic modeling to extract hidden themes from documents. These themes can produce detailed insights into a large body of documents in an automated way. They are very useful to understand the haystack itself and permit the concise tagging of documents because using the degree of association of topics and documents.

Topic models permit the extraction of sophisticated, interpretable text features that can be used in various ways to extract trading signals from large collections of documents. They speed up the review of documents, help identify and cluster similar documents, and can be annotated as a basis for predictive modeling. Applications include the identification of key themes in company disclosures, or earnings call transcripts, customer reviews or contracts, annotated using, for example, sentiment analysis or direct labeling with subsequent asset returns.

More specifically, in this chapter, we will cover these topics:

What topic modeling achieves, why it matters, and how it has evolved
How Latent Semantic Indexing (LSI) reduces the dimensionality of the DTM
How probabilistic Latent Semantic Analysis (pLSA) uses a generative model to extract topics
How Latent Dirichlet Allocation (LDA) refines pLSA and why it is the most popular topic model
How to visualize and evaluate topic modeling results
How to implement LDA using sklearn and gensim
How to apply topic modeling to collections of earnings calls and Yelp business reviews

The code samples for the following sections are in the directory of the GitHub repository for this chapter, and references are listed in the main README file.

Table of Contents for Topic Modeling

Create new playlist

Sign In

Sign Up

Table of Contents for
Topic Modeling