From linear algebra to hierarchical probabilistic models

Initial attempts by topic models to improve on the vector space model (developed in the mid-1970s) applied linear algebra to reduce the dimensionality of the document-term matrix. This approach is similar to the algorithm we discussed as principal component analysis in Chapter 12, Unsupervised Learning, on unsupervised learning. While effective, it is difficult to evaluate the results of these models absent a benchmark model.

In response, probabilistic models emerged that assume an explicit document generation process and provide algorithms to reverse engineer this process and recover the underlying topics.

This table highlights key milestones in the model evolution that we will address in more detail in the following sections:

Model

Year

Description

Latent Semantic Indexing (LSI)

1988

Reduces the word space dimensionality to capture semantic document-term relationships by 

Probabilistic Latent Semantic Analysis (pLSA)

1999

Reverse-engineers a process that assumes words generate a topic and documents are a mix of topics

Latent Dirichlet Allocation (LDA)

2003

Adds a generative process for documents: a three-level hierarchical Bayesian model

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.109.211