Summary

This chapter was devoted to learning about topic models; after sentiment analysis on movie reviews, this was our second foray into working with real-life text data. This time, our predictive task was classifying the topics of news articles on the web. The primary technique for topic modeling on which we focused was LDA. This derives its name from the fact that it assumes that the topic and word distributions that can be found inside a document arise from hidden multinomial distributions that are sampled from Dirichlet priors. We saw that the generative process of sampling words and topics from these multinomial distributions mirrors many of the natural intuitions that we have about this domain; however, it signally fails to account for correlations between the various topics that can co-occur inside a document.

In our experiments with LDA, we saw that there is more than one way to fit an LDA model, and in particular we saw that the method known as Gibbs sampling tends to be more accurate, even if it often is more computationally expensive. In terms of performance, we saw that, when the topics in question are quite distinct from each other, such as the topics in the BBC dataset, we got very high accuracy in our topic prediction.

At the same time, however, when we classified documents with topics that are more similar to each other, such as the different sports documents in the BBCSports dataset, we saw that this posed more of a challenge and our results were not quite as high. In our case, another factor that probably played a role is that both the documents and the available features were much fewer in number than in the BBCSports dataset. Currently, an increasing number of variations on LDA are being researched and developed in order to deal with limitations in both performance and training speed.

As an interesting exercise, we also downloaded an archive of tweets and used R commands to create a document-term matrix object, which we then used as an input for creating a word cloud object that visualized the words found within the tweets.

Topic models can be viewed as a form of clustering, and this was our first glimpse in to this area. In the next chapter on recommendation systems, we will delve more deeply into the field of clustering in order to understand the way in which websites such as Amazon are able to make product recommendations by predicting which products a shopper is most likely to be interested in based on their previous shopping history and the shopping habits of similar shoppers.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.14.17.40