Summary

We now have a basic understanding of how probabilistic topic modeling works and we have worked to implement one of the most popular tools for performing this analysis on text: the Gensim implementation of Latent Dirichlet Allocation, or LDA. We learned how to write a simple program to implement LDA modeling on a variety of text samples, some with greater success than others. We learned about how the model can be manipulated by changing the input variables, such as the number of topics and the number of passes over the data. We also discovered that topic lists can change over time, and while more data tends to produce a stronger model, it also tends to obscure niche topics that might have been very important for only a moment in time.

In this topic modeling chapter – perhaps even more than in some of the other chapters – our unsupervised learning approach meant that we experienced how our results are truly dependent on the volume, quality, and uniformity of the data we started with. Producing coherent topics from text is possible, but results will vary wildly depending on the initial documents in the collection. It is appropriate then that in the next chapter we will turn our attention towards using data mining techniques to address issues with data itself. We will use data mining techniques to identify data quality problems, including finding and fixing missing values and identifying anomalous data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.53.76