Summary

The project in this chapter was about finding hidden similarity underneath newsgroups data, be it semantic groups, be it themes, or word clouds. We started with what unsupervised learning does and the typical types of unsupervised learning algorithms. We then introduced unsupervised learning clustering and studied a popular clustering algorithm, k-means, in detail. We also talked about tf-idf as a more efficient feature extraction tool for text data. After that, we performed k-means clustering on the newsgroups data and obtained four meaningful clusters. After examining the key terms in each resulting cluster, we went straight to extracting representative terms among original documents using topic modeling techniques. Two powerful topic modeling approaches, NMF and LDA, were discussed and implemented. Finally, we had some fun interpreting the topics we obtained from both methods.

Hitherto, we have covered all the main categories of unsupervised learning, including dimensionality reduction in Chapter 2, Mining the 20 Newsgroups Dataset with Clustering and Topic Modeling Algorithms, clustering in this chapter, as well as topic modeling, which is also dimensionality reduction in a way. Starting from the next chapter, we will talk about supervised learning; specifically, binary classification will be our entry point.

Table of Contents for Summary

Create new playlist

Sign In

Sign Up

Table of Contents for
Summary