Exploring the 20 Newsgroups Dataset with Text Analysis Algorithms

We went through a bunch of fundamental machine learning concepts in the last chapter. We learned them along with analogies the fun way, such as studying for the exams, designing driving schedule, and so on. As promised, starting from this chapter as the second step of our learning journal, we will be discovering in detail several import machine learning algorithms and techniques. Beyond analogies, we will be exposed to and will solve real-world examples, which makes our journal more interesting. We start with a classic natural language processing problem--newsgroups topic modeling in this chapter. We will gain hands-on experience in working with text data, especially how to convert words and phrases into machine-readable values. We will be tackling the project in an unsupervised learning manner, using clustering algorithms, including k-means clustering and non-negative matrix factorization.

We will get into details of the following topics:

What is NLP and what are its applications?
Touring Python NLP libraries
Natural Language Toolkit and common NLP tasks
The newsgroups data
Getting the data
Thinking about features
Visualizing the data
Data preprocessing: tokenization, stemming, and lemmatization
Clustering and unsupervised learning
k-means clustering
Non-negative matrix factorization
Topic modeling

Table of Contents for Exploring the 20 Newsgroups Dataset with Text Analysis Algorithms

Create new playlist

Sign In

Sign Up

Table of Contents for
Exploring the 20 Newsgroups Dataset with Text Analysis Algorithms