Exploring the 20 Newsgroups Dataset with Text Analysis Algorithms

We went through a bunch of fundamental machine learning concepts in the last chapter. We learned them along with analogies the fun way, such as studying for the exams, designing driving schedule, and so on. As promised, starting from this chapter as the second step of our learning journal, we will be discovering in detail several import machine learning algorithms and techniques. Beyond analogies, we will be exposed to and will solve real-world examples, which makes our journal more interesting. We start with a classic natural language processing problem--newsgroups topic modeling in this chapter. We will gain hands-on experience in working with text data, especially how to convert words and phrases into machine-readable values. We will be tackling the project in an unsupervised learning manner, using clustering algorithms, including k-means clustering and non-negative matrix factorization.

We will get into details of the following topics:

  • What is NLP and what are its applications?
  • Touring Python NLP libraries
  • Natural Language Toolkit and common NLP tasks
  • The newsgroups data
  • Getting the data
  • Thinking about features
  • Visualizing the data
  • Data preprocessing: tokenization, stemming, and lemmatization
  • Clustering and unsupervised learning
  • k-means clustering
  • Non-negative matrix factorization
  • Topic modeling
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.111.87