Exploring the 20 Newsgroups Dataset with Text Analysis Techniques

We went through a bunch of fundamental machine learning concepts in the previous chapter. We learned about them along with analogies, in a fun way, such as studying for exams and designing a driving schedule. Starting from this chapter as the second step of our learning journal, we will be discovering in detail several important machine learning algorithms and techniques. Beyond analogies, we will be exposed to and solve real-world examples, which makes our journey more interesting. We will start with a natural language processing problem—exploring newsgroups data. We will gain hands-on experience in working with text data, especially how to convert words and phrases into machine-readable values and how to clean up words with little meaning. We will also visualize text data by mapping it into a two-dimensional space in an unsupervised learning manner.

We will go into detail for each of the following topics:

  • What is NLP and its applications
  • NLP basics
  • Touring Python NLP libraries
  • Tokenization
  • Part-of-speech tagging
  • Named entities recognition
  • Stemming and lemmatization
  • Getting and exploring the newsgroups data
  • Data visualization using seaborn and matplotlib
  • The Bag of words (BoW) model and token count vectorization
  • Text preprocessing
  • Stop words removal
  • Dimensionality reduction
  • T-SNE
  • T-SNE for text visualization
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.130.199