Chapter 6. Analyzing Text Data

In this chapter, we will cover the following recipes:

  • Preprocessing data using tokenization
  • Stemming text data
  • Converting text to its base form using lemmatization
  • Dividing text using chunking
  • Building a bag-of-words model
  • Building a text classifier
  • Identifying the gender
  • Analyzing the sentiment of a sentence
  • Identifying patterns in text using topic modeling

Introduction

Text analysis and natural language processing (NLP) is an integral part of modern artificial intelligence systems. Computers are good at understanding rigidly-structured data with limited variety. However, when we deal with unstructured free-form text, things begin to get difficult. Developing NLP applications is challenging because computers have a hard time understanding underlying concepts. There are also many subtle variations to the way in which we communicate things. These can be in the form of dialects, context, slang, and so on.

In order to solve this problem, NLP applications are developed based on machine learning. These algorithms detect patterns in text data so that we can extract insights from it. Artificial intelligence companies make heavy use of NLP and text analysis to deliver relevant results. Some of the most common applications of NLP include search engines, sentiment analysis, topic modeling, part-of-speech tagging, entity recognition, and so on. The goal of NLP is to develop a set of algorithms so that we can interact with computers in plain English. If we can achieve this, then we wouldn't need programming languages to instruct computers about what they should do. In this chapter, we will look at a few recipes that focus on text analysis and how we can extract meaningful information from text data. We will use a Python package called Natural Language Toolkit (NLTK) heavily in this chapter. Make sure that you install this before you proceed. You can find the installation steps at http://www.nltk.org/install.html. You also need to install NLTK Data, which contains many corpora and trained models. This is an integral part of text analysis! You can find the installation steps at http://www.nltk.org/data.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.65.61