Chapter 9. Analyzing Textual Data and Social Media

In the previous chapters, we focused on the analysis of structured data, mostly in tabular format. In reality, plain text is the most predominant form of data available today. Text analysis applies analysis of word frequency distributions, pattern recognition, tagging, link and association analysis, sentiment analysis, and visualization. We will analyze text with the Python Natural Language Toolkit (NLTK) library. NLTK comes with a collection of sample texts called corpora. A small example of network analysis will also be covered. The following topics will be discussed in this chapter:

  • Installing NLTK
  • Filtering out stopwords, names, and numbers
  • The bag-of-words model
  • Analyzing word frequencies
  • Naive Bayes classification
  • Sentiment analysis
  • Creating word clouds
  • Social network analysis

Installing NLTK

NLTK is a Python API for the analysis of texts written in natural languages, such as English. NLTK was created in 2001 and was originally intended as a teaching tool. Install NLTK with the following command:

$ sudo pip install nltk
$ pip freeze|grep nltk
nltk==2.0.4

As usual, we will check the installation with a new version of the pkg_check.py file. The following import statement is required:

import nltk

If everything works, we should get a result similar to the following:

nltk version 2.0.4
nltk.app DESCRIPTION chartparser: Chart Parser chunkparser: Regular-Expression Chunk Parser collocations: Find collocations in text concordance: Part
nltk.ccg DESCRIPTION For more information see nltk/doc/contrib/ccg/ccg.pdf PACKAGE CONTENTS api chart combinator lexicon DATA BackwardApplication<n
nltk.chat DESCRIPTION A class for simple chatbots. These perform simple pattern matching on sentences typed by users, and respond with automatically g
nltk.chunk DESCRIPTION Classes and interfaces for identifying non-overlapping linguistic groups (such as base noun phrases) in unrestricted text. This 
nltk.classify DESCRIPTION Classes and interfaces for labeling tokens with category labels (or "class labels"). Typically, labels are represented with stri
nltk.cluster DESCRIPTION This module contains a number of basic clustering algorithms. Clustering describes the task of discovering groups of similar ite
nltk.corpus
nltk.draw DESCRIPTION # Natural Language Toolkit: graphical representations package # # Copyright (C) 2001-2012 NLTK Project # Author: Edward Loper<e
nltk.examples
nltk.inference
nltk.metrics DESCRIPTION Classes and methods for scoring processing modules. PACKAGE CONTENTS agreement association confusionmatrix distance scores segme
nltk.misc DESCRIPTION # Natural Language Toolkit: Miscellaneous modules # # Copyright (C) 2001-2012 NLTK Project # Author: Steven Bird <[email protected]
nltk.model DESCRIPTION # Natural Language Toolkit: Language Models # # Copyright (C) 2001-2012 NLTK Project # Author: Steven Bird <[email protected].
nltk.parse DESCRIPTION Classes and interfaces for producing tree structures that represent the internal organization of a text. This task is known as "
nltk.sem DESCRIPTION This package contains classes for representing semantic structure in formulas of first-order logic and for evaluating such formu
nltk.stem DESCRIPTION Interfaces used to remove morphological affixes from words, leaving only the word stem. Stemming algorithms aim to remove those 
nltk.tag DESCRIPTION This package contains classes and interfaces for part-of-speech tagging, or simply "tagging". A "tag" is a case-sensitive string
nltk.test DESCRIPTION Unit tests for the NLTK modules. These tests are intended to ensure that changes that we make to NLTK's code don't accidentally 
nltk.tokenize DESCRIPTION Tokenizers divide strings into lists of substrings. For example, tokenizers can be used to find the list of sentences or words i

However, we are not done yet; we still need to download the NLTK corpora. The download is relatively large (about 1.8 GB); however, we only have to download it once. Unless you know exactly which corpora you require, it's best to download all the available corpora. Download the corpora from the Python shell as follows:

$ python
>>> import nltk 
>>> nltk.download()

A GUI application should appear, where you can specify a destination and what to download. If you are new to NLTK, it's most convenient to choose the default options and download everything. In this chapter, we will need the stopwords, movie reviews, names, and Gutenberg corpora.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.197.213