In the previous chapters, we focused on the analysis of structured data, mostly in tabular format. In reality, plain text is the most predominant form of data available today. Text analysis applies analysis of word frequency distributions, pattern recognition, tagging, link and association analysis, sentiment analysis, and visualization. We will analyze text with the Python Natural Language Toolkit (NLTK) library. NLTK comes with a collection of sample texts called corpora. A small example of network analysis will also be covered. The following topics will be discussed in this chapter:
NLTK is a Python API for the analysis of texts written in natural languages, such as English. NLTK was created in 2001 and was originally intended as a teaching tool. Install NLTK with the following command:
$ sudo pip install nltk $ pip freeze|grep nltk nltk==2.0.4
As usual, we will check the installation with a new version of the pkg_check.py
file. The following import statement is required:
import nltk
If everything works, we should get a result similar to the following:
nltk version 2.0.4 nltk.app DESCRIPTION chartparser: Chart Parser chunkparser: Regular-Expression Chunk Parser collocations: Find collocations in text concordance: Part nltk.ccg DESCRIPTION For more information see nltk/doc/contrib/ccg/ccg.pdf PACKAGE CONTENTS api chart combinator lexicon DATA BackwardApplication<n nltk.chat DESCRIPTION A class for simple chatbots. These perform simple pattern matching on sentences typed by users, and respond with automatically g nltk.chunk DESCRIPTION Classes and interfaces for identifying non-overlapping linguistic groups (such as base noun phrases) in unrestricted text. This nltk.classify DESCRIPTION Classes and interfaces for labeling tokens with category labels (or "class labels"). Typically, labels are represented with stri nltk.cluster DESCRIPTION This module contains a number of basic clustering algorithms. Clustering describes the task of discovering groups of similar ite nltk.corpus nltk.draw DESCRIPTION # Natural Language Toolkit: graphical representations package # # Copyright (C) 2001-2012 NLTK Project # Author: Edward Loper<e nltk.examples nltk.inference nltk.metrics DESCRIPTION Classes and methods for scoring processing modules. PACKAGE CONTENTS agreement association confusionmatrix distance scores segme nltk.misc DESCRIPTION # Natural Language Toolkit: Miscellaneous modules # # Copyright (C) 2001-2012 NLTK Project # Author: Steven Bird <[email protected] nltk.model DESCRIPTION # Natural Language Toolkit: Language Models # # Copyright (C) 2001-2012 NLTK Project # Author: Steven Bird <[email protected]. nltk.parse DESCRIPTION Classes and interfaces for producing tree structures that represent the internal organization of a text. This task is known as " nltk.sem DESCRIPTION This package contains classes for representing semantic structure in formulas of first-order logic and for evaluating such formu nltk.stem DESCRIPTION Interfaces used to remove morphological affixes from words, leaving only the word stem. Stemming algorithms aim to remove those nltk.tag DESCRIPTION This package contains classes and interfaces for part-of-speech tagging, or simply "tagging". A "tag" is a case-sensitive string nltk.test DESCRIPTION Unit tests for the NLTK modules. These tests are intended to ensure that changes that we make to NLTK's code don't accidentally nltk.tokenize DESCRIPTION Tokenizers divide strings into lists of substrings. For example, tokenizers can be used to find the list of sentences or words i
However, we are not done yet; we still need to download the NLTK corpora. The download is relatively large (about 1.8 GB); however, we only have to download it once. Unless you know exactly which corpora you require, it's best to download all the available corpora. Download the corpora from the Python shell as follows:
$ python >>> import nltk >>> nltk.download()
A GUI application should appear, where you can specify a destination and what to download. If you are new to NLTK, it's most convenient to choose the default options and download everything. In this chapter, we will need the stopwords, movie reviews, names, and Gutenberg corpora.
3.142.114.19