Preface

What This Book Covers

This book introduces the basic ideas of text mining, which is a group of techniques that extracts useful information from one or more texts. This is a practical book, one that focuses on applications and examples. Although some statistics and mathematics is required, it is kept to a minimum, and what is used is explained.

This book, however, does make one demand: it assumes that you are willing to learn to write simple programs using Perl. This programming language is explicitly designed to work with text. In addition, it is open-source software that is available over the Web for free. That is, you can download the latest full-featured version of Perl right now, and install it on all the computers you want without paying a cent.

Chapters 2 and 3 give the basics of Perl, including a detailed introduction to regular expressions, which is a text pattern matching methodology used in a variety of programming languages, not just Perl. For each concept there are several examples of how to use it to analyze texts. Initial examples analyze short strings, for example, a few words or a sentence. Later examples use text from a variety of literary works, for example, the short stories of Edgar Allan Poe, Charles Dickens’s A Christmas Carol, Jack London’s The Call of the Wild, and Mary Shelley’s Frankenstein. All the texts used here are part of the public domain, so you can download these for free, too. Finally, if you are interested in word games, Perl plus extensive word lists are a great combination, which is covered in chapter 3.

Chapters 4 through 8 each introduce a core idea used in text mining. For example, chapter 4 explains the basics of probability, and chapter 5 discusses the term-document matrix, which is an important tool from information retrieval.

This book assumes that you want to analyze one or more texts, so the focus is on the practical. All the techniques in this book have immediate applications. Moreover, learning a minimal amount of Perl enables you to modify the code in this book to analyze the texts that interest you.

The level of mathematical knowledge assumed is minimal: you need to know how to count. Mathematics that arises for text applications is explained as needed and is kept to the minimum to do the job at hand. Although most of the techniques used in this book were created by researchers knowledgeable in math, a few basic ideas are all that are needed to read this book.

Although I am a statistician by training, the level of statistical knowledge assumed is also minimal. The core tools of statistics, for example, variability and correlations, are explained. It turns out that a few techniques are applicable in many ways.

The level of prior programming experience assumed is again minimal: Perl is explained from the beginning, and the focus is on working with text. The emphasis is on creating short programs that do a specific task, not general-purpose text mining tools. However, it is assumed that you are willing to put effort into learning Perl. If you have never programmed in any computer language at all, then doing this is a challenge. Nonetheless, the payoff is big if you rise to this challenge.

Finally, all the code, output, and figures in this book are produced with software that is available from the Web at no cost to you, which is also true of all the texts analyzed. Consequently, you can work through all the computer examples with no additional costs.

What Is Text Mining?

The text in text mining refers to written language that has some informational content. For example, newspaper stories, magazine articles, fiction and nonfiction books, manuals, blogs, email, and online articles are all texts. The amount of text that exists today is vast, and it is ever growing.

Although there are numerous techniques and approaches to text mining, the overall goal is simple: it discovers new and useful information that is contained in one or more text documents. In practice, text mining is done by running computer programs that read in documents and process them in a variety of ways. The results are then interpreted by humans.

Text mining combines the expertise of several disciplines: mathematics, statistics, probability, artificial intelligence, information retrieval, and databases, among others. Some of its methods are conceptually simple, for example, concordancing where all instances of a word are listed in its context (like a Bible concordance). There are also sophisticated algorithms such as hidden Markov models (used for identifying parts of speech). This book focuses on the simpler techniques. However, these are useful and practical nonetheless, and serve as a good introduction to more advanced text mining books.

This Book’s Approach to Text Mining

This book has three broad themes. First, text mining is built upon counting and text pattern matching. Second, although language is complex, some aspects of it can be studied by considering its simpler properties. Third, combining computer and human strengths is a powerful way to study language. We briefly consider each of these.

First, text pattern matching means identifying a pattern of letters in a document. For example, finding all instances of the word cat requires using a variety of patterns, some of which are below.

cat Cat cats Cats cat’s Cat’s cats’ cat, cat. cat!

It also requires rejecting words like catastrophe or scatter, which contain the string cat, but are not otherwise related. Using regular expressions, this can be explained to a computer, which is not daunted by the prospect of searching through millions of words. See section 2.2.1 for further discussion of this example and chapter 2 for text patterns in general.

It turns out that counting the number of matches to a text pattern occurs again and again in text mining, even in sophisticated techniques. For example, one way to compute the similarity of two text documents is by counting how many times each word appears in both documents. Chapter 5 considers this problem in detail.

Second, while it is true that the complexity of language is immense, some information about language is obtainable by simple techniques. For example, recent language reference books are often checked against large text collections (called corpora). Language patterns have been both discovered and verified by examining how words are used in writing and speech samples. For example, big, large, and great are similar in meaning, but the examination of corpora shows that they are not used interchangeably. For example, the following sentences: “he has big feet,” “she has large feet,” and “she has great insight” sound good, but “he has big insight” or “she has large insight” are less fluent. In this type of analysis, the computer finds the examples of usage among vast amounts of text, and a human examines these to discover patterns of meanings. See section 6.4.2 for an example.

Third, as noted above, computers follow directions well, and they are untiring, while humans are experts at using and interpreting language. However, computers have limited understanding of language, and humans have limited endurance. These facts suggest an iterative and collaborative strategy: the results of a program are interpreted by a human who, in turn, decides what further computer analyses are needed, if any. This back and forth process is repeated as many times as is necessary. This is analogous to exploratory data analysis, which exploits the interplay between computer analyses and human understanding of what the data means.

Why Use Perl?

This section title is really three questions. First, why use Perl as opposed to an existing text mining package? Second, why use Perl as opposed to other programming languages? Third, why use Perl instead of so-called pseudo-code? Here are three answers, respectively.

First, if you have a text mining package that can do everything you want with all the texts that interest you, and if this package works exactly the way you want it, and if you believe that your future processing needs will be met by this package, then keep using it. However, it has been my experience that the process of analyzing texts suggests new ideas requiring new analyses and that the boundaries of existing tools are reached too soon in any package that does not allow the user to program. So at the very least, I prefer packages that allow the user to add new features, which requires a programming language. Finally, learning how to use a package also takes time and effort, so why not invest that time in learning a flexible tool like Perl.

Second, Perl is a programming language that has text pattern matching (called regular expressions or regexes), and these are easy to use with a variety of commands. It also has a vast amount of free add-ons available on the Web, many of which are for text processing. Additionally, there are numerous books and tutorials and online resources for Perl, so it is easy to find out how to make it do what you want. Finally, you can get on the Web and download full-strength Perl right now, for free: no hidden charges!

Larry Wall built Perl as a text processing computer language. Moreover, he studied linguistics in graduate school, so he is knowledgeable about natural languages, which influenced his design of Perl. Although many programming languages support text pattern matching, Perl is designed to make it easy to use this feature.

Third, many books use pseudo-code, which excels at showing the programming logic. In my experience, this has one big disadvantage. Students without a solid programming background often find it hard to convert pseudo-code to running code. However, once Perl is installed on a computer, accurate typing is all that is required to run a program. In fact, one way to learn programming is by taking existing code and modifying it to see what happens, and this can only be done with examples written in a specific programming language.

Finally, personally, I enjoy using Perl, and it has helped me finish numerous text processing tasks. It is easy to learn a little Perl and then apply it, which leads to learning more, and then trying more complex applications. I use Perl for a text mining class I teach at Central Connecticut State University, and the students generally like the language. Hence, even if you are unfamiliar with it, you are likely to enjoy applying it to analyzing texts.

Organization of This Book

After an overview of this book in chapter 1, chapter 2 covers regular expressions in detail. This methodology is quite powerful and useful, and the time spent learning it pays off in the later chapters. Chapter 3 covers the data structures of Perl. Often a large number of linguistic items are considered all at once, and to work with all of them requires knowing how to use arrays and hashes as well as more complex data structures.

With the basics of Perl in hand, chapter 4 introduces probability. This lays the foundation for the more complex techniques in later chapters, but it also provides an opportunity to study some of the properties of language. For example, the distribution of the letters of the alphabet of a Poe story is analyzed in section 4.2.2.1.

Chapter 5 introduces the basics of vectors and arrays. These are put to good use as term-document matrices, which is a fundamental tool of information retrieval. Because it is possible to represent a text as a vector, the similarity of two texts can be measured by the angle between the two vectors representing the texts.

Corpus linguistics is the study of language using large samples of texts. Obviously this field of knowledge overlaps with text mining, and chapter 6 introduces the fundamental idea of creating a text concordance. This takes the text pattern matching ability of regular expressions, and allows a researcher to compare the matches in a variety of ways.

Text can be measured in numerous ways, which produces a data set that has many variables. Chapter 7 introduces the statistical technique of principal components analysis (PCA), which is one way to reduce a large set of variables to a smaller, hopefully easier to interpret, set. PCA is a popular tool among researchers, and this chapter teaches you the basic idea of how it works.

Given a set of texts, it is often useful to find out if these can be split into groups such that (1) each group has texts that are similar to each other and (2) texts from two different groups are dissimilar. This is called clustering. A related technique is to classify texts into existing categories, which is called classification. These topics are introduced in chapter 8.

Chapter 9 has three shorter sections, each of which discusses an idea that did not fit in one of the other chapters. Each of these is illustrated with an example, and each one has ties to earlier work in this book.

Finally, the first appendix gives an overview of the basics of Perl, while the second appendix lists the R commands used at the end of chapter 5 as well as chapters 7 and 8. R is a statistical software package that is also available for free from the Web. This book uses it for some examples, and references for documentation and tutorials are given so that an interested reader can learn more about it.

ROGER BILIs0LY

New Britain, Connecticut

May 2008

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.47.169