CHAPTER 2

TEXT PATTERNS

2.1 INTRODUCTION

Did you ever remember a certain passage in a book but forgot where it was? With the advent of electronic texts, this unpleasant experience has been replaced by the joy of using a search utility. Computers have limitations, but their ability to do what they are told without tiring is invaluable when it comes to combing through large electronic documents. Many of the more sophisticated techniques later in this book rely on an initial analysis that starts with one or more searches.

Before beginning with text patterns, consider the following question. Since humans are experts at understanding text, and, at present, computers are essentially illiterate, can a procedure as simple as a search really find something unexpected to a human? Yes, it can, and here is an example. Anyone fluent in English knows that the precedes its noun, so the following sentence is clearly ungrammatical.

(2.1) Dog the is hungry.

Putting the the before the noun corrects the problem, so sentence 2.2 is correct.

(2.2) The dog is hungry.

A systematically collected sample of text is called a corpus (its plural form is corpora), and large corpora have been collected to study language. For example, the Cambridge International Corpus has over 800 million words and is used in Cambridge University Press language reference books [26]. Since a book has roughly 500 words on a page, this corresponds to roughly 1.6 million pages of text. In such a corpus, is it possible to find a noun followed by the? Our intuition suggests no, but such constructions do occur, and, in fact, they do not seem unusual when read. Try to think of an example before reading the next sentence.

(2.3) Dottie gave the small dog the large bone.

The only place the appears adjacent to a noun in sentence (2.3) is after the word dog. Once this construction is seen, it is clear how it works: the small dog is the indirect object (that is, the recipient of the action of giving), and the large bone is the direct object (that is, the object that is given.) So it is the direct object’s the that happens to follow dog.

A new generation of English reference books have been created using corpora. For example, the Longman Dictionary of American English [74] uses the Longman Corpus of Spoken American English as well as the Longman Corpus of Written American English, and the Cambridge Grammar of English [26] is based on the Cambridge International Corpus. One way to study a corpus is to construct a concordance, where examples of a word along with the surrounding text are extracted. This is sometimes called a KWIC concordance, which stands for Key Word In Context. The results are then examined by humans to detect patterns of usage. This technique is useful, so much so that some concordances were made by hand before the age of computers, mostly for important texts such as religious works. We come back to this topic in section 2.5 as well as section 6.4.

This chapter introduces a powerful text pattern matching methodology called regular expressions. These patterns are often complex, which makes them difficult to do by hand, so we also learn the basics of programming using the computer language Perl. Many programming languages have regular expressions, but Perl’s implementation is both powerful and easy to invoke. This chapter teaches both techniques in parallel, which allows the easy testing of sophisticated text patterns. By the end of this chapter we will know how to create both a concordance and a program that breaks text into its constituent sentences using Perl. Because different types of texts can vary so much in structure, the ability to create one’s own programs enables a researcher to fine tune a program to the text or texts of interest. Learning how to program can be frustrating, so when you are struggling with some Perl code (and this will happen), remember that there is a concrete payoff.

2.2 REGULAR EXPRESSIONS

A text pattern is called a regular expression, often shortened to regex. We focus on regexes in this section and then learn how to use them in Perl programs starting in section 2.3. The notation we use for the regexes is the same as Perl’s, which makes this transition easier.

2.2.1 First Regex: Finding the Word Cat

Suppose we want to find all the instances of the word cat in a long manuscript. This type of task is ideal for a computer since it never tires, never becomes bored. In Perl, text is found with regexes, and the simplest regex is just a sequence of characters to be found. These are placed between two forward slashes, which denotes the beginning and the end of the regex. That is, the forward slashes act as delimiters. So to find instances of cat, the following regex suggests itself.

/cat/

However, this matches all character strings containing the substring “cat,” for example, caterwaul, implicate, or scatter. Clearly a more specific pattern is needed because /cat/ finds many words not of interest, that is, it produces many false positives.

If spaces are added before and after the word cat, then we have / cat /. Certainly this removes the false positives already noted, however, a new problem arises. For instance, cat in sentence (2.4) is not found.

(2.4) Sherby looked all over but never found the cat.

At first this might seem mysterious: cat is at the end of the sentence. However, the string “cat.” has a period after the t, not a blank, so / cat / does not match. Normal texts use punctuation marks, which pose no problems to humans, but computers are less insightful and require instructions on how to deal with these.

Since punctuation is the norm, it is useful to have a symbol that stands for a word boundary, a location such that one side of the boundary has an alphanumeric character and the other side does not, which is denoted in Perl as . Note that this stands for a location between two characters, not a character itself. Now the following regex no longer rejects strings such as “cat.” or “cat,”.

/cat/

Note that alphanumeric characters are precisely the characters a-z (that is, the letters a through z), A-Z, 0-9 and -. Hence the pattern /cat/ matches all of the following:

(2.5) “cat.” “cat,” “cat?” “eat’s” “-cat-”

but none of these:

(2.6) “catO” “9cat.” “cat,” “implicate” “location”

In a typical text, a string such as “cat0” is unlikely to appear, so this regex matches most of the words that are desired. However, /cat/ does have one last problem. If Cat appears in a text, it does not match because regexes are case sensitive. This is easily solved: just add an i (which stands for case insensitive) after the second backslash as shown below.

/cat/i

This regex matches both “cat” and “Cat.” Note that it also matches “cAt,” “cAT,” and so forth.

In English some types of words are inflected, for example, nouns often have singular and plural forms, and the latter are usually formed by adding the ending -s or -es. However, the pattern /cat/, thanks to the second , cannot match the plural form cats. If both singular and plural forms of this noun are desired, then there are several fixes. First, two separate regexes are possible: /cat/i and /cats/i.

Second, these can be combined into a single regex. The vertical line character is the logical operator or, also called alternation. So the following regex finds both forms of cat.

Regular Expression 2.1 A regex that finds the words cat and cats, regardless of case.

/cat|cats/i

Other regexes can work here, too. Alternatively, there is a more efficient way to search for the two words cat and cats, but it requires further knowledge of regexes. This is done in regular expression 2.3 in section 2.2.3.

2.2.2 Character Ranges and Finding Telephone Numbers

Initially, searching for the word cat seems simple, but it turns out that the regex that finally works requires a little thought. In particular, punctuation and plural forms must be considered. In general, regexes require fine tuning to the problem at hand. Whatever pattern is searched for, knowledge of the variety of forms this pattern might take is needed. Additionally, there are several ways to represent any particular pattern.

In this section we consider regexes for phone numbers. Again, this seems like a straight-forward task, but the details require consideration of several cases. We begin with a brief introduction to telephone numbers (based on personal communications [19]).

For most countries in the world, an international call requires an International Direct Dialing (IDD) prefix, a country code, a city code, then the local number. To call long-distance within a country requires a National Direct Dialing (NDD) prefix, a city code, then a local number. However, the United States uses a different system, so the regexes considered below are not generalizable to most other countries. Moreover, because city and country codes can differ in length, and since different countries use differing ways to write local phone numbers, making a completely general international phone regex would require an enormous amount of work.

In the United States, the country code is 1, usually written +1; the NDD prefix is also 1; and the IDD prefix is 011. So when a person calls long-distance within the United States, the initial 1 is the NDD prefix, not the country code. Instead of a city code, the United States uses area codes (as does Canada and some Caribbean countries) plus the local number. So a typical long-distance phone number is 1-860-555-1212 (this is the information number for area code 860). However, many people write 860-555-1212 or (860)555-1212 or (860)555-1212 or some other variant like 860.555.1212. Notice that all these forms are not what we really dial. The digits actually pressed are 18605551212, or if calling from a work phone, perhaps 918605551212, where the initial 9 is needed to call outside the company’s phone system. Clearly, phone numbers are written in many ways, and there are more possibilities than discussed above (for instance, extensions, access codes for different long-distance companies, and so forth). So before constructing a regex for phone numbers, some thought on what forms are likely to appear is needed.

Suppose a company wants to test the long-distance phone numbers in a column of a spreadsheet to determine how well they conform to a list of formats. To work with these numbers, we can copy the column into a text file (or flat file), which is easily readable by a Perl program. Note that it is assumed below that each row has exactly one number. The goal is to check which numbers match the following formats: an initial optional 1, the three digits for the area code within parentheses, the next three digits (the exchange), and then the final four digits. In addition, spaces may or may not appear both before and after the area code. These forms are given in table 2.1, where d stands for a digit. Knowing these, below we design a regex to find them.

Table 2.1 Telephone number formats we wish to find with a regex. Here d stands for a digit 0 through 9.

1 (ddd) ddd-dddd

1 (ddd) ddd-dddd

1(ddd) ddd-dddd

(ddd) ddd-dddd

(ddd) ddd-dddd

To create the desired regex, we must specify patterns such as three digits in a row. A range of characters is specified by enclosing them in square brackets, so one way to specify a digit is [0123456789], which is abbreviated by [0–9] or d in Perl.

To specify a range of the number of replications of a character, the symbol {m, n} is used, which means that the character must appear at least m times, and at most n times (so mn). The symbol {m,m} is abbreviated by {m}. Hence d{3} or [0–9] {3} or [0123456789] {3, 3} specifies a sequence of exactly three digits. Note that {m,} means m or more repetitions. Because some repetitions are common, there are other abbreviations used in regexes, for example, {0, 1} is denoted ? and is used below.

Finally, parentheses are used to identify substrings of strings that match the regex, so they have a special meaning. Hence the following regex is interpreted as a group of three digits, not as three digits in parentheses.

/(d{3})/

To use characters that have special meaning to regexes, they must be escaped, that is, a backslash needs to precede them. This informs Perl to consider them as characters, not as their usual meaning. So to detect parentheses, the following works.

/(d{3})/

Now we have the tools to specify a pattern for the long-distance phone numbers. The regex below finds them, assuming they are in the forms given in table 2.1.

/(1 ?)?(d{3}) ?d{3}–d{4}/

This regex is complicated, so let us take it apart to convince ourselves that it is matching what is claimed. First, “1 ?” means either “1 ”or “1”, since? means zero or one occurrence of the character immediately before it. So (1 ?)? means that the pattern inside the parentheses appears zero or one time. That is, either “1 ” or “1” appears zero or one time. This allows for the presence or absence of the NDD prefix in the phone number. Second, there is the area code in parentheses, which must be escaped to prevent the regex as interpreting these as a group. So the area code is matched by (d{3}). The space between the area code and the exchange is optional, which is denoted by “ ?”, that is, zero or one space. The last seven digits split into groups of three and four separated by a dash, which is denoted by d{3}–d{4}.

Unfortunately, this regex matches some unexpected patterns. For instance, it matches (ddd) ddd-ddddd and (ddd) ddd-dddd-ddd. Why is this true? Both these strings contain the substring (ddd) ddd-dddd, which matches the above regex. For example, the pattern (ddd) ddd-ddddd matches by ignoring the last digit. That is, although the pattern –d{4} matches only if there are four digits in the text after the dash, there are no restrictions on what can come after the fourth digit, so any character is allowed, even more digits. One way to rule this behavior out is by specifying that each number is on its own line.

Fortunately, Perl has special characters to denote the start and end of a line of text. Like the symbol , which denotes not a character but the location between two characters, the symbol ^ denotes the start of a new line, and this is called a caret. In a computer, text is actually one long string of characters, and lines of text are created by newline characters, which is the computer analog for the carriage return for an old-fashioned typewriter. So ^ denotes the location such that a newline character precedes it. Similarly, the $ denotes the end of a line of text, or the position such that the character just after it is a newline. Both ^ and $ are called anchors, which are symbols that denote positions, not literal characters. With this discussion in mind, regular expression 2.2 suggests itself.

Regular Expression 2.2 A regex for testing long-distance telephone numbers.

/^(1 ?)?(d{3}) ?d{3}–d{4}$/

Often it is quite hard to find a regex that matches precisely the pattern one wants and no others. However, in practice, one only needs a regex that finds the patterns one wants, and if other patterns can match, but do not appear in the text, it does not matter. If one gets too many false positives, then further fine-tuning is needed.

Finally, note there is a second use of the caret, which occurs inside the square brackets. When used this way, it means the negation of the characters that follow. For example, [^abc] means all characters other than the lowercase versions of a, b, and c. Problem 2.3 gives a few examples (but it assumes knowledge of material later in this chapter).

We have seen that although identifying a phone number is straightforward to a human, there are several issues that arise when constructing a regex for it. Moreover, regex 2.2 is complex enough that it might have a mistake. What is needed is a way to test regexes against some text. In the next section we see how to use a simple Perl script to read in a text file line by line, each of which is compared with regex 2.2. To get the most out of this book, download Perl now (go to http://www.perl.org/ [45] and follow their instructions) and try running the programs yourself.

2.2.3 Testing Regexes with Perl

Many computer languages support regexes, so why use Perl? First, Perl makes it easy to read in a text document piece by piece. Second, regexes are well integrated into the language. For example, almost any computer language supports addition in the usual form 3+5 instead of a function call like plus (3,5). In Perl, regexes can be used like the first form, which enables the programmer to employ them throughout the program. Third, it is free. If you have access to the Internet, you can have the complete, full-feature version of Perl right now, on as many computers as you wish. Fourth, there is an active Perl community that has produced numerous sources of help, from Web tutorials to books on how to use it.

Other authors feel the same way. For example, Friedl’s Mastering Regular Expressions [47] covers regexes in general. The later chapters discuss regex implementation in several programming languages. Chapter 2 gives introductory examples of regexes, and of all the programming languages used in this book, the author uses Perl because it makes it easy to show what regexes can do.

This book focuses on text, not Perl, so if the latter catches your interest, there are numerous books devoted to learning Perl. For example, two introductory texts are Lemay’s Sams Teach Yourself Perl in 21 Days [71] and Schwartz, Phoenix, and Foy’s Learning Perl [109]. Another introductory book that should appeal to readers of this book is Hammond’s Programming for Linguists [51].

To get the most out of this book, however, download Perl to your computer (instructions are at http://www.perl.org/ [45] and try writing and running the programs that are discussed in the text. To learn how to program requires hands-on experience, and reading about text mining is not nearly as fun as doing it yourself.

For our first Perl program, we write a script that reads in a text file and matches each line to regular expression 2.2 in the previous section. This is one way to test the regex for mistakes. Conceptually, the task is easy. First, open a file for Perl to read. Second, loop through the file line by line. Third, try to match each line with the regex, and fourth, print out the lines that match. This program is an effective regex testing tool, and, fortunately, it is not hard to write.

Program 2.1 performs the above steps. To try this script yourself, type the commands into a file with the suffix .pl, for example, call it test_regex.pl. Perl is case sensitive, so do not change from lower to uppercase or the reverse. Once Perl is installed on your computer, you need to find out how to use your computer’s command line interface, which allows the typing of commands for execution by pressing the enter key. Once you do this, type the statement below on the command line and then press the enter key. The output will appear below it.

perl test_regex.pl

image

Program 2.1 Perl script for testing regular expression 2.2.

Semicolons mark the end of statements, so it is critical to use the them correctly. A programmer can put several statements on one line (each with its own semicolon), or write one statement over several lines. However, it is common to use one statement per line, which is usually the case in this book. Finally, as claimed, the code is quite short, and the only complex part is the regex itself. Let us consider program 2.1 line by line.

First, to read a file, the Perl program needs to know where the file is located. Program 2.1 looks in the same directory where the program itself is stored. If the file “testfile.txt” were in another directory, the full path name is required, for example, “c:/dirname/testfile.txt”. The open statement is a function that acts on two values, called arguments. The first argument is a name, called a filehandle, that refers to the file, the name of which is the second argument. In this example, FILE is the filehandle of “testfile.txt”, which is read in by the while loop.

Second, the while loop reads the contents of the file designated by FILE. Its structure is as follows.

Code Sample 2.1 Form of a while loop.

while (<FILE>) { # commands }

The angle brackets around FILE indicate that each iteration returns a piece of FILE. The default is to read it line by line, but there are other possibilities, for example, reading paragraph by paragraph, or reading the entire file at once. The curly brackets delimit all the commands that are executed by the while loop. That is, for each line of the file, the commands in the curly brackets are executed, and such a group of commands is called a block. Note that program 2.1 has only an if statement within the curly brackets of the while loop. Finally, the # symbol in Perl denotes that the rest of the line is a comment, which allows a programmer to put remarks in the code, and these are ignored by Perl. This symbol is called a number sign or sometimes a hash (or even an octothorp). Hence code sample 2.1 is valid Perl code, although nothing is done as it stands.

Third, the if statement in program 2.1 tests each line of the file designated by FILE against the regex that is in the parentheses, which is regular expression 2.2. Note that these parentheses are required: leaving them out produces a syntax error. If the line matches the regex, then the commands in the curly brackets are executed, which is only the print statement in this case.

Finally, the print prints out the value of the current line of text from FILE. This can print out other strings, too, but the default is the current value of a variable denoted by $-, which is Perl’s generic default variable. That is, if a function is evaluated, and its argument is not given, then the value of $- is used. In program 2.1, each line read by the while loop is automatically assigned to $-. Hence the statement print; is equivalent to the following.

print “$-”;

Assuming that Perl has been installed in your computer, you can run program 2.1 by putting its commands into a file, and save this file under a name ending in .pl, for example, test_regex.pl. Then create a text file called testfile.txt containing phone numbers to test against regular expression 2.2. Remember that this regex assumes that each line has exactly one potential phone number. Suppose that table 2.2 is typed into testfile.txt. On the command line enter the following, which produces output 2.1 on your computer screen.

Table 2.2 Telephone number input to test regular expression 2.2.

(000) 000-0000

(000)000-0000

000-000-0000

(000)0000-000

1-000-000-0000

1(000)000-0000

1(000) 000-0000

1 (000)000-0000

1 (000) 000-0000

(0000)000-0000

(000)0000-0000

(000)000-00000

perl test_regex.pl

Output 2.1 Output from the program test_regex.pl using table 2.2 as input.

(000) 000-0000

(000)000-0000

1(000)000-000

1(000) 000-0000

1 (000)000-0000

1 (000) 000-0000

Regular expression 2.2 is able to find all the forms in table 2.1. It also matches the pattern 1 (ddd)ddd–dddd, which is not in table 2.1, but it is a reasonable way to write a phone number.

Program 2.1 prints out the matches, but it is also informative to see what strings do not match the regex. This can be done by putting the logical operator not in front of the regex. See problem 2.2 for more on this.

Returning to old business in section 2.2.1, we can now simplify regular expression 2.1 as promised. Instead of using the vertical bar (which denotes the logical operator or), we can use the zero or one symbol (denoted by a question mark. This is shown in regular expression 2.3. It turns out that this regex is more efficient than the original version. Instead of checking for both cat and cats independently, now the regex just checks for cat and when this is found, it checks for an optional s character.

Regular Expression 2.3 Another regex that finds the words cat and cats.

/cats?/i

2.3 FINDING WORDS IN A TEXT

In the last section we saw that a little Perl is useful. This section starts with a review of regexes that we covered so far. Then we consider the task of identifying the words of a text, and our eventual goal is to write a Perl script that finds and prints these words without punctuation. That is, the words in a text are segmented, and this is often the initial step in more complicated analyses, so it is useful later in this book.

2.3.1 Regex Summary

Table 2.3 summarizes parts of regexes already seen as well as a few additional, related patterns. Remember that regexes search for substrings contained in a string that match the pattern. For example, d stands for a single digit, so as long as there is one digit somewhere in the string, there is a match. Hence “The US NDD prefix is 1.” matches d since there is a digit in that string. What the string is depends on how the Perl code is set up. For example, in program 2.1, each line of the input file is the string. If a pattern runs across two different lines of the input file, then it does not match the regex. However, there are ways to deal with multiple lines of text at one time, one of which is discussed in table 2.5 in section 2.5.

Table 2.3 Summary of some of the special characters used by regular expressions with examples of strings that match.

Regex

Description

Example of a Match

/cat/

Specified substring

“cat” or “scatter”

/[BT] erry/

Choice of characters in []

“Berry” or “Terry”

/cat|dog/

|means or

“cats” or “boondoggle”

/d/

Short for [0123456789] or [0–9]

“1963” or “(860)”

/D/

Not d, i.e., [^0–9]

“(860)” or “Lucy”

/w/

Alphanumeric: [0–9a–zA–Z-]

“Daisy!” or “Pete??”

/w/

[^0–9a–zA–Z_]

“Dog!” or “Taffy??”

/s/

Whitespace: space, tab, newline

“Hello, Hattie”

/S/

Not whitespace

“Hello, Wally”

//

Word boundary

“Dave!” or “Pam’s”

/B/

Not a word boundary

“Brownie” or “Annie’s”

/ ·/

Any character

Any nonempty string

/^Cat/

^ denotes the start of a string

“Catastrophic”

/Cat$/

$ denotes the end of a string

“The Cat”

/cats?/

? matches 0 or 1 occurrences

“scat” or “scats”

/Gary’s+/

+ matches 1 or more occurrences

“Gary’s” or “Gary’sss”

/Gina’s*/

* matches 0 or more occurrences

“Gina’” or “Gina’sss”

/cat(cat)?/

( ) denotes grouping

“ssscat” or “catcatty”

Recall that {0, 1} is denoted ?, and we see that {1, } is denoted + and {0, } is denoted *, where {m, n} stands for at least m repetitions and at most n repetitions. Also remember that unless surrounding characters are explicitly specified, the restrictions of {m, n} are seemingly broken. For example, /d{3}/ matches both “000” and “0000.” However, /–d{3}–/ matches “-000-” but not “-0000-.” The latter regex demands a pattern of a hyphen, then exactly three digits, then another hyphen, while the former regex only looks for a substring of three digits, which can be found in a string of four digits. Finally, note that the caret inside square brackets signifies negation, so [^0–9] (also denoted by D) means any character except a digit. However, the caret outside the brackets means the start of a string (see problem 2.3 for further examples of how to use a caret). Problem 2.4 has further regex examples.

In the discussion of phone numbers, it is seen that parentheses have special meaning to a regex. Specifically, parentheses specify subpatterns. For example, (1 ?)? in regular expression 2.2 means that the substring (1 ?) is subject to the second ?. That is, the parentheses designate (1 ?) as a unit, and the second ? says that this unit appears zero or one time. It turns out that parentheses do more than specify parts: they also store whatever matches each part into match variables that are available for later use in the program (see section 2.3.4).

As already noted, to match against a range of characters, square brackets are used. For example, lowercase letters are represented by [a–z]. However, certain characters have special meanings in regexes, for example, the question mark means zero or one instance of the preceding character. To match a literal question mark in a text, one has to use an escaped version, which is done by placing a backslash in front of the question mark as follows: ?. However, to include this character in a range of values, the escaped version is not needed, so [?!] means either a question mark or an exclamation point. Conversely, a hyphen is a special symbol in a range, so [a–z] means only the lowercase letters and does not match the hyphen. To include a hyphen in the square brackets, just put it first (or last), so [–a] matches either the letter a or a hyphen. However, the hyphen has no special meaning elsewhere in a regex. For example, /d{3}–d{4}/ matches a U.S.-style seven-digit phone number.

We have now seen the basic components of regexes: parentheses for grouping; {m, n} for repetition; characters and anchors; and | for alternation. These, however, do not have equal weight in Perl. In fact, Perl considers these in the order just given. That is, grouping is considered first, repetition second, characters and anchors third, and alternation last. For example, a|b+ means either the letter a or one or more copies of the letter b. So + is considered first, and only then |. This ordering is called the precedence of these regex components.

2.3.2 Nineteenth-Century Literature

A general, all-purpose regex would be great to have, but, in practice, different types of texts vary too much. For example, compared to formal business letters, a file of emails has many more abbreviations, misspellings, slang words, jargon, and odd symbols like smiley faces. What might work for analyzing business letters probably fails for emails, and vice versa. However, even documents from the same source sometimes have systematic differences. In text mining it always pays to examine the texts at the beginning of a project.

To test the word regexes that we develop below, we use literary texts. These have several advantages. First, literature comes in a variety of lengths, from short stories to novels. Second, if older texts are used, for instance, nineteenth-century literature, then public domain versions are often available from the Web. It is true that these versions may not be definitive, but they are certainly satisfactory for testing out text mining techniques. For this task of segmenting words, the short stories of Edgar Allan Poe are used. These stories are generally short and put all together still fit in one book. Plus Poe wrote in a variety of styles: although he is most famous for his horror stories (or as he called them, Tales of the Grotesque and Arabesque), he also wrote detective stories, early science fiction, and parodies of other genres. Finally, some of his fiction has unusual words in it. For example, Poe is fond of quoting foreign languages, and several of his stories have dialog with people using heavy dialects. With the goal of segmenting words for motivation, we need to learn a few more Perl tools to test our regexes.

2.3.3 Perl Variables and the Function split

Program 2.1 assumes that a line of text contains just one phone number and nothing more. A file with a Poe short story has many words per line, and so a way to break this into words is needed. However, the spaces in text naturally break a line into substrings, and Perl has a function called split that does this. The results require storage, and using variables and arrays are an effective way to do this, which is the next topic.

Programming languages use variables to store results. In Perl, scalar variables (variables that contain a single value) always start with a dollar sign. Consider the following statement.

$x = “They named their cat Charlie Brown.”;

Perl stores this sentence in the variable $x. Later in the code $x is available for use or modification. We are interested in taking a line of text and breaking it into substrings that are potential words. These are tested against a regex, and the substrings that match the regex are labeled as words, and the nonmatches are declared nonwords. The initial substrings are stored in variables so that they can be later tested against the regex. One complication, however, is that the number of words per line of text varies, and it is not known beforehand. This requires the ability to store a variable number of substrings, which is easily done with arrays.

An array is an ordered collection of variables with a common name that starts with an @ character. Each variable in an array is indexed by the nonnegative numbers 0, 1, 2, .... Code sample 2.2 gives an example. Since the text “The sun is rising.” has three blanks, it splits into four substrings, and “The” is stored in $word [0], “sun” in $word [1], “is” in $word [2], and “rising.” in $word [3]. The array @word refers to this collection of four variables as a single unit. One convenient feature of Perl is that the programmer does not need to specify how big the array is beforehand. So the output of the function split can be stored in an array despite that the number of substrings produced is unknown beforehand.

Code Sample 2.2 An example of an array as well as string interpolation.

$line = “The sun is rising.”;

@word = split(/ /,$line);

print “$word[0], $word[1], $word[2], $word[3]”;

The first line of code sample 2.2 stores the string in the variable $line. The function split in the second line has two arguments: a regex and a string for splitting into pieces. In this case, the regex is a blank between two forward slashes, which splits at every place where there is a single space. The string for splitting is the second argument, which is $line. Finally, the print statement shows the results. Because the string in the print statement has double quotes, each variable in the this string is replaced by its value. That is, $word [0] is replaced by its value, as is true for the other three. In Perl, scalar variables in strings with double quotes are replaced by their values, which is called interpolation. Note that this does not happen for strings in single quotes.

The regex / / in the function split is not flexible: it only splits on exactly one space. If the text has two spaces instead of one between words, this causes an unexpected result which is seen by running code sample 2.3, and the results are output 2.2.

Code Sample 2.3 Note what happens to the double space between the words The and sun.

$line = “The sun is rising.”;

@word = split(/ /,$line);

print “$word[0], $word[1], $word[2], $word[3]”;

Output 2.2 Output for code sample 2.3.

The, ,sun, is

When split on / /, the double space is split so that the first piece is “The”, the second piece is “” (the empty string), and the third piece is “sun”, and so forth. This empty string produces the two commas in a row in the output. One way to fix this behavior is by changing the regex in the function split to match all the whitespace between words, as is done in code sample 2.4.

Code Sample 2.4 Now the double space is treated as a unit, unlike code sample 2.3.

$line = “The sun is rising.”;

@word = split(/s+/, $line);

print “$word[0], $word[1], $word[2], $word[3] ”;

Recall from table 2.3 that s stands for whitespace (space, tab, or newline) and the + means one or more of the preceding character, which is one or more whitespace characters in this case. Running this code produces output 2.3, which is what we expect.

Output 2.3 Output for code sample 2.4.

The, sun, is, rising.

Now that we can break a line of text into substrings and store these in an array, these substrings are tested against a regex. In program 2.1, the if statement has the following form.

if ( /$regex/ ) { # coimnands }

Although no variable is explicitly mentioned, Perl understands that the default variable, denoted $-, is compared with $regex, which contains a string denoting a regex. This is another example of Perl interpolation: the contents of this variable is used as the regex. One can use other variables with the following syntax.

if ( $x =~ /$regex/ ) { # commands }

Now the variable $x, not $_ is compared with the regex. Note that $_ can be written out explicitly instead of suppressing it, so the following is valid syntax.

if ( $_ =~ /$regex/ ) { # commands }

When testing a regex, it is useful to examine which strings match and which do not match the regex. Using an if–else construction is one way to do this, so if the regex matches, one set of commands is executed, and if there is no match, then another set is executed. The form is given in code sample 2.5.

Code Sample 2.5 The structure of an if statement.

if ( $x =~ /$regex/ ) {

# commands for a match

} else {

# commands for no match

}

Now we can split a line of text as well as test a variable against a regex. However, splitting text produces an array, not just one variable, so we need to test each component of the array. Just as the while loop can access the lines of a file, the foreach loop can access the values of an array by using the following syntax.

Code Sample 2.6 The structure of a foreach loop.

foreach $x (@word) {

# commands

}

image

Program 2.2 Code for reading a string a word at a time.

Here the variable $x takes on each value of the array @word. The commands in the curly brackets then can use these $x values. For example, compare each one against a regex.

Let us put together the individual parts just discussed to create program 2.2. Here the split function splits on one or more whitespace characters, and the regex in the if statement matches strings having one or more alphanumeric characters. So there are four matches, as expected. Note that the semicolons inside the double quotes are characters, not Perl end-of-the-line delimiters. The output is as follows.

The matches; sun matches; is matches; rising. matches;

Finally, it is sometimes useful to undo split, and Perl has a function to do this called join. Code sample 2.7 has a short example, and it prints out the same sentence that is in $line1, except that the double space is replaced by a single space. Note that the first argument can be any string. For example, using the string ’ + ’ produces The + sun + is + rising.

Code Sample 2.7 The function join undoes the results of the function split.

$line1 = “The sun is rising.”;

@word = split(/s+/, $line1);

$line2 = join (’ ’, @word);

print “$line2”;

In the next section we learn how to access the substring that matches a regex. For example, in program 2.2 we do not know which characters actually do match /w+/.

2.3.4 Match Variables

The content of parentheses in a regex is a unit. However, parentheses also store the substning that matches this unit in a variable. These are called match variables, and they are written $1, $2, and so forth. The first, $1, matches the first set of parentheses, $2 matches the second set of parentheses, and so forth.

Using a match variable we can modify program 2.2 by adding parentheses to w+. Now the part of the string that matches this is stored in the match variable $1, and this is a substring composed completely of alphanumeric characters. In particular, it does not match punctuation, so this seems like a promising regex for extracting words from a text. Let us try it out in the script below.

image

Program 2.3 Code for removing nonalphanumeric characters.

The sun is rising

Program 2.3 produces the above output. Indeed, the period has been removed. If this code is placed into a while loop that goes through a file line by line, then we can see how well punctuation is removed for a longer text. Keep in mind that in text mining, promising initial solutions often have unexpected consequences, so we may need to patch up the regex used in program 2.3.

2.4 DECOMPOSING POE’S “THE TELL-TALE HEART” INTO WORDS

As noted earlier, public domain versions of Poe’s short stories are available on the Web and are not hard to find using a search engine. Our goal here is to decompose “The Tell-Tale Heart” [94] into words. This task is called word segmentation. Program 2.3 suggests that this is not too hard, but there are many details to consider if we are to do this task well.

To extract words from text, punctuation must be removed. The discussion below covers the basics, but this is not the complete story. The following two references give more details and background on the grammar of English. First, section 506 of the Cambridge Grammar of English [26] gives a short synopsis of punctuation use. Second, section 4.2.2 of Foundations of Statistical Natural Language Processing [75] discusses some issues of the tokenization of a string into words.

Before discussing code, there are two more Perl functions that are useful here. Although text in files looks like it is stored in lines, in fact, it is stored as one long string of characters. The lines are created by the program displaying the text, which knows to put them in wherever a newline character exists. So the last word in a line ends with a newline character, which is not how humans think of the text. Therefore, it is useful to cut off this newline character, which is what the following command does.

chomp;

Note that this function has no explicit argument, which implies it is using the default variable of Perl, $_. In a while loop, each line read is automatically stored in $_, so chomp in a while loop cuts off the newline character at the end of each line.

The second function is die, which is often used to test for a failure in opening a file. In Perl, or is the logical operator of the same name, and it has a clever use in the following command.

open(FILE, “The Tell–Tale Heart.txt”) or die(”File not found”);

If the file opens successfully, open returns the value true and die is not executed. If the file fails to open, then die runs, which halts Perl and prints out a warning that starts with the string in its argument, namely, “File not found.” See problem 2.5 for why this happens.

Now let us consider a Perl script to remove punctuation from the short story “The Tell-Tale Heart.” Note that this code assumes that the file “The Tell–Tale Heart.txt” exists in the same directory as the program itself, which if not true, then the name of the directory that contains the file requires specification, for example, open(FILE, “C:/Poe/story.txt”);.

image

Program 2.4 Code for extracting words.

Let us consider program 2.4 line by line. First, the open command makes FILE the filehandle for this short story, and the die command stops execution and prints a warning if the open command fails to work. Second, the while command loops through the story line by line. Third, @word is an array with one entry for each substring created by the split command, which splits on one or more whitespace characters. Note that while splits on newlines, so the split function only has spaces and tabs left for it.

With each line of text split into substrings, the foreach loop goes through each of these, which are stored in the variable $x in the body of the loop, that is, for the commands within the curly brackets. In this case, there is only one command, an if statement, which itself contains one command, a print statement. The if statement tests $x against the regex /(w+)/, and the substring that matches is stored in the variable $1. For example, testing “dog.” puts “dog” in $1, since only these characters are alphanumeric. Finally, $1 is printed out, so that the output should have no punctuation.

If you obtain a copy of “The Tell-Tale Heart” and run program 2.4, the punctuation seems entirely removed at first glance. However, depending on the public domain version you choose, there are exceptions, five of which are shown in table 2.4. The first line should have the word watch’s, but the apostrophe and the ending -s are both removed, which also happens in line 2. The hyphenated word over-acuteness is reduced to over in the third line, and o’clock is truncated to the letter o in the fourth. Finally, the last line has underscores since this character is included in w, and it is commonly used to denote italics in electronic texts.

Hence, there is a major problem with program 2.4. The regex /(w+)/ matches all the alphanumeric characters of a word only if it has no internal punctuation. Otherwise, only the first group of contiguous alphanumeric characters are matched, and the rest is ignored, which happened in the first four lines of this table. So problems are caused by contractions, possessive nouns, and hyphenated words.

Table 2.4 Removing punctuation: a sample of five mistakes made by program 2.4.

A watch minute hand moves more quickly

Who there

madness is but over of the sense

it was four o still dark as midnight

I now grew_very_pale

An additional problem not appearing in this version of “The Tell-Tale Heart” are dashes that are written as double hyphens such that they abut the words on either side of it. For example, this happens in sentence 2.7. Since punctuation does occur within words, let us consider the cases noted above one at a time.

(2.7) Cheryl saw Dave–he wore black–and she ran.

2.4.1 Dashes and String Substitutions

Dashes are written in several ways. My version of “The Tell-Tale Heart” uses a single hyphen with one space on each side, that is, “ – ”. As long as the dash is written with spaces on each side of the dash, that is, “ – ” or “ –– ”, then splitting on whitespace never produces a word attached to a dash.

If there are no spaces between the adjacent words and the dash, then the form of the latter must be “––”, otherwise a dash is indistinguishable from a hyphen. There are two ways to deal with this situation. First, if dashes are not of interest to the researcher, then these can be replaced with a single space. Second, if dashes are kept, then “––” can be replaced with “––”.

Perl has string substitutions. For example, s/dog/cat/ replaces the first instance of the string “dog” with “cat”. Note that the letter s stands for substitution. To replace every instance, just append the letter g, which stands for global. For example, see code sample 2.8, which produces the output below. Note that using s/––/ /g instead of s/––/ –– /g replaces each dash with a single space, thereby removing them altogether.

Code Sample 2.8 This adds spaces around the dashes in the string stored in the variable $line.

$line = “Cheryl saw Dave--he wore black--and she ran.”;

$line =~ s/––/ –– /g;

print “$line ”;

Cheryl saw Dave -- he wore black -- and she ran.

So dashes are not hard to work with. Unfortunately, sometimes -- is used in other ways. For example, Poe sometimes wrote a year as “18––”in his short stories. But such special cases are detectable by regexes, and then a decision on what to do can be made by the researcher. For example, the following code finds all instances of “––”, and notes the nonstandard uses, which means not having a letter or whitespace adjacent to the front and the back of the dash.

image

Program 2.5 Code to search for –– and to decide if it is between two words or not.

Non-standard dash: “18--,” Standard dash: “April--some”

Program 2.5 produces the above output. The first dash is nonstandard since it has a number on its left, while the second dash is between two words, and so is standard. Looking at this Perl program, much of the syntax has already been introduced above. A string of text to test is stored in $line. The results of splitting on whitespace is stored in the array @word, and the foreach statement loops over the substrings stored in this array. The first if statement tests for any instance of ––, while the second if tests if a space or a letter is both before and after the dash. If this is the case, then the first print statement is executed, otherwise the second one is. Finally, note that the print statements print out strings that contain a double quote, yet the strings themselves are delimited by them. To put one in a string, it must be escaped, that is, a backslash precedes the double quotes inside the string. Next we consider hyphens.

2.4.2 Hyphens

Hyphens are used in two distinct ways. First, many published works use justified typesetting; that is, the text is aligned on both its margins so that its width is constant. In order to do this, some words are broken into two pieces: the first ends a line of text and is followed by a hyphen, and the second starts the next line. However, for electronic texts, this is typically not done because the raw text can be entered into a word processing or typesetting program, which can convert it into justified text. For example, this book was originally written as a LATEX text file, which includes typesetting commands.

Second, some words are hyphenated because they are built up from two or more words. For example, mother-in-law, forty-two, and self-portrait are written with hyphens. So, in practice, electronic text generally only uses hyphens for words that are themselves hyphenated.

Unfortunately, not everyone agrees on which words should be hyphenated. For example, is it e-mail or email? Both are used now (the latter is easier to type, so it should win out.) Moreover, three or more words are combinable. For example, one-size-fits-all is sometimes written with hyphens.

For many text mining applications, words are counted up as part of the analysis. Should we count a hyphenated word as one, or as more than one? Note that its components may or may not be words themselves. For example, e-mail is not a combination of two words. Even if the components are all words, their individual meanings can differ from the collective meaning. For example, mother-in-law is composed of mother, in, and law. However, the connotations of the noun law or the preposition in have little to do with the concept mother-in-law. Finally, a word like once-in-a-lifetime is roughly equivalent to its four constituent words. So there is no easy answer to whether a hyphenated word should be counted as one or several words. In this book, we take the former approach, which is simpler.

Hence, /(w+)/ used in program 2.3 only matches up to the first hyphen. Because w matches digits and underscores, specifying only letters with [a–zA–Z] is helpful. Since the hyphen is used to denote a range of characters, it must be first or last in the square brackets, for example, [a–zA–Z–]. So a first attempt at a regex is made by specifying exactly these characters, one or more times. Consider code sample 2.9, which produces the output below.

Code Sample 2.9 First attempt at a regex that finds words with hyphens.

$line = “Her sister–in–law came––today–––and –it– is a–okay!”;

@word = split(/s+/, $line);

foreach $x (@word) {

if ( $x =~ /([a–zA–Z–]+)/ ) {

print “$1 ”;

}

}

Her sister-in-law came--today---and -it- is a-okay

This is not what we want. The dashes remain as well as three or more hyphens in a row, but both can be removed by the substitution in code sample 2.8.

In addition, the hyphens of -it- require removal, but sister-in-law cannot be changed. This can be accomplished by thinking in terms of groups of characters. Hyphenated words start with one or more letters, then one hyphen, then one or more letters, perhaps another hyphen, then one or more letters, and so forth. To include words with no hyphens, make the first one optional. So the regex starts with ([a–zA–Z]+–?), which says that any word (hyphenated or not) must start with one or more letters followed by zero or one hyphen. Now if we specify that this pattern happens one or more times, which is done by adding a + after the parentheses, then this matches hyphenated words as well as regular words. Running code sample 2.10 produces the output below.

Code Sample 2.10 Second attempt at a regex that finds words with hyphens.

$line = “Her sister–in–law came––today–––and –it– is a–okay!”;

$line =~ s/––+/ /g;

@word = split(/s+/, $line);

foreach $x (@word) {

if ( $x =~ / ( ( [a–zA–Z]+–?)+)/ ) {

print “$1 ”;

}

}

Her sister-in-law came today and it– is a–okay

This almost works. The only problem is that it matches a group of letters ending in a hyphen, so –it– still ends in one in the output. This is easily fixed by specifying the last character as a letter. Code sample 2.11 does this, which produces the desired output below.

Code Sample 2.11 This code extracts words at least two letters long, including hyphenated ones.

$line = “Her sister-in-law came––today–––and –it– is a–okay!”;

$line =~ s/––+/ /g;

@word = split(/s+/, $line);

foreach $x (@word) {

if ( $x =~ /(([a–zA–Z]+–?) + [a–zA–Z])/ ) {

print “$1 ”;

}

}

Her sister-in-law came today and it is a-okay

Although this test is successful, there is one problem with code sample 2.11. It won’t match words that are exactly one letter long. This is an example of why testing is paramount: it is easy to make a change that fixes one problem only to discover that a new one arises. Upon reflection, this regex requires at least two letters because the part in the inner parentheses requires at least one letter, and the requirement of a final letter forces a potential match to have at least two. This is not hard to fix, and the solution is given in code sample 2.12. Note that the word I is matched in the output below.

Code Sample 2.12 This code extracts hyphenated and one-letter words.

$line = “Her sister-in-law came--today---and I -am- a-okay!”;

$line =~ s/––+/ /g;

@word = split(/s+/, $line);

foreach $x (@word) {

if ( $x =~ /(([a–zA–Z]+–)*[a–zA–Z]+)/ ) {

print “$1 ”;

}

}

Her sister-in-law came today and I am a-okay

Remember that * means zero or more occurrences, so this regex matches zero or more groups of letters followed by exactly one hyphen, and which ends in one or more letters. This now matches one-letter words. Finally, suppose $line is set to the string “She received an A- on her paper.” This regex now improves the grade. This is an example of the difficulties of writing for the general case instead of for a particular group of texts.

Our work, however, is not quite done. Now that we have considered both dashes and hyphens, we next discuss the apostrophe.

2.4.3 Apostrophes

Apostrophes are problematic because they serve more than one purpose. First, they are used to show possession, for example, Gary’s dog. Second, they are also used for contractions, for example, Gina‘s going home. Third, they are used for quotation marks; see section 488 of [26]. In addition, quotes within quotations use the other type of quotation marks: for example, if double quotes are used for direct speech, then direct speech that quotes another person uses single quotes. An example of this is the following sentence.

(2.8) Katy said, “I thought he said, ‘Sam,’ but I was wrong.”

Moreover, all three uses of the apostrophe are combinable. This is seen in the following example.

(2.9) Bart said, “I thought he said, ‘That’s Scoot’s,’ but I was wrong.”

When processing sentence 2.9, unless care is taken, it is easy to match ‘That as the inner quotation. Although humans can easily use symbols in multiple ways depending on the context, this makes pattern matching more difficult for a computer.

Two further possible complications are worth noting. First, contractions can have an initial apostrophe, for example, ‘twas. And nouns ending in -s are made into a possessive noun by adding an apostrophe at the end of the word, for example, my parents’ cats. If single quotes are used for direct speech, then these examples become harder to deal with.

For the short story “The Tell-Tale Heart,” however, double quotes are used, and there are no quotes within direct speech. So all the single quotes are either contractions or possessive nouns. If the regex in code sample 2.12 has the single quote added to it, then using this new regex in program 2.4 extracts the words from this particular short story, which is done in program 2.6. However, putting the single quote in the range of characters also allows multiple single quotes in a row, which may or may not be desired.

image

Program 2.6 Improved version of program 2.4 for extracting words.

Looking at the output of program 2.6, the problems in table 2.4 are corrected. Hence, over-acuteness appears, as do watch‘s, Who‘s, and four o‘clock. Finally, -very- is changed to very. Therefore, this program works, at least for the input “The Tell-Tale Heart.”

2.5 A SIMPLE CONCORDANCE

Sections 2.3 and 2.4 give us some tools to extract words from a text. We use these to create a concordance program in Perl. That is, we want to write code that finds a target word, and then extracts the same number of characters of text both before and after it. These extracts are then printed out one per line so that the target appears in the same location in each line. For example, output 2.4 shows four lines of output for the word the in Poe’s “The Tell-Tale Heart,” which is produced by program 2.7. This kind of listing allows a researcher to see which words are associated with a target word. In this case, since the is a determiner, a class of words that modify nouns, it is not surprising that the precedes nouns in this output.

Output 2.4 Four lines of concordance output for Poe’s “The Tell-Tale Heart.”

say that I am mad? The disease had sharpen hem. Above all was the sense of hearing ac heard all things in the heaven and in the en the heaven and in the earth. I heard man

Conceptually, a concordance program is straightforward. When the first instance of the target word is found, its location is determined, and then the characters surrounding the target are printed out. Starting just after the target, the search continues until the next instance is found. This repeats until all instances are discovered.

We have already used regexes to find the target word. Here is a second approach. Perl has a function index that locates a substning within a string, and the function substr extracts a substring given its position. We first learn how these two functions work, and then apply them to the concordance program.

Code Sample 2.13 Example of the string function index.

$line = "He once lived in Coalton, West Virginia.";

$target = "lived";

$position = index($line, $target, 0);

print "The word "$target" is at position $position.";

For a first example, consider the code sample 2.13. Here index looks for the string in $target within $line starting at position 0 (given by the third argument), which is the beginning of $line. If this string is not found, then $position is assigned the value –1; otherwise the position of the start of the target string is returned. Running this code produces the following output.

Output 2.5 Results of code sample 2.13.

The word "lived" is at position 8.

Counting from the first letter of $line such that H is 0, e is 1, the blank is 2, and so forth, we do find that the letter 1, the first letter of lived, is number 8.

Notice that index only finds the first instance after the starting position. To find the second instance, the starting position must be updated after the first match. Since the word lived is at position 8, then the following command finds the second instance of this word, if any.

$position = index($line, $target, 9);

If the value 8 is used instead, then $posit ion is still 8 since the next instance of $target is, in fact, at that position. Updating the starting position is achievable as follows. Here old value of $position is used in the function index, and then it is updated to the result returned by index.

$position = index($line, $target, $position+1);

Code Sample 2.14 This searches for all instances of the target word she.

image

By repeatedly using this updating of $position, all instances of the target word are found by code sample 2.14. Running this produces the values 9 and 21, but not the value 0. However, the reason for this is simple: the first she is capitalized, and so it is not a match. Unfortunately, the function index does not take a regex for its argument, so we cannot find all the instances of she by putting the regex /she/i into $target. However, applying the function lc to the string in $line changes all the letters to lowercase. So replacing the third line in code sample 2.14 to the statement below and making the analogous change to the seventh line finds all three instances of she, regardless of case.

$position = index(lc($line), $target, 0);

The heart of code sample 2.14 is the while ioop, which keeps going as long as $position is greater than –1. This is true as long as instances of the target word are found. Once this does not happen, index returns the value –1, which halts this loop. If the text has no instance of the target, then the value of $position before the while loop begins is –1, which prevents the loop from executing even once.

Finally, let us consider the function substr, which extracts substrings from text. It is easy to use: the first argument is the string, the second is the starting position, and the third is the length of the substring to be extracted. So the following line prints Nell.

print substr("I saw Nell on A level.", 6, 4);

Combining index and substr can produce a concordance program for a fixed string. However, regexes are more powerful, so we return to this approach using the ideas just discussed. Consider the following syntax.

while ( $var =˜ /$target/g ) { # commands }

The letter g means to match globally, that is, all matches are found. Each one causes the commands in the curly brackets to execute once. However, how is the location of the match determined? Is there a function analogous to index for a regex? Yes, there is. Perl has pos, which returns the position of the character after the regex match. This is seen in code sample 2.15, which prints out the numbers 4 and 7. The former is the position of the space after the word This, which is the first occurrence of the letters is. The latter is the position of the space after the word is. These ideas are put into action in code sample 2.16.

Code Sample 2.15 An example of the pos function. When run, this code prints out the numbers 4 and 7.

image

Code Sample 2.16 Core code for a concordance.

image

The target word, cat, is made into a simple regex, and the parentheses store the matched text in $1. The variable $pos has the location of the character right after the matched text. The output has four matches as seen below.

Cat 3 cat 8 cat l3 cat 18

Note that the letter i after the regex makes the match case insensitive, hence the first Cat is matched. This also can be done in the regex, for example, using / [Cc] at/. As discussed above, finding the substring cat is necessary but not sufficient. For example, the word catastrophe has the substning cat, but it is not the word cat. However, since $target can have any regex, this is easily fixed. For example, the regex / [Cc] at/ rules out words that merely contain the letters cat. Similarly, the regex / [Cc] ats?/ finds both the word cat and its plural. This ability to find regexes as opposed to fixed strings makes it easy to match complex text patterns. Hence, while the function index is useful, the while loop in code sample 2.16 is much more flexible.

Up to now the text has been stored in the variable $line, which has been short. For a longer text like Poe’s “The Tell-Tale Heart,’ it is natural to read it in with a while loop. However, the default is to read it line by line, which can prevent the concordance program from getting sufficient text surrounding the target. One way around this is to change the unit of text read in, for example, reading in the entire document at once. This is possible and is called the slurp mode. However, if the text is very long, this can slow the program down. A compromise is to read in a text paragraph by paragraph. Many electronic texts use blank lines between paragraphs, and Perl knows this convention and can read in each paragraph if these are separated by blank lines. Changing the default only requires changing the value of the Perl variable $/. Table 2.5 gives some common values, but any string is possible.

Table 2.5 Some values of the Perl variable $/ and their effects.

$/ = undef;

Slurp mode

$/ = "";

Paragraph mode

$/ = " ";

Line-by-line mode

$/ = " ";

Almost word-by-word mode

The reason that $/ set to a blank is not quite a word by word mode is that the last word of a line of text has a newline character after it, not a space. This combines the last word of a line with a newline character and then with the first word of the next line.

Now we have the tools to write program 2.7, which creates concordances. Remember that programming comments follow a #. It is good practice to comment your programs because it is surprisingly easy to forget the logic of your own code over time. If the program is used by others, then it is especially helpful to put in comments to explain how it works.

Program 2.7 builds on the discussions above on the index function and the trick of a while loop that iterates over all the matches of a regex. However, there are several additional points worth making. First, note that this program requests no input from the person running it, which restricts it to a concordance just for the “The Tell-Tale Heart” and for the target the. However, it is easy to modify <FILE> to refer to any other specific text file, and the target word can be any regex, not just a specific string. It is also straightforward to enable this code to accept arguments on the command line. This technique is discussed in section 2.5.1.

Second, note the use of parentheses in the string assigned to $target. This stores the matching substring in the Perl variable $1, which is then stored in $match. This is later used to ensure that the number of characters extracted before and after the matched substring have the precise length given in $radius. If no parentheses were used, then each line printed has exactly the number of characters in $width instead of length($target) + $width.

Third, the first while loop goes through the text paragraph by paragraph. For each of these paragraphs, the second while loop goes through each match found by the regex in $target. This is not the only way to go through the entire text, but it is one that is easy for a person to grasp.

Fourth, the if statement checks whether or not there are as many characters as $radius before the matched text. If not, then $start is negative, and spaces are added to the beginning of the concordance line, that is, to $extract. The operator x shown below creates a string of blanks that has length equal to the value of –$st art.

" "x -$start

However, this is not the only way to add spaces to $extract. The function sprintf creates a string with a specified format, which can be constructed by the program to make it the correct length. See problem 2.6 for more details.

Finally, running the program produces 150 lines of output, one for each the in “The Tell-Tale Heart.” The first 10 lines of the output are displayed in output 2.6. Remember that the first 4 lines of this are displayed above in output 2.4.

image

Program 2.7 A regex concordance program.

Program 2.7, although it is short, it is powerful, especially because of its ability to match regular expressions. Although concordance programs already exist, we know exactly how this one works, and it is modifiable for different types of texts and tasks. For example, if a concordance for long-distance phone numbers were desired, then the work of section 2.2.3 provides a regex for this program. Then a document containing such numbers is analyzable to determine in what contexts these appear. This ability to adapt to new circumstances is one major payoff of knowing how to program, and when dealing with the immense variety and complexity of a natural language, such flexibility is often rewarded.

One drawback of program 2.7 is that to change the target regex, the code itself requires modification. Changing code always allows the possibility of introducing an error. So enabling the program to accept the regex as input is worth doing. We know how to open a file so that its contents are read into the program, but for something short, this is overkill. The next section introduces an easy way to give a program a few pieces of information when it starts.

Output 2.6 First 10 lines of the output of program 2.7.

say that I am mad? The disease had sharpen them. Above all was the sense of hearing ac heard all things in the heaven and in the en the heaven and in the earth. I heard many lmly I can tell you the whole story.
le to say how first the idea entered my brae was none. I loved the old man. He had nev it was this! He had the eye of a vulture - up my mind to take the life of the old man to take the life of the old man, and thus r

2.5.1 Command Line Arguments

Perl is run by using the command line. If text is placed after the name of the program, then there is a way to access this within the program. For instance, consider the following command.

perl program.pl dog cat

The two words after the program name are put into the array called @ARGV. The value of $ARGV [0] is dog, and $ARGV [1] is cat. Clearly this becomes tedious if many strings are needed, but it is quite useful for only a few values.

As an application, let us modify program 2.7 so that it expects three strings: a word to match, the size of the radius of the extract, and a file to open. For example, suppose all the instances of and in the file text . txt are desired along with the 30 characters before and after it. The modified version finds these by typing the following on the command line.

perl program.pl and 30 text.txt

This is easy to do. In program 2.7, remove all the code before the first while loop and replace it with code sample 2.17. One point to note: the definition of $target requires two backslashes before the b since a single backslash is interpreted as the word boundary, . That is, the backslash must be escaped by adding another backslash. Note that the use of single quotes does not require escaping the backslash, but then $ARGV [0] would not be interpolated.

This is enough for now on extracting words, but this task is essential since it is typically the first step of many text mining tasks. An implementation of command line arguments for a regex concordance is done in section 3.7: see program 3.2.

2.5.2 Writing to Files

For a large text, a concordance program can produce much output, and scrolling through this on a computer screen is a pain. It is often more convenient to store the output to a text file, which is doable in three ways.

Code Sample 2.17 Replace the code prior to the first while loop in program 2.7 with the commands here to make that program run as described above.

image

First, the function open can open a file for output as well as input. Code sample 2.18 gives two examples. Note that OUT1 is a filehandle for filenamel .txt. The greater than sign means that this file is written to. If this file already exists, then the original contents are lost.

Second, the use of two greater than signs means to append to the file. So in this case, if filename2. txt already exists, then the original contents are appended to, not overwritten.

Code Sample 2.18 How to write or append to a text file.

open (OUT1, ">filenamel.txt") or die;

open (OUT2, ">>filenarne2.txt") or die;

# commands

print OUT1 "$x, $y, etc. ";

print OUT2 "$x, $y, etc. ";

Third, redirecting the output from the command line is possible. For example, the following command stores the output from the program to filenaine3 .txt. In addition, using a double greater than sign appends the output to the file listed on the command line.

perl program.pl > filename3.txt

Although this is rarely discussed in this book’s code examples, in practice, it is useful to store voluminous outputs in a file. Now we turn to a new problem in the next section: a first attempt to identify sentences using regexes.

2.6 FIRST ATTEMPT AT EXTRACTING SENTENCES

The general problem of extracting strings from a text is called tokenization, and the extracts are called tokens. This is a useful term since it covers any type of string, not just words, for example, telephone numbers, Web addresses, dollar amounts, stock prices, and so forth. One challenge of extracting words is how to define what a word is, and this is a complex issue. For example, in this book, sections 2.4.2 and 2.4.3 discuss the issues of hyphens and apostrophes, respectively. See section 4.2.2 of Foundations of Statistical Natural Language Processing [75] for further discussion on defining a word. Moreover, many other specialized cases come to mind with a little thought. For example, this book has regexes and computer code, and each of these have unusual tokens. However, we usually focus on tokens arising from literature.

Words are joined to create phrases, which are joined to form clauses, which are com­ bined into sentences. So sentences inherit the complexity of words, plus they have their own structure, ranging from simple one-word exclamations up to almost arbitrarily long constructions. An early statistical paper on sentences by Yule in 1939 [128] starts out by noting that one difficulty he had in his analysis is deciding exactly what constituted both a word and a sentence. In fact, there is no definitive answer to these questions, and since language changes over time, any proposed definition becomes out of date. So the general issues are complex and are not dealt with in this book in detail, but see section 4.2.4 of the Foundations of Statistical Natural Language Processing [75] and sections 269 through 280 of the Cambridge Grammar of English [26] for further discussion on sentences. Fortunately, breaking a particular text into sentences might be easy because the author uses only certain kinds of punctuation and syntax. And even if a text cannot be broken into sentences perfectly, if the error rate is small, the results are useful.

Finally, note that written English and spoken English have many significant differences; for example, the former typically uses sentences as a basic unit. However, analyses of speech corpora have revealed that sentences are not the best unit of study for a discourse among people. Section 83 of the Cambridge Grammar of English [26] states that clauses are the basic unit of conversations. Although some texts analyzed in this book have dialogs, these are more structured than what people actually say when they talk to one another. Due to both its complexity and the lack of public domain transcriptions, this book does not analyze transcribed spoken English.

2.6.1 Sentence Segmentation Preliminaries

Another term for finding sentences is sentence segmentation. This is equivalent to detecting sentence boundaries; however, there is no built-in regex command analogous to  for sentences. One reason for this difference is that (in English) whitespace typically separates words. Although there are exceptions, for example, once-in-a-lifetime can be written either with or without the hyphens, these are not typical. Sentences, however, are combinable in numerous ways so that a writer has a choice between using many shorter sentences, or a few medium length sentences, or just one long sentence. The examples in table 2.6 show several ways to combine the sentences He woke up and It was dark.

Table 2.6 A variety of ways of combining two short sentences.

He woke up, and it was dark.

He woke up, but it was dark.

He woke up when it was dark.

When he woke up, it was dark.

Although he woke up, it was dark.

He woke up; it was dark.

He woke up: it was dark.

He woke up–it was dark.

The freedom of choice in combining adjacent sentences is not the only consideration. First, sentences can be combined by nesting. For example, this is common in depicting dialog in a novel (see sentence 2.10 below). Second, sentence ending punctuation marks can be ambiguous because they serve more than one purpose. We discuss the basics of both of these below, but many details are left out. For an in depth discussion of this, see chapters 5, 10, and 14 of The Chicago Manual of Style [27].

First, sentences can be nested, that is, one sentence can interrupt the other as in sen­

(2.10) “When I drive to Enfield,” Dave said, “I take 1-91.”

Here the direct quote is interrupted by the sentence Dave said. A similar situation occurs with parenthetical remarks, such as sentence 2.11. Here the sentence I do it daily interrupts the first sentence.

(2.11) “When I drive to Enfield-I do it daily-I take 1-91.”

From the examples just discussed, nested sentences are not uncommon. However, ambiguous punctuation is even more of a problem for sentence segmentation. Sentences end in question marks, exclamation points, and periods. Unfortunately, all of these symbols have other uses than terminating sentences.

In many types of texts, question marks and exclamation points are used primarily for marking the end of a sentence, although we have seen that direct quotes make the situation more complicated as shown in sentence 2.12.

(2.12) She said, “You named your cat Charlie Brown?” to me.

However, some types of texts use these two punctuation marks for other purposes. For example, this book also uses the question mark as a regex symbol. In addition, a book on chess uses the question mark to denote a poor move and an exclamation point to denote a good move, and a calculus text uses the exclamation point to denote the factorial function. However, for many kinds of texts, both the question mark and the exclamation point do not serve any other purpose besides ending a sentence.

Periods, however, have several uses besides ending a sentence, and all these can easily appear in many types of texts. First, periods are commonly used in numbered lists, right after the numeral. Second, periods are used as decimal points within numerals. Third, the ellipsis, used to indicate missing material in a quotation is written with three periods in a row. But there is another common, alternative use of the period: abbreviations.

Using periods with abbreviations is common, especially in American English. For example, a person’s name is often accompanied by a social title such as Mr., Mrs., or Dr. There are many other titles, too: Capt., Prof, or Rev. Academic degrees are sometimes added to a name, for example, BA. or Ph.D. Instead of the full name, many times parts of a name are replaced by an initial, for example, John X. Doe. But this is just a start, and after some thought, numerous other abbreviations come to mind: U.S. for the United States, Ave. for avenue, A.D. for anno Domini, in. for inches, Co. for company, and so forth. Of course, not all abbreviations use periods, for instance, one can use either U.S. or US, and the symbols for the chemical elements never use them. Yet enough abbreviations do use it that one should never ignore this possibility. Finally, note that a period can mark the end of a sentence and denote an abbreviation at the same time. For example, this is true of “1 live in the U.S.”

So creating a general-purpose sentence segmentation tool requires more than a few simple rules. Nonetheless, in the 1990s, error rates below 1% were achieved, for example, see Palmer’s paper on SATZ [85], which is his software package to do sentence segmentation (note that Satz is the German word for sentence).

However, an imperfect program is still useful, and a regexes written for a particular set of texts might be quite good at detecting sentences. With the above discussion in mind, we now try to write such a program.

2.6.2 Sentence Segmentation for A Christmas Carol

Sentence segmentation is an interesting challenge for a regex. We try to solve it in several different ways for Charles Dickens’s A Christmas Carol [39]. This lacks generality, but the process of analyzing this novel’s sentence structure to create the regex also increases one’s familiarity with this text, which is a worthwhile payoff.

Common sense suggests a sentence begins with a capital letter and ends with either a period, an exclamation point, or a question mark. Let us call these end punctuation,although as noted above, they do not always mark the end of a sentence.

Since program 2.7 is a regex-based concordance maker, we can easily analyze A Christ­ mas Carol on its use of end punctuation with minor changes to this program. For instance, the filehandle FILE must be linked to the file containing this novel. Then just changing $target to the strings ‘(.)‘, ‘(?) ‘,and ‘(‘) ‘, respectively, finds all instances of these three punctuation marks. Remember that the parentheses are needed to save the matched

substring.

Output 2.7 A coding error fails to find just periods.

MARLEY was dead: to begin with.

MARLEY was dead: to begin with.

MARLEY was dead: to begin with. T

MARLEY was dead: to begin with. Th

MARLEY was dead: to begin with.

The MARLEY was dead: to begin with. Ther

MARLEY was dead: to begin with. There

MARLEY was dead: to begin with. There

MARLEY was dead: to begin with. There i

MARLEY was dead: to begin with. There is

Running program 2.7 with the change of FILE and with $target set to ‘(.)‘ produces output 2.7. Clearly something has gone wrong. In this case, the period has a special meaning in a regex. As stated in table 2.3, the period matches every character. So the reason why each line in output 2.7 moves over by one is that every single character matches, hence in the first line M is matched, and so is placed in the center. In the second line, A is matched, so it is placed in the center, which means that M has moved one space to the left, and so forth. Consequently, the regexes should be ‘(.)‘ ,‘ (?)‘, and ‘C!)’ since both the period and the question mark have special meaning in regexes, but the exclamation point does not.

Changing $target to ‘(.)‘ produces many lines, and output 2.8 shows the first 10. Note that the fifth and ninth lines are the end of the paragraph, which explains why there is no text after either period.

This program lists all uses of the period, but not all of these are of interest. In particular, the use of abbreviations is important for us to check. There are several ways to do this. First, find a list of abbreviations and check for these directly. Second, find a list of words and then flag tokens that do not match this list. However, both of these approaches require more advanced programming techniques, for example, the use of hashes (discussed in the next chapter), so a third approach is tried here. We search for periods followed by a lowercase letter.

Output 2.8 First 10 periods in Dickens’s A Christmas Carol.

MARLEY was dead: to begin with. s no doubt whatever about that. ertaker, and the chief mourner. ng he chose to put his hand to. ley was as dead as a door-nail. cularly dead about a door-nail. ce of ironmongery in the trade. it, or the Country’s done for. ley was as dead as a door-nail. he was dead? Of course he did.

There is no doubt whatever ab The register of his burial wa Scrooge signed it: and Scroog Old Marley was as dead as a d
I might have been inclined, m But the wisdom of our ancesto You will therefore permit me
How could it be otherwise? Sc

A regex to do this needs to find a period followed by a single quote, or double quote or comma as well as zero or more whitespaces, and all of this is then followed by a lowercase letter. Remembering that the period and the single quote mark must be escaped, this regex is reasonably easy to write down: / (. [s’" ,] * [a–z] ) /, which is assigned to $target. One more change is required: in the second while loop, the letters after the regex must be changed from gi to just g, otherwise the matches are case insensitive, but now we want to detect lowercase letters. After making these changes, running program 2.7 produces output 2.9, which shows exactly one match.

Output 2.9 All instances of a period followed by whitespace followed by a lowercase letter in A Christmas Carol.

his domestic ball broke up. Mr. and Mrs. Fezziwig took their st

This match occurs starting with the period in Mr. and ends with the first letter of and, which is in lowercase. So there are apparently no abbreviations in the interior of a sentence in this story. However, there are clearly social titles since Mr. and Mrs. do exist.

Changing $target to ’([MD] rs?.)’ matches 3 common social titles. Making this change to program 2.7 now produces 45 lines, and the first 10 are given in output 2.10. Looking at the entire output, it turns out that there are no occurrences of Dr. While there are certainly other titles we might consider, familiarity with this story suggests that these should account for all of them.

Let us next look at the first 10 uses of either a question mark or a exclamation point by setting $target to ’([?!])‘. Recall that inside the square brackets, the question mark has no special meaning, so there is no need to escape it with a backslash here.

From output 2.11, we see that line 5 has an exclamation point followed by a lowercase letter, so this pattern needs to be checked by setting $target to ’([?!] [s’" ,] * [a–z])’. Now rerunning the program produces output 2.12, which shows that this construction happens 182 times, the first 10 of which are shown. Not surprisingly, most of these examples are nested sentences arising from conversations in the novel.

Output 2.10 First 10 instances of Mr., Mrs., or Dr in A Christmas Carol.

the pleasure of addressing addressing Mr. Scrooge, r
estive season of the year, s First of Exchange pay to fty stomach-aches. In came ig stood out to dance with tch for them, and so would And when old Fezziwig and is domestic ball broke up.

Mr. Scrooge, or Mr. Marley?" Mr. Marley?" "Mr. Marley has been dead these Mr. Scrooge," said the gentlem Mr. Ebenezer Scrooge or his or Mrs. Fezziwig, one vast substan Mrs. Fezziwig. Top couple, too; Mrs. Fezziwig. As to her, she w Mrs. Fezziwig had gone all thro Mr. and Mrs. Fezziwig took the

Output 2.11 The first 10 instances of exclamation points or question marks in A Christmas Carol

image

Output 2.12 First 10 instances of exclamation points or question marks followed by a lowercase letter in A Christmas Carol.

image

Note that the first line of this output does not have the exclamation point at the end of a quote. By dropping the double quotes in the regex, all instances like this first line are found, and all of these are shown in output 2.13. As far as sentence segmentation goes, note that many of these lines are ambiguous. For example, line 5 can be rewritten as “Rise and walk with me!” or “Rise! And walk with me!” However, as in output 2.12, usually the lowercase letter after a question mark or exclamation point means that the sentence need not end at either of these. Adopting this as a rule even for output 2.13 also produces reasonable sentences, even though alternatives exist. Furthermore, this rule also applies to the period, although there is only one instance of this in the novel.

Output 2.13 All instances of exclamation points or question marks followed by a lowercase letter but not immediately followed by quote marks in A Christmas Carol.

image

Based on the above discussion, the following steps for sentence segmentation suggest themselves. First, for each case of Mr or Mrs., remove the period. After this is done, assume that a sentence starts with a capital letter, has a string of symbols ending in a period, question mark, or exclamation point oniy if one of these is followed by another capital letter.

Code Sample 2.19 A first attempt to write a simple sentence segmentation program.

image

The initial attempt to write a sentence segmentation program is given in code sample 2.19, which applies the regex to $test. Looking at the regex, the part in parentheses, ([A–Z] . * [.?!]), finds a capital letter, zero or more characters, and finally one of the three sentence-ending punctuation marks. Outside the parentheses, whitespace is matched up to another capital letter. Note that this regex requires that a sentence be followed by an uppercase letter, so just ending in a question mark, for example, is not sufficient by itself. Finally, each match is printed on a separate line. When run, however, this program produces exactly the one line given below, so there is exactly one sentence match, not three matches as expected.

What went wrong? To answer this, we need to know more about how matching occurs for a regex, which is the topic of the next section.

Output 2.14 Output of code sample 2.19, which did not run as expected.

Testing. one, two, three. Hello!

2.6.3 Leftmost Greediness and Sentence Segmentation

All regexes start looking for matches at the leftmost character, and if this fails, then it tries the next character to the right, and if this fails, then it tries the next character, and so forth. So the regex / (10+) / applied to the string “10010000’ matches the first three characters, “100”, but not the last five.

Hence, the default for {m, n} is to find the first match going from left to right. If there is more than one match starting at the same location, the longest one is picked, which is called greediness. In particular, this applies to the three special cases of {m , n}, namely * (same as {o, }), + ({ 1, }), and ? ({0, 1}). With this in mind, output 2.15 of code sample 2.20 is understandable because instead of matching the first word and period, the regex matches as much text as possible, which is the entire line. Note that the print statement produces one slash per iteration of the while loop. Therefore the number of slashes equals the number of sentences matched.

Code Sample 2.20 The regex matches as many characters as possible starting as far left as possible.

image

Output 2.15 Output of code sample 2.20. The single forward slash means that only one match is made.

Hello. Hello. Hello. Hello./

For sentence segmentation, a greedy match is not wanted. In fact, the shortest substring is desired. Fortunately, it is easy to denote this pattern in the regex: just append the question mark to the repetition operators. That is, *?, +?, ??, and {m , n}? match as short a substring as possible. Thus making code sample 2.20 nongreedy is simple: just add a question mark after the plus, which creates code sample 2.21, which produces output 2.16. Now each Hello, is a separate match. Although there are other considerations, usually the heuristic that a regex is greedy is accurate.

There is another way to fix code sample 2.20. If sentences end only with periods (and no periods used for abbreviations), then a sentence is precisely what is between two periods. A regex to match this is / [ˆ .] *. /, which means search for as many nonperiods in a row as possible, then a period. Hence code sample 2.22 produces the same as output 2.16.

Now we can fix code sample 2.19 using the nongreedy zero or more match, which is given in code sample 2.23. The results are in output 2.17, but there is still only one line, meaning that there is only one sentence detected, not three. However, unlike output 2.14, the output does give the first sentence correctly. Remember that the word after a period must be capitalized, so the first sentence does end with the word three. Before reading on, can you discover the cause of this problem?

Code Sample 2.21 The regex matches as few characters as possible since *? is nongreedy.

image

Output 2.16 Output of code sample 2.21. Four matches are found.

Hello./ Hello./ Hello./ Hello./

Code Sample 2.22 This regex searches for as many nonperiods as possible, then a period.

image

Code Sample 2.23 A second attempt using nongreedy matches to write a simple sentence

image

Output 2.17 Output of code sample 2.23, which still did not run as expected.

Testing. one, two, three.

There are, in fact, two problems in this code. One is that after the first match is found, the regex searches for another match starting at the character immediately after the prior match. Since the regex in this case searches for a capital letter following a sentence-ending punctuation mark, the start of the search for the second match occurs at the letter e in the sentence Hello! Hence the second sentence is not found. However, why is the third sentence What? not discovered?

Here the problem is different. Since the regex looks for a capital letter following a sentence-ending punctuation mark, and because there is no text after the third sentence, it is undetected. One way to fix this code requires two changes. First, we need a way to backup where a search starts after a match. Second, the last sentence in a string requires a different pattern than the other ones.

When Perl finds a match for a regex, it stores the position of the character following the match, which is obtainable by using the function pos. Moreover, the value of pos can be changed by the programmer, which starts the next search at this new value. Applying these two ideas to code sample 2.23 produces code sample 2.24. Note that the first line within the while loop uses a shortcut. In general, –= decreases the variable on the left by the value on the right. So the statement below decreases $y by one, then assigns this value to $x.

$x = $y —= 1;

In addition, the variable $loc is needed because when the while loop makes its last test, it fails (which ends the while loop’s execution), and it resets the pos function. Finally, the substr function with only two arguments returns the rest of the string starting at the position given in the second argument. That is, starting from the position in $loc, the remainder of $test is printed.

Code Sample 2.24 A third attempt using nongreedy matches and the function 05 to write

image

Output 2.18 Output of code sample 2.24, which succeeds in finding all three sentences.

image

Although using pos provides a second way to do sentence segmentation, it means that the programmer is doing the bookkeeping, and this is best left to Perl itself, if possible. Fortunately, there is a third technique that finds sentences: the use of character negation in square brackets. If periods are not used for abbreviations, then sentences are the longest strings not containing end punctuation, which is denoted by [ˆ.?!]. This does not take into account lowercase letters after the end punctuation, but that can be tested for. So this is the idea: use greedy matches to find the longest substrings that do not have end punctuation. If the next character after one of these substrings is uppercase, then it is a sentence. Otherwise save and combine it with the next substring without end punctuation. Repeat this process until either an uppercase letter is found or the end of the paragraph is reached.

Code sample 2.25 uses this approach, and its code has two constructions worth noting. First, the variable $buffer stores the substrings that conclude with end punctuation but are followed by a lowercase letter. The two statements below do exactly the same thing: they concatenate $match to $buffer. Note that if $match has not been assigned anything, then $buffer is unchanged.

$buffer .= $match;

$buffer = $buffer . $match;

Second, this code also introduces a new Perl variable, $’. When a regex matches a substring, it is stored in the variable $&. The string up to the match is assigned to $ ‘ (using a backquote), and the last part of the string is saved in $’ (using a single quote). Hence the regex matched against $’ is checking either if the next character is a capital letter or if the end of the string has been reached (not counting whitespace).

Code Sample 2.25 Sentence segmentation by character class negation.

image

Running code sample 2.25 produces output 2.19. Notice that the initial spaces before the sentences are retained, which is why the second and third sentence are indented.

Output 2.19 Output of code sample 2.25, which succeeds in finding all three sentences.

image

Now we have the tools to go back to sentence segmentation in A Christmas Carol. One complication in this novel (and any novel with dialog) is that quotation marks are used, and exclamation points and question marks can go either inside or outside these (see sections 5.20 and 5.28 of The Chicago Manual of Style [27] for the details). For example, if a person is quoted asking a question, then the question mark goes inside the quotation marks, but if a question is asked about what a person says, then the question mark goes outside. But including the possibility of quotation marks is easy: just place [“ ’] {o , 2} in the appropriate places in the regex, which is shown in regular expression 2.4. This is needed in case there are quotes within quotes, which does happen in A Christmas Carol. The heart of this regex is the character class [ˆ.?!] *, which matches as many nonend punctuation as possible. This is eventually followed by [.?!], so together these two pieces search for a substring having no end punctuation except at the end of the substring. Most of the rest of this regex is checking for possible quotation marks.

If regular expression 2.4 replaces the regex in the while statement of code sample 2.25. then quotation marks are taken into account. If we add a while loop that goes through A Christmas Carol paragraph by paragraph, then we have program 2.8, which performs sentence segmentation. This code has several features that are commented on below.

First, the default variable $_ is explicitly given to emphasize its role, although this is optional. Second, the beginning of the while loop has four simplifying substitutions, for example, the periods in Mr and Mrs. are removed so that they are not mistaken for end punctuation. Third, the if statement includes an else clause. This if–then–else statement checks whether or not a capital letter follows the match found in the while statement. Hence the underlying structure is given in code sample 2.26.

Regular Expression 2.4 A regex that matches a substring up to end punctuation and that may contain either double or single quotes.

image
image

Program 2.8 A simple sentence segmentation program.

Running program 2.8 produces much output (hopefully all the sentences of A Christmas Carol). Visual inspection reveals that the program does a good job, but the results are not perfect. For example, one error is caused by the sentence in table 2.7. Try to figure out what went wrong: the solution is given in problem 2.7.

Program 2.8 has been created with A Christmas Carol in mind, so the results of this program with other texts probably requires further modifications. For more robust sentence rules see figure 4.1 of section 4.2.4 of the Foundations of Statistical Natural Language

Code Sample 2.26 The underlying structure of program 2.8.

image

Table 2.7 Sentence segmentation by program 2.8 fails for this sentence.

But the great effect of the evening came after the Roast and Boiled, when the fiddler (an artful dog, mind! The sort of man who knew his business better than you or I could have told it him!) struck up ‘‘Sir Roger de Coverley."

Processing [75]. However, there is another point of view: this program is only 28 lines long (which counts every line, even the blank ones). Given experience with regexes, creating this code is not difficult, and the process of fine tuning it helps the programmer understand the text itself, a worthwhile payoff. As long as a programmer is facing a homogeneous group of texts, this approach is fruitful. To analyze a heterogeneous group of texts makes the programming challenge much harder.

Finally, in any programming language a given task can be done in several ways, and this is especially true in Perl. For another example of sentence segmentation using a different approach, see section 6.8 of Hammond’s Programming for Linguists [51]. This author employs regexes to create arrays by using the functions push and splice. In addition, we return to sentence segmentation in section 2.7.3 after introducing the idea of lookaround. This solution is the most elegant. Finally, section 9.2.3 has one last approach to this task in Perl.

The next section introduces a few more Perl techniques for creating regexes. These examples highlight new programming ideas and syntax.

2.7 REGEX ODDS AND ENDS

This section goes over a few miscellaneous techniques. It is also a chance to review some of the earlier material discussed in this chapter.

However, there are techniques that we do not discuss, and an excellent book covering regexes in depth is Friedl’s Mastering Regular Expressions [47]. Although this book discussions several programming language’s implementations of regexes, chapters 2 and 7 focus on Perl, which pops up in several other chapters, too. Conversely, almost all books on Perl have at least one chapter on regexes. Historically, Perl has been at the forefront of regexes as both have evolved over the years, and this co-evolution is likely to continue.

2.7.1 Match Variables and Backreferences

We have already seen the match variables $1, $2, and so forth, which store the substrings that match the parts of a regex inside parentheses. These can be nested as in code sample 2.27. This program examines a list of plural nouns, and the regex matches the last letter of the base word as well as the final -s or -es, if any. This is a simple example of lexical morphology, the study of the structure of words, and using a larger list of plural nouns would uncover the rules of plural forms. For information on these rules, see sections 523–532 of Practical English Usage [114]. Notice that the order of the variables $1, $2, $3, and $4, is determined by the order of the leftmost parenthesis. Hence $1 is the entire word; $2 is the singular form of the noun; $3 is the last letter before the addition of either -s or -es; and $4 is one of these two letter groups. The results are given in output 2.20. Note that the code fails for the word moose, which has an irregular plural form. Finally, see problem 2.8 for two more comments on this code sample.

Code Sample 2.27 Example of nested parentheses and the associated match variables.

image

Output 2.20 Output of code sample 2.27.

dogs, dog, g, s

cats, cat, t, s

wishes, wish, h, es

passes, pass, s, es

moose, moose, e,

Backreferences are related to match variables. While the latter allows the programmer to use matched text outside of the regex, backreferences allow it to be used inside the regex itself. For example, given text, let us find the words with doubled letters, that is, two letters in a row that are the same. Since w stands for [a–zA–Z_], the goal is to match a letter, then immediately afterwards, match that letter again. The backreferences 1, 2..., store the substring that matches the part of a regex in parentheses. Hence, the regex / (w) 1/ matches a double letter since the 1 has the value of the previous character.

Code sample 2.28 tests this. The code breaks the string into words using the function split, and the results are stored in an array. Then the foreach loop tests each word in the array against this regex. The matches are then printed out, as shown in output 2.21.

Notice that testing $x against the regex informs the if statement whether or not there is a match. So a true or a false value is generated, which suggests that there is more going on in $x =˜ /(w)1/ than meets the eye. These details are discussed in the next section.

Code Sample 2.28 Example of using the backreference 1 to detect double letters in words.

image

Output 2.21 Output of code sample 2.28

moose

Nell

911

2.7.2 Regular Expression Operators and Their Output

Since $x =˜ /$regex/ can be put into an if or while statement, both of which require logical values to operate, then the regex must produce a logical value. This raises the question of how Perl represents true and false.

Perl only has two types of variables: string and numerical, and we have seen that Perl is flexible even with these. For example, $x = "3" + "4" assigns the number 7 to $x. Logical values can be either strings or numbers, and there are exactly seven values that are equivalent to the logical value false as shown in table 2.8. Note that the empty parentheses stands for an array with no entries (see section 3.3 for information on arrays).

Table 2.8 Defining true and false in Perl.

0, ‘0’,”0", ‘’ “", (), undef

false

All other numbers and strings

true

If matching regexes returns something, then this can be assigned to a variable, which then can be printed out. Code sample 2.29 does exactly this using dashes as delimiters. The output is just −−1−, so $resultl has the empty string (since there is nothing between the first two dashes), and $result2 has the string or number 1.

Code Sample 2.29 Proof that matching a text to a regex produces either a logical true or false.

image

To see if a string does not match a regex is easily done by replacing =˜ by !˜. For example, the following is true for $text in code sample 2.29.

$text !˜ /upper/i

There is an alternative way to write matching a regex, which is done by putting an m before the initial forward slash. That is, the two statements below are equivalent. This parallels the substitution notation, s///.

$x =˜ /$regex/

$x =˜ m/$regex/

We have seen two types of regex operators, matching and substitution, denoted m// and s///, respectively. Not only does the match operator return a value, so does substitution. For example, s/Mr. /Mr/g removes periods in the abbreviation Mr. However, it also returns the number of substitutions performed. If none are, then the empty string is returned, which is equivalent to false. Remember that any positive number returned is equivalent to true, so s/// can be used in both if and while statements, just like m//. Hence, in code sample 2.30, $result has the value 2, which means that Mr. appears twice in $text. This sort of flexibility is common in Perl, which makes it fun to program in but harder to understand the code.

Code Sample 2.30 Example of the substitution operator returning the number of substitutions made.

$text = "Mr. Scrooge and Mr. Marley";

$result = ($text =˜ s/Mr./Mr/g);

print "$result";

Besides matching and substitution, there is one more regex operator, quote regex, denoted qr//, and this allows precompilation of a regex, which can make the program run quicker. The syntax is similar to assigning a string representing a regular expression to a variable. The operator qr// takes this one more step as seen in code sample 2.31, where two regexes are precompiled and stored into two variables: the first matches Mr. and Mrs., and the second matches a name. The output of code sample 2.31 is given below.

Mr. Dickens Mrs. Poe

Code Sample 2.31 Example of two quote regex operators.

$title = qr/(Mrs?.)/;

$surname = qr/([A–Z] [a–z]*)/;

$text = "Mr. Dickens and Mrs. Poe";

while ( $text =˜ /$title $surname/g ) {

print "$1 $2 ";

}

Finally, there is translation (or transliteration), denoted tr///, which does not allow regexes, but the structure is similar to the operators m//, s///, and qr//. See problem 2.9 for some discussion on this.

Now we discuss one last regex technique. This allows us to solve the sentence segmentation problem of A Christmas Carol in a new and better fashion.

2.7.3 Lookaround

Lookaround allows a regex to test whether or not a condition is true without affecting which characters are matched. For example, the word boundary, , is a location satisfying the condition that one side has a word character and the other side does not. This idea is also called a zero-width assertion. The concept of lookaround allows the programmer to test locations for more complex conditions.

Lookaround comes in four types. It can lookahead (forward in the text) or it can lookbehind (backward in the text), and the lookaround can search either for a regex (positive form) or the negation of a regex (negative form). Hence, for any regular expression, call it $regex, there is positive lookahead with the syntax (?=$regex), as well as negative lookahead, (?!$regex). In addition, there is positive lookbehind, (?<=$regex), as well as negative lookbehind, (7<!$regex). We consider only two examples: a simple introductory one and positive lookahead for sentence segmentation.

First, in HTML, many tags come in pairs, which surround text. For example, bold font is indicated by the tags <B> and </B>. One way to match text inside these is to use lookbehind and lookahead to ensure that the tags exist, but these are not included in the match. This is done in code sample 2.32, which prints out the word think. Note that lookaround is not required: the regex /<B> (.*) </B>/ does the same task and is simpler.

Code Sample 2.32 An example of lookahead and lookbehind.

$test = "Don’t even <B>think</B> it!";

$test =˜ /(?<=<B>)(.*)(?=</B>)/;

print "$1 ";

The second example is yet another approach to sentence segmentation. Suppose that no periods are used for abbreviations (or that this type of period has been removed). Suppose a sentence is required to start with a capital letter. Then a sentence starts with [A–Z] followed by one or more occurrences of [^.?!] * [.?!] and ends in whitespace followed by either a capital letter or the end of the string. Although code sample 2.24 tests for this, it matches up to and including the capital letter. Lookahead can test for this capital letter, which is not included in the match.

Consider the regex in code sample 2.33. It breaks into two pieces: regexes 2.5 and 2.6. The former matches the sentence, which is stored in $1. The latter looks ahead for either the following capital letter or the end of the string.

Code Sample 2.33 A simple sentence matching regex using positive lookahead.

image

Regular Expression 2.5 First part of the regex in code sample 2.33.

([A–Z]([^.?!]*[.?!])+?)

Regular Expression 2.6 Second part of the regex in code sample 2.33.

(?=s+ [A–Z] I w*$)

It is essential that the +? at the end of the first pair of parentheses is not greedy. If this is changed to ([^.?!]*[.?!]+), then only one line of output is produced, which implies that only one sentence is found. Output 2.22 shows the correct, nongreedy results.

Output 2.22 Output of code sample 2.33.

Short. a test.

A test? a text?

No problem!

As is, this regex does not take into account quotation marks, but including these is not hard. The result is program 2.9. Note that the qr// construction has been used. Here it shortens the length of the regex, and it allows the programmer to label the two pieces of this regex in a more understandable fashion.

image

Program 2.9 Using lookahead to segment A Christmas Carol into sentences.

Again the code is not perfect when applied to A Christmas Carol, but the problem arises from the text: there are sentences that do not end in one of three end punctuation marks. For example, at the end of the novel, after Mr. Scrooge gives a large sum of money to two gentlemen who are collecting funds for charity, one of the men replies, “I don’t know what to say to such munifi–” This is the end of the paragraph, and there is no end punctuation, so program 2.9 does not print this out.

This is almost the end of sentence segmentation in this book. It turns out that code to do this has already been written for Perl and all a programmer needs to do is download a certain package (called a Perl module). This is discussed in section 9.2.3.

This is a long chapter, yet much about Perl and regexes have been left out. So before moving on to chapter 3, the last section lists some Perl references for your reading pleasure. In addition, more advanced references for Perl are given in section 3.9.

2.8 REFERENCES

This section gives some introductory references for Perl. These represent only a small portion of the documentation on Perl, but it gives the reader a place to start.

There are many books that introduce programming using Perl. Three good beginning books are Learning Perl by Randal Schwartz, Tom Phoenix, and brian d foy [109], Sams Teach Yourself Perl in 21 Days by Laura Lemay [71], and Perl 5 Interactive Course by Jon Orwant [83]. Finally, Programming for Linguists: Perl for Language Researchers by Michael Hammond [51] is a gentle introduction to both Perl and programming and is intended for people interested in natural languages.

All of the above books discuss regular expressions, but to learn much more about them, start with Andrew Watt’s Beginning Regular Expressions [124], where Perl is covered in chapter 26. Then try Mastering Regular Expressions by Jeffrey Friedl [47]. Chapter 7 covers Perl’s implementation in detail, and chapter 2 introduces regexes mostly using Perl, which also appears in a few other chapters. It gives the details on how regexes work, and how to optimize them. Chapter 2 of Daniel Jurafsky and James Martin’s Speech and Language Processing [64] covers regexes, and the book covers many topics on natural language processing and computational linguistics. Finally, there is a mathematical theory of regular expressions. If this interests the reader, try John Hopcroft and Jeffrey Ullman’s Introduction to Automata Theory, Languages and Computation [58].

Of course, the most up-to-date information on Perl is always online. Web sites change unpredictably, so only three of them are given here, all of which are by The Perl Foundation. Perl documentation is available at http://per1doc.per1.org/ [3]; and http://www.perl.org/ [45] maintains many great links for Perl. Third, the Comprehensive Perl Archive Network (known as CPAN) at http://cpan.perl.org [54] has numerous existing Perl programs for a vast number of applications, and all of these are free.

For more advanced references on Perl, see section 3.9. The next chapter describes Perl’s data structures. These are useful for many tasks, including counting the matches made by a regex.

PROBLEMS

2.1 One way to learn a programming language is to copy a piece of code, modify it, and then rerun it. Try this with some of the Perl code in this chapter. What happens if a semicolon is removed? Try modifying a regex to find out what it matches after it is changed. Try adjusting the arguments of a function, for example, index or substr. Be adventuresome!

2.2 Program 2.1 finds the lines in table 2.2 that match regular expression 2.2. For this problem, print out the lines that do not match, which can be done in at least two ways.

First, put the logical operator not in front of the regex in the if statement as shown below. Try this modification of this program and run the resulting code.

if ( not /ˆ(1 ?)?(d{3}) ?d{3}—d{4}$/ ) { print “$_“; }

Second, the two statements below are equivalent. The default variable is now explicit in the second one. Replacing =~ by ! ~ makes the expression inside the parentheses true only if there is no match. That is, ! ~ is the nonmatching regex operator. Again, try modifying program 2.1 in this way and run it.

if ( / ˆ(1 ?)?(d{3}) ?d{3}—d{4}$/ ) { print “$_“; }

if ( $_ =~ /ˆ(1 ?)?(d{3}) ?d{3}—d{4}$/ ) { print “$_“; }

2.3 As noted in section 2.2.2, the caret is used in two distinct ways in a regex. Outside square brackets, it stands for the start of a line, and when it is the first character inside square brackets, it means to match all characters except for those that follow. Some examples are given in code sample 2.34. Try to guess what each line of code prints out, and then check your guess by running this code in Perl.

Code Sample 2.34 The uses of the caret. This is for problem 2.3.

image

Finally, note that to match the caret outside the square brackets, it must be escaped with a backslash. However, to include a caret as a character inside square brackets, it does not require escaping, but it cannot be the first character.

2.4 In section 2.3.1, table 2.3 gives examples of regexes as well as strings that match each one. Sometimes, however, this is too inclusive, that is, too many matches are obtained. For example, if a researcher is looking for the word cat, then matching Cat, cats, and cat’s are probably all desired, but matching scatter or catastrophe are false positives.

It is useful to think about what delimits a target string, that is, what characters might begin or end this string. Is there punctuation? whitespace? XML tags? the end or beginning of a line? In addition, what forms of the string are desired?

For this problem, create a regex for the examples below, which represent different parts of speech.

a) Write a regex to find the noun rat. Remember to prevent matches like vituperation, but to allow Rat, rat’s, rats.

b) Write a regex to find the adjective old. Remember that adjectives have comparative and superlative forms, and do not forget about preventing words like golden from matching.

c) Write a regex to find the verb jump in all its forms (past tense, third person singular, and so forth). Remember to prevent matches like jumper, which is a noun.

d) Write a regex to find all the forms of the verb sit. This, unlike jump, is irregular. How does this change the task?

2.5 In section 2.4, the construction given below is used. It stops execution if the file does not open for any reason.

open(FILE, “filenarne.txt”) or die(”Message”);

This problem discusses why this works. Recall that the statement A or B is true if either A is true or B is or both are. In particular, if A is true, then the status of B is irrelevant. For the Perl command, if open is successful, it returns the value true and then there is no need to evaluate the second part of the or statement. That is, there is no need to execute die. If the open statement fails, it returns false and then die is executed. So this command does what is desired: if the file opens, the program runs on; otherwise, the program is halted. Perl is famous for shortcuts like this, which is why having an advanced programming book on Perl is useful.

a) Try changing or to and to see what happens.

b)Try putting die first to see what happens.

2.6 In the discussion of program 2.7 it is noted that sprintf that acts like print, but it allows formatting and produces a string output. The function printf is like print except that the former allows formatting. Code sample 2.35 shows a simple example of both functions.

a)Change the numbers in the double quotes to see how the output changes.

b)Look up other types of formats in a Perl book or online.

c)Try modifying program 2.7 by replacing the existing string construction for $extract with the sprintf function instead.

Code Sample 2.35 Example of the functions printf and sprintf. For problem 2.6.

image

Output 2.23 Output of code sample 2.35.

image

2.7 Program 2.8 does make mistakes. One way to tell this is by a word count (doable in Perl or in word processing program). Show that the output of this program has less words than the original story. As noted in the text, table 2.7 shows a sentence that is broken into two pieces by this program. Where is this sentence broken? Hint: the problem is due to nesting.

2.8 In code sample 2.27, a simple regex finds the letter just before the -s and -es in a small list of plural nouns. Whichever ending is appropriate is often determined by the last letter (or letters) of the noun. For example, nouns that end in -s generally have the plural form -es: alias becomes aliases; loss becomes losses; and sinus becomes sinuses. More complete rules are available in section 523 of the Practical English Usage [114].

a) Create or find a larger list of plural nouns and use them as input into this code sample. What patterns do you see?

b) The regex in this code sample uses a nongreedy version of + (that is, +?). The greedy version gives different results: try to predict what it does and then test it using Perl.

2.9 Translation, tr///, is character-by-character substitution. Suppose a programmer wants to change all letters in a text to lowercase. One way to do this is by using the function lc. A second way is to specify a letter-by-letter translation with tr/A–Z/a–z/. Note that no letter g is needed since translation is inherently global. See code sample 2.36 for an example. The value returned by this function is the number of translations made. If none are made, then the number 0 is returned. Like s///, this value can be used as a logical value, where true means one or more translations, and false means no translations. Hence, $result gets the number 4 because four capital letters are made into lowercase letters. Like substitution, translation can be used in if and while statements.

Code Sample 2.36 Example of the translation operator returning the number of translations made. For problem 2.9.

$text = “Mr. Scrooge and Mr. Marley”;

$result = ($text =~ tr/A—Z/a—z/);

print “$result”;

a) The Caesar cipher takes each letter and replaces it by the letter three places ahead where the alphabet is seen as cyclic (see section 1.1 of Abraham Sinkov’s Elementary Cryptanalysis [110] for a discussion). Hence, D replaces A, E replaces B,..., B replaces Y, and C replaces Z. Use tr/// to accomplish this.

b) DNA are long molecules that contain sequences of four bases, which are abbreviated as A, T, C, and G. DNA is double stranded, and the bases in one strand have the following relationship with the bases in the other: A and T always pair up as well as C and G. For example, given the fragment ATTTCTG, then the other strand must be TAAAGAC. Try implementing this conversion in Perl using tr/ACGT/TGCA/. Note that the letter translations are all done in parallel.

Although the above DNA sequence is made up, there are vast amounts of real DNA sequences available at the National Center for Biotechnology Information (NCBI) via their Web page: http://www.ncbi.nim.nih.gov/ [81]. For information on using Perl to analyze DNA, see the excellent book Perl for Exploring DNA by Mark LeBlanc and Betsey Dexter Dyer [70].

c) Use tr/// to count the number of vowels in Dickens’s A Christmas Carol. Assume that these are a, e, i, o, and u.

d) How hard is it to find vowels if y is included? Remember that it is not always a vowel, for example, it is a consonant in yellow.

2.10 This problem illustrates how Perl combined with a word list can be applied to word recreations. Fortunately, there are word game word lists that are in the public domain, and we use Grady Ward’s CROSSWD .TXT, which is one of the Moby Word Lists [123] available at Project Gutenberg. It contains all inflections; for example, nouns appear both in singular and plural forms; verbs appear in all their conjugated forms, and so forth.

Regexes find strings that have a certain pattern, and this is applicable to a word list. Program 2.10 shows a simple program that prints out all words that match a regex that is entered on the command line.

image

Program 2.10 Searching for words that match a regex. For problem 2.10.

The name of the file CROSSWD .TXT suggests one use: filling in words in a crossword puzzle. Here the length of any word is known, and if there are one or more letters known, so are their positions. This is also the situation in the game hangman. To solve any puzzle of these types, create a regex such that each unknown letter is represented by w, and each known letter is put into the regex at its proper place. Finally, anchor the start and end of the word. Note that using Perl for word games is also discussed further in section 3.7.2 of this book.

a) Find an eight-letter word where the middle two letters are p and m. By the above discussion, this corresponds to the following regex. Note the use of starting and ending anchors.

/^wwwpmwww$/

It is possible to shorten this regex, for example, w{3} can replace the three letters before and after pm. Try finding all such words using program 2.10. For example, chipmunk and shipment both match.

b) Find all four-letter words that start with p and end with m.

c) Code sample 2.28 shows how to find double letters. Generalize this to find triple letters, which are much rarer. Examples of words with three or more repeated letters in a row are given in section 32 of Ross Eckler’s Making the Alphabet Dance [41].

d) The above puzzles just scratch the surface. The book The Oxford A to Z of Word Games by Tony Augarde [4] lists numerous games, many of which have the goal of finding as many words as possible with certain patterns of letters. Often the length of the word is not specified, but this makes it even easier to write the regex.

For example, find all words that contain the letters pm (in that order) using /pm/ in program 2.10. For instance, find all the words that end in mp (in that order using /mp$/, which has an ending anchor. Tony Augarde also wrote a book or the history of a selection of word games, which is called The Oxford Guide to Word Games [5]. Both of his books are enjoyable and informative.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.122.82