CHAPTER 6

CONCORDANCE LINES AND CORPUS LINGUISTICS

6.1 INTRODUCTION

A corpus (plural corpora) is a collection of texts that have been put together to research one or more aspects of language. This term is from the Latin and means body. Not surprisingly, corpus linguistics is the study of language using a corpus.

The idea of collecting language samples is old. For example, Samuel Johnson’s dictionary was the first in English to emphasize how words are used by supplying over 100,000 quotations (see the introduction of the abridged version edited by Lynch [61] for more details). Note that his dictionary is still in print. In fact, a complete digital facsimile of the first edition is available [62].

In the spirit of Samuel Johnson, a number of large corpora have been developed to support language references, for example, the Longman Dictionary of American English [74] or the Cambridge Grammar of English [26]. To analyze such corpora, this chapter creates concordances.

The next section introduces a few ideas of statistical sampling, and then considers how to apply these to text sampling. The rest of this chapter discusses examples of concordancing, which provide ample opportunity to apply the Perl programming techniques covered in the earlier chapters.

6.2 SAMPLING

Sampling replaces measuring all of the objects in a population with those from a subset. Assuming that the sample is representative of the population, then estimates are computable along with their accuracy. Although taking a subset loses information, it also requires less resources to measure. For example, asking questions of a thousand adults in the United States is cheaper and faster than contacting all the adults across America. Hence, sampling is a trade-off between accuracy and costs. The next section introduces a few basic ideas of sampling, and while reading this, think about how these ideas might apply to text.

6.2.1 Statistical Survey Sampling

This section gives one example of statistical survey sampling, which is compared to text sampling in the next section. The former is a well-researched area and many methods of sample selection have been proposed and analyzed.

We consider this example: for an upcoming statewide election, the percentage of voters for Candidate X is desired. A survey is the usual way to answer this question.

The underlying idea of a survey is straightforward. Some unknown properties (called parameters) of a population are desired. Measuring all the members is too expensive (in time or money or both), so a sample of the population is taken. The members of this are measured, which are used to compute an estimate of the population parameters.

In this election survey, the population is all people who do, in fact, vote on election day. However, it is not known who these voters are prior to that time. Although there exist lists of registered voters, even if these were without error, they only indicate who can vote, not who will vote on election day. Let us call the actual voters in the upcoming election the target population, while the registered voters are the frame population. The percentage of target voters in favor of Candidate X over Candidate Y is the desired parameter, but we must sample the frame population to make an estimate.

A good sample is representative of the population. A variety of demographic variables such as gender and marital status have been measured by region, for example, by ZIP codes. Assume that these two variables are available for registered voters, say from a marketing firm. Then the sample can be selected to ensure that its demographics match those of the registered voters.

Consider the following sampling design. A subset of the registered voter lists (the frame population) is taken so that every person on it has an equal chance of selection, which is called a simple random sample. When these people are contacted, each person is asked if he or she plans to vote in the next presidential election. If the answer is no, then the questioning ends. Otherwise, three more questions are asked. First, “What is your gender?” Second, “Are you single, married, or divorced?” Third, “Who do you prefer: Candidate X or Candidate Y?” Quotas for gender and marital status are established prior to the survey, and once each category is filled, no more responses from people in that category are used in the analysis. For example, if the quota for single females were 250, then only the first 250 contacted are asked about the candidates. Once all the quotas are filled, then the percentage of people (in the sample) who said that they will vote for Candidate X on election day is computed, which is the estimate of the election day results.

The above sampling design is stratified. The strata are all the combinations of demographic variables, for example, single females, single males, married females, and so forth. In general, strata are picked because they are important in making the sample representative of the target population. In addition, the strata must not overlap, that is, a person is not a member of two strata at once, which is the case above.

The above description ignores many practical considerations. For example, what should be done about all the people picked from the frame population who are unreachable? In the question “Who do you prefer: Candidate X or Candidate YT” does the order of the candidates matter? However, these issues are not pertinent to text sampling, so we ignore them. See problem 6.1 for another sampling design applicable to this situation. For a much more detailed discussion on statistical sampling, see Thompson’s book Sampling [115]. With this example in mind, let us consider text.

6.2.2 Text Sampling

Language comes in two forms: written and spoken. Speech is transcribable, but not without simplifications. For example, intonation, body language, and speed of speaking are generally not recorded but convey important information. To simplify matters a little, this book only analyzes written language.

Suppose we want to study a sample of written American English. As in statistical sampling, it should be representative of some form of language, but what is the target population? There are nonfiction and literary texts. The latter includes short stories and novels, and these come in a variety of genres including romance, detective, historical, fantasy, mysteries, children’s, and science fiction. Moreover, nonfiction texts cover an immense number of topics and come in many forms including books, magazines, newspapers, journal articles, Web pages, letters, emails, and pamphlets. Finally, all the texts have to be accessible to the researchers, which suggests using a library database to construct the frame population.

Although registered voter lists exist, the situation varies for texts. Most books have International Standard Book Numbers (ISBN), and most periodicals have International Standard Serial Numbers (ISSN). However, no such lists exist for emails, product catalogs, graffiti, and so forth. And there are texts between these two extremes, for instance, many Web pages are cataloged by search engines, but not the deep Web.

Since there is no hope of a complete list of all texts, the specification of the target population requires some thought. One simplification is to limit texts to those that are cataloged in some way. This underrepresents certain categories, but the alternative is to create one’s own list of unusual texts, which requires effort, and the idea of sampling is to decrease the amount of work.

The key question for a corpus is to decide exactly what it is sampling. Should it focus on a very narrow type of usage? For example, English is the official language of international air traffic control, but this type of English is limited to discussions about aviation. Another example is an English as a Second Language (ESL) program that creates a collection of student essays in order to study typical mistakes. In fact, this has been done, and for more information on such a corpus, see the Web page of the Cambridge Learner Corpus [24], which is part of the Cambridge International Corpus [23].

However, a broader focus is possible. For example, English is spoken in many countries, and each has its own peculiarities, so it makes sense to speak of American English, British English, Australian English, and so forth. In addition, it is both written and spoken. How a person talks depends upon what he or she is doing, for example, speaking to a friend in a bar is different from discussing work with one’s boss over the phone. These changes in use due to circumstances are called registers. Already this makes three ways to classify English (country, written/spoken and register), and a corpus can be made for any combination of these three, for example, American written English as it appears in newspaper business stories.

The divisions in the preceding paragraph are strata. That is, a researcher can stratify by countries, written/spoken language, registers, and so forth. This is a useful way to approach a corpus that is representative of many types of English. Such a corpus might first divide texts by country. Then for each country, written and spoken examples are collected, and each of these can be further broken down. The end result are strata that are narrow enough in scope so that it is possible to gather a representative collection of texts. If enough are found, then the result is representative of a wide variety of English.

There exist giant copora that are representative of a large swath of Englishes, for example, the Cambridge International Corpus of English [23]. This combines other corpora owned by Cambridge University Press. For a detailed description of the construction of a corpus that is representative of American English, see the Brown Corpus Manual [46]. This explains many details, including the strata used, and the sizes of each text sample.

Although survey sampling and text sampling have similarities, there is one important difference: many texts have copyrights. Researchers must request permission to use copyrighted texts in their corpus, but permission need not be given. For a discussion of this issue, see unit A9 of McEnery, Xiao, and Tono’s Corpus-Based Language Studies [78]. Note that the issue of copyright is why this book uses texts that are in the public domain, which explains the preponderance of pre–World War I literary works.

With the above discussion of both survey and text sampling in mind, we are ready to consider corpus linguistics. The next section introduces the first of several uses of corpora.

6.3 CORPUS AS BASELINE

A common statistical task is deciding whether or not a measurement is typical. For example, output 3.4 gives counts of the word lengths for the story “The Tell-Tale Heart.” On average, are these unusually long? unusually short? Without a standard of comparison, it is hard to say.

If a corpus is representative of some part of the English language, it can be used as a baseline for comparison. For example, the most ambitious to date are the giant corpora such as the Cambridge International Corpus [23], which have been built to study English as a whole. Text that deviate from this kind of corpus are likely to be atypical in some way.

Unfortunately, even smaller, less ambitious corpora are labor intensive, and although many exist, few are in the public domain. For example, the Linguistic Data Consortium [73] has many available, but most of them cost money to obtain. Note that many older texts are in the public domain and are available online, so a researcher can create certain types of corpora without obtaining copyright permissions. For example, to create a corpus of 19th-century novels, many texts are available, so the frame population is clear. However, care is needed to define the target population as well as choosing texts that are representative of this.

This section compares three novels to a large, public domain corpus consisting of Enron emails released to the public by the Federal Energy Regulatory Commission during its investigation of the company [31]. This is called the EnronSent corpus, and it was prepared for public distribution by the CALO Project [76], created by SRI International’s Artificial Intelligence Center. This corpus is distributed as 45 files, each containing from 1.5 to 2.5 megabytes of text (uncompressed). However, the analysis here is based on just one file. enronsentOO, for simplicity.

Although the novels and emails are both in English, clearly these types of texts have differences. A more appropriate corpus might be made from public domain literary novels of the 19th century, but as noted above, this requires time and effort. However, this comparison shows the use of a corpus as a baseline.

We compare three novels with the EnronSent corpus with regards to word frequencies. Since the latter is composed of email, it is wise to check its contents for odd characters. Table 6.1 shows the frequencies for enronsent00, which are obtained from program 4.3. Note that the original output is changed from one column to three to save space.

Table 6.1 Character counts for the EnronSent email corpus.

: 377706

v: 14816

): 2310

e: 169596

k: 13619

(: 2244

t: 128570

,: 12423

q: 1858

0: 114242

1: 8808

>: 1805

a: 112451

-: 8459

$: 1575

i: 102610

2: 8428

":1345

n: 100985

I: 5613

z: 1129

s: 89229

3: 5609

@: 867

r: 88811

=: 5592

<: 802

1: 61438

5: 4908

!: 711

h: 59689

*: 4907

+: 614

d: 50958

:: 4726

&: 512

C: 45425

j: 4375

˜: 501

u: 43439

x: 4222

%: 448

m: 38346

‘: 4162

#: 419

p: 34210

9: 4035

[: 353

y: 31347

4: 3642

1: 353

f: 29674

7: 3269

;: 218

g: 29336

?: 3250

‘: 95

w: 27443

6: 2807

|: 88

.: 24821

tab: 2695

: 11

b: 21606

8: 2661

}: 2

0: 15875

_: 2600

Not surprisingly, there are some unusual characters, for example, 4907 asterisks and 1805 greater than signs. Using the regex concordance (program 2.7), it is easy to print out all the lines that match some pattern, which gives insight into how the special symbols are used.

In section 2.4, it is noted that the dash, hyphen, and apostrophe can cause complications. For the EnronSent corpus, we remove dashes, but leave all the hyphenated words in place. To decide what to do with the apostrophes, we examine them using program 2.7. Output 6.1 shows a selection of the 270 matches for the regex (w’ W I W’ w), which finds all the apostrophes at the beginning or the end of a word. Note the initial output line numbers, which start at 1.

Output 6.1 clearly shows a variety of uses of the apostrophe. For example, in lines 2 and 41, it is used to form a possessive noun, while in lines 5, 6, and 196, it forms contractions. In lines 7 and 87, single quotes are used to highlight a phrase, but in line 19 it is used to indicate length in feet. Notice that lines 75, 129, and 136 have typos, which is common in emails. In line 109, two apostrophes are used to make double quotes, as is true in line 157 except that the initial double quote is made with back-quotes. Finally, line 233 shows the back-quote, apostrophe pair to highlight a term.

Output 6.1 Selected concordance lines matching an apostrophe in the EnronSent corpus.

images

Since there are many patterns of typos to fix (like cant’ and who ‘s), and because changing nothin’ to nothin is still comprehensible, all apostrophes at both ends of a word are removed in the word counting program. This creates some ambiguities, such as changing 55’ to 55, but these are not that common (after all, there are only 270 total matches.)

To count the words in this corpus, all the nonalphanumeric characters shown in table 6.1 are removed, as are two single quotes in a row and single quotes either at the beginning or the end of a word. With these in mind, it is straightforward to rewrite program 3.3 and code sample 3.26 to create code sample 6.1.

The output from code sample 6.1 is quite long, but we focus on the 20 most frequent words, which are shown in the first column of table 6.2. These word frequencies are used as a baseline for comparisons, which assumes that similarities or differences from this corpus are of interest.

Table 6.2 Twenty most frequent words in the EnronSent email corpus, Dickens’s A Christmas Carol, London’s The Call of the Wild, and Shelley’s Frankenstein using code sample 6.1.

images

We compare the Enronsent corpus with the following literary works: Charles Dickens’s A Christmas Carol (compare with output 3.21), Jack London’s The Call of the Wild, and Mary Shelley’s Frankenstein. The results for these three novels have been combined in table 6.2.

Notice that there are both similarities and differences among the four columns. First, the same word is at the top of each column, the, which is the case for most texts. Second, the top five words in the first column are highly ranked in all four columns. These are common function words.

Third, the first column has is, are, and have, but these do not appear in the other three columns (although these words are in the three novels, their ranking is below 20). On the other hand, was, were, and had are in the last three columns but are not in the first. So there is a difference in verb tense of the auxiliary verbs to be and to have: the novels prefer the past, the emails prefer the present.

Fourth, character names appear in the second and the third columns. For example, Scrooge is the protagonist of A Christmas Carol as is Buck for The Call of the Wild. Not surprisingly, Scrooge does not appear in the corpus or the other two novels. However, buck is in the corpus (both as a name and as a word) as well as in A Christmas Carol (though not as a name). Nonetheless, it appears infrequently in these works.

Code Sample 6.1 Computes word frequencies for the EnronSent corpus.

images

Finally, the rate of use of I varies greatly among the columns. It is twelfth mA Christmas Carol, but is 121st in The Call of the Wild. Both of these novels have narrators who are not characters in the stories they tell, so lower ranks are not surprising. However, Buck is a dog and does not use this pronoun. Checking with the concordance program, all the uses of 1 in The Call of the Wild appear in direct quotes where humans are speaking.

In contrast, I is third in Frankenstein, which is told in a series of first-person accounts. For example, Robert Walton writes letters to his sister of his expedition to the North Pole, during which he sees the monster and then meets Victor Frankenstein. The latter eventually tells Walton his tale of creating the monster that he is currently pursuing. This tale also includes other first-person narratives, for example, the monster’s story. All this explains the high frequencies of first-person pronouns.

Even though the above analysis only requires finding and counting words, these top 20 counts do reveal differences. That is, even though table 6.2 represents a tremendous loss of information, enough is left to draw interesting conclusions. Hence reductive techniques like word counting can be informative.

6.3.1 Function vs. Content Words in Dickens, London, and Shelley

Words can be split into two classes: function and content. The former are often frequent and provide grammatical information. Examples of this are the, to, and and, which are the three most frequent words in the EnronSent corpus. The latter provide content. For instance, the sentence “The prancing blue cat is on a snowboard” is evocative because of its four content words (cat, snowboard, blue, and prancing) and is grammatical because of its four function words (the, is, on, and a).

Function words are common in a stoplist, which is a collection of terms to ignore in an analysis. For example, the is seen to distort the angles between word count vectors in section 5.6. Since the serves a grammatical role, but has little meaning to contribute, decreasing its influence is reasonable, and removing it completely is the most extreme way of achieving this.

However, this distinction between function words and content words is not so clear-cut. For example, consider table 6.3, which contains eight examples of phrasal verbs using up, which is a common preposition and is called a particle when used this way. First, note that adding up changes the meaning of these verbs, for example, to throw up is much different than to throw. Second, the meaning of up ranges from literal to idiomatic. For example, to walk up a hill implies that a person has increased his or her elevation, but to screw something up is an idiomatic phrase. See section 235 of the Cambridge Grammar of English [26] for an explanation of the grammar of multiword verbs, which include phrasal verbs.

Table 6.3 Eight phrasal verbs using the preposition up.

Phrasal VerbRough Meaning

Screw up

Make a mistake

Shape up

Exercise

Wake up

Awaken

Shut up

Be quiet

Speak up

Talk louder

Throw up

Vomit

Walk up

Ambulate upward or toward

Pick up

Lift upward

In the last section, the pronouns I and my are informative. The narrative structure of the novel Frankenstein is reflected by the higher than expected proportion of these two pronouns. Hence, ignoring these words does lose information. For more on stoplists, see section 15.1.1 of Foundations of Statistical Natural Language Processing [75]. Also see section 9.2.2 for one way to obtain stoplists for a variety of natural languages using Perl.

The preceding section shows that examining the most frequent words in a text is informative, even when these are function words. Code sample 6.1 prints out the word frequencies from highest to lowest, and as it decreases, the proportion of content words increases.

For example, in The Call of the Wild, the word dogs appears 111 times (ranks 33rd) and dog appears another 57 times. Since the novel is a narrative about the dog, Buck, this is not surprising. Furthermore, most of the story takes place in the outdoors of far north of Canada, and this is reflected in the ranks of sled (58th), camp (71st), and trail (88th). In addition, the names of the important characters in the story appear frequently. For example, Buck is in the top 20, Thorton (44th) is the human Buck loves most before he joins the wolves at the end of the book, and Spitz (58th) dislikes Buck so much that the two of them fight it out to the death.

Hence, studying word frequency lists does give insight to a novel. However, considering how a word is used in a text provides additional information, which is easy to do by running a concordance program. This technique combines the computer’s ability to find text with a human’s ability to understand it, which is both simple and powerful and is the topic of the next section.

6.4 CONCORDANCING

Concordancing is also called Key Word In Context (KWIC). The goal is to find all instances of a regex, and print these out along with the surrounding text for the researcher to inspect. Section 2.5 discusses how this is done in Perl, which culminates with program 2.7.

We first use the concordance program to check the accuracy of the word frequencies found in the last section. For example, applying program 2.7 with the regex (buck) to The Call of the Wild results in 360 matches, which is 47 more than the 313 noted in table 6.2. Using (he), there are 817 matches compared to 814. Finally, using (the), both the table and the program report 2274 matches. This is a reminder that a programmer must be careful when using regexes. Try to think of what causes the counts to differ in these cases before reading the next paragraph.

Since the concordance counts are always at least as big as those in table 6.2, code sample 6.1 can miss instances of words. One possibility is that the programs treat capitalization differently. However, code sample 6.1 uses lc, so all letters are converted to lowercase, and program 2.7 uses the regex /$target/gi, which is case indifferent. Hence, both programs ignore capitalization, although they use different means to achieve this.

The answer is due to the boundary condition, . Putting this before and after the target only matches nonalphanumeric characters before and after it, and punctuation satisfies this condition. However, do both programs take into account all punctuation? Not quite: the apostrophe and hyphen match , but they are not removed in code sample 6.1.

It turns out that there are 47 Buck’s and 3 he’s in The Call of the Wild, which cause the discrepancies noted above. Another potential problem are hyphenated words because the hyphens match the boundary condition. Now we are ready to consider how to sort concordance lines, which is the topic of the next section.

6.4.1 Sorting Concordance Lines

The output of program 2.7 shows the matches in the order they appear in the input text. However, other arrangements can be useful.

We consider three kinds of orders in this section. First, since the text that matches a regex can, in general, produce a variety of strings, sorting these in alphabetical order can be interesting. For example, the regex (dogs?) matches both the singular and plural forms of dog, and sorting these means that all the concordance lines with dog come first. and then all the lines with dogs.

Second, one way to create table 6.3 is by looking at concordance lines that match the word up. Output 6.2 does exactly this, which finds many phrasal verbs, for example, wrap up, take up, and pent up. This output, however, is easier to use if the word immediately before up were alphabetically sorted. Moreover, some of the verbs are not adjacent to up, for example, takin’ ‘m up and licked some up. Hence, sorting two or three words before up is also useful.

Output 6.2 First 10 concordance lines matching the word up in The Call of the Wild.

"You might wrap up the goods before you deli struggle. "I’m takin’ ‘m up for the boss to ‘Frisco. ere they keeping him pent up in this narrow crate? He ur men entered and picked up the crate. More tormento o long, and Buck crumpled up and went down, knocked Ut riously, then licked some up on his tongue. It bit li the bone for three inches up and down. Forever after B strils, and there, curled up under the snow in a snug snarl he bounded straight up into the blinding day, the in harness and swinging up the trail toward the Dyea

Third, the prepositions that go with a phrasal verb can be studied. For example, output 6.3 shows the first 10 matches of sprang in The Call of the Wild. This reveals several phrasal verbs: to spring at, to spring for, to spring to, and so forth. Now sorting the lines by the word right after the match is of interest. With these types of sorts in mind, we implemeni them in the next section.

Output 6.3 First 10 concordance lines matching the word sprang in The Call of the Wild.

breath.In quick rage he sprang at the man, who met him h a kidnapped king. The man sprang for his throat, but Buck times during the night he sprang to his feet when the shed the first meal. As Buck sprang to punish him, the lash o ething very like mud. He sprang back with a snort. More ggled under his feet. He sprang back, bristling and snarl e beast in him roared. He sprang upon Spitz with a fury wh ing, and when the two men sprang among them with stout clu, terrified into bravery, sprang through the savage circle bristling with fear, then sprang straight for Buck. He had

6.4.1.1 Code for Sorting Concordance Lines We start with sorting concordance lines by the strings that match the regex. Program 2.7 finds these lines, and only two additional ideas are needed. First, instead of immediately printing out the lines, store them in an array, say @lines. Second, once stored, use sort to order them.

Code sample 6.2 prints its output in the order the lines are found, and adding subroutines to this program enables the types of sorts discussed above. Compared to program 2.7, program 6.1 has several advantages. First, it allows the programmer to enter the regex and the radius on the command line. Second, $radius is padded with spaces so that the final string always has $extract characters both before and after the regex match, which makes sorting the concordance lines easier. Third, these are stored in the array @lines. Fourth, sorting is achieved by replacing (@lines) with (sort byOrder @lines) in the final foreach loop.

One detail to remember is that all characters have an order, which has two consequences. First, the order of words containing punctuation can be surprising. For example, don ‘t comes before done because the single quote comes before all the letters of the alphabet. This is solved by removing all the punctuation within the sort function (except that a dash adjacent to words is replaced by a space), which is done by removePunctuat ion in code sample 6.3. Note that the function lc is required because capital letters come before lowercase letters. Otherwise Zebra comes before aardvark.

Second, a function to order the concordance lines is required. Recall that the lines have exactly $radius characters before and after the match. Hence, given a line, the match is recoverable by removing these characters, which can be done by substr. This is the reason that program 6.1 adds spaces to each line if they are short due to paragraph boundaries. What remains is processed by removePunctuation and sorted by cmp. Code sample 6.4 does this.

Finally, changing the foreach loop of code sample 6.2 to the following reorders the output as discussed above.

foreach $x (sort byMatch 1ines)

Combining code samples 6.2, 6.3, and 6.4 with program 6.1 accomplishes our original goal of sorting by regex matches. See problem 6.5 for some simple punctuation searches to try. The next section discusses some applications.

images

Program 6.1 Core concordance code for use with code samples 6.2, 6.3, 6.4, 6.7, 6.8, 6.9, and 6.10.

6.4.2 Application: Word Usage Differences between London and Shelley

Lexicography is the study of words. Recent dictionaries like the Longman Dictionary of American English [74] use proprietary corpora representative of written and spoken English. However, any corpus can be used to study the language it contains. For example, the EnronSent corpus gives insight into the register of corporate emails.

When studying usage, even though English has relatively few inflected forms, it does have some. For example, many nouns have singular and plural forms, and verbs have conjugated forms. Hence, to study a word often requires finding several forms, which are collectively called a lemma. Fortunately, finding multiple patterns is easy for a regex.

Code Sample 6.2 Code to print out the concordance lines found by program 6.1.

images

Code Sample 6.3 A function to remove punctuation, which is used in code sample 6.4.

images

Code Sample 6.4 A function to sort concordance lines by the strings that match the regex. This is used in program 6.1.

images

For the first example, we consider how the pronoun I is used in The Call of the Wild. This word ranks 121st most frequent, and it is that low because it is only used by the humans in the novel, not by Buck, the dog protagonist.

Although I has no inflected forms, it is often used in contractions. Hence the regex (i ( ‘w+) ?) is used, where the innermost parentheses match any potential contractions (as long as only one apostrophe is used). Output 6.4 gives the first 10 occurrences of I in addition to all its contractions. Note that the first line matched the roman numeral I, not the pronoun.

Although some of these lines are too short to know for sure, most of them include double quotes. In fact, the last 42 lines are direct quotations of human characters in the novel. Also, as promised, the matches are in alphabetical order, that is, all the occurances of I are first, then all instances of I’ll, I’m, and I’ve, in that order.

Output 6.4 Representative lines containing I in The Call ofthe Wild.

images

It is easy to confirm that the usage of I in the EnronSent corpus is much different, which agrees with common sense. A person writing an email commonly uses the firstperson pronoun, but without using quotation marks. Nonetheless, there are examples in this corpus where the person quoted uses I, for example, some emails include news reports that include first-person direct quotes. However, these are rare.

Hence, these two texts differ in the grammatical use of the pronoun I. The Call of the Wild always uses this word in direct quotations of the human characters. However, the EnronSent corpus rarely uses direct quotations.

This example, however, is somewhat crude. Concordance lines can be used to discover which meanings a word has in a text. To illustrate this, we compare the word body as it is used in the novels The Call of the Wild and Frankenstein by searching for matches to the regex (bod(y ies)). Both texts use the word about the same number of times, but not in the same way.

The extracts in tables 6.4 and 6.5 are found by the concordance program, but are edited to fit on the page. The numbering does not start at 1 because the initial lines match the word bodies. In the former table, all the lines refer to Buck’s body except 11 (Perrault’s) and 14 (a primitive man’s). Note that all 10 lines refer to living bodies. The Call of the Wild is an adventure story of the dog, Buck, and the emphasis is on physical action. Although a few of the characters in the novel die, the focus is on what they did, not their corpses.

Table 6.4 First 10 lines containing the word body in The Call of the Wild.

5 an opening sufficient for the passage of Buck’s body.

6 he received a shock that checked his body and brought his teeth together

7 With drooping tail and shivering body, very forlorn indeed,

8 In a trice the heat from his body filled the confined space

9 The muscles of his whole body contracted spasmodically

10 blood carried it to the farthest reaches of his body

11 it fell each time across the hole made by his body,

12 his splendid body flashing forward, leap by leap,

13 and every hair on his body and drop of blood in his veins;

14 but on his body there was much hair.

Table 6.5 First 10 lines containing the word body in Frankenstein.

4 I commenced by inuring my body to hardship.

5 His limbs were nearly frozen, and his body dreadfully emaciated by fatigue

6 in a state of body and mind whose restoration

7 No word, no expression could body forth the kind of relation

8 I must also observe the natural decay and corruption of the human body.

9 renew life where death had apparently devoted the body to corruption.

10 for the sole purpose of infusing life into an inanimate body.

11 where the body of the murdered child had been afterwards found.

12 When shown the body, she fell into violent hysterics

13 If she had gone near the spot where his body lay,

In the second novel, however, the story revolves around Victor Frankenstein’s successful attempt to animate a dead body. Although his experiment succeeds, he abandons his monster, who through ill treatment grows to hate men, especially the man who created him. This leads to the monster systematically killing the friends and family of Victor until both of them die in the frozen wastelands of the arctic.

Unlike The Call of the Wild, where the word body is consistently used to refer to a living body doing something, the uses of body in table 6.5 are more diverse. Line 4 is written by Robert Walton about his preparations to explore the arctic. He refers to his body in a matter of fact way that is reminiscent of The Call of the Wild.

Lines 5 and 6 refer to Victor Frankenstein, who is discovered by Walton in the arctic. Line 7, however, uses body as a verb, not as a noun, which is a very different use of this word. And lines 8, 9, and 10 refer to Frankenstein’s research of animating the dead. In all three cases, the discussion is about abstract science, not physical deeds, and body is replaceable by corpse.

In lines 11, 12, and 13, all three uses of body refer to the corpse of William Frankenstein, Victor’s brother. The murderer is the monster, but he also planted evidence on Justine Moritz, who is accused, found guilty, and executed for the crime. Finally, some of the lines not shown introduce other meanings, for example, “the phenomena of the heavenly bodies ...,” which refers to astronomical bodies.

The differences in the usage of body in the two novels reflects the different genres of these stories. In Call of the Wild, the emphasis is on Buck’s physical adventures as fate prepares him to answer the call of the wild. However, in Frankenstein, Victor pays the price for daring to create life. He sees his friends and family killed one by one until he himself dies at the end. While this story also has adventures (Walton’s arctic explorations, for instance), the focus is on the life and death of the body, not the physical deeds that it can perform.

In general, by looking at many examples of word usage in many texts, patterns are discovered, which are studied by lexicographers to create dictionary definitions. Although finishing a dictionary requires an immense amount of research, the spirit of this process is captured in the idea of concordancing.

6.4.3 Application: Word Morphology of Adverbs

The preceding section examines concordance lines to explore word usage, which is a task of corpus linguistics. This section shows that word forms are also amenable to this approach.

English is famous for irregular spelling. Nonetheless, there are still many patterns that can be written as a regex. For example, plural nouns often follow the pattern of adding either an -s or an -es to the singular form, and which one it is depends on the end of the word. For example, nouns that end in s usually form a plural by adding -es. However, there are irregular plurals, for example, it is “one mouse,” but “many mice.” See problem 2.8 for more on this.

This section finds adverbs formed from adjectives by adding -ly. For example, consider sentences 6.1 and 6.2. The former uses quick to modify the proper noun Scoot, while the latter uses quickly to modify the verb runs.

(6.1) Scoot is quick.

(6.2) Scoot runs quickly.

The above pattern does not cover all adverbs. In fact, sentence 6.3 uses quick as an adverb. However, even a rule that is applicable some of the time has value.

(6.3) Run here, Scoot, and be quick about it.

A simplistic idea is to search for all words that end in -ly, which is easy to do. Using the regex (w+ly), and sorting the concordance lines alphabetically by the matches produces output 6.5.

Output 6.5 First 10 lines in alphabetical order of words that end in -ly. Text is The Call of the Wild.

images

In this output, all the matches are adverbs. For example, in line 5, apparently modifies the adjective lifeless. However, looking at the entire output, there are matches that are not adverbs, for example, belly (a noun), Curly (a name of a dog), and silly (an adjective). Hence there are false positives. One idea is to find words ending in -ly such that removing this still results in a word: perhaps these are more likely to be adverbs. However, sometimes adding -ly changes the final letter, so investigating this possibility first is useful.

Using the Grady Ward CROSSWD . TXT word list, code sample 6.5 finds all the words that end in -ly such that when this is removed, the result is no longer a word.

Code Sample 6.5 Finding words that end in -ly that are not words when this ending is removed.

images

Running this code produces many words, which reveal several patterns. First, adjectives that end in -y change to -ily, hence happily is an adverb, yet happi is not a word. Second, there are words like seasonable and incredible where the -ble changes to a -bly. Third, words like automatic change to automatically. All these patterns can be removed with an if statement coupled with substr, which is done by code sample 6.6.

Running this produces output 6.6. It turns out that there are still word patterns that produce false positives, for example, words that end in -ll change to -lly, or words that end in -le change to -ly (which includes the -ble words noted above). It also turns out that there are gaps in the original word list, for example, analogous and deceptive are missing.

Note that the process of trying to find exceptions to a potentially useful rule reveals morphological patterns for some adverbs, although some unrelated words also appear, for example, anomaly, apply, and billy.

The next step is to remove the not in the if statement of code sample 6.5. This produces a long list of words, which must be examined to determine how many of them are adverbs. This is not done here, but see problem 6.6 for the first few words of output.

The above example shows that language is complex, and that a back-and-forth programming approach is helpful. That is, the researcher programs an idea, which is found to have exceptions. This program is revised, and then new exceptions might appear. However, in practice, this process usually ends after several iterations with code that has an acceptably low error rate. Moreover, this process itself teaches the programmer about the texts under analysis.

A final lesson is to realize that data sets often have errors. In the above case, two words that should not have been in output 6.6 did appear: analogously and deceptively.

Code Sample 6.6 Variant of code sample 6.5 that removes certain patterns of words.

images

Output 6.6 Initial 10 lines of output from code sample 6.6.

agly analogously anomaly antimonopoly apetaly aphylly apply assembly beauteously bialy bihourly billy biweekly biyearly blackfly blowfly botfly brambly bristly brolly bubbly buirdly bully burbly butterfly catchfly chilly cicely coly contumely crinkly cuddly cully dayfly deceptively deerfly dhooly dicycly dilly dimply dooly doubly doyly drizzly drolly dully duly duopoly eely emboly epiboly feebly felly firefly fly folly freckly frilly fully gadfly gallfly giggly gilly gingelly gingely girly glowfly goggly golly googly gorbelly greenfly grisly grizzly grumbly hepatomegaly hillbilly hilly holly

Unfortunately, errors are always a possibility. Even the act of finding and correcting these can introduce new errors. Instead of expecting no errors, it is wise for any data analyst to assume that errors exist and try to detect them.

For another example of morphology, see section 3.2 of Corpus Linguistics by Biber, Conrad, and Reppen [11]. That section studies proportions of nominalizations, which are classes of nouns that are formed by using particular endings. For example, happy is an adjective, and happiness is a noun formed by adding the suffix -ness (and changing y to i, which also happens with happily).

This reference also states how the authors approach their task. What computer tools do they use to study nouns ending in these suffixes? As they note on page 59 of Corpus Linguistics, although some concordance programs can do what they want, they find it easier to write their own computer code. These authors are linguists, not computer scientists, yel they realize the flexibility of programming is worth the effort.

In the next section we move from sorting the regex matches of concordance lines to sorting the words near these matches. This can be applied to collocations.

6.5 COLLOCATIONS AND CONCORDANCE LINES

Table 6.3 shows examples of verbs that change their meaning when used with up. For example, to throw your lunch into the sink means something completely different than to throw up your lunch into the sink. However, there are examples where the meaning of a phrase is solely determined by its constituent words. For example, a green phone is a phone that is green.

Certain words occur together more often than chance, and these are called collocations. For example, throw up is a collocation, as is blue moon, but green phone is not. Recall from section 4.3.1 that the probability of two independent words is the product of the probabilities of each word. However, the probability of a collocation is higher than this product, as stated below.

images

Note that the words in a collocation can be separated by other words, for example, he threw his hands up into the air, where his hands separates threw and up. In addition, there are some words that avoid each other. For example, people say brag about something but not brag up something nor brag something up.

For further information on collocations, see chapter 5 of Barnbrook’s Languages and Computers [9]. In addition, see section A1O.2 of Corpus-Based Language Studies by McEnery et al. [78]. In the latter, on page 81, the authors note that native speaker intuition about collocations is not that reliable. Hence there is a need for analyzing language data.

Before we analyze collocations, the next section extends the sorting capabilities of the concordance program used above. It shows how to replace code sample 6.4 with functions that can sort the concordance lines by the words near the regex match instead of the regex match itself.

6.5.1 More Ways to Sort Concordance Lines

This section shows how to sort concordance lines by the words neighboring the regex matches. For example, suppose the regex matches up, then sorting by the words that appear immediately to the left produces output 6.7. This helps identify potential collocations.

Doing this requires a function that takes a concordance line and returns the word that is a specified distance away from the regex match. For example, in output 6.7, a function that returns the word to the left of up is used with sort to order the lines. This requires a few steps. First, take the characters up to but not including the match. Since $radius has the number of characters the come before (and after) the match, this is easy to do. Second, use removePunctuation given in code sample 6.3. Third, split the result into an array of words. Fourth, pick the last word of this array, which is the closest to up.

Output 6.7 Ten concordance lines from The Call of the Wild with the word up. These are sorted in alphabetical order of the word immediately to the left of up.

1

heem chew dat Spitz all up an’ spit heem out on de

2

across his shoulders and up his neck, till he whimpe

3

at it again. He backed up Spitz with his whip, whi

4

whip, while Buck backed up the remainder of the tea

5

ond day saw them booming up the Yukon well on their

6

was the pride that bore up Spitz and made him thras

7

not like it, but he bore up well to the work, taking

8

ed in to the bank bottom up, while Thornton, flung s

9

boats against the break-up of the ice in the spring

10

eek bed, till he brought up against a high gravel ba

The general case is no more difficult. Let $ordinal be the number of words to go to the left. For example, setting $ordinal to 1 corresponds to output 6.7. Each of four lines of Pen code in code sample 6.7 corresponds to the four steps in the last paragraph. Remember that a negative index counts from the end of the array, for example, $word [–1] is the last entry.

Code Sample 6.7 A function that returns a word to the left of the regex match in a concordance line.

images

With the function onLeft, defining an order is easy. Use it to select the word, which is changed to lowercase, then sort these using cmp. This is done in code sample 6.8.

Code Sample 6.8 An ordering for sort. It uses the function onLeft defined in code sample 6.7.

images

The same logic applies to sorting on words to the right of the match. The function onRight and the ordering byRightWords are both shown in code sample 6.9.

With program 6.1 and the above code samples, there are only two details left. First, code sample 6.10 prints out the results. Second, set $ordinal to $ARGV [2].

Code Sample 6.9 Subroutines to sort concordance lines by a word to the right of the match.

image

Code Sample 6.10 Commands to print out the sorted concordance lines.

image

By putting code samples 6.7, 6.8, 6.9, and 6.10 together with program 6.1, we have a concordance program able to sort by words near the regex match. This produces output 6.7. The next section shows some applications.

6.5.2 Application: Phrasal Verbs in The Call of the Wild

Let us apply the programming of the last section to phrasal verbs. Our first task is to recreate output 6.7, and all the pieces are in place.

Start with program 6.1 and add removePunctuation defined in code sample 6.3. Second, add code sample 6.7, which defines the subroutine onLeft. This is used by the subroutine byLeftWords in code sample 6.8, which defines the ordering of the concordance lines. Finally, the first foreach loop of code sample 6.10 prints out the sorted results.

Combining the above parts into one file and running it produces output 6.7. There are 104 total lines, so it is convenient to summarize these by listing the words that appear immediately to the left of up, along with their frequencies. By using code sample 3.26, these frequencies are sorted in descending order. Putting code sample 6.11 at the end of the code described in the preceding paragraph produces output 6.8 (words appearing once are not shown.)

Code Sample 6.11 Code to print out the sorted frequencies of words to the left of the match. This uses code sample 3.26.

image

Output 6.8 Frequencies of the words immediately to the left of up in The Call of the Wild.

get, 8

him, 4

harness-, 3

straight, 3

went, 3

backed, 2

bore, 2

curled, 2

it, 2

made, 2

picked, 2

team, 2

took, 2

This output contains more than verbs. Looking at the concordance lines reveals the reason for this. For example, lines 41, 43, and 44 are all phrasal verbs where him is between the verb and the preposition. However, in line 42 up is part of the prepositional phrase up to Buck as shown in the quote of the entire sentence.

At another time Spitz went through, dragging the whole team after him up to Buck, who strained backward with all his strength, his fore paws on the slippery edge and the ice quivering and snapping all around.

While sorting on the word immediately to the left of up is certainly helpful in finding phrasal verbs, intervening words cause problems. Even if words were labeled by their part of speech (see section 9.2.4 for a way to do this), finding the first verb to the left can fail as shown in the above sentence.

Output 6.9 The four concordance lines with him up, which are counted in output 6.8.

41

ancois’s whip backed him up, Buck found it to be che

42

the whole team after him up to Buck, who strained ba

43

s. Francois followed him up, whereupon he again retr

44

rior weight, and cut him up till he ceased snapping

Clearly, creating a fully automated phrasal verb finder is difficult. However, by combining a concordance program with a human, shorter texts like a novel are analyzable.

Phrasal verbs can be studied by analyzing the words that come after a verb. In fact, there are dictionaries that do this such as NTC’s Dictionary of Phrasal Verbs and Other Idiomatic Verbal Phrases [111]. The subroutines onRight and byRightWords in place of onLeft and byLeftWords, respectively, perform this task.

As an example, output 6.3 is redone so that the lines are sorted into alphabetical order by the first word after the verb sprang. The complete results are in output 6.10.

Output 6.10 Concordance lines sorted by the word after sprang in The Call of the Wild.

image

By inspection, sprang upon is the most common preposition (lines 21—26) followed by sprang to (lines 15—18). As noted above, collocations should occur higher than chance. For example, equation 6.4 should hold for the suspected collocation sprang upon.

(6.4) equation

The output shows sprang occurs 26 times, and it is easy to check with the concordance program that upon occurs 79 times. The total number of words in the novel is 31,830, so by equation 4.2, equation 6.5 shows that sprang upon is a collocation because the left-hand side equals 0.00022, which is much bigger than the right-hand side of 0.00000203.

(6.5) equation

Although sprung upon is a collocation for the Call of the Wild, this may or may not be true for other texts. For example, the same analysis for Frankenstein gives equation 6.6. Here the left hand side is zero, so it is certainly not bigger than the right hand side. So for this text, sprung upon is not a collocation. In retrospect, that a story about the adventures of a dog contains many scenes of springing upon objects is not a surprise.

(6.6) equation

Unfortunately, the zero count of spring upon in Frankenstein is all too common. The next section comments on this phenomenon.

6.5.3 Grouping Words: Colors in The Call of the Wild

Counts from texts can vary by large amounts. For example, output 4.2 counts letters in A Christmas Carol, and e occurs 14,869 times while z appears only 84. Zipf’s law (see section 3.7.1) shows the same phenomenon with words, for example, in The Call of the Wild, the appears 2274 times, but about half the words appear exactly once.

Unfortunately, analyzing linguistic objects two at a time when many of them are rare often produces zero counts, which are hard to work with. For example, program 4.4 analyzes letter bigrams, and finds that roughly a third of the possibilities do not appear at all.

If letter bigrams are rare, then the situation is only worse with more complex combinations. What can a researcher do? This section considers analyzing groups of words together. For example, this includes lemmas, which consists of all the inflected forms of a word.

But grouping need not stop here. Instead of considering just one lemma, create a group of related lemmas. For example, instead of analyzing the use of the adjective red (along with the comparative forms redder and reddest), analyze color words as a group, which is done below with The Call of the Wild.

To do this, a list of such words is needed, but a thesaurus has this type of information, and both the Moby Thesaurus [122] and Roget’s Thesaurus [107] exist online and are in the public domain.

Looking at sections 430—439 in Roget’s Thesaurus [107], numerous color words are listed. Selecting the more common ones produces the following list: white, black, gray, silver, brown, red, pink, green, blue, yellow, purple, violet, and orange. One group of these words have the form noun-colored, for example, sulfur-colored. Some of the words have multiple meanings and are left out, for example, yolk. To find each word of a group, alternation works, as shown below.

(white|black|red|...|orange)

Finally, running the concordance program produces output 6.11, which shows the firsi 12 lines out of a total of 61. The most common color is white with 21 instances, then red with 14 instances. Hence this strategy worked because 61 lines is at least 3 times as frequent as any particular color in The Call of the Wild.

Output 6.11 Color words in The Call of the Wild. The first 12 lines out of 61 are shown.

1

ult and turned over to a black-faced giant called Franc

2

emonstrative, was a huge black dog, half bloodhound and

3

ensions were realized. “Black” Burton, a man evil-temp

4

l wood moss, or into the black soil where long grasses

5

stream he killed a large black bear, blinded by the mos

6

eir backs, rafted across blue mountain lakes, and desc

7

ious. But for the stray brown on his muzzle and above

8

re seen with splashes of brown on head and muzzle, and

9

a middle-aged, lightish-colored man, with weak and water

10

the dark, and the first gray of dawn found them hitti

11

ne only he saw,--a sleek gray fellow, flattened agains

12

low, flattened against a gray dead limb so that he see

However, words are not the only lexical items that are analyzable with this technique. As noted in section 6.4.3, morphological structures are detectable by regexes, for example, analyzing adverbs ending in -ly as done earlier in this chapter. But there are many other grammatical forms, for example, gerunds, which are verb forms ending in -ing that can be used as nouns or adjectives, such as running is an exercise or there is a running cat.

Furthermore, there is a Perl module that can identify parts of speech, which is discussed in section 9.2.4. Using this, patterns involving types of words are possible, for example, finding two adjectives in a row.

The next section gives a number of references, which contain many examples. These provide many further ideas well suited for programming in Perl.

6.6 APPLICATIONS WITH REFERENCES

Much work has been done in corpus linguistics, and there are many examples in the academic literature of using concordancing techniques to analyze one or more texts. This section lists a few of these for the interested reader to pursue.

First, section 2.6 of Douglas Biber, Susan Conrad, and Randi Reppen’s Corpus Linguistics [11], compares words that seem synonymous. This is essential for the lexicographer who needs to find all the different shades of meaning, and it can be used to investigate grammar, as well as how grammar interacts with words.

Specifically, section 2.6 analyzes the adjectives big, large, and great. Grammatically, these seem interchangeable, however, there are a variety of restrictions on how these are used. This analysis by Biber et al. has similarities with section 106 of Practical English Usage [114], and perhaps there is a link between these.

Section 3.2 of Biber et al. [11] gives another example of how concordancing gives insight into language. The authors study the distributional properties of nominalizations among registers. These are a way to form new nouns by adding an ending to existing words. For example, nominalization is a noun formed from the verb nominalize by adding -tion (plus a vowel change). As long as there are a reasonable number of well-defined patterns, finding these can be done by regexes.

In general, Biber et al. [11] is an excellent book for anyone interested in quantitative analysis of corpora. It has many practical examples and is quite readable. Another great book is Corpora in Applied Linguistics by Susan Hunston [59]. She gives numerous interesting examples of searching corpora for a variety of word patterns. For example, she defines a frame as three words in a row where the first and third are fixed, but the second is arbitrary. She lists several examples on page 49. But converting this pattern to a regex is straightforward, as shown in the example below (which ignores punctuation).

$word1 (w+) $word2

Language teaching is another application of concordancing. At present, many people across the world want to take English as a Second Language (ESL) classes. This has increased the need for language references designed for nonnative speakers. In particular, learner dictionaries specifically for ESL students have been influenced by corpus linguistics. One key idea is ranking words in the order of frequency, which has been done several times in this book, for example, see table 6.2 or section 3.7.1.

Giant corpora have been developed to study language. The Cambridge International Corpus (CIC) is used by the language reference books published by Cambridge University Press [25]. An example of a dictionary using frequency information from the CIC is the Cambridge Advanced Learner’s Dictionary (CALD) [121]. This book has three labels to broadly indicate how common a word is, for example, a word given as essential usually has a rate of more than 40 per million words.

Unfortunately, the CIC is proprietary, but a frequency analysis is done easily for any text available in electronic form. For example, if an ESL student wants to read The Call of the Wild, then table 6.2 is useful.

For more on using corpus linguistics for teaching, see chapters 6, 7 and 8 of Hunston [59]. These give concrete examples of how to apply this to teaching. The use of two language corpora is also discussed, for example, the French-English parallel corpus of Le Petit Prince/The Little Prince (a French novel by Antoine de Saint Exupéry).

An excellent book on language instruction is I. S. P. Nation’s Learning Vocabulary in Another Language [80]. The focus of this book is how to teach vocabulary, and right away it addresses quantitative questions involving word frequencies. For example, chapter 1 starts off discussing how many words a student should know. Implicit in this question is the idea that a core vocabulary needs to include the highest frequency words. Finally, this book has many ideas of interest to the corpus linguist, for example, chapter 7 discusses principal components analysis.

Finally, stylometry is the analysis of one or more texts to determine quantitative measures indicative of the author’s style. Although literary critics and historians have long done this by analyzing historical evidence along with close readings of the texts in question, stylometry is usually associated with computer analyses that look at a large number of variables. These are split into two types: first, textual features that are probably unconscious habits of the author; second, features which the author consciously tries to manipulate.

Some researchers prefer working with the first type. For example, an author probably does not consciously think about his or her use of function words. So the rate of usage of up probably reflects a writer’s habitual style.

A famous example in the statistical literature is Frederick Mosteller and David Wallace’s book Applied Bayesian and Classical Inference: The Case of the Federalist Papers [79]. Here the authors try to determine who wrote 12 disputed (at the time) texts among the Federalist Papers, which are a series of newspaper articles arguing that New York should ratify the United States constitution. This makes a great example because it is acknowledged that either James Madison or Alexander Hamilton wrote these 12 papers, so an analysis only needs to decide between these two. Moreover, both have many other written documents that are available for determining the writing style of each.

Mosteller and Wallace’s analysis is detailed and thoughtful, and much attention is given to checking the assumptions of the statistical models used. The details are best left to their book, but note that much of their analysis focuses on finding function words that do a good job of distinguishing between Madison and Hamilton. Since the Federalist Papers are a joint effort (with John Jay), each author might have consciously tried to write in a common style. Hence it is important to focus on stylistic markers that were probably unconscious habits.

However, as a human reader, this type of analysis is not satisfying. For example, a person, asked why he or she enjoys The Call of the Wild does not mention the usage of the preposition up but instead discusses the exciting plot, how dogs think, and so forth.

Perhaps to reflect the reader’s perspective, some researchers have measured textual characteristics that an author is certainly aware of and perhaps is trying to manipulate consciously. We consider one example of this.

In 1939 G. Udny Yule published an analysis of sentence lengths in Biometrika [128], which are analyzed to distinguish a pair of writers. Like Mosteller and Wallace, Yule is a statistician, so his analysis discusses the statistical issues involved. Finally, he applied his analysis to two texts with disputed authorships: De Imitatione Christi and Observations upon the Bills of Mortality. Sentence lengths are clearly something an author does think about, and it is believable that some writers might prefer different length sentences, on average.

On the whole, although a number of stylometric analyses have been done, the methods in print tend not to generalize well to other texts. For a great overview of this topic along with numerous examples, see Chapter 5 (“Literary Detective Work”) of Michael Oakes’ Statistics for Corpus Linguistics [82]. Also see the articles by Binongo and Smith [16] and Holmes, Gordon, and Wilson [57].

6.7 SECOND TRANSITION

We have reached a second transition in this book. Up to this point, the level of the statistics and mathematics is quite modest. The next two chapters, however, introduce topics that require a little more mathematical knowledge, although this is kept to a minimum. The emphasis is still on applications and practical problems, and the ideas behind the mathematics are explained.

In addition, the free statistical package R, which is introduced in section 5.5.1, is used to do calculations in chapters 7 and 8. The R code is explained, but like Perl, only certain parts are used in this book. If this taste of R piques your interest, there are free tutorials and documentation on the Web at http://cran.r-project.org/.

PROBLEMS

6.1 Voter registration lists are created by town clerks and registrars of voters, many of which exist, so much so that the task of obtaining all of them requires effort. This work load can be reduced by performing two stages of sampling. The first is a simple random sample of precincts, and the second is a simple random sample of registered voters within each precinct.

Is this a good idea for text sampling? For example, suppose that a researcher creates a list of different types of texts. Then picking a random subset of these types is the first stage of sampling. Then for each type, a random sample of texts is selected (say from a library catalog). It turns out that this design has been used.

The Brown Corpus Manual [46] describes the details of how the Brown corpus was created. In section 1, it says that the sampling design does have two phases. While the first phase (or stage) consists of picking categories of text, it says this is done subjectively using expert judgment as opposed to random selection. How does this compare to using a random selection of topics? Is one or the other technique superior? Once you have thought about this, go online and find this manual (the URL is in the bibliography) and compare your thoughts to the actual design details used.

6.2 Table 6.2 contains the word counts for Shelley’s Frankenstein and London’s The Call of the Wild. Using the programs in this chapter, recompute the word frequencies of these two novels, and compare your results to this table.

Now do the same task with a novel of your own choosing. When you compute your counts, how might these be checked?

6.3 Shelley’s novel, Frankenstein, is analyzed using a concordance program in the book Language and Computers by Geoff Barnbrook [9]. For example, he points out that the monster created by Victor Frankenstein has no name, but he is referred to by a variety of terms.

For this problem, think about how one might try to find the words that refer to the monster in Frankenstein. Then check chapters 3 and 4 of Barnbrook’s book to see how he analyzes this novel. With the concordancing programs developed in this chapter, you can reproduce his results for yourself. Moreover, his book has many additional text analyses to try.

6.4 Table 6.1 can be created by mimicking either program 4.3 or program 5.1. Using these as a model, create a program that counts the character frequencies of a file where the name of the file is put on the command line as follows.

perl character_frequencies. p1 text. txt

6.5 Type in the Perl code for program 6.1 with code samples 6.2, 6.3, and 6.4. Once this runs, do the following problems.

a) Table 6.1 reveals that there are underscores in the EnronSent corpus. Find out all the ways these are used. Note that many underscores in a row occur in this text.

b) Table 6.1 reveals that square brackets are used in the EnronSent corpus. Find out how these are used. Were they all typed by the authors of the emails? Finally, remember that a square bracket has special meaning in a regex.

c) Table 6.1 reveals that there are parentheses in the EnronSent corpus. How are these used?

d) Find all the words with the character @ in the EnronSent corpus, and see how they are used. Since these are emails, one use is easy to guess.

6.6 In section 6.4.3, the form of adverbs is explored. For this problem try the suggestion given at the end of this section of removing the not in the if statement of code sample 6.5. Examine the resulting output to estimate the proportion of adverbs. Output 6.12 shows the first few words.

Output 6.12 Ten lines of output from the suggested adverb analysis in problem 6.6.

abasedly abdominally abjectly abnormally aborally

abrasively abruptly absently absentmindedly absolutely

absorbingly abstemiously abstractly abstrusely absurdly

abundantly abusively abysmally accidentally accordingly

accurately achingly acidly acoustically acquiescently

acridly actively actually acutely adamantly addedly

additionally adeptly adequately adjectivally

administratively admiringly admittedly adroitly adultly

advantageously adventitiously aerially aerodynamically

aeronautically affectedly affectingly affectionately

6.7 In code sample 6.9, why is $word [$_[2]—1] returned instead of $word [$_[2]]?

6.8 In section 2.2.2, the problem of finding phone numbers is discussed. The EnronSent corpus includes many of these as well as various other numbers, for example, ZIP codes, dates, times, prices, and addresses.

a) Write a regex to find U.S. style phone numbers. In this corpus many styles are used. For example, there are company phone numbers that start with x and just give the last five digits. There are seven-digit phone numbers, as well as ones with area codes, and ones that start with the U.S. country code +1. Many numbers use dashes, but some use spaces or periods.

b) How do you check how accurate your regex is? One idea is to run a promising regex and save the results. Then run a less restrictive regex that matches many more patterns, and then check if among the extra matches there are patterns overlooked by the promising regex.

For example, to test a phone number regex, a pattern like d{3}Dd{4} is useful since it should match most numbers. However, there are false positives, for example, ZIP+4 codes. Applying this to the EnronSent corpus, what other false positives are there? Also, does it match all U.S. style numbers? What broader regex can be used to check this?

6.9 Adjectives have comparative and superlative forms. These are not quite straightforward because there are irregular forms and differences depending on the number of syllables in the adjective. See section 236c and sections 460 through 464 of the Cambridge Grammar of English [26] for more details.

For this problem, find a book on grammar and look up the rules for making comparative forms of adjectives. Write a Perl program that finds some of these. Remember that many words end in -er and -est that are not adjectives such as reader and pest. Remember that some adjectives use the words more and most. Finally, making a program that produces no errors is a monumental task, so part of the challenge of this problem is to decide on how much error is tolerable.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.147.193