Appendix A

Overview of Perl for Text Mining

This appendix summarizes the basics of Perl in these areas: basic data structures, operators, branching and looping, functions, and regular expressions. The focus is on Perl’s text capabilities, and many references are made to code throughout this book.

The form of these code samples is slightly different than the ones in this book. To save space, the output is placed at the end of the computer code.

To run Perl, first download it by going to http://www.pen.org/ [45] and following the instructions there. Second, type the statements into a file with the suffix .p1, for example, call it program.p1. Third, you need to find out how to use your computer’s command line interface, which allows the typing of commands for execution. Fourth, type the statement below on the command line and then press the enter key. The output will appear below it.

perl program.pl

Remember that Perl is case sensitive. For example, commands have to be in lowercase, and the three variables $cat, $Cat, and $CAT are all distinct. Finally, do not forget to use semicolons to end each statement.

A.1 BASIC DATA STRUCTURES

A programmer must be able to store and modify information, which is kept in scalar, array, and hash variables. We start with scalars, which store a single value, and their names always start with a dollar sign. First, consider the examples in code sample A.1, which demonstrates Perl’s two types of scalars, strings and numbers. If a string is used as a number, then Perl tries to convert it. Conversely, a number used as a string is always converted.

Code Sample A.1 Perl converting a string to a number and vice versa.

$xl = "4";

$yl = "5";

$zl = $x1 + $yl; # Addition

$x2 = 4;

$y2 = 5;

$z2 = $x2 . $y2; # Concatenation

$x3 = "4";

$y3 = 5;

$z3= $x3 . $y3

$z4= $x3 + $y3;

print "$zl, $z2, $z3, $z4 ";

OUTPUT: 9, 45, 45, 9

Code sample A.2 shows that the logical values true and false are represented by either strings or numbers. The values 0, ‘0’;, "0";, ‘’, "", (), and undef are false, and all other numbers and strings are true.

Code Sample A.2 Numbers and strings represent true and false.

if ( 0 ) { print "True "; } else { print "False "; }

if ( ‘0’ ) { print "True "; } else { print "False "; }

if ( "0" ) { print "True "; } else { print "False "; }

if ( 7 ) { print "True "; } else { print "False "; }

if ( ‘7’ ) { print "True "; } else { print "False "; }

if ( "7" ) { print "True "; } else { print "False "; }

if ( ‘’ ) { print "True "; } else { print "False "; }

if ( "" ) { print "True "; } else { print "False "; }

if (()) { print "True "; } else { print "False "; }

if ( undef ) { print "True "; } else { print "False "; }

OUTPUT: False False False True True True False False False False

Code sample A.3 gives examples of references, which are hexadecimal numbers representing a memory location. These are created by placing a backslash in front of the scalar (or array or hash). References can be chained together, for example, a reference to another reference. A dereferencing operator is used to access the value in the memory location, For a reference to a scalar, this operator is ${}. Finally, references are used extensively in complex data structures such as arrays of hashes, as discussed in section 3.8.

Code Sample A.3 Examples of scalar references.

$x = 1729;

$xref = $x; # A reference to the value in $x

$xrefref = $xref; # A reference to a reference

$y = ${$xref}; # Dereferencing a reference

$zref = ${$xrefref};

$z = ${${$xrefref}}; # Two dereferences in a row

print "$x, $xref, $xrefref, $y $z, $zref";

OUTPUT: 1729, SCALAR(0x1832960), REF(0x1832984), 1729

1729, SCALAR(0x1832960)

Examples of working with arrays are shown in code sample A.4. While scalars contain only one value, arrays have many. While scalars always start with a dollar sign, arrays as a whole always start with an at sign, for example, @array. An array is a collection of variables indexed by 0, 1, 2, ..., and these individual values are accessed as follows: $array [0], $array [1], $array [2], ... Note that each starts with a dollar sign.

Arrays can be defined by listing the values in parentheses, or created by a variety of functions, for example, split. They can be built up by functions, too, for example, push. If a scalar is set to an array, this does not produce a syntax error, in fact, the scalar is set to the number of entries it has. For example, $scalar in code sample A.4 has the value 5 because @array2 has five elements. Finally, an array of indices can be placed in the square brackets to select a subset.

Code Sample A.4 Making and modifying arrays.

@array1 = ("Katy", "Sam", 16);

$arrayl[3] = "Taffy";

@array2 = split(//, "Test");

push(@array2, "ing");

$scalar = array2;

@indices = (0,0,1,0,2,1);

@array3 = @array1[@indices];

print "$arrayl [0], @array1, @array2 [0], @array2 ";

print "$scalar, @array3 ";

OUTPUT: Katy, Katy Sam 16 Taffy, T, T e s t ing

5, Katy Katy Sam Katy 16 Sam

References to arrays are illustrated in code sample A.5. These can be created by listing values between square brackets or by putting a backslash in front of an array name. To access the array, dereferencing must be done. Just as scalar names start with a dollar sign, and scalar dereferencing is done with ${}, arrays start with an at symbol and array dereferencing is done with @{}.

Code Sample A.5 Array references and dereferences.

$ref 1 = [1,2,"Cat"];

@array1 = @{$ref1};

$ref2 = @array1;

@array2 = @{$ref2};

print "$ref1, @array1, $ref2, @array2";

OUTPUT: ARRAY(0x18330b8), 1 2 Cat, ARRAY(0x1832be4), 1 2 Cat

Code sample A.6 has examples of hashes, which are like arrays except they are indexed by strings, not nonnegative integers. Hashes can be defined the same way as an array, bul the odd entries are the keys (or indices), and the even entries are the values. A hash can also be created by setting it equal to an array. The functions keys and values return an array of keys and values, respectively.

Code Sample A.6 Working with hashes.

%hash = (Cat, 3, Dog, 4, Rabbit, 6); # Quotes are optional

@array = %hash;

@keys = keys(%hash);

@values = values(%hash);

print "$hash{Cat}, $hash{Rabbit}, @array ";

print "@keys, @values ";

OUTPUT: 3, 6, Rabbit 6 Dog 4 Cat 3

Rabbit Dog Cat, 6 4 3

References are made to hashes by either listing values between curly brackets or putting a backslash in front of a hash name as shown in code sample A.7. These are dereferenced by using %{}, and note that hash names and dereferencing both begin with a percent sign. Unlike scalars and arrays, hashes are not interpolated inside double quotation marks because hashes are unordered. Note that the two print statements do not list the hashes in the same order. However, a hash can be assigned to an array, which can be interpolated.

Finally, scalars, arrays, and hashes can be mixed together using references to form complex data structures. See section 3.8 for a discussion.

A.1.1 Special Variables and Arrays

Perl defines many variables that it uses for a variety of purposes. Table A.1 contains just a few of these along with examples of use (if any) in this book. For more information, read one of the books on Perl programming listed in section 2.8.

Table A.1 A few special variables and their use in Perl.

NamePurposeExample

$_

Default variable

Program 2.7

$1

Contents of leftmost parentheses

Program 3.3

$2

Contents of next parentheses

Code sample 2.31

@ARGV

Command line arguments

Code sample 2.17

@_

Subroutine arguments

Code sample A.20

$/

Separator for reading input file

Program 2.7

$/="";

Paragraph mode

Program 2.7

$/=" ";

Line-by-line mode

Default

$"

Separator for interpolation

Code sample 3.21

$‘

Stores string before regex match

$&

Stores regex match

$’

Stores string after regex match

Code sample 2.25

Code Sample A.7 Hash referencing and dereferencing.

$ref 1 = {Cat, 3, Dog, 4, Rabbit, 6};

%hashl = (Cat, 3, Dog, 4, Rabbit, 6);

$ref2 = \%hashl;

%hash2 = %{$refl};

@array1 = %hashl;

@array2 = %hash2;

print "$ref 1, @array1 ";

print ‘$ref2, @array2 ";

OUTPUT: HASH(0x18332f8),Rabbit 6 Dog 4 Cat 3

HASH(0x18330e8),Rabbit 6 Cat 3 Dog 4

A.2 OPERATORS

Operators are just functions, but they use special symbols and syntax. For example, addition is an arithmetic operator in Perl and is denoted by +, which is written in between numbers. The common mathematical operators in Perl are given in code sample A.8. Note that the percent sign is the modulus operator, which returns the remainder for an integer division.

Code sample A.9 demonstrates two string operators. The second uses the letter x and can be used with arrays, too.

In Perl, only the values 0, ‘0’, "0", ‘‘, "", (), and undef are false, and all other numbers or strings are true. So logical operators work with both numbers and strings as shown in code sample A.10.

However, when Perl computes a logical true or false, it uses 1 and "" as seen in code sample A.11. To make the output more readable, the symbol printed between array entries is changed to a comma (by setting $" to this). Perl has two types of comparison operators: one for numbers, the other for strings. The former uses symbols (for example, == for numerical equality), while the latter uses letters (for example, eq for string equality). However, a comparison like 14 gt 3 is possible, but it changes both numbers to strings, and then does a string comparison, so in this case it is false because the string “14” precedes the string “3” when using alphabetical order. Finally, <=> and cmp are both comparison operators. If the first value is greater than the second, then 1 is returned; if the two values are equal, 0 is returned; and if the second value is greater than the first, −1 is returned.

Code Sample A.8 Examples of mathematical operators.

$x = 2;

$y = 3;

$z = 8;

$answer[0] = $x + $y; # Addition

$answer[1] = $x * $y; # Multiplication

$answer[2] = $x – $y; # Subtraction

$answer[3] = $x / $y; # Division

$answer[4] = $x ** $y; # Exponentiation

$answer[5] = $z % $y;

print "@answer";

OUTPUT: 5 6 −1 0.666666666666667 8 2

Code Sample A.9 Examples of two string operators.

$concatl = ‘abc’ . ‘123’;

$concat2 = "cat" . 123;

$mult = ‘cat’ x 3;

@array1 = (1, 2, 3) x 2;

@array2 = array1 x 5;

@array3 = (@arrayl) x 2;

@array4 = 2 x (1, 2, 3);

@array5 = 2 x @array1;

@array6 = 2 x (@array1);

print "$concatl, $concat2, $mult ";

print "@arrayl, @array2, @array3 ";

print "@array4, @array5, @array6 ";

OUTPUT: abc123, cat123, catcatcat

1 2 3 1 2 3, 66666, 1 2 3 1 2 3 1 2 3 1 2 3

222, 222222, 222222

Code sample A.12 gives a few miscellaneous operators. The regex matching operators return true and false: see section 2.7.2 for a discussion. The ++ operator increments the variable it is next to: if it is before the variable, then that variable in incremented before it is assigned, but if it is after the variable, then it is incremented after it is assigned. For example, compare the values of $answer [3] and $answer [4].

Code Sample A.10 Logical operator examples.

$answer [0] = 5 and 6;

$answer [1] = 5 && 6;

$answer [2] = 3 or "";

$answer [3] = 0 ||""

$answer [4] = 3 xor 5;

$answer [5] = ! ‘’;

$answer [6] =not ‘’;

$" = ‘,’

print "@answer ":;

OUTPUT: 5,6,3,,3,1,1

Code Sample A.11 Comparison operator examples.

$answer [0] = 14 > 3;

$answer[1] = 14 gt 3;

$answer [2] = "14" > "3";

$answer [3] = "14" > "3";

$answer [4] = 25 < 7;

$answer [5] = ‘cat’ 1t "dog";

$answer [6] = 22.24 < 3.99;

$answer [7] = ‘cat’ ge ‘dog’;

$answer [8] = ‘cat’ eq ‘cat’;

$answer [9] = 5 == 6;

$answer [10] = 5 <=> 6;

$answer [11] = 6 <=> 6;

$answer [12] = 7 <=> 6;

$answer [13] = ‘cats’ cmp ‘dog’;

$answer [14] = ‘dog’ cmp ‘dog’;

$answer [15]= ‘rats’ cmp ‘dog’;

$" = ‘,’

print "@answer ";

OUTPUT: 1, ,1, , ,1, , ,1, ,−1,0,1,−l,0,1

Finally, the double period is the range operator, and it produces an array of numbers or strings. This is shown below where @array receives the elements 1 through 10. This also works with letters, for example, (‘a’. . ‘z’) produces the lowercase alphabet.

@array = (1..10);

The above operators are not exhaustive, but they are useful in text mining. Next we review branching and looping in a program.

Code Sample A.12. A few miscellaneous operators.

$text = "It’s never too late ...";

$answer[0] = $text = ~ /never/;

$answer[1] = $text !~ /never/;

$answer[2] = $text = ~/cat/;

$x = 6;

$answer[3] = ++$x;

$x = 6;

$answer[4] = $x++;

$answer[5] = $x;

$x = 6;

$answer[6] = --$x;

$" = ‘,’;

print "@answer ";

OUTPUT: 1,,,7,6,7,5

A.3 BRANCHING AND LOOPING

The ability to make decisions and to repeat portions of code is essential to programming. The if statement tests logical conditions, and examples are given in code sample A.l3. A logical test is performed within parentheses, the outcome of which determines which block of code (contained in curly brackets) to execute. Note that the if can come after a statement. In this code sample, all the regexes are tested against the default variable $_, which can either be explicitly written or left out. To test another variable against the regex, it must be explicitly given.

Code Sample A.13 Examples of if statements.

$_ = "This is a test.";

if (/test/) { print "Match, "; }

if ($_ = ~/test/) { print "Match, "; }

print "Match, " if (/test/);

if ($_ = /test/) {

print "Match, ";

} else {

print "No Match, ";

}

OUTPUT: Match, Match, Match, Match,

An if-elsif-else statement allows more than one test, and two or more elsifs are permissible. See code sample A.14 for an example.

Code Sample A.14 Example of using elsif twice.

$_ = "This is a test.";

if ($_ = ~/cat/) {

print "Matches cat ";

} elsif (/dog/) {

print "Matches dog ";

} elsif (/bat/) {

print "Matches bat ";

} else {

print "No matches here ";

}

OUTPUT: No matches here

The for statement loops over a block of code, as seen in code sample A.15. The number of iterations is determined by a counter (the variable $i in the first example), or over a sequence of values like 0..9, or over the elements of an array. Finally, although a programmer may prefer either for or foreach depending on the situation, these two statements are interchangeable.

Code Sample A.15 Examples of for loops.

for ($i = 0; $i <10; ++$i) {

print "$i ";

}

print " ";

for $i (0..9) {

print "$i ";

}

print " ";

@array = (‘a’..‘j’);

for $i (@array) {

print "$i";

}

OUTPUT: 0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9

a b c d e f gh i j

It is also possible to use a hash in a for or foreach loop, which is seen in code sample A.16. However, the order is unpredictable since Perl determines how the hash is stored, which can differ from how it is constructed. A better approach is to iterate over the keys of the hash, which can be sorted as desired, for example, put into alphabetical order.

Code Sample A.16 Using a hash in a for loop.

%hash = (dog, 1, cat, 2, rabbit, 3);

for $i (%hash) { print "$i"; }

print " ";

for $i (sort keys %hash) {

print "$i $hash{$i} ";

}

OUTPUT: cat 2 rabbit 3 dog 1

cat 2 dog 1 rabbit 3

The while loop is also enormously useful. For example, text files are usually read in by a statement of the following form.

while (<FILE>) { # code }

Here the filehandle within the angle brackets reads the file piece by piece (the default is line by line). The while statement can also iterate over all the matches of a regex, as seen in code sample A.17.

Code Sample A.17 Looping over the matches of a regex with while.

$_ = "This is a test.";

$vowels = 0;

while ( /[aeiou]/g ) {

++$vowels;

}

print "# of vowels = $vowels";

OUTPUT: # of vowels = 4

Finally, a while loop can test for all types of logical conditions. Code sample A.18 shows an example where the while loop executes like a for loop.

Code Sample A.18 A while loop executing like a for loop.

$i = 0;

while ($i < 10) {

++$i;

print "$i";

}

OUTPUT: 1 2 3 4 5 6 7 8 9 10

Sometimes, modification of the execution of a for or while is needed. There are three commands to do this; for example, when last is executed, the loop immediately ends, as seen in code sample A.19. Other statements to modify looping behavior are next and redo. See problem 7.2 for an example.

Code Sample A.19 Ending a loop with the statement last.

$text = "This is a test.";

words = split(/ /, $text);

foreach $word (@words) {

if ($word eq "a") { last; }

print "$word ";

}

OUTPUT: This is

Another way to execute blocks ofcode is the subroutine. The analogous idea for a group of subroutines is called a module, but this topic is more advanced: see section 9.2.

The default in Perl allows subroutines to access and modify any variables in the main code. This can cause problems, especially if subroutines are reused in other programs. The solution is easy: make the variables in a subroutine local to that subroutine, which can be done with the my statement at the beginning for all the variables at once, as shown below.

my $variablel, @array1, %hashl;

The my statement can also be used the first time a variable is used, as shown in code sample A.20. Also see problem 5.5.

Code Sample A.20 Example of a subroutine. Note the use of my, which makes variables local to the subroutine.

sub letterrank {

my $lcletter = lc($_[0]);

if ( ‘a’ le $lcletter and $lcletter le ‘z’ ) {

return ord($lcletter) — ord(’a’) + 1;

} else {

return ‘’;

}

}

$letter = ‘R’

print "$letter has letter rank ", letterrank($letter), " ";

print "$letter has letter rank ", &letterrank($letter), " ";

0UTPUT: R has letter rank 18

R has letter rank 18

In this code sample, the subroutine uses return, which returns a value back to the main program. Hence this subroutine is a function. Also note that after a subroutine is defined, then its name does not require an initial ampersand, although there is no harm in always using it. Also see problem 5.4.

The array @_ contains the subroutine arguments, so $_[0] is the first argument, $_[l] the second, and so forth. Finally, the function ord returns the ASCII rank of each character, which assigns 65 to A, 66 to B 90 to Z, and 97 to a, ..., 122 to z. So by changing each letter to lowercase and then subtracting the rank of a and adding 1, we get the desired letter rank (that is, A has rank 1, B has rank 2, and so forth.)

Functions break a complex task into a sequence of simpler tasks. The ability of functions to use other functions enables a programmer to create a hierarchy of them to do a task.

Finally, many functions are already built into Perl, or can be easily loaded into Perl, so before writing a subroutine, it is wise to check to see if it already exists by checking online. The next section discusses just a few of these.

A.4 A FEW PERL FUNCTIONS

Perl has many functions built into it and even more that can be downloaded. The Perl documentation online [3] or the Comprehensive Perl Archive Network (CPAN) [54] are great places to check for information. This section, however, only discusses some of these applicable to text mining.

First, table A.2 lists string functions with an example of where they are used in this book. Note that the inverse of ord is chr. Finally, note that reverse reverses a string when in scalar context as shown below. However, it reverses an array if it is in an array context as shown in table A.3, which has examples of array functions used in this book.

Table A.2 String functions in Perl with examples.

NamePurposeExample

chomp

Remove trailing newline

Program 2.4

index

Find position of substrings

Code sample 2.13

join

Combine strings

Code sample 2.7

1c

Convert to lowercase

Code sample 3.15

ord

ASCII value

Code sample 3.38

pos

Position in string

Code sample 2.15

split

Split up strings

Code sample 2.2

sprintf

Create formatted string

Code sample 2.35

substr

Find substrings

Program 2.7

Table A.3 Array functions in Perl with examples.

NamePurposeExample

grep

Apply regex to entries

Code sample 3.12

map

Apply a function to an array

Code sample 3.38

pop

Remove last entry

Code sample 3.11

push

Add new last entry

Code sample 3.11

reverse

Reverse an array

Code sample 3.16

shift

Remove first entry

Code sample 3.11

split

Output is an array

Code sample 2.2

sort

Sort an array

Code sample 3.14

unshift

Add new first entry

Code sample 3.11

$answer = reverse("testing, testing");

Second, since an array is easily converted into a hash, the functions in table A.3 are not unrelated to hashes. Table A.4, however, has functions for hashes only. Note that values is analogous to keys: the former returns an array of the values of a hash. In addition, there is the function each, which is shown in code sample A.21.

Table A.4 Hash functions in Perl with examples

Name

Purpose

Example

exists

Test if key exists

Program 3.5

keys

Create array of keys

Code sample 3.22

Third, there are many mathematical functions available in Perl. For example, the natural logarithm, log, and the square root, sqrt, are used in program 5.5. Perl has limited trigonometric functions unless the Math: : Trig module is loaded with use, which is done in code sample 5.5.

Code Sample A.21 Example of the function each.

%hash = (cat, 3, dog, 3, rabbit, 6);

while ( ($key, $value) = each %hash ) {

print "$key, $value; ";

}

OUTPUT: cat, 3; rabbit, 6; dog, 3;

Finally, the functions open and close manipulate external files. There are three modes for the former: input, overwriting, and appending. These are shown in code sample A.22 along with close and die, where the latter halts execution of the program if open fails to work.

The command line can be used to overwrite or append to a file as shown below. The character > overwrites, while >> appends. This completes our review of functions, and regular expressions are the topic of the next section.

perl program.pl > output.txt

perl prograrn.pl >> output.txt

A.5 INTRODUCTION TO REGULAR EXPRESSIONS

Chapter 2 is devoted to regular expressions (also called regexes), which is central to text patterns, so reading it is essential. This section, however, just summarizes the regex syntax and refers the reader to illustrative examples in this book. Finally, table 2.3 summarizes some of the special symbols used.

Code Sample A.22 Three different ways to open a file for reading, overwriting, and appending.

open (INPUT, "test.txt") or die; # Rewrites file

while (<INPUT>) { # commands }

close (INPUT);

open (OUTPUT, ">test.txt") or die; # Rewrites file

print OUTPUT "Your Text";

close (OUTPUT);

open(OUTPUT, ">>test.txt") or die; # Appends to file

print OUTPUT "Your Text";

close (OUTPUT);

First, the syntax for matching a regex is as follows.

$text = ~m/$regex/;

There is also an operator for not matching.

$text !~ m/$regex/;

Note that the m is not required. Substitution has a similar syntax, but the s is required.

$text = ~s/$regex/string/;

Both matching and substitution have modifiers, which are placed after the last forward slash. Two useful ones are g for global, and i for case insensitive. See code sample 6.1 for an example of the former and code sample 3.35 for an example of the latter.

As shown above, variables are interpolated inside matching or substitution. An example of this is program 3.6. The qr// construct allows storing a regex in a variable; for example, see code sample 3.12.

Table A.5 lists the common special characters in regexes along with an example of a program that uses each one, if any. Note that ^ the also stands for character negation when it is the first character inside a pair of square brackets.

Table A.5 Some special characters used in regexes as implemented in Perl.

NamePurposeExample



Word boundary

Program 6.1

B

Negation of 

d

Digit

Program 2.1

D

Negation of d

s

Whitespace

Program 4.4

S

Negation of s

w

Word character

Code sample 3.12

W

Negation of w

Program 3.3

.

Any character

Program 3.7

^

Start of line

Program 5.1

$

End of line

Program 3.7.2.1

Table A.6 summarizes the syntax for how many times a pattern can appear in a regex. Note that all the examples in the table are greedy, but by adding a final question mark, these become nongreedy: ??, *?, +? {m,n}?. See code sample 2.21 and program 5.2 for examples of *?.

Table A.6 Repetition syntax in regexes as implemented in Perl.

NameNumber of matchesExample

?

Zero or one

Program 2.1

*

Zero or more

Program 2.6

+

One or more

Program 2.9

{m,n}

At least m, at most n

Program 2.9

{m}

m

Program 2.1

{m, }

m or more

Square brackets specify a collection of characters. Examples are seen in programs 3.3, 4.4, and 6.1. Parentheses form groups, for example, (abc) + means one or more repetitions of the string “abc.” The part of the regex within a set of parentheses is stored in the variables, $1, $2, $3, .... For two examples of this, see programs 3.7 or 5.1. To refer to matches within a regex, the backreferences 1, 2, 3, ..., are used. See code sample 3.12 for an example.

Finally, the idea of lookaround is discussed in section 2.7.3. These match conditions at a position as opposed to characters. This is a generalization of , which matches a word boundary, which is the condition that a letter is on one side of the position, but not on the other.

For an in depth examination of regexes, start with Watt’s Beginning Regular Expressions [124], and then you will be ready for Friedi’s Mastering Regular Expressions [47].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.21.158