Learn Perl

This is a short introduction to Perl.

Many of the hacks in this book use Perl to process text files. I’m sure that many of you have never used Perl (or any programming language, for that matter), so I want to give a short introduction. This introduction should be enough to let you write really simple programs, to understand the sample programs in this book, and to modify the sample programs to do different things. If you don’t want to understand my scripts, modify my scripts, or write your own scripts, you can safely skip this hack.

Perl programs are usually called Perl scripts because they are interpreted in real time by the Perl program. (Perl is an example of an interpreted language. Other languages you might have heard of, like C++ and Java, are compiled languages. Here’s a quick analogy: suppose the computer is a baker. The Perl program is similar to a recipe to make cake from scratch, and the compiled C++ program is similar to a cake mix. You can bake a cake from either recipe, but the cake mix eliminates steps for you ahead of time.)

The Basics

This section introduces some basic concepts in Perl, showing constructs we’ll use many times in this book.

Statements.

Perl statements tell the computer to do something. As a simple example, let’s write a short program to print something:

	print "hello world
";

(Notice the semicolon at the end of the statement. You need to put a semicolon after the end of every statement in Perl. Oh, and the means “start a new line.”) If you save this to a file called helloworld.pl, you can run the program like this:

	% perl helloworld.pl
	hello world

We’ll get back to statements soon, but first we need to introduce variables.

Variables.

In Perl, you can give a convenient name to a value and then use it later. The named thing is called a variable. (If that statement bothers you, you know too much to be reading this chapter. This book is about saving time! Don’t read this. Skip to another hack.) Here are a couple of examples:

	$name = "Barry Bonds";
	$OPS2004 = 1.422;
	print "The OPS of $name in 2004 was $OPS2004
";

Running this program will produce the following:

The OPS of Barry Bonds in 2004 was 1.422

Datatypes.

As you might guess, it can be nice to keep lists of things, where you can refer to each item in the list by a name or position. These lists of things are called data structures. In this book, I use only four different datatypes in Perl.

The first one is scalar values. Scalar values include numbers and strings, like the ones I used in the earlier example for variables. Notice that there is a dollar sign ($) in front of scalar values.

The second one is arrays. An array is a simple list of items. Here’s an example:

	@positions = ("P", "C", "1B","2B","3B","SS","LF","CF","RF","DH")
	print "The third position in the list is $positions[2]
";

You might have read that and said, “Hey, you wrote a two, but said third.” You’re right about that. Array positions are numbered from zero, not one. (You’ll also notice that I put an at sign [@] in front of the variable name when I defined a list, but a $ when I looked up a value from the list. This is a kind of funny feature of Perl; you put a sign in front of a variable to explain how to interpret the result in the current context.) Running this code will show the following:

The third position in the list is 1B

The third datatype is associative arrays (also called hashes). An associative array lets you keep a list of stuff, with a name associated with each item. Here’s an example:

	%AVG = ("Suzuki"   => .372,
	        "Bonds"    => .362,
	        "Helton"   => .347,
	        "Mora"     => .340,
	        "Guerrero" => .337);
	print "Melvin Mora's batting average was ", $AVG{"Mora"}, "
";

Notice the percent sign (%) in front of the variable name when I define the variable, but the $ when I look up a value. Oh, and notice the way I used printin this example. You can print a set of different things, separated by commas. Running this code produces the following:

Melvin Mora’s batting average was .340

The last datatype we use in this book is file handles, which we use to reference files. Traditionally, you write file handles in uppercase. Also, they don’t have any symbol in front of them. See the next section for an example of using a file handle.

Control structures.

I hate using that term (it seems so technical), but I can’t think of a better one. Sometimes you want to tell Perl to do one thing if something is true and something else if it’s false. You can use the if {} else {} structure to do this. I’ll build on the previous example here:

	if($AVG{"Bonds"} > .350) {
	  print "Whoah!
";
	} else {
	  print "Eh.
";
	}

Of course, this code will print the following:

Whoah!

You can also use loops to repeat things in Perl. There are many different types of loops, but here is a simple example of one:

	$i = 0;
	while ($i < scalar(@positions)) {
	  print $positions[$i], " ";
	  $i = $i + 1;
	}
	print "
";

Running this program produces the following output:

P C 1B 2B 3B SS LF CF RF DH

Comments.

You can put comments in Perl programs that Perl ignores. They start with a pound sign (#). It is good practice to write comments in your programs. Sometimes a piece of code is complicated and difficult to read. It is often hard to remember what a piece of code does and how it works. To avoid confusing yourself and other people later, you can include comments that explain what the code does.

An Example Program

Almost every Perl script in this book follows the same basic structure, which I show in this section. This program will open an input file specified by the user and write the output to an output file specified by the user. Just to make it (slightly) interesting, this program will print a line number at the beginning of each line.

	16 # check to make sure that there are the right number of arguments
	17 unless ($#ARGV == 1) {die "usage $0 <input file> <output file>
";}
	18
	19 # open the input and output files
	20 open INFILE, "<$ARGV[0]" or die "couldn't open input file $ARGV[0]: $!
";
	21 open OUTFILE, ">$ARGV[1]" or die "couldn't open output file $ARGV[1]: $!
";
	22
	23 $lineno = 1;
	24 while(<INFILE>) {
	25   # loop over each line in the input file
	26   print OUTFILE "$lineno: ", $_;
	27   $lineno++;
	28 }
	29
	30 close INFILE;
	31 close OUTFILE;

A couple of quick notes on this example: first, you’ll notice the weird filename in brackets in line 24. The part inside the parentheses is Perl shorthand for “read each line in the INFILE file and assign it to the $_ variable.” The while means “keep doing the following until the thing inside the parentheses equals false.” The stuff inside the parentheses equals false when the file runs out. In lines 20 and 21, you’ll notice the > and < signs added to the filenames. These tell Perl that the file is an input or output file. In lines 17, 20, and 21, you’ll see a reference to a list called ARGV. This is automatically set to whatever is on the command line. Finally, the ++ in line 27 means “add one to the variable.” Oh, and in line 26, print OUTFILE means write to the OUTFILE file rather than to the screen.

Some Not-so-Basic Basics

We’ll use three other Perl features in this book: objects, subroutines, and patterns.

Pattern matching through regular expressions.

When trying to understand files from the Internet, we want to match a lot of patterns. For example, consider the play-by-play calls “Roger Clemens intentionally walks Barry Bonds,” “Keith Foulke intentionally walks Rafael Palmeiro,” and “Jake Westbrook intentionally walks Doug Mientkiewicz.” You probably noticed that each expression fits a pattern: the name of a pitcher followed by the expression “intentionally walks” followed by a batter’s name. A regular expression gives you a concise way to express this relationship.

Let’s start with an example that uses this pattern:

	$playbyplay = "Jake Westbrook intentionally walks Doug Mientkiewicz.";
	if($playbyplay =~ /w+sw+ intentionally walks w+sw+/) {
	  print "found an intentional walk
";
	}

If you run this code fragment, Perl will print the words “found an intentional walk.” Here is how this works. The expression =~ means “if the variable on the left matches the pattern on the right.” The expression between the slashes—/w+sw+ intentionally walks w+sw+/—is the pattern. You can interpret this pattern as “a bunch of characters, then a space, then some more characters, another space, the words ‘intentionally walks,’ another space, some characters, another space, and some more characters.” More specifically, the wmeans “any word character,” the s means “any space character,” and the +means “at least one of the things to the left.”

See “Use Regular Expressions to Identify Events” [Hack #23] for an overview of regular expressions.

One feature needs an explanation. If you place parentheses around part of an expression, Perl will set a variable to the thing that matched inside the parentheses. These variables are named $1, $2, $3, etc.

This can be useful for extracting things from patterns. Let’s use this technique to extract the name of the pitcher and batter in the earlier example:

	$playbyplay = "Jake Westbrook intentionally walks Doug Mientkiewicz.";
	if($playbyplay =~ /(w+sw+) intentionally walks (w+sw+)/) {
	  print "pitcher: $1, batter: $2
";
	}

If you run this code fragment, Perl will print:

pitcher: Jake Westbrook, batter: Doug Mientkiewicz

Subroutines.

Often, you will want to repeat the same Perl code inside a script. Instead of writing the same few lines over again, you can place them in a subroutine and give the subroutine a name. Here’s a simple example of a subroutine that prints the name of a pitcher and batter when given a play-by-play expression matching the “intentional walk” pattern:

	sub intentional_walk_message($) {
	  # takes one argument: play by play string
	  my ($playbyplay) = (@_);
	  if ($playbyplay =~ /(w+sw+) intentionally walks (w+sw+)/) {
	    print "pitcher: $1, batter: $2
";
	  }
	}

In this subroutine, the ($) is called a prototype—it tells Perl that this expression takes one argument. The arguments are contained in the list (@_). Finally, the my qualifier tells the script “set the variable $playbyplay only inside this subroutine” so that if another part of the script uses this value and the subroutine changes it, there will be no surprises.

Modules and packages.

Perl supports a system called packages for grouping together data structures, functions, and variables that are often used together. A module is just a package defined in a file of the same name; I use these terms interchangeably in this book. Modules are a method of grouping together commonly used datatypes and subroutines into a single package. (This is an example of object-oriented programming.) The datatypes are called objects and the subroutines are called methods.

You tell Perl that you want to use an object through the use command. You access a method (subroutine) associated with an object through an expression such as $object->method(). Here is a specific example that opens a file for output using the FileHandle package:

	use FileHandle;
	$fh = new FileHandle;
	$fh->open(">file");

The first line loads the package. The second new returns a new FileHandle object. The third line opens a file for output, associating the file with the FileHandle object. (You can use the FileHandle object in Perl to refer to the file.)

In this book, we use Perl packages for parsing web pages and reading baseball data. For more information on packages, see the Perl documentation.

Editors

You might be wondering where I write and edit all of these Perl scripts. You could write them in Microsoft Word and save them as text files, but the spellchecking and grammar checks would quickly drive you crazy. There are many, many options for text editors, but I’ll list just a few favorites:[3]

Notepad.exe

The Microsoft Windows Notepad application is a perfectly acceptable text editor. It’s small, fast, and works well for editing simple files.

UltraEdit

I like UltraEdit for editing code on Windows. It has some nice features for syntax highlighting and macros, and some very convenient functions for editing columns of text.

Xcode

If you’re fortunate enough to be using Mac OS X, you probably have the option to use Xcode, Apple’s development environment. It includes some nice Perl editing functionality, including syntax highlighting and automatic indentation.

Komodo

ActiveState, the company that makes ActivePerl, sells a tool called Komodo for editing dynamic languages like Perl. Komodo is an example of an Integrated Development Environment (IDE), and it provides a very sophisticated environment for editing scripts. Versions are currently available for Windows, Linux, and Mac OS. You can learn more about this tool at http://www.activestate.com/Products/Komodo/?tn=1.

Hacking the Hack

The key part of this hack is the short program. All of the Perl programs in this book are variations on this theme. Here are some things you can do with this structure:

Read multiple files, maybe a whole directory of files

I do this in “Make a Historical Play-by-Play Database” [Hack #22] .

Don’t blindly copy input to output; do it only if a pattern matches

I show how to do this in “Use Regular Expressions to Identify Events” [Hack #23] , and I use this technique in several other hacks.

Keep track of state while you’re looping through a file

I show how to do this in “Load Baseball Data into MySQL” [Hack #20] .

Download web pages directly into Perl

That makes Perl much more interesting, doesn’t it? You can use it to automatically get web pages and then do whatever you want with them; read Spidering Hacks (O’Reilly) to learn how to do this. I use this technique in a few different places in this book.

In general, it’s pretty easy to modify Perl scripts to do more stuff. Here are a few tips:

Add more variables in the middle of code

It’s usually safer to add than it is to delete, so just try adding new variables in the middle of code. For example, suppose you have a program that reformats play-by-play data. While you’re reading it, suppose you want to check if a double was scored in the play, and you want to set a variable called $double to 1 if a double was scored and to 0 if not.

Add more print statements for debugging

If you don’t understand what a program is doing, just add some print statements. Print to the screen what the program is reading in, the value of each variable, and what the program is writing out. This will help you understand what’s going on.

Learn to use the Perl help files

For simple stuff, you can type man perl at a command line. This will tell you other stuff that you can type at a command line to get more specific help, like man perlintro. You can also check out http://www.perl.com (a shameless O’Reilly plug; this comes from me, not my editors). Best of all, you can invest in a Perl book, like O’Reilly’s Programming Perl (yet another shameless plug, but this book is the bible of Perl).



[3] I actually use Emacs instead of these programs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.222.175