Chapter 10. Getting a Handle on Files

Image

Now it is time to see how Perl interacts with files, pipes, and command-line arguments. By the time you have finished this chapter, you should be able to explain the following script.

use feature 'say';
die "Insufficient arguments" if scalar @ARGV < 1;

while(<>){
   say "$ARGV $. $_ ";
   say "x" x 30 && close ARGV if eof;
}

10.1 The User-Defined Filehandle

If you are processing text, you will regularly be opening, closing, reading from, and writing to files. In Perl, we use filehandles to get access to system files.

A filehandle is a name for a file, device, pipe, or socket. In Chapter 4, “Getting a Handle on Printing,” we discussed the three default filehandles, STDIN, STDOUT, and STDERR. Perl also allows you to create your own filehandles for input and output operations on files, devices, pipes, or sockets. A filehandle allows you to associate the filehandle name with a system file and to use that filehandle to access the file.

10.1.1 Opening Files—The open Function

The open function lets you name a filehandle and the file you want to attach to that handle. If the filehandle is an undefined scalar variable, a new file filehandle is created (called autovivification) as a reference to a new anonymous filehandle. If the filehandle is an expression, it is a symbolic reference to the named file. The file can be opened for reading, writing, or appending (or both reading and writing), and the file can be opened to pipe data to or from a process. The open function returns a nonzero result if successful and the undefined value if it fails. Like scalars, arrays, and labels, filehandles have their own namespace. So that they will not be confused with reserved words, it is recommended that you use lexical scalars variables to hold your filehandles (as we will do in most of the examples in this chapter). If you use a bareword name for your filehandle, it is recommended that it be in capital letters (see the open function in Appendix A, “Perl Builtins, Pragmas, Modules, and the Debugger”).

When opening text files on Win32 platforms, the ( octal 15; and octal 12), are characters representing return and newline are translated into when text files are read from disk, and the ^Z character is read as an end-of-file (EOF) marker. The following functions for opening files should work fine with text files but will cause a problem with binary files (see Section 10.2.8, “Win32 Binary Files”).

10.1.2 Opening for Reading

The following examples illustrate how to open files for reading with both the older style and modern style. Even though the examples represent UNIX files, they will work the same way on Windows, Mac OS, and other systems.

Closing the Filehandle

The close function closes the file, pipe, socket, or device attached to FILEHANDLE. Once FILEHANDLE is opened, it stays open until it goes out of scope, the script ends, or you call the open function again. (The next call to open closes FILEHANDLE before reopening it.) If you don’t explicitly close the file, when you reopen it this way, the line counter variable, $., will not be reset. Closing a pipe causes the process to wait until the pipe is complete and reports the status in the $! variable (see the following section, “The die Function” for more about the $! variable). It’s a good idea to explicitly close files and handles after you are finished using them, but if using a lexically scoped scalar as the filehandle, it will be closed as soon as it goes out of scope.

The die Function

In the following examples, the die function is used if a call to the open function fails. If Perl cannot open the file, the die function is used to exit the Perl script and print a message to STDERR, usually the screen.

If you were to go to your shell or MS-DOS prompt and type

cat junk  (UNIX)

or

type junk (DOS)

and if junk is a nonexistent file, the following system error would appear on your screen:

cat: junk: No such file or directory    (UNIX "cat" command)
The system cannot find the file specified.   (Windows "type" command)

When using the die function, Perl provides a special variable $! to hold the value of the system error that occurs when you are unable to successfully open a file or execute a system utility. This is very useful for detecting a problem with the filehandle before continuing with the execution of the script.

10.1.3 Reading from a File and Scalar Assignment

The Filehandle and $_

In Example 10.4, a file called datebook is opened for reading. Each line read is assigned, in turn, to $_, the default scalar that holds what was just read until the end of file is reached.

The Filehandle and a User-Defined Scalar Variable

In addition to the default $_ variable, Perl allows you to create your own user-defined scalar variables to hold input from a file.

“Slurping” a File into an Array

When assigning input from a file to an array, Perl takes each line (ending in ) as an element of the array, “slurping” up each line of the file and adding it to the array until end of file is reached.

Using map to Create Fields from a File

You can use the map function in conjunction with the split function to break up input into several elements of an array.

Slurping a File into a String with the read Function

The read function allows you to read in a specified number of characters, and put them in a variable. It returns the number of characters that were read. If you know the size of a file, you can read the entire file into a string, as shown in the next example.

10.1.4 Loading a Hash from a File

Loading a hash from a file requires selecting what will be the key and what will be the value for the hash. Since keys must be unique, this method can be used for removing duplicate entries based on a key.

10.2 Reading from STDIN

The three filehandles STDIN, STDOUT, and STDERR, as you may recall, are names given to three predefined streams, stdin, stdout, and stderr. By default, these filehandles are associated with your terminal. When printing output to the terminal screen, STDOUT is used. When printing errors, STDERR is used. When assigning user input to a variable, STDIN is used.

The Perl <> input operator encloses the STDIN filehandle so that the next line of standard input can be read from the terminal keyboard and assigned to a variable. Unlike the shell and C operations for reading input, Perl retains the newline on the end of the string when reading a line from standard input. If you don’t want the newline, then you have to explicitly remove it, or “chomp” it off (see the following Section 10.2.2, “The chop and chomp Functions”).

10.2.1 Assigning Input to a Scalar Variable

When reading input from the filehandle STDIN, if the context is scalar, one line of input is read, including the newline, and assigned to a scalar variable as a single string.

10.2.2 The chop and chomp Functions

The chop function removes the last character in a scalar variable and the last character of each word in an array. Its return value is the character it chopped. Chop is used primarily to remove the newline from the line of input coming into your program, whether it is STDIN, a file, or the result of command substitution. When you first start learning Perl, the trailing newline can be a real pain!

The chomp function was introduced in Perl 5 to remove the last character in a scalar variable and the last character of each word in an array only if that character is the newline (or, to be more precise, the character that represents the input line separator, initially defined as a newline and stored in the $/ variable). It returns the number of characters it chomped. Using chomp instead of chop protects you from inadvertently removing some character other than the newline.

10.2.3 The read Function

The read function1 allows you to read a number of characters into a variable from a specified filehandle. (The first character is character 0.) If reading from standard input, the filehandle is STDIN. The read function returns the number of bytes that were read. You will normally use this function with files or reading input from a server using CGI. To read the entire file you will need to know the size in bytes of that file.

1. The read function is similar to the fread function in the C language.

10.2.4 The getc Function

The getc function gets a single character from the keyboard or from a file. At EOF, getc returns a null string.

10.2.5 Assigning Input to an Array

When reading input from the filehandle STDIN, if the context is an array, then each line is read with its newline and is treated as a single list item, and the read is continued until you press <CTRL>+D (in UNIX) or <CTRL>+Z (in Windows) for end of file (EOF). Normally, you will not assign input to an array, because it could eat up a large amount of memory, or because the user of your program may not realize that he should press <CTRL>+D or <CTRL>+Z to stop reading input.

10.2.6 Assigning Input to a Hash

Reading input from STDIN and assigning it to a hash is like reading from a file. The line read can be assigned as a value corresponding to a hash key or as the key itself.

10.2.7 Opening for Writing

When opening a file for writing, the file will be created if it does not exist, and if it already exists, it must have write permission. If the file exists, its contents will be overwritten. The filehandle is used to access the system file.

10.2.8 Win32 Binary Files

Win32 distinguishes between text and binary files. If ^Z is found, the program may abort prematurely or have problems with the newline translation. When reading and writing Win32 binary files, use the binmode function to prevent these problems. The binmode function arranges for a specified filehandle to be read or written to in either binary (raw) or text mode. If the discipline argument is not specified, the mode is set to “raw.” The discipline is one of :raw, :crlf, :text, :utf8, :latin1, and so forth.

10.2.9 Opening for Appending

When opening a file for appending, the file will be created if it does not exist, and if it already exists, it must have write permission. If the file exists, its contents will be left intact, and the output will be appended to the end of the file. Again, the filehandle is used to access the file rather than accessing it by its real name.

10.2.10 The select Function

The select function sets the default output to the specified filehandle and returns the previously selected filehandle. All printing will go to the selected handle. Once you use select, you must remember to reset your default ouput to STDOUT or all output from your script will continue to be sent to the “selected” filehandle.

10.2.11 File Locking with flock

To prevent two programs from writing to a file at the same time, you can lock the file so you have exclusive access to it, and then unlock it when you’re finished using it. The flock function takes two arguments: a filehandle and a file-locking operation. The operations are listed in Table 10.1.

Image

Table 10.1 File-Locking Operations

Read permission is required on a file to obtain a shared lock, and write permission is required to obtain an exclusive lock. With operations 1 and 2, normally the caller requesting the file will block (wait) until the file is unlocked. If a nonblocking lock is used on a filehandle, an error is produced immediately if a request is made to get the locked file. (See Fcntl.pm for a better implementation of locks.)

10.2.12 The seek and tell Functions

The seek Function

Seek allows you to randomly access a file. The seek function is the same as the fseek standard I/O function in C. Rather than closing the file and then reopening it, the seek function allows you to move to some byte (not line) position within the file. The seek function returns 1 if successful, 0 otherwise.

The seek function sets a position in a file, where the first byte is 0. Positions are as follows:

• 0 = Beginning of the file

• 1 = Current position in the file

• 2 = End of the file

The offset is the number of bytes from the file position. A positive offset moves the position forward in the file; a negative offset moves the position backward in the file for position 1 or 2.

The od command lets you look at how the characters in a file are stored. This file was created on a Win32 platform; on UNIX systems, the linefeed/newline is one character, .

$ od -c db
0000000000   S   t   e   v   e        B   l   e   n   h   e   i   m     
0000000020   B   e   t   t   y        B   o   o   p         L   o   r   i
0000000040   G   o   r   t   z          S   i   r       L   a   n   c
0000000060   e   l   o   t          N   o   r   m   a       C   o   r   d
0000000100         J   o   n        D   e   L   o   a   c   h         K
0000000120   a   r   e   n       E    v   i   c   h     
0000000134

The tell Function

The tell function returns the current byte position in the file and is used with the seek function to move to that position in the file. If FILEHANDLE is omitted, tell returns the position of the file last read.

10.2.13 Opening for Reading and Writing

Two files are used in the next example: a text file, called visitor.txt, which has an initital value of 1 as it’s only text, and countem.pl, the script that will be used to track the number of users who have run the script.

The visitor_count file is a Perl script that will add one to the visitor.txt file every time the script is executed.

Image

Table 10.2 Reading and Writing Operations

10.2.14 Opening for Anonymous Pipes

When using a pipe (also called a filter), a connection is made from one program to another. The program on the left-hand side of a pipe symbol sends its output into a temporary buffer and writes into it. On the other side of the pipe is a program that is a reader. It gets its input from the buffer. Here is an example of a typical UNIX pipe (see Figure 10.1):

who | wc -l

and an MS-DOS pipe:

dir /b | more

Image

Figure 10.1 UNIX pipe example.

The output of the who command is sent to the wc command. The who command sends its output to the pipe; meaning, it writes to the pipe. The wc command gets its input from the pipe; it reads from the pipe. (If the wc command were not a reader, it would ignore what is in the pipe.) The output of the wc command is finally sent to the STDOUT, the terminal screen. The number of people logged on is printed.

When a Perl pipe is opened, the operating system command is either on the left-hand side or right-hand side of the pipe. For example, if you see | sort, the OS command is on the right side of the pipe symbol. There is nothing on the left side, which implies that Perl is there and Perl is the writer. Perl sends its output to the pipe and the sort command reads from it. On the other hand, if you see ls | or dir |, the OS command is on the left-hand side of the pipe, implying that Perl is on the right-hand side, making perl the reader.

(It is important to keep in mind that the process connecting to Perl is an operating system command. If you are running Perl on a UNIX or Linux system, the commands will be different from those on a Windows system, thereby making Perl scripts implementing pipes unportable between systems.)

The Output Filter

When creating a handle with the open function, you can open a filter so that the output is piped to a system command. The command is preceded by a pipe symbol (|) and replaces the filename argument in the previous examples. The output will be piped to the command and sent to STDOUT (see Figure 10.2). You can use the two-argument format or the three-argument format, as shown in the following example. With the three-argument format, your shell (bash, korn, and so on) may be avoided and, thus, shell wildcard expansion, redirection, and multistage pipelines will not be handled.

Image

Figure 10.2 Perl output filter.

Sending the Output of a Filter to a File

In the previous example, what if you had wanted to send the output of the filter to a file intead of to STDOUT? You can’t send output to a pipe and a filehandle at the same time, but you can redirect STDOUT to a filehandle. Since, later in the program, you may want STDOUT to be redirected back to the screen, you can first save it or simply reopen STDOUT to the terminal device by typing

open(STDOUT, ">/dev/tty");

The following example can better be accomplished by using Capture::Tiny from CPAN. Capture::Tiny fixes pitfalls, incuding avoiding accidentally clobbering someone else’s global filehandles.

Input Filter

When creating a filehandle with the open function, you can also open a filter so that input is piped into Perl. The OS shell normally handles any special characters that need interpretation during the processing.

If you don’t have any need for the shell to process the command in the pipe (meaning you aren’t using redirection, wildcard expansion, or multiple pipes), you can use the three-argument format as previously shown in Example 10.30. See Figure 10.3.

Image

Figure 10.3 Perl input filter.

10.3 Passing Arguments

How does Perl pass command-line arguments to a Perl script? If you are coming from a C, C++, awk, or C shell background, at first glance you might think, “Oh, I already know this!” Beware! There are some subtle differences. So, read on.

10.3.1 The @ARGV Array

Perl does store arguments in a special array called @ARGV. The subscript starts at zero and, unlike C and awk, $ARGV[0] does not represent the name of the program; it represents the name of the first word after the script name. Like the shell languages, the $0 special variable is used to hold the name of the Perl script. Unlike the C shell, the $#ARGV expression contains the number of the last subscript in the array, not the number of elements in the array. The number of arguments is $#ARGV + 1. $#ARGV initially has a value of -1. To get the size of the @ARGV array, it is easier to just say scalar @ARGV.

When ARGV, the filehandle, is enclosed in angle brackets, <ARGV>, the command-line argument is treated as a filename. The filename is assigned to ARGV and the @ARGV array is shifted immediately to the left by one, thereby shortening the @ARGV array.

The value that is shifted off the @ARGV array is assigned to $ARGV. $ARGV contains the name of the currently selected filehandle. See Figure 10.4.

Image

Figure 10.4 The many faces of ARGV.

10.3.2 ARGV and the Null Filehandle

When used in loop expressions and enclosed in the input angle brackets (<>), each element of the @ARGV array is treated as a special filehandle. Perl shifts through the array, storing each element of the array in a variable $ARGV. A set of empty angle brackets is using the null filehandle, and Perl implicitly uses each element of the ARGV array as a filehandle. When using the input operators <>, either with or without the keyword ARGV, Perl shifts through its arguments one at a time, allowing you to process each argument in turn. Once the ARGV filehandle has been opened, the arguments are shifted off one at a time, so if they are to be used later, they must be saved in another array.

10.3.3 The eof Function

The eof function can be used to test if end of file has been reached. It returns 1 if either the next read operation on a FILEHANDLE is at the end of the file, or the file was not opened. Without an argument, the eof function returns the eof status of the last file read. The eof function with parentheses can be used in a loop block to test the end of file when the last filehandle has been read. Without parentheses, each file opened can be tested for end of file.

10.3.4 The -i Switch—Editing Files in Place

The -i option is used to edit files in place. The files are named at the command line and stored in the @ARGV array. Perl will automatically rename the output file to the same name as the input file. The output file will be the selected default file for printing. To ensure that you keep a backup of the original file, you can specify an extension to the -i flag, such as -i.bak. The original file will be renamed filename.bak. The file must be assigned to the ARGV filehandle when it is being read from. Multiple files can be passed in from the command line and each, in turn, will be edited in place.

10.4 File Testing

Like the shells, Perl provides a number of file test operators (see Table 10.3) to check for the various attributes of a file, such as existence, access permissions, directories, and so on. Most of the operators return 1 for true and “ ” (null) for false.

Image

Table 10.3 File Test Operators

A single underscore can be used to represent the name of the file if the same file is tested more than once. The stat structure of the previous file test is used.

10.5 What You Should Know

1. What is a filehandle?

2. What does it mean to open a file for reading?

3. When opened for writing, if the file exists, what happens to it?

4. How does > differ from >> when opening a file?

5. What is the purpose of the select() function?

6. What is binmode?

7. What does the die() function accomplish when working with files?

8. How do Windows and UNIX differ in how they terminate a line?

9. What is an exclusive lock?

10. What does the tell() function return?

11. What is the difference between the +< and +> symbols?

12. What does the stat() function do?

13. How do you reposition the file pointer in a file?

14. How does the -M switch work when testing a file?

10.6 What’s Next?

Until this point, all the functions you have used were provided by Perl. The print() and printf(), push(), pop(), and chomp() functions are all examples of built-in Perl functions. All you had to know was what they were supposed to do and how to use them. You did not have to know what the Perl authors did to make the function work; you just assumed they knew what they were doing. In the next chapter, you will write your own functions, also called subroutines, and learn how to send messages to them and return some result.

Exercise 10: Getting a Handle on Things

Part 1

1. Create a filehandle for reading from the datebook file (on the CD); print to another filehandle the names of all those who have a salary greater than $50,000.

2. Ask the user to input data for a new entry in the datebook file. (The name, phone, address, and so on, will be stored in separate scalars.) Append the newline to the datebook file by using a user-defined filehandle.

Part 2

This problem appeared on a Web site called daniweb.com. Can you solve it?

1. We need a Perl program that will check whether or not an IP address entered by a user is valid. The user is to enter the IP address as a command-line parameter. For example, the user could type at the prompt

check_ip.pl 192.168.9.23

and the script will attempt to validate the IP address 192.168.9.23.

2. The script must first check whether the user has input any data and if not, display an appropriate error message. A valid IP address must have:

a. Four octets, each separated by a dot.

b. Only numbers are allowed in each of the four octets (meaning, no alphabetic or punctuation characters are allowed within each octet).

c. The first octet values are between 1 and 255. The second, third, and fourth octet values are between 0 and 255. Only one IP Address is to be input and validated (meaning, there is no looping through several IP addresses).

Part 3

1. Use a pipe to list all the files in your current directory, and print only those files that are readable text files. Use the die function to quit if the open fails. For UNIX users, the command is ls. For Windows use dir /b. (Hint: Don’t forget to chomp!)

2. Rewrite the program to test whether any of the files listed have been modified in the last 12 hours. Print the names of those files.

Part 4

1. Sort the datebook file by names, using a pipe.

Part 5

1. Create a number of duplicate entries in the datebook file. Fred Fardbarkle, for example, might appear five times, and Igor Chevsky three times. In most editors, this will be a simple copy/paste operation.

a. Write a program that will assign the name of the datebook file to a scalar and check to see if the file exists. If it does exist, the program will check to see if the file is readable and writeable. Use the die function to send any errors to the screen. Also tell the user when the datebook was last modified.

b. The program will read each line of the datebook file giving each person a 10% raise in salary. If, however, the person appears more than once in the file (assume having the same first and last name means it is a duplicate), he will be given a raise the first time, but if he appears again, he will be skipped. Send each line of output to a file called raise. The raise file should not contain any person’s name more than once. It will also reflect the 10% increase in pay. Display on the screen the average salary for all the people in the datebook file. For duplicate entries, print the names of those who appeared in the file more than once, and how many times each appeared.

2. Write a script called checking that will take any number of filenames as command-line arguments and will print the names of those files that are readable and writeable text files. The program will print an error message if there are no arguments, and exit.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.137.7