Chapter 8. Creating Command Pipelines

The designers of Unix created an operating system with a philosophy that remains valid to this day. The Unix designers established the following:

  • Everything is a file. Devices are represented as special files, as are networking connections and plain old normal files.

  • Each process runs in an environment. This environment includes standard files for input, output, and errors.

  • Unix has many small commands, each of which was designed to perform one task and to do that task well. This saves on memory and processor usage. It also leads to a more elegant system.

  • These small commands were designed to accept input from the standard input file and send output to the standard output files.

  • You can combine these small commands into more complex commands by creating command pipelines.

This chapter delves into these concepts from the perspective of shell scripts. Because shell scripts were designed to call commands, the ability to create command pipelines, thereby making new, complex commands from the simple primitive commands, provides you with extraordinary power. (Be sure to laugh like a mad scientist here.)

This chapter covers how you can combine commands and redirect the standard input, output, and errors, as well as pipe commands together.

Working with Standard Input and Output

Every process on Unix or a Unix-like system is provided with three open files (usually called file descriptors). These files are the standard input, output, and error files. By default:

  • Standard input is the keyboard, abstracted as a file to make it easier to write scripts and programs.

  • Standard output is the shell window or terminal from which the script runs, abstracted as a file to again make writing scripts and programs easier.

  • Standard error is the same as standard output: the shell window or terminal from which the script runs.

When your script calls the read command, for example, it reads data from the standard input file. When your script calls the echo command, it sends data to the standard output file.

A file descriptor is simply a number that refers to an open file. By default, file descriptor 0 (zero) refers to standard input and is often abbreviated as stdin. File descriptor 1 refers to stdout, and file descriptor 2 refers to stderr. These numbers are important when you need to access a particular file, especially when you want to redirect these files to other locations. File descriptor numbers go up from zero.

Redirecting Standard Input and Output

Because the keyboard and shell window are treated as files, it's easier to redirect a script's output or input. That is, you can send the output of a script or a command to a file instead of to the shell window. Similarly, you can change the input of a script or command to come from a file instead of the keyboard. To do this, you create commands with a special > or < syntax.

To review, the basic syntax for a command is:

command options_and_arguments

The options are items such as -l for a long file listing (for the ls command). Arguments are items such as file names.

To redirect the output of a command to a file, use the following syntax:

command options_and_arguments > output_file

To redirect the input of a command to come from a file, use the following syntax:

command options_and_arguments < input_file

You can combine both redirections with the following syntax:

command options_and_arguments < input_file > output_file

You can use this syntax within your scripts or at the command line.

Redirecting Standard Error

In addition to redirecting the standard input and output for a script or command, you can redirect standard error. Even though standard error by default goes to the same place as standard output—the shell window or terminal—there are good reasons why stdout and stderr are treated separately. The main reason is that you can redirect the output of a command or commands to a file, but you have no way of knowing whether an error occurred. Separating stderr from stdout allows the error messages to appear on your screen while the output still goes to the file.

To redirect stderr from a command to a file, use the following syntax:

command options_and_arguments 2> output_file

The 2 in 2> refers to file descriptor 2, the descriptor number for stderr.

The C shell uses a different syntax for redirecting standard error. See the next section for more on this.

Redirecting Both Standard Output and Standard Error

In the Bourne shell (as well as Bourne-shell derivatives such as bash and ksh), you can redirect stderr to the same location as stdout in a number of ways. You can also redirect standard error to a separate file. As part of this, you need to remember that the file descriptors for the standard files are 0 for stdin, 1 for stdout, and 2 for stderr.

Appending to Files

The > operator can be quite destructive. Each time you run a command redirecting stdout to a file with >, the file will be truncated and replaced by any new output. In many cases, you'll want this behavior because the file will contain just the output of the command. But if you write a script that outputs to a log file, you typically don't want to destroy the log each time. This defeats the whole purpose of creating a log.

To get around this problem, you can use the >> operator to redirect the output of a command, but append to the file, if it exists. The syntax follows:

command >> file_to_append

The shell will create the file to append if the file does not exist.

Truncating Files

You can use a shorthand syntax for truncating files by omitting the command before the > operator. The syntax follows:

> filename

You can also use an alternate format with a colon:

: > filename

Note that : > predates the use of smiley faces in email messages.

Both of these command-less commands will create the file if it does not exist and truncate the file to zero bytes if the file does exist.

Sending Output to Nowhere Fast

On occasion, you not only want to redirect the output of a command, you want to throw the output away. This is most useful if:

  • A command creates a lot of unnecessary output.

  • You want to see error messages only, if there are any.

  • You are interested only in whether the command succeeded or failed. You do not need to see the command's output. This is most useful if you are using the command as a condition in an if or while statement.

Continuing in the Unix tradition of treating everything as a file, you can redirect a command's output to the null file, /dev/null. The null file consumes all output sent to it, as if /dev/null is a black hole star.

The file /dev/null is often called a bit bucket.

To use this handy file, simply redirect the output of a command to the file. For example:

$ ls /usr/bin > /dev/null

The Cygwin environment for Windows includes a /dev/null to better support Unix shell scripts.

Redirecting input and output is merely the first step. The next step is to combine commands into command pipelines.

Piping Commands

Command pipelines extend the idea of redirecting the input and output for a program. If you can redirect the output of one command and also redirect the input of another, why not connect the output of one command as the input of another? That's exactly what command pipelines do.

The basic syntax is:

command options_and_arguments | command2 options_and_arguments

The pipe character, |, acts to connect the two commands. The shell redirects the output of the first command to the input of the second command.

Note that command pipelines are often redundant to the normal redirection. For example, you can pass a file as input to the wc command, and the wc command will count the characters in the file:

$ wc < filename

You can also pass the name of the file as a command-line argument to the wc command:

$ wc filename

Or you can pipe the output of the cat command to the wc command:

$ cat filename | wc

Not all commands accept file names as arguments, so you still need pipes or input redirection. In addition, you can place as many commands as needed on the pipeline. For example:

command1 options_and_arguments | command2 | command3 | command4 > output.txt

Each of the commands in the pipeline can have as many arguments and options as needed. Because of this, you will often need to use the shell line-continuation marker, , at the end of a line. For example:

command1 options_and_arguments | 
    command2 | 
    command3 | 
    command4 > output.txt

You can use the line-continuation marker, , with any long command, but it is especially useful when you pipe together a number of commands.

Note that in your scripts, you don't need to use the line-continuation marker.

Piping with Unix Commands

Unix commands were designed with pipes in mind, as each command performs one task. The designers of Unix expected you to pipe commands together to get any useful work done.

For example, the spell command outputs all the words it does not recognize from a given file. (This is sort of a backward way to check the spelling of words in a file.) The sort command sorts text files, line by line. The uniq command removes duplicate lines. You can combine these commands into a primitive spell-checking command.

Creating Pipelines

Creating command pipelines can be difficult. It's best to approach this step by step, making sure each part of the pipeline works before going on to the next part.

For example, you can create a series of commands to determine which of many user accounts on a Unix or Linux system are for real users. Many background services, such as database servers, are given user accounts. This is mostly for the sake of file permissions. The postgres user can then own the files associated with the Postgres database service, for example. So the task is to separate these pseudo user accounts from real live people who have accounts on a system.

On Unix and Linux, user accounts are traditionally stored in /etc/passwd, a specially formatted text file with one line per user account.

Mac OS X supports a /etc/passwd file, but in most cases, user accounts are accessed from DirectoryServices or lookup. You can still experiment with the following commands to process formatted text in the /etc/passwd file, however. In addition, many systems do not use /etc/passwd to store all user accounts. Again, you can run the examples to see how to process formatted text.

An /etc/passwd file from a Linux system follows:

$ more /etc/passwd
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
sync:x:5:0:sync:/sbin:/bin/sync
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
news:x:9:13:news:/etc/news:
uucp:x:10:14:uucp:/var/spool/uucp:/sbin/nologin
operator:x:11:0:operator:/root:/sbin/nologin
games:x:12:100:games:/usr/games:/sbin/nologin
gopher:x:13:30:gopher:/var/gopher:/sbin/nologin
ftp:x:14:50:FTP User:/var/ftp:/sbin/nologin
nobody:x:99:99:Nobody:/:/sbin/nologin
rpm:x:37:37::/var/lib/rpm:/sbin/nologin
vcsa:x:69:69:virtual console memory owner:/dev:/sbin/nologin
nscd:x:28:28:NSCD Daemon:/:/sbin/nologin
sshd:x:74:74:Privilege-separated SSH:/var/empty/sshd:/sbin/nologin
rpc:x:32:32:Portmapper RPC user:/:/sbin/nologin
rpcuser:x:29:29:RPC Service User:/var/lib/nfs:/sbin/nologin
nfsnobody:x:65534:65534:Anonymous NFS User:/var/lib/nfs:/sbin/nologin
pcap:x:77:77::/var/arpwatch:/sbin/nologin
mailnull:x:47:47::/var/spool/mqueue:/sbin/nologin
smmsp:x:51:51::/var/spool/mqueue:/sbin/nologin
apache:x:48:48:Apache:/var/www:/sbin/nologin
squid:x:23:23::/var/spool/squid:/sbin/nologin
webalizer:x:67:67:Webalizer:/var/www/usage:/sbin/nologin
dbus:x:81:81:System message bus:/:/sbin/nologin
xfs:x:43:43:X Font Server:/etc/X11/fs:/sbin/nologin
named:x:25:25:Named:/var/named:/sbin/nologin
ntp:x:38:38::/etc/ntp:/sbin/nologin
gdm:x:42:42::/var/gdm:/sbin/nologin
postgres:x:26:26:PostgreSQL Server:/var/lib/pgsql:/bin/bash
ericfj:x:500:500:Eric Foster-Johnson:/home2/ericfj:/bin/bash
bobmarley:x:501:501:Bob Marley:/home/bobmarley:/bin/bash

The /etc/passwd file uses the following format for each user account:

username:password:userID:groupID:Real Name:home_directory:starting_shell

Each field is separated by a colon. So you can parse the information for an individual user:

bobmarley:x:501:501:Bob Marley:/home/bobmarley:/bin/bash

In this case, the user name is bobmarley. The password, x, is a placeholder. This commonly means that another system handles login authentication. The user ID is 501. So is the user's default group ID. (Linux systems often create a group for each user, a group of one, for security reasons.) The user's real name is Bob Marley. His home directory is /home/bobmarley. His starting shell is bash. (Good choice.)

Like the ancient spell command used previously, making broad assumptions is fun, although not always accurate. For this example, a real user account is a user account that runs a shell (or what the script thinks is a shell) on login and does not run a program in /sbin or /usr/sbin, locations for system administration commands. As with the spell command, this is not fully accurate but good enough to start processing the /etc/passwd file.

You can combine all this information and start extracting data from the /etc/passwd file one step at a time.

In addition to piping between commands, you can pipe data to and from your shell scripts, as in the following Try It Out.

Using tee to Send the Output to More Than One Process

The tee command sends output to two locations: a file as well as stdout. The tee command copies all input to both locations. This proves useful, for example, if you need to redirect the output of a command to a file and yet still want to see it on the screen. The basic syntax is:

original_command | tee filename.txt | next_command

In this example, the tee command sends all the output of the original_command to both the next_command and to the file filename.txt. This allows you to extract data from the command without modifying the result. You get a copy of the data, written to a file, as well as the normal command pipeline.

Summary

You can get a lot of work done by combining simple commands. Unix systems (and Unix-like systems) are packed full of these types of commands. Many in the programming community liken scripting to the glue that ties commands together. You can think of the operating system as a toolbox and the shell as a way to access these tools. This philosophy will make it a lot easier to write shell scripts that you can use again and again.

This chapter covers redirecting input, output, and errors, as well as creating command pipelines.

  • You can redirect the output of commands to files using the > operator. The > operator will truncate a file if it already exists. Use >> in place of > if you want to append to the file.

  • You can redirect the error output of commands using &>, or 2>&1 to send the error output to the same location as the normal output.

  • You can redirect the input of commands to come from files using the < operator.

  • Redirect the output of one command to the input of another using the pipe character, |. You can pipe together as many commands as you need.

  • The tee command will copy its input to both stdout and to any files listed on the command line.

The next chapter shows how to control processes, capture the output of commands into variables, and mercilessly kill processes.

Exercises

  1. Discuss the ways commands can generate output. Focus particularly on commands called from shell scripts.

  2. Use pipes or redirection to create an infinite feedback loop, where the final output becomes the input again to the command line. Be sure to stop this command before it fills your hard disk. (If you are having trouble, look at the documentation for the tail command.)

  3. Modify the listusers script so that it does not generate a false positive for the postgres user and other, similar accounts that are for background processes, not users. You may want to go back to the original data, /etc/passwd, to come up with a way to filter out the postgres account.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.209.131