CHAPTER 11: Mixing and Matching

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 11
Mixing and Matching

The shell is a powerful language, but it does not do everything. Some other languages are heavily used by shell scripts to perform tasks that the shell itself is not very well suited for; similarly, programs in some other languages use the shell to do some of their heavy lifting. This chapter discusses a few of the issues you may encounter when using other languages from the shell, or the shell from other languages.

This chapter starts with some general information about embedding code in one language within code in another language. Following that are sections on embedding shell code in specific other languages and on embedding code in other languages in shell code. These sections briefly discuss the reasons for which you would use each combination, but they do not attempt to completely explain other languages.

Mixing Quoting Rules

The most fundamental problem of mixing shell and other code is that other languages typically have different quoting rules than the shell. For instance, both Perl and Ruby allow ' to occur in single-quoted strings to include single quotes in them. This is useful because they do not share the shell's implicit concatenation of adjacent quoted strings, so the shell idiom wouldn't work, but it is often surprising to shell programmers. While very few large shell scripts are typically embedded in either Perl or Ruby programs, both have convenient syntax for embedding small scripts, including command substitution.

Nested quoting is complicated and easy to get wrong. Conflicting quoting rules are also easy to get wrong. Nested quoting using different rules can be a real hassle, and no one gets it right on the first try every time. It's most effective to separate scripts out, and this tends to produce other benefits, as you'll be able to generalize and make more use of each component. However, there are cases where a small embedded script is really too specialized, and not large enough, to justify a separate executable.

In general, your best bet with languages that do not require single quotes very often is to use single quotes in the shell to pass code into other languages. Here documents are only occasionally useful; many scripting languages read scripts from files, not standard input, and the most common programs to write this way are filters, which need to be able to use standard input for data anyway.

To get nested quoting correct, start by writing the embedded program correctly as a separate file with correct quoting. Once you have done this, you can look at how to quote this string in the outer scripting language. You may find it practical to bypass the first step with experience, but if something goes wrong, try a separate program first; it is a lot easier to debug.

As an example, the following awk script extracts information from C header files. Many headers defining symbolic names for constants use the following convention to describe the meanings of each constant:

#define NAME 23 /* Description of name */

For instance, somewhere under most implementations of <errno.h>, there is a file containing lines like this:

#define ENOMEM 12 /* Cannot allocate memory */

This format lends itself well to extraction in a simple awk script:

/^#define/ && $2 == "ENOMEM" {

  for (i = 5; i < NF; ++i) {

    foo = foo " " $i;

  }

  printf("%-22s%s
", $2 " [" $3 "]:", foo);

  foo = ""

}

This script could be passed in as a single argument to the awk command (using single quotes in a shell script) or saved as a file and invoked with awk -f file. This script combines a number of awk's features somewhat awkwardly to produce output such as this:

ENOMEM [12]: Cannot allocate memory

The output format is a bit elaborate and bears a little explanation; the output looks better if the error name is left-aligned and the numbers are immediately next to it. First constructing the string ENOMEM [12]:, then printing it left-adjusted in a field, provides an interface where the descriptive text is also aligned, making it easier to read larger blocks of output (such as multiple lines in sequence).

This program can be easily wrapped in a simple shell script. Because the script uses only double quotes, it can be wrapped using a single pair of single quotes, except for embedding a value. Here's a way you might do it:

for arg

do

  awk '/^#define/ && $2 == "'"$arg"'" {

    for (i = 5; i < NF; ++i) {

      foo = foo " " $i;

    }

    printf("%-22s%s
", $2 " [" $3 "]:", foo);

    foo = ""

  }' < /usr/include/sys/errno.h

done

This scriptlet (assuming that your system's <errno.h> is structured like a BSD one, which it may not be) prints similar output for each matched argument. The interesting part is the argument embedding; "'"$arg"'" is a simple case of handling nested quoting. This awk script is composed from three adjacent strings; the first single-quoted string ends with a double quote, then there is a double-quoted string containing only $arg, and then the next single-quoted string starts with a double quote. If $arg contains the string ENOMEM, this expands to "ENOMEM" in the awk program. This is not necessarily the best way to pass data to awk. You might do better to use awk's -v option to assign variables:

for arg

do

  awk -v arg="$arg" '/^#define/ && $2 == arg {

    for (i = 5; i < NF; ++i) {

      foo = foo " " $i;

    }

    printf("%-22s%s
", $2 " [" $3 "]:", foo);

    foo = ""

  }' < /usr/include/sys/errno.h

done

When you have to embed multiple kinds of quotes, it gets trickier. Just remember that you can always switch quoting styles to combine different rules. Be especially careful when trying to get backslashes into embedded code; this is one of the big arguments for using single quotes as much as possible.

For a more extreme example, m4's quoting rules are totally different from the shell's (although arguably superior). By default, m4 quotes start with ` and end with '. Obviously, this is vastly superior in terms of unambiguous nesting. Just as obviously, it is not a good fit for shell code. To resolve this, m4sh uses the m4 language's built-in (and thoroughly sanity-destroying) changequote primitive to change the quoting mechanism; in m4sh, quotes are []. These were selected, not because they are uncommonly used, but because they are almost always balanced. By contrast, an unmatched ending parenthesis is often seen in case statements. This is the real reason the examples in this book have preferred test to [.

Embedding Shell Scripts in Code

Shell code can be embedded in other programs. Many UNIX programs have some limited shell-out functionality, allowing them to run single commands; these commands are almost always passed to the shell. Quoting rules vary widely between languages; be sure you know which quoting rules apply. Editors that allow you to run shell commands may have their own special quoting and input rules; check the documentation.

By far the most common program in which shell code is included is make, and it deserves a bit of discussion.

Shell and make

The shell is heavily used by most implementations of make because it is the canonical command interpreter and is used to execute commands. In general, each individual command (a single line in a make rule) is run in a separate shell. However, you can join lines together using backslashes at the end of each line, and it is possible to write many shell scripts on a single line by using semicolons instead of new lines as line terminators. This section discusses the use of shell commands embedded as make rules, but it does not try to explain the rest of make; there are wonderful books and tutorials available on the topic, and it is beyond the scope of this book.

As with any language, make has its own quoting rules. Like the shell, make also substitutes variables. These expansions occur before the command is passed to the shell; unfortunately, they may also be confusingly similar to shell expressions. To pass a dollar sign to the shell, use the make variable $$. To substitute a make variable, use parentheses around its name, as in $(CFLAGS). This is confusingly similar to shell command substitution, but it is unrelated; think of $(CFLAGS) as being the equivalent of a shell program's ${CFLAGS}. The extra punctuation is much less optional in make, however. You should use it always, not just when there are other words nearby.

Anything that make passes to the shell is a single line. The behavior of comments in a makefile is thus not what you would expect for the same text occurring in a shell script. For instance, the following script is a two-line shell script:

echo hello # 

echo world

hello

world

The line continuation character is ignored because it is in a comment. However, if you specified the same text as a rule in a makefile, it would behave differently:

$ cat test.mk

default:

        echo hello # 

        echo world

$ make -f test.mk

echo hello #  echo world

hello

The rule is joined into a single line by make; the lines are joined, and the resulting rule is echo hello # echo world. Although make recognizes lines starting with # as comments, it does nothing special with a comment character in the middle of the line, so the whole line is passed to the shell as is. The shell comment extends to the end of the whole command because the whole command is a single line; the second command is not executed. This is a common mistake. There is a more subtle additional mistake; even if the comment character weren't there, the output would be hello echo world because there is no statement separator. To write multi-line scripts as single commands in a makefile, you must use semicolons between statements.

Be extremely careful about shell portability in make rules. There is no portable or safe way to cause a different shell to be used, so you are generally stuck with whatever shell the make program chooses. Users developing on some Linux systems sometimes produce make rules that only work if /bin/sh is actually bash. Don't do that.

Embedding shell code in make is not especially risky. Just remember that the code you write in the makefile is subject to processing, substitution, and quoting in make before it is passed to the shell. With that in mind, the shell gives make the flow control features (such as iteration through loops) it has always lacked. When writing longer sections of code, remember that make determines success or failure the same way the shell does, and it usually aborts whenever any build rule fails. If you write a build rule as an embedded shell script, be sure its return code is correct. For example, the following build rule has a serious problem:

timestamps:

        ( date; $(MAKE) build; date ) > build.log 2>&1

This rule appears to run a build with timestamps before and after the build, storing the output in a log file. In fact, it does this. However, the exit status of the shell command will always be the exit status of the second date command, which is unlikely to fail. If the build fails, make will not know about it or stop processing additional rules. To fix this, store the exit status and provide it to the caller:

timestamps:

        ( date; $(MAKE) build; x=$?; date; exit $x ) > build.log 2>&1

This rule passes back the status of the significant command rather than the status of another command. Another choice you might consider is to use && to join commands:

timestamps:

        ( date && $(MAKE) build && date ) > build.log 2>&1

This does preserve exit status, but it deprives you of the second timestamp if a build fails.

Shell and C

On UNIX-like systems, the system() library call usually passes its argent to the shell for processing. On non-UNIX systems, the command processor may be a different shell or may be absent entirely; relying on the UNIX shell makes C code less portable than it would be otherwise. The C language has relatively simple quoting rules inside double quotes; you can pass new lines (written ), single quotes (no special treatment needed), and double quotes (written "). Line continuation in C, as in make, happens before data are passed to the shell, so do not rely on it. If you really want to write a long multi-line script in C, the most idiomatic way is to rely on C's automatic string concatenation and use new lines:

system("if true; then
"

       "  echo hello
"

       "fi");

This produces a string equivalent to "if true; then echo hello fi", but it is easier to read. Some C compilers offer extensions to accept embedded new lines in strings; do not rely on this. It is not portable, and it is also not especially useful.

This, of course, begs the question of whether it makes sense to embed non-trivial script code in C at all. In general, it does not. If you want to run external commands from C, you should normally restrict yourself to calling out to external programs using fork() and exec(). If you want to run a script, it is usually better to have an external script program rather than trying to embed it.

Embedding Code in Shell Scripts

Code in other languages is usually embedded in shell scripts when the languages lend themselves well to being used as filters. Two of the most famous examples are sed and awk, which are discussed in detail in the rest of this chapter.

Once you are comfortable with the shell's quoting mechanisms, embedding programs in your shell scripts is usually easy. Most of the time, single quotes will do everything you want, except for shell variables substituted into the embedded program.

Embedding shell variables in programs can range from relatively simple to fairly difficult, depending on the context in the embedded code. If you can be sure that the variable never has a value that would require special quoting in the embedded language, it is pretty easy:

cmd -e 'code '"$var"' more code'

This embeds the shell variable $var between code and more code. It is not quoted in the embedded code, though. If it needs quoting, you have to ensure that the shell variable's value is correctly quoted before embedding it. File names and user-supplied input can require a great deal of work to sanitize correctly for embedded code. In some cases, it is better to truncate or remove invalid inputs rather than try to preserve them through quoting.

Shell and sed

The sed utility provides a generalized editing facility that can be used as a filter (the -i option for editing in place is not universal). It is most heavily used to perform simple substitutions using regular expression patterns, but it is substantially more powerful than this. The sed utility uses basic regular expressions—mostly. Some versions support additional features, such as alternation (|) or the ? and + extensions; others do not. Do not use these. In general, do not escape anything with a backslash in sed unless it is a character that has special meaning when escaped with a backslash or is the expression delimiter. A few versions of sed do not support using asterisks on subexpressions, only on single character atoms (including character classes).

Mostly, sed is used for cases where you want to perform reasonably simple translations of files—for instance, replacing special marker text with string values. Like many utilities, sed is built around the assumptions of a shell script. Given no other instructions, it reads from standard input and writes to standard output. By default, it performs any instructions given to it, then prints each line.

One of the major uses of sed is to work around shells that lack the POSIX parameter substitution operators, such as ${var#pattern}. (You can also usually do this with expr.) For instance, one idiomatic usage would be to grab a directory name from a pathname:

$ dir=`printf "%s" "$path" | sed -e 's,.*/,,'`

While this usage is idiomatic, it is probably better to use expr for simple cases like this. If all you want to do is display part of a string, use expr.

While the s// command in sed is usually illustrated with slashes, it is portable to use other characters instead. When working with path names, it is even preferable. Commas are a common choice. Exclamation points are popular, too, but cause problems in shells (csh, very old bash), which use ! for history expansion. In general, sed commands that use delimiters let you pick a delimiter.

Even a small sed command can do very useful things. You have previously seen the common convention of prefixing strings with X to prevent them from being taken as options. This leads to a handy idiom:

Xsed="sed -e 's,^X,,'"

func_echo () {

  echo X"$@" | $Xsed

}

Even if the first argument is -n, this function can display it reliably. Unfortunately, this is not enough to work around the versions of echo that strip backslashes. You might wonder why this example doesn't just use expr, as previously suggested. The reason is that sed can take multiple -e arguments, and this provides a useful idiom:

echo X"$var" | $Xsed -e s/foo/bar/

This replaces foo with bar in the contents of $var, even if $var happens to start with a hyphen. Since piping small strings into sed is a fairly common task (and to be fair, there are many substitutions expr cannot make), and many versions of echo are obnoxious, this is a great way to magically hide the problem. On modern systems (or even moderately old ones, as long as they're not stripped-down embedded systems), printf may be better. Still, it's a good idiom to know. You never know when you'll suddenly need it.

sed scripts do not need quoting beyond backslashes, and those only in limited circumstances, such as when a regular expression contains the delimiter character used for the command. When writing a sed command with multiple separate commands, you have several options. You can use multiple -e arguments or separate commands with semicolons, but if you want to write a longer script, it is often better to write a single script using embedded new lines. It is usually easier to read multiple commands on multiple lines than squished together on one. Long single-quoted strings are your best friend here. Use the standard concatenation trick to embed variable substitutions in sed scripts; the following trivial example shows how you might emulate grep using sed:

sed -n -e '/'"$regex"'/p'

This prints every line matching $regex, unless it contains forward slashes. The sed command ends up being /regex/p, which prints lines matching regex. The -n option prevents sed from printing every line automatically, so only lines explicitly printed are displayed. A common mistake is to omit the -n:

sed -e '/'"$regex"'/p'

This command prints every input line and prints lines matching $regex twice.

Solving the delimiter problem is a bit tricky. In general, you want to escape delimiter characters with backslashes, but the backslash itself is special to sed. Luckily, this is easier than it sounds:

pat=`printf "%s
" "$pat" | sed -e 's,/,\/,g'`

This causes the variable $pat to have every slash replaced with a backslash followed by a slash. If you then expand $pat in a sed script, the backslashes protect the forward slashes and cause them not to be interpreted as delimiters. It is important to use single quotes to quote the sed script; otherwise, you need twice as many backslashes because each pair of backslashes becomes a single backslash in the argument passed to sed, which then simply protects the following character and disappears. Be sure to sanitize variables you plan to embed in sed scripts; otherwise, you may get unpleasant surprises.

Caution You will also see this idiom using echo, but some versions of echo strip backslashes. If you try to use this to escape backslashes, or if your string happens to contain backslashes for any other reason, it may not work portably with echo. You can find examples of how to work around this in libtool. Some of them have a lot of backslashes.

Longer sed scripts can do truly amazing things. This makes a good time to review the configure script code, which replaces $LINENO with the line number of a script:

sed '=' <$as_myself |

    sed '

      N

      s,$,-,

      : loop

      s,^(['$as_cr_digits']*)(.*)[$]LINENO([^'$as_cr_alnum'_]),1213,

      t loop

      s,-$,,

      s,^['$as_cr_digits']*
,,

    ' >$as_me.lineno

The first script prints the line number of each line, then prints the line. So the output at the top of the script might be this:

1

#!/bin/sh

2

# Guess values for system-dependent variables and create Makefiles.

The body of the script is impressive, impressive enough, in fact, that the script gives credit to the inventors (plural):

# (Raja R Harinath suggested sed '=', and Paul Eggert wrote the

# second 'sed' script.  Blame Lee E. McMahon for sed's syntax.  :-)

To understand what this script does, first have a look at what it comes out to when the autoconf $as_cr values have been filled in. I've used character ranges for expressiveness; the actual variables are completely spelled out.

N

s,$,-,

: loop

s,^([0-9]*)(.*)[$]LINENO([^0-9A-Za-z_]),1213,

t loop

s,-$,,

s,^[0-9]*
,,

For each line, sed begins by merging it with the next line (the N command). A hyphen is appended to the line. This is a trick reminiscent of the case ,$list, in trick introduced in Chapter 2; the purpose is to ensure that $LINENO never occurs at the end of the line, so you can always check the following character to see whether it could be part of an identifier.

Next, there is a small loop. The : command in sed introduces a label, which can be branched to later. (Yes, sed has flow control.) Each iteration of the loop performs a replacement. It replaces the text $LINENO (the dollar sign is in a character class, so it matches a literal dollar sign rather than the end of the string) with an initial string of digits. This idiom is extremely important to understand; it forms the basis of all sorts of things you can do with regular expressions that you cannot do without them.

The key is the use of grouped matches and the ability to refer back to them. (Since the reference back is not part of the matching regular expression, this is not technically a backreference; the same technique is available even in extended regular expression implementations lacking backreference support.)

When this substitute pattern is reached, a typical input buffer might be this:

124

echo "configure: $LINENO: I am an example code fragment."

Table 11-1 shows how this line matches the regular expression.

Table 11-1. Matching a Complicated Regular Expression

Pattern	Text
`^([0-9]*)`	124
`(.*)`	<newline>echo "configure:<space>
`[$]LINENO`	$LINENO
`([^A-Za-z0-9_])`	:

The rest of the line is not matched; the regular expression matches only up through the colon after $LINENO. Because there is a successful match, this is replaced. The first chunk (containing the line number) is replaced by 1; since it was this text that formed 1, nothing changes. The second part is replaced by 2; again, nothing changes. After 2, the script inserts 1 again. Finally, 3 is replaced by itself. Because $LINENO was not in any group, it is not kept; instead, it is replaced by 1. So, $LINENO is replaced by 124, and nothing else happens.

This idiom is important because it means that you can do a replacement operation where you match on surrounding context that you do not wish to replace or modify; you can use groups around the material you need to match on, and then use N references to replace those groups with their original text.

After the replacement, there is a branch; the t command branches back to the label if any substitutions have been made. (This is necessary because the regular expression in question can't be repeated with the /g modifier.) Once all instances of $LINENO on a line have been replaced, the t does not branch, and the script continues.

The last two commands remove the trailing hyphen from the line and remove the line number from the beginning of the buffer. The N command joined the lines, preserving the trailing new line; the last command replaces any initial string of digits followed by a new line, leaving the original line (before the = script) with only the $LINENO changes.

You may be surprised to find that, to an experienced sed user, this is fairly obvious. It's a powerful language and worth learning.

Shell and awk

The awk language (named after its creators, Aho, Weinberger, and Kernighan) fills a number of roles in shell scripts. While it is overkill for many simple substitution or pattern-matching operations, it offers a great deal of flexibility in performing more elaborate computations and generating interesting reports. In general, an awk script consists of a series of conditions and associated actions; a condition determines which actions to perform, and actions do things like calculating and printing values. Unlike sed, awk uses extended regular expressions. This section introduces the basic features of awk and the many variants of awk you are likely to encounter.

Why Use awk?

There are several key features awk provides that make it useful in shell scripts. The first, and most obvious, is associative arrays (also called hashes). In awk, a variable can be an array in which the indices are arbitrary strings rather than just numbers. This is an exceedingly flexible data type, allowing for the creation of lookup tables with keys, such as file names or other arbitrary strings. In many cases, it is desirable to accumulate data as you process input, and then do something with the accumulated data only after all the input has been processed. Finally, awk's implicit splitting of input into fields and flexible operations on fields make it easy to express a lot of common operations without a lot of additional setup work.

While sed scripts tend to be short and terse, often only a single short line, many awk scripts run across multiple lines. Resist the temptation to cram a whole complicated awk script onto a single line; go ahead and write a longer script over multiple lines.

In awk, strings should be quoted (using double quotes). An unquoted word is interpreted as a variable, not a literal string. There is no shell-like distinction between assignment and substitution; variable names are always given as plain words. Operators and literal numbers need no quoting or special markers.

While processing each line, awk automatically splits it into fields; usually these are the words of a line, delimited by whitespace, but you can specify a different delimiter. Fields are numbered starting at 1, with field 0 referring to the whole line. To get the value of a field, you use a dollar sign: $0 is the whole line, $1 is the first field, and so on. The built-in variable NF holds the number of fields on the current line; you can refer to the last field as $NF. In general, any variable can be used this way. This can be a bit of a shock for shell programmers, who expect $var to be the value of var.

Like the shell, awk treats uninitialized variables as empty. However, awk can perform both numeric and string operations; in a numeric context, an uninitialized variable is treated as a zero. Strings of digits and numbers are mostly interchangeable; if you try to add a number to a string of digits, the string is converted to a number and added. The transparent conversion between strings and numbers, and the implicit initialization of fields, make awk a very friendly language for writing reports. Associative arrays in particular are a wonderful feature. Many of the behaviors you see in awk are also common in Perl scripts.

Basic Concepts

The central concept of awk is the rule, also called a pattern-action statement. A rule is a condition (called the pattern) and a block of code (called the action). Conditions are just awk expressions, typically referring to the fields of the current line. The expression /regex/ implicitly matches the extended regular expression regex against $0. If the expression is true for a given input line, the block is executed for that line. An empty expression is always true. An empty action is interpreted as print, which implicitly prints $0. The following fragment of awk code prints the last word of each line of input containing the text hello:

/hello/ {

  print $NF

}

You can also perform matches on a particular field (the ˜ operator is the explicit regex match operator).

$1 ˜ /hello/ {

  print "goodbye, " $2

}

The special conditions BEGIN and END define rules that are executed once only; BEGIN rules before any input is read, and END rules after all input has been read. If you want to change the special variables RS and FS (record separator and field separator), you must use a BEGIN rule. Older code sometimes uses a BEGIN rule to insert shell variables into awk variables:

awk 'BEGIN { x='"$x"' } ...'

In "new awk" (1986 or so and later), you can use the -v option instead:

awk -v x="$x" '...'

Typically, END rules are used to provide summaries or reports after processing and interpreting a data file. The following example reads the output of an ls -l command:

/.h$/ { h += $5 }

/.c$/ { c += $5 }

END { print "C source: ", c; print "Headers: ", h }

Saved as a file named codesize, this could be used from the command line:

$ ls -l | awk -f codesize

C source: 98093

Headers: 14001

The lack of initialization and setup code is one of the reasons awk is popular.

If you provide two expressions separated by commas for a rule, the rule is considered to match every line between one where the first expression matches and one where the second expression matches. The lines matching the first and second expressions are included, as shown in the following example (using the implicit print $0 action):

$ awk '/a/,/b/' << EOF

> 1

> a

> 2

> b

> 3

> EOF

a

2

b

Expressions can use variables defined by the user, not just the predefined variables and the fields from the current line. Of particular interest to shell programmers, because the shell has no native equivalent, are awk's arrays. An array in awk is a collection of values indexed by strings. You can use just about any expression as the index for an array. Members that did not exist, like variables that did not exist, are treated as zero or empty. The following script prints a list of the first words of its input line, with the count of occurrences of each word:

$ awk '{ count[$1]++ }

> END { for (val in count) print val ": " count[val]; }' <<EOF

> example

> test

> example

> script

> awk

> program

> EOF

program: 1

script: 1

awk: 1

example: 2

test: 1

The order of output is not deterministic in this case; arrays are not stored in any particular order in awk. You can, of course, use sort on your output. You can also do your own sorting of output, although this is a bit more complicated (there is no built-in sort function).

In addition to the basic operators, awk has functions. A function takes an argument list (which may be empty for some functions) and returns a value. For instance, the tolower() function returns its argument converted to lowercase:

$ echo "WHAT CAPS LOCK KEY?" | awk '{ print tolower($0) }'

what caps lock key?

Functions can be used anywhere in an expression; the following awk script prints only input containing lowercase letters:

$0 != toupper($0)

Since the rule has no specified action, the implicit print is used. Note that the input is not modified by the function call; normal values passed to functions are passed by value (meaning that the function can change the copy passed to it, but not the original object). This is different for arrays.

Furthermore, in addition to the built-in functions awk provides, you can define your own functions. A user-defined function looks a little like a rule:

function greet(who) {

  print "Hello, " who

}

The parameters a function is declared with are local variables, assigned from any parameters passed to it when the function is called. Functions are declared at the top level of the awk program, not inside rules. Functions in awk may be recursive; each function gets its own local copy of the parameters.

Variants

The original awk language was quite popular, but it had some limitations. A newer version, called nawk, was developed starting around 1985, and was the basis of the 1988 book The Awk Programming Language (by Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger; Addison Wesley). Since then, the GNU project has contributed another version (gawk), and Mike Brennan introduced another version called mawk. In practice, nearly every system in use today provides something roughly compatible with nawk, although some variants provide many more features. There is also a small awk implementation available in busybox; it seems to be nawk-compatible. The most substantial upgrade of "new awk" is user-defined functions and is essentially universally available now. However, a few systems provide a "traditional" awk, usually the same systems that provide a "traditional" shell.

If you are doing a great deal of awk programming, it makes sense to search around for the best available awk implementation. In fact, even if a program is otherwise pure awk, it may be better to embed it in a shell script that can do some initial command-line argument parsing and pick a good awk interpreter. The following snippet looks for a good version of awk, preferring the faster and more powerful interpreters:

if test -z "$AWK"; then

  for variant in mawk gawk nawk awk

  do

    save_IFS=$IFS

    IFS=:

    for dir in $PATH

    do

      IFS=$save_IFS

      if test -x "$dir/$variant"; then

        AWK="$dir/$variant"

        break

      fi

    done

    IFS=$save_IFS

    test -n "$AWK" && break

  done

fi

As always, trust the user; a user is unlikely to specify $AWK without good reason. Combining this with command-line parsing, whether using getopts or something like the boilerplate introduced in Chapter 6, allows you to write powerful and fairly portable awk scripts that handle arguments much more gracefully than traditional awk. Note that there are two points at which this script restores $IFS. The line at the top of the inner loop ensures that following commands will execute with $IFS restored; the line after the loop ensures that $IFS gets restored even if $PATH is empty and the loop never executes. In this particular case, neither of these boundary conditions is likely to come up, but it is good to develop careful habits.

Portability Concerns

Essentially every system since the late 80s has provided some variant of "new awk." This section covers the key portability notes among the new awk variants (including gawk and mawk). Special thanks are due to the autoconf portability notes, which caught a number of quirks I had never run into.

Do not put spaces in front of the parentheses in a function declaration; this is not portable. Function declarations in awk do need the function keyword, unlike shell functions:

function foo() { print "Please give your function a more creative name." }

The order of operations when iterating over an array is not deterministic; do not assume it is any order (not "order of insertion" or "sorted," for instance). It is not even guaranteed that the order is the same on successive iterations! A single for (var in array) will hit every member of the array, but there is no guarantee at all about order.

The last semicolon in a block is probably optional. Some people use them based on vague recollections that there was an awk implementation somewhere that required them. (Shell {} blocks definitely require a trailing semicolon or new line; awk may not.) I have omitted them because I cannot find an awk implementation that needs them.

If an awk script is not supposed to process any lines of input, run it with /dev/null (or any empty file) as input; some implementations may try to read input even when the POSIX spec says they shouldn't.

At least one awk mishandles anchors in alternating expressions, such as /^foo|bar/. If you have to use such expressions, put them inside a group—for instance, /^(foo|.*bar)/.

Several implementations reset NF and the field variables before executing an END block; if you need to refer to their last values, save them in regular user-defined variables.

Features you can use portably across new awk implementations include user-defined functions: the ?: operator, the getline function, the exponentiation operator (^), and a number of string and math functions. Variable assignment using -v is universal in new awk, but not found in traditional awk. If you are using autoconf, AC_PROG_AWK can find a working new awk on every known system.

Only single-dimensional arrays are available in traditional awk. In fact, even in modern awk, what is really available is not multidimensional arrays, but a convenient syntactic shorthand for constructing array keys:

a[x,y,z] = 3

This syntax allows you to store data structures much like multidimensional arrays, but you cannot easily extract "every array member in column 1."

Embedding awk in Shell Scripts

There are two good pieces of advice to consider about embedding small awk scripts in your shell scripts. The first is that you should always think about whether what you want to do can be done better using tr, expr, cut, paste, or one of the other similar small and specialized tools UNIX provides. Many tasks can be performed more efficiently by sort and uniq -c than they can by an awk script building a table of values and printing them out. There is no need to use awk to display fields of output when cut can do the same thing.

The second piece of advice is that maybe you should use tiny little awk scripts for a lot of things like this anyway. It is true that a script often runs faster using smaller and more specialized utilities. However, it is often easier to write the correct code using awk, and this may be more important when you are in a hurry. For instance, if you want the process IDs of every process matching a pattern, it is easy to write:

$ ps ax | awk '/pattern/ { print $1 }'

However, there's no reason you couldn't do this just as well with grep and cut:

$ ps ax | grep -e 'pattern' | cut -f 1

Well, there's one. This doesn't work. By default, cut expects delimiters to be tabs, and ps doesn't normally use tabs, so the whole line is a single field. No problem! Just use spaces:

$ ps ax | grep -e 'pattern' | cut -f 1 -d ' '

Oops. Turns out this works only when the pid extends to the left of the display; the default right-aligned output puts spaces in front of shorter pids (on my system, those with 1–4 digit numbers), and cut treats those as fields.

What this means to you: For a script where performance matters, it is probably worth figuring out the right way to do something with other tools. Often they will be much faster. However, in the fairly common case where you're just writing something to get a result right now, it is worth being comfortable enough with awk to emit one-line scripts quickly and easily.

Slightly longer scripts can generally be embedded using single quotes, but if your script gets to be a screenful or full of text, it is worth considering making it a separate file and using awk -f file. If you need to pass variables into the awk script, use the -v option. Even if you are embedding the script, it may be easier to follow it if you use the -v option to pass in variables instead of messing around with quotes.

Utilities and Languages

Is sed a utility or a language? Really, it is both. One of the sources of the flexibility of many UNIX utilities is that they have substantial expressive power, and indeed, often implement complete (if simplistic) languages. There are programming languages whose expression parsers are not as flexible as those used by find or test. The downside of this is that, to program the shell effectively, you have to have at least basic familiarity with a handful of smaller languages that are used for particular purposes.

HAVING IT BOTH WAYS: APPLESCRIPT

AppleScript is hardly portable, but it offers an excellent example of the interesting case of a language that is both easy to embed in shell scripts and easy to embed shell scripts in. Shell scripts on an OS X system can run chunks of AppleScript code using the osascript command; AppleScripts can spawn shell scripts using the do shell script language command. Because many Mac OS applications can be controlled from AppleScript, but many common UNIX shell tasks are very difficult from AppleScript, this substantially enhances the functional range of both languages. If you use a Mac, you should make a point of learning both languages.

Because do shell script uses the shell (always /bin/sh, which is bash on current systems) to parse commands, shell commands run from AppleScript are subject to the full range of command parsing, substitutions, and so on. Quoting AppleScript variables for the shell can be done using the quoted form of command. Note that it can be a bit disconcerting to switch back and forth between AppleScript's astounding verbosity and the shell's surprising terseness.

You can also pass a script in on standard input to the osascript utility, either by default (if there are no file name arguments and no -e options) or by explicitly naming - as the script file. This allows the use of here documents in the shell to contain nicely indented and expressive AppleScript scripts.

AppleScript is fairly similar to the HyperTalk language used in HyperCard, and thus somewhat similar to Runtime Revolution, a third-party scripting language targeting Mac, Windows, and Linux systems. In Runtime Revolution, you can use shell command substitution using the syntax put the shell of (command) into variable.

In the end, the shell is just another utility. It is an extremely powerful one with a complex (sometimes regrettably so) command language, which uses other utilities, and even other programming languages, as its building blocks. You can develop new utilities using existing utilities and new programs relying on these new utilities. For many tasks, the shell's performance weaknesses have long since ceased to be a significant weakness on modern systems; many shell scripts operate many times faster than their human users can type.

Used carefully, with a bit of attention to detail and planning, the shell allows for extremely rapid development of programs with unusually high portability across a broad range of systems.

What's Next?

The appendices. By kind permission of The Open Group, this book includes the specification for the POSIX shell; while some of the features described are not perfectly portable (yet...), the POSIX shell spec offers a clear description of many core shell features.

Beyond that, what's next is up to you. I recommend making a point of reading existing shell scripts; you may find a number of interesting idioms in distributions like shtool (a collection of small but very useful and highly portable shell scripts). When looking at programs you've never used, check to see whether they might be shell scripts; of the 404 commands in /usr/bin on one of my home systems, 32 are shell scripts. Reading a script you've never seen before can be informative.

If you want to master the shell, read lots of scripts and write lots of scripts. Don't settle for merely being able to guess what a script does; understand it. Find out what other programs it uses, and find out what they do. Automate aggressively. Feel free to write something that just automates part of a task; it's a great way to get started. You may be surprised at how easy it is to fill in the rest. About halfway through writing this book, I decided to automate "just the easy part" of a task which usually took me about three or four hours. Six hours later, I had it all automated.

Test your code on multiple systems and with multiple shells. You will learn a lot by doing this, and it will save you a lot of trouble when you unexpectedly have to target a new machine. I say when, rather than if, because personal experience has taught me that it is so.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for CHAPTER 11: Mixing and Matching

Create new playlist

Sign In

Sign Up

CHAPTER 11Mixing and Matching