CHAPTER 8: Utility Portability

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 8
Utility Portability

This chapter discusses the portability of programs external to the shell. Most shell scripts need to use a number of programs other than the shell itself to achieve their ends. Compared to the divergence in the functions and options offered by utilities, the variance of all the shell languages is relatively trivial.

This chapter does not attempt to provide a complete or comprehensive list of differences between different utilities; such a list would be much larger than this book. A shell script may have access to hundreds, or even thousands, of distinct programs. Many programs exist in three or more distinct variants with different sets of options, different names for the same options, and other sources of confusion. Utilities can acquire, or lose, features between one version of a system and another. Keeping track of every detail specifically is impractical, amplifying the need to stick with standard features most of the time. The autoconf documentation has a particularly good list of issues you are likely to run into. This chapter gives a somewhat narrower list, but it also goes into general principles of portability and explores some of the ways to go about finding out what will be portable.

This chapter begins with an overview of common families of utilities, such as BSD and System V. Following this is a section on avoiding unnecessary dependencies, and ways to check to ensure that your code will be portable. The third section discusses a number of specific examples of common utility portability issues. Finally, I close with a discussion of how to cope when something you thought was portable enough turns out not to be.

Common Variations

While there are dozens of specific variants of many commands with particular local features added, there are broad categories into which many utility programs fall. The famous historical distinction in UNIX utilities is System V and BSD, with BSD utilities often offering radically different semantics and options than System V utilities. Often, if you recognize one utility on a system as favoring a particular heritage, it will turn out that many other utilities on the same system have the same background.

Many utilities on modern systems are more explicit about which of their features are standard and which are extensions. Start with the online manuals, called the man pages (they are accessed using a command called man). When reading the man page for a utility, check to see whether it has a "Standards" heading; if it does, this will give you guidance on where to look for information about what that utility might do on other systems. With that in mind, it's time to look into some of the heritage of the UNIX utility environment.

Days of Yore: System V and BSD

The first major portability problems from UNIX arose, unsurprisingly, when there started being more than one kind of UNIX. When students at the University of California, Berkeley, started distributing their own modified versions of UNIX, one of the most noticeable changes was a huge upswing in the number of utilities and the number of options for those utilities. As AT&T turned UNIX into a commercial product (System III, and then System V), some of these ideas were adopted and others were not. Meanwhile, AT&T's new features often didn't make it into BSD. The result was that, while the core features that had been present in the original code base were usually portable, features added by one group or the other tended not to be. Even worse, both groups showed some tendency to reject things they had not developed locally, a syndrome often referred to as "not invented here" (NIH) syndrome.

A general trend was for Berkeley systems to add a lot more features and options, sometimes changing a utility from a fairly specialized program into a much more general one. The love of special cases and options led to a famous quip:

Stop! Whoever crosseth the bridge of Death, must answer first these questions three, ere the other side he see:

"What is your name?"

"Sir Brian of Bell."

"What is your quest?"

"I seek the Holy Grail."

"What are four lowercase letters that are not legal flag arguments to the Berkeley UNIX version of 'ls'?"

"I, er... AIIIEEEEEE!"

—Mark-Jason Dominus

Of course, he's joking; in fact, there are five (e, j, v, y, and z).

Berkeley and System V UNIX continued to diverge in many respects, with subtle differences in the C library as well as their utilities. Going from one to the other could be quite confusing; the basic selection of utilities available differed widely, and even the utilities that existed on both might have radically different behaviors. This is also where the difficulties with echo originated (see the section "Common Utility Issues" later in this chapter).

Modern systems often support many of the idioms from both BSD and System V utilities. For instance, some versions of ps accept Berkeley options without hyphens and System V options with hyphens; on Mac OS X, you can use either ps aux or ps -ef, but ps ef complains that the f option is not valid. (The original Berkeley ps did not use hyphens to introduce its options, making this behavior moderately idiomatic for users of either system.)

GNU Arrives

The GNU project began in 1984, when Richard Stallman began to work on developing free utilities. It is important to note the distinction; he means "free" as in speech, not "free" as in beer. That is to say, the emphasis is not on cost; you may sell GNU software if you want to. Rather, the emphasis is on privileges or rights; if you have GNU software, you may sell it to other people, modify it, or otherwise use it pretty much as you wish. However, you must offer these same freedoms to others in any derivative works. So while you are free to acquire GNU make, modify it in any way you want, and use it, if you begin to pass on modified copies, you must make the source for the modifications available under equally nonrestrictive terms. (A more detailed discussion of the licensing implications is beyond the scope of this book, but if you write much code, you should make a point of being familiar with the common open source licenses.)

As time went on, the GNU project began to develop mostly compatible versions of a number of core UNIX utilities, such as grep. With many concerns in the air about software litigation, the GNU project adopted a philosophy that went beyond writing utilities from scratch to implementing them in ways that were expected to make it very unlikely that their internals were even similar to those of other implementations. This is how the GNU coding standards put it:

For example, UNIX utilities were generally optimized to minimize memory use; if you go for speed instead, your program will be very different. You could keep the entire input file in memory and scan it there instead of using studio. Use a smarter algorithm discovered more recently than the UNIX program. Eliminate use of temporary files. Do it in one pass instead of two (we did this in the assembler).

—GNU Coding Standards, Section 2.1

GNU utilities often mixed and matched pieces of both System V and Berkeley functionality, as well as introduced many interesting new options. Some of these options later made it back into other systems, but not quickly.

GNU utilities frequently have exceptionally broad feature sets. Information about standards conformance is often kept in a separate set of documentation to be browsed with the GNU info reader, rather than put into man pages. For a long time, the GNU project advocated putting only incomplete summary documentation into man pages; while the info format is arguably better at many things, this results in users having to use more than one documentation reader, and many users remain unaware of the much greater documentation detail in the info pages.

GNU utilities introduced a new convention of option flags, which were whole words rather than single letters; these are called long options. The most commonly used options will also have single-letter abbreviations, but sometimes the long form is easier to remember. Long options are introduced with a double hyphen (--option). This behavior has shown up widely in other utilities but has not been formally standardized.

The GNU utilities are fairly widely portable, and in many cases you can arrange to install them even on a non-GNU system if you need a particular feature they offer. However, be careful when doing this, as other scripts may be depending on the behavior of the system's non-native utilities. Typically, a system that has both GNU and non-GNU versions of a utility prefixes the names of GNU utilities with g, as in gmake, gfind, or gtar.

Standardization

The POSIX standard is one of many UNIX standards that have come along; it is neither the first standard to come out nor the most recent. UNIX users found the difficulty of porting between systems frustrating, and standards and portability work began showing up. From the early Uniform Draft Standard (UDS) and System V Interface Definition (SVID) guides came POSIX and the X/Open Portability Guides. In many cases, these standards tended to track AT&T more closely than BSD, but both systems had some decisions adopted and ratified.

The POSIX and X/Open work has gradually converged on more recent standards, such as the Single UNIX Specification. Unfortunately, these standards have become gigantic, and conformance testing is large, complicated, and not always adequate. This is not to say the standards are not useful; they are excellent, but very few systems are fully conformant. If you are developing a system, you should pursue conformance, and scripts that do not need to be portable to older systems gain a lot of benefit from increased standardization.

You will find two central problems in trying to rely heavily on standardized features in portable scripting. The first is that many systems, especially open source systems, lack the funding to pursue every last nook and cranny of the gigantic specifications. The second is that many of the behaviors required need not be defaults. So, while POSIX does require a broad range of basic shell functionality, it may not be the case that /bin/sh is the shell that provides that functionality. In practice, NetBSD's /bin/sh is much closer to POSIX compliance than Solaris's /bin/sh, but the /usr/xpg4/bin/sh on Solaris might be more compliant than NetBSD's ash derivative. (As a disclaimer, I must admit that I have not done comprehensive testing of either; neither of them has ever failed to run a POSIX script for me, though.) A third, more subtle problem is that POSIX does not specify nearly as much as most people expect it to. The shebang notation (#! at the beginning of scripts) is not mandated or defined by POSIX, even though I have never personally seen a system that didn't use it.

Similar issues apply throughout the utilities used by the shell. Commands often continue to accept nonstandard options by default, converting to standard behavior only when you take special effort to obtain that behavior.

A number of utilities, especially GNU utilities, will adhere somewhat more closely to the POSIX specification if the environment variable $POSIXLY_CORRECT has been set.

In short, while standardization does help some, and your chances of getting a reasonable selection of basic utilities with standard features are much higher than they were in the early 80s, you still can't just code for the standard and forget about the details if you want your code to be portable. Unlike shell language portability, where preambles can generally fix things up well enough, utility programs are extremely difficult to replace on the fly in most cases; however, there are some cases where it may be practical to build a portable shell or C version of a utility and use that.

busybox

The busybox program, used heavily in embedded Linux systems, offers customized (and stripped-down) versions of a number of standard UNIX utilities. You may be wondering whether busybox is a shell or a utility; in fact, it is a combined binary that performs different functions depending on the name it is called with. Typically, a single binary containing a shell and a number of basic utilities is installed in a root file system, with most of the standard utilities as symbolic links to the busybox binary.

Porting to a system that uses busybox for most of its utilities can be a challenge. Not all busybox systems offer the same utilities; individual utilities may be removed from the busybox package to reduce space usage, and some systems install other utilities. While there is a busybox variant of grep, a vendor could also install a standard grep on a busybox system.

In general, if you are expecting to run on embedded systems, you need to know that before you start writing any code. You are no longer targeting a standard UNIX desktop environment, and there are a number of surprises. The biggest, though, is how many programs work just fine on busybox.

Shell Builtins

There are two significant aspects to shell builtins that affect utility portability. The first is the question of which standard utilities, such as test or expr, a given shell may have chosen to implement as built-in commands. The second question is this: Which of the features are provided that must be implemented as built-ins to work? The second question is better viewed as a shell language portability question, but the first question is significant when considering portability.

In many cases, the risk of shell builtins is that they will hide an underlying portability issue. For instance, the exceptionally useful printf utility is not universally available. It is provided as a builtin by ksh93 and bash, though, so it is quite possible to run a quick test in your preferred login shell on a target system and conclude that the command is provided by the system. (In fact, in the particular case of printf, it turns out that no system I can find lacks it in /bin/sh using the default path, whether as an external utility or a built-in command.) Even worse, a built-in command may have different behaviors in some cases than the external command or may offer additional features that you might accidentally use, thinking they are provided as part of the standard behavior of the command.

In most cases, the solution is simple: Rely on the common behavior rather than on extensions. However, you may want to ensure the use of an external command in some cases; specify the path to the command you want to execute. By contrast, if you know that you want a feature provided by the shell's built-in command, and that the external utility lacks it, do not specify the path.

Unfortunately, shells do not always document exactly which commands are built in or provide a standard way to check. You can, however, determine this indirectly by temporarily setting the $PATH environment variable to an empty string:

$ PATH="" ls; echo $?

ls: not found

127

$ PATH="" echo hello; echo $?

hello

0

The shell searches $PATH for a command (containing no slashes) that is not a builtin but does not search the path for built-in commands. This allows you to at least find out whether a shell is providing its own version of a command. The which utility can tell you whether a command is in your current path, but it does not tell you whether the shell has a builtin of the same name. Some shells offer useful additional features to let you find out whether a command is a function, alias, built-in command, or external command, but these are not portable or consistent between shells.

The question of whether a command is a builtin is not the only question; on some systems, there may be multiple versions of the same external command with different behaviors; see the section "Multiple Different Versions" later in this chapter for more information on that problem.

A third aspect of builtins and their impact on portability is more subtle; in some cases, a shell builtin has nothing to do with a command-line utility of the same name. For instance, some old systems had a FORTRAN compiler named fc. The POSIX history editing mechanism uses a shell builtin named fc. This kind of thing can be a bit surprising. Luckily, it is also fairly rare; developers tend to try to avoid clashing names, so it is atypical for two commands with the same name to come into widespread use.

Avoiding Unnecessary Dependencies

A program often has requirements or assumptions about its operating environment. In general, the things that must be true for a program to work are called its dependencies. When developing a script, you should aim to eliminate as many dependencies as possible in order to increase the number of systems your script will work on.

Determining whether a dependency is optional is not always easy. Be sure to distinguish between the particular programs you have available and the functions for which you are using them. For example, consider a script that uses a small Perl program to sort lines by their length:

#!/bin/sh

perl -e 'print sort { length $a <=> length $b } <>;' "$@"

This script (which could also be implemented easily enough as a pure Perl script) obviously requires the perl program to run. You might conclude that, to port this script, you must have perl installed on your target system. However, it is possible to duplicate this script's functionality without perl:

#!/bin/sh

cat "$@" | while read i

do

  echo "${#i} $i"

done | sort -n | sed -e 's/[0-9]* //'

This script is probably less efficient than the original script. However, it produces the same effect; each line is sorted by its length. The only external programs used are cat, sort, and sed (and possibly echo in a shell that does not provide it as a built-in command). None of these commands are rare, and none of the features used are rare or even particularly new. The result is almost certainly a great deal slower, but it is quite portable.

The cat "$@" idiom bears a moment's examination. In general, the cat command is not very useful in pipes; you can always just redirect something from a file or omit cat in the case where input should just be standard input. However, in this case, it performs the useful service of either concatenating all of the command-line arguments or simply passing standard input into the while loop. (Remember that "$@" does not expand to an empty string when there are no arguments, at least in recent shells; instead, it expands to nothing at all. In a very few older shells, you may need to use ${1+"$@"}; this issue was discussed in more detail in Chapter 7.)

The hardest part of avoiding dependencies is usually identifying them. Most dependencies are possible to work around, but it can be very easy to mistakenly rely on something without realizing that it is not portable.

Relying on Extensions Considered Harmful

With all the material you've seen so far on portable programming, you might reasonably wonder why there are so many features that are not portable. While some portability problems are the result of a developer not completely implementing something, many more are the result of a developer choosing to implement something additional. Many system utilities have shortcomings that may be frustrating enough that a developer would correct for them.

With that in mind, understand that one of the limitations of portable shell programming is that it may be harder to write something portably than it would have been to write it relying on an extension. So why bother? The answer is that the payoff in portability is usually worth it. Often, a special option streamlines or simplifies something that you can do by a slightly longer route. For instance, some versions of cat have a line-numbering option, but this is easy enough to implement without using cat. While the option might be convenient, it is hardly necessary. It may save you a minute while writing the script—and cost you an hour when trying to use the script on another system.

Two major difficulties may arise when you try to run a script that relied on extensions on a new system. The first is that the new system, lacking the extensions, will probably have no documentation to tell you what they did. This can make it very difficult to determine what to use to replace the nonworking code. The second is that there is a chance that the new system may have different extensions using the same utility name or option flag; in this case, you get no warning as to what has gone wrong.

This leads to a piece of advice you may not expect in a book advocating portable programming: Learn to use the extensions available on the systems you use. Understand them because they will save you a lot of time and effort in one-off scripts. Keep in mind that they are extensions, but don't be afraid to learn them and use them when they are appropriate. Despite the apparent contradiction, I have generally found that familiarity with extensions makes it easier to avoid them. It also makes it easier to understand scripts that were not written portably.

Try to ensure that any features you rely on beyond the basic standards are clearly identified and are genuinely necessary. You will find it easier to keep your scripts running, and you will also find it easier to write new scripts if you have formed good habits.

During development, you may find it rewarding to use shell function wrappers that warn you about nonportable options rather than accepting them. This can make it easier to develop portable code during testing. If you wrote a number of these, you could put a directory containing them early in your $PATH while developing. (I am not aware of an existing collection of these, but it sounds really useful.)

Test with More Than One Shell on More Than One System

For some reason, authors love to advise people to try things and see what happens. This is excellent advice if you keep in mind that what happens today on a particular system may not be what happens tomorrow on a different system. When writing portable code, you should start by looking at documentation, standards, and other things that will be less ephemeral than a particular implementation's combination of bugs and extensions. However, sooner or later you will probably need to test some things out.

When this time comes, test on multiple targets. Run your code in several shells. Run your code on several systems. Developers sometimes refer to the set of possible combinations as the "matrix of pain" because it can be very difficult to keep multiple combinations working all at once. Luckily, portable shell code is not nearly as painful to maintain as some things are (such as combinations of kernel options across multiple architectures). The purpose of tests like these is to maximize the chance that you find out sooner, rather than later, about a prospective portability problem.

If you are writing for the plain POSIX shell, it may seem counterintuitive to intentionally seek out shells with additional features and extensions to test with. However, these additional features and extensions might, in some rare cases, cause a nonobvious bug. Furthermore, in many cases, differences are merely implementation choices, where any of several possible outcomes is permissible. You want to find out whether you are relying on any of these as soon as possible.

If you are writing significant production code, set up a test environment with all the target systems available. It may be impossible to catch every possible target environment, but for production code, it is reasonable to set up five or six target systems for testing. Be sure that your test code is run in the default environment for each system, not just in your personal shell. Most shell programmers have a number of changes in their environment that may create surprises, such as an unusual path, special environment variables, and more. (Note that you must also test your scripts in these environments, not just in the default environment.)

In fact, you may wish to steal an idea from developers who work in compiled languages and set up regular and automated testing of scripts across a variety of systems. Automated tests are more likely to be run than manual tests. In my experience, the times when I break the portability of my code are not the times when I am worried that I am about to do something nonportable, but rather the times when I am not thinking about it. Because of this, manual testing is surprisingly unlikely to catch the real errors; the times when I think to run the tests are the times when I've probably gotten the code right to begin with.

Document Your Assumptions

It is quite possible that, after reviewing your target environment, you will conclude that a particular dependency is simply too important. If you have a clear notion of what target systems you need to worry about, and you are quite sure they all provide a utility or a particular extension, you can go ahead and use it. Do yourself (and everyone else) a favor, though, and identify what the assumption was you made. A script that only runs in bash can be annoying. A script that only runs in bash, but presents itself as a generic shell script, is maddening.

If your script requires particular utilities, especially if they are not very widely installed, comment on them, or even check for them and warn the user if they are not available. A few comment lines at the beginning of a script, or near the line where you do something risky, could save a future developer (possibly you) hours or days of work. The quilt patch utility is excellent but relies heavily on GNU utility extensions. This is particularly difficult when targeting Mac OS X, as simply building GNU utilities for that system does not always produce desirable results; the file system has some unusual features not found in other UNIX systems, which are supported by special code unique to the OS X versions of some core utilities. Gary V. Vaughan, the technical reviewer for this book, spent a number of working days getting quilt working on Mac OS X a couple of years back. I've personally spent as much as a day on a single smallish script trying to make it work on a different version of UNIX; it can be very hard to track down portability bugs, especially undocumented ones.

Good documentation can help a lot. If a script clearly indicates what it is relying on, this makes it easier to understand the intended behavior. The biggest problem is often being unable to figure out what the writer intended the program to do in the first place.

Common Utility Issues

While it is impractical to give a list of all the variances you will encounter in the wild while writing shell scripts, there are a few issues so utterly persistent and pernicious that I want to call special attention to them. These are the programs that have bitten me repeatedly across a number of projects, have haunted me for years, and have otherwise caused more than their fair share of grief.

Public Enemy #1: echo

The echo utility is in a class by itself. While other programs may vary more widely in their behavior, none vary more widely per feature provided. The initial issue that started this all is quite simple; it is very common to wish to display some text without going to a new line immediately. This allows long lines of text or output to be assembled by bits and pieces and often dramatically improves the user interface of a script. Unfortunately, it cannot be done. The reason is that the Berkeley people added a feature to implement this using an option flag; the -n option suppresses the trailing new line. The System V people also added a feature to implement this; they added support for escape sequences and text ending with c causes echo to suppress a line ending.

Neither of these is a good choice for a fundamental utility like echo. The key function of echo is to reproduce its arguments precisely. As soon as you create special cases, you have made things hard for the user. How should a user of Berkeley echo cause it to produce the string -n followed by a new line? Causing echo to interpret backslash escape sequences creates an additional nightmare. Depending on the version (and different versions, of course, do it differently), you not only have to deal with the shell's usual quoting rules, but take care to handle possibly differing versions of the quoting rules for echo. For instance, the command echo '' produces two backslashes in ksh but only one in zsh. You can mitigate this by calling /bin/echo explicitly, but it may not always have the behavior you want.

The blunt reality is that it is not portable to use echo when either the first argument begins with a hyphen or any argument contains a backslash. What's worse, there may not always be a portable alternative. Many modern systems provide an extremely flexible utility called printf. When it is available, this solves the problem by allowing you to specify both data and format, as well as providing for a number of other useful features, such as alignment, significant digits, and so on. In fact, I know of no desktop or server systems lacking printf; unfortunately, it is still sometimes omitted from embedded systems.

The Korn shell introduced a print built-in command to alleviate this problem, but it is portable only to ksh and pdksh. The Bourne-again shell provides a printf builtin, reasonably compatible with the common external utility.

The closest I have found to a remotely acceptable portable solution is to take advantage of other commands to filter the output of echo:

func_echo_noline () {

  /bin/echo X"$@" | awk '

{

  if (lastline != "") print lastline;

  else sub("X", "");

  lastline = $0;

}

END { printf "%s", lastline; }

'

}

Instead of trying to coerce the echo command into doing something useful, you can filter its output using the awk command. This small awk script strips the leading X (which is used to prevent echo from interpreting a -n argument as an option), then prints every line it receives, suppressing the new line after the last line. This is the simplest solution I've found to the "no new line" problem. (This awk script is explained in more detail in Chapter 11.)

None of this gets into a much more fundamentally awful decision; some versions of echo process backslash escapes and generally discard backslashes. This breaks the conventional idiom for escaping backslashes:

foo=`echo "$foo" | sed -e 's/\/\\/g'`

A complete solution to this is large (around 140 lines of shell code in recent configure scripts). In short, it is extremely hard; this is a case where, if you can avoid embedded systems, the shortest path is to switch to printf or use a preamble to reexecute in a shell that has a built-in echo without the bug. (I am aware that the bug in question was implemented that way on purpose; it's still a bug.)

Multiple Different Versions

Some vendors helpfully provide more than one version of some common utilities. This can create a portability nightmare for scripts. For instance, on some older systems, utility programs might be found in /usr/bin, /usr/ucb, and /usr/5bin. The ucb directory would hold Berkeley versions of utilities, while 5bin would hold System V versions or programs designed to act mostly like them. Many modern systems have common programs stored in directories with names like /opt/package/bin, /usr/local/bin, or /usr/pkg/bin. Furthermore, users may have their own personal binaries in a directory in the path, and sometimes these binaries clash with standard system binaries.

Correcting for this is hard. If you provide your own setting for $PATH, you have to be sure you get it right on a broad variety of foreign systems. Even worse, in some cases, users may have chosen an unusual path value because they really do need your script to use different binaries. In general, your best bet is to test your scripts with multiple path settings but not to try to outsmart the user; use the provided path and hope for the best. However, do be aware that on systems with multiple binary directories, a change in the order of the path sometimes changes the behavior of utilities, even common and well-known utilities.

Archive Utilities

The UNIX environment provides a broad selection of archive utilities, but they are not as portable as you might hope. Archiving utilities are significant in many shell scripts because they are often a useful way to copy a large number of files while preserving metadata—attributes such as modification dates, permissions, or other traits beyond just the data in the file. Copying batches of files using an archive utility is frequently better than copying them using plain old cp. For network copying, there is often a substantial performance improvement from using archive files.

There are three primary archive utilities commonly used on UNIX systems: cpio, pax, and tar. Historically, cpio originated in AT&T, tar originated in BSD, and pax was introduced by POSIX. The tar utility is the least flexible but the most commonly available; it supports only its own native format. (Some versions may handle two or three variants of the tar format, but only variants within the format, not other formats.) The cpio and pax utilities support both the reading and writing of a variety of formats. The modern cpio format is probably the most comprehensive and flexible; it handles a variety of special cases (such as extracting only one of several hard links to the same file), which the others lack support for. There is an additional pax-only archive format (referred to as the pax format in the documentation) that is a superset of the POSIX tar format, which also supports hard links and very long pathnames.

All three produce uncompressed archives; if you want compression, you combine your archive utility with a compression utility. Some versions of the tar utility have options to do this automatically for you, but this is not portable. In general, all three are happy to work with archives on standard input or standard output, so you can use them in a pipeline with a compression utility. As an example, here are three roughly equivalent commands to create a compressed archive of the directory files:

tar cf - files | gzip > files.tar.gz

pax -w files | gzip > files.pax.gz

find files -print | cpio -o | gzip > files.cpio.gz

The tar and pax utilities implicitly recurse through named directories (this behavior can be disabled in pax). The cpio utility packs only those files explicitly named. Both cpio and pax can take file lists from standard input; tar only takes them on the command line. If you send the output of find into pax, be wary; it will generate multiple copies of files in a directory if you name both the files and the directory containing them. This all gets very complicated, and it is easier to see in tabular form, as shown in Table 8-1.

Table 8-1. The Archive Utilities

Utility	Recursive	Files on Input	Files on Command Line	Passthrough	Formats
`tar`	Yes	No	Yes	No	`tar`, POSIX `tar`
`cpio`	No	Yes	No	Yes	`tar`, POSIX `tar`, `cpio`
`pax`	Option	Yes	Yes	Yes	`tar`, POSIX `tar`, `cpio`, `pax`

There are several common cpio archive formats; most modern cpio utilities (and most versions of pax) can read and write each others' archives.

Because the use of archive utilities to create an archive and immediately unpack it is so common, both cpio and pax have the option of running in a direct copy mode. In this mode, they are similar to cp -R, only much better at preserving metadata.

Finally, there are many sorts of metadata. While all three utilities can generally handle the metadata traditionally supported by UNIX systems, many systems have newer features. Access control lists (ACLs) are commonly supported on modern UNIX. Mac OS X can have additional data stored under the same name in a separate "fork," which no non-Mac utility can even describe, let alone refer to or copy. (The utilities included with Mac OS X have, since Mac OS X version 10.4, generally handled this for you transparently.) Unfortunately, it is not portable to expect any of these utilities to copy such additional data. This is mitigated somewhat by the fact that the additional data themselves are not portable either.

For the most part, you can use the native archive utilities provided with a system and expect them to copy files well enough that they look the same to users. If metadata are lost, they are usually metadata that users would not have noticed in the first place. There are exceptions; on Mac OS X version 10.3, tar did not preserve resource forks. However, if you wanted to build a new version of GNU tar and use it on Mac OS X, it might not copy things correctly. (Luckily for you, the system's provided tar is a suitably modified GNU tar as of this writing.)

The implications of these differences for portable scripting are a bit of a pain. Nearly every system provides at least one of these utilities. Many provide two of them. Right now, tar is the most commonly available, but its limitations in dealing with long pathnames may make it impractical for some tasks. If you can find pax on the systems you need to work on, it is usually pretty good; it has a broader set of fundamental functionality than either of the others. However, if you need special features, such as reliable duplication of hard links in a copy, you may need cpio. Luckily, the GNU project provides a workable cpio program in fairly portable source.

Passing file names as standard input has one key limitation; file names containing new lines break it. This is not a specific bug in either pax or cpio, but a general limitation of the protocol. This bug does not necessarily result only in failures to archive files. If you are running as root and try to archive a user's home directory with find | cpio, think about what the archive ends up containing if the user has a directory named myroot<newline> containing files with the same relative paths as system files. A user who wanted a copy of /etc/passwd.master (the BSD convention for the file actually holding password data) could create a file called myroot<newline>/etc/passwd.master. Passed on standard input, this becomes two things: first, a directory named myroot (which does not exist and is not archived) and second, a file named /etc/passwd.master. You can pass file names as arguments to pax, but cpio does not have such an option.

WHAT ABOUT ZIP AND AR?

Many modern UNIX systems have one or more utilities to handle archives in the semi-standardized zip file format, which combines archiving and compression. Typically, these utilities would be named zip and unzip, with the former creating archives and the latter extracting files from archives or listing their contents. These formats may be a better choice if you have to share data files or output with Windows systems, but not all systems provide these utilities. While very many desktop systems do provide these utilities, some server-oriented systems omit them or make them optional packages.

Another archive utility only occasionally used is the classic ar archiver, used to combine object files into libraries. In fact, it can be used as a general-purpose archiver, but it lacks the flexibility of the other archivers, and many versions have punitively short name restrictions (such as 15 or 16 characters). In general, do not use this unless you are building libraries, and even then, more modern tools may be preferable.

Another common problem with archive utilities is their behavior when dealing with absolute paths. A few utilities, such as GNU tar, do not restore absolute file names by default; instead, they strip the leading slash and restore everything relative to the current directory. This behavior is wonderful, except when you get used to it and assume other programs will do it, too. Be very wary when unpacking archives that seem to have absolute paths, and be sure you know how your archiver of choice behaves before unpacking any archive containing an absolute path. Note that some variants may treat this differently; I have seen pax variants that automatically stripped leading slashes and others that didn't. Assuming that an archiver will safely restore only into the current directory can cause you a lot of trouble; I've lost a lot of data by assuming I could just extract a backup into a temporary directory, only to find that it restored over the data from which it had been backed up a week earlier.

The final problem often encountered in archives is long file names (including the path). The original tar header format allows the specification of file names up to 100 characters. This is not long enough. Modern POSIX archivers can usually handle up to 250 characters and should be able to handle each others' files; but older tar programs may choke on them, and they generally get file names wrong when extracting new archives containing files with names over 100 characters. To avoid these problems, use terse file names or use one of the cpio formats.

LONG FILE NAMES

We are not talking about the "innovation" of supporting file names longer than eight characters with more than a three-character extension. The early System V file system supported file names up to 14 characters. (Trivia point: This is why the semantics for strncpy() are inconsistent with all the other C library string functions. Early UNIX used 16-byte directory entries, which allowed 14 bytes for the file name; null terminating a 14-character string would have limited file names to 13 characters.) The original Mac OS HFS file system was limited to 31 and Berkeley UFS/FFS to 255. These limits are for each individual component in a directory path; paths may be much longer on some systems. A typical modern UNIX system will support at least 4096 total characters in a file's absolute pathname. Some will support relative paths of this length, even from deeply nested directories.

In practice, most people never find out that tar can't portably handle names over 100 characters, or over 250 in modern tar. File names on UNIX are often typed; this imposes firm limits on how long most of them get. Still, be wary of these issues. File name portability is a special challenge because you can't always control a user's files.

Block Sizes

Many utilities give sizes or disk usage information in blocks. Most often, a block is 512 bytes, but it may also be 1024 bytes. Some utilities have (sadly, not always portable) options to specify other block sizes. The environment variable $BLOCKSIZE is not portable; there is no reliable expectation that BLOCKSIZE=1m du will give disk usage in megabytes rather than in kilobytes or half-kilobyte blocks. However, this variable does exist on many systems, especially BSD systems, and users may have set it. This can produce very surprising behavior; the typical failure mode is for an installer to query for free space on a disk, apply a scaling factor to the reported number of blocks, and conclude that your disk does not have enough space free. If you are using any utilities that rely on a block size, be sure to check what block size they use and verify that you know what units they are using. With larger disks, users may have set $BLOCKSIZE to something large; a megabyte is not uncommon, and gigabytes are starting to show up.

This has no impact on the block sizes used by dd.

Other Common Problems

For a comprehensive and often-updated list of issues with common utilities, check out the autoconf manual, which discusses issues you may face in writing portable shell scripts. This section gives a brief overview of some of the general problems you might encounter, as well as a few highlights from the larger list. Many of the issues described in the autoconf manual are rare on modern systems.

Avoiding these problems can be tricky. As the size of this list suggests, it is difficult to learn all the nooks and crannies of hundreds of utilities. A few basic techniques are available to you. The first is to be aware of standards. Check standard documentation, such as The Open Group's online standards (found in their catalog at www.opengroup.org/products/publications/catalog/t_c.htm) showing the current state of the standard. Read the man pages, and look in particular for information about standard conformance and extensions. When you see a new option you had not previously encountered, be sure to check the man pages on other systems as well.

Testing a feature quickly on several unrelated systems may also help. You have to have a representative sample of your target systems; if you are writing truly portable code, this can be very difficult to obtain. Portability to systems you don't yet have access to is where a list like the one in this chapter (or the longer one in the autoconf manual) comes in handy.

You cannot simply assume that the features of the oldest system you have available are universal; in some cases, newer systems remove features when implementing new ones. This can lead to a problem where two systems provide mutually exclusive ways to solve a given problem. For instance, the option used with sort when using a key other than the entire input line varies between systems, with some new systems no longer accepting the historical options (details are provided in the following alphabetized list of commands).

While the core functions of utilities like sed are reasonably stable, many more esoteric utilities are unique to particular systems. Many systems provide the column command to convert input lines into columns (although not all do); only a few provide the rs command (a more general program that reshapes input data).

Case-insensitive file systems can occasionally create an extra utility portability problem; for instance, one common Perl module can provide aliases with names like GET and HEAD. If you install this on a machine with a case-insensitive (or case-preserving) file system, your path now determines whether the head command is a convenient utility to obtain results from a web server or a standard utility to display the first few lines of a file.

When trying to guess how portable a program will be, check the man page for "Standards" and "History" sections. A command that complies with POSIX is probably more portable than one with no listed provenance or that claims to have been introduced in 4.4BSD.

awk

In general, every system ought to have some variety of awk. Check for variants; there may also be nawk, gawk, or mawk. Any of these variants may be installed under the name awk. A few systems provide an old pre-POSIX awk by default, so check for others. If there is an oawk command, it is quite possible that plain awk is really nawk. If you have to use one of the old versions, it has a number of limitations, such as not supporting user-defined functions. The printf (or sprintf) function may have gratuitous smallish limits on format lengths in old versions.

If you are writing a significant amount of awk code, be sure to test it against at least a couple of variants. (See Chapter 11 for more information on awk.)

basename

Unfortunately, this is not universally available. You can use expr, or you can use ${var%%*/} constructs. The latter are marginally less portable, but if you find yourself committed to a POSIX shell, they're available.

cat

There are no options to the cat command in portable code. The things that some variants provide as options must be done using other programs or utilities.

chmod

Do not rely on the behavior of chmod when you don't specify which users a permission change applies to; use chmod a-w, not chmod -w.

cmp

On systems that distinguish between text and binary files, cmp compares binary files; for instance, on DOS systems, it will report differences in line endings and differences in content. The diff utility treats files as text files.

cp

There is no portable -r option, and even when there was, its behavior was often undesirable. Use -R or look at archive utilities as another option. The -p option may lose subsecond timestamps on file systems that provide them.

cpio

See the "Archive Utilities" section earlier in the chapter. Some versions have options (usually -s and -S) to swap bytes of input data, which are useful when processing old binary archives; but these are not universal and may be unimplemented on modern systems.

cut

Most systems now distinguish between characters (-c) and bytes (-b). Some older systems may have only -c; these systems usually mean "bytes" when they say characters and are not multibyte aware.

date

The use of specifiers similar to those of strftime() in date to format output is fairly standard, but some systems may not implement all the standard specifiers. There is no portable way to specify a time other than the present to format or display, and not all systems provide one. BSD systems typically recognize -r seconds, where seconds is a number of seconds since the epoch. GNU systems typically recognize -d date, where date is any of a number of common date formats. One particular date format recognized by GNU date is @seconds, where seconds is the number of seconds since the epoch. SVR4 date has neither.

While the GNU date utility does have a -r option, it is -r file to report the last modification time of file; there is no corresponding option in BSD date. This is an example of a case where the error message you get using a construct on a different system might be surprising; if a script written on a BSD system uses date -r, the error message on the GNU system will not indicate an invalid argument, but rather complain about a nonexistent file.

diff

A few rare versions do not compare /dev/null to any file correctly. The unified patch format (-u option) is not completely portable. Patches in the "context diff" format (-c option) are not as easy to read, but they are more portable.

dirname

Not universally available. As with basename, you can use expr or the POSIX parameter substitution rules.

ditroff

See nroff.

dos2unix

Many systems include a program named dos2unix (and often another named unix2dos) to translate line endings. This is not portable; many other systems lack it. The general difficulty is that there are three common choices about line endings; new line only (UNIX), carriage return followed by new line (DOS and Windows), and carriage return only (classic Mac OS). Translating between these generically is slightly difficult because line-oriented UNIX utilities do not lend themselves well to translating line endings. You can use tr to remove trailing carriage returns from files known to be in the carriage return followed by new line (CRLF) format.

Removing or adding carriage returns is easy. Translating carriage returns to new lines is also easy. However, handling all three input formats generically is a bit hard. Even worse, some files genuinely contain both carriage returns and new lines, such as captured output from some programs. Be cautious when trying to translate line endings. (I know of no completely general solution. Sometimes you just have to know what the format is.)

echo

Not very portable. Avoid any backslashes in arguments to echo, and avoid first arguments starting with hyphens. Sadly, this is still often all there is, but see the previous discussion under the "Public Enemy #1: echo" section of this chapter. It may be better to just use printf, if you do not need to target embedded systems.

egrep

Not quite universally available, but neither is grep -E. Modern systems are more likely to support grep -E, and older systems are more likely to provide egrep. You can test for this:

if echo foo | grep -E '(f)oo' >/dev/null 2>&1; then

  EGREP='grep -E'

elif echo foo | egrep '(f)oo' >/dev/null 2>&1; then

  EGREP='egrep'

else

  echo >&2 "Cannot find viable egrep/grep -E!"

  exit 1

fi

The pattern (f)oo matches the string foo only if it is an extended regex. While I have never seen a system that provided an egrep (or grep -E) that did not work on extended regular expressions, I am also deeply distrustful of vendors at this stage in my career.

expect

The expect program is not a standard utility, although many systems install a variant of it.

expr

Prefix both strings and regular expressions (for the : operator) with X to avoid possible confusion or misparsing. While other characters might work, X is idiomatic. Many implementations offer extensions that are not portable; be careful. There are a number of quirks in regular expression handling in some systems; be careful. For additional information on regular expression matching and expr, see Chapter 2.

fgrep

As with egrep, POSIX specifies a new grep -F usage, which is not universally supported. Old systems may have fgrep, new systems may have grep -F. Unlike other grep variants, fgrep does not check for regular expressions; it checks for fixed strings. Contrary to various rumors, the "f" does not stand for "fast" or "file."

find

The replacement of {} with a file name in -exec is portable only when {} is an argument by itself. The -print0 option, which uses ASCII NUL characters instead of new lines to separate file names, is unfortunately not universal. See the discussion of xargs.

Older implementations of find used special predicates such as -depth, -follow, and -xdev to change searching behavior; newer implementations use corresponding -d, -h, and -x options, which precede the pathname. Neither solution is completely portable now. The -regex option is not portable either; the only portable name matching is the standard shell globbing used by -name. Different versions of find that support -regex may not use the same regex rules; GNU find defaults to the emacs regex rules (described in Chapter 2), for instance. Remember that the glob matching is done within find, so switching shells does not change which glob rules are used.

grep

Do not rely on options other than -clnv. There is no portable option to completely suppress output and yield only a return code; instead, redirect the output (and error) from grep to /dev/null. Some implementations of grep may behave surprisingly on binary files, and the behavior is not portable. Do not pass non-text data to grep in portable scripts.

groff

See nroff.

info

The GNU info documentation system is not universally available. It offers a number of very useful documentation tools, but info documentation tends to be underused because users prefer to read man pages. Do not rely on the availability of any of the tools usually used with this.

killall

This is not particularly portable. Even worse, different systems offer wildly different semantics for some common and likely command lines. Avoid this.

ldd

The ldd utility examines an executable file and tells you which shared libraries it uses. If you are writing an installation utility or something similar, this is exactly the tool you are likely to look at to see whether suitable shared libraries are available. Unfortunately, this utility is not universal. In particular, Mac OS X does not provide any program named ldd; the closest equivalent is otool -L, which is unique to Mac OS X. Some embedded systems will lack this, or have a version that does not function, as the version distributed with glibc is a bash script, and many embedded systems lack bash.

ln

Symbolic links are not totally portable, although they are pretty close these days; exceptions will usually be Windows machines running one of the UNIX emulation layers. The -f option is not portable. You may want to write a wrapper function that removes the destination file before linking when called with -f. The question of what to do if you need symbolic links and they are not available is difficult. In some cases, creating a hard link might be a workable alternative; in others, it could be a disaster (or simply not work). Copying files may be viable, but in some cases won't be. I do not think it is practical to try to develop a one-size-fits-all replacement for ln -s if you need to target Windows machines. Instead, for each use, think about what behavior you really want, and use it explicitly.

lp/lpr

The two most common printer commands, lp and lpr, are not quite compatible. The simplest cases are fairly stable, but many of the options and control features vary widely, and some historic options are unimplemented in some modern systems. If you need to do printing, you have to know a lot about the specific host systems you need to print on; there is nothing even approximating a portable way to guess at printer selection, printer options, and so on. Not all systems will react gracefully to all inputs to the print command, either. Behavior such as automatically converting input files to desired output formats is not portable. Do not pass large binary files to the lpr command unless you are quite sure about your target system doing what you want, or you will get a lot of scratch paper.

If there is a correctly configured printer at all, you should get consistently acceptable results from sending plain text files to either lp or lpr as standard input. To do more printing than this, you will have to accept that there is some lack of portability in your code. (Of course, any printing at all is at least a little unportable; not every system has a printer.) You can do quite well writing a small wrapper function that targets each of a handful of known systems, though. A representative code fragment might look something like this:

case `hostname` in

  server*)

    LP_TEXT='lpr -P line'

    LP_PS='lp -P postscript -o media=letter'

    ;;

  ...

esac

case $file in

  *.txt) $LP_TEXT "$file";;

  *.ps) $LP_PS "$file";;

  *) echo >&2 "Unknown file type for '$file'.";;

esac

This might be a good candidate for a separate wrapper script, which other scripts use for printing.

m4

The m4 macro processing language is extremely common, and I have never seen a nonembedded system without it. However, some programs may rely on additional features of GNU m4, which may be installed as gm4 on some systems. (The name shares the counting etymology of the term i18n for internationalization.) This program is especially useful if you decide to work with m4sh, which is (of course) written in very portable m4, but it is of some general utility for writing code to generate more code.

make

There is no portable way to include other files in a makefile. The good news is, if you are using make from a shell script, you can assemble the makefiles yourself. While many of the more elaborate make features are unique to a particular variant (usually GNU or BSD make, the two most elaborate members of the family), you can do a great deal with a few simple rules.

If you need inclusion or similar features, you can look at tools like automake or autoconf, which generate makefiles automatically. In simple cases, it may be sufficient to generate makefiles using a preprocessing tool, such as m4, or even create them using a simple shell script. Interestingly, while there is no single standard way to include other files in a makefile, it seems to be quite consistent that every variant of make supports some variant of include file or -include file. (See Chapter 11 for information on how make uses the shell to execute commands.)

makeinfo

See info.

mkdir

Do not rely on the -p option. If you ignore that advice, do not combine -p and -m; the modes of intermediate directories are ambiguous. Some versions can fail if an intermediate directory is created while mkdir -p is running. You can use the widely available mkdirhier and mkinstalldirs shell code. You can also write your own wrapper for mkdir, which handles the -p option:

func_mkdir_p() {

  for dir in "$@"; do

    save_IFS=$IFS

    IFS=/

    set --$dir

    IFS=$save_IFS

    (

      case $dir in

      /*) cd /; shift;;

      esac

      for subdir in "$@"; do

        test -z "$subdir" && continue

        if test -d "$subdir" || mkdir "$subdir"; then

          if cd "$subdir"; then

            :

          else

            echo >&2 "func_mkdir_p: Can't enter $subdir while creating $dir."

            exit 1

          fi

        else

          exit 1

        fi

      done

    )

  done

}

This function preserves the normal behavior of mkdir -p, including succeeding without comment if a given directory or subdirectory already exists. A leading / creates an empty $1, which is discarded using shift. The subshell is used to avoid having to keep track of the current directory. Empty subdirectories are ignored by standard UNIX convention. Finally, note the setting of the positional parameters inside a loop iterating over them. This works because the loop is not executed until the shell has already expanded "$@".

mktemp

The mktemp utility is not universally available. If you need temporary files, create a directory with mode 700, then create files in it. (To do this, create a directory after running umask 077; you may want to use a subshell for this to restore the previous umask). There is an excellent wrapper function for this (func_mktempdir) in libtoolize.

mv

Only the -f and -i options are portable. Do not try to move directories across file systems, as this is not portable; use archive utilities or cp and rm. There is no easy way to check whether a move is across file systems, but in general, it is safe to rename a directory without changing its parent directory; anything else you should probably do by copying in some way. The "Archive Utilities" section earlier in this chapter covers some of the issues you may face in copying files; there is no completely portable good answer.

Moving individual files is portable, although very old versions might have a brief window during which neither file exists. Windows-or DOS-based hosts do not allow you to rename an open file.

nroff

The roff utilities (nroff, troff, groff, and so on) are fairly widely supported. However, they do not always produce identical output; do not expect page layout or line break choices to be identical between versions. Not every system has these installed, although they are usually available.

pax

The pax utility is the POSIX "portable archiver," which is a cleaned-up interface for a program similar in functionality to tar. See the section "Archive Utilities" earlier in the chapter. pax is widely available, but not completely universal. If you can verify its availability on the systems you need to target, this may be the best choice.

perl

The perl program is the interpreter for the Perl programming language. Some systems use perl for a Perl 4 interpreter, and perl5 for a Perl 5 interpreter. Others use Perl4 and Perl, respectively. Do not count on Perl being installed in /usr/bin, and do not count on the version without testing it. In fact, do not count on Perl being installed at all; but if you must, remember that it may be somewhere unusual. Some systems may have multiple installations, some older and some newer. Many users like to use #!/usr/bin/env Perl, but this prevents you from specifying the -w option on many systems. (And you should always, always use the -w option.)

pkill

As with killall, pkill is not universally available. Do not rely on this.

printf

The printf utility is found on SVR4-derived systems, on BSD systems, and on Linux systems; in fact, it is essentially universal outside of embedded systems. If you can be sure this is available on all the systems you need, this is infinitely superior to echo. Check your target systems, but this should be considered reasonably portable now. Avoid using options; there are no portable options. Avoid using format strings that begin with hyphens, as they might be taken as options. Most versions should handle a first argument of -- as a nonprinting sign that there are no options coming, but there is really no need for this. If you want to start a format string with a variable of some sort, begin it with %s and specify the variable as the first argument.

Note that printf is not a direct replacement for echo. It is actually a much nicer command, but you cannot simply change echo to printf and expect scripts to work. The most noticeable change is that printf does not provide a trailing new line automatically. The escape sequences used by printf all use backslashes, so it is simplest to use single quotes around format strings.

You can implement much of this functionality through clever use of a one-line awk script because the awk programming language has a standard printf function:

awk 'BEGIN { printf("%s", "foo") }'

Some very old awk implementations may only provide sprintf, requiring you to write print sprintf(...) and making it impossible to omit the trailing new line.

Some printf implementations(...) have fairly short limits on total length of formatted output or total length of individual conversions. The lowest limits I have encountered are a bit over 2000 characters for zero-padded numeric values. (Space-padded values did not have the same problem.) If you stay under that limit, you should have no problems.

ps

Predicting the option flags available for the ps command is an exercise in futility. I planned to write about how systems generally now support the modern POSIX (and historic System V) flags, but the first system I tested doesn't. On Berkeley systems, ps -ax will usually get you a list of all running processes; on other systems, it is ps -ef. It is very hard to write anything portable using ps output, but a partial example is provided in Chapter 10.

python

The python binary is the interpreter for Python. As with Perl, you may have to look around a bit to find a particular version. Conventionally, specific versions are installed as python X.Y, so you can tell which version you have. One of my test systems has python2.4 but not python; do not assume that a system with Python installed will have a binary without a version number.

ranlib

The ranlib utility is used to create headers for archives created by the ar archiver. It is only rarely still needed on modern systems. Today, its primary function is illustrating examples of how to choose a utility on the fly by pointing out the use of true as a substitute on systems that no longer provide or require ranlib.

rm

Both the -f and -r options are portable. There is no guarantee that a silent rm -f has succeeded, and there are circumstances under which it can fail. It is not portable to call rm without any file arguments.

rpm2cpio

The rpm utility is not available for all systems. The rpm2cpio utility is also not available on all systems, but many systems provide it as a way to extract files from RPM packages without using the RPM database. Try not to depend on this in a script, but be aware that there is a way to get files out of an RPM package even without the rpm utility. However, rpm2cpio does not report the ownership of files in the package.

rs

The rs utility "reshapes" input; for instance, it can convert lines of data with one word to a line into columns of data. It is not universally available, being most common on BSD systems. Many of the things rs would be used for can be handled by some combination of cut, paste, or join; failing that, you can do just about anything in awk.

sed

The character used to separate patterns (like / in s/from/to/) may not portably appear in patterns, even in a character class. Use a separator that does not occur in these strings. There are a number of special case bugs with sed on older systems. On modern systems, you get the best portability writing a sed script as a single argument using new lines to separate commands and without the -e option. If you use ! on a command, do not leave a trailing space after the !. Some versions of sed strip leading whitespace from arguments to the a, c, and i commands. You can use a backslash at the beginning of a line to create an escaped space that suppresses this obnoxious behavior. (See Chapter 11 for more information about sed.)

sh

It would take a book (this one, by preference) to describe all the portability issues you may see in variants on the sh program. Remember that there is no guarantee that a shell script is actually running in /bin/sh; if you call a script with sh, it may behave differently than it would if you read it in with the . command or invoked $SHELL on it. If you call it with $SHELL, and your script is being run by a csh or tcsh user, it could be even worse. Your best bet is usually to make sure that all of your scripts either run correctly in /bin/sh or know how to reexecute themselves if they have to, and then always use sh to invoke them.

In general, avoid options other than -c and -x. It is fine to pass commands to a shell through a pipe, but you might be better off using some combination of eval and subshells.

sort

Pre-POSIX systems used +N to indicate an N-column offset in each line as the search key. POSIX uses -k N to indicate the Nth column. So, sort +4 is the same as sort -k 5. The -k option is available on all modern systems, so use it; you are much more likely to encounter a system that does not handle the +N notation than a system that does not handle the -k notation. Behavior when handling complicated keys (such as numeric sorting by multiple keys, or even just numeric sorting by anything but the beginning of the line) is occasionally buggy, although most common cases work. Behavior when sorting nonnumeric keys numerically can be unpredictable. Bugs are much more common when inputs are large enough to require temporary files to hold intermediate results; try not to sort more data than can fit in memory.

stat

The stat utility (usually a wrapper around the stat(2) system call) is not portable. Many GNU and BSD systems provide utilities of this name, with comparable functionality, but they have wildly different invocations and outputs. You can use the test command to answer many questions about files and the ls command for others.

tar

See the section "Archive Utilities" earlier in the chapter. Remember to keep an eye out also for GNU tar (sometimes named gtar) and Jörg Schilling's "Schily tar," usually named star. You cannot safely assume that a particular version is available.

touch

On systems where the file system can represent subsecond timestamps, touch may not store information more precise than the second; this can actually change a file's timestamp to be up to one second in the past. On fast machines, this can be a problem.

troff

See nroff.

tsort

The tsort utility performs topological sorts. The most common use of it is to identify the order in which to specify object files to a linker that has to receive files in a sorted order. However, this program can be used for just about any kind of dependency checking. A typical usage would be to make a list of relationships between activities that must be performed in order. For example, if you were writing a script to raid and terrorize coastal villages, you might begin with a list of observations; you have to loot before you can pillage, and you must pillage before you burn. Furthermore, you must defeat all who oppose you before you can loot. You would express this to tsort as a file (or standard input stream) containing pairs of words. Idiomatically, there is one pair to a line:

$ tsort <<EOF

loot pillage

pillage burn

defeat loot

EOF

defeat

loot

pillage

burn

$

This utility is available on most systems, but sometimes outside the standard path; on Solaris, for instance, it is in /usr/ccs/bin.

unix2dos

See dos2unix.

unzip

See zip.

xargs

The -0 option, which uses ASCII NUL characters instead of new lines to separate file arguments, is not totally portable. Unfortunately, its absence creates a serious problem that is at the very least a bug magnet and can create serious security holes. If you cannot ensure that your script will generally be run on systems that provide this option, avoid xargs with file lists that contain files you did not create. Note that this is no worse than the behavior you get passing a list into a while read var loop. It is a potential security hole if you aren't alert to it, but it may be livable. If you can be sure of systems where find and xargs support NUL character separators, use those options.

zip

This is an archive utility, usually paired with unzip. It is not universally available, although it's quite common on desktop systems.

What to Do When Something Is Unavailable

Sooner or later, you will find yourself in the uncomfortable circumstance of having guessed wrong on utility portability. The field is too large to keep track of; there are too many utilities to learn, and there are too many systems with local variants and surprises.

But all is not lost. You can generally work around the absence of a utility one way or another, and this section goes into some of the techniques used to handle these circumstances. There are several possible solutions to the problem of a missing utility. You can develop your own clone of it, if it is small, and include it with your script (or even implement it as a shell function in your script). You may be able to get the utility added to your target system, if you have any influence over it. In some cases, you can patch other utilities together to obtain the results you need. Sometimes, you can settle for something nearly good enough. If a system simply does not have symbolic links, you may be able to make do with hard links, or with copies.

One other resolution is on the table: Sometimes, you may find that the best you can do is insist on a more complete or modern system. This is a rare choice and should never be your first response to a problem, but keep it in mind.

Roll Your Own

Sometimes the best way to be sure you can rely on a utility or feature being available is to develop your own. Many common utility programs can be implemented (sometimes more tediously, or more slowly) in terms of other existing utility programs. There are a few ways to approach this. You can write separate utility programs, whether as shell scripts or in another language. However, this leaves you with an additional problem at installation time, which is ensuring that your helper programs also get installed. Many simpler utilities can be implemented as shell functions, which allows you to embed them in a program. You can even use your own script as multiple different programs by defining special command-line options to tell your script to do something special; for instance, the standard autoconf configure script behaves very differently when run with the --fallback-echo argument.

This technique is of limited and specialized applicability, but as long as the programs you need are simple enough to duplicate, it can work. It is also sometimes your only option.

Add a Missing Utility

You can require the user to install additional software or install it yourself. This presumes an environment with some control of the target system; for instance, a script to be run on machines on a corporate network may be able to simply impose a requirement that particular packages must be installed for the script to work. If you are shipping a product and want to rely on particular utilities, you can document them as requirements; this does not work as well because users never read documentation, but it is better than not documenting the requirement.

The weakness of this strategy is that there are times when it is simply impossible for the user to comply. While most scripts do not need to run on embedded systems today, there is a rapidly increasing pool of small portable devices that contain some variety of UNIX and a somewhat stripped-down environment.

Use Something Else

If it turns out that a utility you thought was universal isn't, use something else. The UNIX shell environment is a fairly full-featured programming language, and you can do just about anything in it with enough time and attention. Often, the problem is not that there is no portable solution, but that the portable solution requires you to make effective use of a utility you've never even heard of. This is a great time to go browsing around documentation, trying to think of other key words to search for, and so on.

Demand a Real System

This is sort of the antithesis of portable code, but it may apply. In the case where other requirements, such as performance or development time, are simply too crucial, and a particular system is causing you grief, you may want to see whether that system can be removed from the project definition. As is often the case, 10% of the time does 90% of the work. If you can establish that your code is fine except on systems with a particular flaw, but working around it is going to be difficult and time-consuming, this may be the course to pursue.

Do not do this merely because a system lacks an extension it would be neat to have. Reserve this for cases where the offending system is clearly wrong. Obviously, this never applies when a particular target system is central to the problem specification. If you are trying to write a script that will be used exclusively on an embedded system, you have to work with what that system provides. On the other hand, if you have a script aimed at full-featured desktop systems, it may be impractical to expect you to make it run also on an embedded system with only a stripped-down busybox.

In some cases, after further discussion, this can become the friendlier case of adding a missing utility (see the previous section). If the problem is just that a particular program is absent or buggy, a replacement can perhaps be found and stated as an explicit requirement.

A Few Examples

The first example I ever saw of a common problem along these lines was a little script called install.sh, which was common in free software packages. Because Berkeley and other systems had, in their typical fashion, all disagreed on how to write a program to copy a file to a given location with particular ownership and permissions, many programmers took to writing a portable script that performed the expected functions. The full functionality of the script can be quite complicated; some versions check to ensure they are not copying a file onto itself, strip binaries of debugging symbols, and otherwise do things that are commonly needed or useful when installing a file, but that are tedious to get right. Variants of this script are still found in many systems (as are dozens of totally unrelated files named install.sh).

The portability problems of echo have been solved (or at least worked around) in several different ways, illustrating some of the previously described strategies. Many scripts test for common behaviors and define two variables (nearly always named $C and $N, or $ECHO_C and $ECHO_N), which allow commands like echo $N No newline:$C. This is moderately idiomatic, and most shell programmers will recognize it. The following variant of an example from Chapter 7 illustrates this:

case `echo -n "c"` in

-n*c) ECHO_N='' ECHO_C='' ;;

*c) ECHO_N='-n' ECHO_C='' ;;

*) ECHO_N='' ECHO_C='c' ;;

esac

echo $ECHO_N Testing...$ECHO_C

echo "Ok."

Testing...Ok.

The AC_PROG_LIBTOOL macro in configure.in scripts (implemented by the libtool.m4 macro file distributed with libtool) provides a particularly complete workaround for a much more insidious problem: working around echo implementations that interpret backslashes. This code is about 250 lines of fairly complicated shell scripting. It is too much to reproduce here, but this is a fairly typical sample:

if test "X$echo" = Xecho; then

    # We didn't find a better echo, so look for alternatives.

    if test "X`(print -r '	') 2>/dev/null`" = 'X	' &&

       echo_testing_string=`(print -r "$echo_test_string") 2>/dev/null` &&

       test "X$echo_testing_string" = "X$echo_test_string"; then

      # This shell has a builtin print -r that does the trick.

      echo='print -r'

    elif (test -f /bin/ksh || test -f /bin/ksh$ac_exeext) &&


         test "X$CONFIG_SHELL" != X/bin/ksh; then

[...]

One of the cases used is to define a special argument, --fallback-echo, which causes the script to try to display its own arguments. The implementation is excellent:

if test "X$1" = X--fallback-echo; then

  # used as fallback echo

  shift

  cat <<EOF

$*

EOF

  exit 0

fi

This does not handle the case where you want to produce output without a new line, but it does eliminate the common problem of shells stripping backslashes.

Some users do something like any of the options previously discussed, but they create a shell function to wrap the desired behavior; this has been the solution I've generally encouraged (as in the original version of the $ECHO_N example, which was in Chapter 7). Some programmers simply rely on the printf command or on shell builtins like ksh's print. Any of these can provide consistent behavior, and some allow reliable production of unterminated lines. And, finally, a last option exists: Avoid starting the arguments to echo with a hyphen, avoid backslashes, and just accept the lack of a portable way to produce output with no new line at the end. This limits your output options, but it is completely portable.

A couple of counterexamples are also worth considering. A number of application installers for Linux systems have been pretty awful. The most obvious and recurring theme is the assumption that sh is always bash. A number of install scripts I have tried to use have choked badly because I usually have $BLOCKSIZE set to 1m in my environment; this resulted in naive install scripts declaring that a disk with 5GB free (5000 one-megabyte blocks) is not large enough to support an installation that requires 10MB because they interpreted 5000 blocks of reported free space as 2.5MB (5000 half-kilobyte blocks). Another common failure mode is assumptions about the stat utility, which I've seen several times in different installers.

What's Next?

This is just about it for the fiddly little details. What comes next is a bit of a higher-level perspective of portability. Chapter 9 talks about how to design scripts so that they will be easier to write portably and how to identify a good candidate for development as a shell script in the first place. Portability is important for reusability and value out of code, but there are other things you should consider as well. A portable script that is only useful once does you little good; next up is the question of how to make a script you will want to reuse on other systems.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for CHAPTER 8: Utility Portability

Create new playlist

Sign In

Sign Up

CHAPTER 8Utility Portability

Common Variations

Days of Yore: System V and BSD

GNU Arrives

Standardization

busybox

Shell Builtins

Avoiding Unnecessary Dependencies

Relying on Extensions Considered Harmful

Test with More Than One Shell on More Than One System

Document Your Assumptions

Common Utility Issues

Public Enemy #1: echo

Multiple Different Versions

Archive Utilities

Block Sizes

Other Common Problems

awk

basename

cat

chmod

cmp

cp

cpio

cut

date

diff

dirname

ditroff

dos2unix

echo

egrep

expect

expr

fgrep

find

grep

groff

info

killall

ldd

ln

lp/lpr

m4

make

makeinfo

mkdir

mktemp

mv

nroff

pax

perl

pkill

printf

ps

python

ranlib

rm

rpm2cpio

rs

sed

sh

sort

stat

tar

touch

troff

tsort

unix2dos

unzip

xargs

zip

What to Do When Something Is Unavailable

Roll Your Own

Add a Missing Utility

Use Something Else

Table of Contents for
CHAPTER 8: Utility Portability

CHAPTER 8
Utility Portability