Chapter 1. Introduction

A big program is made up of many small modules. These modules provide the functions, procedures, and data structures used in the program. Ideally, most of these modules are ready-made and come from libraries; only those that are specific to the application at hand need to be written from scratch. Assuming that library code has been tested thoroughly, only the application-specific code will contain bugs, and debugging can be confined to just that code.

Unfortunately, this theoretical ideal rarely occurs in practice. Most programs are written from scratch, and they use libraries only for the lowest level facilities, such as I/O and memory management. Programmers often write application-specific code for even these kinds of low-level components; it’s common, for example, to find applications in which the C library functions malloc and free have been replaced by custom memory-management functions.

There are undoubtedly many reasons for this situation; one of them is that widely available libraries of robust, well designed modules are rare. Some of the libraries that are available are mediocre and lack standards. The C library has been standardized since 1989, and is only now appearing on most platforms.

Another reason is size: Some libraries are so big that mastering them is a major undertaking. If this effort even appears to be close to the effort required to write the application, programmers may simply reimplement the parts of the library they need. User-interface libraries, which have proliferated recently, often exhibit this problem.

Library design and implementation are difficult. Designers must tread carefully between generality, simplicity, and efficiency. If the routines and data structures in a library are too general, they may be too hard to use or inefficient for their intended purposes. If they’re too simple, they run the risk of not satisfying the demands of applications that might use them. If they’re too confusing, programmers won’t use them. The C library itself provides a few examples; its realloc function, for instance, is a marvel of confusion.

Library implementors face similar hurdles. Even if the design is done well, a poor implementation will scare off users. If an implementation is too slow or too big — or just perceived to be so — programmers will design their own replacements. Worst of all, if an implementation has bugs, it shatters the ideal outlined above and renders the library useless.

This book describes the design and implementation of a library that is suitable for a wide range of applications written in the C programming language. The library exports a set of modules that provide functions and data structures for “programming-in-the-small.” These modules are suitable for use as “piece parts” in applications or application components that are a few thousand lines long.

Most of the facilities described in the subsequent chapters are those covered in undergraduate courses on data structures and algorithms. But here, more attention is paid to how they are packaged and to making them robust. Each module is presented as an interface and its implementation. This design methodology, explained in Chapter 2, separates module specifications from their implementations, promotes clarity and precision in those specifications, and helps provide robust implementations.

Literate Programs

This book describes modules not by prescription, but by example. Each chapter describes one or two interfaces and their implementations in full. These descriptions are presented as literate programs. The code for an interface and its implementation is intertwined with prose that explains it. More important, each chapter is the source code for the interfaces and implementations it describes. The code is extracted automatically from the source text for this book; what you see is what you get.

A literate program is composed of English prose and labeled chunks of program code. For example,

<compute x · y>≡
   sum = 0;
   for (i = 0; i < n; i++)
       sum += x[i]*y[i];

defines a chunk named <compute x · y>; its code computes the dot product of the arrays x and y. This chunk is used by referring to it in another chunk:

<function dotproduct>≡
   int dotProduct(int x[], int y[], int n) {
       int i, sum;

       <compute x · y>
       return sum;
   }

When the chunk <function dotproduct> is extracted from the file that holds this chapter, its code is copied verbatim, uses of chunks are replaced by their code, and so on. The result of extracting <function dotproduct> is a file that holds just the code:

   int dotProduct(int x[], int y[], int n) {
       int i, sum;

       sum = 0;
       for (i = 0; i < n; i++)
           sum += x[i]*y[i];
       return sum;
   }

A literate program can be presented in small pieces and documented thoroughly. English prose subsumes traditional program comments, and isn’t limited by the comment conventions of the programming language.

The chunk facility frees literate programs from the ordering constraints imposed by programming languages. The code can be revealed in whatever order is best for understanding it, not in the order dictated by rules that insist, for example, that definitions of program entities precede their uses.

The literate-programming system used in this book has a few more features that help describe programs piecemeal. To illustrate these features and to provide a complete example of a literate C program, the rest of this section describes double, a program that detects adjacent identical words in its input, such as “the the.” For example, the UNIX command

   % double intro.txt inter.txt
   intro.txt:10: the
   inter.txt:110: interface
   inter.txt:410: type
   inter.txt:611: if

shows that “the” occurs twice in the file intro.txt; the second occurrence appears on line 10; and double occurrences of “interface,” “type,” and “if” appear in inter.txt at the lines shown. If double is invoked with no arguments, it reads its standard input and omits the file names from its output. For example:

   % cat intro.txt inter.txt | double
   10: the
   143: interface
   343: type
   544: if

In these and other displays, commands typed by the user are shown in a slanted typewriter font, and the output is shown in a regular typewriter font.

Let’s start double by defining a root chunk that uses other chunks for each of the program’s components:

<double.c 4>≡
   <includes 5>
   <data 6>
   <prototypes 6>
   <functions 5>

By convention, the root chunk is labeled with the program’s file name; extracting the chunk <double.c 4> extracts the program. The other chunks are labeled with double’s top-level components. These components are listed in the order dictated by the C programming language, but they can be presented in any order.

The 4 in <double.c 4> is the page number on which the definition of the chunk begins. The numbers in the chunks used in <double.c 4> are the page numbers on which their definitions begin. These page numbers help readers navigate the code.

The main function handles double’s arguments. It opens each file and calls doubleword to scan the file:

<functions 5>≡
   int main(int argc, char *argv[]) {
       int i;

       for (i = 1; i < argc; i++) {
           FILE *fp = fopen(argv[i], "r");
           if (fp == NULL) {
               fprintf(stderr, "%s: can't open '%s' (%s)
",
                   argv[0], argv[i], strerror(errno));
               return EXIT_FAILURE;
           } else {
               doubleword(argv[i], fp);
               fclose(fp);
           }
       }
       if (argc == 1) doubleword(NULL, stdin);
       return EXIT_SUCCESS;
   }

<includes 5>≡
   #include <stdio.h>
   #include <stdlib.h>
   #include <errno.h>

The function doubleword needs to read words from a file. For the purposes of this program, a word is one or more nonspace characters, and case doesn’t matter. getword reads the next word from an opened file into buf[0..size-1] and returns one; it returns zero when it reaches the end of file.

<functions 5>+≡
   int getword(FILE *fp, char *buf, int size) {
       int c;

       c = getc(fp);
       <scan forward to a nonspace character or EOF 6>
       <copy the word into buf[0..size-1] 7>
       if (c != EOF)
           ungetc(c, fp);
       return <found a word? 7>;
   }

<prototypes 6>≡
   int getword(FILE *, char *, int);

This chunk illustrates another literate programming feature: The +≡ that follows the chunk labeled <functions 5> indicates that the code for getword is appended to the code for the chunk <functions 5>, so that chunk now holds the code for main and for getcode. This feature permits the code in a chunk to be doled out a little at a time. The page number in the label for a continued chunk refers to the first definition for the chunk, so it’s easy to find the beginning of a chunk’s definition.

Since getword follows main, the call to getword in main needs a prototype, which is the purpose of the <prototypes 6> chunk. This chunk is something of a concession to C’s declaration-before-use rule, but if it is defined consistently and appears before <functions 5> in the root chunk, then functions can be presented in any order.

In addition to plucking the next word from the input, getword increments linenum whenever it runs across a new-line character. doubleword uses linenum when it emits its output.

<data 6>≡
   int linenum;

<scan forward to a nonspace character or EOF 6>≡
   for ( ; c != EOF && isspace(c); c = getc(fp))
       if (c == '
')
           linenum++;

<includes 5>+≡
   #include <ctype.h>

The definition of linenum exemplifies chunks that are presented in an order different from what is required by C. linenum is given here, when it is first used, instead of at the top of the file or before the definition of getword, which is where C insists that it be defined.

The value of size is the limit on the length of words stored by getword, which discards the excess characters and folds uppercase letters to lowercase:

<copy the word into buf[0..size-1] 7>≡
   {
       int i = 0;
       for ( ; c != EOF && !isspace(c); c = getc(fp))
           if (i < size - 1)
               buf[i++] = tolower(c);
       if (i < size)
           buf[i] = '';
   }

The index i is compared to size - 1 to guarantee there’s room to store a null character at the end of the word. The if statement protecting this assignment handles the case when size is zero. This case won’t occur in double, but this kind of defensive programming helps catch “can’t happen” bugs.

All that remains is for getword to return one if buf holds a word, and zero otherwise:

<found a word? 7>≡
   buf[0] != ''

This definition shows that chunks don’t have to correspond to statements or to any other syntactic unit of C; they’re simply text.

doubleword reads each word, compares it with the previous word, and complains about duplicates. It looks only at words that begin with letters:

<functions 5>+≡
   void doubleword(char *name, FILE *fp) {
       char prev[128], word[128];

       linenum = 1;
       prev[0] = '';
       while (getword(fp, word, sizeof word)) {
           if (isalpha(word[0]) && strcmp(prev, word)==0)
               <word is a duplicate 8>
           strcpy(prev, word);
       }
   }

<prototypes 6>+≡
   void doubleword(char *, FILE *);

<includes 5>+≡
   #include <string.h>

Emitting the output is easy, but the file name and its trailing colon are printed only if name isn’t null:

<word is a duplicate 8>≡
   {
       if (name)
           printf("%s:", name);
       printf("%d: %s
", linenum, word);
   }

This chunk is defined as a compound statement so that it can appear as the consequent of the if statement in which it is used.

Programming Style

double illustrates many of the stylistic conventions used for the programs in this book. It is more important for programs to be read easily and understood by people than it is for them to be compiled easily by computers. The compiler doesn’t care about the names chosen for variables, how the code is laid out, or how the program is divided into modules. But these kinds of details can have enormous impact on how easily programmers can read and understand a program.

The code in this book follows established stylistic conventions for C programs. It uses consistent conventions for naming variables, types, and routines, and, to the extent permitted by the typographical constraints imposed by this book, a consistent indentation style. Stylistic conventions are not a rigid set of rules that must be followed at all costs; rather, they express a philosophical approach to programming that seeks to maximize readability and understanding. Thus, the “rules” are broken whenever varying the conventions helps to emphasize important facets of the code or makes complicated code more readable.

In general, longer, evocative names are used for global variables and routines, and short names, which may mirror common mathematical notation, are used for local variables. The loop index i in <compute x·y> is an example of the latter convention. Using longer names for indices and variables that are used for similarly traditional purposes usually makes the code harder to read; for example, in

   sum = 0;
   for (theindex = 0; theindex < numofElements; theindex++)
       sum += x[theindex]*y[theindex];

the variable names obscure what the code does.

Variables are declared near their first use, perhaps in chunks. The declaration of linenum near its first use in getword is an example. Locals are declared at the beginning of the compound statements in which they are used, when possible. An example is the declaration of i in <copy the word into buf[0..size-1] 7>.

In general, the names of procedures and functions are chosen to reflect what the procedures do and what the functions return. Thus getword returns the next word in the input and doubleword finds and announces words that occur two or more times. Most routines are short, no more than a page of code; chunks are even shorter, usually less than a dozen lines.

There are almost no comments in the code because the prose surrounding the chunks that comprise the code take their place. Stylistic advice on commenting conventions can evoke nearly religious wars among programmers. This book follows the lead of classics in C programming, in which comments are kept to a minimum. Code that is clear and that uses good naming and indentation conventions usually explains itself. Comments are called for only to explain, for example, the details of data structures, special cases in algorithms, and exceptional conditions. Compilers can’t check that comments and code agree; misleading comments are usually worse than no comments. Finally, some comments are just clutter; those in which the noise and excess typography drown out the content do nothing but smother the code.

Literate programming avoids many of the battles that occur in comment wars because it isn’t constrained by the comment mechanisms of the programming language. Programmers can use whatever typographical features are best for conveying their intentions, including tables, equations, pictures, and citations. Literate programming seems to encourage accuracy, precision, and clarity.

The code in this book is written in C; it uses most of the idioms commonly accepted — and expected — by experienced C programmers. Some of these idioms can confuse programmers new to C, but they must master them to become fluent in C. Idioms involving pointers are often the most confusing because C provides several unique and expressive operators for manipulating pointers. The library function strcpy, which copies one string to another and returns the destination string, illustrates the differences between “idiomatic C” and code written by newcomers to C; the latter kind of code often uses arrays:

   char *strcpy(char dst[], const char src[]) {
       int i;

       for (i = 0; src[i] != ''; i++)
           dst[i] = src[i];
       dst[i] = '';
       return dst;
   }

The idiomatic version uses pointers:

   char *strcpy(char *dst, const char *src) {
       char *s = dst;

       while (*dst++ = *src++)
           ;
       return s;
   }

Both versions are reasonable implementations of strcpy. The pointer version uses the common idiom that combines assignment, incrementing a pointer, and testing the result of the assignment into the single assignment expression. It also modifies its arguments, dst and src, which is acceptable in C because all arguments are passed by value — arguments are just initialized locals.

A good case can be made for preferring the array version to the pointer version. For example, the array version is easier for all programmers to understand, regardless of their fluency in C. But the pointer version is the one most experienced C programmers would write, and hence the one programmers are most likely to encounter when reading existing code. This book can help you learn these idioms, understand C’s strong points, and avoid common pitfalls.

Efficiency

Programmers seem obsessed with efficiency. They can spend hours tweaking code to make it run faster. Unfortunately, much of this effort is wasted. Programmers’ intuitions are notoriously bad at guessing where programs spend their time.

Tuning a program to make it faster almost always makes it bigger, more difficult to understand, and more likely to contain errors. There’s no point in such tuning unless measurements of execution time show that the program is too slow. A program needs only to be fast enough, not necessarily as fast as possible.

Tuning is often done in a vacuum. If a program is too slow, the only way to find its bottlenecks is to measure it. A program’s bottlenecks rarely occur where you expect them or for the reasons you suspect, and there’s no point in tuning programs in the wrong places. When you’ve found the right place, tuning is called for only if the time spent in that place is a significant amount of the running time. It’s pointless to save 1 percent in a search routine if I/O accounts for 60 percent of the program’s running time.

Tuning often introduces errors. The fastest program to a crash isn’t a winner. Reliability is more important than efficiency; delivering fast software that crashes is more expensive in the long run than delivering reliable software that’s fast enough.

Tuning is often done at the wrong level. Straightforward implementations of inherently fast algorithms are better than hand-tuned implementations of slow algorithms. For example, squeezing instructions out of the inner loop of a linear search is doomed to be less profitable than using a binary search in the first place.

Tuning can’t fix a bad design. If the program is slow everywhere, the inefficiency is probably built into the design. This unfortunate situation occurs when designs are drawn from poorly written or imprecise problem specifications, or when there’s no overall design at all.

Most of the code in this book uses efficient algorithms that have good average-case performance and whose worst-case performance is easy to characterize. Their execution times on typical inputs will almost always be fast enough for most applications. Those cases where performance might pose problems in some applications are clearly identified.

Some C programmers make heavy use of macros and conditional compilation in their quests for efficiency. This book avoids both whenever possible. Using macros to avoid function calls is rarely necessary. It pays only when objective measurements demonstrate that the costs of the calls in question overwhelm the running times of the rest of the code. I/O is one of the few places where macros are justified; the standard I/O functions getc, putc, getchar, and putchar, for example, are often implemented as macros.

Conditional compilation is often used to configure code for specific platforms or environments, or to enable or disable debugging code. These problems are real, but conditional compilation is usually the easy way out of them and always makes the code harder to read. And it’s often more useful to rework the code so that platform dependencies are selected during execution. For example, a single compiler that can select one of, say, six architectures for which to generate code at execution time — a cross compiler — is more useful than having to configure and build six different compilers, and it’s probably easier to maintain.

If an application must be configured at compile time, version-control tools are better at it than C’s conditional-compilation facilities. The code isn’t littered with preprocessor directives that make the code hard to read and obscure what’s being compiled and what isn’t. With version-control tools, what you see is what is executed. These tools are also ideal for keeping track of performance improvements.

Further Reading

The ANSI standard (1990) and the technically equivalent ISO standard (1990) are the definitive references for the standard C library, but Plauger (1992) gives a more detailed description and a complete implementation. Similarly, the standards are the last word on C, but Kernighan and Ritchie (1988) is probably the most widely used reference. The latest edition of Harbison and Steele (1995) is perhaps the most up-to-date with respect to the standards, and it also describes how to write “clean C” — C code that can be compiled with C++ compilers. Jaeschke (1991) condenses the essence of Standard C into a compact dictionary format, which is a useful reference for C programmers.

Software Tools by Kernighan and Plauger (1976) gives early examples of literate programs, although the authors used ad hoc tools to include code in the book. WEB is the one of the first tools designed explicitly for literate programming. Knuth (1992) describes WEB and some of its variants and uses; Sewell (1989) is a tutorial introduction to WEB. Simpler tools (Hanson 1987; Ramsey 1994) can go a long way to providing much of WEB’s essential functionality. This book uses notangle, one of the programs in Ramsey’s noweb system, to extract the chunks. noweb is also used by Fraser and Hanson (1995) to present an entire C compiler as a literate program. This compiler is also a cross compiler.

double is taken from Kernighan and Pike (1984) where it’s implemented in the AWK programming language (Aho, Kernighan, and Weinberger 1988). Despite its age, Kernighan and Pike remains one of the best books on the UNIX programming philosophy.

The best way to learn good programming style is to read programs that use good style. This book follows the enduring style used in Kernighan and Pike (1984) and Kernighan and Ritchie (1988). Kernighan and Plauger (1978) is the classic book on programming style, but it doesn’t include any examples in C. Ledgard’s brief book (1987) offers similar advice, and Maguire (1993) provides a perspective from the world of PC programming. Koenig (1989) exposes C’s dark corners and highlights the ones that should be avoided. McConnell (1993) offers sound advice on many aspects of program construction, and gives a balanced discussion of the pros and cons of using goto statements.

The best way to learn to write efficient code is to have a thorough grounding in algorithms and to read other code that is efficient. Sedgewick (1990) surveys all of the important algorithms most programmers need to know, and Knuth (1973a) gives the gory details on the fundamental ones. Bentley (1982) is 170 pages of good advice and common sense on how to write efficient code.

Exercises

1.1

getword increments linenum in <scan forward to a nonspace or EOF 6> but not after <copy the word into buf[0..size-1] 7> when a word ends at a new-line character. Explain why. What would happen if linenum were incremented in this case?

1.2

What does double print when it sees three or more identical words in its input? Change double to fix this “feature.”

1.3

Many experienced C programmers would include an explicit comparison in strcpy’s loop:

   char *strcpy(char *dst, const char *src) {
       char *s = dst;

       while ((*dst++ = *src++) != '')
           ;
       return s;
   }

The explicit comparison makes it clear that the assignment isn’t a typographical error. Some C compilers and related tools, like Gimpel Software’s PC-Lint and LCLint (Evans 1996), issue a warning when the result of an assignment is used as a conditional, because such usage is a common source of errors. If you have PC-Lint or LCLint, experiment with it on some “tested” programs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.111.24