Chapter 15. Debugging

In this chapter

  • 15.1 First Things First page 568

  • 15.2 Compilation for Debugging page 569

  • 15.3 GDB Basics page 570

  • 15.4 Programming for Debugging page 577

  • 15.5 Debugging Tools page 605

  • 15.6 Software Testing page 632

  • 15.7 Debugging Rules page 633

  • 15.8 Suggested Reading page 637

  • 15.9 Summary page 638

  • Exercises page 639

There are many practices, starting with program logic and data design, through code breakdown and organization, and finally implementation, that can help minimize errors and problems. We encourage you to study these; find good books on software design and software engineering, and put their advice into practice! Any program over a few hundred lines in size should be carefully thought out and designed, instead of just hacked on until it appears to work.

However, since programmers are human, programming errors are unavoidable. Debugging is the process of tracking down and removing errors in programs. Even well-designed, well-implemented programs occasionally don’t work; when something’s going wrong and you can’t figure out why, it’s a good idea to point a debugger at the code, and watch it fail.

This chapter covers a range of topics, starting off with basic debugging advice and techniques (compiling for debugging and elementary use of GDB, the GNU debugger), moving on to a range of techniques for use during program development and debugging that make debugging easier, and then looking at a number of tools that help the debugging process. It then closes with a brief introduction to software testing, and a wonderful set of “debugging rules,” extracted from a book that we highly recommend.

Most of our advice is based on our long-term experience as a volunteer for the GNU project, maintaining gawk (GNU awk). Most, if not all, the specific examples we present come from that program.

Throughout the chapter, specific recommendations are marked Recommendation.

First Things First

When a program misbehaves, you may be at a loss as to what to do first. Often, strange behavior is due to misusing memory—using uninitialized values, reading or writing outside the bounds of dynamic memory, and so on. Therefore, you may get faster results by trying out a memory-debugging tool before you crank up a debugger.

The reason is that memory tools can point you directly at the failing line of code, whereas using a debugger is more like embarking on a search-and-destroy mission, in which you first have to isolate the problem and then fix it. Once you’re sure that memory problems aren’t the issue, you can proceed to using a debugger.

Because the debugger is a more general tool, we cover it first. We discuss a number of memory-debugging tools later in the chapter.

Compilation for Debugging

For a source code debugger to be used, the executable being debugged (the debuggee, if you will) must be compiled with the compiler’s -g option. This option causes the compiler to emit extra debugging symbols into the object code; that is, extra information giving the names and types of variables, constants, functions, and so on. The debugger then uses this information to match source code locations with the code being executed and to retrieve or store variable values in the running program.

On many Unix systems, the -g compiler option is mutually exclusive with the -O option, which turns on optimizations. This is because optimizations can cause rearrangement of bits and pieces of the object code, such that there is no longer a direct relationship between what’s being executed and a linear reading of the source code. By disabling optimizations, you make it much easier for the debugger to relate the object code to the source code, and in turn, single-stepping through a program’s execution works in the obvious way. (Single-stepping is described shortly.)

GCC, the GNU Compiler Collection, does allow -g and -O together. However, this introduces exactly the problem we wish to avoid when debugging: that following the execution in a debugger becomes considerably more difficult. The advantage of allowing the two together is that you can leave the debugging symbols in an optimized, forproduction-use executable. They occupy only disk space, not memory. Then, an installed executable can still be debugged in an emergency.

In our experience, if you need to use a debugger, it’s better to recompile the application from scratch, with only the -g option. This makes tracing considerably easier; there’s enough detail to keep track of just going through the program as it’s written, without also having to worry about how the compiler rearranged the code.

There is one caveat: Be sure the program still misbehaves. Reproducibility is the key to debugging; if you can’t reproduce the problem, it’s much harder to track it down and fix it. Rarely, compiling a program without -O can cause it to stop failing.[1] Typically, the problem persists when compiled without -O, meaning there is indeed a logic bug of some kind, waiting to be discovered.

GDB Basics

A debugger is a program that allows you to control the execution of another program and examine and change the subordinate program’s state (such as variable values). There are two kinds of debuggers: machine-level debuggers, which work on the level of machine instructions, and source-level debuggers, which work in terms of the program’s source code. For example, in a machine-level debugger, to change a variable’s value, you specify its address in memory. In a source-level debugger, you just use the variable’s name.

Historically, V7 Unix had adb, which was a machine-level debugger. System III had sdb, which was a source-level debugger, and BSD Unix provided dbx, also a source-level debugger. (Both continued to provide adb.) dbx survives on some commercial Unix systems.

GDB, the GNU Debugger, is a source-level debugger. It has many more features, is more broadly portable, and is more usable than either sdb or dbx.[2]

Like its predecessors, GDB is a command-line debugger. It prints one line of source code at a time, prints a prompt, and reads one line of input containing a command to execute.

There are graphical debuggers; these provide a larger view of the source code and usually provide the ability to manipulate the program both through a command-line window and through GUI components such as buttons and menus. The ddd debugger[3] is one such; it is built on top of GDB, so if you learn GDB, you can make some use of ddd right away. (ddd has its own manual, which you should read if you’ll be using it heavily.) Another graphical debugger is Insight,[4] which uses Tcl/Tk to provide a graphical interface on top of GDB. (You should use a graphical debugger if one is available to you and you like it. Since our intent is to provide an introduction to debuggers and debugging, we’ve chosen to go with a simple interface that can be presented in print.)

GDB understands C and C++, including support for name demangling, which means that you can use the regular C++ source code names for class member functions and overloaded functions. In particular, GDB understands C expression syntax, which is useful when you wish to look at the value of complicated expressions, such as ’*ptr->x.a[1]->q’. It also understands Fortran 77, although you may have to append an underscore character to the Fortran variable and function names. GDB has partial support for Modula-2, and limited support for Pascal.

If you’re running GNU/Linux or a BSD system (and you installed the development tools), then you should have a recent version of GDB already installed and ready to use. If not, you can download the GDB source code from the GNU Project’s FTP site for GDB[5] and build it yourself.

GDB comes with its own manual, which is over 300 pages long. You can generate the printable version of the manual in the GDB source code directory and print it yourself. You can also buy printed and bound copies from the Free Software Foundation; your purchase helps the FSF and contributes directly to the production of more free software. (See the FSF web site[6] for ordering information.) This section describes the basics of GDB; we recommend reading the manual to learn how to take full advantage of GDB’s capabilities.

Running GDB

The basic usage is this:

gdb [ options ] [ executable [ core-file-name ]]

Here, executable is the executable program to be debugged. If provided, core-file-name is the name of a core file created when a program was killed by the operating system and dumped core. Under GNU/Linux, such files (by default) are named core.pid,[7] where pid is the process ID number of the running program that died. The pid extension means you can have multiple core dumps in the same directory, which is helpful, but also good for consuming disk space!

If you forget to name the files on the command line, you can use ’file executable’ to tell GDB the name of the executable file, and ’core-file core-file-name’ to tell GDB the name of the core file.

With a core dump, GDB indicates where the program died. The following program, ch15-abort.c, creates a few nested function calls and then purposely dies by abort() to create a core dump:

/* ch15-abort.c --- produce a core dump */

#include <stdio.h>
#include <stdlib.h>

/* recurse --- build up some function calls */

void recurse(void)
{
    static int i;

    if (++i == 3)
        abort();
    else
        recurse();
}

int main(int argc, char **argv)
{
    recurse();
}

Here’s a short GDB session with this program:

$ gcc -g ch15-abort.c -o ch15-abort                     Compile, no -O
$ ch15-abort                                            Run the program
Aborted (core dumped)                                   It dies miserably
$ gdb ch15-abort core.4124                              Start GDB on it
GNU gdb 5.3
Copyright 2002 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "i686-pc-linux-gnu"...
Core was generated by 'ch15-abort'.
Program terminated with signal 6, Aborted.
Reading symbols from /lib/i686/libc.so.6...done.
Loaded symbols for /lib/i686/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
#0  0x42028cc1 in kill () from /lib/i686/libc.so.6
(gdb) where                                                Print stack trace
#0  0x42028cc1 in kill () from /lib/i686/libc.so.6
#1  0x42028ac8 in raise () from /lib/i686/libc.so.6
#2  0x4202a019 in abort () from /lib/i686/libc.so.6
#3  0x08048342 in recurse () at ch15-abort.c:13           <--- We need to examine here
#4  0x08048347 in recurse () at ch15-abort.c:15
#5  0x08048347 in recurse () at ch15-abort.c:15
#6  0x0804835f in main (argc=1, argv=0xbffff8f4) at ch15-abort.c:20
#7  0x420158d4 in __libc_start_main () from /lib/i686/libc.so.6

The where command prints a stack trace, that is, a list of all the functions called, most recent first. Note that there are three invocations of the recurse() function. The command bt, for “back trace,” is an alias for where; it’s easier to type.

Each function invocation in the stack is referred to as a frame. This term comes from the compiler field, in which each function’s parameters, local variables, and return address, grouped on the stack, are referred to as a stack frame. The GDB frame command lets you examine a particular frame. In this case, we want frame 3. This is the most recent invocation of recurse(), which called abort():

(gdb) frame 3                                Move to frame 3
#3  0x08048342 in recurse () at ch15-abort.c:13
13               abort();                    GDB prints source location in frame
(gdb) list                                   Show several lines of source code
8       void recurse(void)
9       {
10          static int i;
11
12          if (++i == 3)
13              abort();
14          else
15              recurse();
16      }
17
(gdb)                                        Pressing ENTER repeats the last command
18      int main(int argc, char **argv)
19      {
20          recurse();
21      }
(gdb) quit                                  Leave the debugger (for now)

As demonstrated, pressing ENTER repeats the last command, in this case, list, to show source code lines. This is an easy way to step through the source code.

GDB uses the readline library for command-line editing, so you can use Emacs or vi commands (as you prefer) for recalling and editing previous lines. The Bash shell uses the same library, so if you’re familiar with command-line editing at the shell prompt, GDB’s works the same way. This feature saves considerable typing.

Setting Breakpoints, Single-Stepping, and Setting Watchpoints

Often, program failures produce a core dump. The first step is to use GDB on the core file to determine the routine in which the program died. If the original binary was not compiled for debugging (that is, no -g), all GDB can tell you is the function’s name, but no other details.

The next step is to then recompile the program with debugging and without optimization, and verify that it still fails. Assuming it does, you’ll want to run the program under the control of the debugger and set a breakpoint in the failing routine.

A breakpoint is a point at which execution should break, or stop. You can set breakpoints by function name, source code line number, source code file and line number together, as well as in other ways.

After setting the breakpoint, you start the program running, using the run command, possibly followed by any command-line arguments to be passed on to the debuggee. (GDB conveniently remembers the arguments for you; if you wish to start the program over again from the beginning, all you need do is type the run command by itself, and GDB will start a fresh copy with the same arguments as before.) Here’s a short session using gawk:

$ gdb gawk                                               Start GDB on gawk
GNU gdb 5.3
...
(gdb) break do_print                                     Set breakpoint in do_print
Breakpoint 1 at 0x805a36a: file builtin.c, line 1504.
(gdb) run 'BEGIN { print "hello, world" }'               Start the program running
Starting program: /home/arnold/Gnu/gawk/gawk-3.1.3/gawk 'BEGIN { print
    "hello, world" }'

Breakpoint 1, do_print (tree=0x8095290) at builtin.c:1504
1504            struct redirect *rp = NULL;              Execution reaches breakpoint
(gdb) list                                               Show source code
1499
1500    void
1501    do_print(register NODE *tree)
1502    {
1503        register NODE **t;
1504        struct redirect *rp = NULL;
1505        register FILE *fp;
1506        int numnodes, i;
1507        NODE *save;
1508        NODE *tval;

Once the breakpoint is reached, you proceed through the program by single-stepping it. This means that GDB allows the program to execute one source code statement at a time. GDB prints the line it’s about to execute and then prints its prompt. To run the statement, use the next command:

(gdb) next                                             Run current statement (1504, above)
1510         fp = redirect_to_fp(tree->rnode, & rp);   GDB prints next statement
(gdb)                                                  Hit ENTER to run it, and go to next
1511         if (fp == NULL)
(gdb)                                                  ENTER again
1519         save = tree = tree->lnode;
(gdb)                                                  And again
1520         for (numnodes = 0; tree != NULL; tree = tree->rnode)

The step command is an alternative command for single-stepping. There is an important difference between next and step.next executes the next statement. If that statement contains a function call, the function is called and returns before GDB regains control of the running program.

On the other hand, when you use step on a statement with a function call, GDB descends into the called function, allowing you to continue single-stepping (or tracing) the program. If a statement doesn’t contain a function call, then step is the same as next.

Note

It’s easy to forget which command you’re using and keep pressing ENTER to run each subsequent statement. If you’re using step, you can accidentally enter a library function, such as strlen() or printf(), which you really don’t want to bother with. In such a case, you can use the command finish, which causes the program to run until the current function returns.

You can print memory contents by using the print command. GDB understands C expression syntax, which makes it easy and natural to examine structures pointed to by pointers:

(gdb) print *save                               Print the structure pointed to by save
$1 = {sub = {nodep = {l = {lptr = 0x8095250, param_name = 0x8095250 "pR	",
        ll = 134828624}, r = {rptr = 0x0, pptr = 0, preg = 0x0, hd = 0x0,
        av = 0x0, r_ent = 0}, x = {extra = 0x0, xl = 0, param_list = 0x0},
      name = 0x0, number = 1, reflags = 0}, val = {
      fltnum = 6.6614191194446594e-316, sp = 0x0, slen = 0, sref = 1,
      idx = 0}, hash = {next = 0x8095250, name = 0x0, length = 0, value = 0x0,
      ref = 1}}, type = Node_expression_list, flags = 1}

Finally, the cont (continue) command lets you continue the program’s execution. It will run until the next breakpoint or until it exits normally if it doesn’t hit any breakpoints. This example picks up where the previous one left off:

1520       for (numnodes = 0; tree != NULL; tree = tree->rnode)
(gdb) cont                                           Continue
Continuing.
hello, world

Program exited normally.                             Informative message from GDB
(gdb) quit                                           Leave the debugger

A watchpoint is like a breakpoint, but for data instead of executable code. You set a watchpoint on a variable (or field in a struct or union, or array element), and when it changes, GDB notifies you. GDB checks the value of the watchpoint as it single-steps the program, and stops when the value changes. For example, the do_lint_old variable in gawk is true when the --lint-old option was issued. This variable is set to true by getopt_long(). (We covered getopt_long() in Section 2.1.2, “GNU Long Options,” page 27.) In gawk’s main.c file:

int do_lint_old = FALSE;         /* warn about stuff not in V7 awk */
...
static const struct option optab[] = {
    ...
    { "lint-old", no_argument, & do_lint_old, 1 },
    ...
};

Here’s a sample session, showing a watchpoint in action:

$ gdb gawk                                     Start GDB on gawk
GNU gdb 5.3
...
(gdb) watch do_lint_old                        Set watchpoint on variable
Hardware watchpoint 1: do_lint_old
(gdb) run --lint-old 'BEGIN { print "hello, world" }'    Run the program
Starting program: /home/arnold/Gnu/gawk/gawk-3.1.4/gawk --lint-old
    'BEGIN { print "hello, world" }'
Hardware watchpoint 1: do_lint_old
Hardware watchpoint 1: do_lint_old
Hardware watchpoint 1: do_lint_old              Watchpoint checked as program runs
Hardware watchpoint 1: do_lint_old
Hardware watchpoint 1: do_lint_old

Old value = 0                                   Watchpoint stops the program
New value = 1
0x420c4219 in _getopt_internal () from /lib/i686/libc.so.6
(gdb) where                                   Stack trace
#0  0x420c4219 in _getopt_internal () from /lib/i686/libc.so.6
#1  0x420c4e83 in getopt_long () from /lib/i686/libc.so.6
#2  0x080683a1 in main (argc=3, argv=0xbffff8a4) at main.c:293
#3  0x420158d4 in __libc_start_main () from /lib/i686/libc.so.6
(gdb) quit                                    We're done for now
The program is running. Exit anyway? (y or n) Y Yes, really

GDB can do much more than we’ve shown here. Although the GDB manual is large, it is worthwhile to read it in its entirety at least once, to familiarize yourself with its commands and capabilities. After that, it’s probably sufficient to look at the NEWS file in each new GDB distribution to see what’s new or changed.

It’s also worth printing the GDB reference card which comes in the file gdb/doc/refcard.tex within the GDB source distribution. You can create a printable PostScript version of the reference card, after extracting the source and running configure, by using these commands:

$ cd gdb/doc                                           Change to doc subdirectory
$ make refcard.ps                                      Format the reference card

The reference card is meant to be printed dual-sided, on 8.5 x 11 inch (“letter”) paper, in landscape format. It provides a six-column summary of the most useful GDB commands. We recommend printing it and having it by your keyboard as you work with GDB.

Programming for Debugging

There are many techniques for making source code easier to debug, ranging from simple to involved. We look at a number of them in this section.

Compile-Time Debugging Code

Several techniques relate to the source code itself.

Use Debugging Macros

Perhaps the simplest compile-time technique is the use of the preprocessor to provide conditionally compiled code. For example:

#ifdef DEBUG
     fprintf(stderr, "myvar = %d
", myvar);
     fflush(stderr);
#endif /* DEBUG */

Adding -DDEBUG to the compiler command line causes the call to fprintf() to execute when the program runs.

Recommendation: Send debug messages to stderr so that they aren’t lost down a pipeline and so that they can be captured with an I/O redirection. Be sure to use fflush() so that messages are forced to the output as soon as possible.

Note

The symbol DEBUG, while obvious, is also highly overused. It’s a better idea to use a symbol specific to your program, such as MYAPPDEBUG. You can even use different symbols for debugging code in different parts of your program, such as file I/O, data verification, memory management, and so on.

Scattering lots of #ifdef statements throughout your code quickly becomes painful. And too many #ifdefs obscure the main program logic. There’s got to be a better way, and indeed, a technique that’s often used is to conditionally define special macros for printing:

/* TECHNIQUE 1 --- commonly used but not recommended, see text */
/* In application header file: */
#ifdef MYAPPDEBUG
#define DPRINT0(msg)                 fprintf(stderr, msg)
#define DPRINT1(msg, v1)             fprintf(stderr, msg, v1)
#define DPRINT2(msg, v1, v2)         fprintf(stderr, msg, v1, v2)
#define DPRINT3(msg, v1, v2, v3)     fprintf(stderr, msg, v1, v2, v3)
#else /* ! MYAPPDEBUG */
#define DPRINT0(msg)
#define DPRINT1(msg, v1)
#define DPRINT2(msg, v1, v2)
#define DPRINT3(msg, v1, v2, v3)
#endif /* ! MYAPPDEBUG */

/* In application source file: */
DPRINT1("myvar = %d
", myvar);
...
DPRINT2("v1 = %d, v2 = %f
", v1, v2);

There are multiple macros, one for each different number of arguments, up to whatever limit you wish to provide. When MYAPPDEBUG is defined, the calls to the DPRINTx() macros expand into calls to fprintf(). When MYAPPDEBUG isn’t defined, then those same calls expand to nothing. (This is essentially how assert() works; we described assert() in Section 12.1, “Assertion Statements: assert(),” page 428.)

This technique works; we have used it ourselves and seen it recommended in text-books. However, it can be refined a bit further, reducing the number of macros down to one:

/* TECHNIQUE 2 --- most portable; recommended */
/* In application header file: */
#ifdef MYAPPDEBUG
#define DPRINT(stuff) fprintf stuff
#else
#define DPRINT(stuff)
#endif

/* In application source file: */
DPRINT((stderr, "myvar = %d
", myvar));           Note the double parentheses

Note how the macro is invoked, with two sets of parentheses! By making the entire argument list for fprintf() into a single argument, you no longer need to have an arbitrary number of debugging macros.

If you are using a compiler that conforms to the 1999 C standard, you have an additional choice, which produces the cleanest-looking debugging code:

/* TECHNIQUE 3 --- cleanest, but C99 only */
/* In application header file: */
#ifdef MYAPPDEBUG
#define DPRINT(mesg, ...)    fprintf(stderr, mesg, __VA_ARGS__)
#else
#define DPRINT(mesg, ...)
#endif

/* In application source file: */
DPRINT("myvar = %d
", myvar);
...
DPRINT("v1 = %d, v2 = %f
", v1, v2);

The 1999 C standard provides variadic macros; that is, macros that can accept a variable number of arguments. (This is similar to variadic functions, like printf().) In the macro definition, the three periods, ’...’, indicate that there will be zero or more arguments. In the macro body, the special identifier __VA_ARGS__ is replaced with the provided arguments, however many there are.

The advantage to this mechanism is that only one set of parentheses is necessary when the debugging macro is invoked, making the code read much more naturally. It also preserves the ability to have just one macro name, instead of multiple names that vary according to the number of arguments. The disadvantage is that C99 compilers are not yet widely available, reducing the portability of this construct. (However, this situation will improve with time.)

Recommendation: Current versions of GCC do support C99 variadic macros. Thus, if you know that you will never be using anything but GCC (or some other C99 compiler) to compile your program, you can use the C99 mechanism. However, as of this writing, C99 compilers are still not commonplace. So, if your code has to be compiled by different compilers, you should use the double-parentheses-style macro.

Avoid Expression Macros If Possible

In general, C preprocessor macros are a rather sharp, two-edged sword. They provide you with great power, but they also provide a great opportunity to injure yourself.[8]

For efficiency or code clarity, it’s common to see macros such as this:

#define RS_is_null    (RS_node->var_value == Nnull_string)
...
if (RS_is_null || today == TUESDAY) ...

At first glance, this looks fine. The condition ’RS_is_null’ is clear and easy to understand, and abstracts the details inherent in the test. The problem comes when you try to print the value in GDB:

(gdb) print RS_is_null
No symbol "RS_is_null" in current context.

In such a case, you have to track down the definition of the macro and print the expanded value.

Recommendation: Use variables to represent important conditions in your program, with explicit code to change the variable values when the conditions change.

Here is an abbreviated example, from io.c in the gawk distribution:

void set_RS()
{
    ...
    RS_is_null = FALSE;
    ...
    if (RS->stlen == 0) {
        RS_is_null = TRUE;
        matchrec = rsnullscan;
    }
    ...
}

Once RS_is_null is set and maintained, it can be tested by code and printed from within a debugger.

Note

Beginning with GCC 3.1 and version 5 of GDB, if you compile your program with the options -gdwarf-2 and -g3, you can use macros from within GDB. The GDB manual states that the GDB developers hope to eventually find a more compact representation for macros, and that the -g3 option will be subsumed into -g.

However, only the combination of GCC, GDB, and the special options allows you to use macros this way: If you’re not using GCC (or if you’re using an older version), you still have the problem. We stand by our recommendation to avoid such macros if you can.

The problem with macros extends to code fragments as well. If a macro defines multiple statements, you can’t set a breakpoint inside the middle of the macro. This is also true of C99 and C++ inline functions: If the compiler substitutes the body of an inline function into the generated code, it is again difficult or impossible to set a breakpoint inside it. This ties in with our advice to compile with -g alone; in this case, compilers usually don’t do function inlining.

Along similar lines, it’s common to have a variable that represents a particular state. It’s easy, and encouraged by many C programming books, to #define symbolic constants for these states. For example:

/* The various states to be in when scanning for the end of a record. */
#define NOSTATE   1     /* scanning not started yet (all) */
#define INLEADER  2     /* skipping leading data (RS = "") */
#define INDATA    3     /* in body of record (all) */
#define INTERM    4     /* scanning terminator (RS = "", RS = regexp) */
int state;
...
state = NOSTATE;
...
state = INLEADER;
...
if (state != INTERM) ...

At the source code level, this looks great. But again, there is a problem when you look at the code from within GDB:

(gdb) print state
$1 = 2

Here too, you’re forced to go back and look at the header file to figure out what the 2 means. So, what’s the alternative?

Recommendation: Use enums instead of macros to define symbolic constants. The source code usage is the same, and the debugger can print the enums’ values too.

An example, also from io.c in gawk:

typedef enum scanstate {
    NOSTATE,    /* scanning not started yet (all) */
    INLEADER,   /* skipping leading data (RS = "") */
    INDATA,     /* in body of record (all) */
    INTERM,     /* scanning terminator (RS = "", RS = regexp) */
} SCANSTATE;
SCANSTATE state;
...rest of code remains unchanged!...

Now, when looking at state from within GDB, we see something useful:

(gdb) print state
$1 = NOSTATE

Reorder Code If Necessary

It’s not uncommon to have a condition in an if or while consist of multiple component tests, separated by && or ||. If these tests are function calls (or even if they’re not), it’s impossible to single-step each separate part of the condition. GDB’s step and next commands work on the basis of statements, not expressions. (Splitting such things across lines doesn’t help, either.)

Recommendation: Rewrite the original code, using explicit temporary variables that store return values or conditional results so that you can examine them in a debugger. The original code should be maintained in a comment so that you (or some later programmer) can tell what’s going on.

Here’s a concrete example: the function do_input() from gawk’s file io.c:

 1   /* do_input --- the main input processing loop */
 2
 3   void
 4   do_input()
 5   {
 6       IOBUF *iop;
 7       extern int exiting;
 8       int rval1, rval2, rval3;
 9
10       (void) setjmp(filebuf); /* for 'nextfile' */
11
12       while ((iop = nextfile(FALSE)) != NULL) {
13           /*
14            * This was:
15           if (inrec(iop) == 0)
16               while (interpret(expression_value) && inrec(iop) == 0)
17                   continue;
18            * Now expand it out for ease of debugging.
19            */
20           rval1 = inrec(iop);
21           if (rval1 == 0) {
22               for (;;) {
23                   rval2 = rval3 = -1; /* for debugging */
24                   rval2 = interpret(expression_value);
25                   if (rval2 != 0)
26                       rval3 = inrec(iop);
27                   if (rval2 == 0 || rval3 != 0)
28                       break;
29               }
30           }
31           if (exiting)
32               break;
33       }
34   }

(The line numbers are relative to the start of the routine, not the file.) This function is the heart of gawk’s main processing loop. The outer loop (lines 12 and 33) steps through the command-line data files. The comment on lines 13–19 shows the original code, which reads each record from the current file and processes it.

A 0 return value from inrec() indicates an OK status, while a nonzero return value from interpret() indicates an OK status. When we tried to step through this loop, verifying the record reading process, it became necessary to perform each step individually.

Lines 20–30 are the rewritten code, which calls each function separately, storing the return values in local variables so that they can be printed from the debugger. Note how line 23 forces these variables to have known, invalid values each time around the loop: Otherwise, they would retain their values from previous loop iterations. Line 27 is the exit test; because the code has changed to an infinite loop (compare line 22 to line 16), the test for breaking out of the loop is the opposite of the original test.

As an aside, we admit to having had to study the rewrite carefully when we made it, to make sure it did exactly the same as the original code; it did. It occurs to us now that perhaps this version of the loop might be closer to the original:

/* Possible replacement for lines 22 - 29 */
do {
    rval2 = rval3 = -1; /* for debugging */
    rval2 = interpret(expression_value);
    if (rval2 != 0)
        rval3 = inrec(iop);
} while (rval2 != 0 && rval3 == 0);

The truth is, both versions are harder to read than the original and thus potentially in error. However, since the current code works, we decided to leave well enough alone.

Finally, we note that not all expert programmers would agree with our advice here. When each component of a condition is a function call, you can set a breakpoint on each one, use step to step into each function, and then use finish to complete the function. GDB will tell you the function’s return value, and from that point you can use cont or step to continue. We like our approach because the results are kept in variables, which can be checked (and rechecked) after the function calls, and even a few statements later.

Use Debugging Helper Functions

A common technique, applicable in many cases, is to have a set of flag values; when a flag is set (that is, true), a certain fact is true or a certain condition applies. This is commonly done with #defined symbolic constants and the C bitwise operators. (We discussed the use of bit flags and the bit manipulation operators in the sidebar in Section 8.3.1, “POSIX Style: statvfs() and fstatvfs(),” page 244.)

For example, gawk’s central data structure is called a NODE. It has a large number of fields, the last of which is a set of flag values. From the file awk.h:

typedef struct exp_node {
    ...                             Lots of stuff omitted
    unsigned short flags;
#       define  MALLOC  1       /* can be free'd */
#       define  TEMP    2       /* should be free'd */
#       define  PERM    4       /* can't be free'd */
#       define  STRING  8       /* assigned as string */
#       define  STRCUR  16      /* string value is current */
#       define  NUMCUR  32      /* numeric value is current */
#       define  NUMBER  64      /* assigned as number */
#       define  MAYBE_NUM 128   /* user input: if NUMERIC then
                                 * a NUMBER */
#       define  ARRAYMAXED 256  /* array is at max size */
#       define  FUNC    512     /* this parameter is really a
                                  * function name; see awkgram.y */
#       define  FIELD   1024    /* this is a field */
#       define  INTLSTR 2048    /* use localized version */
} NODE;

The reason to use flag values is that they provide considerable savings in data space. If the NODE structure used a separate char field for each flag, that would use 12 bytes instead of the 2 used by the unsigned short. The current size of a NODE (on an Intel x86) is 32 bytes. Adding 10 more bytes would bump that to 42 bytes. Since gawk can allocate potentially hundreds of thousands (or even millions) of NODEs,[9] keeping the size down is important.

What does this have to do with debugging? Didn’t we just recommend using enums for symbolic constants? Well, in the case of OR’d values enums are no help, since they’re no longer individually recognizable!

Recommendation: Provide a function to convert flags to a string. If you have multiple independent flags, set up a general-purpose routine.

Note

What’s unusual about these debugging functions is that application code never calls them. They exist only so that they can be called from a debugger. Such functions should always be compiled in, without even a surrounding # ifdef, so that you can use them without having to take special steps. The (usually minimal) extra code size is justified by the developer’s time savings.

First we’ll show you how we did this initially. Here is (an abbreviated version of) flags2str() from an earlier version of gawk (3.0.6):

 1   /* flags2str --- make a flags value readable */
 2
 3   char *
 4   flags2str(flagval)
 5   int flagval;
 6   {
 7       static char buffer[BUFSIZ];
 8       char *sp;
 9
10       sp = buffer;
11
12       if (flagval & MALLOC) {
13           strcpy(sp, "MALLOC");
14           sp += strlen(sp);
15       }
16       if (flagval & TEMP) {
17           if (sp != buffer)
18               *sp++ = '|';
19           strcpy(sp, "TEMP");
20           sp += strlen(sp);
21       }
22       if (flagval & PERM) {
23           if (sp != buffer)
24               *sp++ = '|';
25           strcpy(sp, "PERM");
26           sp += strlen(sp);
27       }
         ...much more of the same, omitted for brevity...
82
83       return buffer;
84   }

(The line numbers are relative to the start of the function.) The result is a string, something like "MALLOC|PERM|NUMBER". Each flag is tested separately, and if present, each one’s action is the same: test if not at the beginning of the buffer so we can add the ’|’ character, copy the string into place, and update the pointer. Similar functions existed for formatting and displaying the other kinds of flags in the program.

The code is both repetitive and error prone, and for gawk 3.1 we were able to simplify and generalize it. Here’s how gawk now does it. Starting with this definition in awk.h:

/* for debugging purposes */
struct flagtab {
    int val;                       Integer flag value
    const char *name;              String name
};

This structure can be used to represent any set of flags with their corresponding string values. Each different group of flags has a corresponding function that returns a printable representation of the flags that are currently set. From eval.c:

/* flags2str --- make a flags value readable */

const char *
flags2str(int flagval)
{
    static const struct flagtab values[] = {
        { MALLOC, "MALLOC" },
        { TEMP, "TEMP" },
        { PERM, "PERM" },
        { STRING, "STRING" },
        { STRCUR, "STRCUR" },
        { NUMCUR, "NUMCUR" },
        { NUMBER, "NUMBER" },
        { MAYBE_NUM, "MAYBE_NUM" },
        { ARRAYMAXED, "ARRAYMAXED" },
        { FUNC, "FUNC" },
        { FIELD, "FIELD" },
        { INTLSTR, "INTLSTR" },
        { 0,   NULL },
    };

    return genflags2str(flagval, values);
}

flags2str() defines an array that maps flag values to strings. By convention, a 0 flag value indicates the end of the array. The code calls gengflags2str() (“general flags to string”) to do the work. genflags2str() is a general-purpose routine that converts a flag value into a string. From eval.c:

 1   /* genflags2str --- general routine to convert a flag value to a string */
 2
 3   const char *
 4   genflags2str(int flagval, const struct flagtab *tab)
 5   {
 6       static char buffer[BUFSIZ];
 7       char *sp;
 8       int i, space_left, space_needed;
 9
10       sp = buffer;
11       space_left = BUFSIZ;
12       for (i = 0; tab[i].name != NULL; i++) {
13           if ((flagval & tab[i].val) != 0) {
14               /*
15                * note the trick, we want 1 or 0 for whether we need
16                * the '|' character.
17                */
18               space_needed = (strlen(tab[i].name) + (sp != buffer));
19               if (space_left < space_needed)
20                   fatal(_("buffer overflow in genflags2str"));
21
22               if (sp != buffer) {
23                   *sp++ = '|';
24                   space_left--;
25               }
26               strcpy(sp, tab[i].name);
27               /* note ordering! */
28               space_left -= strlen(sp);
29               sp += strlen(sp);
30           }
31       }
32
33       return buffer;
34   }

(Line numbers are relative to the start of the function, not the file.) As with the previous version, the idea here is to fill in a static buffer with a string value such as "MALLOC|PERM|STRING|MAYBE_NUM" and return the address of that buffer. We discuss the reasons for using a static buffer shortly; first let’s examine the code.

The sp pointer tracks the position of the next empty spot in the buffer, while space_left tracks how much room is left; this keeps us from overflowing the buffer.

The bulk of the function is a loop (line 12) through the array of flag values. When a flag is found (line 13), the code computes how much space is needed for the string (line 18) and tests to see if that much room is left (lines 19–20).

The test ’sp != buffer’ fails on the first flag value found, returning 0. On subsequent flags, the test has a value of 1. This tells us if we need the ’|’ separator character between values. By adding the result (1 or 0) to the length of the string, we get the correct value for space_needed. The same test, for the same reason, is used on line 22 to control lines 23 and 24, which insert the ’|’ character.

Finally, lines 26–29 copy in the string value, adjust the amount of space left, and update the sp pointer. Line 33 returns the address of the buffer, which contains the printable representation of the string.

Now, what about that static buffer? Normally, good programming practice discourages the use of functions that return the address of static buffers: It’s easy to have multiple calls to such a function overwrite the buffer each time, forcing the caller to copy the returned data.

Furthermore, a static buffer is by definition a buffer of fixed size. What happened to the GNU “no arbitrary limits” principle?

The answer to both of these questions is to remember that this is a debugging function. Normal code never calls genflags2str(); it’s only called by a human using a debugger. No caller holds a pointer to the buffer; as a developer doing debugging, we don’t care that the buffer gets overwritten each time we call the function.

In practice, the fixed size isn’t an issue either; we know that BUFSIZ is big enough to represent all the flags that we use. Nevertheless, being experienced and knowing that things can change, genflags2str() has code to protect itself from overrunning the buffer. (The space_left variable and the code on lines 18–20.)

As an aside, the use of BUFSIZ is arguable. That constant should be used exclusively for I/O buffers, but it is often used for general string buffers as well. Such code would be better off defining explicit constants, such as FLAGVALSIZE, and using ’sizeof(buffer)’ on line 11.

Here is an abbreviated GDB session showing flags2str() in use:

$ gdb gawk                                                  Start GDB on gawk
GNU gdb 5.3
...
(gdb) break do_print                                        Set a breakpoint
Breakpoint 1 at 0x805a584: file builtin.c, line 1547.
(gdb) run 'BEGIN { print "hello, world" }'                  Start it running
Starting program: /home/arnold/Gnu/gawk/gawk-3.1.4/gawk 'BEGIN { print
    "hello, world" }'

Breakpoint 1, do_print (tree=0x80955b8) at builtin.c:1547   Breakpoint hit
1547             struct redirect *rp = NULL;
(gdb) print *tree                                           Print NODE
$1 = {sub = {nodep = {l = {lptr = 0x8095598, param_name = 0x8095598 "xU	",
        ll = 134829464}, r = {rptr = 0x0, pptr = 0, preg = 0x0, hd = 0x0,
        av = 0x0, r_ent = 0}, x = {extra = 0x0, xl = 0, param_list = 0x0},
      name = 0x0, number = 1, reflags = 0}, val = {
      fltnum = 6.6614606209589101e-316, sp = 0x0, slen = 0, sref = 1,
      idx = 0}, hash = {next = 0x8095598, name = 0x0, length = 0, value = 0x0,
      ref = 1}}, type = Node_K_print, flags = 1}
(gdb) print flags2str(tree->flags)                          Print flag value
$2 = 0x80918a0 "MALLOC"
(gdb) next                                                  Keep going
1553             fp = redirect_to_fp(tree->rnode, & rp);
...
1588                 efwrite(t[i]->stptr, sizeof(char), t[i]->stlen, fp,
                             "print", rp, FALSE);
(gdb) print *t[i]                                           Print NODE again
$4 = {sub = {nodep = {l = {lptr = 0x8095598, param_name = 0x8095598 "xU	",
        ll = 134829464}, r = {rptr = 0x0, pptr = 0, preg = 0x0, hd = 0x0,
        av = 0x0, r_ent = 0}, x = {extra = 0x8095ad8, xl = 134830808,
        param_list = 0x8095ad8}, name = 0xc <Address 0xc out of bounds>,
      number = 1, reflags = 4294967295}, val = {
      fltnum = 6.6614606209589101e-316, sp = 0x8095ad8 "hello, world",
      slen = 12, sref = 1, idx = -1}, hash = {next = 0x8095598, name = 0x0,
      length = 134830808, value = 0xc, ref = 1}}, type = Node_val, flags = 29}
(gdb) print flags2str(t[i]->flags)                           Print flag value
$5 = 0x80918a0 "MALLOC|PERM|STRING|STRCUR"

We hope you’ll agree that the current general-purpose mechanism is considerably more elegant than the original one, and easier to use.

Careful design and use of arrays of structs can often replace or consolidate repetitive code.

Avoid Unions When Possible

         “There’s no such thing as a free lunch.”                      —Lazarus Long

The C union is a relatively esoteric facility. It allows you to save memory by storing different items within the same physical space; how the program treats it depends on how it’s accessed:

/* ch15-union.c --- brief demo of union usage. */

#include <stdio.h>

int main(void)
{
    union i_f {
        int i;
        float f;
    } u;

    u.f = 12.34;    /* Assign a floating point value */
    printf("%f also looks like %#x
", u.f, u.i);
    exit(0);
}

Here is what happens when the program is run on an Intel x86 GNU/Linux system:

$ ch15-union
12.340000 also looks like 0x414570a4

The program prints the bit pattern that represents a floating-point number as a hexadecimal integer. The storage for the two fields occupies the same memory; the difference is in how the memory is treated: u.f acts like a floating-point number, whereas the same bits in u.i act like an integer.

Unions are particularly useful in compilers and interpreters, which often create a tree structure representing the structure of a source code file (called a parse tree). This models the way programming languages are formally described: if statements, while statements, assignment statements, and so on are all instances of the more generic “statement” type. Thus, a compiler might have something like this:

struct if_stmt { ... };                Structure for IF statement
struct while_stmt { ... };             Structure for WHILE statement
struct for_stmt { ... };               Structure for FOR statement
... structures for other statement types ...

typedef enum stmt_type {
    IF, WHILE, FOR, ...
} TYPE;                                What we actually have

/* This contains the type and unions of the individual kinds of statements. */
struct statement {
    TYPE type;
    union stmt {
        struct if_stmt if_st;
        struct while_stmt while_st;
        struct for_stmt for_st;
        ...
    } u;
};

Along with the union, it is conventional to use macros to make the components of the union look like they were fields in a struct. For example:

#define if_s      u.if_st                   So can use s->if_s instead of s->u.if_st
#define while_s   u.while_st                And so on...
#define for_s     u.for_st
...

At the level just presented, this seems reasonable and looks manageable. The real world, however, is a more complicated place, and practical compilers and interpreters often have several levels of nested structs and unions. This includes gawk, in which the definition of the NODE, its flag values, and macros for accessing union components takes over 120 lines![10] Here is enough of that definition to give you a feel for what’s happening:

typedef struct exp_node {
    union {
        struct {
            union {
                struct exp_node *lptr;
                char *param_name;
                long ll;
            } l;
            union {
                ...
            } r;
            union {
                ...
            } x;
            char *name;
            short number;
            unsigned long reflags;
            ...
        } nodep;
        struct {
            AWKNUM fltnum;
            char *sp;
            size_t slen;
            long sref;
            int idx;
        } val;
        struct {
            struct exp_node *next;
            char *name;
            size_t length;
            struct exp_node *value;
            long ref;
        } hash;
#define hnext    sub.hash.next
#define hname    sub.hash.name
#define hlength  sub.hash.length
#define hvalue   sub.hash.value
...
    } sub;
    NODETYPE type;
    unsigned short flags;
...
} NODE;

#define vname sub.nodep.name
#define exec_count sub.nodep.reflags

#define lnode   sub.nodep.l.lptr
#define nextp   sub.nodep.l.lptr
#define source_file sub.nodep.name
#define source_line sub.nodep.number
#define param_cnt   sub.nodep.number
#define param   sub.nodep.l.param_name

#define stptr   sub.val.sp
#define stlen   sub.val.slen
#define stref   sub.val.sref
#define stfmt   sub.val.idx

#define var_value lnode
...

The NODE has a union inside a struct inside a union inside a struct! (Ouch.) On top of that, multiple macro “fields” map to the same struct/union components, depending on what is actually stored in the NODE! (Ouch, again.)

The benefit of this complexity is that the C code is relatively clear. Something like ’NF_node->var_value->slen’ is straightforward to read.

There is, of course, a price to pay for the flexibility that unions provide. When your debugger is deep down in the guts of your code, you can’t use the nice macros that appear in the source. You must use the real expansion.[11] (And for that, you have to find the definition in the header file.)

For example, compare ’NF_node->var_value->slen’ to what it expands to: ’NF_node->sub.nodep.l.lptr->sub.val.slen’! You must type the latter into GDB to look at your data value. Look again at this excerpt from the earlier GDB debugging session:

(gdb) print *tree                                           Print NODE
$1 = {sub = {nodep = {l = {lptr = 0x8095598, param_name = 0x8095598 "xU	",
        ll = 134829464}, r = {rptr = 0x0, pptr = 0, preg = 0x0, hd = 0x0,
        av = 0x0, r_ent = 0}, x = {extra = 0x0, xl = 0, param_list = 0x0},
      name = 0x0, number = 1, reflags = 0}, val = {
      fltnum = 6.6614606209589101e-316, sp = 0x0, slen = 0, sref = 1,
      idx = 0}, hash = {next = 0x8095598, name = 0x0, length = 0, value = 0x0,
      ref = 1}}, type = Node_K_print, flags = 1}

That’s a lot of goop. However, GDB does make this a little easier to handle. You can use expressions like ’($1).sub.val.slen’ to step through the tree and list data structures.

There are other reasons to avoid unions. First of all, unions are unchecked. Nothing but programmer attention ensures that when you access one part of a union, you are accessing the same part that was last stored. We saw this in ch15-union.c, which accessed both of the union’s “identities” simultaneously.

A second reason, related to the first, is to be careful of overlays in complicated nested struct/union combinations. For example, an earlier version of gawk[12] had this code:

/* n->lnode overlays the array size, don't unref it if array */
if (n->type != Node_var_array && n->type != Node_array_ref)
    unref(n->lnode);

Originally, there was no if, just a call to unref(), which frees the NODE pointed to by n->lnode. However, it was possible to crash gawk at this point. You can imagine how long it took, in a debugger, to track down the fact that what was being treated as a pointer was in reality an array size!

As an aside, unions are considerably less useful in C++. Inheritance and object-oriented features make data structure management a different ball game, one that is considerably safer.

Recommendation: Avoid unions if possible. If not, design and code them carefully!

Runtime Debugging Code

Besides things you add to your code at compile time, you can also add extra code to enable debugging features at runtime. This is particularly useful for applications that are installed in the field, where a customer’s system won’t have the source code installed (and maybe not even a compiler!).

This section presents some runtime debugging techniques that we have used over the years, ranging from simple to more complex. Note that our treatment is by no means exhaustive. This is an area where it pays to have some imagination and to use it!

Add Debugging Options and Variables

The simplest technique is to have a command-line option that enables debugging. Such an option can be conditionally compiled in when you are debugging. But it’s more flexible to leave the option in the production version of the program. (You may or may not also wish to leave the option undocumented as well. This has various tradeoffs: Documenting it can allow your customers or clients to learn more about the internals of your system, which you may not want. On the other hand, not documenting it seems rather sneaky. If you’re writing Open Source or Free Software, it’s better to document the option.)

If your program is large, you may wish your debugging option to take an argument indicating what subsystem should be debugged. Based on the argument, you can set different flag variables or possibly different bit flags in a single debugging variable. Here is an outline of this technique:

struct option options[] = {
    ...
    { "debug", required_argument, NULL, 'D' },
    ...
}

int main(int argc, char **argv)
{
    int c;
    
    while ((c = getopt_long(argc, argv, "...D:")) != -1) {
        switch (c) {
        ...
        case 'D':
            parse_debug(optarg);
            break;
        ...
        }
    }
    ...
}

The parse_debug() function reads through the argument string. For example, it could be a comma- or space-separated string of subsystems, like "file,memory,ipc". For each valid subsystem name, the function would set a bit in a debugging variable:

extern int debugging;

void parse_debug(const char *subsystems)
{
    char *sp;
    
    for (sp = subsystems; *sp != '';) {
        if (strncmp(sp, "file", 4) == 0) {
            debugging |= DEBUG_FILE;
            sp += 4;
        } else if (strncmp(sp, "memory", 6) == 0) {
            debugging |= DEBUG_MEM;
            sp += 6;
        } else if (strncmp(sp, "ipc", 3) == 0) {
            debugging |= DEBUG_IPC;
            sp += 3;
        ...
        }
        while (*sp == ' ' || *sp == ',')
            sp++;
    }
}

Finally, application code can then test the flags:

if ((debugging & DEBUG_FILE) != 0) ...          In the I/O part of the program

if ((debugging & DEBUG_MEM) != 0) ...           In the memory manager

It is up to you whether to use a single variable with flag bits, separate variables, or even a debugging array, indexed by symbolic constants (preferably from an enum).

The cost of leaving the debugging code in your production executable is that the program will be larger. Depending on the placement of your debugging code, it may also be slower since the tests are always performed, but are always false until debugging is turned on. And, as mentioned, it may be possible for someone to learn about your program, which you may not want. Or worse, a malevolent user could enable so much debugging that the program slows to an unusable state! (This is called a denial of service attack.)

The benefit, which can be great, is that your already installed program can be reinvoked with debugging turned on, without requiring you to build, and then download, a special version to your customer site. When the software is installed in remote places that may not have people around and all you can do is access the system remotely through the Internet (or worse, a slow telephone dial-in!), such a feature can be a lifesaver.

Finally, you may wish to mix and match: use conditionally compiled debugging code for fine-grained, high-detail debugging, and save the always-present code for a coarser level of output.

Use Special Environment Variables

Another useful trick is to have your application pay attention to special environment variables (documented or otherwise). This can be particularly useful for testing. Here’s another example from our experience with gawk, but first, some background.

gawk uses a function named optimal_bufsize() to obtain the optimal buffer size for I/O. For small files, the function returns the file size. Otherwise, if the filesystem defines a size to use for I/O, it returns that (the st_blksize member in the struct stat, see Section 5.4.2, “Retrieving File Information,” page 141). If that member isn’t available, optimal_bufsize() returns the BUFSIZ constant from <stdio.h>. The original function (in posix/gawkmisc.c) looked like this:

 1   /* optimal_bufsize --- determine optimal buffer size */
 2
 3   int
 4   optimal_bufsize(fd, stb)         int optimal_bufsize(int fd, struct stat *stb);
 5   int fd;
 6   struct stat *stb;
 7   {
 8       /* force all members to zero in case OS doesn't use all of them. */
 9       memset(stb, '', sizeof(struct stat));
10
11       /*
12        * System V.n, n < 4, doesn't have the file system block size in the
13        * stat structure. So we have to make some sort of reasonable
14        * guess. We use stdio's BUFSIZ, since that is what it was
15        * meant for in the first place.
16        */
17   #ifdef HAVE_ST_BLKSIZE
18   #define DEFBLKSIZE (stb->st_blksize ? stb->st_blksize : BUFSIZ)
19   #else
20   #define DEFBLKSIZE BUFSIZ
21   #endif
22
23       if (isatty(fd))
24           return BUFSIZ;
25       if (fstat(fd, stb) == -1)
26           fatal("can't stat fd %d (%s)", fd, strerror(errno));
27       if (lseek(fd, (off_t)0, 0) == -1)   /* not a regular file */
28           return DEFBLKSIZE;
29       if (stb->st_size > 0 && stb->st_size < DEFBLKSIZE) /* small file */
30           return stb->st_size;
31       return DEFBLKSIZE;
32   }

The constant DEFBLKSIZE is the “default block size”; that is, the value from the struct stat, or BUFSIZ. For terminals (line 23) or for files that aren’t regular files (lseek() fails, line 27), the return value is also BUFSIZ. For regular files that are small, the file size is used. In all other cases, DEFBLKSIZE is returned. Knowing the “optimal” buffer size is particularly useful on filesystems in which the block size is larger than BUFSIZ.

We had a problem whereby one of our test cases worked perfectly on our development GNU/Linux system and every other Unix system we had access to. However, this test would fail consistently on certain other systems.

For a long time, we could not get direct access to a failing system in order to run GDB. Eventually, however, we did manage to reproduce the problem; it turned out to be related to the size of the buffer gawk was using for reading data files: On the failing systems, the buffer size was larger than for our development system.

We wanted a way to be able to reproduce the problem on our development machine: The failing system was nine time zones away, and running GDB interactively across the Atlantic Ocean is painful. We reproduced the problem by having optimal_bufsize() look at a special environment variable, AWKBUFSIZE. When the value is "exact", optimal_bufsize() always returns the size of the file, whatever that may be. If the value of AWKBUFSIZE is some integer number, the function returns that number. Otherwise, the function falls back to the previous algorithm. This allows us to run tests without having to constantly recompile gawk. For example,

$ AWKBUFSIZE=42 make check

This runs the gawk test suite, using a buffer size of 42 bytes. (The test suite passes.) Here is the modified version of optimal_bufsize():

 1   /* optimal_bufsize --- determine optimal buffer size */
 2
 3   /*
 4    * Enhance this for debugging purposes, as follows:
 5    *
 6    * Always stat the file, stat buffer is used by higher-level code.
 7    *
 8    * if (AWKBUFSIZE == "exact")
 9    *    return the file size
10    * else if (AWKBUFSIZE == a number)
11    *    always return that number
12    * else
13    *    if the size is < default_blocksize
14    *      return the size
15    *    else
16    *      return default_blocksize
17    *    end if
18    * endif
19    *
20    * Hair comes in an effort to only deal with AWKBUFSIZE
21    * once, the first time this routine is called, instead of
22    * every time. Performance, dontyaknow.
23    */
24
25   size_t
26   optimal_bufsize(fd, stb)
27   int fd;
28   struct stat *stb;
29   {
30       char *val;
31       static size_t env_val = 0;
32       static short first = TRUE;
33       static short exact = FALSE;
34
35       /* force all members to zero in case OS doesn't use all of them. */
36       memset(stb, '', sizeof(struct stat));
37
38       /* always stat, in case stb is used by higher level code. */
39       if (fstat(fd, stb) == -1)
40           fatal("can't stat fd %d (%s)", fd, strerror(errno));
41
42       if (first) {
43           first = FALSE;
44
45           if ((val = getenv("AWKBUFSIZE")) != NULL) {
46               if (strcmp(val, "exact") == 0)
47                   exact = TRUE;
48               else if (ISDIGIT(*val)) {
49                   for (; *val && ISDIGIT(*val); val++)
50                       env_val = (env_val * 10) + *val - '0';
51
52                   return env_val;
53               }
54           }
55       } else if (! exact && env_val > 0)
56           return env_val;
57       /* else
58           fall through */
59
60       /*
61        * System V.n, n < 4, doesn't have the file system block size in the
62        * stat structure. So we have to make some sort of reasonable
63        * guess. We use stdio's BUFSIZ, since that is what it was
64        * meant for in the first place.
65        */
66   #ifdef HAVE_ST_BLKSIZE
67   #define DEFBLKSIZE (stb->st_blksize > 0 ? stb->st_blksize : BUFSIZ)
68   #else
69   #define DEFBLKSIZE BUFSIZ
70   #endif
71
72       if (S_ISREG(stb->st_mode)            /* regular file */
73           && 0 < stb->st_size              /* non-zero size */
74           && (stb->st_size < DEFBLKSIZE    /* small file */
75           || exact))                       /* or debugging */
76           return stb->st_size;             /* use file size */
77
78       return DEFBLKSIZE;
79   }

The comment on lines 3–23 explains the algorithm. Since searching the environment can be expensive and it only needs to be done once, the function uses several static variables to collect the appropriate information the first time.

Lines 42–54 execute the first time the function is called, and only the first time. Line 43 enforces this condition by setting first to false. Lines 45–54 handle the environment variable, looking for either "exact" or a number. In the latter case, it converts the string value to decimal, saving it in env_val. (We probably should have used strtoul() here; it didn’t occur to us at the time.)

Line 55 executes every time but the first. If a numeric value was given, the condition will be true and that value is returned (line 56). Otherwise, it falls through to the rest of the function.

Lines 60–70 define DEFBLKSIZE; this part has not changed. Finally, lines 72–76 return the file size if appropriate. If not (line 78), DEFBLKSIZE is returned.

We did fix the problem,[13] but in the meantime, we left the new version of optimal_bufsize() in place, so that we could be sure the problem hasn’t reoccurred.

The marginal increase in code size and complexity is more than offset by the increased flexibility we now have for testing. Furthermore, since this is production code, it’s easy to have a user in the field use this feature for testing, to determine if a similar problem has occurred. (So far, we haven’t had to ask for a test, but it’s nice to know that we could handle it if we had to.)

Add Logging Code

It is often the case that your application program is running on a system on which you can’t use a debugger (such as at a customer site). In that case, your goal is to be able to examine the program’s internal state, but from the outside. The only way to do that is to have the program itself produce this information for you.

There are multiple ways to do this:

  • Always log information to a specific file. This is simplest: The program always writes logging information. You can then look at the file at your convenience.

    The disadvantage is that at some point the log file will consume all available disk space. Therefore, you should have multiple log files, with your program switching to a new one periodically.

    Brian Kernighan recommends naming the log files by day of the week: myapp.log.sun, myapp.log.mon, and so on. The advantage here is that you don’t have to manually move old files out of the way; you get a week’s worth of log files for free.

  • Write to a log file only if it already exists. When your program starts up, if the log file exists, it writes information to the log. Otherwise, it doesn’t. To enable logging, first create an empty log file.

  • Use a fixed-format for messages, one that can be easily parsed by scripting languages such as awk or Perl, for summary and report generation.

  • Alternatively, generate some form of XML, which is self-describing, and possibly convertible to other formats. (We’re not big fans of XML, but you shouldn’t let that stop you.)

  • Use syslog() to do logging; the final disposition of logging messages can be controlled by the system administrator. (syslog() is a fairly advanced interface; see the syslog(3) manpage.)

Choosing how to log information is, of course, the easy part. The hard part is choosing what to log. As with all parts of program development, it pays to think before you code. Log information about critical variables. Check their values to make sure they’re in range or are otherwise what you expect. Log exceptional conditions; if something occurs that shouldn’t, log it, and if possible, keep going.

The key is to log only the information you need to track down problems, no more and no less.

Runtime Debugging Files

In a previous life, we worked for a startup company with binary executables of the product installed at customer sites. It wasn’t possible to attach a debugger to a running copy of the program or to run it from a debugger on the customer’s system. The main component of the product was not started directly from a command line, but indirectly, through shell scripts that did considerable initial setup.

To make the program start producing logging information, we came up with the idea of special debugging files. When a file of a certain name existed in a certain directory, the program would produce informational messages to a log file that we could then download and analyze. Such code looks like this:

struct stat sbuf;
extern int do_logging;       /* initialized to zero */

if (stat ("/path/to/magic/.file", &sbuf) == 0)
    do_logging = TRUE;
...
if (do_logging) {
    logging code here: open file, write info, close file, etc.
}

The call to stat() happened for each job the program processed. Thus, we could dynamically enable and disable logging without having to stop and restart the application!

As with debugging options and variables, there are any number of variations on this theme: different files that enable logging of information about different subsystems, debugging directives added into the debugging file itself, and so on. As with all features, you should plan a design for what you will need and then implement it cleanly instead of hacking out some quick and dirty code at 3:00 A.M. (a not uncommon possibility in startup companies, unfortunately).

Note

All that glitters is not gold. Special debugging files are but one example of techniques known as back doors—one or more ways for developers to do undocumented things with a program, usually for nefarious purposes. In our instance, the back door was entirely benign. But an unscrupulous developer could just as easily arrange to generate and download a hidden copy of a customer list, personnel file, or other sensitive data. For this reason alone, you should think extra hard about whether this technique is usable in your application.

Add Special Hooks for Breakpoints

Often, a problem may be reproducible, but only after your program has first processed many megabytes of input data. Or, while you may know in which function your program is failing, the failure occurs only after the function has been called many hundreds, or even thousands, of times.

This is a big problem when you’re working in a debugger. If you set a breakpoint in the failing routine, you have to type the continue command and press ENTER hundreds or thousands of times to get your program into the state where it’s about to fail. This is tedious and error prone, to say the least! It may even be so difficult to do that you’ll want to give up before starting.

The solution is to add special debugging “hook” functions that your program can call when it is close to the state you’re interested in.

For example, suppose that you know that the check_salary() function is the one that fails, but only when it’s been called 1,427 times. (We kid you not; we’ve seen some rather strange things in our time.)

To catch check_salary() before it fails, create a special dummy function that does nothing but return, and then arrange for check_salary() to call it just before the 1,427th time that it itself is called:

/* debug_dummy --- debugging hook function */
void debug_dummy(void) { return; }

struct salary *check_salary(void)
{
    ...real variable declarations here...
    static int count = 0;     /* for debugging */

    if (++count == 1426)
        debug_dummy();

    ...rest of the code here...
}

Now, from within GDB, set a breakpoint in debug_dummy(), and then run the program normally:

(gdb) break debug_dummy                            Set breakpoint for dummy function
Breakpoint 1 at 0x8055885: file whizprog.c, line 3137.
(gdb) run                                          Start program running

Once the breakpoint for debug_dummy() is reached, you can set a second breakpoint for check_salary() and then continue execution:

(gdb) run                                          Start program running
Starting program: /home/arnold/whizprog

Breakpoint 1, debug_dummy () at whizprog.c, line 3137
3137 void debug_dummy(void)  { return; }           Breakpoint reached
(gdb) break check_salary                           Set breakpoint for function of interest
Breakpoint 2 at 0x8057913: file whizprog.c, line 3140.
(gdb) cont   

When the second breakpoint is reached, the program is about to fail and you can single-step through it, doing whatever is necessary to track down the problem.

Instead of using a fixed constant (’++count == 1426’), you may wish to have a global variable that can be set by the debugger to whatever value you need. This avoids the need to recompile the program.

For gawk, we have gone a step further and brought the debugging hook facility into the language, so the hook function can be called from the awk program. When compiled for debugging, a special do-nothing function named stopme() is available. This function in turn calls a C function of the same name. This allows us to put calls to stopme() into a failing awk program right before things go wrong. For example, if gawk is producing bad results for an awk program on the 1,200th input record, we can add a line like this to the awk program:

NR == 1198 { stopme() }    # Stop for debugging when Number of Records == 1198

... rest of awk program as before ...

Then, from within GDB, we can set a breakpoint on the C function stopme() and run the awk program. Once that breakpoint fires, we can then set breakpoints on the other parts of gawk where we suspect the real problem lies.

The hook-function technique is useful in and of itself. However, the ability to bring it to the application level multiplies its usefulness, and it has saved us untold hours of debugging time when tracking down obscure problems.

Debugging Tools

Besides GDB and whatever source code hooks you use for general debugging, there are a number of useful packages that can help find different kinds of problems. Because dynamic memory management is such a difficult task in large-scale programs, many tools focus on that area, often acting as drop-in replacements for malloc() and free().

There are commercial tools that do many (or all) of the same things as the programs we describe, but not all of them are available for GNU/Linux, and many are quite expensive. All of the packages discussed in this section are freely available.

The dbug Library — A Sophisticated printf()

The first package we examine is the dbug library. It is based on the idea of conditionally compiled debugging code we presented earlier in this chapter, but carries things much further, providing relatively sophisticated runtime tracing and conditional debug output. It implements many of the tips we described, saving you the trouble of implementing them yourself.

The dbug library, written by Fred Fish in the early 1980s, has seen modest enhancements since then. It is now explicitly in the public domain, so it can be used in both free and proprietary software, without problems. It is available from Fred Fish’s FTP archive,[14] as both a compressed tar file, and as a ZIP archive. The documentation summarizes dbug well:

dbug is an example of an internal debugger. Because it requires internal instrumentation of a program, and its usage does not depend on any special capabilities of the execution environment, it is always available and will execute in any environment that the program itself will execute in. In addition, since it is a complete package with a specific user interface, all programs which use it will be provided with similar debugging capabilities. This is in sharp contrast to other forms of internal instrumentation where each developer has their own, usually less capable, form of internal debugger...

The dbug package imposes only a slight speed penalty on executing programs, typically much less than 10 percent, and a modest size penalty, typically 10 to 20 percent. By defining a specific C preprocessor symbol both of these can be reduced to zero with no changes required to the source code.

The following list is a quick summary of the capabilities of the dbug package. Each capability can be individually enabled or disabled at the time a program is invoked by specifying the appropriate command line arguments.

  • Execution trace showing function level control flow in a semi-graphical manner using indentation to indicate nesting depth.

  • Output the values of all, or any subset of, key internal variables.

  • Limit actions to a specific set of named functions.

  • Limit function trace to a specified nesting depth.

  • Label each output line with source file name and line number.

  • Label each output line with name of current process.

  • Push or pop internal debugging state to allow execution with built-in debugging defaults.

  • Redirect the debug output stream to standard output (stdout) or a named file. The default output stream is standard error (stderr). The redirection mechanism is completely independent of normal command line redirection to avoid output conflicts.

The dbug package requires you to use a certain discipline when writing your code. In particular, you have to use its macros when doing a function return or calling setjmp() and longjmp(). You have to add a single macro call as the first executable statement of each function and call a few extra macros from main(). Finally, you have to add a debugging command-line option: By convention this is -#, which is rarely, if ever, used as a real option. In return for the extra work, you get all the benefits just outlined. Let’s look at the example in the manual:

 1   #include <stdio.h>
 2   #include "dbug.h"
 3
 4   int
 5   main (argc, argv)
 6   int argc;
 7   char *argv[];
 8   {
 9       register int result, ix;
10       extern int factorial (), atoi ();
11
12       DBUG_ENTER ("main");
13       DBUG_PROCESS (argv[0]);
14       DBUG_PUSH_ENV ("DBUG");
15       for (ix = 1; ix < argc && argv[ix][0] == '-'; ix++) {
16           switch (argv[ix][1]) {
17           case '#':
18               DBUG_PUSH (&(argv[ix][2]));
19               break;
20           }
21       }
22       for (; ix < argc; ix++) {
23           DBUG_PRINT ("args", ("argv[%d] = %s", ix, argv[ix]));
24           result = factorial (atoi (argv[ix]));
25           printf ("%d
", result);
26           fflush (stdout);
27       }
28       DBUG_RETURN (0);
29   }

This program illustrates most of the salient points. The DBUG_ENTER() macro (line 12) must be called after any variable declarations and before any other code. (This is because it declares some private variables of its own.[15])

The DBUG_PROCESS() macro (line 13) sets the name of the program, primarily for use in output messages from the library. This macro should be called only once, from main().

The DBUG_PUSH_ENV() macro (line 14) causes the library to look at the named environment variable (DBUG in this case) for control strings. (The dbug control strings are discussed shortly.) The library is capable of saving its current state and using a new one, creating a stack of saved states. Thus, this macro pushes the state obtained from the given environment variable onto the stack of saved states. As used in this example, the macro creates the initial state. If there is no such environment variable, nothing happens. (As an aside, DBUG is rather generic; perhaps something like GAWK_DBUG [for gawk] would be better.)

The DBUG_PUSH() macro (line 18) passes in the control string value obtained from the -# command-line option. (New code should use getopt() or getopt_long() instead of manual argument parsing.) This is normally how debugging is enabled, but using an environment variable as well provides additional flexibility.

The DBUG_PRINT() macro (line 23) is what produces output. The second argument uses the technique we described earlier (see Section 15.4.1.1, “Use Debugging Macros,” page 577) of enclosing the entire printf() argument list in parentheses, making it a single argument as far as the C preprocessor is concerned. Note that a terminating newline character is not provided in the format string; the dbug library provides the newline for you.

When printing, by default, the dbug library outputs all DBUG_PRINT() statements. The first argument is a string that can be used to limit the output just to DBUG_PRINT() macros using that string.

Finally, the DBUG_RETURN() macro (line 28) is used instead of a regular return statement to return a value. There is a corresponding DBUG_VOID_RETURN macro for use in void functions.

The rest of the program is completed with the factorial() function:

 1   #include <stdio.h>
 2   #include "dbug.h"
 3   
 4   int factorial (value)
 5   register int value;
 6   {
 7       DBUG_ENTER ("factorial");
 8       DBUG_PRINT ("find", ("find %d factorial", value));
 9       if (value > 1) {
10           value *= factorial (value - 1);
11       }
12       DBUG_PRINT ("result", ("result is %d", value));
13       DBUG_RETURN (value);
14   }

Once the program is compiled and linked with the dbug library, it can be run normally. By default, the program produces no debugging output. With debugging enabled, though, different kinds of output are available:

$ factorial 1 2 3                       Regular run, no debugging
1
2
6
$ factorial -#t 1 2 3                   Show function call trace, note nesting
| >factorial
| <factorial
1                                       Regular output is on stdout
| >factorial
| | >factorial
| | <factorial                          Debugging output is on stderr
| <factorial
2
| >factorial
| | >factorial
| | | >factorial
| | | <factorial
| | <factorial
| <factorial
6
<?func?
$ factorial -#d 1 2                     Show debugging messages from DBUG_PRINT()
?func?: args: argv[2] = 1
factorial: find: find 1 factorial
factorial: result: result is 1
1
?func?: args: argv[3] = 2
factorial: find: find 2 factorial
factorial: find: find 1 factorial
factorial: result: result is 1
factorial: result: result is 2
2

The -# option controls the dbug library. It is “special” in the sense that DBUG_PUSH() will accept the entire string, ignoring the leading ’-#’ characters, although you could use a different option if you wish, passing DBUG_PUSH() just the option argument string (this is optarg if you use getopt()).

The control string consists of a set of options and arguments. Each group of options and arguments is separated from the others by a colon character. Each option is a single letter, and the arguments to that option are separated from it by commas. For example:

$ myprog -#d,mem,ipc:f,check_salary,check_start_date -f infile -o outfile

The d option enables DBUG_PRINT() output, but only if the first argument string is one of "mem" or "ipc". (With no arguments, all DBUG_PRINT() messages are printed.) Similarly, the f option limits the function call trace to just the named functions: check_salary() and check_start_date().

The following list of options and arguments is reproduced from the dbug library manual. Square brackets enclose optional arguments. We include here only the ones we find useful; see the documentation for the full list.

d[, keywords]

  • Enable output from macros with specified keywords. A null list of keywords implies that all keywords are selected.

F

  • Mark each debugger output line with the name of the source file containing the macro causing the output.

i

  • Identify the process emitting each line of debug or trace output with the process ID for that process.

L

  • Mark each debugger output line with the source-file line number of the macro causing the output.

o[, file]

  • Redirect the debugger output stream to the specified file. The default output stream is stderr. A null argument list causes output to be redirected to stdout.

t[, N]

  • Enable function control flow tracing. The maximum nesting depth is specified by N, and defaults to 200.

To round out the discussion, here are the rest of the macros defined by the dbug library.

DBUG_EXECUTE(string, code)

  • This macro is similar to DBUG_PRINT(): The first argument is a string selected with the d option, and the second is code to execute:

    DBUG_EXECUTE("abort", abort());
    

DBUG_FILE

  • This is a value of type FILE *, for use with the <stdio.h> routines. It allows you to do your own output to the debugging file stream.

DBUG_LONGJMP(jmp_buf env, int val)

  • This macro wraps a call to longjmp(), taking the same arguments, so that the dbug library will know when you’ve made a nonlocal jump.

DBUG_POP()

  • This macro pops one level of saved debugging state, as created by DBUG_PUSH(). It is rather esoteric; you probably won’t use it.

DBUG_SETJMP(jmp_buf env)

  • This macro wraps a call to setjmp(), taking the same argument. It allows the dbug library to handle nonlocal jumps.

In a different incarnation, at the first startup company we worked for,[16] we used the dbug library in our product. It was invaluable during development, and by omitting the -DDBUG on the final build, we were able to build a production version, with no other source code changes.

To get the most benefit out of the dbug library, you must use it consistently, throughout your program. This is easier if you use it from the beginning of a project, but as an experiment, we found that with the aid of a simple awk script, we could incorporate the library into a 30,000 line program with a few hours work. If you can afford the overhead, it’s best to leave it in the production build of your program so that you can debug with it without first having to recompile.

We find that the dbug library is a nice complement to external debuggers such as GDB; it provides an organized and consistent way to apply instrumentation to C code. It also rather nicely combines many of the techniques that we outlined separately, earlier in the chapter. The dynamic function call trace feature is particularly useful, and it proves invaluable for help in learning about a program’s behavior if you’re unfamiliar with it.

Memory Allocation Debuggers

Ignoring issues such as poor program design, for any large-scale, practical application, the C programmer’s single biggest challenge is dynamic memory management (by malloc(), realloc(), and free()).

This fact is borne out by the large number of tools that are available for debugging dynamic memory. There is a fair amount of overlap in what these tools provide. For example:

  • Memory leak detection: memory that is allocated and then becomes unreachable.

  • Unfreed memory detection: memory that is allocated but never freed. Never-freed memory isn’t always a bug, but detecting such occurrences allows you to verify that they’re indeed OK.

  • Detection of bad frees: memory that is freed twice, or pointers passed to free() that didn’t come from malloc().

  • Detection of use of already freed memory: memory that is freed is being used through a dangling pointer.

  • Memory overrun detection: accessing or storing into memory outside the bounds of what was allocated.

  • Warning about the use of uninitialized memory. (Many compilers can warn about this.)

  • Dynamic function tracing: When a bad memory access occurs, you get a traceback from where the memory is used to where it was allocated.

  • Tool control through the use of environment variables.

  • Log files for raw debugging information that can be postprocessed to produce useful reports.

Some tools merely log these events. Others arrange for the application program to die a horrible death (through SIGSEGV) so that the offending code can be pinpointed from within a debugger. Additionally, most are designed to work well with GDB.

Some tools require source code modification, such as calling special functions, or using a special header file, extra #defines, and a static library. Others work by using special Linux/Unix shared library mechanisms to transparently install themselves as replacements for the standard library versions of malloc() and free().

In this section we look at three dynamic memory debuggers, and then we provide pointers to several others.

GNU/Linux mtrace

GNU/Linux systems using GLIBC provide two functions for enabling and disabling memory tracing at runtime:

#include <mcheck.h>                                           GLIBC

void mtrace(void);
void muntrace(void);

When mtrace() is called, the library looks at the environment variable MALLOC_TRACE. It is expected that this names a writable file (existing or not). The library opens the file and begins logging information about memory allocations and frees. (No logging is done if the file can’t be opened. The file is truncated each time the program runs.) When muntrace() is called, the library closes the file and does not log any further allocations or frees.

The use of separate functions makes it possible to do memory tracing for specific parts of the program; it’s not necessary to trace everything. (We find it most useful to enable logging at the start of the program and be done, but this design provides flexibility, which is nice to have.)

Once the application program exits, you use the mtrace program to analyze the log file. (The log file is ASCII, but the information isn’t directly usable.) For example, gawk turns on tracing if TIDYMEM is defined:

$ export TIDYMEM=1 MALLOC_TRACE=mtrace.out                Export environment variables
$ ./gawk 'BEGIN { print "hello, world" }'                 Run the program
hello, world
$ mtrace ./gawk mtrace.out                                Generate report

Memory not freed:
-----------------
   Address      Size      Caller
0x08085858      0x20   at /home/arnold/Gnu/gawk/gawk-3.1.3/main.c:1102
0x08085880     0xc80   at /home/arnold/Gnu/gawk/gawk-3.1.3/node.c:398
0x08086508       0x2   at /home/arnold/Gnu/gawk/gawk-3.1.3/node.c:337
0x08086518       0x6   at /home/arnold/Gnu/gawk/gawk-3.1.3/node.c:337
0x08086528      0x10   at /home/arnold/Gnu/gawk/gawk-3.1.3/eval.c:2082
0x08086550       0x3   at /home/arnold/Gnu/gawk/gawk-3.1.3/node.c:337
0x08086560       0x3   at /home/arnold/Gnu/gawk/gawk-3.1.3/node.c:337
0x080865e0       0x4   at /home/arnold/Gnu/gawk/gawk-3.1.3/field.c:76
0x08086670      0x78   at /home/arnold/Gnu/gawk/gawk-3.1.3/awkgram.y:1369
0x08086700       0xe   at /home/arnold/Gnu/gawk/gawk-3.1.3/node.c:337
0x08086718      0x1f   at /home/arnold/Gnu/gawk/gawk-3.1.3/awkgram.y:1259

The output is a list of locations at which gawk allocates memory that is never freed. Note that permanently hanging onto dynamic memory is fine if it’s done on purpose. All the cases shown here are allocations of that sort.

Electric Fence

In Section 3.1, “Linux/Unix Address Space,” page 52, we described how dynamic memory comes from the heap, which can be made to grow and shrink (with the brk() or sbrk() calls, described in Section 3.2.3, “System Calls: brk() and sbrk(),” page 75).

Well, the picture we presented there is a simplified version of reality. More advanced system calls (not covered in this volume) make it possible to add additional, not necessarily contiguous, segments of memory into a process’s address space. Many malloc() debuggers work by using these system calls to add a new piece of address space for every allocation. The advantage of this scheme is that the operating system and the computer’s memory-protection hardware cooperate to make access to memory outside these discontiguous segments invalid, generating a SIGSEGV signal. The scheme is depicted in Figure 15.1.

Linux/Unix process address space, including special areas

Figure 15.1. Linux/Unix process address space, including special areas

The first debugging package to implement this scheme was Electric Fence. Electric Fence is a drop-in replacement for malloc() et al. It works on many Unix systems and GNU/Linux; it is available from its author’s FTP archive.[17] Many GNU/Linux distributions supply it, although you may have to choose it explicitly when you install your system.

Once a program is linked with Electric Fence, any access that is out of bounds generates a SIGSEGV. Electric Fence also catches attempts to use memory that has already been freed. Here is a simple program that illustrates both problems:

 1   /* ch15-badmem1.c --- do bad things with memory */
 2
 3   #include <stdio.h>
 4   #include <stdlib.h>
 5
 6   int main(int argc, char **argv)
 7   {
 8       char *p;
 9       int i;
10
11       p = malloc(30);
12
13       strcpy(p, "not 30 bytes");
14       printf("p = <%s>
", p);
15
16       if (argc == 2) {
17           if (strcmp(argv[1], "-b") == 0)
18               p[42] = 'a';    /* touch outside the bounds */
19           else if (strcmp(argv[1], "-f") == 0) {
20               free(p);        /* free memory and then use it */
21               p[0] = 'b';
22           }
23       }
24
25       /* free(p); */
26
27       return 0;
28   }

This program does simple command-line option checking to decide how to misbehave: -b touches memory out of bounds, and -f attempts to use freed memory. (Lines 18 and 21 are the dangerous ones, respectively.) Note that with no options, the pointer is never freed (line 25); Electric Fence doesn’t catch this case.

One way to use Electric Fence, a way that is guaranteed to work across Unix and GNU/Linux systems, is to statically link your program with it. The program should then be run from the debugger. (The Electric Fence documentation is explicit that Electric Fence should not be linked with a production binary.) The following session demonstrates this procedure and shows what happens for both command-line options:

$ cc -g ch15-badmem1.c -lefence -o ch15-badmem1       Compile; link statically
$ gdb ch15-badmem1                                    Run it from the debugger
GNU gdb 5.3
...
(gdb) run -b                                          Try -b option
Starting program: /home/arnold/progex/code/ch15/ch15-badmem1 -b
[New Thread 8192 (LWP 28021)]

  Electric Fence 2.2.0 Copyright (C) 1987–1999 Bruce Perens <[email protected]>
p = <not 30 bytes>

Program received signal SIGSEGV, Segmentation fault.   SIGSEGV: GDB prints where
[Switching to Thread 8192 (LWP 28021)]
0x080485b6 in main (argc=2, argv=0xbffff8a4) at ch15-badmem1.c:18
18                               p[42] = 'a';     /* touch outside the bounds */
(gdb) run -f                                           Now try the -f option
The program being debugged has been started already.
Start it from the beginning? (y or n) y                Yes, really

Starting program: /home/arnold/progex/code/ch15/ch15-badmem1 -f
[New Thread 8192 (LWP 28024)]

  Electric Fence 2.2.0 Copyright (C) 1987–1999 Bruce Perens <[email protected]>
p = <not 30 bytes>

Program received signal SIGSEGV, Segmentation fault.      SIGSEGV again
[Switching to Thread 8192 (LWP 28024)]
0x080485e8 in main (argc=2, argv=0xbffff8a4) at ch15-badmem1.c:21
21                              p[0] = 'b';

On systems that support shared libraries and the LD_PRELOAD environment variable (including GNU/Linux), you don’t need to explicitly link in the efence library. Instead, the ef shell script arranges to run the program with the proper setup.

Although we haven’t described the mechanisms in detail, GNU/Linux (and other Unix systems) support shared libraries, special versions of the library routines that are kept in a single file on disk instead of copied into every single executable program’s binary file. Shared libraries save some space on disk and can save system memory, since all programs using a shared library use the same in-memory copy of the library. The cost is that program startup is slower because the program and the shared library have to be hooked together before the program can start running. (This is usually transparent to you, the user.)

The LD_PRELOAD environment variable causes the system’s program loader (which brings executable files into memory) to link in a special library before the standard libraries. The ef script uses this feature to link in Electric Fence’s version of the malloc() suite. Thus, relinking isn’t even necessary. This example demonstrates ef:

$ cc -g ch15-badmem1.c -o ch15-badmem1                   Compile normally
$ ef ch15-badmem1 -b                                     Run using ef, dumps core

  Electric Fence 2.2.0 Copyright (C) 1987–1999 Bruce Perens <[email protected]>
p = <not 30 bytes>
/usr/bin/ef: line 20: 28005 Segmentation fault      (core dumped)
    ( export LD_PRELOAD=libefence.so.0.0; exec $* )
$ ef ch15-badmem1 -f                                     Run using ef, dumps core again

  Electric Fence 2.2.0 Copyright (C) 1987–1999 Bruce Perens <[email protected]>
p = <not 30 bytes>
/usr/bin/ef: line 20: 28007 Segmentation fault      (core dumped)
    ( export LD_PRELOAD=libefence.so.0.0; exec $* )
$ 1s -1 core*                                            Linux gives us separate core files
-rw-------    1 arnold   devel      217088 Aug 28 15:40 core.28005
-rw-------    1 arnold   devel      212992 Aug 28 15:40 core.28007

GNU/Linux creates core files that include the process ID number in the file name. In this instance this behavior is useful because we can debug each core file separately:

$ gdb ch15-badmem1 core.28005                             From the -b option
GNU gdb 5.3
...
Core was generated by 'ch15-badmem1 -b'.
Program terminated with signal 11, Segmentation fault.
...
#0  0x08048466 in main (argc=2, argv=0xbffff8c4) at ch15-badmem1.c:18
18                             p[42] = 'a';     /* touch outside the bounds */
(gdb) quit   

$ gdb ch15-badmem1 core.28007                             From the -f option
GNU gdb 5.3
...
Core was generated by 'ch15-badmem1 -f'.
Program terminated with signal 11, Segmentation fault.
...
#0  0x08048498 in main (argc=2, argv=0xbffff8c4) at ch15-badmem1.c:18
18                              p[42] = 'a';     /* touch outside the bounds */
(gdb) quit

$ gdb ch15-badmem1 core.28007                             From the -f option
GNU gdb 5.3
...
Core was generated by 'ch15-badmem1 -f'.
Program terminated with signal 11, Segmentation fault.
...
#0 0x08048498 in main (argc=2, argv=0xbffff8c4) at ch15-badmem1.c:21
21                             p[0] = 'b';

The efence(3) manpage describes several environment variables that can be set to tailor Electric Fence’s behavior. The following three are the most notable.

EF_PROTECT_BELOW

  • Setting this variable to 1 causes Electric Fence to look for underruns instead of overruns. An overrun, accessing memory beyond the allocated area, was demonstrated previously. An underrun is accessing memory located in front of the allocated area.

EF_PROTECT_FREE

  • Setting this variable to 1 prevents Electric Fence from reusing memory that was correctly freed. This is helpful when you think a program may be accessing freed memory; if the freed memory was subsequently reallocated, access to it from a previously dangling pointer would otherwise go undetected.

EF_ALLOW_MALLOC_0

  • When given a nonzero value, Electric Fence allows calls of ’malloc(0)’. Such calls are technically valid in Standard C, but likely represent a software bug. Thus, by default, Electric Fence disallows them.

In addition to environment variables, Electric Fence supplies similarly named global variables. You can change their values from within a debugger, so you can dynamically alter the behavior of a program that has already started running. See efence(3) for the details.

Debugging Malloc: dmalloc

The dmalloc library provides a large number of debugging options. Its author is Gray Watson, and it has its own web site.[18] As with Electric Fence, it may already be installed on your system, or you may have to retrieve it and build it yourself.

The dmalloc library examines the DMALLOC_OPTIONS environment variable for control information. For example, it might look like this:

$ echo $DMALLOC_OPTIONS
debug=0x4e40503, inter=100, log=dm-log

The ’debug’ part of this variable is a set of OR’d bit flags which is nearly impossible for most people to manage directly. Therefore, the documentation describes a two-stage process for making things easier to use.

The first step is to define a shell function named dmalloc that calls the dmalloc driver program:

$ dmalloc () {
> eval 'command dmalloc -b $*'      The 'command' command bypasses shell functions
> }

Once that’s done, you can pass options to the function to set the log file (-1), specify the number of iterations after which dmalloc should verify its internal data structures (-i), and specify a debugging level or other tag (’low’):

$ dmalloc -1 dm-log -i 100 low

Like Electric Fence, the dmalloc library can be statically linked into the application or dynamically linked with LD_PRELOAD. The following example demonstrates the latter:

$ LD_PRELOAD=libdmalloc.so ch15-badmem1 -b     Run with checking on
p = <not 30 bytes>                             Normal output shown

Note

Do not use ’export LD_PRELOAD=libdmalloc.so’! If you do, every program you run, such as ls, will run with malloc() checking turned on. Your system will become unusable, quickly. If you do this by accident, you can use ’unset LD_PRELOAD’ to restore normal behavior.

The results go into the dm-log file, as specified:

$ cat dm-log
1062078174: 1: Dmalloc version '4.8.1' from 'http://dmalloc.com/'
1062078174: 1: flags = 0x4e40503, logfile 'dm-log'
1062078174: 1: interval = 100, addr = 0, seen # = 0
1062078174: 1: starting time = 1062078174
1062078174: 1: free bucket count/bits: 63/6
1062078174: 1: basic-block 4096 bytes, alignment 8 bytes, heap grows up
1062078174: 1: heap: 0x804a000 to 0x804d000, size 12288 bytes (3 blocks)
1062078174: 1: heap checked 0
1062078174: 1: alloc calls: malloc 1, calloc 0, realloc 0, free 0
1062078174: 1: alloc calls: recalloc 0, memalign 0, valloc 0
1062078174: 1:  total memory allocated: 30 bytes (1 pnts)
1062078174: 1:  max in use at one time: 30 bytes (1 pnts)
1062078174: 1: max alloced with 1 call: 30 bytes
1062078174: 1: max alloc rounding loss: 34 bytes (53%)
1062078174: 1: max memory space wasted: 3998 bytes (98%)
1062078174: 1: final user memory space: basic 0, divided 1, 4062 bytes
1062078174: 1:  final admin overhead: basic 1, divided 1, 8192 bytes (66%)
1062078174: 1:  final external space: 0 bytes (0 blocks)
1062078174: 1: top 10 allocations:
1062078174: 1:  total-size  count in-use-size  count  source
1062078174: 1:          30      1          30      1  ra=0x8048412
1062078174: 1:          30      1          30      1  Total of 1
1062078174: 1: dumping not-freed pointers changed since 0:
1062078174: 1:  not freed: '0x804c008|s1' (30 bytes) from 'ra=0x8048412'
1062078174: 1:  total-size   count   source
1062078174: 1:          30       1   ra=0x8048412     Allocation is here
1062078174: 1:          30       1   Total of 1
1062078174: 1:  unknown memory: 1 pointer, 30 bytes
1062078174: 1: ending time = 1062078174, elapsed since start = 0:00:00

The output includes many statistics, which we’re not interested in at the moment. The line that is interesting is the one indicating memory that wasn’t freed, with a return address indicating the function that allocated the memory (’ra=0x8048412’). The dmalloc documentation explains how to get the source code location for this address, using GDB:

$ gdb ch15-badmem1                       Start GDB
GNU gdb 5.3
...
(gdb) x 0x8048412                        Examine address
0x8048412 <main+26>:     0x8910c483
(gdb) info line *(0x8048412)             Get line information
Line 11 of "ch15-badmem1.c" starts at address 0x8048408 <main+16>
   and ends at 0x8048418 <main+32>.

This is painful, but workable if you have no other choice. However, if you include the "dmalloc.h" header file in your program (after all other #include statements), you can get source code information directly in the report:

...
1062080258: 1:   top 10 allocations:
1062080258: 1:    total-size  count in-use-size  count  source
1062080258: 1:            30      1          30      1  ch15-badmem2.c:13
1062080258: 1:            30      1          30      1  Total of 1
1062080258: 1:   dumping not-freed pointers changed since 0:
1062080258: 1:    not freed: '0x804c008|s1' (30 bytes) from 'ch15-badmem2.c:13'
1062080258: 1:    total-size  count  source
1062080258: 1:            30      1  ch15-badmem2.c:13
1062080258: 1:            30      1  Total of 1
...

(The ch15-badmem2.c file is the same as ch15-badmem1.c, except that it includes "dmalloc.h", so we haven’t bothered to show it.)

Individual debugging features are enabled or disabled by the use of tokens—specially recognized identifiers—and the -p option to add a token (feature) or -m option to remove one. There are predefined combinations, ’low’, ’med’, and ’high’. You can see what these combinations are with ’dmalloc -Lv’:

$ dmalloc low                                       Set things to low
$ dmalloc -Lv                                       Show settings
Debug Malloc Utility: http://dmalloc.com/
  For a list of the command-line options enter: dmalloc --usage
Debug-Flags 0x4e40503 (82052355) (low)              Current tokens
  log-stats, log-non-free, log-bad-space, log-elapsed-time, check-fence,
  free-blank, error-abort, alloc-blank, catch-null
Address     not-set
Interval    100
Lock-On     not-set
Logpath     'log2'
Start-File  not-set

The full set of tokens, along with a brief explanation and each token’s corresponding numeric value, is available from ’dmalloc -DV’:

$ dmalloc -DV
Debug Tokens:
none (nil) -- no functionality (0)
log-stats (lst) -- log general statistics (0x1)
log-non-free (lnf) -- log non-freed pointers (0x2)
log-known (lkn) -- log only known non-freed (0x4)
log-trans (ltr) -- log memory transactions (0x8)
log-admin (lad) -- log administrative info (0x20)
log-blocks (lbl) -- log blocks when heap-map (0x40)
log-bad-space (lbs) -- dump space from bad pnt (0x100)
log-nonfree-space (lns) -- dump space from non-freed pointers (0x200)
log-elapsed-time (let) -- log elapsed-time for allocated pointer (0x40000)
log-current-time (lct) -- log current-time for allocated pointer (0x80000)
check-fence (cfe) -- check fence-post errors (0x400)
check-heap (che) -- check heap adm structs (0x800)
check-lists (cli) -- check free lists (0x1000)
check-blank (cbl) -- check mem overwritten by alloc-blank, free-blank (0x2000)
check-funcs (cfu) -- check functions (0x4000)
force-linear (fli) -- force heap space to be linear (0x10000)
catch-signals (csi) -- shutdown program on SIGHUP, SIGINT, SIGTERM (0x20000)
realloc-copy (rco) -- copy all re-allocations (0x100000)
free-blank (fbl) -- overwrite freed memory space with BLANK_CHAR (0x200000)
error-abort (eab) -- abort immediately on error (0x400000)
alloc-blank (abl) -- overwrite newly alloced memory with BLANK_CHAR (0x800000)
heap-check-map (hcm) -- log heap-map on heap-check (0x1000000)
print-messages (pme) -- write messages to stderr (0x2000000)
catch-null (cnu) -- abort if no memory available (0x4000000)
never-reuse (nre) -- never re-use freed memory (0x8000000)
allow-free-null (afn) -- allow the frees of NULL pointers (0x20000000)
error-dump (edu) -- dump core on error and then continue (0x40000000)

By now you should have a feel for how to use dmalloc and its flexibility. dmalloc is overkill for our simple demonstration program, but it is invaluable for a larger scale, real-world application.

Valgrind: A Versatile Tool

The tools described in the previous section all focus on dynamic memory debugging, and indeed this is a significant problem area for many programs. However, dynamic memory problems aren’t the only kind. The GPL-licensed Valgrind program catches a large variety of problems, including those that arise from dynamic memory.

The Valgrind manual describes the program as well or better than we can, so we’ll quote (and abbreviate) it as we go along.

Valgrind is a flexible tool for debugging and profiling Linux-x86 executables. The tool consists of a core, which provides a synthetic x86 CPU in software, and a series of “skins”, each of which is a debugging or profiling tool. The architecture is modular, so that new skins can be created easily and without disturbing the existing structure.

The most useful “skin” is memcheck:

The memcheck skin detects memory-management problems in your programs.

All reads and writes of memory are checked, and calls to malloc/new/free/delete are intercepted. As a result, memcheck can detect the following problems:

  • Use of uninitialized memory.

  • Reading/writing memory after it has been free’d.

  • Reading/writing off the end of malloc’d blocks.

  • Reading/writing inappropriate areas on the stack.

  • Memory leaks—where pointers to malloc’d blocks are lost forever.

  • Mismatched use of malloc/new/new [] vs free/delete/delete [].

  • Some misuses of the POSIX pthreads API.

Problems like these can be difficult to find by other means, often lying undetected for long periods, then causing occasional, difficult-to-diagnose crashes.

Other skins are more specialized:

  • cachegrind performs detailed simulation of the I1, D1, and L2 caches in your CPU and so can accurately pinpoint the sources of cache misses in your code.

  • The addrcheck [skin] is identical to memcheck except for the single detail that it does not do any uninitialized-value checks. All of the other checks—primarily the fine-grained address checking—are still done. The downside of this is that you don’t catch the uninitialized-value errors that memcheck can find.

    But the upside is significant: Programs run about twice as fast as they do on memcheck, and a lot less memory is used. It still finds reads/writes of freed memory, memory off the end of blocks and in other invalid places, bugs which you really want to find before release!

  • helgrind is a debugging skin designed to find data races in multithreaded programs.

Finally, the manual notes:

Valgrind is closely tied to details of the CPU, operating system and to a lesser extent, compiler and basic C libraries. This makes it difficult to make it portable, so we have chosen at the outset to concentrate on what we believe to be a widely used platform: Linux on x86s. Valgrind uses the standard Unix ’./configure’, ’make’, ’make install’ mechanism, and we have attempted to ensure that it works on machines with kernel 2.2 or 2.4 and glibc 2.1.X, 2.2.X or 2.3.1. This should cover the vast majority of modern Linux installations. Note that glibc-2.3.2+, with the NPTL (Native POSIX Thread Library) package won’t work. We hope to be able to fix this, but it won’t be easy.

If you’re using GNU/Linux on a different platform or if you’re using a commercial Unix system, then Valgrind won’t be of much help to you. However, as x86 GNU/Linux systems are quite common (and affordable), it’s likely you can acquire one on a moderate budget, or at least find one to borrow! What’s more, once Valgrind has found a problem for you, it’s fixed for whatever platform your program is compiled to run on. Thus, it’s reasonable to use an x86 GNU/Linux system for development, and some other commercial Unix system for deployment of a high-end product.[19]

Although the Valgrind manual might lead you to expect that there are separate commands named memcheck, addrcheck, and so on, this isn’t the case. Instead, a driver shell program named valgrind runs the debugging core, with the appropriate “skin” as specified by the --skin= option. The default skin is memcheck; thus, running a plain valgrind is the same as ’valgrind --skin=memcheck’. (This provides compatibility with earlier versions of Valgrind that only did memory checking, and it also makes the most sense, since the memcheck skin provides the most information.)

Valgrind provides a number of options. We refer you to its documentation for the full details. The options are split into groups; of those that apply to the core (that is, work for all skins), these are likely to be most useful:

--gdb-attach=no | yes

  • Start up with a GDB attached to the process, for interactive debugging. The default is no.

--help

  • List the options.

--logfile=file

  • Log messages to file.pid.

--num-callers=number

  • Show num callers in stack traces. The default is 4.

--skin=skin

  • Use the skin named skin. Default is memcheck.

--trace-children=no | yes

  • Also run the trace on child processes. The default is no.

-v, --verbose

  • Be more verbose. This includes listing the libraries that are loaded, as well as the counts of all the different kinds of errors.

Of the options for the memcheck skin, these are the ones we think are most useful:

--leak-check=no | yes

  • Find memory leaks once the program is finished. The default is ’no’.

--show-reachable=no | yes

  • Show reachable blocks when the program is finished. If --show-reachable=yes is used, Valgrind looks for dynamically allocated memory that still has a pointer pointing to it. Such memory is not a memory leak, but it may be useful to know about anyway. The default is ’no’.

Let’s take a look at Valgrind in action. Remember ch15-badmem.c? (See Section 15.5.2.2, “Electric Fence” page 614.) The -b option writes into memory that is beyond the area allocated with malloc(). Here’s what Valgrind reports:

$ valgrind ch15-badmem1 -b
 1  ==8716== Memcheck, a.k.a. Valgrind, a memory error detector for x86-linux.
 2  ==8716== Copyright (C) 2002-2003, and GNU GPL'd, by Julian Seward.
 3  ==8716== Using valgrind-20030725, a program supervision framework for x86-linux.
 4  ==8716== Copyright (C) 2000-2003, and GNU GPL'd, by Julian Seward.
 5  ==8716== Estimated CPU clock rate is 2400 MHz
 6  ==8716== For more details, rerun with: -v
 7  ==8716==
 8  p = <not 30 bytes>
 9  ==8716== Invalid write of size 1
10  ==8716==    at 0x8048466: main (ch15-badmem1.c:18)
11  ==8716==    by 0x420158D3: __libc_start_main (in /lib/i686/libc-2.2.93.so)
12  ==8716==    by 0x8048368: (within /home/arnold/progex/code/ch15/ch15-badmem1)
13  ==8716==    Address 0x4104804E is 12 bytes after a block of size 30 alloc'd
14  ==8716==    at 0x40025488: malloc (vg_replace_malloc.c:153)
15  ==8716==    by 0x8048411: main (ch15-badmem1.c:11)
16  ==8716==    by 0x420158D3: __libc_start_main (in /lib/i686/libc-2.2.93.so)
17  ==8716==    by 0x8048368: (within /home/arnold/progex/code/ch15/ch15-badmem1)
18  ==8716==
19  ==8716== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
20  ==8716== malloc/free: in use at exit: 30 bytes in 1 blocks.
21  ==8716== malloc/free: 1 allocs, 0 frees, 30 bytes allocated.
22  ==8716== For a detailed leak analysis, rerun with: --leak-check=yes
23  ==8716== For counts of detected errors, rerun with: -v

(Line numbers in the output were added to aid in the discussion.) Line 8 is the output from the program; the others are all from Valgrind, on standard error. The error report is on lines 9–17. It indicates how many bytes were incorrectly written (line 9), where this happened (line 10), and a stack trace. Lines 13–17 describe where the memory was allocated from. Lines 19–23 provide a summary.

The -f option to ch15-badmem1 frees the allocated memory and then writes into it through a dangling pointer. Here is what Valgrind reports for this case:

$ valgrind ch15-badmem1 -f
==8719== Memcheck, a.k.a. Valgrind, a memory error detector for x86-linux.
...
p = <not 30 bytes>
==8719== Invalid write of size 1
==8719==    at 0x8048498: main (ch15-badmem1.c:21)
==8719==    by 0x420158D3: __libc_start_main (in /lib/i686/libc-2.2.93.so)
==8719==    by 0x8048368: (within /home/arnold/progex/code/ch15/ch15-badmem1)
==8719==    Address 0x41048024 is 0 bytes inside a block of size 30 free'd
==8719==    at 0x40025722: free (vg_replace_malloc.c:220)
==8719==    by 0x8048491: main (ch15-badmem1.c:20)
==8719==    by 0x420158D3: __libc_start_main (in /lib/i686/libc-2.2.93.so)
==8719==    by 0x8048368: (within /home/arnold/progex/code/ch15/ch15-badmem1)
...

This time the report indicates that the write was to freed memory and that the call to free() is on line 20 of ch15-badmem1.c.

When called with no options, ch15-badmem1.c allocates memory and uses it but does not release it. The --leak-check=yes option reports this case:

$ valgrind --leak-check=yes ch15-badmem1
 1  ==8720== Memcheck, a.k.a. Valgrind, a memory error detector for x86-linux.
    ...
 8  p = <not 30 bytes>
 9  ==8720==
10  ==8720== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
11  ==8720== malloc/free: in use at exit: 30 bytes in 1 blocks.
12  ==8720== malloc/free: 1 allocs, 0 frees, 30 bytes allocated.
    ...
16  ==8720==
17  ==8720== 30 bytes in 1 blocks are definitely lost in loss record 1 of 1
18  ==8720==    at 0x40025488: malloc (vg_replace_malloc.c:153)
19  ==8720==    by 0x8048411: main (ch15-badmem1.c:11)
20  ==8720==    by 0x420158D3: __libc_start_main (in /lib/i686/libc-2.2.93.so)
21  ==8720==    by 0x8048368: (within /home/arnold/progex/code/ch15/ch15-badmem1)
22  ==8720==
23  ==8720== LEAK SUMMARY:
24  ==8720==    definitely lost: 30 bytes in 1 blocks.
25  ==8720==    possibly lost:   0 bytes in 0 blocks.
26  ==8720==    still reachable: 0 bytes in 0 blocks.
27  ==8720==         suppressed: 0 bytes in 0 blocks.
28  ==8720== Reachable blocks (those to which a pointer was found) are not shown.
29  ==8720== To see them, rerun with: --show-reachable=yes

Lines 17–29 provide the leak report; the leaked memory was allocated on line 11 of ch15-badmem1.c.

Besides giving reports on misuses of dynamic memory, Valgrind can diagnose uses of uninitialized memory. Consider the following program, ch15-badmem3.c:

 1   /* ch15-badmem3.c --- do bad things with nondynamic memory */
 2
 3   #include <stdio.h>
 4   #include <stdlib.h>
 5
 6   int main(int argc, char **argv)
 7   {
 8       int a_var; /* Both of these are uninitialized */
 9       int b_var;
10
11       /* Valgrind won't flag this; see text. */
12       a_var = b_var;
13
14       /* Use uninitialized memory; this is flagged. */
15       printf("a_var = %d
", a_var);
16
17       return 0;
18   }

When run, Valgrind produces this (abbreviated) report:

==29650== Memcheck, a.k.a. Valgrind, a memory error detector for x86-linux.
...
==29650== Use of uninitialised value of size 4
==29650==    at 0x42049D2A: _IO_vfprintf_internal (in /lib/i686/libc-2.2.93.so)
==29650==    by 0x420523C1: _IO_printf (in /lib/i686/libc-2.2.93.so)
==29650==    by 0x804834D: main (ch15-badmem3.c:15)
==29650==    by 0x420158D3: __libc_start_main (in /lib/i686/libc-2.2.93.so)
==29650==
==29650== Conditional jump or move depends on uninitialised value(s)
==29650==    at 0x42049D32: _IO_vfprintf_internal (in /lib/i686/libc-2.2.93.so)
==29650==    by 0x420523C1: _IO_printf (in /lib/i686/libc-2.2.93.so)
==29650==    by 0x804834D: main (ch15-badmem3.c:15)
==29650==    by 0x420158D3: __libc_start_main (in /lib/i686/libc-2.2.93.so)
...
a_var = 1107341000
==29650==
==29650== ERROR SUMMARY: 25 errors from 7 contexts (suppressed: 0 from 0)
==29650== malloc/free: in use at exit: 0 bytes in 0 blocks.
==29650== malloc/free: 0 allocs, 0 frees, 0 bytes allocated.
==29650== For a detailed leak analysis, rerun with: --leak-check=yes
==29650== For counts of detected errors, rerun with: -v

The Valgrind documentation explains that copying of uninitialized data doesn’t produce any reports. The memcheck skin notes the status of the data (uninitialized) and keeps track of it as data are moved around. Thus, a_var is considered uninitialized, since its value came from b_var, which started out uninitialized.

It is only when an uninitialized value is used that memcheck reports a problem. Here, the use occurs down in the C library (_IO_vfprintf_internal()), which has to convert the value to a string; to do so it does a computation with it.

Unfortunately, although Valgrind can detect the use of uninitialized memory, all the way down to the bit level, it cannot do array bounds checking for local and global variables. (Valgrind can do bounds checking for dynamic memory since it handles such memory itself and therefore knows the start and end of each region.)

In conclusion, Valgrind is a powerful memory debugging tool. It has been used on large-scale, multithreaded production programs such as KDE 3, OpenOffice, and the Konquerer web browser. It rivals several commercial offerings, and a variant version has even been used (together with the WINE Emulator[20]) to debug programs written for Microsoft Windows, using Visual C++! You can get Valgrind from its web site.[21]

Other Malloc Debuggers

Two articles by Cal Erickson in Linux Journal describe mtrace and dmalloc, as well as most of the other tools listed below. These articles are “Memory Leak Detection in Embedded Systems”, Issue 101,[22] September 2002, and “Memory Leak Detection in C++”, Issue 110,[23] June 2003. Both articles are available on the Linux Journal web site.

The other tools are similar in nature to those described earlier.

ccmalloc

Mark Moraes’s malloc

mpatrol

memwatch

njamd

  • Not Just Another Malloc Debugger. This library doesn’t require special linking of the application; instead, it uses LD_PRELOAD to replace standard routines. See http://sourceforge.net/projects/njamd.

yamd

  • Similar to Electric Fence, but with many more options. See http://www3.hmc.edu/~neldredge/yamd.

Almost all of these packages use environment variables to tune their behavior. Based on the Linux Journal articles, Table 15.1 summarizes the features of the different packages.

Table 15.1. Memory tool features summary

Tool

OS

Header

Module/Program

Thread safe

ccmalloc

Multivendor

No

Program

No

dmalloc

Multivendor

Optional

Program

Yes

efence

Multivendor

No

Program

No

memwatch

Multivendor

Yes

Program

No

Moraes

Multivendor

Optional

Program

No

mpatrol

Multivendor

No

Program

Yes

mtrace

Linux(GLIBC)

Yes

Module

No

njamd

Multivendor

No

Program

No

valgrind

Linux(GLIBC)

No

Program

Yes

yamd

Linux, DJGPP

No

Program

No

As is clear, a range of choices is available for debugging dynamic memory problems. On GNU/Linux and BSD systems, one or more of these tools are likely to already be installed, saving you the trouble of downloading and building them.

It is also useful to use multiple tools in succession on your program. For example, mtrace to catch unfreed memory, and Electric Fence to catch invalid memory accesses.

A Modern lint

In Original C, the compiler couldn’t check whether the parameters passed in a function call matched the parameter list in the function’s definition; there were no prototypes. This often led to subtle bugs since a bad function call might produce only mildly erroneous results, which went unnoticed during testing, or might not even get called at all during testing. For example:

if (argc < 2)
    fprintf("usage: %s [ options ] files
", argv[0]);       stderr is missing

If a program containing this fragment is never invoked with the wrong number of arguments, the fprintf(), which is missing the initial FILE * argument, is never called.

The V7 lint program was designed to solve such problems. It made two passes over all the files in a program, first collecting information about function arguments and then comparing function calls to the gathered information. Special “lint library” files provided information about the standard library functions so that they could be checked as well. lint also checked other questionable constructs.

With prototypes in Standard C, the need for lint is decreased, but not eliminated, since C89 still allows old-style function declarations:

extern int some_func();             Argument list unknown

Additionally, many other aspects of a program can be checked statically, that is, by analysis of the source code text.

The splint program (Secure Programming Lint[24]), is a modern lint replacement. It provides too many options and facilities to list here, but is worth investigating.

One thing to be aware of is that lint-like programs can produce a flood of warning messages. Many of the reported warnings are really harmless. In such cases, the tools allow you to provide special comments that indicate “yes, I know about this, it’s not a problem.” splint works best when you provide lots of such annotations in your code.

splint is a powerful but complicated tool; spending some time learning how to use it and then using it frequently will help you keep your code clean.

Software Testing

Software development contains elements of both art and science; this is one aspect of what makes it such a fascinating and challenging profession. This section introduces the topic of software testing, which also involves both art and science; thus, it is some-what more general and higher level (read: “handwavy”) than the rest of this chapter.

Software testing is an integral part of the software development process. It is very unusual for a program to work 100 percent correctly the first time it compiles. The program isn’t responsible for being correct; the author of the program is. One of the most important ways to verify that a program functions the way it’s supposed to is to test it.

One way to break down the different kinds of testing is as follows:

Unit tests

  • These are tests you write for each separate unit or functional component of your program. As part of this effort, you may also need to write scaffolding—code designed to provide enough supporting framework to run the unit as a standalone program.

  • It is important to design the tests for each functional component when you design the component. Doing so helps you clarify the feature design; knowing how you’ll test it helps you define what it should and shouldn’t do in the first place.

Integration tests

  • These are tests you apply when all the functional components have been written, tested, and debugged individually. The idea is that everything is then hooked into place in the overall framework and the whole thing is tested to make sure that the interactions between the components are working.

Regression tests

  • Inevitably, you (or your users!) will discover problems. These may be real bugs, or design limitations, or failures in weird “corner cases.” Once you’ve been able to reproduce and fix the problem, keep the original failing case as a regression test.

  • A regression test lets you make sure that when you make changes, you haven’t reintroduced an old problem. (This can happen easily.) By running a program through its test suite after making a change, you can be (more) confident that everything is working the way it’s supposed to.

Testing should be automated as much as possible. This is particularly easy to do for non-GUI programs written in the style of the Linux/Unix tools: programs that read standard input or named files, and write to standard output and standard error. At the very least, testing can be done with simple shell scripts. More involved testing is usually done with a separate test subdirectory and the make program.

Software testing is a whole subfield in itself, and we don’t expect to do it justice here; rather our point is to make you aware that testing is an integral part of development and often the motivating factor for using your debugging skills! Here is a very brief summary list:

  • Design the test along with the feature.

  • Test boundary conditions: Make sure the feature works both inside and at valid boundaries and that it fails correctly outside them. (For example, the sqrt() function has to fail when given a negative argument.)

  • Use assertions in your code (see Section 12.1, “Assertion Statements: assert()”, page 428), and run your tests with the assertions enabled.

  • Create and reuse test scaffolding.

  • Save failure cases for regression testing.

  • Automate testing as much as possible.

  • Print a count of failed tests so that success or failure, and the degree of failure, can be determined easily.

  • Use code coverage tools such as gcov to verify that your test suite exercises all of your code.

  • Test early and test often.

  • Study software-testing literature to improve your ability to develop and test software.

Debugging Rules

Debugging isn’t a “black art.” Its principles and techniques can be learned, and consistently applied, by anyone. To this end, we highly recommend the book Debugging by David J. Agans (ISBN: 0-8144-7168-4). The book has a web site[25] that summarizes the rules and provides a downloadable poster for you to print and place on your office wall.

To round off our discussion, we present the following material. It was adapted by David Agans, by permission, from Debugging, Copyright © 2002 David J. Agans, published by AMACOM,[26] a division of American Management Association, New York, New York. We thank him.

  1. Understand the system. When all else fails, read the manual. You have to know what the troubled system and all of its parts are supposed to do, if you want to figure out why they don’t. So read any and all documentation you can get your hands (or browser) on.

    Knowing where functional blocks and data paths are, and how they interact, gives you a roadmap for failure isolation. Of course, you also have to know your domain (language, operating system, application) and your tools (compiler, source code debugger).

  2. Make it fail. In order to see the bug, you have to be able to make the failure occur consistently. Document your procedures and start from a known state, so that you can always make it fail again. Look at the bug on the system that fails, don’t try to simulate the problem on another system. Don’t trust statistics on intermittent problems; they will hide the bug more than they will expose it. Rather, try to make it consistent by varying inputs, and initial conditions, and timing.

    If it’s still intermittent, you have to make it look like it’s not. Capture in a log every bit of information you can, during every run; then when you have some bad runs and some good runs, compare them to each other. If you’ve captured enough data you’ll be able to home in on the problem as if you could make it fail every time. Being able to make it fail every time also means you’ll be able to tell when you’ve fixed it.

  3. Quit thinking and look. There are more ways for something to fail than you can possibly imagine. So don’t imagine what could be happening, look at it—put instrumentation on the system so you can actually see the failure mechanism. Use whatever instrumentation you can—debuggers, printf()s, assert()s, logic analyzers, and even LEDs and beepers. Look at it deeply enough until the bug is obvious to the eye, not just to the brain.

    If you do guess, use the guess only to focus the search—don’t try to fix it until you can see it. If you have to add instrumentation code, do it, but be sure to start with the same code base as the failing system, and make sure it still fails with your added code running. Often, adding the debugger makes it stop failing (that’s why they call it a debugger).

  4. Divide and conquer. Everybody knows this one. You do a successive approximation—start at one end, jump halfway, see which way the error is from there, and jump half again in that direction. Binary search, you’re there in a few jumps. The hard part is knowing whether you’re past the bug or not. One helpful trick is to put known, simple data into the system, so that trashed data is easier to spot. Also, start at the bad end and work back toward the good: there are too many good paths to explore if you start at the good end. Fix the bugs you know about right away, since sometimes two bugs interact (though you’d swear they can’t), and successive approximation doesn’t work with two target values.

  5. Change one thing at a time. If you’re trying to improve a stream-handling module and you simultaneously upgrade to the next version of the operating system, it doesn’t matter whether you see improvement, degradation, or no change—you will have no idea what effect your individual changes had. The interaction of multiple changes can be unpredictable and confusing. Don’t do it. Change one thing at a time, so you can bet that any difference you see as a result came from that change.

    If you make a change and it seems to have no effect, back it out immediately. It may have had some effects that you didn’t see, and those may show up in combination with other changes. This goes for changes in testing as well as in coding.

  6. Keep an audit trail. Much of the effectiveness of the above rules depends on keeping good records. In all aspects of testing and debugging, write down what you did, when you did it, how you did it, and what happened as a result. Do it electronically if possible, so that the record can be emailed and attached to the bug database. Many a clue is found in a pattern of events that would not be noticed if it wasn’t recorded for all to see and compare. And the clue is likely to be in the details that you didn’t think were important, so write it all down.

  7. Check the plug. Everyone has a story about some problem that turned out to be “it wasn’t plugged in.” Sometimes it’s literally unplugged, but in software, “unplugged” can mean a missing driver or an old version of code you thought you replaced. Or bad hardware when you swear it’s a software problem. One story had the hardware and software engineers pointing fingers at each other, and it was neither: The test device they were using was not up to spec. The bottom line is that sometimes you’re looking for a problem inside a system, when in fact the problem is outside the system, or underlying the system, or in the initialization of the system, or you’re not looking at the right system.

    Don’t necessarily trust your tools, either. The tool vendors are engineers, too; they have bugs, and you may be the one to find them.

  8. Get a fresh view. There are three reasons to ask for help while debugging.

    The first reason is to get fresh insight—another person will often see something just because they aren’t caught up in it like you are. The second reason is to tap expertise—they know more about the system than you do. The third reason is to get experience—they’ve seen this one before.

    When you describe the situation to someone, report the symptoms you’ve seen, not your theories about why it’s acting that way. You went to them because your theories aren’t getting you anywhere—don’t pull them down into the same rut you’re stuck in.

  9. If you didn’t fix it, it ain’t fixed. So you think it’s fixed? Prove it. Since you were able to make it fail consistently, set up the same situation and make sure it doesn’t fail. Don’t assume that just because the problem was obvious, it’s all fixed now. Maybe it wasn’t so obvious. Maybe your fix wasn’t done right. Maybe your fix isn’t even in the new release! Test it! Make it not fail.

    Are you sure your code is what fixed it? Or did the test change, or did some other code get in there? Once you see that your fix works, take the fix out and make it fail again. Then put the fix back in and see that it doesn’t fail. This step assures you that it was really your fix that solved the problem.

More information about the book Debugging and a free downloadable debugging rules poster can be found at http://www.debuggingrules.com.

Suggested Reading

The following books are excellent, with much to say about both testing and debugging. All but the first relate to programming in general. They’re all worth reading.

  1. Debugging, David J. Agans. AMACOM, New York, New York, USA 2003. ISBN: 0-8144-7168-4.

    We highly recommend this book. Its tone is light, and amazing as it sounds, it’s fun reading!

  2. Programming Pearls, 2nd edition, by Jon Louis Bentley. Addison-Wesley, Reading, Massachusetts, USA, 2000. ISBN:0-201-65788-0. See also this book’s web site.[27]

    Chapter 5 of this book gives a good discussion of unit testing and building test scaffolding.

  3. Literate Programming, by Donald E. Knuth. Center for the Study of Language and Information (CSLI), Stanford University, USA, 1992. ISBN: 0-9370-7380-6.

    This fascinating book contains a number of articles by Donald Knuth on literate programming—a programming technique he invented, and used for the creation of TEX and Metafont. Of particular interest is the article entitled “The Errors of TEX,” which describes how he developed and debugged TEX, including his log of all the problems found and fixed.

  4. Writing Solid Code, by Steve Maguire. Microsoft Press, Redmond, Washington, USA, 1993. ISBN: 1-55615-551-4.

  5. Code Complete: A Practical Handbook of Software Construction, by Steve McConnell. Microsoft Press, Redmond, Washington, USA, 1994. ISBN: 1-55615-484-4.

  6. The Practice of Programming, by Brian W. Kernighan and Rob Pike. Addison-Wesley, Reading, Massachusetts, USA, 1999. ISBN: 0-201-61585-X.

Summary

  • Debugging is an important part of software development. Good design and development practices should be used to minimize the introduction of bugs, but debugging will always be with us.

  • Programs should be compiled without optimization and with debugging symbols included to make debugging under a debugger more straightforward. On many systems, compiling with optimization and compiling with debugging symbols are mutually exclusive. This is not true of GCC, which is why the GNU/Linux developer needs to be aware of the issue.

  • The GNU debugger GDB is standard on GNU/Linux systems and can be used on just about any commercial Unix system as well. (Graphical debuggers based on GDB are also available and easily portable.) Breakpoints, watchpoints, and single-stepping with next, step, and cont provide basic control over a program as it’s running. GDB also lets you examine data and call functions within the debuggee.

  • There are many things you can do when writing your program to make it easier when you inevitably have to debug it. We covered the following topics:

    • Debugging macros for printing state.

    • Avoiding expression macros.

    • Reordering code to make single-stepping easier.

    • Writing helper functions for use from a debugger.

    • Avoiding unions.

    • Having runtime debugging code in the production version of a program and having different ways to enable that code’s output.

    • Adding dummy functions to make breakpoints easier to set.

  • A number of tools and libraries besides just general-purpose debuggers exist to help with debugging. The dbug library provides a nice internal debugger that uses many of the techniques we described, in a consistent, coherent way.

  • Multiple dynamic memory debugging libraries exist, with many similar features. We looked at three of them (mtrace, Electric Fence, and dmalloc), and provided pointers to several others. The Valgrind program goes further, finding problems related to uninitialized memory, not just dynamic memory.

  • splint is a modern alternative to the venerable V7 lint program. It is available on at least one vendor’s GNU/Linux system and can be easily downloaded and built from source.

  • Besides debugging tools, software testing is also an integral part of the software development process. It should be understood, planned for, and managed from the beginning of any software development project, even personal ones.

  • Debugging is a skill that can be learned. We recommend reading the book Debugging by David J. Agans and learning to apply his rules.

Exercises

  1. Compile one of your programs with GCC, using both -g and -o. Run it under GDB, setting a breakpoint in main(). Single-step through the program, and see how closely execution relates (or doesn’t relate) to the original source code. This is particularly good to do with code using a while or for loop.

  2. Read up on GDB’s conditional breakpoint feature. How does that simplify dealing with problems that occur only after a certain number of operations have been done?

  3. Rewrite the parse_debug() function from Section 15.4.2.1, “Add Debugging Options and Variables,” page 595, to use a table of debugging option strings, flag values, and string lengths.

  4. 4. (Hard.) Study the gawk source code; in particular the NODE structure in awk.h. Write a debugging helper function that prints the contents of a NODE based on the value in the type field.

  5. Take one of your programs and modify it to use the dbug library. Compile it first without -DDBUG, to make sure it compiles and works OK. (Do you have a regression test suite for it? Did your program pass all the tests?)

    Once you’re sure that adding the dbug library has not broken your program, recompile it with -DDBUG. Does your program still pass all its tests? What is the performance difference with the library enabled and disabled?

    Run your test suite with the -#t option to see the function-call trace. Do you think this will help you in the future when you have to do debugging? Why or why not?

  6. Run one of your programs that uses dynamic memory with Electric Fence or one of the other dynamic memory testers. Describe the problems, if any, that you found.

  7. Rerun the same program, using Valgrind with leak checking enabled. Describe the problems, if any, that you found.

  8. Design a set of tests for the mv program. (Read mv(1): make sure you cover all its options.)

  9. Search on the Internet for software testing resources. What interesting things did you find?



[1] Compiler optimizations are a notorious scapegoat for logic bugs. In the past, finger-pointing at the compiler was more justified. In our experience, using modern systems and compilers, it is very unusual to find a case in which compiler optimization introduces bugs into working code.

[2] We’re speaking of the original BSD dbx. We have used GDB exclusively for well over a decade.

[3] ddd comes with many GNU/Linux systems. The source code is available from the GNU Project’s FTP site for ddd ftp://ftp.gnu.org/gnu/ddd/).

[4] http://sources.redhat.com/insight/

[7] See sysctl(8) if you wish to change this behavior.

[8] Bjarne Stroustrup, the creator of C++, worked hard to make the use of the C preprocessor completely unnecessary in C++. In our opinion, he didn’t quite succeed: #include is still needed, but regular macros aren’t. For C, the preprocessor remains a valuable tool, but it should be used judiciously.

[9] Seriously! People often run megabytes of data through gawk. Remember, no arbitrary limits!

[10] We inherited this design. In general it works, but it does have its problems. The point of this section is to pass on the experience we’ve acquired working with unions.

[11] Again, GCC 3.1 or newer and GDB 5 can let you use macros directly, but only if you’re using them together, with specific options. This was described earlier, in Section 15.4.1.2, “Avoid Expression Macros If Possible,” page 580.

[12] This part of the code has since been revised, and the example lines are no longer there.

[13] By rewriting the buffer management code!

[15] C99, which allows variable declarations mixed with executable code, makes this less of a problem, but remember that this package was designed for K&R C.

[16] Although we should have learned our lesson after the first one, we went to a second one. Since then we’ve figured it out and generally avoid startup companies. Your mileage may vary, of course.

[17] ftp://ftp.perens.com/pub/ElectricFence

[19] Increasingly, GNU/Linux is being used for high-end product deployment, too!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.136.165