Chapter 1. Introduction

In this chapter

  • 1.1 The Linux/Unix File Model page 4

  • 1.2 The Linux/Unix Process Model page 10

  • 1.3 Standard C vs. Original C page 12

  • 1.4 Why GNU Programs Are Better page 14

  • 1.5 Portability Revisited page 19

  • 1.6 Suggested Reading page 20

  • 1.7 Summary page 21

  • Exercises page 22

If there is one phrase that summarizes the primary GNU/Linux (and therefore Unix) concepts, it’s “files and processes.” In this chapter we review the Linux file and process models. These are important to understand because the system calls are almost all concerned with modifying some attribute or part of the state of a file or a process.

Next, because we’ll be examining code in both styles, we briefly review the major difference between 1990 Standard C and Original C. Finally, we discuss at some length what makes GNU programs “better,” programming principles that we’ll see in use in the code.

This chapter contains a number of intentional simplifications. The full details are covered as we progress through the book. If you’re already a Linux wizard, please forgive us.

The Linux/Unix File Model

One of the driving goals in the original Unix design was simplicity. Simple concepts are easy to learn and use. When the concepts are translated into simple APIs, simple programs are then easy to design, write, and get correct. In addition, simple code is often smaller and more efficient than more complicated designs.

The quest for simplicity was driven by two factors. From a technical point of view, the original PDP-11 minicomputers on which Unix was developed had a small address space: 64 Kilobytes total on the smaller systems, 64K code and 64K of data on the large ones. These restrictions applied not just to regular programs (so-called user level code), but to the operating system itself (kernel level code). Thus, not only “Small Is Beautiful” aesthetically, but “Small Is Beautiful” because there was no other choice!

The second factor was a negative reaction to contemporary commercial operating systems, which were needlessly complicated, with obtuse command languages, multiple kinds of file I/O, and little generality or symmetry. (Steve Johnson once remarked that “Using TSO is like trying to kick a dead whale down a beach.” TSO is one of the obtuse mainframe time-sharing systems just described.)

Files and Permissions

The Unix file model is as simple as it gets: A file is a linear stream of bytes. Period. The operating system imposes no preordained structure on files: no fixed or varying record sizes, no indexed files, nothing. The interpretation of file contents is entirely up to the application. (This isn’t quite true, as we’ll see shortly, but it’s close enough for a start.)

Once you have a file, you can do three things with the file’s data: read them, write them, or execute them.

Unix was designed for time-sharing minicomputers; this implies a multiuser environment from the get-go. Once there are multiple users, it must be possible to specify a file’s permissions: Perhaps user jane is user fred’s boss, and jane doesn’t want fred to read the latest performance evaluations.

For file permission purposes, users are classified into three distinct categories: user: the owner of a file; group: the group of users associated with this file (discussed shortly); and other: anybody else. For each of these categories, every file has separate read, write, and execute permission bits associated with it, yielding a total of nine permission bits. This shows up in the first field of the output of ’ls -l’:

$ ls -l progex.texi
-rw-r--r--    1 arnold   devel          5614 Feb 24 18:02 progex.texi

Here, arnold and devel are the owner and group of progex.texi, and -rw-r--r-- are the file type and permissions. The first character is a dash for regular file, a d for directories, or one of a small set of other characters for other kinds of files that aren’t important at the moment. Each subsequent group of three characters represents read, write, and execute permission for the owner, group, and “other,” respectively.

In this example, progex.texi is readable and writable by the owner, and readable by the group and other. The dashes indicate absent permissions, thus the file is not executable by anyone, nor is it writable by the group or other.

The owner and group of a file are stored as numeric values known as the user ID (UID) and group ID (GID); standard library functions that we present later in the book make it possible to print the values as human-readable names.

A file’s owner can change the permission by using the chmod (change mode) command. (As such, file permissions are sometimes referred to as the “file mode.”) A file’s group can be changed with the chgrp (change group) and chown (change owner) commands.[1]

Group permissions were intended to support cooperative work: Although one person in a group or department may own a particular file, perhaps everyone in that group needs to be able to modify it. (Consider a collaborative marketing paper or data from a survey.)

When the system goes to check a file access (usually upon opening a file), if the UID of the process matches that of the file, the owner permissions apply. If those permissions deny the operation (say, a write to a file with -r--rw-rw- permissions), the operation fails; Unix and Linux do not proceed to test the group and other permissions.[2] The same is true if the UID is different but the GID matches; if the group permissions deny the operation, it fails.

Unix and Linux support the notion of a superuser: a user with special privileges. This user is known as root and has the UID of 0. root is allowed to do anything; all bets are off, all doors are open, all drawers unlocked.[3] (This can have significant security implications, which we touch on throughout the book but do not cover exhaustively.) Thus, even if a file is mode ----------, root can still read and write the file. (One exception is that the file can’1t be executed. But as root can add execute permission, the restriction doesn’t prevent anything.)

The user/group/other, read/write/execute permissions model is simple, yet flexible enough to cover most situations. Other, more powerful but more complicated, models exist and are implemented on different systems, but none of them are well enough standardized and broadly enough implemented to be worth discussing in a general-purpose text like this one.

Directories and Filenames

Once you have a file, you need someplace to keep it. This is the purpose of the directory (known as a “folder” on Windows and Apple Macintosh systems). A directory is a special kind of file, which associates filenames with particular collections of file metadata, known as inodes. Directories are special because they can only be updated by the operating system, by the system calls described inChapter 4, “Files and File I/O,” page 83. They are also special in that the operating system dictates the format of directory entries.

Filenames may contain any valid 8-bit byte except the / (forward slash) character and ASCII NUL, the character whose bits are all zero. Early Unix systems limited filenames to 14 bytes; modern systems allow individual filenames to be up to 255 bytes.

The inode contains all the information about a file except its name: the type, owner, group, permissions, size, modification and access times. It also stores the locations on disk of the blocks containing the file’s data. All of these are data about the file, not the file’s data itself, thus the term metadata.

Directory permissions have a slightly different meaning from those for file permissions. Read permission means the ability to search the directory; that is, to look through it to see what files it contains. Write permission is the ability to create and remove files in the directory. Execute permission is the ability to go through a directory when opening or otherwise accessing a contained file or subdirectory.

Note

If you have write permission on a directory, you can remove files in that directory, even if they don’t belong to you! When used interactively, the rm command notices this, and asks you for confirmation in such a case.

The /tmp directory has write permission for everyone, but your files in /tmp are quite safe because /tmp usually has the so-called sticky bit set on it:

$ ls -ld /tmp
drwxrwxrwt   11 root      root          4096 May 15 17:11 /tmp

Note the t is the last position of the first field. On most directories this position has an x in it. With the sticky bit set, only you, as the file’s owner, or root may remove your files. (We discuss this in more detail in Section 11.5.2, “Directories and the Sticky Bit,” page 414.)

Executable Files

Remember we said that the operating system doesn’t impose a structure on files? Well, we’ve already seen that that was a white lie when it comes to directories. It’s also the case for binary executable files. To run a program, the kernel has to know what part of a file represents instructions (code) and what part represents data. This leads to the notion of an object file format, which is the definition for how these things are laid out within a file on disk.

Although the kernel will only run a file laid out in the proper format, it is up to user-level utilities to create these files. The compiler for a programming language (such as Ada, Fortran, C, or C++) creates object files, and then a linker or loader (usually named ld) binds the object files with library routines to create the final executable. Note that even if a file has all the right bits in all the right places, the kernel won’t run it if the appropriate execute permission bit isn’t turned on (or at least one execute bit for root).

Because the compiler, assembler, and loader are user-level tools, it’s (relatively) easy to change object file formats as needs develop over time; it’s only necessary to “teach” the kernel about the new format and then it can be used. The part that loads executables is relatively small and this isn’t an impossible task. Thus, Unix file formats have evolved over time. The original format was known as a. out (Assembler OUTput). The next format, still used on some commercial systems, is known as COFF (Common Object File Format), and the current, most widely used format is ELF (Extensible Linking Format). Modern GNU/Linux systems use ELF.

The kernel recognizes that an executable file contains binary object code by looking at the first few bytes of the file for special magic numbers. These are sequences of two or four bytes that the kernel recognizes as being special. For backwards compatibility, modern Unix systems recognize multiple formats. ELF files begin with the four characters "177ELF".

Besides binary executables, the kernel also supports executable scripts. Such a file also begins with a magic number: in this case, the two regular characters #!. A script is a program executed by an interpreter, such as the shell, awk, Perl, Python, or Tcl. The #! line provides the full path to the interpreter and, optionally, one single argument:

#! /bin/awk -f

BEGIN { print "hello, world" }

Let’s assume the above contents are in a file named hello.awk and that the file is executable. When you type ’hello.awk’, the kernel runs the program as if you had typed ’/bin/awk -f hello.awk’. Any additional command-line arguments are also passed on to the program. In this case, awk runs the program and prints the universally known hello, world message.

The #! mechanism is an elegant way of hiding the distinction between binary executables and script executables. If hello.awk is renamed to just hello, the user typing ’hello’ can’t tell (and indeed shouldn’t have to know) that hello isn’t a binary executable program.

Devices

One of Unix’s most notable innovations was the unification of file I/O and device I/O.[4] Devices appear as files in the filesystem, regular permissions apply to their access, and the same I/O system calls are used for opening, reading, writing, and closing them. All of the “magic” to make devices look like files is hidden in the kernel. This is just another aspect of the driving simplicity principle in action: We might phrase it as no special cases for user code.

Two devices appear frequently in everyday use, particularly at the shell level: /dev/null and /dev/tty.

/dev/null is the “bit bucket.” All data sent to /dev/null is discarded by the operating system, and attempts to read from it always return end-of-file (EOF) immediately.

/dev/tty is the process’s current controlling terminal—the one to which it listens when a user types the interrupt character (typically CTRL-C) or performs job control (CTRL-Z).

GNU/Linux systems, and many modern Unix systems, supply /dev/stdin, /dev/stdout, and /dev/stderr devices, which provide a way to name the open files each process inherits upon startup.

Other devices represent real hardware, such as tape and disk drives, CD-ROM drives, and serial ports. There are also software devices, such as pseudo-ttys, that are used for networking logins and windowing systems. /dev/console represents the system console, a particular hardware device on minicomputers. On modern computers, /dev/console is the screen and keyboard, but it could be a serial port.

Unfortunately, device-naming conventions are not standardized, and each operating system has different names for tapes, disks, and so on. (Fortunately, that’s not an issue for what we cover in this book.) Devices have either a b or c in the first character of ’ls -l’ output:

$ ls -l /dev/tty /dev/hda
brw-rw----    1 root     disk       3,   0 Aug 31 02:31 /dev/hda
crw-rw-rw-    1 root     root       5,   0 Feb 26 08:44 /dev/tty

The initial b represents block devices, and a c represents character devices. Device files are discussed further in Section 5.4, “Obtaining Information about Files,” page 139.

The Linux/Unix Process Model

A process is a running program.[5] Processes have the following attributes:

  • A unique process identifier (the PID)

  • A parent process (with an associated identifier, the PPID)

  • Permission identifiers (UID, GID, groupset, and so on)

  • An address space, separate from those of all other processes

  • A program running in that address space

  • A current working directory (’.’)

  • A current root directory (/; changing this is an advanced topic)

  • A set of open files, directories, or both

  • A permissions-to-deny mask for use in creating new files

  • A set of strings representing the environment

  • A scheduling priority (an advanced topic)

  • Settings for signal disposition (an advanced topic)

  • A controlling terminal (also an advanced topic)

When the main() function begins execution, all of these things have already been put in place for the running program. System calls are available to query and change each of the above items; covering them is the purpose of this book.

New processes are always created by an existing process. The existing process is termed the parent, and the new process is termed the child. Upon booting, the kernel handcrafts the first, primordial process, which runs the program /sbin/init; it has process ID 1 and serves several administrative functions. All other processes are descendants of init. (init’s parent is the kernel, often listed as process ID 0.)

The child-to-parent relationship is one-to-one; each process has only one parent, and thus it’s easy to find out the PID of the parent. The parent-to-child relationship is one-to-many; any given process can create a potentially unlimited number of children. Thus, there is no easy way for a process to find out the PIDs of all its children. (In practice, it’s not necessary, anyway.) A parent process can arrange to be notified when a child process terminates (“dies”), and it can also explicitly wait for such an event.

Each process’s address space (memory) is separate from that of every other. Unless two processes have made explicit arrangement to share memory, one process cannot affect the address space of another. This is important; it provides a basic level of security and system reliability. (For efficiency, the system arranges to share the read-only executable code of the same program among all the processes running that program. This is transparent to the user and to the running program.)

The current working directory is the one to which relative pathnames (those that don’t start with a /) are relative. This is the directory you are “in” whenever you issue a ’cd someplace’ command to the shell.

By convention, all programs start out with three files already open: standard input, standard output, and standard error. These are where input comes from, output goes to, and error messages go to, respectively. In the course of this book, we will see how these are put in place. A parent process can open additional files and have them already available for a child process; the child will have to know they’re there, either by way of some convention or by a command-line argument or environment variable.

The environment is a set of strings, each of the form ’name=value’. Functions exist for querying and setting environment variables, and child processes inherit the environment of their parents. Typical environment variables are things like PATH and HOME in the shell. Many programs look for the existence and value of specific environment variables in order to control their behavior.

It is important to understand that a single process may execute multiple programs during its lifetime. Unless explicitly changed, all of the other system-maintained attributes (current directory, open files, PID, etc.) remain the same. The separation of “starting a new process” from “choosing which program to run” is a key Unix innovation. It makes many operations simple and straightforward. Other operating systems that combine the two operations are less general and more complicated to use.

Pipes: Hooking Processes Together

You’ve undoubtedly used the pipe construct (’|’) in the shell to connect two or more running programs. A pipe acts like a file: One process writes to it using the normal write operation, and the other process reads from it using the read operation. The processes don’t (usually) know that their input/output is a pipe and not a regular file.

Just as the kernel hides the “magic” for devices, making them act like regular files, so too the kernel does the work for pipes, arranging to pause the pipe’s writer when the pipe fills up and to pause the reader when no data is waiting to be read.

The file I/O paradigm with pipes thus acts as a key mechanism for connecting running programs; no temporary files are needed. Again, generality and simplicity at work: no special cases for user code.

Standard C vs. Original C

For many years, the de facto definition of C was found in the first edition of the book The C Programming Language, by Brian Kernighan and Dennis Ritchie. This book described C as it existed for Unix and on the systems to which the Bell Labs developers had ported it. Throughout this book, we refer to it as “Original C,” although it’s also common for it to be referred to as “K&R C,” after the book’s two authors. (Dennis Ritchie designed and implemented C.)

The 1990 ISO Standard for C formalized the language’s definition, including the functions in the C library (such as printf() and fopen()). The C standards committee did an admirable job of standardizing existing practice and avoided inventing new features, with one notable exception (and a few minor ones). The most visible change in the language was the use of function prototypes, borrowed from C++.

Standard C, C++, and the Java programming language use function prototypes for function declarations and definitions. A prototype describes not only the function’s return value but also the number and type of its arguments. With prototypes, a compiler can do complete type checking at the point of a function call:

extern int myfunc(struct my_struct *a,      Declaration
                  struct my_struct *b,
                  double c, int d);

int myfunc(struct my_struct *a,             Definition
           struct my_struct *b,
           double c, int d)
{
    ...
}

...
struct my_struct s, t;
int j;

...
/* Function call, somewhere else: */
j = my_func(& s, & t, 3.1415, 42);

This function call is fine. But consider an erroneous call:

j = my_func(-1, -2, 0);                     Wrong number and types of arguments

The compiler can immediately diagnose this call as being invalid. However, in Original C, functions are declared without the argument list being specified:

extern int myfunc();                        Returns int, arguments unknown

Furthermore, function definitions list the parameter names in the function header, and then declare the parameters before the function body. Parameters of type int don’t have to be declared, and if a function returns int, that doesn’t have to be declared either:

myfunc(a, b, c, d)                          Return type is int
struct my_struct *a, *b;
double c;                                   Note, no declaration of parameter d
{
    ...
}

Consider again the same erroneous function call: ’j = my_func(-1, -2, 0);’. In Original C, the compiler has no way of knowing that you’ve (accidentally, we assume) passed the wrong arguments to my_func(). Such erroneous calls generally lead to hard-to-find runtime problems (such as segmentation faults, whereby the program dies), and the Unix lint program was created to deal with these kinds of things.

So, although function prototypes were a radical departure from existing practice, their additional type checking was deemed too important to be without, and they were added to the language with little opposition.

In 1990 Standard C, code written in the original style, for both declarations and definitions, is valid. This makes it possible to continue to compile millions of lines of existing code with a standard-conforming compiler. New code, obviously, should be written with prototypes because of the improved possibilities for compile-time error checking.

1999 Standard C continues to allow original style declarations and definitions. However, the “implicit int” rule was removed; functions must have a return type, and all parameters must be declared.

Furthermore, when a program called a function that had not been formally declared, Original C would create an implicit declaration for the function, giving it a return type of int. 1990 Standard C did the same, additionally noting that it had no information about the parameters. 1999 Standard C no longer provides this “auto-declare” feature.

Other notable additions in Standard C are the const keyword, also from C++, and the volatile keyword, which the committee invented. For the code you’ll see in this book, understanding the different function declaration and definition syntaxes is the most important thing.

For V7 code using original style definitions, we have added comments showing the equivalent prototype. Otherwise, we have left the code alone, preferring to show it exactly as it was originally written and as you’ll see it if you download the code yourself.

Although 1999 C adds some additional keywords and features beyond the 1990 version, we have chosen to stick to the 1990 dialect, since C99 compilers are not yet commonplace. Practically speaking, this doesn’t matter: C89 code should compile and run without change when a C99 compiler is used, and the new C99 features don’t affect our discussion or use of the fundamental Linux/Unix APIs.

Why GNU Programs Are Better

What is it that makes a GNU program a GNU program?[6] What makes GNU software “better” than other (free or non-free) software? The most obvious difference is the GNU General Public License (GPL), which describes the distribution terms for GNU software. But this is usually not the reason you hear people saying “Get the GNU version of xyz, it’s much better.” GNU software is generally more robust, and performs better, than standard Unix versions. In this section we look at some of the reasons why, and at the document that describes the principles of GNU software design.

The GNU Coding Standards describes how to write software for the GNU project. It covers a range of topics. You can read the GNU Coding Standards online at http://www.gnu.org/prep/standards.html. See the online version for pointers to the source files in other formats.

In this section, we describe only those parts of the GNU Coding Standards that relate to program design and implementation.

Program Design

Chapter 3 of the GNU Coding Standards provides general advice about program design. The four main issues are compatibility (with standards and Unix), the language to write in, reliance on nonstandard features of other programs (in a word, “none”), and the meaning of “portability.”

Compatibility with Standard C and POSIX, and to a lesser extent, with Berkeley Unix is an important goal. But it’s not an overriding one. The general idea is to provide all necessary functionality, with command-line switches to provide a strict ISO or POSIX mode.

C is the preferred language for writing GNU software since it is the most commonly available language. In the Unix world, Standard C is now common, but if you can easily support Original C, you should do so. Although the coding standards prefer C over C++, C++ is now commonplace too. One widely used GNU package written in C++ is groff (GNU troff). With GCC supporting C++, it has been our experience that installing groff is not difficult.

The standards state that portability is a bit of a red herring. GNU utilities are ultimately intended to run on the GNU kernel with the GNU C Library.[7] But since the kernel isn’t finished yet and users are using GNU tools on non-GNU systems, portability is desirable, just not paramount. The standard recommends using Autoconf for achieving portability among different Unix systems.

Program Behavior

Chapter 4 of the GNU Coding Standards provides general advice about program behavior. We will return to look at one of its sections in detail, below. The chapter focuses on program design, formatting error messages, writing libraries (by making them reentrant), and standards for the command-line interface.

Error message formatting is important since several tools, notably Emacs, use the error messages to help you go straight to the point in the source file or data file at which an error occurred.

GNU utilities should use a function named getopt_long() for processing the command line. This function provides command-line option parsing for both traditional Unix-style options (’gawk -F: ...’) and GNU-style long options (’gawk --field-separator=: ...’). All programs should provide --help and --version options, and when a long name is used in one program, it should be used the same way in other GNU programs. To this end, there is a rather exhaustive list of long options used by current GNU programs.

As a simple yet obvious example, --verbose is spelled exactly the same way in all GNU programs. Contrast this to -v, -V, -d, etc., in many Unix programs. Most of Chapter 2, “Arguments, Options, and the Environment,” page 23, is devoted to the mechanics of argument and option parsing.

C Code Programming

The most substantive part of the GNU Coding Standards is Chapter 5, which describes how to write C code, covering things like formatting the code, correct use of comments, using C cleanly, naming your functions and variables, and declaring, or not declaring, standard system functions that you wish to use.

Code formatting is a religious issue; many people have different styles that they prefer. We personally don’t like the FSF’s style, and if you look at gawk, which we maintain, you’ll see it’s formatted in standard K&R style (the code layout style used in both editions of the Kernighan and Ritchie book). But this is the only variation in gawk from this part of the coding standards.

Nevertheless, even though we don’t like the FSF’s style, we feel that when modifying some other program, sticking to the coding style already used is of the utmost importance. Having a consistent coding style is more important than which coding style you pick. The GNU Coding Standards also makes this point. (Sometimes, there is no detectable consistent coding style, in which case the program is probably overdue for a trip through either GNU indent or Unix’s cb.)

What we find important about the chapter on C coding is that the advice is good for any C coding, not just if you happen to be working on a GNU program. So, if you’re just learning C or even if you’ve been working in C (or C++) for a while, we recommend this chapter to you since it encapsulates many years of experience.

Things That Make a GNU Program Better

We now examine the section titled Writing Robust Programs in Chapter 4, Program Behavior for All Programs, of the GNU Coding Standards. This section provides the principles of software design that make GNU programs better than their Unix counterparts. We quote selected parts of the chapter, with some examples of cases in which these principles have paid off.

Avoid arbitrary limits on the length or number of any data structure, including file names, lines, files, and symbols, by allocating all data structures dynamically. In most Unix utilities, “long lines are silently truncated.” This is not acceptable in a GNU utility.

This rule is perhaps the single most important rule in GNU software design—no arbitrary limits. All GNU utilities should be able to manage arbitrary amounts of data.

While this requirement perhaps makes it harder for the programmer, it makes things much better for the user. At one point, we had a gawk user who regularly ran an awk program on more than 650,000 files (no, that’s not a typo) to gather statistics. gawk would grow to over 192 megabytes of data space, and the program ran for around seven CPU hours. He would not have been able to run his program using another awk implementation.[8]

Utilities reading files should not drop NUL characters, or any other nonprinting characters including those with codes above 0177. The only sensible exceptions would be utilities specifically intended for interface to certain types of terminals or printers that can’t handle those characters.

It is also well known that Emacs can edit any arbitrary file, including files containing binary data!

Whenever possible, try to make programs work properly with sequences of bytes that represent multibyte characters, using encodings such as UTF-8 and others.[9] Check every system call for an error return, unless you know you wish to ignore errors. Include the system error text (from perror or equivalent) in every error message resulting from a failing system call, as well as the name of the file if any and the name of the utility. Just “cannot open foo.c” or “stat failed” is not sufficient.

Checking every system call provides robustness. This is another case in which life is harder for the programmer but better for the user. An error message detailing what exactly went wrong makes finding and solving any problems much easier.[10]

Finally, we quote from Chapter 1 of the GNU Coding Standards, which discusses how to write your program differently from the way a Unix program may have been written.

For example, Unix utilities were generally optimized to minimize memory use; if you go for speed instead, your program will be very different. You could keep the entire input file in core and scan it there instead of using stdio. Use a smarter algorithm discovered more recently than the Unix program. Eliminate use of temporary files. Do it in one pass instead of two (we did this in the assembler).

Or, on the contrary, emphasize simplicity instead of speed. For some applications, the speed of today’s computers makes simpler algorithms adequate.

Or go for generality. For example, Unix programs often have static tables or fixed-size strings, which make for arbitrary limits; use dynamic allocation instead. Make sure your program handles NULs and other funny characters in the input files. Add a programming language for extensibility and write part of the program in that language.

Or turn some parts of the program into independently usable libraries. Or use a simple garbage collector instead of tracking precisely when to free memory, or use a new GNU facility such as obstacks.

An excellent example of the difference an algorithm can make is GNU diff. One of our system’s early incarnations was an AT&T 3B1: a system with a MC68010 processor, a whopping two megabytes of memory and 80 megabytes of disk. We did (and do) lots of editing on the manual for gawk, a file that is almost 28,000 lines long (although at the time, it was only in the 10,000-lines range). We used to use ’diff -c’ quite frequently to look at our changes. On this slow system, switching to GNU diff made a stunning difference in the amount of time it took for the context diff to appear. The difference is almost entirely due to the better algorithm that GNU diff uses.

The final paragraph mentions the idea of structuring a program as an independently usable library, with a command-line wrapper or other interface around it. One example of this is GDB, the GNU debugger, which is partially implemented as a command-line tool on top of a debugging library. (The separation of the GDB core functionality from the command interface is an ongoing development project.) This implementation makes it possible to write a graphical debugging interface on top of the basic debugging functionality.

Parting Thoughts about the “GNU Coding Standards”

The GNU Coding Standards is a worthwhile document to read if you wish to develop new GNU software, enhance existing GNU software, or just learn how to be a better programmer. The principles and techniques it espouses are what make GNU software the preferred choice of the Unix community.

Portability Revisited

Portability is something of a holy grail; always sought after, but not always obtainable, and certainly not easily. There are several aspects to writing portable code. The GNU Coding Standards discusses many of them. But there are others as well. Keep portability in mind at both higher and lower levels as you develop. We recommend these practices:

Code to standards.

  • Although it can be challenging, it pays to be familiar with the formal standards for the language you’re using. In particular, pay attention to the 1990 and 1999 ISO standards for C and the 2003 standard for C++ since most Linux programming is done in one of those two languages.

  • Also, the POSIX standard for library and system call interfaces, while large, has broad industry support. Writing to POSIX greatly improves the chances of successfully moving your code to other systems besides GNU/Linux. This standard is quite readable; it distills decades of experience and good practice.

Pick the best interface for the job.

  • If a standard interface does what you need, use it in your code. Use Autoconf to detect an unavailable interface, and supply a replacement version of it for deficient systems. (For example, some older systems lack the memmove() function, which is fairly easy to code by hand or to pull from the GLIBC library.)

Isolate portability problems behind new interfaces.

  • Sometimes, you may need to do operating-system-specific tasks that apply on some systems but not on others. (For example, on some systems, each program has to expand command-line wildcards instead of the shell doing it.) Create a new interface that does nothing on systems that don’t need it but does the correct thing on systems that do.

Use Autoconf for configuration.

  • Avoid #ifdef if possible. If not, bury it in low-level library code. Use Autoconf to do the checking for the tests to be performed with #ifdef.

Suggested Reading

  1. The C Programming Language, 2nd edition, by Brian W. Kernighan and Dennis M. Ritchie. Prentice-Hall, Englewood Cliffs, New Jersey, USA, 1989. ISBN: 0-13-110370-9.

    This is the “bible” for C, covering the 1990 version of Standard C. It is a rather dense book, with lots of information packed into a startlingly small number of pages. You may need to read it through more than once; doing so is well worth the trouble.

  2. C, A Reference Manual, 5th edition, by Samuel P. Harbison III and Guy L. Steele, Jr. Prentice-Hall, Upper Saddle River, New Jersey, USA, 2002. ISBN: 0-13-089592-X.

    This book is also a classic. It covers Original C as well as the 1990 and 1999 standards. Because it is current, it makes a valuable companion to The C Programming Language. It covers many important items, such as internationalization-related types and library functions, that aren’t in the Kernighan and Ritchie book.

  3. Notes on Programming in C”, by Rob Pike, February 21, 1989. Available on the Web from many sites. Perhaps the most widely cited location is http://www.lysator.liu.se/c/pikestyle.html. (Many other useful articles are available from one level up: http://www.lysator.liu.se/c/.)

    Rob Pike worked for many years at the Bell Labs research center where C and Unix were invented and did pioneering development there. His notes distill many years of experience into a “philosophy of clarity in programming” that is well worth reading.

  4. The various links at http://www.chris-lott.org/resources/cstyle/. This site includes Rob Pike’s notes and several articles by Henry Spencer. Of particular note is the “Recommended C Style and Coding Standards”, originally written at the Bell Labs Indian Hill site.

Summary

  • “Files and processes” summarizes the Linux/Unix worldview. The treatment of files as byte streams and devices as files, and the use of standard input, output, and error, simplify program design and unify the data access model. The permissions model is simple, yet flexible, applying to both files and directories.

  • Processes are running programs that have user and group identifiers associated with them for permission checking, as well as other attributes such as open files and a current working directory.

  • The most visible difference between Standard C and Original C is the use of function prototypes for stricter type checking. A good C programmer should be able to read Original-style code, since many existing programs use it. New code should be written using prototypes.

  • The GNU Coding Standards describe how to write GNU programs. They provide numerous valuable techniques and guiding principles for producing robust, usable software. The “no arbitrary limits” principle is perhaps the single most important of these. This document is required reading for serious programmers.

  • Making programs portable is a significant challenge. Guidelines and tools help, but ultimately experience is needed too.

Exercises

  1. Read and comment on the article “The GNU Project”,[11] by Richard M. Stallman, originally written in August of 1998.



[1] Some systems allow regular users to change the ownership on their files to someone else, thus “giving them away.” The details are standardized by POSIX but are a bit messy. Typical GNU/Linux configurations do not allow it.

[2] The owner can always change the permission, of course. Most users don’t disable write permission for themselves.

[3] There are some rare exceptions to this rule, all of which are beyond the scope of this book.

[4] This feature first appeared in Multics, but Multics was never widely used.

[5] Processes can be suspended, in which case they are not “running”; however, neither are they terminated. In any case, in the early stages of the climb up the learning curve, it pays not to be too pedantic.

[6] This section is adapted from an article by the author that appeared in Issue 16 of Linux Journal. (See http://www.linuxjournal.com/article.php?sid=1135.) Reprinted and adapted by permission.

[7] This statement refers to the HURD kernel, which is still under development (as of early 2004). GCC and GNU C Library (GLIBC) development take place mostly on Linux-based systems today.

[8] This situation occurred circa 1993; the truism is even more obvious today, as users process gigabytes of log files with gawk.

[9] Section 13.4, “Can You Spell That for Me, Please?”, page 521, provides an overview of multibyte characters and encodings.

[10] The mechanics of checking for and reporting errors are discussed in Section 4.3, “Determining What Went Wrong,” page 86.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.27.45