Catherine: “Why commit Evil?” Gtz: “Because Good has already been done.” Catherine: “Who has done it?” Gtz: “God the Father. I, on the other hand, am improvising.”
—Jean-Paul Sartre, The Devil and the Good Lord, act IV, scene 4
The C Standard defines formatted output functions that accept a variable number of arguments, including a format string.1 Examples of formatted output functions include printf()
and sprintf()
.
1. Formatted output originated in Fortran and found its way into C in 1972 with the portable I/O package described in an internal memorandum written by M. E. Lesk in 1973 regarding “A Portable I/O package.” This package was reworked and became the C Standard I/O functions.
Example 6.1 shows a C program that uses formatted output functions to provide usage information about required arguments that are not provided. Because the executable may be renamed, the actual name of the program entered by the user (argv[0]
) is passed as an argument to the usage()
function on line 13 of main()
. The call to snprintf()
on line 6 constructs the usage string by substituting the %s
in the format string with the runtime value of pname
. Finally, printf()
is called on line 8 to output the usage information.
01 #include <stdio.h>
02 #include <string.h>
03
04 void usage(char *pname) {
05 char usageStr[1024];
06 snprintf(usageStr, 1024,
07 "Usage: %s <target>
", pname);
08 printf(usageStr);
09 }
10
11 int main(int argc, char * argv[]) {
12 if (argc > 0 && argc < 2) {
13 usage(argv[0]);
14 exit(-1);
15 }
16 }
This program implements a common programming idiom, particularly for UNIX command-line programs. However, this implementation is flawed in a manner that can be exploited to run arbitrary code. But how is this accomplished? (Hint: It does not involve a buffer overflow.)
Formatted output functions consist of a format string and a variable number of arguments. The format string, in effect, provides a set of instructions that are interpreted by the formatted output function. By controlling the content of the format string, a user can, in effect, control execution of the formatted output function.
Formatted output functions are variadic, meaning that they accept a variable number of arguments. Limitations of variadic function implementations in C contribute to vulnerabilities in the use of formatted output functions. Variadic functions are examined in the following section before formatted output functions are examined in detail.
The <stdarg.h>
header declares a type and defines four macros for advancing through a list of arguments whose number and types are not known to the called function when it is compiled. POSIX defines the legacy header <varargs.h>
, which dates from before the standardization of C and provides functionality similar to <stdarg.h>
[ISO/IEC/IEEE 9945:2009]. The older <varargs.h>
header has been deprecated in favor of <stdarg.h>
. Both approaches require that the contract between the developer and the user of the variadic function not be violated by the user. The newer C Standard version is described here.
Variadic functions are declared using a partial parameter list followed by the ellipsis notation. For example, the variadic average()
function shown in Example 6.2 accepts a single, fixed argument followed by a variable argument list. No type checking is performed on the arguments in the variable list. One or more fixed parameters precede the ellipsis notation, which must be the last token in the parameter list.
01 int average(int first, ...) {
02 int count = 0, sum = 0, i = first;
03 va_list marker;
04
05 va_start(marker, first);
06 while (i != -1) {
07 sum += i;
08 count++;
09 i = va_arg(marker, int);
10 }
11 va_end(marker);
12 return(sum ? (sum / count) : 0);
13 }
A function with a variable number of arguments is invoked simply by specifying the desired number of arguments in the function call:
average(3, 5, 8, -1);
The <stdarg.h>
header defines the va_start()
, va_arg()
, and va_end()
macros shown in Example 6.3 for implementing variadic functions, as well as the va_copy()
macro not used in this example. All of these macros operate on the va_list
data type, and the argument list is declared using the va_list
type. For example, the marker
variable on line 3 of Example 6.2 is declared as a va_list
type. The va_start()
macro initializes the argument list and must be called before marker
can be used. In the average()
implementation, va_start()
is called on line 5 and passed marker
and the last fixed argument (first
). This fixed argument allows va_start()
to determine the location of the first variable argument. The va_arg()
macro requires an initialized va_list
and the type of the next argument. The macro returns the next argument and increments the argument pointer based on the type size. The va_arg()
macro is invoked on line 9 of the average()
function to access the second through last arguments. Finally, va_end()
is called to perform any necessary cleanup before the function returns. If the va_end()
macro is not invoked before the return, the behavior is undefined.
The termination condition for the argument list is a contract between the programmers who implement the function and those who use it. In this implementation of the average()
function, termination of the variable argument list is indicated by an argument whose value is –1. If the programmer calling the function neglects to provide this argument, the average()
function will continue to process the next argument indefinitely until a –1 value is encountered or a fault occurs.
Example 6.3 shows the va_list
type and the va_start()
, va_arg()
, and va_end()
macros2 as implemented by Visual C++. Defining the va_list
type as a character pointer is an obvious implementation with sequentially ordered arguments such as the ones generated by Visual C++ and GCC on x86-32.
2. C99 adds the va_copy()
macro.
1 #define _ADDRESSOF(v) (&(v))
2 #define _INTSIZEOF(n)
3 ((sizeof(n)+sizeof(int)-1) & ~(sizeof(int)-1))
4 typedef char *va_list;
5 #define va_start(ap,v) (ap=(va_list)_ADDRESSOF(v)+_INTSIZEOF(v))
6 #define va_arg(ap,t) (*(t *)((ap+=_INTSIZEOF(t))-_INTSIZEOF(t)))
7 #define va_end(ap) (ap = (va_list)0)
Figure 6.1 illustrates how the arguments are sequentially ordered on the stack when the average(3,5,8,–1)
function is called on these systems. The character pointer is initialized by va_start()
to reference the parameters following the last fixed argument. The va_start()
macro adds the size of the argument to the address of the last fixed parameter. When va_start()
returns, va_list
points to the address of the first optional argument.
Not all systems define the va_list
type as a character pointer. Some systems define va_list
as an array of pointers, and other systems pass arguments in registers. When arguments are passed in registers, va_start()
may have to allocate memory to store the arguments. In this case, the va_end()
macro is used to free allocated memory.
Formatted output function implementations differ significantly based on their history. The formatted output functions defined by the C Standard include the following:
• fprintf()
writes output to a stream based on the contents of the format string. The stream, format string, and a variable list of arguments are provided as arguments.
• printf()
is equivalent to fprintf()
except that printf()
assumes that the output stream is stdout
.
• sprintf()
is equivalent to fprintf()
except that the output is written into an array rather than to a stream. The C Standard stipulates that a null character is added at the end of the written characters.
• snprintf()
is equivalent to sprintf()
except that the maximum number of characters n to write is specified. If n is nonzero, output characters beyond n–1st are discarded rather than written to the array, and a null character is added at the end of the characters written into the array.3
3. The snprintf()
function was introduced in the C99 standard to improve the security of the standard library.
• vfprintf()
, vprintf()
, vsprintf()
, and vsnprintf()
are equivalent to fprintf()
, printf()
, sprintf()
, and snprintf()
with the variable argument list replaced by an argument of type va_list
. These functions are useful when the argument list is determined at runtime.
Another formatted output function not defined by the C specification but defined by POSIX is syslog()
. The syslog()
function accepts a priority argument, a format specification, and any arguments required by the format and generates a log message to the system logger (syslogd
). The syslog()
function first appeared in BSD 4.2 and is supported by Linux and other modern POSIX implementations. It is not available on Windows systems.
The interpretation of format strings is defined in the C Standard. C runtimes typically adhere to the C Standard but often include nonstandard extensions. You can usually rely on all the formatted output functions for a particular C runtime interpreting format strings the same way because they are almost always implemented using a common subroutine.
The following sections describe the C Standard definition of format strings, GCC and Visual C++ implementations, and some differences between these implementations and the C Standard.
Format strings are character sequences consisting of ordinary characters (excluding %
) and conversion specifications. Ordinary characters are copied unchanged to the output stream. Conversion specifications consume arguments, convert them according to a corresponding conversion specifier, and write the results to the output stream.
Conversion specifications begin with a percent sign (%
) and are interpreted from left to right. Most conversion specifications consume a single argument, but they may consume multiple arguments or none. The programmer must match the number of arguments to the specified format. If there are more arguments than conversion specifications, the extra arguments are ignored. If there are not enough arguments for all the conversion specifications, the results are undefined.
A conversion specification consists of optional fields (flags, width, precision, and length modifier) and required fields (conversion specifier) in the following form:
%[flags] [width] [.precision] [{length-modifier}] conversion-specifier
For example, in the conversion specification %-10.8ld
, -
is a flag, 10
is the width, the precision is 8
, the letter l
is a length modifier, and d
is the conversion specifier. This particular conversion specification prints a long int
argument in decimal notation, with a minimum of eight digits left-justified in a field at least ten characters wide.
Each field is a single character or a number signifying a particular format option. The simplest conversion specification contains only %
and a conversion specifier (for example, %s
).
A conversion specifier indicates the type of conversion to be applied. The conversion specifier character is the only required format field, and it appears after any optional format fields. Table 6.1 lists some of the conversion specifiers from the C Standard, including n
, which plays a key role in many exploits.
Flags justify output and print signs, blanks, decimal points, and octal and hexadecimal prefixes. More than one flag directive may appear in a format specification. The flag characters are described in the C Standard.
Width is a nonnegative decimal integer that specifies the minimum number of characters to output. If the number of characters output is less than the specified width, the width is padded with blank characters.
A small width does not cause field truncation. If the result of a conversion is wider than the field width, the field expands to contain the conversion result. If the width specification is an asterisk (*
), an int
argument from the argument list supplies the value. In the argument list, the width argument must precede the value being formatted.
Precision is a nonnegative decimal integer that specifies the number of characters to be printed, the number of decimal places, or the number of significant digits.4 Unlike the width field, the precision field can cause truncation of the output or rounding of a floating-point value. If precision is specified as 0 and the value to be converted is 0, no characters are output. If the precision field is an asterisk (*
), the value is supplied by an int
argument from the argument list. The precision argument must precede the value being formatted in the argument list.
4. The conversion specifier determines the interpretation of the precision field and the default precision when the precision field is omitted.
Length modifier specifies the size of the argument. The length modifiers and their meanings are listed in Table 6.2. If a length modifier appears with any conversion specifier other than the ones specified in this table, the resulting behavior is undefined.
The GCC implementation of formatted output functions conforms to the C Standard but also implements POSIX extensions.
Formatted output functions in GCC version 3.2.2 handle width and precision fields up to INT_MAX
(2,147,483,647 on x86-32). Formatted output functions also keep and return a count of characters output as an int
. This count continues to increment even if it exceeds INT_MAX
, which results in a signed integer overflow and a signed negative number. However, if interpreted as an unsigned number, the count is accurate until an unsigned overflow occurs. The fact that the count value can be successfully incremented through all possible bit patterns plays an important role when we examine exploitation techniques later in this chapter.
The Visual C++ implementation of formatted output functions is based on the C Standard and Microsoft-specific extensions.
Formatted output functions in at least some Visual C++ implementations share a common definition of format string specifications. Therefore, format strings are interpreted by a common function called _output()
. The _output()
function parses the format string and determines the appropriate action based on the character read from the format string and the current state.
The _output()
function stores the width as a signed integer. Widths of up to INT_MAX
are supported. Because the _output()
function makes no attempt to detect or deal with signed integer overflow, values exceeding INT_MAX
can cause unexpected results.
The _output()
function stores the precision as a signed integer but uses a conversion buffer of 512 characters, which restricts the maximum precision to 512 characters. Table 6.3 shows the resulting behavior for precision values and ranges.
The character output counter is also represented as a signed integer. Unlike the GCC implementation, however, the main loop of _output()
exits if this value becomes negative, which prevents values in the INT_MAX+1
to UINT_MAX
range.
Studio 2012 does not support C’s h
, j
, z
, and t
length modifiers. It does, however, provide an I32
length modifier that behaves the same as the l
length modifier and an I64
length modifier that approximates the ll
length modifier; that is, I64
prints the full value of a long long int
but writes only 32 bits when used with the n
conversion specifier.
Formatted output became a focus of the security community in June 2000 when a format string vulnerability was discovered in WU-FTP.5 Format string vulnerabilities can occur when a format string (or a portion of a string) is supplied by a user or other untrusted source. Buffer overflows can occur when a formatted output routine writes beyond the boundaries of a data structure. The sample proof-of-concept exploits included in this section were developed with Visual C++ and tested on Windows, but the underlying vulnerabilities are common to many platforms.
5. See www.kb.cert.org/vuls/id/29823.
Formatted output functions that write to a character array (for example, sprintf()
) assume arbitrarily long buffers, which makes them susceptible to buffer overflows. Example 6.4 shows a buffer overflow vulnerability involving a call to sprintf()
. The function writes to a fixed-length buffer, replacing the %s
conversion specifier in the format string with a (potentially malicious) user-supplied string. Any string longer than 495 bytes results in an out-of-bounds write (512 bytes – 16 character bytes – 1 null byte).
1 char buffer[512];
2 sprintf(buffer, "Wrong command: %s
", user);
Buffer overflows need not be this obvious. Example 6.5 shows a short program containing a programming flaw that can be exploited to cause a buffer overflow [Scut 2001].
1 char outbuf[512];
2 char buffer[512];
3 sprintf(
4 buffer,
5 "ERR Wrong command: %.400s",
6 user
7 );
8 sprintf(outbuf, buffer);
The sprintf()
call on line 3 cannot be directly exploited because the %.400s
conversion specifier limits the number of bytes written to 400. This same call can be used to indirectly attack the sprintf()
call on line 8, for example, by providing the following value for user
:
%497dx3cxd3xffxbf<nops><shellcode>
The sprintf()
call on lines 3–7 inserts this string into buffer
. The buffer
array is then passed to the second call to sprintf()
on line 8 as the format string argument. The %497d
format specification instructs sprintf()
to read an imaginary argument from the stack and write 497 characters to buffer
. Including the ordinary characters in the format string, the total number of characters written now exceeds the length of outbuf
by 4 bytes.
The user input can be manipulated to overwrite the return address with the address of the exploit code supplied in the malicious format string argument (0xbfffd33c
). When the current function exits, control is transferred to the exploit code in the same manner as a stack-smashing attack (see Section 2.3).
This is a format string vulnerability because the format string is manipulated by the user to exploit the program. Such cases are often hidden deep inside complex software systems and are not always obvious. For example, qpopper versions 2.53 and earlier contain a vulnerability of this type.6
6. See www.auscert.org.au/render.html?it=81.
The programming flaw in this case is that sprintf()
is being used inappropriately on line 8 as a string copy function when strcpy()
or strncpy()
should be used instead. Paradoxically, replacing this call to sprintf()
with a call to strcpy()
eliminates the vulnerability.
Formatted output functions that write to a stream instead of a file (such as printf()
) are also susceptible to format string vulnerabilities.
The simple function shown in Example 6.6 contains a format string vulnerability. If the user
argument can be fully or partially controlled by a user, this program can be exploited to crash the program, view the contents of the stack, view memory content, or overwrite memory. The following sections detail each of these exploits.
1 int func(char *user) {
2 printf(user);
3 }
Format string vulnerabilities are often discovered when a program crashes. For most UNIX systems, an invalid pointer access causes a SIGSEGV
signal to the process. Unless caught and handled, the program will abnormally terminate and dump core. Similarly, an attempt to read an unmapped address in Windows results in a general protection fault followed by abnormal program termination. An invalid pointer access or unmapped address read can usually be triggered by calling a formatted output function with the following format string:
printf("%s%s%s%s%s%s%s%s%s%s%s%s");
The %s
conversion specifier displays memory at an address specified in the corresponding argument on the execution stack. Because no string arguments are supplied in this example, printf()
reads arbitrary memory locations from the stack until the format string is exhausted or an invalid pointer or unmapped address is encountered.
Unfortunately, it is relatively easy to crash many programs—but this is only the start of the problem. Attackers can also exploit formatted output functions to examine the contents of memory. This information is often used for further exploitation.
As described in Section 6.1, formatted output functions accept a variable number of arguments that are typically supplied on the stack. Figure 6.2 shows a sample of the assembly code generated by Visual C++ for a simple call to printf()
. Arguments are pushed onto the stack in reverse order. Because the stack grows toward low memory on x86-32 (the stack pointer is decremented after each push), the arguments appear in memory in the same order as in the printf()
call.
Figure 6.3 shows the contents of memory after the call to printf()
.7 The address of the format string 0xe0f84201
appears in memory followed by the argument values 1
, 2
, and 3
. The memory directly preceding the arguments (not shown in the figure) contains the stack frame for printf()
. The memory immediately following the arguments contains the automatic variables for the calling function, including the contents of the format character array 0x2e253038
.
7. The bytes in Figure 6.3 appear exactly as they would in memory when using little endian alignment.
The format string in this example, %08x.%08x.%08x.%08x
, instructs printf()
to retrieve four arguments from the stack and display them as eight-digit padded hexadecimal numbers. The call to printf()
, however, places only three arguments on the stack. So what is displayed, in this case, by the fourth conversion specification?
Formatted output functions including printf()
use an internal variable to identify the location of the next argument. This argument pointer initially refers to the first argument (the value 1). As each argument is consumed by the corresponding format specification, the argument pointer is increased by the length of the argument, as shown by the arrows along the top of Figure 6.3. The contents of the stack or the stack pointer are not modified, so execution continues as expected when control returns to the calling program.
Each %08x
in the format string reads a value it interprets as an int
from the location identified by the argument pointer. The values output by each format string are shown below the format string in Figure 6.3. The first three integers correspond to the three arguments to the printf()
function. The fourth “integer” contains the first 4 bytes of the format string—the ASCII codes for %08x
. The formatted output function will continue displaying the contents of memory in this fashion until a null byte is encountered in the format string.
After displaying the remaining automatic variables for the currently executing function, printf()
displays the stack frame for the currently executing function (including the return address and arguments for the currently executing function). As printf()
moves sequentially through stack memory, it displays the same information for the calling function, the function that called that function, and so on, up through the call stack. Using this technique, it is possible to reconstruct large parts of the stack memory. An attacker can use this data to determine offsets and other information about the program to further exploit this or other vulnerabilities.
It is possible for an attacker to examine memory at an arbitrary address by using a format specification that displays memory at a specified address. For example, the %s
conversion specifier displays memory at the address specified by the argument pointer as an ASCII string until a null byte is encountered. If an attacker can manipulate the argument pointer to reference a particular address, the %s
conversion specifier will output memory at that location.
As stated earlier, the argument pointer can be advanced in memory using the %x
conversion specifier, and the distance it can be moved is restricted only by the size of the format string. Because the argument pointer initially traverses the memory containing the automatic variables for the calling function, an attacker can insert an address in an automatic variable in the calling function (or any other location that can be referenced by the argument pointer). If the format string is stored as an automatic variable, the address can be inserted at the beginning of the string. For example, the address 0x0142f5dc
can be represented as the 32-bit, little endian encoded string xdcxf5x42x01
. The printf()
function treats these bytes as ordinary characters and outputs the corresponding displayable ASCII character (if one is defined). If the format string is located elsewhere (for example, the data or heap segments), it is easier for an attacker to store the address closer to the argument pointer.
By concatenating these elements, an attacker can create a format string of the following form to view memory at a specified address:
address advance-argptr %s
Figure 6.4 shows an example of a format string in this form: xdcxf5x42x01%x%x%x%s
. The hex constants representing the address are output as ordinary characters. They do not consume any arguments or advance the argument pointer. The series of three %x
conversion specifiers advance the argument pointer 12 bytes to the start of the format string. The %s
conversion specifier displays memory at the address supplied at the beginning of the format string. In this example, printf()
displays memory from 0x0142f5dc
until a