2.1 Overview
C requires explicit data typing for variables, arguments passed to a function, and a value returned from a function. The names for C data types occur in many other languages as well: int for signed integers, float for floating-point numbers, char for numeric values that serve as character codes, and so on. C programmers can define arbitrarily rich data types of their own such as Employee and Movie, which reduce ultimately to primitive types such as int and float. C’s built-in data types deliberately mirror machine-level types such as integers and floating-point numbers of various sizes.
The amount of memory required to store values of the type (e.g., the int value -3232, a pointer to the string “ABC”)
The operations allowed on values of type (e.g., an int value can be shifted left or right, whereas a float value should not be shifted at all)
The sizeof various basic data types
The dataTypes (see Listing 2-1) program prints the byte sizes for the basic C data types. These sizes are the usual ones on modern devices. The following sections focus on C’s built-in data types and built-in operations on these types. Technical matters such as the 2’s complement representation of integers and the IEEE 754 standard for floating-point formats is covered in detail.
2.2 Integer Types
The most significant (by convention, the leftmost) bit is the sign bit, with 0 for nonnegative and 1 for negative.
The remaining bits are magnitude bits.
The signed and unsigned integer types come in various sizes.
Basic integer data types
Type | Byte size | Range |
---|---|---|
unsigned char | 1 | 0 to 255 |
signed char | 1 | -128 to 127 |
unsigned short | 2 | 0 to 65,535 |
signed short | 2 | -32,768 to 32,767 |
unsigned int | 4 | 0 to 4,294,967,295 |
signed int | 4 | -2,147,483,648 to 2,147,483,647 |
unsigned long | 8 | 0 to 18,446,744,073,709,551,615 |
signed long | 8 | –9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 |
As the examples indicate, unsigned must be used explicitly if unsigned is the desired variant.
Data types and bits
The data type of a variable does not restrict the bits that can be assigned to it. For example, the compiler does not warn against assigning the negative value -1 to the unsigned variable n. For the compiler, the decimal value -1 is, in the 2’s complement representation now common across computing devices, all 1s in binary. Accordingly, the variable n holds 32 1s when -1 is assigned to this variable. (Further details of the 2’s complement representation are covered shortly.)
The equality operator == , when applied to integer values, checks for identical bit patterns. If the left and the right side expressions (in this example, the values of two variables) have identical bit patterns, the comparison is true; otherwise, false. The variables n, m, and k all store 32 1s in binary; hence, they are all equal in value by the equality operator ==.
In print statements , the internal representation of a value (the bit string) can be formatted to yield different external representations. For example, the 32 1s stored in variable n can be printed as a negative decimal value using the formatter %i (integer) or %d (decimal). Recall that in 2’s complement, a value is negative if its high-order (leftmost) bit is a 1; hence, the %i formatter for signed values treats the 32 1s as the negative value -1: the high-order bit is the sign bit 1 (negative), and the remaining bits are the magnitude bits. By contrast, the %u formatter for unsigned treats all of the bits as magnitude bits, which yields the value of the symbolic constant UINT_MAX (4,294,967,295) in decimal.
Comparing expressions of mixed data types is risky because the compiler coerces one of the types to the other, following rules that may not be obvious. In this example, the value -1 stored in the signed variable small is converted to unsigned so that the comparison is apple to apple rather than apple to orange. As noted earlier, -1 is all 1s in binary; hence, as unsigned, this value is UNIT_MAX, far greater than the 98,765 stored in big.
Signed values are converted to unsigned ones.
Smaller value types are converted to larger ones. For example, if a 2-byte short is compared to a 4-byte int, then the short value is converted to an int value for the comparison.
When floating-point values occur in expressions with integer values, the compiler converts the integer values into floating-point ones.
In assembly code , an instruction such as cmpl would be used to compare two integer values. The l in cmpl determines the number of bits compared: in this case, 32 because l is for longword, a 32-bit word in the Intel architecture. Were two 64-bit values being compared, then the instruction would be cmpq instead, as the q stands for quadword, a 64-bit word in this same architecture. At the assembly level, as at the machine level, the size of a data type is built into the instruction’s opcode, in this example cmpl.
An earlier example showed that C’s signed char and unsigned char are likewise integer types. As the name char indicates, the char type is designed to store single-byte character codes (e.g., ASCII and Latin-1); the more recent wchar_t type also is an integer type, but one designed for multibyte character codes (e.g., Unicode). For historical reasons, the char type is shorthand for either signed char or unsigned char, but which is platform dependent. For the remaining types, this is not the case. For example, short is definitely an abbreviation for signed short.
2.2.1 A Caution on the 2’s Complement Representation
For readability, the binary representation has been broken into four 8-bit chunks. The rightmost (least significant) bit is a 0, which makes the value (-2,147,483,648) even rather than odd. The leftmost (most significant) bit is the sign bit: 1 for negative as in this case and 0 for nonnegative. There are similar constants for other integer types (for instance, SHRT_MIN and LONG_MIN).
- 1.
Invert the 1s in -1, which yields all 0s: 00000…000.
- 2.
Add 1, which yields 00000…001 or 1 in binary and decimal, the absolute value of -1 in decimal.
The same recipe yields -1 from 1: invert the bits in 1 (yielding 11111…0) and then add 1 (yielding 11111…1), which again is all 1s in binary and -1 in decimal.
- 1.
Invert the bits, which transforms INT_MIN to 01111111 11111111 11111111 11111111.
- 2.
Add 1 to yield 10000000 00000000 00000000 00000000, which is INT_MIN again.
A modern C compiler does issue a warning when encountering the expression -INT_MIN, cautioning that the expression causes an overflow because of the addition operation. By the way, no other int value is equal to its negation under the 2’s complement representation.
2.2.2 Integer Overflow
Integer overflow
The overflow program (see Listing 2-3) initializes int variable n to 81 and then loops. In each loop iteration, n is multiplied by itself as long as the resulting value is greater than zero. The trace shows that loop iterates three times, and on the third iteration, the new value of n becomes negative . As the hex output shows, the leftmost (most significant) four bits are hex digit e, which is 1110 in binary: the leftmost 1 is now the sign bit for negative. In this example, the overflow could be delayed, but not prevented, by using a long instead of an int.
There is no compiler warning in the overflow program that overflow may result. It is up to the programmer to safeguard against this possibility. There are libraries that support arbitrary-precision arithmetic in C, including the GMP library (GNU Multiple Precision Arithmetic Library at https://gmplib.org ). A later code example uses embedded assembly code to check for overflow.
2.3 Floating-Point Types
C has the floating-point types appropriate in a modern, general-purpose language. Computers as a rule implement the IEEE 754 specification ( https://standards.ieee.org/ieee/754/6210/ ) in their floating-point hardware, so C implementations follow this specification as well.
Basic floating-point data types
Type | Byte size | Range | Precision |
---|---|---|---|
float | 4 | 1.2E-38 to 3.4E+38 | 6 places |
double | 8 | 2.3E-308 to 1.7E+308 | 15 places |
long double | 16 | 3.4E-4932 to 1.1E+4932 | 19 places |
2.3.1 Floating-Point Challenges
In the printf statement , the formatter %.24f specifies a precision of 24 decimal places. As a later example illustrates, unexpected rounding up can occur when a particular decimal value does not have an exact binary representation. Even this short code segment underscores that floating-point types should not be used in financial, engineering, and other applications that require exactness and precision. In such applications, there are libraries such as GMP ( http://gmplib.org ), mentioned earlier, to support arbitrary-precision arithmetic.
The two differ in the least significant digit: hex a is 1010 in binary, whereas hex b is 1011 in binary. The two values differ ever so slightly, in the least significant (rightmost) bit of their binary representations. In close-to-the-metal C, the equality operator compares bits; at the bit level, the two expressions differ.
Approximate equality
The comp code segment (see Listing 2-4) shows how a comparison can be made using FLT_EPSILON. The library function fabs returns the absolute value of the difference between f1 and f2. This value is less than FLT_EPSILON ; hence, the two values might be considered equal because their difference is less than FLT_EPSILON.
Issues with floating-point data types
The rounding program (see Listing 2-5) initializes a variable to 1.01 and then increments this variable by that amount in a loop that iterates ten times. The rounding up becomes evident in the seventh loop iteration: the expected value is 7.070000, but the printed value is 7.07001. Note that the formatter is %12f rather than %.12f. In the latter case, the printouts would show 12 decimal places but here show the default places, which happens to be six. Instead, the 12 in %12f sets the field width, which right-justifies the output to make it more readable.
More examples of decimal-to-binary conversion
The d2bconvert program (see Listing 2-6) shows yet again how information may be lost in converting from decimal to binary. In these isolated examples, of course, no harm is done; but these cases underscore that floating-point types such as float and double are not suited for applications involving, for instance, currency.
2.3.2 IEEE 754 Floating-Point Types
For reference, the written exponent comprises the 8 bits depicted previously. In the discussion that follows, the written exponent is contrasted with the actual exponent. Also, the written magnitude comprises the 23 bits shown previously and is contrasted with the actual magnitude.
If the written exponent field contains a mix of 0s and 1s, the value is normalized.
If the written exponent field contains only 0s, the value is denormalized.
If the written exponent field contains only 1s, the value is special.
For the sample value -1110110.101 (-118.625 in decimal), the implicit leading 1 is obtained by moving the binary point six places to the left, which yields -1.110110101 × 26. The written magnitude is then the fractional part 110110101.
In summary, the decimal value -118.625 has a written exponent of 133 in IEEE 754, but an actual exponent of 6.
The middle field alone, the 8-bit exponent, indicates that this value is indeed normalized: the written exponent contains a mix of 0s and 1s.
The IEEE representation of zero is intuitive in that every bit—except, perhaps, the sign bit—is a 0. A denormalized value does not have an implicit leading 1, and the actual exponent has a fixed value of -126 in the 32-bit case. The written exponent is always all 0s.
Positive denormalized and normalized values
Binary | Decimal |
---|---|
0 00000001 00000000000000000000000 | 1.175494350822e-38 |
0 00000000 11111111111111111111111 | 1.175494210692e-38 |
0 00000000 00000000000000000000001 | 1.401298464325e-45 |
The value in the middle row has all 0s in the exponent, which makes the value denormalized. This value is the largest denormalized value in 32 bits, but this value is still smaller than the very small normalized value above it. The smallest denormalized value, the bottom row in the table, has a single 1 as the least significant bit: all the rest are 0s. Between the smallest and the largest denormalized values are many more, all differing in the bit pattern of the written magnitude. Although the denormalized values shown so far are positive, there are negative ones as well: the sign bit is 1 for such values.
In summary, denormalized values cover the two representations of zero, as well as evenly spaced values that are close to zero. The preceding examples show that the gap between the smallest positive normalized value and positive zero is considerable and filled with denormalized values.
Special values under the IEEE 754 specification
The floating-point units (FPUs) of modern computers commonly follow the IEEE specification; modern languages, including C, do so in any case. There are heated discussions within the computing community on the merits of the IEEE specification, but there is little doubt that this specification is now a de facto standard across programming languages and systems.
In the flag -lm (lowercase L followed by m), the -l stands for link, and the m identifies the standard mathematics library libm, which resides in a file such as libm.so on the compiler/linker search path (e.g., in a directory such as /usr/lib or /usr/local/lib). Note that the prefix lib and the file extension so fall away in a link specification, leaving only the m for the mathematics library.
The linking is needed because the specVal program calls the sqrt function from the mathematics library. A compilation command may contain several explicit link flags in same style shown previously: -l followed by the name of the library without the prefix lib and without the library extension such as so.
During compilation , libraries such as the standard C library and the input/output library are linked in automatically. Other libraries, such as the mathematics and cryptography libraries, must be linked in explicitly. In Chapter 8, the section on building libraries goes into more detail on linking.
2.4 Arithmetic, Bitwise, and Boolean Operators
C has the usual arithmetic, bitwise, and boolean (relational) operators. Recall that even the character types char and wchar_t, and the makeshift-boolean type (zero for false, nonzero for true), are fundamentally arithmetic types. However, some operators are ill-suited for some types. For example, floating-point values should not be bit-shifted, left or right.
The second line uses a cast operation , which is an explicit type-conversion operation; in this case, the floating-point value of variable f is converted to an int value so that the compiler does not complain. (The syntax of casts is covered in the following sidebar.) In the shift operation, << represents a left shift, and >> represents a right shift. To the left of the shift operator is the value (in this case, variable f) to be shifted, and to the right is the number of bit places to shift. On left shifts, the vacated positions are filled with 0s.
If the preceding example were to omit the cast operation, the compiler would complain, with an error rather than just a warning, that the left operand to << should be an int, not a float. To get by the compiler, the code segment thus includes the cast operation.
It should be emphasized that a cast operation is not an assignment operation. In this example, the casted value 123.456 is still stored in variable f. The salient point is that floating-point values, in general, should not be shifted at all. The shift operation is intended for integer values only, and even then caution is in order—as later examples illustrate.
For convenience, the following subsections divide the operators into the traditional categories of arithmetic, bitwise, and boolean (relational). Miscellaneous operators such as sizeof and the cast will continue to be clarified as needed.
2.4.1 Arithmetic Operators
Binary arithmetic operators
Operation | C | Example |
---|---|---|
Addition | + | 12 + 3 |
Subtraction | - | 12 - 3 |
Multiplication | * | 12 * 3 |
Division | / | 12 / 3 |
Modulus | % | 12 % 3 |
Operator association and precedence
The assoc program (see Listing 2-8) shows how expressions can be parenthesized in order to get the desired association when mixed operations are in play. The use of parentheses seems easier than trying to recall precedence details, and parenthesized expressions are, in any case, easier to read.
2.4.2 Boolean Operators
the second conjunct (4 > 2) is not evaluated: a conjunction is true only if each of its conjuncts is true, and the first conjunct (3 < 2) is false, thereby making the entire expression false.
Richer examples are yet to come.
2.4.3 Bitwise Operators
The bit-level representation of n starts out 01110..., with the leftmost bit as the sign bit 0 for nonnegative. The 1-bit left shift moves a 1 into the sign position, which accounts for change in sign from 1,879,048,192 to -536,870,912. Recall that, in left shifts, the vacated bit positions are filled with 0s.
Reversing the endian-ness of a multibyte data item
Recall that each hex digit is 4 bits. Accordingly, the leftmost byte in variable n is 12, and the rightmost is cd.
C has a header file endian.h that declares various functions for transforming little-endian formats to big-endian formats, and vice versa. These functions specify the bit sizes on which they work: 16 (2 bytes), 32 (4 bytes), and 64 (8 bytes).
The variable n is the symbolic name of a memory location or CPU register, and a value assigned to n is thus an lvalue.
2.5 What’s Next?
The pointer variable str identifies a collection (in this case, an array) of five characters: the ones shown and the null terminator. The expression str[0] refers to the first of the variables that hold a character, lowercase a in this example. Pointer str thus identifies an aggregate rather than just a single variable.
Arrays and structures are the primary aggregates in C. Pointers also deserve a closer look because they dominate in efficient, production-grade programming. The next chapter focuses on aggregates and pointers.
C is a small, strictly procedural or imperative language. C++ is a large language that can be used in procedural style but also includes object-oriented features (e.g., classes, inheritance, and polymorphism) not found in C. C++, unlike C, has generic collection types. A C++ program can include orthodox C code, but much depends on the compiler; further, header files and the corresponding libraries may differ in name and location between the two languages. The two languages share history and features but are distinct.