Appendix A: Computer Arithmetic
I have deliberately kept discussion of number bases and arithmetic to a minimum in the chapters of this book. However, it’s important to have some understanding of this, so I’m summarizing the subject in this appendix. If you feel confident in your math skills, this review will be old hat for you. If you think the math parts are going to be tough, then this section should show you how easy it really is.
Binary Numbers
First, let’s consider exactly what you intend when you write a common, everyday
decimal
number
, such as 324 or 911. Obviously, what you mean is “300 and 24” or “900 and 11.” These are a shorthand way of saying “300” plus “two 10s” plus “4” and “900” plus “one 10” plus “1.” Put more numerically and more precisely, you really mean
324 is 3 × 102 + 2 × 101 + 4 × 100, which is 3 × 10 × 10 + 2 × 10 + 4
911 is 9 × 102 + 1 × 101 + 1 × 100, which is 9 × 10 × 10 + 1 × 10 + 1
We call this decimal notation
because it’s built around powers of 10. (This is derived from the Latin word decimal, meaning “of tithes,” which was a tax of 10 percent. Ah, those were the days!) We also say that we are representing numbers to base 10 here because each digit position is a power of 10.
Representing numbers in this way is very handy for people with ten fingers and ten toes, or indeed ten of any kind of appendage. However, your PC is rather less handy, being built mainly of switches that are either on or off. It’s okay for counting up to two, but not spectacular at counting to ten. I’m sure you’re aware that this is the primary reason why your computer represents numbers using base 2 rather than base 10. Representing numbers using base 2 is called the
binary system
of counting. With numbers expressed using base 10, digits can be from zero to nine inclusive. In general, if you are representing numbers in an arbitrary base,
n, the digit in each position in a number can be from
0 to
n-1. Thus, with binary numbers, digits can only be zero or one, which is ideal when you only have on and off switches to represent them. In an exact analogy to the system of counting in base 10, the binary number 1101, for example, breaks down like this:
This amounts to 13 in the decimal system. In Table
A-1, you can see the decimal equivalents of all the possible numbers you can represent using eight binary digits (a
binary digi
t is more commonly known as a
bit).
Table A-1.Decimal Equivalents of 8-Bit Binary Values
Binary | Decimal | Binary | Decimal |
---|
0000 0000 | 0 | 1000 0000 | 128 |
0000 0001 | 1 | 1000 0001 | 129 |
0000 0010 | 2 | 1000 0010 | 130 |
. . . | . . . | . . . | . . . |
0001 0000 | 16 | 1001 0000 | 144 |
0001 0001 | 17 | 1001 0001 | 145 |
. . . | . . . | . . . | . . . |
0111 1100 | 124 | 1111 1100 | 252 |
0111 1101 | 125 | 1111 1101 | 253 |
0111 1110 | 126 | 1111 1110 | 254 |
0111 1111 | 127 | 1111 1111 | 255 |
Notice that using the first 7 bits, you can represent numbers from 0 to 127, which is a total of 128 numbers, and that using all 8 bits, you get 256 or 28 numbers. In general, if you have n bits available, you can represent 2n integers, with values from 0 to 2n–1.
Adding binary numbers inside your computer is a
piece of cake, because the “carry” from adding corresponding digits can only be zero or one. This means that very simple circuitry can handle the process. Figure
A-1 shows how the addition of two 8-bit binary values would work.
The addition operation starts with adding the rightmost bits in the numbers. Figure A-1 shows that there is a “carry” of 1 to the next bit position for each of the first six bit positions. This is because each digit can only be zero or one. When you add 1 + 1, the result cannot be stored in the current bit position and is equivalent to one in the next bit position to the left.
Hexadecimal Numbers
When you start dealing with larger binary
numbers
, a small problem arises when you want to write them down. Look at this one:
Binary notation here starts to be more than a
little cumbersome for practical use, particularly when you consider that if you work out what this is in decimal, it’s only 16,103,905—a miserable eight decimal digits. You can sit more angels on the head of a pin than that! Clearly, you need a more economical way of writing this, but decimal isn’t always appropriate. Sometimes you might need to be able to specify that the 10th and 24th bits from the right are set to one, but without the overhead of writing out all the bits in binary notation. To figure out the decimal integer required to do this sort of thing is hard work, and there’s a good chance you’ll get it wrong anyway. A much easier solution is to use hexadecimal notation, in which the numbers are represented using base 16.
Arithmetic to base 16 is a much more convenient option, and it fits rather well with binary. Each hexadecimal digit can have values from 0 to 15, and the digits from 10 to 15 are represented by the letters A–F (or a–f), as shown in Table
A-2. Values from 0 to 15 happen to correspond nicely with the range of values that four binary digits can represent.
Table A-2.Hexadecimal Digits and Their Values in Decimal and Binary
Hexadecimal | Decimal | Binary |
---|
0 | 0 | 0000 |
1 | 1 | 0001 |
2 | 2 | 0010 |
3 | 3 | 0011 |
4 | 4 | 0100 |
5 | 5 | 0101 |
6 | 6 | 0110 |
7 | 7 | 0111 |
8 | 8 | 1000 |
9 | 9 | 1001 |
A | 10 | 1010 |
B | 11 | 1011 |
C | 12 | 1100 |
D | 13 | 1101 |
E | 14 | 1110 |
F | 15 | 1111 |
Because a hexadecimal
digit
corresponds to four binary digits, you can represent any large binary number as a hexadecimal number simply by taking groups of four binary digits, starting from the right, and writing the equivalent hexadecimal digit for each group. Look at the following binary number:
Taking each group of 4 bits in turn and replacing it with the corresponding hexadecimal digit from the table, this number expressed in hexadecimal notation will come out as follows:
You have six hexadecimal digits corresponding to the six groups of four binary digits. Just to prove that it all works out with no cheating, you can convert this number directly from hexadecimal to decimal by again using the analogy with the meaning of a decimal number. The value
of this hexadecimal number therefore works out as follows.
F5B9E1 as a decimal value is given by
This turns out to be
15,728,640 + 327,680 + 45,056 + 2,304 + 224 + 1
Thankfully, this adds up to the same number you got when converting the equivalent binary number to a decimal value: 16,103,905.
The other very handy coincidence
with hexadecimal numbers is that modern computers store integers in words that are even numbers of bytes, typically 2, 4, 8, or 16 bytes. A byte is 8 bits, which is exactly two hexadecimal digits, so any binary integer word in memory always corresponds to an exact number of hexadecimal digits.
Negative Binary Numbers
There’s another aspect to binary arithmetic
that you need to understand: negative numbers. So far, you’ve assumed that everything is positive—the optimist’s view, if you will—and so the glass is still half full. But you can’t avoid the negative side of life—the pessimist’s perspective—that the glass is already half empty. How can a negative number be represented inside a computer? Well, you have only binary digits at your disposal, so the solution has to be to use at least one of those to indicate whether the number is negative or positive.
For numbers that you want to allow to have
negative values (referred to as signed numbers
), you must first decide on a fixed length (in other words, the number of binary digits) and then designate the leftmost binary digit as a sign bit. You have to fix the length to avoid any confusion about which bit is the sign bit.
As you know, your computer’s memory consists of 8-bit bytes, so the binary numbers are going to be stored in some multiple (usually a power of 2) of 8 bits. Thus, you can have some numbers with 8 bits, some with 16 bits, and some with 32 bits or whatever. As long as you know what the length is in each case, you can find the sign bit—it’s just the leftmost bit. If the sign bit is zero, the number is positive, and if it’s one, the number is negative.
This seems to solve the problem, and in some computers it does. Each number consists of a sign bit that is zero for positive values and one for negative values, plus a given number of bits that specify the absolute value of the number—unsigned, in other words. Changing +6 to –6 then just involves flipping the sign bit from zero to one. Unfortunately, this representation carries a lot of overhead with it in terms of the complexity of the circuits that are needed to perform arithmetic. For this reason, most computers take a different approach. You can get the idea of how this approach works by considering how the computer would handle arithmetic with positive and negative values so that operations are as simple as possible.
Ideally, when two integers are added, you don’t want the computer to be searching about, checking whether either or both of the numbers are negative. You just want to use simple “add” circuitry regardless of the signs of the operands. The add operation will combine corresponding binary digits to produce the appropriate bit as a result, with a carry to the next digit along where this is necessary. If you add –8 in binary to +12, you would really like to get the answer +4 using the same circuitry that would apply if you were adding +3 and +8.
If you try this with the simplistic solution, which is just to set the sign bit of the positive value to one to make it negative, and then perform the arithmetic with conventional carries, it doesn’t quite work:
12 in binary is | 0000 1100 |
–8 in binary (you suppose) is | 1000 1000 |
If you now add these together, you get | 1001 0100 |
This seems to be –20, which isn’t what you wanted at all. It’s definitely not +4, which you know is 0000 0100. “Ah,” I hear you say, “you can’t treat a sign just like another digit.” But that is just what you do want to do.
You can see how the
computer
would like to represent –8 by subtracting +12 from +4 and seeing what the result is:
+4 in binary is | 0000 0100 |
+12 in binary is | 0000 1100 |
Subtract the latter from the former and you get | 1111 1000 |
For each digit from the fourth from the right onward, you had to “borrow” 1 to do the subtraction, just as you would when performing ordinary decimal arithmetic. This result is supposed to be –8, and even though it doesn’t look like it, that’s exactly what it is. Just try adding it to +12 or +15 in binary, and you’ll see that it works! Of course, if you want to produce –8, you can always do so by subtracting +8 from 0.
What exactly did you get when you subtracted 12 from 4 or +8 from 0, for that matter? It turns out that what you have here is called the two’s complement representation
of a negative binary number, and you can produce this from any positive binary number by a simple procedure you can perform in your head. At this point, I need to ask a little faith on your part and avoid getting into explanations of why it works. I’ll just show you how the two’s complement form of a negative number can be constructed from a positive value, and you can prove to yourself that it does work. Let’s return to the previous example, in which you need the two’s complement binary representation for –8.
You start with +8 in binary:
You now “flip” each binary digit, changing 0s to 1s and vice versa:
This is called the
one’s complement form, and if you now add 1 to this, you’ll get the two’s complement form:
This is exactly the same as the representation of –8 you got by subtracting +12 from +4. Just to make absolutely sure, let’s try the original sum of adding –8 to +12:
+12 in binary is 0000 1100
Your version of –8 is 1111 1000
If you add these together, you get 0000 0100
The answer is 4—magic. It works! The “carry”
propagates through all the leftmost ones, setting them back to zero. One fell off the end, but you shouldn’t worry about that—it’s probably compensating
for the one you borrowed from the end in the subtraction sum you did to get –8. In fact, what’s happening is that you’re implicitly assuming that the sign bit, one or zero, repeats forever to the left. Try a few examples of your own; you’ll find it always works, automatically. The really great thing about the two’s complement representation of negative numbers is that it makes arithmetic very easy (and fast) for your computer.
Big-Endian and Little-Endian Systems
As I have discussed, integers generally are stored in memory
as binary values in a contiguous sequence of bytes, commonly groups of 2, 4, 8, or 16 bytes. The question of the sequence in which the bytes appear can be very important—it’s one of those things that doesn’t
matter until it matters, and then it really matters.
Let’s consider the decimal value 262,657 stored as a 4-byte binary value. I chose this value because in binary it happens to be
So each byte has a pattern of bits that is easily distinguished from the others.
If you’re using a PC with an Intel processor, the number will be stored as follows:
Byte address: | 00 | 01 | 02 | 03 |
Data bits: | 0000 0001 | 0000 0010 | 0000 0100 | 0000 0000 |
As you can see, the most significant 8 bits of the value—the one that’s all zeros—are stored in the byte with the highest address (last, in other words), and the least significant 8 bits are stored in the byte with the lowest address, which is the leftmost byte. This arrangement is described as a little-endian system.
If you’re using a mainframe
computer
, a workstation, a Mac machine based on a Motorola processor (nevertheless, current Mac machines are Intel), or networking protocols (ICMP/TCP/UDP commonly uses big endian), the same data are likely to be arranged in memory like this:
Byte address: | 00 | 01 | 02 | 03 |
Data bits: | 0000 0000 | 0000 0100 | 0000 0010 | 0000 0001 |
Now the bytes are in reverse sequence with the most significant 8 bits stored in the leftmost byte, which is the one with the lowest address. This arrangement is described as a big-endian system.
This is all very interesting, you might say, but why should it matter? Most of the time, it doesn’t. More often than not, you can happily write your C program without knowing whether the computer on which the code will execute is big endian or little endian. It does matter, however, when you’re processing binary data that come from another machine. Binary data will be written to a file or transmitted over a network as a sequence of bytes. It’s up to you how you interpret these data. If the source of the data is a machine with a different endianness from the machine on which your code is running, you must reverse the order of the bytes in each binary value. If you don’t, you will have garbage.
For those who collect curious background information, the terms big endian and little endian are drawn from the book Gulliver’s Travels by Jonathan Swift. In the story, the emperor of Lilliput commanded all his subjects to always crack their eggs at the smaller end. This was a consequence of the emperor’s son having cut his finger following the traditional approach of cracking his egg
at the big end. Ordinary, law-abiding Lilliputian subjects who cracked their eggs at the smaller end were described as Little Endians. The Big Endians were a rebellious group of traditionalists in the Lilliputian kingdom who insisted on continuing to crack their eggs at the big end. Many were put to death as a result.
Continuing with the number 262,657 about little- and big-endian representations, we can test it against the following code. As you can perceive, it examines the number itself. Besides, there is a simple integer (
int e) to see the architecture endianness (depending on the order, it will return a one if it is little endian or a zero if it's big endian):
// Program a.1 Checking endianness
#include <stdio.h>
int main(void)
{
int n = 0x40201; // 0x40201 = 262657
char* p = (char*) &n;
int e = 0x1;
char *q = (char*)&e;
//4 bytes an integer:
for (int i = 0; i < 4; i++)
{
printf("memory address: %p: value: %d
", p, *p++);
}
if(q[0] == 1) // checking endianness
{
printf("
It's Little-Endian.
");
}
else
{
printf("
It's Big-Endian.
");
}
return 0;
}
Here’s an example of some output from this program on my machine:
memory address: 00000000002df785: value: 1
memory address: 00000000002df786: value: 2
memory address: 00000000002df787: value: 4
memory address: 00000000002df788: value: 0
Floating-Point Numbers
We often have to deal with very large numbers—the number
of protons in the universe, for example—which need around 79 decimal digits. Clearly there are lots of situations in which you’ll need more than the ten decimal digits you get from a 4-byte binary number. Equally, there are lots of very small numbers, for example, the amount of time in minutes it takes the typical car salesperson to accept your generous offer on a 2001 Honda (which has covered only 480,000 miles…). A mechanism for handling both these kinds of numbers is, as you may have guessed, floating-point numbers
.
A floating-point representation of a number in decimal notation is a decimal value, called the
mantissa
, which is greater than or equal to 0.0 and less than 1.0 with a fixed number of digits, with this value multiplied by a power of 10 to produce the actual value. This power of 10 is called the
exponent
. It’s easier to demonstrate this than to describe it, so let’s look at some examples. The number 365 in normal decimal notation could be written in floating-point form as follows:
The E stands for “exponent” and precedes the power of 10 that the 0.3650000 (the mantissa) part is multiplied by to get the required value. That is
This is clearly 365.
The mantissa in the
number
here has seven decimal digits. The number of digits of precision in a floating-point number will depend on how much memory it is allocated. A single-precision floating-point value occupies 4 bytes where the 32 bits are allocated like this:
Thus, a single-precision floating-point value will provide approximately seven decimal digits’ accuracy. I say “approximately” because a binary fraction with 23 bits doesn’t exactly correspond to a decimal fraction with seven decimal digits.
A double-precision binary floating-point
value occupies 8 bytes with the exponent using 11 bits after the sign bit and the mantissa occupying the remaining 52 bits with an implied leading binary digit that is 1.
Now let’s look at a small number:
This is evaluated as 0.365 × 10-4, which
is 0.0000365—exactly the time in minutes required by the car salesperson to accept your cash.
Suppose you have a large number such as 2,134,311,179. How does this look as a floating-point number? Well, it would look like this:
It’s not quite the same. You’ve lost three low-order digits, and you’ve approximated your original value as 2,134,311,000. This is a small price to pay for being able to handle such a vast range of numbers, typically from 10–38 to 10+38 either positive or negative, as well as having an extended representation that goes from a minute 10–308 to a mighty 10+308. They’re called floating-point numbers for the fairly obvious reason that the decimal point “floats” and its position depends on the exponent value.
Aside from the fixed
precision
limitation in terms of accuracy, there’s another aspect of which you may need to be conscious. You need to take great care when adding or subtracting numbers of significantly different magnitudes. A simple example will demonstrate the problem. You can first consider adding 0.365E-3 to 0.365E+7. You can write this as a decimal sum:
This produces this result:
When converted back to a floating-point value with seven digits of precision, this becomes
Adding 0.365E-3 to 0.365E+7 has had no effect whatsoever, so you might as well not have bothered. The problem lies directly with the fact that you carry only six or seven digits of precision. The digits of the larger number aren’t affected by any of the digits of the smaller
number because they’re all farther to the right. Oddly enough, you must also take care when the numbers are nearly equal. If you compute the difference between such numbers, you may end up with a result that has only one or two digits of precision. It’s quite easy in such circumstances to end up computing with numbers that are totally garbage.
While floating-point numbers
enable you to carry out calculations that would be impossible without them, you must always keep their limitations in mind if you want to be sure your results are valid. This means considering the range of values you are likely to be working with and their relative values.