B

Computer Arithmetic

In the chapters of this book, I have deliberately kept discussion of binary arithmetic to a minimum. However, it is important overall, and fundamental to understanding how some operators work, so I’m including a summary of the subject in this appendix. If you feel confident about your math knowledge, this is all old hat to you and you need read no farther. If you find the math parts tough, then this appendix should show you how easy it really is.

BINARY NUMBERS

First let’s consider what you mean when you write a common everyday number such as 321 or 747. Put more precisely you mean

321 is:

3 × (10 × 10) + 2 × (10) + 1

and 747 is:

7 × (10 × 10) + 4 × (10) + 7

Because it is built around powers of ten, you call this the decimal system (derived from the Latin decimalis, meaning of tithes, which was a tax of 10 percent — ah, those were the days . . .).

Representing numbers in this way is very handy for people with ten fingers and ten toes, or creatures with ten of any kind of appendage for that matter. However, your PC is quite unhandy in this context, being built mainly of switches that are either on or off. This is okay for counting up to two, but not spectacular at counting to ten. For this reason your computer represents numbers to base 2 rather than base 10. This is called the binary system of counting, analogous to the bicycle (two wheels), but nothing whatever to do with bibacity, which means an inclination to drink a lot. With the decimal system, to base 10, the digits used can be from 0 to 9. In the binary system, to base 2, the digits can only be 0 or 1, ideal when you have only on/off switches to represent them; off is usually 0, and on is 1 — simple. Each digit in the binary system is called a bit, which is a contraction of binary digit. In an exact analogy to the usual base 10 counting, the binary number 1101 is therefore:

1 × (2 × 2 × 2) + 1 × (2 × 2) + 0 × (2) + 1

which, if you work it out, amounts to 13 in the decimal system. In Table B-1, you can see the decimal equivalents of 8-bit binary numbers illustrated.

TABLE B-1: Decimal Equivalents of 8-bit Binary Numbers

image

Note that by using just 7 bits, you can represent all the decimal numbers from 0 to 127, which is a total of 27, or 128 numbers; and using all 8 bits you get 256, which corresponds to 28 numbers. In general, if you have n bits available, you can represent 2n positive integers with values from 0 to 2n1. Of course, I am only talking in the context positive numbers so far. If you also need the same number of negative numbers, you need more bits. I get to negative numbers in a moment.

Hexadecimal Numbers

When you get to work with larger binary numbers — for example, numbers with 24 bits:

1111 0101 1011 1001 1110 0001

the notation starts to be a little cumbersome, particularly when you consider that if you apply the method you saw in the previous section to work out what this is in decimal notation, it’s only 16,103,905, a miserable 8 decimal digits. You can sit more angels on a pinhead than that. Well, as it happens, you have an excellent alternative.

Arithmetic to base 16 is a very convenient option. Numbers to base 16 are hexadecimal numbers. Each digit can have values from 0 to 15 (the digits from 10 to 15 being represented by the letters A to F as shown in Table B-2, or by a to f if you’re averse to capitalization) and values from 0 to 15 correspond quite nicely with the range of values that four binary digits can represent.

TABLE B-2: Base 16 Conversion Table

HEXADECIMAL DECIMAL BINARY
0 0 0000
1 1 0001
2 2 0010
3 4 0011
4 4 0100
5 5 0101
6 6 0110
7 7 0111
8 8 1000
9 9 1001
A 10 1010
B 11 1011
C 12 1100
D 13 1101
E 14 1110
F 15 1111

Because a hexadecimal digit corresponds exactly to four binary bits, you can represent a binary number as a hexadecimal number just by taking successive groups of four binary digits starting from the right, and writing the equivalent base 16 digit for each group. Look as this binary number:

1111 0101 1011 1001 1110 0001

If you replace each group of four bits with the equivalent hexadecimal digit, you get:

F5B9E1

You have six hexadecimal digits corresponding to the six groups of four binary digits. Just to show it all really works out with no cheating, you can convert this number directly from hexadecimal to decimal, by again using the analogy with the meaning of a decimal number, as follows:

F5B9E1 is:

15 × (16 × 16 × 16 × 16 × 16) +

5 × (16 × 16 × 16 × 16) +

11 × (16 × 16 × 16) +

9 × (16 × 16) + 14 × (16) + 1

This in turn turns out to be:

15,728,640 + 327,680 + 45,056 + 2304 + 224 + 1

which fortunately totals to the same number that you got when you converted the original binary number to a decimal value.

Octal Numbers

In Chapter 2 you read that you can define octal integer literals, which are integers to base 8. These were invented by Charles XII of Sweden in 1717, along with the idea of representing numbers to base 64, but this is not the reason they appear in Java. The octal representation is there because historically computers stored binary integers in units that were multiples of 3 bits, the early-1960’s vintage IBM 7090 with 36-bit words to store integers being just one example where 12 octal digits could be used to specify a 36-bit binary value. Some of the programming languages of the time offered the economy of writing binary values as octal. As programming languages evolved, the octal representation was carried along down through the years from one language to the next bigger, better, and more powerful language, right up to today. There they are, octal numbers, still lurking in C, C++, and now Java, even though they have very little purpose.

Octal digits can have the values from 0 to 7 and computation is exactly analogous to hexadecimal arithmetic, with 8 instead of 16 as the base. Octal literals are easily confused with decimal integers. The only differentiation is the leading 0 (zero) in the representation of octal literals in Java. Unless you have discovered a situation where they are essential, or you enjoy a perverse approach to programming, octal literals are best avoided.

Negative Binary Numbers

Another aspect to binary arithmetic that you need to understand is how negative numbers are represented. So far you have assumed everything is positive — the optimist’s view, if you will — that your glass is still half full. But you can’t avoid the negative side of life forever — the pessimist’s perspective that your glass is already half empty. How do you indicate a negative number? Well, you have only binary digits at your disposal, so they must contain the solution.

For numbers where you want to allow the possibility of negative values (referred to as signed numbers) you must first decide on a fixed length (in other words, fix the number of binary digits in a number) and then designate the leftmost binary digit as a sign bit. You have to fix the length to avoid any confusion about which bit is the sign bit as opposed to other bits that are digits. A single bit is quite capable of representing the sign of a number because a number can be either positive — corresponding to a sign bit being 0, or negative — indicated by the sign bit being 1.

Of course, you can have some numbers with 8 bits, and some with 16 bits, or whatever number of bits you like, as long as you know what the length is in each case. If the sign bit is 0 the number is positive, and if it is 1, the number is negative. This would seem to solve the problem, but not quite. If you add −8 in binary to +12 you would really like to get the answer +4. If you carry out that operation simplistically, just putting the sign bit of the positive value to 1 to make it negative, and then doing the arithmetic with conventional carries from one bit position to the next on the left, it doesn’t quite work:

12 in binary is 0000 1100
−8 in binary you suppose is 1000 1000

Because +8 is 0000 1000, the binary representation for −8 is the same, but with the leftmost bit set to 1. If we now add these together we get:

12 + (−8) is 1001 0100

The value 1001 0100 seems to be −20 according to the rules, which is not what you wanted at all. It’s definitely not +4, which you know is 0000 0100. “Ah,” I hear you say, “you can’t treat a sign just like another digit.” But that is just what you do have to do when dealing with computers because, dumb things that they are, they have trouble coping with anything else. So you really need a different representation for negative numbers if the same process for addition is to work regardless of the sign of the operands. Well, because the same process for arithmetic operations should work regardless of the signs of the operands, you could try subtracting +12 from +4 and see what you get. Whatever the result is should be −8:

+4 is 0000 0100
Take away +12 0000 1100
and you get 1111 1000

For each digit from the fourth from the right onward you had to borrow 1 to do the sum, analogously to our usual decimal method for subtraction. This supposedly is +8, and even though it doesn’t look much like it, it really is. Just try adding it to +12 or +15 in binary and you see that it works. So what is it? It turns out that the answer is what is called the 2s complement representation of negative binary numbers.

Now here I am going to demand a little faith on your part and avoid getting into explanations of why it works. I’m just showing you how the 2’s complement form of a negative number can be constructed from a positive value and that it does work, so you can prove it to yourself. Let’s return to the previous example where you need the 2’s complement representation of −8. We start with +8 in binary:

0000 1000

You now flip each binary digit — if it is one make it zero, and vice versa:

1111 0111

This is called the 1s complement form, and if you now add 1 to this, you get the 2’s complement form:

1111 0111
Add one to this 0000 0001
and we get: 1111 1000

Now this looks pretty similar to the representation of −8 you got from subtracting +12 from +4. So just to be sure, let’s try the original sum of adding −8 to +12:

+12 is 0000 1100
Our version of −8 is 1111 1000
and you get: 0000 0100

So the answer is 4 — magic! It works! The carry propagates through all the leftmost 1’s, setting them back to zero. One fell off the end, but you shouldn’t worry about that. It’s probably the one you borrowed from off the end in the subtraction sum you did earlier to get −8. In fact, you are making the implicit assumption that the sign bit, 1 or 0, repeats forever to the left. If you try a few examples of your own, you’ll find it always works quite automatically. The really great thing is, it makes arithmetic very easy (and fast) for your computer.

FLOATING-POINT NUMBERS

You often have to deal with very large numbers: the number of protons in the universe, for example, which needs around 79 decimal digits. Clearly there are lots of situations where you need more than the 10 decimal digits you get from a 4-byte binary number. Equally, there are lots of very small numbers: the amount of time in minutes, for example, that it takes the typical car salesman to accept your cash offer on his lightly used 1999 Honda Accord (only 387,604 miles, and keyless entry because the key got lost along the way. . .). A mechanism for handling both these kinds of numbers is — as you might have guessed from the title of this section — floating-point numbers.

A floating-point representation of a number is a decimal point followed by a fixed number of digits, multiplied by a power of 10 to get the number you want. It’s easier to demonstrate than explain, so let’s take some examples. The number 365 in normal decimal notation would be written in floating-point form as the following:

0.365E03

Here, the E stands for exponent and is the power of ten that the 0.365 bit (the mantissa) is multiplied by, to get the required value. That is:

0.365 × (10 × 10 × 10)

which is clearly 365.

Now let’s look at a smallish number:

.365E−04

This is evaluated as .365 × 10−4, which is .0000365 — exactly the time in minutes required by the car salesman to accept your money and put up the “Closed” sign.

The number of digits in the mantissa of a floating-point number is called the precision, and depends on the type of the floating-point number that you are using. The Java type float provides the equivalent of approximately 7 decimal digits, and the type double provides around 17 decimal digits. The number of decimal digits is approximate because the mantissa is binary, not decimal, and there’s not an exact mapping between binary and decimal digits.

Suppose you have a large number such as 2,134,311,179. How does this look as a floating-point number? Well, as a decimal representation of a number of type float it looks like:

0.2134311E10

It’s not quite the same. You have lost three low-order digits, so you have approximated our original value as 2,134,311,000. This is a small price to pay for being able to handle such a vast range of numbers, typically from 10−38 to 10+38 either positive or negative, as well as having an extended representation that goes from a minute 10−308 to a mighty 10+308. As you can see, they are called floating-point numbers for the fairly obvious reason that the decimal point “floats,” depending on the exponent value.

Aside from the fixed precision limitation in terms of accuracy, there is another aspect you may need to be conscious of. You need to take great care when adding or subtracting numbers of significantly different magnitudes. A simple example demonstrates the kind of problem that can arise. You can first consider adding .365E−3 to .365E+7. You can write this as a decimal sum:

.000365 + 3,650,000

This produces the result:

3,650,000.000365

which when converted back to floating-point with seven-digit accuracy becomes

.3650000E+7

So you might as well not have bothered. The problem lies directly with the fact that you carry only seven-digit precision. The seven digits of the larger number are not affected by any of the digits of the smaller number because they are all farther to the right. Oddly enough, you must also take care when the numbers are very nearly equal. If you compute the difference between such numbers you may end up with a result that has only one or two digits’ precision. It is quite easy in such circumstances to end up computing with numbers that are total garbage.

One final point about using floating-point values — many values that have an exact representation as a decimal value cannot be represented exactly in binary floating-point form. For example, 0.2 as a decimal value cannot be represented exactly as a binary floating-point value. This means that when you are working with such values, you have tiny errors in your values right from the start. One effect of this is that accumulating the sum of 100 values that are all 0.2 will not produce 20 as the result. If you try this out in Java, the result is 20.000004, slightly more than you bargained for. (Unfortunately the banks do not use floating-point values for your deposits — it could have been a sure-fire way to make money without working.)

You can conclude from this that although floating-point numbers are a powerful way of representing a very wide range of values in your programs, you must always keep in mind their limitations. If you are conscious of the range of values that you are likely to be working with, you can usually adopt an approach to performing the calculations that you need that avoids the sorts of problems I have described. In other words, if you keep the pitfalls in mind when working with floating-point values, you have a reasonable chance of stepping around or over them.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.184.117