Chapter 19. Floating Point

1 is equal to 2 for sufficiently large values of 1.

Anonymous

Computers handle integers very well. The arithmetic is simple, exact, and fast. Floating point is the opposite. Computers do floating-point arithmetic only with great difficulty.

This chapter discusses some of the problems that can occur with floating point. To address the principles involved in floating-point arithmetic, we have defined a simple decimal floating-point format. We suggest you put aside your computer and work through these problems using pencil and paper so you can see firsthand the problems and pitfalls that occur.

The format used by computers is very similar to the one defined in this chapter, except that instead of using base 10, computers use base 2, 8, or 16. However, all the problems demonstrated here on paper can occur in a computer.

Floating-Point Format

Floating-point numbers consist of three parts: a sign, a fraction, and an exponent. Our fraction is expressed as a four-digit decimal. The exponent is a single-decimal digit. So our format is:

±f.fff × 10±e

where:

±

is the sign (plus or minus).

f.fff

is the four-digit fraction.

±e

is the single-digit exponent.

Zero is +0.000 × 10 +0. We represent these numbers in “E” format: ±f.fff E±e. This format is similar to the floating-point format used in many computers. The IEEE has defined a floating-point standard (#742), but not all machines use it.

Table 19-1 shows some typical floating-point numbers.

Table 19-1. Floating-point examples

Notation

Number

+1.000E+0

1.0

+3.300E+5

330000.0

-8.223E-3

-0.008223

+0.000E+0

0.0

The floating-point operations defined in this chapter follow a rigid set of rules. To minimize errors we make use of a guard digit. That is an extra digit added to the end of the fraction during computation. Many computers use a guard digit in their floating-point units.

Floating Addition/Subtraction

To add two numbers, such as 2.0 and 0.3, the computer must perform the following steps:

  1. Start with the numbers.

    +2.000E+0	The number is 2.0.
    +3.000E-1	The number is 0.3.
  2. Add guard digits to both numbers.

    +2.0000E+0	The number is 2.0.
    +3.0000E-1	The number is 0.3.
  3. Shift the number with the smallest exponent to the right one digit and increment its exponent. Continue until the exponents of the two numbers match.

    +2.0000E+0	The number is 2.0.
    +0.3000E-0	The number is 0.3.
  4. Add the two fractions. The result has the same exponent as the two numbers.

    +2.0000E+0	The number is 2.0.
    +0.3000E-0	The number is 0.3.
    _________________________________
    +2.3000E+0	The result is 2.3.
  5. Normalize the number by shifting it left or right until there is just one nonzero digit to the left of the decimal point. Adjust the exponent accordingly. A number like +0.1234E+0 would be normalized to +1.2340E-1. Because the number +2.3000E+0 is already normalized, we do nothing.

  6. Finally, if the guard digit is greater than or equal to 5, round the next digit up. Otherwise, truncate the number.

    +2.3000E+0	Round last digit.
    +2.300E+0	The result is 2.3.

To subtract a number:

  1. Change the sign of the second operand.

  2. Add.

Multiplication and Division

When we want to multiply two numbers, such as 0.12 × 11.0, the following rules apply:

  1. Start with the numbers:

    +1.200E-1	The number is 0.12.
    +1.100E+1	The number is 11.0.
  2. Add the guard digit.

    +1.2000E-1	The number is 0.12.
    +1.1000E+1	The number is 11.0.
  3. Multiply the two fractions and add the exponents (1.2 × 1.1 = 1.32, -1 + 1 = 0).

    +1.2000E-1	The number is 0.12.
    +1.1000E+1	The number is 11.0.
    __________________________________
    +1.3200E+0	The result is 1.32.
  4. Normalize the result.

    +1.3200E+0	The number is 1.32.
  5. If the guard digit is greater than or equal to 5, round the next digit up. Otherwise, truncate the number.

    +1.320E+0	The number is 1.32

Notice that in multiply, you didn’t have to go through all that shifting. The rules for multiplication are a lot shorter than those for add as far as the computer hardware designers are concerned. Integer multiplication is a lot slower than integer addition. In floating point, multiplication speed is a lot closer to that of addition.

To divide numbers like 100.0 by 30.0, we must perform the following steps:

  1. Start with the numbers.

    +1.000E+2	The number is 100.0.
    +3.000E+1	The number is 30.0.
  2. Add the guard digit.

    +1.0000E+2	The number is 100.0.
    +3.0000E+1	The number is 30.0.
  3. Divide the fractions, and subtract the exponents.

    +1.0000E+2	The number is 100.0.
    +3.0000E+1	The number is 30.0.
    ___________________________________
    +0.3333E+1	The result is 3.333.
  4. Normalize the result.

    +3.3330E+0	The result is 3.333.
  5. If the guard digit is less than or equal to 5, round the next digit up. Otherwise, truncate the number.

    +3.333E+0	The result is 3.333.

Overflow and Underflow

There are limits to the size of the number a computer can handle. What is the result of the following calculation?

9.000E+9 × 9.000E+9

Multiplying it out, we get:

8.1 × 1019

However, we are limited to a single-digit exponent, too small to hold 19. This is an example of overflow (sometimes called exponent overflow). Some computers generate a trap when this occurs, thus interrupting the program and causing an error message to be printed. Others are not so nice and generate a wrong answer (like 8.100E+9). Computers that follow the IEEE floating-point standard generate a special value called +Infinity.

Underflow occurs when the numbers become too small for the computer to handle. Example:

1.000E-9 × 1.000E-9

The result is:

1.0 × 10-18

Because -18 is too small to fit into one digit, we have underflow. Again, like overflow, the results of underflow are system-dependent.

Roundoff Error

Floating point is not exact. Everyone knows that 1 + 1 is 2, but did you know that 1/3 + 1/3 does not = 2/3? This can be shown by the following floating-point calculations:

2/3 as floating point is 6.667E-1

1/3 as floating point is 3.333E-1

+3.333E-1 
+3.333E-1 
_______________________
+6.666E-1, or 0.6666

which is not:

+6.667E-1

Every computer has a similar problem with doing floating-point calculations. For example, the number 0.2 has no exact representation in binary floating point.

Floating point should never be used for money. Because we are used to dealing with dollars and cents, it is tempting to define the amount $1.98 as:

float amount = 1.98;

However, the more calculations you do with floating point, the bigger the roundoff error. Banks, credit cards, and the IRS tend to be very fussy about money. Giving the IRS a check that’s almost right is not going to make them happy. Money should be stored as an integer number of pennies.

Accuracy

How many digits of the fraction are accurate? At first glance you might be tempted to say all four digits. Those of you who have read the previous section on roundoff error might be tempted to change your answer to three.

The answer is: the accuracy depends on the calculation. Certain operations, such as subtracting two numbers that are close to each other, generate inexact results. Consider the following equation:

1 -1/3 - 1/3 - 1/3

In floating-point notation this is:

 1.000E+0
-0.333E-0
-0.333E-0
-0.333E-0
_______________________
 0.0010E+0, or 1.000E-3

The correct answer is 0.000E+0 and we got 1.000E-3. The very first digit of the fraction is wrong. This is an example of the problem called roundoff error that can occur during floating-point operations.

Minimizing Roundoff Error

There are many techniques for minimizing roundoff error. Guard digits have already been discussed. Another trick is to use double instead of float. This gives you approximately twice the accuracy as well as twice the range. It also pushes away the minimization problem twice as far. But roundoff errors still can creep in.

Advanced techniques for limiting the problems caused by floating point can be found in books on numerical analysis. They are beyond the scope of this text. The purpose of this chapter is to give you some idea of what sort of problems can be encountered.

Floating point by its very nature is not exact. People tend to think of computers as very accurate machines. They can be, but they also can give wildly wrong results. You should be aware of the places where errors can slip into your program.

Determining Accuracy

There is a simple way of determining how accurate your floating point is (for simple calculations). The method used in the following program is to add 1.0 + 0.1, 1.0 + 0.01, 1.0 + 0.001, and so on until the second number gets so small that it makes no difference in the result.

The old C language specified that all floating-point numbers were to be done in double . C++ removed that restriction, but because many C++ compilers are really front-ends to a C compiler, frequently C++ arithmetic is done in double. This means that if number1 and number2 are declared as float, the expression:

while (number1 + number2 != number1)

is equivalent to:

while (double(number1) + double(number2) != double(number1))

If you use the 1 + 0.001 trick, the automatic conversion of float to double may give a distorted picture of the accuracy of your machine. (In one case, 84 bits of accuracy were reported for a 32-bit format.) Example 19-1 computes the accuracy of both floating point as used in equations and floating point as stored in memory. Note the trick used to determine the accuracy of the floating-point numbers in storage.

Example 19-1. float/float.cpp
#include <iostream>
#include <iomanip>

int main(  )
{
    // two number to work with 
    float number1, number2;
    float result;               // result of calculation 
    int   counter;              // loop counter and accuracy check 

    number1 = 1.0;
    number2 = 1.0;
    counter = 0;

    while (number1 + number2 != number1) {
        ++counter;
        number2 = number2 / 10.0;
    }
    std::cout << std::setw(2) << counter << 
        " digits accuracy in calculations
";

    number2 = 1.0;
    counter = 0;

    while (true) {
        result = number1 + number2;
        if (result == number1)
            break;
        ++counter;
        number2 = number2 / 10.0;
    }
    std::cout << std::setw(2) << counter << 
        " digits accuracy in storage
";
    return (0);
}

The results are as follows:

20 digits accuracy in calculations 
8 digits accuracy in storage

This program gives only an approximation of the floating-point precision arithmetic. A more accurate definition can be found in the standard include file float.h .

Precision and Speed

A variable of type double has about twice the precision of a normal float variable. Most people assume that double-precision arithmetic takes longer than single-precision. This is not always the case. Let’s assume we have one of the older compilers that does everything in double.

For the equation:

float answer, number1, number2; 

answer = number1 + number2;

C++ must perform the following steps:

  1. Convert number1 from single to double precision.

  2. Convert number2 from single to double precision.

  3. Double-precision add.

  4. Convert result into single precision.

  5. Store the result in answer.

If the variables were of type double, C++ would have to perform only the following steps:

  1. Double-precision add.

  2. Store result in answer.

As you can see, the second form is a lot simpler, requiring three fewer conversions. In some cases, converting a program from single precision to double precision makes it run faster.

Tip

Because C++ specifies that floating point can be done in double or float, you can’t be sure of anything. Changing all floats into doubles may make the program run faster, slower, or the same. The only thing you can be sure of when using floating point is that the results are unpredictable.

Many computers, including the PC and Sun/3 series machines, have a special chip called a floating-point processor that does all the floating-point arithmetic. Actual tests using the Motorola 68881 floating-point chip (which is used in the Sun/3) and floating point on the PC show that single precision and double precision run at the same speed.

Power Series

Many trigonometry functions are computed using a power series. For example, the series for sine is:

sin(x) = x - x 3 /3! + x 5 /5! - x 7 /7! + ...

The question is, how many terms do we need to get four-digit accuracy? Table 19-2 contains the terms for sin(π/2).

Table 19-2. Terms for sin(π/2)
 

Term

Value

Total

1

x

1.571E+0

 

2

x 3 /3!

6.462E-1

9.248E-1

3

x 5 /5!

7.974E-2

1.005E+0

4

x 7 /7!

4.686E-3

9.998E-1

5

x 9 /9!

1.606E -- 4

1.000E+0

6

x 11 /11!

3.604E-6

1.000E+0

From this we conclude that five terms are needed. However, if we try to compute sin(π), we get Table 19-3.

Table 19-3. Terms for sin(p)
 

Term

Value

Total

1

x

3.142E+0

 

2

x 3 /3!

5.170E+0

-2.028E+0

3

x 5 /5!

2.552E-0

5.241E-1

4

x 7 /7!

5.998E-1

-7.570E-2

5

x 9 /9!

8.224E-2

6.542E-3

6

x 11 /11!

7.381E-3

-8.388E-4

7

x 13 /13!

4.671E-4

-3.717E-4

8

x 15 /15!

2.196E-5

-3.937E-4

9

x 17 /17!

7.970E-7

-3.929E-4

10

x 19 /19!

2.300E-8

-3.929E-4

π needs nine terms; so different angles require a different number of terms. (A program for computing the sine to four-digit accuracy showing intermediate terms is included in Appendix D.)

Compiler designers have a dilemma when it comes to designing a sine function. If they know ahead of time the number of terms to use, they can optimize their algorithms for that number of terms. However, they lose accuracy for some angles. So a compromise must be struck between speed and accuracy.

Don’t assume that because the number came from the computer, it is accurate. The library functions can generate bad answers—especially when working with excessively large or small values. Most of the time you will not have any problems with these functions, but you should be aware of their limitations.

Finally, there is the question of what is sin(1,000,000)? Our floating-point format is good for only four digits. The sine function is cyclical. That is, sin(0) = sin(2π) = sin(4π). Therefore, sin(1,000,000) is the same as sin(1,000,000 mod 2π).

Because our floating-point format is good to only four digits, sin(1,000,000) is actually sin(1,000,xxx), where xxx represents unknown digits. But the sin function is periodic with a period 2π, which means it goes through its full range in a space of 2π. Because the range of unknown (1,000) is bigger than 2π, the error renders meaningless the result of the sine.

Programming Exercises

Exercise 19-1: Write a class that uses strings to represent floating-point numbers in the format used in this chapter. The class should have functions to read, write, add, subtract, multiply, and divide floating-point numbers.

Exercise 19-2: Create a class to handle fixed-point numbers. A fixed-point number has a constant (fixed) number of digits to the right of the decimal point.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.30.236