Chapter 6. Floating-Point Optimization

Chapter Objectives

It would seem intuitive that computers are machines for handling numbers, so therefore, they should excel at handling floating-point arithmetic. Floating-point optimization is an interesting topic because it actually turns out to be more complex than you might expect. Floating-point mathematics is covered by the IEEE-754 standard, and optimization of floating-point arithmetic relaxes the constraint on the compiler to conform to this standard (the Sun compiler adheres to the standard unless flags are used that relax this constraint).

By the end of this chapter, the reader will understand the optimizations that can be applied to floating-point computation and the impact these optimizations will have on the accuracy of the generated results.

Floating-Point Optimization Flags

Mathematical Optimizations in -fast

Table 6.1 shows the floating-point optimization flags that are included in the -fast compiler flag for the C, C++, and Fortran languages. These flags represent optimizations that have been found to be generally useful for floating-point applications. In this section, I will describe the flags and present the trade-offs that you need to consider when using the flags.

Table 6.1. Floating-Point Optimization Flags Included in -fast

Flag

Comment

C

C++

Fortran

-fns

Floating-point nonstandard mode

Y

Y

Y

-fsimple=2

Aggressive floating-point optimizations

Y

Y

Y

-ftrap=%none

Don’t trap on IEEE exceptions

Y

Y

 

-ftrap=common

Trap only on “common” IEEE exceptions

  

Y

-xlibmil

Use inline templates for math functions

Y

Y

Y

-fsingle

Float expressions are evaluated in single-precision

Y

  

-xlibmopt

Use optimized math library

Y

Y

Y

-xvector

Generate calls to vector math library

  

Y

-fnostore

Do not convert temporary values into shorter formats (x86 only)

Y

Y

Y

The other option enabled in -fast for C is -D__MATHERR_ERRNO_DONTCARE, which tells the compiler to assume that the math functions defined in math.h have no side effects (such as setting the error reporting variable errno). I discuss this in more detail in Section 6.2.15.

IEEE-754 and Floating Point

The IEEE-754 standard determines how floating-point arithmetic should function on a “standard” computer, which is the default mode for the compiler. Adherence to the standard means calculations have to be performed in a particular order, and the compiler cannot take shortcuts. This may mean that the calculations take longer. Using the -fns and -fsimple compiler flags allows the compiler to produce code that might obtain results faster; however, using the flags means the resulting code no longer adheres to the standard.

Even using IEEE-754 mathematics does not guarantee “correct” results. The intention of the standard is to make code portable so that you can run the code on different platforms, and expect the same support and conventions. It does not guarantee that the results will be identical on the two platforms.

The IEEE-754 standard describes two commonly occurring storage formats for numbers: single- and double-precision. These take four and eight bytes, respectively. It is important to realize that precision does not mean accuracy; accuracy is whether a number is correct, whereas precision is how many decimal places are specified in the number. It is worth observing that using double precision can, in many cases, improve accuracy. Single precision holds about 6–9 significant figures, and double precision holds about 15–17.

When floating-point numbers are stored, they are stored as the nearest number that can be represented in binary format. So, some numbers (such as 1/3) cannot be exactly represented in this format. As such, by its very nature, floating-point arithmetic on a computer has some degree of inaccuracy, and in this section, I will provide examples of how this manifests itself.

Given that storage of floating-point numbers is an approximation, there will always be some error for any application. One technique that improves this situation is the use of interval arithmetic. Briefly, intervals are a way for the computer to calculate the lower and upper bounds for a value. When a calculation is performed on an interval, the output is also an interval. If the calculation is well behaved, the output will have an upper and lower bound that are in close agreement. If the algorithm is less well behaved, the difference between the upper and lower bounds could be substantial. Although the Sun Studio compilers do support interval arithmetic, further discussion of the topic is beyond the scope of this book.

Vectorizing Floating-Point Computation (-xvector)

The -xvector flag asks the compiler to recognize situations in which multiple calls to a mathematical function (e.g., log, sin, cos, etc.) can be replaced with a single call to a function that works on a vector of values.

Because the calculation of the values of these functions involves many steps, and all the steps have to be completed serially, it turns out that in some cases doing several calculations at once actually takes about the same time as doing one. Consequently, it is possible to see a speed increase just using this flag.

Example 6.1 shows a code snippet that could be vectorized; one array is produced by calculating the sine of every element in another array. If this code is compiled with both the -xvector and -xbuiltin flags, the call to sin will be replaced by a call to the vector sin routine, contained in the vector math library. This is demonstrated in Example 6.2.

Example 6.1. Example of Vectorizable Code

#include <math.h>
extern double x[100],y[100];
void calc()
{
  int i;
  for (i=0; i<100; i++)
    x[i]=sin(y[i]);
}

Example 6.2. Enabling Compiler to Insert Calls to Vector Library

$ cc -xO3 -S ex6.1.c
$ grep sin ex6.1.s
/* 0x001c          7 */         call    sin
$ cc -xO3 -xbuiltin -xvector -S ex6.1.c
$ grep sin ex6.1.s
/* 0x0020            */         call    __vsin_

Example 6.3 shows an example of code that calls the vector library directly. You can find further details about the vector library under man libmvec, and details of a similar library for complex vector operations under man clibmvec.

Example 6.3. Calling the Vector Library Directly

#include <math.h>
extern double x[100],y[100];
void vsin_(int *n, double *x, int *stridex, double *y, int *stridey);

void calc()
{
  int i;
  int stride=1;
  int length=100;
  vsin_(&length,x,&stride,y,&stride);
}

There are a few things to observe about calling libmvec.

  • No header files for the vector math library are included with the compiler, so it is necessary to extract the required headers from the man pages.

  • To call the vector routine, it is also necessary to link in the vector library. You van do this by including -xvector on the compile line.

  • The vectors passed into the library must not alias or overlap.

Vectorizing Computation Using SIMD Instructions (-xvector=simd) (x64 Only)

On x64 platforms, the -xvector flag also has support for recognizing opportunities to use Single Instruction, Multiple Data (SIMD) instructions. These instructions simultaneously perform the same operation on multiple items of data, reducing the total number of instructions needed and increasing performance. To do this the compiler needs the -xvector=simd flag and the appropriate architecture setting of -xarch=sse2.

Example 6.4 shows an example of this optimization. In the example, the mulps instruction is used to multiply four single-precision pairs of numbers; the surrounding move instructions are responsible for loading the SSE2 registers with the data, and then storing the result back to memory.

Example 6.4. Generating SIMD Instructions

% more ex6.4.c
void calc (float * restrict a, float * restrict b, int count)
{
  for (int i=0; i<count; i++) {a[i]=a[i]*b[i];}
}
$ cc -fast -xarch=sse2 -xvector=simd -S ex6.4.c
$ more ex6.4.s
...
.LU16.124:
        movlps     (%edx),%xmm0                         ;/ line : 3
        movhps     8(%edx),%xmm0                        ;/ line : 3
        movlps     (%ecx),%xmm1                         ;/ line : 3
        movhps     8(%ecx),%xmm1                        ;/ line : 3
        mulps      %xmm1,%xmm0                          ;/ line : 3
        movlps     %xmm0,(%edx)                         ;/ line : 3
        movhps     %xmm0,8(%edx)                        ;/ line : 3
        addl       $16,%edx                             ;/ line : 3
        addl       $16,%ecx                             ;/ line : 3
        addl       $4,%eax                              ;/ line : 3
        cmpl       %edi,%eax                            ;/ line : 3
        jle        .LU16.124                            ;/ line : 3
....

Subnormal Numbers

Subnormal numbers are floating-point numbers that are very close to zero. The idea of supporting them is that they form a gradual underflow between the smallest floating-point number that can be represented in the “normal” way, and zero.

Floating-point numbers are represented as x*2y, where x is called the mantissa and y is called the exponent. Using this notation, it is possible to represent some values in multiple ways. For example, consider representing a half in base 10. A half can be represented as 0.5*100 or as 5.0*10–1. A way to normalize this is to say that all numbers should have a single nonzero digit before the decimal point. This would make the second representation of a half the appropriate one in decimal. In binary, this translates to always storing numbers with the first bit set to 1.

A floating-point number has a certain number of bits to represent the mantissa and a certain number of bits to represent the exponent. If the first bit of the mantissa is always set to be one, the only way to make a number smaller is to make the exponent more negative. Subnormal numbers fill the range between the value zero, and the smallest number that can be represented by the largest negative exponent and a mantissa with a leading one. For subnormal numbers, rather than representing the number using a leading one, the mantissa has a leading zero. Subnormal numbers have a reduced precision (because fewer bits are available to hold the mantissa), so they are trading precision for smoothing the transition between small nonzero numbers and zero.

Calculations on subnormal numbers usually have a longer latency than operations on normal numbers. The UltraSPARC family of processors handle subnormal numbers by trapping to software to complete the calculation. The Opteron processors handle operations on subnormal numbers using microcoded instructions; these take significantly longer to execute, but less time than taking a trap. If there are significant numbers of calculations on subnormal numbers, the processor might spend a considerable amount of time handling them. If a program encounters significant numbers of subnormal numbers, it indicates that it is performing many reduced-accuracy calculations on numbers close to zero. This indicates that the program is performing calculations at the limit of the range of numbers that can be represented in floating-point registers, and consequently the output from the program may be inaccurate.

Example 6.5 shows some code that generates subnormal numbers. The program takes a floating-point number and keeps dividing by two until it becomes zero. The value starts at 1.0, in the range of the “normal” floating-point numbers, and then goes into the subnormal numbers.

Example 6.5. Example of Code That Generates Subnormal Numbers

#include <stdio.h>
#include <sys/time.h>

void main ()
{
  double f=(double)1.0;
  hrtime_t start,end;
  while (f>0)
  {
    start = gethrtime();
    f=f * (double)0.5;
    end = gethrtime();
    printf("f=%e time=%lld
",f,end-start);
  }
}

Example 6.6 shows this code being run (the call to gethrtime seems to be accurate only to within 200ns on this system). The critical things to note are that initially each calculation is taking <200ns, but at the end each calculation is taking about 6,000ns. Also notice that the program keeps running until the f variable is about 4*10–324, at which point the value becomes zero when divided by two.

Example 6.6. Timing of Subnormal Numbers

$ cc -xO3 ex6.5.c
$ a.out
...
f=5.000000e-01 time=600
f=2.500000e-01 time=200
f=1.250000e-01 time=200
f=6.250000e-02 time=200
f=3.125000e-02 time=0
f=1.562500e-02 time=200
...
f=1.976263e-323 time=6400
f=9.881313e-324 time=6200
f=4.940656e-324 time=6200
f=0.000000e+00 time=7400

Flushing Subnormal Numbers to Zero (-fns)

The -fns flag enables floating-point nonstandard mode. In this mode, subnormal numbers may be flushed to zero. Doing this flush to zero will mean that the computation in the program no longer meets the IEEE-754 standard, but for some codes it may result in a speed increase and for many codes there may be no difference in the output. Example 6.7 shows the same code as in Example 6.5, but this time compiled with the -fns flag.

Example 6.7. Tail of Output of Program Compiled with -fns

$ cc -xO3 -fns ex6.5.c
$ a.out
...
f=1.780059e-307 time=200
f=8.900295e-308 time=0
f=4.450148e-308 time=200
f=2.225074e-308 time=200
f=0.000000e+00 time=0

Example 6.7 shows that the performance of the program remains <200ns for all iterations, but that the value at the final iteration is at about 2*10-308 rather than 4*10-324.

The -fns flag causes the compiler to include code that changes the behavior of the processor so that it may flush subnormal numbers to zero. This means the flag is effective only when compiling the main program.

Handling Values That Are Not-a-Number

Under IEEE-754, a value for Not-a-Number (NaN) is defined. Calculations that would normally produce errors (e.g., zero divided by zero) can produce the result NaN (because the answer is not representable as a number) instead of causing the program to terminate. Such an operation would also raise an exception. Handling this exception enables a program to handle such calculations gracefully, rather than dumping core.

It is also the case that NaN propagate, meaning that a calculation where one operand is a NaN produces NaN as a result. This way, a calculation will produce either a “valid” (i.e., numerical) result, or a NaN, meaning that some part of the calculation was invalid. Example 6.8 shows an example in which a NaN is generated as output.

Example 6.8. Example of a Calculation Generating a NaN

$ cat ex6.8.c
#include <stdio.h>
void main()
{
  double a=0;
  double b=a/a;
  printf("b=%f
",b);
}
$ cc ex6.8.c
$ a.out
b=NaN

One interesting property of NaNs is that they fail the equality test—they are unorderable, they are neither bigger nor smaller than numbers, and one NaN does not equal another NaN. The code in Example 6.9 illustrates that NaNs fail the equality test: A NaN is generated when zero is divided by zero. Even though the NaN is compared to itself, the test fails. In fact, this is the test for NaNs; they are values that are not equal to themselves.

Example 6.9. NaNs Fail the Equality Test

$ cat ex6.9.c
#include <stdio.h>
void main()
{
  double a=0;
  double b=a/a;
  if (b==b) printf("Equal
");
  if (b!=b) printf("Not equal
");
}
$ cc ex6.9.c
$ a.out
Not equal

One further classification of NaNs is that they have two types: signaling NaNs and quiet NaNs. Operations on signaling NaNs will generate a floating-point exception (which can then be caught), whereas operations on quiet NaNs do not generate exceptions (except when they are used in ordered comparisons).

Enabling Floating-Point Expression Simplification (-fsimple)

The -fsimple flag enables floating-point simplification. It allows the compiler to reorder floating-point expressions, replace long-latency floating-point operations with algebraically equivalent but faster versions, or omit some kinds of operations. One way to imagine what the flag can do is to think of it as allowing the compiler to assume that the rules of algebra carry over into floating-point arithmetic performed on a computer. In fact, floating-point math has more complex rules than normal algebra. Here are a couple of examples of situations in which algebra and floating-point computation do not agree.

  • A floating-point variable may not always equal itself. Consider the statement (x==x). Normally this would be expected to always be true. However, as described in Section 6.2.7, this is not true for NaNs.

  • In algebra, division by a value is equivalent to multiplication by the reciprocal of the value. In other words, a/b is equal to a*(1/b). Unfortunately, because the multiplication by the reciprocal involves two floating-point operations, whereas the division is only a single operation, the results of the two are rarely identical (in the least significant bits). Also, calculating the reciprocal of b may generate floating-point overflow, which would not occur in the calculation of a/b.

Generally, these kinds of simplifications can have a performance impact on floating-point applications, because they enable the compiler to do things such as reorder additions or replace divides with multiplication by the reciprocal.

Table 6.2 shows the three settings for -fsimple.

Table 6.2. Settings for -fsimple

-fsimple Setting

Comment

0

No floating-point simplification allowed

1

There are no NaNs in the data, so tests such as (x==x) can be replaced with TRUE. Allow generally appropriate floating-point optimizations.

2

Allow aggressive floating-point optimizations, such as hoisting of divides, and reordering of floating-point expressions.

The default setting for -fsimple is zero, which means that the compiler adheres to the IEEE-754 standard. However, -fast includes the -fsimple=2 flag, which allows aggressive floating-point reordering (and hence, potentially higher performance), but no longer adheres to IEEE-754. Aside from adherence to the standard, you should use -fsimple=0 in calculations in which operations on NaNs are important, and when interval arithmetic is used.

The next few sections will look at various transformations that are possible under the -fsimple flag.

Elimination of Comparisons

In the code in Example 6.9, the comparison of a NaN with itself produces the result that it is not equal. In Example 6.10, the code is recompiled with the -fsimple=2 flag (although the same behavior would result with -fsimple=1). The flag enables the compiler to assume that b variable is always equal to itself; hence, the program always prints “Equal”.

Example 6.10. Equality Testing Under -fsimple=2

$ cat ex6.10.c
#include <stdio.h>
void main()
{
  double a=0;
  double b=a/a;
  if (b==b) printf("Equal
");
  if (b!=b) printf("Not equal
");
}
$ cc -xO3 ex6.10.c
$ a.out
Not equal
$ cc -xO3 -fsimple=2 ex6.10.c
$ a.out
Equal

Elimination of Unnecessary Calculation

For floating-point math, it is sometimes important to perform calculations to observe their side effects. For example, you might perform a calculation to check for overflow; the results of the calculation might not be useful, but the fact that the calculation succeeded without generating an overflow might be important.

As a consequence of this, it is not possible for the compiler to remove floating-point calculations, even when the results of the calculations are not used, except under the control of the -fsimple=1 flag.

Example 6.11 shows a floating-point calculation in which the result is stored in a local variable, but never used. Even though the variable is never used, the compiler will still have to perform the calculation. If this code is compiled without -fsimple, the divide operation is performed. If -fsimple=1 is specified, the compiler has the freedom to eliminate the unused divide operation.

Example 6.11. Example of Redundant Floating-Point Calculation

void test(float a, float b)
{
  float c=a/b;
}

Reordering of Calculations

The language standards dictate that operations are completed in the order specified. This has two parts to it. First, parentheses are honored, and second, that calculations are carried out in the order that the program specifies. Consider the code shown in Example 6.12. In this code, a sum is calculated.

Example 6.12. Calculation of a Sum of a Vector

int i;
double a[LEN];
double total = 0;
for (i=0; i<LEN; i++)
  total += a[i];

There is a performance issue with this code in that each addition to the variable total has to complete before the next addition can start. This means that each iteration of the loop takes at least as long as a single addition takes.

There is a faster way to do this summation, and that is to have multiple summation variables, with each variable totaling part of the final result. Example 6.13 shows the transformed code.

Example 6.13. Summation Code Restructured to Use Four Temporary Variables

int i;
double a[LEN];
double total = 0, total1 = 0, total2 = 0, total3 = 0, total4 = 0;
for (i=0; i<LEN-4; i+=4)
{
  total1 += a[i];
  total2 += a[i+1];
  total3 += a[i+2];
  total4 += a[i+3];
}
for ( ; i<LEN; i++)
  total += a[i];

total = total + total1 + total2 + total3 + total4;

Obviously, it is very painful to manually do this kind of transformation for all the places in the code where summations occur, and in fact, this is the kind of transformation the -fsimple=2 flag enables the compiler to do. Example 6.14 shows a full example of the two code snippets. I discuss the timing harness, defined in timing.h and used for all the examples in this book, in Section 7.4.1 of Chapter 7.

Example 6.14. Timing Loop Unrolling

#include "timing.h"

#define SIZE 6000
#define RPT 100
static float f[SIZE],g[SIZE];

int main()
{
  int index,count;
  float totalf=0.0,tf1,tf2,tf3,tf4;
  for (index=0; index<SIZE; index++) f[index]=g[index]=1.0;
  printf("(a+b)                  ");
  starttime();
  for (count=0; count<RPT; count++)
    {
    for (index=0;index<SIZE;index++)
      totalf+=(f[index]+g[index]);
    totalf=totalf*1.7;
    }
  endtime(SIZE*RPT);
  printf("(a+b) unrolled         ");
  starttime();
  for (count=0; count<RPT; count++)
    {
    tf1=tf2=tf3=tf4=0;
    for (index=0;index<SIZE-4;index+=4)
    {
      tf1+=(g[index]  +f[index]  );
      tf2+=(g[index+1]+f[index+1]);
      tf3+=(g[index+2]+f[index+2]);
      tf4+=(g[index+3]+f[index+3]);
    }
    totalf+=tf1+tf2+tf3+tf4;
    totalf=totalf*1.7;
    }
  endtime(SIZE*RPT);
}

Running the code shown in Example 6.14 with and without -fsimple=2 produces the results shown in Table 6.3.

Table 6.3. Performance Gains from Using -fsimple=2 on Vector Summation

Code

Single-Precision

Single-Precision with -fsimple=2

(a+b)

5.45ns

3.40ns

(a+b) unrolled

3.09ns

2.96ns

There are some possible problems with doing this kind of optimization. Consider the sequence shown in Example 6.15.

Example 6.15. Problem Number Sequence for Summation

10000, -10000, 20000, -20000, 30000, -30000 ....

In Example 6.15, the summation will produce the value zero, because each term is canceled by the next one. However, if the summation is split into four (each summation dealing with every fourth value in the vector), the individual summations will not cancel out, and there is a chance that precision will be lost as the numbers grow larger. On the other hand, look at the sequence shown in Example 6.16; in this case, using four temporary variables may well increase the precision of the result, because all the small numbers will be added together in one total and all the large values in another.

Example 6.16. Different Problem Sequence for Summations

10000, 0.1, 20000, 0.2, 30000, 0.3, ...

It may appear that adding a sequence of numbers is very hard to do. Of course, the examples shown in Examples 6.15 and 6.16 represent extremes, and most code falls between these two. Some algorithms, such as the Kahan Summation Formula, reduce these kinds of problems. It is also useful to always use double precision, which will also increase the number of decimal places and the range of numbers that can be held in a variable.

On the other hand, it is conceivable that the mix of numbers that occur in the code does not have the mix of small and large, or positive and negative, which amplifies this kind of issue. It is also possible that the difference in results that are obtained from the program is only in the least significant bits, and this may well be dwarfed by the accuracy of the data that is fed into the program.

Kahan Summation Formula

Section 6.2.11 introduces the issues with reordering floating-point calculations. The Kahan Summation Formula represents one way to produce more accurate results in a given precision (e.g., single or double precision). Example 6.17 shows four methods of computing a summation: two “traditional” ways of summing a series of numbers, one using single precision and one using double precision; and the Kahan formulation both in single and double precision.

Example 6.17. Summation Formulae

float fsum(float * array, int n)
{
  float total=0.0;
  for (int i=0; i<n; i++) {total+=array[i];}
  return total;
}

float kfsum(float * array, int n)
{
  float total, temp1, temp2 , carry;
  carry=0.0;
  total = array[0];
  for (int i=1; i<n; i++)
  {
    temp1 =array[i] - carry;
    temp2 =total + temp1;
    carry = (temp2 - total) - temp1;
    total = temp2;
  }
  return total;
}
double dsum(float * array, int n)
{
  double total=0.0;
  for (int i=0; i<n; i++) {total+=array[i];}
  return total;
}

double kdsum(float * array, int n)
{
  double total, temp1, temp2 , carry;
  carry=0.0;
  total = array[0];
  for (int i=1; i<n; i++)
  {
   temp1 =array[i] - carry;
   temp2 =total + temp1;
   carry = (temp2 - total) - temp1;
   total = temp2;
  }
  return total;
}

You can test the various approaches using the test harness shown in Example 6.18. This test harness prepares an array with alternating large and small values.

Example 6.19 shows the results of compiling and running this test program without floating-point simplification enabled.

The single-precision summation provides a result that is correct to about four significant figures. The Kahan formula improves this to seven significant figures, still using just single-precision variables. Double precision obtains 12 significant figures, whereas double precision using the Kahan formula provides the most accurate result with about 15 significant figures.

Example 6.18. Summation Test Harness

void setarray(float * array,int n)
{
  for (int i=0; i<n; i+=2)
  {
    array[i]=1/100001.0;
  }
  for (int i=1; i<n; i+=2)
  {
    array[i]=100001.0;
  }
}

void main()
{
  float array[100000];
  setarray(array,100000);
  printf(" fsum = %12.8f
", fsum(array,100000));
  printf("kfsum = %12.8f
",kfsum(array,100000));
  printf(" dsum = %12.8f
", dsum(array,100000));
  printf("kdsum = %12.8f
",kdsum(array,100000));
}

Example 6.19. Results of Summation Code

% cc -O ex6.17.c
% a.out
 fsum = 5000756736.00000000
kfsum = 5000050176.00000000
 dsum = 5000050000.49729729
kdsum = 5000050000.49999523

Hoisting of Divides

Division is an operation that takes a processor a large number of cycles to complete. Consider that addition and multiplication each might take about four cycles, whereas division might take 20 or more cycles. Given this fact, it is a good plan to avoid divisions wherever possible. Under -fsimple=2, the compiler will, where possible, replace divisions with multiplication by the reciprocal, or even delay them until later in program execution.

Often, divides by a constant value appear within a loop, and under this optimization, the divide can be done once before the loop (called hoisting the divide out of the loop), and the loop can progress with the much cheaper multiply operation. Example 6.20 shows code that demonstrates the potential for this optimization.

Example 6.20. Code in Which the Divide Can Be Hoisted

for (i=0; i<LEN; i++)
  total += a[i]/b;

You can transform the code shown in Example 6.20 into the faster-running code shown in Example 6.21.

Example 6.21. Alternative Sequence of Faster Running Code

for (i=0; i<LEN; i++)
  total += a[i];
total = total /b;

Note that for the code shown in Example 6.21, the compiler also would split the total variable into four components, but for clarity this optimization is not shown here.

Example 6.22 shows some code that can demonstrate the performance gains you can obtain by hoisting the division operation out of the critical loop.

Example 6.22. Timing Code for Divide Operations

#include <stdio.h>
#include "timing.h"

#define SIZE 6000
#define RPT 100
static float f[SIZE],g[SIZE];

int main()
{
  int index,count;
  float totalf=0.0;
  for (index=0; index<SIZE; index++) f[index]=g[index]=1.0;
  printf("(a+b)/const            ");
  starttime();
  for (count=0; count<RPT; count++)
    {
    for (index=0;index<SIZE;index++)
      totalf+=(f[index]+g[index])/3.88;
    totalf=totalf*1.7;
    }
  endtime(SIZE*RPT);
}

Example 6.23 shows the performance gains from running the code in Example 6.22 with and without -fsimple=2.

Example 6.23. Timing Running with and without -fsimple=2

$ cc -xO3 ex6.22.c
$ a.out
(a+b)/const            Time per iteration 23.01 ns
$ cc -xO3 -fsimple=2 ex6.22.c
$ a.out
(a+b)/const            Time per iteration 16.23 ns

The hoisting of division operations has similar problems to the summation seen in Section 6.2.10. The summation may overflow before the divide operation is performed. It also has the rounding problems associated with replacing a one-step division by a two-step process.

Honoring of Parentheses at Levels of Floating-Point Simplification

At -fsimple=0, the compiler will honor the parentheses placed around calculations; at -fsimple=2, the compiler will maintain the same algebraic formula, but may replace it with something that can be executed more efficiently.

For example, Example 6.24 shows two algebraically equivalent floating-point expressions.

Example 6.24. Example of Simplification of a Floating-Point Expression

(a+b)*c - b*c = a*c

The Kahan Summation Formula shown in Example 6.17 is sensitive to the order in which expressions are calculated. Examination of the formula indicates that by removing parentheses, much of the calculation can be eliminated, and it essentially resolves to a straightforward summation of all the values in an array. Hence, compiling the Kahan Summation Formula is one situation that is incompatible with the use of the -fsimple flag.

Effect of -fast on errno

The expansion of -fast for C includes the definition of the preprocessor variable __ MATHERR_ERRNO_DONTCARE. This variable changes the way some of the functions in the math.h header file are defined. Example 6.25 shows a snippet from the header file.

Example 6.25. Snippet of math.h

#if defined(__MATHERR_ERRNO_DONTCARE)
#pragma does_not_read_global_data(erf, erfc, hypot)
#pragma does_not_write_global_data(erf, erfc, hypot)
#pragma no_side_effect(erf, erfc, hypot)
#endif

The pragmas that are enabled are does_not_read_global_data, which means the compiler does not have to store variables back to memory before the call to the function (see Section 5.12.3 of Chapter 5), does_not_write_global_data, which means variables do not need to be reloaded after the function call, and no_side_effect, which allows the compiler to eliminate the call if the result of the call is not used (see Section 5.12.4 of Chapter 5). The presence of these pragmas gives the compiler the freedom to ignore changes to the errno variable for these mathematical routines, but does not force the compiler to do so.

The errno variable is also disrupted by the use of the -xbuiltin flag, which enables the compiler to replace calls to known library routines defined in math.h and stdio.h with intrinsic functions; -xlibmil, which replaces some mathematical function calls with equivalent inline templates (e.g., fabs and sqrt); and -xlibmopt, which uses optimized versions of some mathematical functions.

When the -mt compiler flag is specified to generate a multithreaded application, errno becomes a function call that manipulates a thread-local copy of the errno variable.

Specifying Which Floating-Point Events Cause Traps (-ftrap)

The -ftrap flag specifies which IEEE floating-point events will cause a trap. The default depends on language. Table 6.4 lists the various trapping modes.

Table 6.4. Options for the -ftrap Flag

Trapping Mode

Comment

%all

Enable all trapping modes

%none

Disable all trapping modes

common

Invalid, division by zero, and overflow enabled

[no%]invalid

Trap if the invalid operation exception is raised

[no%]overflow

Trap if the number is too large to fit into the size of variable used to hold the result

[no%]underflow

Trap if the result of an operation is too small to fit into the size of variable used to hold the result

[no%]division

Enable division-by-zero trap

[no%]inexact

Trap if the value of an operation is different from the exact result of the same operation. Most operations raise this exception.

Note that all files in a program must be compiled with the same trapping mode for the program to give the correct behavior.

The Floating-Point Exception Flags

When a floating-point exception occurs—for example, a division by zero is encountered during the run of the program—this event is recorded in a set of floating-point exception flags. You can query the state of the flags through the ieee_flags function, as shown in Example 6.26.

Example 6.26. Accessing the Floating-Point Exception Flags

int i=ieee_flags(char* action, char* mode, char* in, char* out);

The action parameter is a string containing one of get, set, clear, or clearall. The mode parameter will typically be a string containing the word exception. The in parameter is either the name of a particular exception, or empty. If the in parameter names an exception, the out parameter will be written with the name of the exception if that exception has occurred. If the in parameter is empty, the out parameter will contain the name of the highest-priority exception that has occured. Example 6.27 shows code that demonstrates how to clear and read the exception flags.

Example 6.27. Reading the Floating-Point Exception Flags

#include <sunmath.h>

void main()
{
  char *text;
  double a=5.55;
  ieee_flags("clear","exception","inexact",&text);
  ieee_flags("get","exception","inexact",&text);
  printf("Inexact flag %s
",text);
  a=a*1.77;
  ieee_flags("get","exception","inexact",&text);
  printf("Inexact flag %s
",text);
}

Example 6.28 shows the results of compiling and running the program from Example 6.27. The program needs to be linked with the Sun Math Library (-lsunmath).

Example 6.28. Compiling and Running Program to Read Floating-Point Exception Flags

% cc -O ex6.27.c -lsunmath
% a.out
Inexact flag
Inexact flag inexact

If the program is compiled so that traps are taken on floating-point exceptions, the application will call handlers for the exceptions. These handlers are installed using the ieee_handler function, as shown in Example 6.29.

Example 6.29. Function to Install a Floating-Point Exception Handler

ieee_handler(char* action, char* exception, sig_fpe_handler_type handler);

Depending on whether the action is get, clear, or set, this function will either return the current handler for a given type of exception, remove the current handler and disable the trap, or install a new handler for that type of exception. Example 6.30 shows an example of setting a handler.

Example 6.30. Handler for Division-by-Zero Exception

#include <sunmath.h>
#include <stdlib.h>
void handler(int signal)
{
  printf("Division by zero error
");
  exit(1);
}

void main()
{
  char *text;
  double a=0.0;
  ieee_handler("set","division",&handler);
  a=1/a;
}

Example 6.31 shows the results of compiling and running the code that has a handler for division-by-zero exceptions.

Example 6.31. Compiling and Running Code Containing Floating-Point Exception Handler

% cc -O ex6.30.c -lsunmath
% a.out
Division by zero error

Floating-Point Exceptions in C99

C99 defines an improved set of routines to manipulate the exception flags. You should not mix these routines with the routines outlined in Section 6.2.17. These routines, together with Sun-specific extensions, provide a much richer interface. For example, it is easily possible for the exception-handling routine to determine what operation caused the exception and what the values were, and even to write a new result back into the calculation.

To use these routines on Solaris 9 it is necessary to link with the C99 floating-point library, using the -lm9x compiler flag. You also need to specify the path to the libraries. On Solaris 10, the functionality is included in the default math library (-lm). Example 6.32 shows example code that uses this interface to set and read the exception flags.

Example 6.32. C99 Functions to Access Floating-Point Exception Flags

#include <fenv.h>
#include <stdio.h>
void handler (int ex, fex_info_t *info)
{
  printf("In handler
");
}

void main()
{
  double a=5.55;
  feclearexcept(FE_INEXACT);
  fex_set_handling(FEX_INEXACT,FEX_CUSTOM, &handler);
  if (fetestexcept(FE_INEXACT) & FE_INEXACT) {printf("Inexact flag set
");}
  else {printf("Inexact flag clear
");}
  a=a*1.77;
  if (fetestexcept(FE_INEXACT) & FE_INEXACT) {printf("Inexact flag set
");}
  else {printf("Inexact flag clear
");}
}

Example 6.33 shows the results of compiling and running this code.

Example 6.33. Compiling and Running C99 Floating-Point Exception Code

$ cc -O ex6.32.c -L/opt/SUNWspro/lib -R/opt/SUNWspro/lib -lm9x
$ a.out
Inexact flag clear
In handler
Inexact flag set

Using Inline Template Versions of Floating-Point Functions (-xlibmil)

The -xlibmil option specifies that the compiler should use inline templates for some common mathematical functions. The inline template versions of the code do not set errno or respect user-specified matherr, but will raise the appropriate floating-point exceptions, and the results do conform to IEEE-754. The advantage of using inline templates is that the call overhead is avoided, and the code can probably be better scheduled.

Example 6.34 shows some source code that has the potential to replace the call to the function fabs with an equivalent inline template. In fact, in this case SPARC assembly already has fabs primative, so the inline template is a single function. Example 6.35 shows the result of compiling with the -xlibmil flag.

Example 6.34. Source Code with Potential for Inlining a Math Function

#include <math.h>

float test(float a)
{
  return fabs(a);
}

Example 6.35. Inline Template for fabs Used with -xlibmil

$ cc -O -xlibmil -S ex6.34.c
$ more ex6.34.s
...
/* 000000          4 */        st        %o0,[%sp+68]
/* 0x0004          5 */        ld        [%sp+68],%f2
/* 0x0008            */        retl      ! Result = %f0
/* 0x000c            */        fabss     %f2,%f0

The -xlibmil flag is closely linked to the -xbuiltin compiler flag discussed in Section 5.10.1 of Chapter 5. Often, the two flags will be used together to ensure that the compiler substitutes higher-performing versions of all the routines that it is able to.

Using the Optimized Math Library (-xlibmopt)

The optimized math library (libmopt) contains versions of the common mathematical functions that have improved speed, while raising floating-point exceptions and producing results that conform to IEEE-754. However, they do not set errno or respect user-specified matherr. Example 6.36 shows an example program.

Example 6.36. Program That Calls Sin and Cosine Functions

#include "timing.h"
#include <math.h>

void main()
{
  int i;
  double d;
  starttime();
  for (i=0; i<100000; i++)
  {
    d=sin(1.5)+cos(1.5);
  }
  endtime(100000);
}

Example 6.37 shows the results of compiling the program with and without the optimized math library.

Example 6.37. Difference in Performance When Compiled with libmopt

% cc -O ex6.36.c -lm
% a.out
Time per iteration 492.94 ns
% cc -O ex6.36.c -xlibmopt
% a.out
Time per iteration 148.12 ns

Do Not Promote Single-Precision Values to Double Precision (-fsingle for C)

The -fsingle flag allows the compiler to keep single-precision floating-point values as single precision and not promote them to double precision. This is important only for -Xt and -Xs compiler modes (which are not the default). These modes favor the K&R standard rather than the ISO C standard, and as a result, in these modes a float variable would normally be promoted to a double. Usually, double-precision mathematics is used because it is more accurate, but avoiding the conversion improves performance.

Storing Floating-Point Constants in Single Precision (-xsfpconst for C)

The reason for the -xsfpconst flag is that the C standard specifies that floating-point values that are not explicitly cast are to be considered as doubles. Example 6.38 shows some code samples that demonstrate this.

Example 6.38. Sample Code to Demonstrate Promotion of Constants to Double Precision

float test1(float i){return 1.3*i;}
float test2(float i){return (double)1.5*i;}
float test3(float i){return (float)1.7*i;}
double test4(double i) {return 1.9*i;}

Example 6.39 shows the assembly code generated by the compiler for the important part of the test1 routine. After the i variable is loaded, it is converted to a double-precision value (by the instruction fstod) to be multiplied by the double-precision value 1.3, and then converted back into a single-precision value (by the instruciton fdtos) to be returned by the function.

Example 6.39. Assembly Code for Critical Part of test1 Routine

/* 0x000c        */        ld        [%sp+68],%f2
/* 0x0010        */        fstod     %f2,%f0
/* 0x0014        */        fmuld     %f0,%f4,%f6
/* 0x0018        */        retl      ! Result = %f0
/* 0x001c        */        fdtos     %f6,%f0

If the code were compiled with the -xsfpconst flag, the 1.3 would be considered a single-precision value, and the conversions of the i variable and the result would be avoided. You can see the downside of compiling with the -xsfpconst flag in the disassembly for the test4 routine, when compiled with the flag.

Example 6.40 shows the disassembly code for the test4 routine, when compiled with the -xsfpconst flag. In this case, the constant is stored as a single-precision value and has to be converted to double precision before it can be used in the multiplication. The other issue here is that the constant has been stored with less precision, and this may impact the result of the calculation. Notice also that the code shown has been compiled without -dalign, so it takes two loads to load the double-precision value of i.

Example 6.40. Assembly Code for the test4 Routine, Compiled with -xsfpconst

/* 0x000c        */        ld     [%o5+%lo(___const_val_1.3)],%f2
/* 0x0010        */        fstod  %f2,%f6
/* 0x0014        */        ld     [%sp+68],%f4
/* 0x0018        */        ld     [%sp+72],%f5
/* 0x001c        */        retl   ! Result = %f0
/* 0x0020        */        fmuld  %f6,%f4,%f0

A recommendation is to always specify whether a constant is double-precision or single-precision. Obviously, this can be very tedious on a large program, and the -xsfpconst flag may help in these situations. However, it is really only necessary to specify the precision for the critical (in terms of performance or functionality) regions of code.

Floating-Point Multiply Accumulate Instructions

The SPARC64 VI processor has the capability to execute fused floating-point multiply accumulate instructions (support for these instructions was first available in the Sun Studio 12 compiler). These instructions complete the operation

d = (a×b)+c

in the same time it takes to complete either a multiply or an addition. The fused instruction also, theoretically, has greater accuracy.

When a floating-point instruction is executed, the computation is typically completed with more bits of precision than can be held in a register. At the end of the operation, the result is rounded so that it fits into a register. When a multiply instruction is followed by an addition instruction there are two rounding operations. If the multiply and addition are combined into a single instruction, there is only a single rounding operation at the end of the computation. When the multiply and addition are combined in this way, avoiding the intermediate rounding operation, the instruction is called a fused multiply accumulate. It is also possible to have unfused multiply accumulate instructions which provide the same result as performing the two operations separately.

Performing only a single rounding operation may cause applications to produce different results when compiled to use fused multiply accumulates than if the same application were compiled to use two instructions, and have two rounding operations. Hence, the compiler does not use these by default. To use the operations two compiler flags are necessary; first the appropriate architecture must be specified using -xarch=sparcfmaf, and second the compiler needs permission to generate the fused instruction using the -fma=fused flag. The flags necessary for this are shown in Example 6.41, together with the assembly language that results from using these instructions.

Example 6.41. Compiler Flags Necessary to Generate Floating-Point Multiple Accumulates

$ more ex6.41.c
double add_mul(double a, double b)
{
  return a+a*b;
}
$ cc -S -O -xarch=sparcfmaf -fma=fused ex6.41.c
$ more ex6.41.s
...
/* 0x0020         */        retl    ! Result = %f0
/* 0x0024         */        fmaddd    %f4,%f2,%f4,%f0
...

On Solaris 10, which supports C99, it is also possible to call the C99 fmaf function to calculate a fused floating-point multiply accumulate, as shown in Example 6.42.

Example 6.42. Calling the C99 fmaf Function

#include <math.h>
void main ()
{
  printf("%f
",fmaf(1.0,2.0,3.0));
}

Integer Math

Integer math is not affected by the use of the -fsimple compiler flag, but that does not mean that it is any less important. Both integer multiplication and division are long latency instructions, and you should avoid them if at all possible. Example 6.43 shows code that demonstrates the performance difference between integer and floating-point division.

Example 6.43. Code to Demonstrate the Performance of Integer and Floating-Point Divides

#include "timing.h"

#define SIZE 6000
#define RPT 100
static float f[SIZE],g[SIZE];
static int fi[SIZE],gi[SIZE];

int main()
{
  int index,count,total=0;
  float totalf=0.0;
  for (index=0; index<SIZE; index++)
  {
    f[index]=g[index]=(float)1.0;
    fi[index]=gi[index]=1;
  }
  printf("fp            ");
  starttime();
  for (count=0; count<RPT; count++)
  {
    for (index=0;index<SIZE;index++)
      totalf+=(f[index]/g[index]);
    totalf=totalf*2;
  }
  endtime(SIZE*RPT);
  printf("int           ");
  starttime();
  for (count=0; count<RPT; count++)
  {
    for (index=0;index<SIZE;index++)
      total+=(fi[index]/gi[index]);
    total=total*2;
  }
  endtime(SIZE*RPT);
  if ((total==0)||(totalf==0.0)) return 1;
}

Example 6.44 shows the results of compiling and running the code in Example 6.43.

Example 6.44. Performance of Integer and Floating-Point Divides

% cc -O ex6.43.c
% a.out
fp            Time per iteration 18.96 ns
int           Time per iteration 57.86 ns

Note that integer divide is much longer latency than integer multiply, so if it is possible to restructure the code to avoid the divide, performance may be improved.

Example 6.45 shows some code that tests the ratio of two numbers. This code will result in an integer divide operation. However, it is possible to replace the divide with a multiply. This has two general benefits. Principally, the multiply operation is faster than the divide operation, but in this case, the multiply is by a constant of three, and the compiler can replace the multiply by three with a faster sequence of operations.

Example 6.45. Example of Integer Divide

int test(int a, int b)
{
  if (a/b==3) return 1;
  return 0;
}

The alternative coding shown in Example 6.46 will not work for all values of b. In some cases, it is possible that multiplying b by three will cause an overflow, whereas the divide operation would work. More seriously, the code will not work for many situations where both a and b are negative.

Example 6.46. Alternative Coding Avoiding the Divide Operation

int test(int a, int b)
{
  if ((a>=3*b) && (a<3*b+b) )return 1;
  return 0;
}

It is possible (in most cases) to replace integer divide with floating-point divide, but it is not permissible (due to rounding) to replace integer division with a floating-point multiplication by a reciprocal. The code in Example 6.47 shows such a situation.

Example 6.47. Example of Code That Is Sensitive to -fsimple

void main()
{
  int i,j;
  for (j=1;j<100; j++)
   for (i=1; i<100; i++)
    if (i/j != (int)((float)i/(float)j) )
      printf("i=%d j=%d i/j=%d fi/fj=%d
",i,j,
                i/j,(int)((float)i/(float)j) );
}

The code in Example 6.47 will not indicate any problem values when compiled without -fsimple=2; this indicates that it is a legitimate optimization to replace the integer divide with the floating-point divide. However, if this code is recompiled with -fsimple=2, the compiler will observe that, for the inner loop, the j variable is invariant. So, it is possible to calculate the reciprocal of the j variable before the inner loop, and multiply by that inside the inner loop. Compiling with -fsimple=2 produces the output shown in Example 6.48. The output shows a number of (i,j) pairs where the floating-point division produces a different result to the integer division.

Example 6.48. Output of Code When Compiled with -fsimple=2

% cc -O -fsimple=2 -xtarget=ultra3 ex6.47.c
% a.out
i=41 j=41 i/j=1 fi/fj=0
i=82 j=41 i/j=2 fi/fj=1
i=47 j=47 i/j=1 fi/fj=0
i=94 j=47 i/j=2 fi/fj=1
i=55 j=55 i/j=1 fi/fj=0
i=61 j=61 i/j=1 fi/fj=0
i=82 j=82 i/j=1 fi/fj=0
i=83 j=83 i/j=1 fi/fj=0
i=94 j=94 i/j=1 fi/fj=0
i=97 j=97 i/j=1 fi/fj=0

Other Integer Math Opportunities

Calculations can often be performed more quickly if one of the operands is a power of two. Example 6.49 shows an example of this. The modulo operation can be greatly simplified, to an AND operation, if the divisor is a power of two.

Example 6.49. Example of Powers of Two Improving Performance

#include "timing.h"

void test1(int a)
{
  if (a%100==0) {printf("Index 0");}
}

void test2(int a)
{
  if (a%128==0) {printf("Index 0");}
}

void main()
{
  int i;
  starttime();
  for (i=0; i<100000; i++) {test1(5);}
  endtime(100000);
  starttime();
  for (i=0; i<100000; i++) {test2(5);}
  endtime(100000);
}

Example 6.50 shows the results of compiling and running the code from Example 6.49.

Example 6.50. Performance Difference with Careful Choice of Divisor

$ cc -O ex6.49.c
$ a.out
Time per iteration 25.82 ns
Time per iteration 14.49 ns

Note that it is relatively easy to inadvertently include integer math into code. The code in Example 6.51 looks correct at first glance, but the abs function takes an integer value, so the compiler has to convert the array variable into an integer, call the abs function on that integer to get the absolute value of it, and finally convert that integer back into a floating-point number to add it onto total. In this case, the function that should be called is fabs.

Example 6.51. Using an Integer Function on Floating-Point Data

  float total,array[SIZE];
  for (index=0; index<SIZE;index++)
    total+=abs(array[index]);

Floating-Point Parameter Passing with SPARC V8 Code

When an application is compiled for SPARC V8 (32-bit code), floating-point function parameters are passed in the integer registers if they occur in the first six 32-bit parameters passed to the function (note that passing a 64-bit value requires two 32-bit registers). In the V9 instruction set architecture (ISA), the floating-point parameters are passed through the floating-point registers. Example 6.52 shows an example of where a double-precision parameter is passed into a routine using the V8 calling convention.

To do the V8 parameter passing, it is necessary for the values to be moved from the floating-point registers into the integer registers, to be passed to the function, and then back again inside the body of the function. Moving from the integer to the floating-point registers (and back) necessitates storing and loading the values to and from the stack.

However, you can avoid some of these stores and loads if the floating-point parameters are placed after six 32-bit parameters have been passed into the function. In this situation, the values are passed in the stack, so the calling code has to store them into the stack, and the called code just has to reload them from the stack. Although this is still not as fast as passing them in registers, it does cut down the number of stores and loads that are necessary.

Example 6.52. Passing Floating-Point Parameters in V8 Code

! FILE ex6.52.c

!    1                 !double fp(double a)
!    2                 !{
!    3                 ! return a*2.0;
!    4                 !}
!
! SUBROUTINE fp
!
! OFFSET    SOURCE LINE LABEL  INSTRUCTION
                        fp:
/* 000000          2 */        st      %o0,[%sp+68]
/* 0x0004          3 */        sethi   %hi(___const_seg_900000102),%o5
/* 0x0008          2 */        st      %o1,[%sp+72]
/* 0x000c          3 */        ldd     [%o5+%lo(___const_seg_900000102)],%f4
/* 0x0010            */        ld      [%sp+68],%f2
/* 0x0014            */        ld      [%sp+72],%f3
/* 0x0018            */        retl    ! Result = %f0
/* 0x001c            */        fmuld   %f2,%f4,%f0

The code shown in Example 6.53 is an example of the differences between the two ways of passing parameters.

Example 6.53. Two Different Ways of Passing Floating-Point Parameters in V8 Code

#include "timing.h"

double f1(double fp1, double fp2, double fp3,
          double fp4, double fp5, double fp6)
{ return fp1+fp2+fp3; }

double f2(double fp1, double fp2, double fp3,
          double fp4, double fp5, double fp6)
{ return fp4+fp5+fp6; }

void main()
{
  int count;
  double value;

  starttime();
  for(count=0; count<1024*1024; count++)
  { value=f1(1.0,1.0,1.0,1.0,1.0,1.0); }
  endtime(1024*1024);

  starttime();
  for(count=0; count<1024*1024; count++)
  { value=f2(1.0,1.0,1.0,1.0,1.0,1.0); }
  endtime(1024*1024);
}

Example 6.54 shows the performance difference between the two approaches. In the f1 routine the first three parameters are used. These parameters are double-precision floating-point values, so they are passed in the first six integer registers. To get the values into the floating-point registers, they need to be stored to the stack and then reloaded. The f2 routine uses parameters that are passed after the six integer registers have been used, so the values are stored to the stack by the calling routine, and then loaded by the called routine. Because the first three double-precision parameters are not used, the compiler does not have to generate code that moves them into the floating-point registers. The only cost of obtaining the floating-point values is that of loading them from the stack, hence the second routine runs substantially faster.

Example 6.54. Performance Difference from Different Ordering of Floating-Point Parameters

$ cc -O ex6.53.c
$ a.out
Time per iteration 44.26 ns
Time per iteration 25.48 ns
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.67.195