© Stephen Smith 2020
S. SmithProgramming with 64-Bit ARM Assembly Languagehttps://doi.org/10.1007/978-1-4842-5881-1_12

12. Floating-Point Operations

Stephen Smith1 
(1)
Gibsons, BC, Canada
 

In this chapter, we’ll look at what the floating-point unit (FPU) does. Some ARM documentation refers to this as the vector floating-point (VFP) coprocessor to promote the fact that it can do some limited vector processing. Any vector processing in the FPU is now replaced by the much better parallel processing provided by the NEON coprocessor, which we will study in Chapter 13, “Neon Coprocessor.” Regardless, the FPU provides several useful instructions for performing floating-point mathematics.

We’ll review what floating-point numbers are, how they’re represented in memory, and how to insert them into our Assembly Language programs. We’ll see how to transfer data between the FPU and the ARM’s regular registers and memory. We’ll also perform basic arithmetic operations, comparisons, and conversions.

About Floating-Point Numbers

Floating-point numbers are a way to represent numbers in scientific notation on the computer, which represents numbers something like this:
  • 1.456354 x 1016

There’s a fractional part and an exponent that lets you move the decimal place to the left if it’s positive and to the right if it’s negative. The ARM CPU deals with half-precision floating-point numbers that are 16 bits in size, single-precision floating-point numbers that are 32 bits in size, and double-precision floating-point numbers that are 64 bits in size.

Note

Only newer ARM processors based on ARMv8.2 support half-precision 16-bit floating-point numbers. Older processors such as that in the Raspberry Pi 4 do not. These are typically used in AI applications where speed and memory size are more important than accuracy. If you plan to use these, make sure you check if your device supports them. You may need to add -march=“armv8.2-a+fp16” to the as or gcc command lines to enable support for half-precision.

The ARM CPU uses the IEEE 754 standard for floating-point numbers. Each number contains a sign bit to indicate if it’s positive or negative, a field of bits for the exponent, and a string of digits for the fractional part. Table 12-1 lists the number of bits for the parts of each format.
Table 12-1

Bits of a floating-point number

Name

Precision

Sign

Fractional

Exponent

Decimal Digits

Half

16 bits

1

10

5

3

Single

32 bits

1

23

8

7

Double

64 bits

1

52

11

16

The decimal digits column of Table 12-1 is the approximate number of decimal digits that the format can represent, or the decimal precision.

About Normalization and NaNs

In the integers we’ve seen so far, all combinations of the bits provide a valid unique number. No two different patterns of bits produce the same number; however, this isn’t the case in floating point. First of all, we have the concept of Not a Number (NaN) . NaNs are produced from illegal operations like dividing by zero or taking the square root of a negative number. These allow the error to quietly propagate through the calculation without crashing a program. In the IEEE 754 specification, a NaN is represented by an exponent of all one bits, for example, 11111, depending on the size of the exponent.

A normalized floating-point number means the first digit in the fractional part is nonzero. A problem with floating-point numbers is that numbers can often be represented in multiple ways. For instance, a fractional part of 0 with either sign bit and any exponent is zero. Consider a representation of 1:
  • 1E0 = 0.1E1 = 0.01E2 = 0.001E3

All of these represent 1, but we call the first one with no leading zeros the normalized form. The ARM FPU tries to keep floating-point numbers in normal form, but will break this rule for small numbers, where the exponent is already as negative as it can go; then to try to avoid underflow errors, the FPU will give up on normalization to represent numbers a bit smaller than it could otherwise.

Recognizing Rounding Errors

If we take a number like ../images/494415_1_En_12_Chapter/494415_1_En_12_Figa_HTML.gif = 0.33333..., and represent it in floating point, then we only keep seven or so digits for single precision. This introduces rounding errors. If these are a problem, usually going to double precision solves the problems, but some calculations are prone to magnifying rounding errors, such as subtracting two numbers that have a minute difference.

Note

Floating-point numbers are represented in base two, so the decimal expansions leading to repeating patterns of digits is different than that of base 10. It comes as a surprise to many people that 0.1 is a repeating binary fraction, 0.00011001100110011…, meaning that adding dollars and cents in floating point will introduce rounding error over enough calculations.

For financial calculations, most applications use fixed point arithmetic that is built on integer arithmetic to avoid rounding errors in addition and subtraction.

Defining Floating-Point Numbers

The GNU Assembler has directives for defining storage for both single- and double-precision floating-point numbers. These are .single and .double, for example:
.single    1.343, 4.343e20, -0.4343, -0.4444e-10
.double    -4.24322322332e-10, 3.141592653589793

These directives always take base 10 numbers.

Note

The GNU Assembler doesn’t have a directive for 16-bit half-precision floating-point numbers, so we need to load one of these and then do a conversion.

About FPU Registers

The ARM FPU and the NEON coprocessor share a set of registers. There are 32 128-bit registers referred to as V0, …, V31. In the same way that a W register is half an X register, we have 32 double-precision floating-point registers D0, …, D31. In this case D0 is the lower 64 bits of V0, D1 is the lower 64 bits of V1, and so on. We can refer to the lower 32 bits of each of these registers using S0, …, S31 and then the lower 16 bits of each register using H0, …, H31. Figure 12-1 shows this configuration of registers.
../images/494415_1_En_12_Chapter/494415_1_En_12_Fig1_HTML.jpg
Figure 12-1

A single ARM FPU registers, the format of the data depends on how you reference the register

Note

The register H1 is the lower 16 bits of register S1 which is the lower 32 bits of register D1 which is the lower 64 bits of the 128-bit register V1.

The floating-point unit can only process values up to 64 bits in size. We’ll see how the full 128 bits are used by the NEON processor in Chapter 13, “Neon Coprocessor.” We need to be aware of the full 128 bits since we may need to save the register to the stack as part of the function calling protocol. The NEON Coprocessor can place integers in these registers as well. For 128-bit integers, the NEON Coprocessor labels these registers Q0, …, Q31. We only need to know this in this chapter, because some instructions use this name to refer to the whole 128 bits, so as we will see in the next section, we need to refer to the registers as Q registers to push and pop them to and from the stack.

Defining the Function Call Protocol

In Chapter 6, “Functions and the Stack,” we gave the protocol for who saves which registers when calling functions. With these floating-point registers, we must add them to our protocol.
  • Callee saved: The function is responsible for saving registers V8V15. They need to be saved by a function, if the function uses them.

  • Caller saved: All other registers don’t need to be saved by a function, so they must be saved by the caller if they are required to be preserved. This includes V0V7 which are used to pass parameters.

Many of the Assembly instructions that we have seen will take floating-point registers as well as W and X integer registers. For instance, we can use STP, STR, LDP, and LDR to load and save these registers to and from memory. In the context here, we can continue to use these to push and pop values to and from the stack. We need to keep in mind that the Q registers are 128 bits or 16 bytes in size. Thus, the following are examples of pushing and popping floating-point registers:
STP   Q8, Q9, [SP, #-32]!
STR   Q10, [SP, #-16]!
LDP   Q8, Q9, [SP], #32
LDR   Q10, [SP], #16

Loading and Saving FPU Registers

In Chapter 5, “Thanks for the Memories,” we covered the LDR and STR instructions to load registers from memory, then store them back to memory. The FPU registers can all be used in these instructions, for example:
      LDR    X1, =fp1
      LDR    S4, [X1]
      LDR    D5, [X1, #4]
      STR    S4, [X1]
      STR    D5, [X1, #4]
      ...
.data
fp1:   .single    3.14159
fp2:   .double    4.3341
fp3:   .single    0.0
fp4:   .double    0.0
We can also move data between the CPU’s integer registers and the FPU with the FMOV instruction. This instruction also lets you move data between FPU registers. Generally, the registers should be the same size, but for half-precision H registers, you can copy them into larger integer registers, for example:
  • FMOV    H1, W2

  • FMOV    W2, H1

  • FMOV    S1, W2

  • FMOV    X1, D2

  • FMOV    D2, D3

Note

The FMOV instruction copies the bits unmodified. It doesn’t perform any sort of conversion.

Performing Basic Arithmetic

The FPU includes the four basic arithmetic operations, along with a few extensions like multiply and accumulate. There are some specialty functions like square root and quite a few variations that affect the sign—negate versions of functions.

Each of these functions can operate on either H, S, or D registers. Here’s a selection of the instructions. We list the three forms of the FADD instruction with each floating-point type, then list the rest with just the D registers to save space:
  • FADD    Hd, Hn, Hm    // Hd = Hn + Hm

  • FADD    Sd, Sn, Sm    // Sd = Sn + Sm

  • FADD    Dd, Dn, Dm    // Dd = Dn + Dm

  • FSUB    Dd, Dn, Dm    // Dd = Dn - Dm

  • FMUL    Dd, Dn, Dm    // Dd = Dn * Dm

  • FDIV    Dd, Dn, Dm    // Dd = Dn / Dm

  • FMADD    Dd, Dn, Dm, Da    // Dd = Da + Dm * Dn

  • FMSUB    Dd, Dn, Dm, Da    // Dd = Da – Dm *Dn

  • FNEG    Dd, Dn        // Dd = - Dn

  • FABS    Dd, Dn        // Dd = Absolute Value( Dn )

  • FMAX    Dd, Dn, Dm    // Dd = Max( Dn, Dm )

  • FMIN    Dd, Dn, Dm    // Dd = Min( Dn, Dm )

  • FSQRT    Dd, Dn        // Dd = Square Root( Dn )

These functions are all fairly simple, so let’s move on to an example using floating-point functions.

Calculating Distance Between Points

If we have two points (x1, y1) and (x2, y2), then the distance between them is given by the formula
  • d = sqrt( (y2-y1)2 + (x2-x1)2 )

Let’s write a function to calculate this for any two single-precision floating-point pair of coordinates. We’ll use the C runtime’s printf function to print out our results. First of all, copy the distance function from Listing 12-1 to the file distance.s.
//
// Example function to calculate the distance
// between two points in single precision
// floating-point.
//
// Inputs:
//    X0 - pointer to the 4 FP numbers
//           they are x1, y1, x2, y2
// Outputs:
//    X0 - the length (as single precision FP)
.global distance // Allow function to be called by others
//
distance:
      // push all registers to be safe, we don't really
      // need to push so many.
      STR    LR, [SP, #-16]!
      // load all 4 numbers at once
      LDP    S0, S1, [X0], #8
      LDP    S2, S3, [X0]
      // calc s4 = x2 - x1
      FSUB   S4, S2, S0
      // calc s5 = y2 - y1
      FSUB   S5, S3, S1
      // calc s4 = S4 * S4 (x2-X1)^2
      FMUL   S4, S4, S4
      // calc s5 = s5 * s5 (Y2-Y1)^2
      FMUL   S5, S5, S5
      // calc S4 = S4 + S5
      FADD   S4, S4, S5
      // calc sqrt(S4)
      FSQRT  S4, S4
      // move result to X0 to be returned
      FMOV   W0, S4
      // restore what we preserved.
      LDR    LR, [SP], #16
      RET
Listing 12-1

Function to calculate the distance between two points

Place the code from Listing 12-2 in main.s that calls distance three times with three different points and prints out the distance for each one.
//
// Main program to test our distance function
//
// W19 - loop counter
// X20 - address to current set of points
.global main // Provide program starting address to linker
//
      .equ  N, 3   // Number of points.
main:
      STP   X19, X20, [SP, #-16]!
      STR   LR, [SP, #-16]!
      LDR   X20, =points   // pointer to current points
      MOV   W19, #N        // number of loop iterations
loop: MOV   X0, X20        // move pointer to parameter 1 (X0)
      BL    distance       // call distance function
// need to take the single precision return value
// and convert it to a double, because the C printf
// function can only print doubles.
      FMOV  S2, W0         // move back to fpu for conversion
      FCVT  D0, S2         // convert single to double
      FMOV  X1, D0         // return double to X1
      LDR   X0, =prtstr    // load print string
      BL    printf         // print the distance
      ADD   X20, X20, #(4*4)      // 4 points each 4 bytes
      SUBS  W19, W19, #1   // decrement loop counter
      B.NE  loop           // loop if more points
      MOV   X0, #0         // return code
      LDR   LR, [SP], #16
      LDP   X19, X20, [SP], #16
      RET
.data
points:     .single        0.0, 0.0, 3.0, 4.0
            .single        1.3, 5.4, 3.1, -1.5
            .single 1.323e10, -1.2e-4, 34.55, 5454.234
prtstr:     .asciz "Distance = %f "
Listing 12-2

Main program to call the distance function three times

The makefile is in Listing 12-3.
distance: distance.s main.s
      gcc -o distance distance.s main.s
Listing 12-3

Makefile for the distance program

If we build and run the program, we get
smist08@kali:~/asm64/Chapter 12$ make
gcc -g -o distance distance.s main.s
smist08@kali:~/asm64/Chapter 12$ ./distance
Distance = 5.000000
Distance = 7.130919
Distance = 13230000128.000000
smist08@kali:~/asm64/Chapter 12$

We constructed the data, so the first set of points comprise a 3-4-5 triangle, which is why we get the exact answer of 5 for the first distance.

The distance function is straightforward. It loads the four numbers in two LDP instructions, then calls the various floating-point arithmetic functions to perform the calculation. This function operates on single-precision 32-bit floating-point numbers using the S versions of the registers.

The part of the main routine that loops and calls the distance routine is straightforward. The part that calls printf has a couple of new complexities. The problem is that the C printf routine only has support to print doubles. In C this isn’t much of a problem, since you can just cast the argument to force a conversion. In Assembly Language, we need to convert our single-precision sum to a double-precision number, so we can print it.

To do the conversion, we FMOV the sum back to the FPU. We use the FCVT instruction to convert from single to double precision. This function is the topic of the next section. We then FMOV the freshly constructed double back to register X1.

When we call printf, the first parameter, the printf format string, goes in X0, and then the next parameter, the double to print, goes in X1.

Note

If you are debugging the program with gdb, and you want to see the contents of the FPU registers at any point, use the “info all-registers” command that will exhaustively list all the coprocessor registers.

Performing Floating-Point Conversions

In the last example, we had our first look at the conversion instruction FCVT. The FPU supports a variety of versions of this function; not only does it support conversions between single- and double-precision floating-point numbers, but it supports conversions to and from integers. It also supports conversion to fixed point decimal numbers (integers with an implied decimal). It supports several rounding methods as well. The most used versions of this function are
  • FCVT    Dd, Sm

  • FCVT    Sd, Dm

  • FCVT    Sd, Hm

  • FCVT    Hd, Sm

These convert single to double precision and double to single precision.

To convert from an integer to a floating-point number, we have
  • SCVTF    Dd, Xm        // Dd = signed integer from Xm

  • UCVTF    Sd, Wm    // Sd = unsigned integer from Wm

To convert from floating point to integer, we have several choices for how we want rounding handled:
  • FCVTAS    Wd, Hn    // signed, round to nearest

  • FCVTAU    Wd, Sn    // unsigned, round to nearest

  • FCVTMS    Xd, Dn    // signed, round towards minus infinity

  • FCVTMU    Xd, Dn    // unsigned, round towards minus infinity

  • FCVTPS    Xd, Dn    // signed, round towards positive infinity

  • FCVTPU    Xd, Dn    // unsigned, round towards positive infinity

  • FCVTZS    Xd, Dn    // signed, round towards zero

  • FCVTZU    Xd, Dn    // unsigned, round towards zero

Comparing Floating-Point Numbers

Most of the floating-point instructions don’t have “S” versions; therefore, don’t update the condition flags. The main instruction that updates these flags is the FCMP instruction. Here are its forms:
  • FCMP    Hd, Hm

  • FCMP    Hd, #0.0

  • FCMP    Sd, Sm

  • FCMP    Sd, #0.0

  • FCMP    Dd, Dm

  • FCMP    Dd, #0.0

It can compare two half-precision registers, two single-precision registers, or two double-precision registers. It allows one immediate value, namely, zero, so it can compare half-, single-, or double-precision register to zero. This is needed since there is no floating-point zero register.

The FCMP instruction updates the condition flags based on subtracting the operands, like the CMP instruction we studied in Chapter 4, “Controlling Program Flow.”

Testing for equality of floating-point numbers is problematic, because rounding error numbers are often close, but not exactly equal. The solution is to decide on a tolerance, then consider numbers equal if they are within the tolerance from each other. For instance, we might define e = 0.000001 and then consider two registers equal if
  • abs(S1 - S2) < e

where abs() is a function to calculate the absolute value.

Example

Create a routine to test if two floating-point numbers are equal using this technique. We’ll first add 100 cents, then test if they exactly equal $1.00 (spoiler alert, they won’t). Then we’ll compare the sum using our fpcomp routine that tests them within a supplied tolerance (usually referred to as epsilon).

Start with our floating-point comparison routine, placing the contents of Listing 12-4 into fpcomp.s.
//
// Function to compare to floating-point numbers
// the parameters are a pointer to the two numbers
// and an error epsilon.
//
// Inputs:
//    X0 - pointer to the 3 FP numbers
//           they are x1, x2, e
// Outputs:
//    X0 - 1 if they are equal, else 0
.global fpcomp // Allow function to be called by others
fpcomp:      // load the 3 numbers
      LDP    S0, S1, [X0], #8
      LDR    S2, [X0]
      // calc s3 = x2 - x1
      FSUB   S3, S1, S0
      FABS   S3, S3
      FCMP   S3, S2
      B.LE          notequal
      MOV           X0, #1
      B             done
notequal:MOV        X0, #0
done: RET
Listing 12-4

Routine to compare two floating-point numbers within a tolerance

Now the main program maincomp.s contains Listing 12-5.
//
// Main program to test our distance function
//
// W19 - loop counter
// X20 - address to current set of points
.global main // Provide program starting address
      .equ   N, 100        // Number of additions.
main:
      STP    X19, X20, [SP, #-16]!
      STR    LR, [SP, #-16]!
// Add up one hundred cents and test if they equal $1.00
      MOV    W19, #N       // number of loop iterations
// load cents, running sum and real sum to FPU
      LDR    X0, =cent
      LDP    S0, S1, [X0], #8
      LDR    S2, [X0]
loop:
      // add cent to running sum
      FADD   S1, S1, S0
      SUBS   W19, W19, #1  // decrement loop counter
      B.NE   loop          // loop if more points
      // compare running sum to real sum
      FCMP   S1, S2
      // print if the numbers are equal or not
      B.EQ   equal
      LDR    X0, =notequalstr
      BL     printf
      B      next
equal:  LDR  X0, =equalstr
      BL     printf
next:
// load pointer to running sum, real sum and epsilon
      LDR    X0, =runsum
// call comparison function
      BL     fpcomp        // call comparison function
// compare return code to 1 and print if the numbers
// are equal or not (within epsilon).
      CMP    X0, #1
      B.EQ   equal2
      LDR    X0, =notequalstr
      BL     printf
      B      done
equal2: LDR  X0, =equalstr
      BL     printf
done: MOV    X0, #0        // return code
      LDR    LR, [SP], #16
      LDP    X19, X20, [SP], #16
      RET
.data
cent:        .single 0.01
runsum:      .single 0.0
sum:         .single 1.00
epsilon:     .single 0.00001
equalstr:    .asciz "equal "
notequalstr: .asciz "not equal "
Listing 12-5

Main program to add up 100 cents and compare to $1.00

The makefile, in Listing 12-6, is as we would expect.
fpcomp: fpcomp.s maincomp.s
      gcc -o fpcomp fpcomp.s maincomp.s
Listing 12-6

The makefile for the floating-point comparison example

If we build and run the program, we get
smist08@kali:~/asm64/Chapter 12$ make
gcc -g -o fpcomp fpcomp.s maincomp.s
smist08@kali:~/asm64/Chapter 12$ ./fpcomp
not equal
equal
smist08@kali:~/asm64/Chapter 12$
If we run the program under gdb, we can examine the sum of 100 cents. We see
s0  {f = 0x0, u = 0x3c23d70a, s = 0x3c23d70a} {f = 0.00999999978, u = 1008981770, s = 1008981770}
s1  {f = 0x0, u = 0x3f7ffff5, s = 0x3f7ffff5} {f = 0.999999344, u = 1065353205, s = 1065353205}
s2  {f = 0x1, u = 0x3f800000, s = 0x3f800000} {f = 1, u = 1065353216, s = 1065353216}

S0 contains a cent, $0.01, and we see from gdb that this hasn’t been represented exactly and this is where rounding error will come in. The sum of 100 cents ends up being in register S1 as 0.999999344, which doesn’t equal our expected sum of 1 contained in register S2.

Then we call our fpcomp routine that determines if the numbers are within the provided tolerance and hence considers them equal.

It didn’t take that many additions to start introducing rounding errors into our sums. Be careful when using floating point for this reason.

Summary

In this chapter, we learned the following:

  • What floating-point numbers are and how they are represented

  • Normalization, NaNs, and rounding error

  • How to create floating-point numbers in our .data section

  • Discussed the bank of floating-point registers and how half-, single-, and double-precision values are contained in them

  • How to load data into the floating-point registers and how to perform mathematical operations and save them back to memory

  • How to convert between different floating-point types, compare floating-point numbers, and copy the result back to the ARM CPU, and the effect rounding errors have on these comparisons

In Chapter 13, “Neon Coprocessor,” we’ll look at how to perform multiple floating-point operations in parallel.

Exercises

  1. 1.

    Create a program to load and add the following numbers:

    2.343 + 5.3

    3.5343425445 + 1.534443455

    3.14e12 + 5.55e-10

    How accurate are the results?

     
  2. 2.

    Integer division by 0 resulted in the incorrect answer of 0. Create a program to perform a floating-point division by 0 and see what the result is.

     
  3. 3.

    The ARM FPU has a square root function, but no trigonometric functions. Write a function to calculate the sine of an angle in radians using the approximate formula:

    sin x = x − x3/3! + x5/5! − x7/7!

    where ! stands for factorial and is calculated as 3! = 3 * 2 *1. Write a main program to call this function with several test values.

     
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.220.114