Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

S. SmithProgramming with 64-Bit ARM Assembly Languagehttps://doi.org/10.1007/978-1-4842-5881-1_12

12. Floating-Point Operations

Stephen Smith¹

(1)

Gibsons, BC, Canada

In this chapter, we’ll look at what the floating-point unit (FPU) does. Some ARM documentation refers to this as the vector floating-point (VFP) coprocessor to promote the fact that it can do some limited vector processing. Any vector processing in the FPU is now replaced by the much better parallel processing provided by the NEON coprocessor, which we will study in Chapter 13, “Neon Coprocessor.” Regardless, the FPU provides several useful instructions for performing floating-point mathematics.

We’ll review what floating-point numbers are, how they’re represented in memory, and how to insert them into our Assembly Language programs. We’ll see how to transfer data between the FPU and the ARM’s regular registers and memory. We’ll also perform basic arithmetic operations, comparisons, and conversions.

About Floating-Point Numbers

Floating-point numbers are a way to represent numbers in scientific notation on the computer, which represents numbers something like this:

1.456354 x 10¹⁶

There’s a fractional part and an exponent that lets you move the decimal place to the left if it’s positive and to the right if it’s negative. The ARM CPU deals with half-precision floating-point numbers that are 16 bits in size, single-precision floating-point numbers that are 32 bits in size, and double-precision floating-point numbers that are 64 bits in size.

Note

Only newer ARM processors based on ARMv8.2 support half-precision 16-bit floating-point numbers. Older processors such as that in the Raspberry Pi 4 do not. These are typically used in AI applications where speed and memory size are more important than accuracy. If you plan to use these, make sure you check if your device supports them. You may need to add -march=“armv8.2-a+fp16” to the as or gcc command lines to enable support for half-precision.

The ARM CPU uses the IEEE 754 standard for floating-point numbers. Each number contains a sign bit to indicate if it’s positive or negative, a field of bits for the exponent, and a string of digits for the fractional part. Table 12-1 lists the number of bits for the parts of each format.

Table 12-1

Bits of a floating-point number

Name	Precision	Sign	Fractional	Exponent	Decimal Digits
Half	16 bits	1	10	5	3
Single	32 bits	1	23	8	7
Double	64 bits	1	52	11	16

The decimal digits column of Table 12-1 is the approximate number of decimal digits that the format can represent, or the decimal precision.

About Normalization and NaNs

In the integers we’ve seen so far, all combinations of the bits provide a valid unique number. No two different patterns of bits produce the same number; however, this isn’t the case in floating point. First of all, we have the concept of Not a Number (NaN) . NaNs are produced from illegal operations like dividing by zero or taking the square root of a negative number. These allow the error to quietly propagate through the calculation without crashing a program. In the IEEE 754 specification, a NaN is represented by an exponent of all one bits, for example, 11111, depending on the size of the exponent.

A normalized floating-point number means the first digit in the fractional part is nonzero. A problem with floating-point numbers is that numbers can often be represented in multiple ways. For instance, a fractional part of 0 with either sign bit and any exponent is zero. Consider a representation of 1:

1E0 = 0.1E1 = 0.01E2 = 0.001E3

All of these represent 1, but we call the first one with no leading zeros the normalized form. The ARM FPU tries to keep floating-point numbers in normal form, but will break this rule for small numbers, where the exponent is already as negative as it can go; then to try to avoid underflow errors, the FPU will give up on normalization to represent numbers a bit smaller than it could otherwise.

Recognizing Rounding Errors

If we take a number like = 0.33333..., and represent it in floating point, then we only keep seven or so digits for single precision. This introduces rounding errors. If these are a problem, usually going to double precision solves the problems, but some calculations are prone to magnifying rounding errors, such as subtracting two numbers that have a minute difference.

Note

Floating-point numbers are represented in base two, so the decimal expansions leading to repeating patterns of digits is different than that of base 10. It comes as a surprise to many people that 0.1 is a repeating binary fraction, 0.00011001100110011…, meaning that adding dollars and cents in floating point will introduce rounding error over enough calculations.

For financial calculations, most applications use fixed point arithmetic that is built on integer arithmetic to avoid rounding errors in addition and subtraction.

Defining Floating-Point Numbers

The GNU Assembler has directives for defining storage for both single- and double-precision floating-point numbers. These are .single and .double, for example:

.single 1.343, 4.343e20, -0.4343, -0.4444e-10

.double -4.24322322332e-10, 3.141592653589793

These directives always take base 10 numbers.

Note

The GNU Assembler doesn’t have a directive for 16-bit half-precision floating-point numbers, so we need to load one of these and then do a conversion.

About FPU Registers

The ARM FPU and the NEON coprocessor share a set of registers. There are 32 128-bit registers referred to as V0, …, V31. In the same way that a W register is half an X register, we have 32 double-precision floating-point registers D0, …, D31. In this case D0 is the lower 64 bits of V0, D1 is the lower 64 bits of V1, and so on. We can refer to the lower 32 bits of each of these registers using S0, …, S31 and then the lower 16 bits of each register using H0, …, H31. Figure 12-1 shows this configuration of registers.

../images/494415_1_En_12_Chapter/494415_1_En_12_Fig1_HTML.jpg — Figure 12-1
A single ARM FPU registers, the format of the data depends on how you reference the register

Note

The register H1 is the lower 16 bits of register S1 which is the lower 32 bits of register D1 which is the lower 64 bits of the 128-bit register V1.

The floating-point unit can only process values up to 64 bits in size. We’ll see how the full 128 bits are used by the NEON processor in Chapter 13, “Neon Coprocessor.” We need to be aware of the full 128 bits since we may need to save the register to the stack as part of the function calling protocol. The NEON Coprocessor can place integers in these registers as well. For 128-bit integers, the NEON Coprocessor labels these registers Q0, …, Q31. We only need to know this in this chapter, because some instructions use this name to refer to the whole 128 bits, so as we will see in the next section, we need to refer to the registers as Q registers to push and pop them to and from the stack.

Defining the Function Call Protocol

In Chapter 6, “Functions and the Stack,” we gave the protocol for who saves which registers when calling functions. With these floating-point registers, we must add them to our protocol.

Callee saved: The function is responsible for saving registers V8–V15. They need to be saved by a function, if the function uses them.
Caller saved: All other registers don’t need to be saved by a function, so they must be saved by the caller if they are required to be preserved. This includes V0–V7 which are used to pass parameters.

Many of the Assembly instructions that we have seen will take floating-point registers as well as W and X integer registers. For instance, we can use STP, STR, LDP, and LDR to load and save these registers to and from memory. In the context here, we can continue to use these to push and pop values to and from the stack. We need to keep in mind that the Q registers are 128 bits or 16 bytes in size. Thus, the following are examples of pushing and popping floating-point registers:

STP Q8, Q9, [SP, #-32]!

STR Q10, [SP, #-16]!

LDP Q8, Q9, [SP], #32

LDR Q10, [SP], #16

Loading and Saving FPU Registers

In Chapter 5, “Thanks for the Memories,” we covered the LDR and STR instructions to load registers from memory, then store them back to memory. The FPU registers can all be used in these instructions, for example:

LDR X1, =fp1

LDR S4, [X1]

LDR D5, [X1, #4]

STR S4, [X1]

STR D5, [X1, #4]

...

.data

fp1: .single 3.14159

fp2: .double 4.3341

fp3: .single 0.0

fp4: .double 0.0

We can also move data between the CPU’s integer registers and the FPU with the FMOV instruction. This instruction also lets you move data between FPU registers. Generally, the registers should be the same size, but for half-precision H registers, you can copy them into larger integer registers, for example:

FMOV H1, W2
FMOV W2, H1
FMOV S1, W2
FMOV X1, D2
FMOV D2, D3

Note

The FMOV instruction copies the bits unmodified. It doesn’t perform any sort of conversion.

Performing Basic Arithmetic

The FPU includes the four basic arithmetic operations, along with a few extensions like multiply and accumulate. There are some specialty functions like square root and quite a few variations that affect the sign—negate versions of functions.

Each of these functions can operate on either H, S, or D registers. Here’s a selection of the instructions. We list the three forms of the FADD instruction with each floating-point type, then list the rest with just the D registers to save space:

FADD Hd, Hn, Hm // Hd = Hn + Hm
FADD Sd, Sn, Sm // Sd = Sn + Sm
FADD Dd, Dn, Dm // Dd = Dn + Dm
FSUB Dd, Dn, Dm // Dd = Dn - Dm
FMUL Dd, Dn, Dm // Dd = Dn * Dm
FDIV Dd, Dn, Dm // Dd = Dn / Dm
FMADD Dd, Dn, Dm, Da // Dd = Da + Dm * Dn
FMSUB Dd, Dn, Dm, Da // Dd = Da – Dm *Dn
FNEG Dd, Dn // Dd = - Dn
FABS Dd, Dn // Dd = Absolute Value( Dn )
FMAX Dd, Dn, Dm // Dd = Max( Dn, Dm )
FMIN Dd, Dn, Dm // Dd = Min( Dn, Dm )
FSQRT Dd, Dn // Dd = Square Root( Dn )

These functions are all fairly simple, so let’s move on to an example using floating-point functions.

Calculating Distance Between Points

If we have two points (x₁, y₁) and (x₂, y₂), then the distance between them is given by the formula

d = sqrt( (y₂-y₁)² + (x₂-x₁)² )

Let’s write a function to calculate this for any two single-precision floating-point pair of coordinates. We’ll use the C runtime’s printf function to print out our results. First of all, copy the distance function from Listing 12-1 to the file distance.s.

// Example function to calculate the distance

// between two points in single precision

// floating-point.

// Inputs:

// X0 - pointer to the 4 FP numbers

// they are x1, y1, x2, y2

// Outputs:

// X0 - the length (as single precision FP)

.global distance // Allow function to be called by others

distance:

// push all registers to be safe, we don't really

// need to push so many.

STR LR, [SP, #-16]!

// load all 4 numbers at once

LDP S0, S1, [X0], #8

LDP S2, S3, [X0]

// calc s4 = x2 - x1

FSUB S4, S2, S0

// calc s5 = y2 - y1

FSUB S5, S3, S1

// calc s4 = S4 * S4 (x2-X1)^2

FMUL S4, S4, S4

// calc s5 = s5 * s5 (Y2-Y1)^2

FMUL S5, S5, S5

// calc S4 = S4 + S5

FADD S4, S4, S5

// calc sqrt(S4)

FSQRT S4, S4

// move result to X0 to be returned

FMOV W0, S4

// restore what we preserved.

LDR LR, [SP], #16

RET

Listing 12-1

Function to calculate the distance between two points

Place the code from Listing 12-2 in main.s that calls distance three times with three different points and prints out the distance for each one.

// Main program to test our distance function

// W19 - loop counter

// X20 - address to current set of points

.global main // Provide program starting address to linker

.equ N, 3 // Number of points.

main:

STP X19, X20, [SP, #-16]!

STR LR, [SP, #-16]!

LDR X20, =points // pointer to current points

MOV W19, #N // number of loop iterations

loop: MOV X0, X20 // move pointer to parameter 1 (X0)

BL distance // call distance function

// need to take the single precision return value

// and convert it to a double, because the C printf

// function can only print doubles.

FMOV S2, W0 // move back to fpu for conversion

FCVT D0, S2 // convert single to double

FMOV X1, D0 // return double to X1

LDR X0, =prtstr // load print string

BL printf // print the distance

ADD X20, X20, #(4*4) // 4 points each 4 bytes

SUBS W19, W19, #1 // decrement loop counter

B.NE loop // loop if more points

MOV X0, #0 // return code

LDR LR, [SP], #16

LDP X19, X20, [SP], #16

RET

.data

points: .single 0.0, 0.0, 3.0, 4.0

.single 1.3, 5.4, 3.1, -1.5

.single 1.323e10, -1.2e-4, 34.55, 5454.234

prtstr: .asciz "Distance = %f "

Listing 12-2

Main program to call the distance function three times

The makefile is in Listing 12-3.

distance: distance.s main.s

gcc -o distance distance.s main.s

Listing 12-3

Makefile for the distance program

If we build and run the program, we get

smist08@kali:~/asm64/Chapter 12$ make

gcc -g -o distance distance.s main.s

smist08@kali:~/asm64/Chapter 12$ ./distance

Distance = 5.000000

Distance = 7.130919

Distance = 13230000128.000000

smist08@kali:~/asm64/Chapter 12$

We constructed the data, so the first set of points comprise a 3-4-5 triangle, which is why we get the exact answer of 5 for the first distance.

The distance function is straightforward. It loads the four numbers in two LDP instructions, then calls the various floating-point arithmetic functions to perform the calculation. This function operates on single-precision 32-bit floating-point numbers using the S versions of the registers.

The part of the main routine that loops and calls the distance routine is straightforward. The part that calls printf has a couple of new complexities. The problem is that the C printf routine only has support to print doubles. In C this isn’t much of a problem, since you can just cast the argument to force a conversion. In Assembly Language, we need to convert our single-precision sum to a double-precision number, so we can print it.

To do the conversion, we FMOV the sum back to the FPU. We use the FCVT instruction to convert from single to double precision. This function is the topic of the next section. We then FMOV the freshly constructed double back to register X1.

When we call printf, the first parameter, the printf format string, goes in X0, and then the next parameter, the double to print, goes in X1.

Note

If you are debugging the program with gdb, and you want to see the contents of the FPU registers at any point, use the “info all-registers” command that will exhaustively list all the coprocessor registers.

Performing Floating-Point Conversions

In the last example, we had our first look at the conversion instruction FCVT. The FPU supports a variety of versions of this function; not only does it support conversions between single- and double-precision floating-point numbers, but it supports conversions to and from integers. It also supports conversion to fixed point decimal numbers (integers with an implied decimal). It supports several rounding methods as well. The most used versions of this function are

FCVT Dd, Sm
FCVT Sd, Dm
FCVT Sd, Hm
FCVT Hd, Sm

These convert single to double precision and double to single precision.

To convert from an integer to a floating-point number, we have

SCVTF Dd, Xm // Dd = signed integer from Xm
UCVTF Sd, Wm // Sd = unsigned integer from Wm

To convert from floating point to integer, we have several choices for how we want rounding handled:

FCVTAS Wd, Hn // signed, round to nearest
FCVTAU Wd, Sn // unsigned, round to nearest
FCVTMS Xd, Dn // signed, round towards minus infinity
FCVTMU Xd, Dn // unsigned, round towards minus infinity
FCVTPS Xd, Dn // signed, round towards positive infinity
FCVTPU Xd, Dn // unsigned, round towards positive infinity
FCVTZS Xd, Dn // signed, round towards zero
FCVTZU Xd, Dn // unsigned, round towards zero

Comparing Floating-Point Numbers

Most of the floating-point instructions don’t have “S” versions; therefore, don’t update the condition flags. The main instruction that updates these flags is the FCMP instruction. Here are its forms:

FCMP Hd, Hm
FCMP Hd, #0.0
FCMP Sd, Sm
FCMP Sd, #0.0
FCMP Dd, Dm
FCMP Dd, #0.0

It can compare two half-precision registers, two single-precision registers, or two double-precision registers. It allows one immediate value, namely, zero, so it can compare half-, single-, or double-precision register to zero. This is needed since there is no floating-point zero register.

The FCMP instruction updates the condition flags based on subtracting the operands, like the CMP instruction we studied in Chapter 4, “Controlling Program Flow.”

Testing for equality of floating-point numbers is problematic, because rounding error numbers are often close, but not exactly equal. The solution is to decide on a tolerance, then consider numbers equal if they are within the tolerance from each other. For instance, we might define e = 0.000001 and then consider two registers equal if

abs(S1 - S2) < e

where abs() is a function to calculate the absolute value.

Example

Create a routine to test if two floating-point numbers are equal using this technique. We’ll first add 100 cents, then test if they exactly equal $1.00 (spoiler alert, they won’t). Then we’ll compare the sum using our fpcomp routine that tests them within a supplied tolerance (usually referred to as epsilon).

Start with our floating-point comparison routine, placing the contents of Listing 12-4 into fpcomp.s.

// Function to compare to floating-point numbers

// the parameters are a pointer to the two numbers

// and an error epsilon.

// Inputs:

// X0 - pointer to the 3 FP numbers

// they are x1, x2, e

// Outputs:

// X0 - 1 if they are equal, else 0

.global fpcomp // Allow function to be called by others

fpcomp: // load the 3 numbers

LDP S0, S1, [X0], #8

LDR S2, [X0]

// calc s3 = x2 - x1

FSUB S3, S1, S0

FABS S3, S3

FCMP S3, S2

B.LE notequal

MOV X0, #1

B done

notequal:MOV X0, #0

done: RET

Listing 12-4

Routine to compare two floating-point numbers within a tolerance

Now the main program maincomp.s contains Listing 12-5.

// Main program to test our distance function

// W19 - loop counter

// X20 - address to current set of points

.global main // Provide program starting address

.equ N, 100 // Number of additions.

main:

STP X19, X20, [SP, #-16]!

STR LR, [SP, #-16]!

// Add up one hundred cents and test if they equal $1.00

MOV W19, #N // number of loop iterations

// load cents, running sum and real sum to FPU

LDR X0, =cent

LDP S0, S1, [X0], #8

LDR S2, [X0]

loop:

// add cent to running sum

FADD S1, S1, S0

SUBS W19, W19, #1 // decrement loop counter

B.NE loop // loop if more points

// compare running sum to real sum

FCMP S1, S2

// print if the numbers are equal or not

B.EQ equal

LDR X0, =notequalstr

BL printf

B next

equal: LDR X0, =equalstr

BL printf

// load pointer to running sum, real sum and epsilon

LDR X0, =runsum

// call comparison function

BL fpcomp // call comparison function

// compare return code to 1 and print if the numbers

// are equal or not (within epsilon).

CMP X0, #1

B.EQ equal2

LDR X0, =notequalstr

BL printf

B done

equal2: LDR X0, =equalstr

BL printf

done: MOV X0, #0 // return code

LDR LR, [SP], #16

LDP X19, X20, [SP], #16

RET

.data

cent: .single 0.01

runsum: .single 0.0

sum: .single 1.00

epsilon: .single 0.00001

equalstr: .asciz "equal "

notequalstr: .asciz "not equal "

Listing 12-5

Main program to add up 100 cents and compare to $1.00

The makefile, in Listing 12-6, is as we would expect.

fpcomp: fpcomp.s maincomp.s

gcc -o fpcomp fpcomp.s maincomp.s

Listing 12-6

The makefile for the floating-point comparison example

If we build and run the program, we get

smist08@kali:~/asm64/Chapter 12$ make

gcc -g -o fpcomp fpcomp.s maincomp.s

smist08@kali:~/asm64/Chapter 12$ ./fpcomp

not equal

equal

smist08@kali:~/asm64/Chapter 12$

If we run the program under gdb, we can examine the sum of 100 cents. We see

s0 {f = 0x0, u = 0x3c23d70a, s = 0x3c23d70a} {f = 0.00999999978, u = 1008981770, s = 1008981770}

s1 {f = 0x0, u = 0x3f7ffff5, s = 0x3f7ffff5} {f = 0.999999344, u = 1065353205, s = 1065353205}

s2 {f = 0x1, u = 0x3f800000, s = 0x3f800000} {f = 1, u = 1065353216, s = 1065353216}

S0 contains a cent, $0.01, and we see from gdb that this hasn’t been represented exactly and this is where rounding error will come in. The sum of 100 cents ends up being in register S1 as 0.999999344, which doesn’t equal our expected sum of 1 contained in register S2.

Then we call our fpcomp routine that determines if the numbers are within the provided tolerance and hence considers them equal.

It didn’t take that many additions to start introducing rounding errors into our sums. Be careful when using floating point for this reason.

Summary

In this chapter, we learned the following:

What floating-point numbers are and how they are represented
Normalization, NaNs, and rounding error
How to create floating-point numbers in our .data section
Discussed the bank of floating-point registers and how half-, single-, and double-precision values are contained in them
How to load data into the floating-point registers and how to perform mathematical operations and save them back to memory
How to convert between different floating-point types, compare floating-point numbers, and copy the result back to the ARM CPU, and the effect rounding errors have on these comparisons

In Chapter 13, “Neon Coprocessor,” we’ll look at how to perform multiple floating-point operations in parallel.

Exercises

1.
Create a program to load and add the following numbers:
2.343 + 5.3
3.5343425445 + 1.534443455
3.14e12 + 5.55e-10
How accurate are the results?
2.
Integer division by 0 resulted in the incorrect answer of 0. Create a program to perform a floating-point division by 0 and see what the result is.
3.
The ARM FPU has a square root function, but no trigonometric functions. Write a function to calculate the sine of an angle in radians using the approximate formula:
sin x = x − x³/3! + x⁵/5! − x⁷/7!
where ! stands for factorial and is calculated as 3! = 3 * 2 *1. Write a main program to call this function with several test values.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 12. Floating-Point Operations

Create new playlist

Sign In

Sign Up

12. Floating-Point Operations

About Floating-Point Numbers

About Normalization and NaNs

Recognizing Rounding Errors

Defining Floating-Point Numbers

About FPU Registers

Defining the Function Call Protocol

Loading and Saving FPU Registers

Performing Basic Arithmetic

Calculating Distance Between Points

Performing Floating-Point Conversions

Comparing Floating-Point Numbers

Example

Summary

Exercises

Table of Contents for
12. Floating-Point Operations