In this chapter, we’ll look at what the floating-point unit (FPU) does. Some ARM documentation refers to this as the vector floating-point (VFP) coprocessor to promote the fact that it can do some limited vector processing. Any vector processing in the FPU is now replaced by the much better parallel processing provided by the NEON coprocessor, which we will study in Chapter 13, “Neon Coprocessor.” Regardless, the FPU provides several useful instructions for performing floating-point mathematics.
We’ll review what floating-point numbers are, how they’re represented in memory, and how to insert them into our Assembly Language programs. We’ll see how to transfer data between the FPU and the ARM’s regular registers and memory. We’ll also perform basic arithmetic operations, comparisons, and conversions.
About Floating-Point Numbers
1.456354 x 1016
There’s a fractional part and an exponent that lets you move the decimal place to the left if it’s positive and to the right if it’s negative. The ARM CPU deals with half-precision floating-point numbers that are 16 bits in size, single-precision floating-point numbers that are 32 bits in size, and double-precision floating-point numbers that are 64 bits in size.
Only newer ARM processors based on ARMv8.2 support half-precision 16-bit floating-point numbers. Older processors such as that in the Raspberry Pi 4 do not. These are typically used in AI applications where speed and memory size are more important than accuracy. If you plan to use these, make sure you check if your device supports them. You may need to add -march=“armv8.2-a+fp16” to the as or gcc command lines to enable support for half-precision.
Bits of a floating-point number
Name | Precision | Sign | Fractional | Exponent | Decimal Digits |
---|---|---|---|---|---|
Half | 16 bits | 1 | 10 | 5 | 3 |
Single | 32 bits | 1 | 23 | 8 | 7 |
Double | 64 bits | 1 | 52 | 11 | 16 |
The decimal digits column of Table 12-1 is the approximate number of decimal digits that the format can represent, or the decimal precision.
About Normalization and NaNs
In the integers we’ve seen so far, all combinations of the bits provide a valid unique number. No two different patterns of bits produce the same number; however, this isn’t the case in floating point. First of all, we have the concept of Not a Number (NaN) . NaNs are produced from illegal operations like dividing by zero or taking the square root of a negative number. These allow the error to quietly propagate through the calculation without crashing a program. In the IEEE 754 specification, a NaN is represented by an exponent of all one bits, for example, 11111, depending on the size of the exponent.
1E0 = 0.1E1 = 0.01E2 = 0.001E3
All of these represent 1, but we call the first one with no leading zeros the normalized form. The ARM FPU tries to keep floating-point numbers in normal form, but will break this rule for small numbers, where the exponent is already as negative as it can go; then to try to avoid underflow errors, the FPU will give up on normalization to represent numbers a bit smaller than it could otherwise.
Recognizing Rounding Errors
If we take a number like = 0.33333..., and represent it in floating point, then we only keep seven or so digits for single precision. This introduces rounding errors. If these are a problem, usually going to double precision solves the problems, but some calculations are prone to magnifying rounding errors, such as subtracting two numbers that have a minute difference.
Floating-point numbers are represented in base two, so the decimal expansions leading to repeating patterns of digits is different than that of base 10. It comes as a surprise to many people that 0.1 is a repeating binary fraction, 0.00011001100110011…, meaning that adding dollars and cents in floating point will introduce rounding error over enough calculations.
For financial calculations, most applications use fixed point arithmetic that is built on integer arithmetic to avoid rounding errors in addition and subtraction.
Defining Floating-Point Numbers
These directives always take base 10 numbers.
The GNU Assembler doesn’t have a directive for 16-bit half-precision floating-point numbers, so we need to load one of these and then do a conversion.
About FPU Registers
The register H1 is the lower 16 bits of register S1 which is the lower 32 bits of register D1 which is the lower 64 bits of the 128-bit register V1.
The floating-point unit can only process values up to 64 bits in size. We’ll see how the full 128 bits are used by the NEON processor in Chapter 13, “Neon Coprocessor.” We need to be aware of the full 128 bits since we may need to save the register to the stack as part of the function calling protocol. The NEON Coprocessor can place integers in these registers as well. For 128-bit integers, the NEON Coprocessor labels these registers Q0, …, Q31. We only need to know this in this chapter, because some instructions use this name to refer to the whole 128 bits, so as we will see in the next section, we need to refer to the registers as Q registers to push and pop them to and from the stack.
Defining the Function Call Protocol
Callee saved: The function is responsible for saving registers V8–V15. They need to be saved by a function, if the function uses them.
Caller saved: All other registers don’t need to be saved by a function, so they must be saved by the caller if they are required to be preserved. This includes V0–V7 which are used to pass parameters.
Loading and Saving FPU Registers
FMOV H1, W2
FMOV W2, H1
FMOV S1, W2
FMOV X1, D2
FMOV D2, D3
The FMOV instruction copies the bits unmodified. It doesn’t perform any sort of conversion.
Performing Basic Arithmetic
The FPU includes the four basic arithmetic operations, along with a few extensions like multiply and accumulate. There are some specialty functions like square root and quite a few variations that affect the sign—negate versions of functions.
FADD Hd, Hn, Hm // Hd = Hn + Hm
FADD Sd, Sn, Sm // Sd = Sn + Sm
FADD Dd, Dn, Dm // Dd = Dn + Dm
FSUB Dd, Dn, Dm // Dd = Dn - Dm
FMUL Dd, Dn, Dm // Dd = Dn * Dm
FDIV Dd, Dn, Dm // Dd = Dn / Dm
FMADD Dd, Dn, Dm, Da // Dd = Da + Dm * Dn
FMSUB Dd, Dn, Dm, Da // Dd = Da – Dm *Dn
FNEG Dd, Dn // Dd = - Dn
FABS Dd, Dn // Dd = Absolute Value( Dn )
FMAX Dd, Dn, Dm // Dd = Max( Dn, Dm )
FMIN Dd, Dn, Dm // Dd = Min( Dn, Dm )
FSQRT Dd, Dn // Dd = Square Root( Dn )
These functions are all fairly simple, so let’s move on to an example using floating-point functions.
Calculating Distance Between Points
d = sqrt( (y2-y1)2 + (x2-x1)2 )
Function to calculate the distance between two points
Main program to call the distance function three times
Makefile for the distance program
We constructed the data, so the first set of points comprise a 3-4-5 triangle, which is why we get the exact answer of 5 for the first distance.
The distance function is straightforward. It loads the four numbers in two LDP instructions, then calls the various floating-point arithmetic functions to perform the calculation. This function operates on single-precision 32-bit floating-point numbers using the S versions of the registers.
The part of the main routine that loops and calls the distance routine is straightforward. The part that calls printf has a couple of new complexities. The problem is that the C printf routine only has support to print doubles. In C this isn’t much of a problem, since you can just cast the argument to force a conversion. In Assembly Language, we need to convert our single-precision sum to a double-precision number, so we can print it.
To do the conversion, we FMOV the sum back to the FPU. We use the FCVT instruction to convert from single to double precision. This function is the topic of the next section. We then FMOV the freshly constructed double back to register X1.
When we call printf, the first parameter, the printf format string, goes in X0, and then the next parameter, the double to print, goes in X1.
If you are debugging the program with gdb, and you want to see the contents of the FPU registers at any point, use the “info all-registers” command that will exhaustively list all the coprocessor registers.
Performing Floating-Point Conversions
FCVT Dd, Sm
FCVT Sd, Dm
FCVT Sd, Hm
FCVT Hd, Sm
These convert single to double precision and double to single precision.
SCVTF Dd, Xm // Dd = signed integer from Xm
UCVTF Sd, Wm // Sd = unsigned integer from Wm
FCVTAS Wd, Hn // signed, round to nearest
FCVTAU Wd, Sn // unsigned, round to nearest
FCVTMS Xd, Dn // signed, round towards minus infinity
FCVTMU Xd, Dn // unsigned, round towards minus infinity
FCVTPS Xd, Dn // signed, round towards positive infinity
FCVTPU Xd, Dn // unsigned, round towards positive infinity
FCVTZS Xd, Dn // signed, round towards zero
FCVTZU Xd, Dn // unsigned, round towards zero
Comparing Floating-Point Numbers
FCMP Hd, Hm
FCMP Hd, #0.0
FCMP Sd, Sm
FCMP Sd, #0.0
FCMP Dd, Dm
FCMP Dd, #0.0
It can compare two half-precision registers, two single-precision registers, or two double-precision registers. It allows one immediate value, namely, zero, so it can compare half-, single-, or double-precision register to zero. This is needed since there is no floating-point zero register.
The FCMP instruction updates the condition flags based on subtracting the operands, like the CMP instruction we studied in Chapter 4, “Controlling Program Flow.”
abs(S1 - S2) < e
where abs() is a function to calculate the absolute value.
Example
Create a routine to test if two floating-point numbers are equal using this technique. We’ll first add 100 cents, then test if they exactly equal $1.00 (spoiler alert, they won’t). Then we’ll compare the sum using our fpcomp routine that tests them within a supplied tolerance (usually referred to as epsilon).
Routine to compare two floating-point numbers within a tolerance
Main program to add up 100 cents and compare to $1.00
The makefile for the floating-point comparison example
S0 contains a cent, $0.01, and we see from gdb that this hasn’t been represented exactly and this is where rounding error will come in. The sum of 100 cents ends up being in register S1 as 0.999999344, which doesn’t equal our expected sum of 1 contained in register S2.
Then we call our fpcomp routine that determines if the numbers are within the provided tolerance and hence considers them equal.
It didn’t take that many additions to start introducing rounding errors into our sums. Be careful when using floating point for this reason.
Summary
In this chapter, we learned the following:
What floating-point numbers are and how they are represented
Normalization, NaNs, and rounding error
How to create floating-point numbers in our .data section
Discussed the bank of floating-point registers and how half-, single-, and double-precision values are contained in them
How to load data into the floating-point registers and how to perform mathematical operations and save them back to memory
How to convert between different floating-point types, compare floating-point numbers, and copy the result back to the ARM CPU, and the effect rounding errors have on these comparisons
In Chapter 13, “Neon Coprocessor,” we’ll look at how to perform multiple floating-point operations in parallel.
Exercises
- 1.
Create a program to load and add the following numbers:
2.343 + 5.3
3.5343425445 + 1.534443455
3.14e12 + 5.55e-10
How accurate are the results?
- 2.
Integer division by 0 resulted in the incorrect answer of 0. Create a program to perform a floating-point division by 0 and see what the result is.
- 3.
The ARM FPU has a square root function, but no trigonometric functions. Write a function to calculate the sine of an angle in radians using the approximate formula:
sin x = x − x3/3! + x5/5! − x7/7!
where ! stands for factorial and is calculated as 3! = 3 * 2 *1. Write a main program to call this function with several test values.