In this chapter, we return to mathematics. We’ve covered addition, subtraction, and a collection of bit operations on our 32-bit registers. Now we will cover multiplication and division. The ARM processor has a surplus of multiply instructions, then a dearth of division operations.
We will cover multiply with accumulate instructions. We will provide some background on why the ARM processor has so much circuitry dedicated to performing this operation. This will get us into the mechanics of vector and matrix multiplication.
Multiplication
In Chapter 7, “Linux Operating System Services,” we discussed why there are so many Linux service calls and how part of the reason was for compatibility when they needed new functionality; they added a new call, so the old call is preserved. The ARM multiply instructions have a similar history. Multiply has been in the ARM architecture for a long time, but the original instructions were inadequate and new instructions were added while keeping the old instructions for software compatibility.
Rd is the lower 32 bits of the product. The upper 32 bits are discarded.
The MULS version of the instruction only sets the N and Z flags; it does not set the C or V flags, so you don’t know if it overflowed.
There aren’t separate signed and unsigned versions; multiplication isn’t like addition where the two's complement makes the operations the same.
All the operands are registers; immediate operands are not allowed.
Rd cannot be the same as Rn.
SMULL{S} RdLo, RdHi, Rn, Rm
UMULL{S} RdLo, RdHi, Rn, Rm
SMMUL{R} {Rd}, Rn, Rm
SMUL<x><y> {Rd}, Rn, Rm
SMULW<y> {Rd}, Rn, Rm
The first SMULL instruction will perform signed 32-bit multiplication putting the 64-bit result in two registers. The second UMULL instruction is the unsigned version of this. SMMUL complements the original MUL instruction by providing the upper 32 bits of the product and discarding the lower 32 bits.
<x> is either B or T. B means use the bottom half (bits [15:0]) of Rn; T means use the top half (bits [31:16]) of Rn.
<y> is either B or T. B means use the bottom half (bits [15:0]) of Rm; T means use the top half (bits [31:16]) of Rm.
SMULW is an intermediate version that multiplies a 32-bit value by a 16-bit value, then only keeps the upper 32 bits of the 48-bit product. The <y> modifier is the same as for SMUL. When I’ve seen this instruction used, one of the operands has usually been shifted so that the product ends up in the upper 32 bits.
All these instructions have the same performance. The ability to detect when a multiply is done (remaining digits are 0) was added to the ARM processor some time ago, so the need for shorter versions of multiply, in my opinion, doesn’t exist anymore. I would recommend always using SMULL and UMULL as then there are less things to go wrong if your numbers change over time.
Examples
Examples of the various multiply instructions
Multiply is straightforward, especially using SMULL and UMULL.
Division
Integer division is a much more recent addition to the ARM processor. In fact, the Raspberry Pi 1 and Zero have no integer division instruction. The second generation of the Raspberry Pi 2 uses ARM Cortex-A53 processors, which introduce integer division to the Pi world. The Raspberry Pi 4 includes newer Cortex-A72 processors.
If you are targeting Raspberry Pi Zero or 1, then you will need to either implement your own division algorithm in code, call some C code, or use the floating-point coprocessor. We’ll cover the floating-point coprocessor in Chapter 11, “Floating-Point Operations.”
SDIV {Rd}, Rn, Rm
UDIV {Rd}, Rn, Rm
Rd is the destination register.
Rn is the register holding the numerator.
Rm is a register holding the denominator.
There is no “S” option of this instruction, as it doesn’t set CPSR at all.
Dividing by 0 should throw an exception; with these instructions, it returns 0 which can be very misleading.
These instructions aren’t the inverses of SMULL and UMULL. For this Rn needs to be a register pair, so the value to be divided can be 64 bits. To divide a 64-bit value, we need to either go to the floating-point processor or roll our own code.
The instruction only returns the quotient, not the remainder. Many algorithms require the remainder, and you must calculate it as remainder = numerator - (quotient * denominator).
Example
Examples of the SDIV and UDIV instructions
then the program will compile and run. The -march parameter is for machine architecture, and “arm8-a” is the correct one for the Raspberry Pi 4. We could have used one to match a Raspberry Pi 3, but we’ll want to explore some new features in the Pi 4 later.
Multiply and Accumulate
The multiply and accumulate operation multiplies two numbers, then adds them to a third. As we go through the next few chapters, we will see this operation reappear again and again. The ARM processor is RISC, if the instruction set is reduced, then why do we find so many instructions, and hence so much circuitry, dedicated to performing multiply and accumulate. The answer goes back to our favorite first year university math course on linear algebra. Most science students are forced to take this course, learn to work with vectors and matrices, then hope they never see these concepts again. Unfortunately, they form the foundation for both graphics and machine learning. Before delving into the ARM instructions for multiply and accumulate, let’s review a bit of linear algebra.
Vectors and Matrices
A ⋅ B = a1*b1 + a2* b1 + … + an * bn
If we want to calculate this dot product, then a loop performing multiply and accumulate instructions should be quite efficient. A matrix is a 2D table of numbers such as
Matrix multiplication is a complicated process that drives first-year linear algebra students nuts. When you multiply matrix A times matrix B, then each element on the resulting matrix is the dot product of a row of matrix A with a column of matrix B.
If these were 3x3 matrices, then there would be nine dot products each with nine terms. We can also multiply a matrix by a vector the same way.
In 3D graphics, if we represent a point as a 4D vector [x, y, z, 1], then the affine transformations of scale, rotate, shear, and reflection can be represented as 4x4 matrices. Any number of these transformations can be combined into a single matrix. Thus, to transform an object into a scene requires a matrix multiplication applied to each of the object’s vertex points. The faster we can do this, the faster we can render a frame in a video game.
In neural networks, the calculation for each layer of neurons is calculated by a matrix multiplication, followed by the application of a nonlinear function. The bulk of the work is the matrix multiplication. Most neural networks have many layers of neurons, each requiring a matrix multiplication. The matrix size corresponds to the number of variables and the number of neurons; hence, the matrix dimensions are often in the thousands. How quickly we perform object recognition or speech translation is dependent on how fast we can multiply matrices, that is dependent on how fast we can do multiply with accumulate.
These important applications are why the ARM processor dedicates so much silicon to multiply and accumulate. We’ll keep returning to how to speed up this process as we explore the Raspberry Pi’s FPU and NEON coprocessors in the following chapters.
Accumulate Instructions
MLA{S} Rd, Rn, Rm, Ra
SMLAL{S} RdLo, RdHi, Rn, Rm
SMLA<x><y> Rd, Rn, Rm, Ra
SMLAD{X} Rd, Rn, Rm, Ra
SMLALD{X} RdLo, RdHi, Rn, Rm
SMLAL<x><y> RdLo, RdHi, Rn, Rm
SMLAW<y> Rd, Rn, Rm, Ra
SMLSD{X} Rd, Rn, Rm, Ra
SMLSD{X} RdLo, RdHi, Rn, Rm
SMMLA{R} Rd, Rn, Rm, Ra
SMMLS{R} Rd, Rn, Rm, Ra
SMUAD{X} {Rd}, Rn, Rm
UMAAL RdLo, RdHi, Rn, Rm
UMLAL{S} RdLo, RdHi, Rn, Rm
That is a lot of instructions, so we won’t cover each in detail, but we can recognize that there is a multiply with accumulate for each regular multiply instruction. Let’s look at what leads to a further proliferation of instructions.
Note
Rd can be the same as Ra for calculating a running sum.
This second form tends to be for instructions with 64-bit results, so the sum needs to be 64 bits, therefore, can’t be a single register.
Dual Multiply with Accumulate
The instructions that end in D are dual. They do two multiply and accumulates in a single step. They multiply the top 16 bits of Rn and Rm and multiply the bottom 16 bits of Rn and Rm, then add both products to the accumulator.
If the accuracy works for you and you can encode all the data this way, then you can double your throughput using these instructions. We’ll look at this in Example 2.
Example 1
We’ve talked about how multiply and accumulate is ideal for multiplying matrices, so for an example, let's multiply two 3x3 matrices.
Pseudo-code for our matrix multiplication program
Basically, the row and column loops go through each cell of the output matrix and calculate the correct dot product for that cell in the innermost loop.
3x3 matrix multiplication in Assembly
Accessing Matrix Elements
We store the three matrices in memory, in row order. They are arranged in the .word directives so that you can see the matrix structure. In the pseudo-code, we refer to the matrix elements using 2D arrays. There are no instructions or operand formats to specify 2D array access, so we must do it ourselves. To Assembly, each array is just a nine-word sequence of memory. Now that we know how to multiply, we can do something like
A[i, j] = A[i*N + j]
where N is the dimension of the array. We don’t do this though; in Assembly it pays to notice that we access the array elements in order and can go from one element in a row to the next by adding the size of an element—the size of a word, or four. We can go from an element in a column to the next one by adding the size of a row. Therefore, we use the constant N * WDSIZE so often in the code. This way, we go through the array incrementally and never have to multiply array indexes. Generally, multiplication and division are expensive operations, and we should try to avoid them as much as possible.
which stores our computed dot product into C, then increments the pointer into C by 4 bytes. We see it again when we print the C matrix at the end.
Multiply with Accumulate
This instruction accumulates a 64-bit sum, but we only take the lower 32 bits in R7. We don’t check for overflow; if at the end R8 isn’t 0, we are going to give an incorrect result.
Register Usage
We nearly use all the registers; we are lucky we can keep track of all our loop indexes and pointers in registers and don’t have to move them in and out of memory. If we needed to do this, we would have allocated space on the stack to hold any needed variables.
Example 2
When we discussed the multiply with accumulate instructions, we mentioned the dual instructions that will do two steps in one instruction. The main problem is packing two numbers that need processing in each 32-bit register. We can create 16-bit integers easily enough using the .short Assembler directive. Processing the rows is easy since the cells are next to each other, but for the columns, each element is a row away. How can we easily load two column elements into one 32-bit register?
What we can do is take the transpose of the second matrix. This means making the rows columns and the columns rows, basically switching B[i, j] with B[j, i]. If we do that, then the column elements are next to each other and easy to load into a single 32-bit register.
3x3 matrix multiplication using a dual multiply/accumulate
If our matrix had an even dimension, we would have saved more. For our 3x3 example, the dot product loop still has two elements. But then if we were doing two 4x4 matrices, it would also be two times through this loop. Notice that we had to add a 0 to the end of each row of both matrices, since the dual instruction is going to process an even number of entries.
which multiplies the high part of R9 by the high part of R10 and at the same time the low part of R9 by the low part of R10, then adds both to R7 and puts the new sum into R7. Notice it’s okay to have Rd=Ra, which is what you mostly want.
We still use LDR to load the registers from the matrices. This will load 32 bits; since we specified each element to take 16 bits, it will load two at a time enhancing our performance.
Summary
We covered the various forms of the multiply instruction supported in the ARM 32-bit instruction set. We covered the division instructions included in newer versions of the ARM processors, like those in the Raspberry Pi 3 and 4. For older processors we can use the FPU, write our own routine, or call some C code.
We then covered the concept of multiply and accumulate and why these instructions are so important to modern applications in graphics and machine learning. We reviewed the many variations of these instructions and then presented two versions of matrix multiplication to show them in action.
In Chapter 11, “Floating-Point Operations,” we will look at more math, but this time in scientific notation allowing fractions and exponents, going beyond integers for the first time.