© Stephen Smith 2020
S. SmithProgramming with 64-Bit ARM Assembly Languagehttps://doi.org/10.1007/978-1-4842-5881-1_11

11. Multiply, Divide, and Accumulate

Stephen Smith1 
(1)
Gibsons, BC, Canada
 

In this chapter, we return to using mathematics. We’ve already covered addition, subtraction, and a collection of bit operations on our 64-bit registers. Now, we will learn multiplication and division.

We will program multiply with accumulate instructions. But first of all, we will provide some background on why the ARM processor has so much circuitry dedicated to performing this operation. This will get us into the mechanics of vector and matrix multiplication.

Multiplication

The multiply instruction is
MUL Xd, Xn, Xm
This instruction computes Xd = Xn ∗ Xm. Looks good, but people familiar with multiplication might immediately ask: These are all 64-bit registers, so when you multiply two 64-bit numbers, don’t you get a 128-bit product? That is true, and that is the most obvious limitation on this instruction. Here are some notes on this instruction:
  • Xd is the lower 64 bits of the product. The upper 64 bits are discarded.

  • There is no “S” version of this instruction, so no condition flags can be set. Therefore, you can’t detect an overflow.

  • There aren’t separate signed and unsigned versions; multiplication isn’t like addition where the two’s complement, as discussed in Chapter 2, “Loading and Adding,” makes the operations the same.

  • All the operands are registers; immediate operands aren’t allowed, but remember you can use left shift to multiply by powers of two, such as two, four, and eight.

  • If you multiply two 32-bit W registers, then the destination must be a W register. Why can’t it be a 64-bit X register? So you don’t lose half the resulting product.

To overcome some of these limitations, there are a few additional multiply instructions, as follows:
  • SMULH      Xd, Xn, Xm

  • SMULL      Xd, Wn, Wm

  • UMULH      Xd, Xn, Xm

  • UMULL      Xd, Wn, Wm

SMULL and UMULL allow us to multiply two 32-bit registers and get the full result in a 64-bit register.
  • SMULL is for signed integers.

  • UMULL for unsigned integers.

SMULH and UMULH complement MUL by giving us the upper 64 bits of the product of two 64-bit numbers:
  • Calling SMULH and MUL we can get the complete 128-bit product for signed integers.

  • UMULH works with MUL to get the upper 64 bits of the product of unsigned integers.

See Exercise 1 in this chapter to confirm how MUL works with both cases.

All these instructions have the same performance and work in a similar manner to how we learned to multiply in grade school, by multiplying each digit in a loop and adding the results (with shifts) together. The ability to detect when a multiplication is complete (remaining leftmost digits are 0) was added to the ARM processor some time ago, so you aren’t penalized for multiplying small numbers (the loop knows to stop early).

There are a set of similar functions that calculate the negative or the multiplication; these are
  • MNEG      Xd, Xn, Xm

  • SMNEGL      Xd, Wn, Wm

  • UMNEGL      Xd, Wn, Wm

MNEG calculates –(Xn ∗ Xm) and places the result in Xd, as well as for SMNEGL and UMNEGL. With only a limited number of operands possible in the 32 bits for instructions, these may seem like a strange addition to the instruction set, but we’ll see where they come from later in this chapter.

Examples

Listing 11-1 has some code to demonstrate all the various multiply instructions. We use our debug.s file from Chapter 9, “Interacting with C and Python,” meaning our program must be organized with the C runtime in mind.
//
// Example of 32 & 64-Bit Multiplication
//
.include "debug.s"
.global main // Provide program starting address
// Load the registers with some data
// Use small positive numbers that will work for all
// multiply instructions.
main:
      MOV    X2, #25
      MOV    X3, #4
      printStr "Inputs:"
      printReg 2
      printReg 3
      MUL    X4, X2, X3
      printStr "MUL X4=X2*X3:"
      printReg 4
      MNEG   X4, X2, X3
      printStr "MNEG X4=-X2*X3:"
      printReg 4
      SMULL  X4, W2, W3
      printStr "SMULL X4=W2*W3:"
      printReg 4
      SMNEGL X4, W2, W3
      printStr "SMNEGL X4=-W2*W3:"
      printReg 4
      UMULL  X4, W2, W3
      printStr "UMULL X4=W2*W3:"
      printReg 4
      UMNEGL X4, W2, W3
      printStr "UMNEGL X4=-W2*W3:"
      printReg 4
      LDR    X2, =A
      LDR    X2, [X2]
      LDR    X3, =B
      LDR    X3, [X3]
      MUL    X4, X2, X3
      printStr "Inputs:"
      printReg 2
      printReg 3
      MUL    X4, X2, X3
      printStr "MUL X4 = bottom 64 bits of X2*X3:"
      printReg 4
      SMULH  X4, X2, X3
      printStr "SMULH X4 = top 64 bits of X2*X3 (signed):"
      printReg 4
      UMULH  X4, X2, X3
      printStr "UMULH X4 = top 64 bits of X2*X3 (unsigned):"
      printReg 4
      MOV    X0, #0            // return code
      RET
.data
A:    .dword       0x7812345678
B:    .dword       0xFABCD12345678901
Listing 11-1

Examples of the various multiply instructions

The makefile is as expected. The output is
smist08@kali:~/asm64/Chapter 11$ make
gcc -o mulexamp mulexamp.s
smist08@kali:~/asm64/Chapter 11$ ./mulexamp
Inputs:
X2 =                               25, 0x0000000000000019
X3 =                                4, 0x0000000000000004
MUL X4=X2*X3:
X4 =                              100, 0x0000000000000064
MNEG X4=-X2*X3:
X4 =                             -100, 0xffffffffffffff9c
SMULL X4=W2*W3:
X4 =                              100, 0x0000000000000064
SMNEGL X4=-W2*W3:
X4 =                             -100, 0xffffffffffffff9c
UMULL X4=W2*W3:
X4 =                              100, 0x0000000000000064
UMNEGL X4=-W2*W3:
X4 =                             -100, 0xffffffffffffff9c
Inputs:
X2 =                     515701495416, 0x0000007812345678
X3 =              -379198319187490559, 0xfabcd12345678901
MUL X4 = bottom 64 bits of X2*X3:
X4 =              8455362044785495672, 0x75577afb36c28e78
SMULH X4 = top 64 bits of X2*X3 (signed):
X4 =                     -10600956976, 0xfffffffd88223bd0
UMULH X4 = top 64 bits of X2*X3 (unsigned):
X4 =                     505100538440, 0x000000759a569248
smist08@kali:~/asm64/Chapter 11$

To demonstrate SMULH and UMULH, we load some large numbers that overflowed a 64-bit result, so we saw nonzero values in the upper 64 bits. Notice the difference between the signed and unsigned computation.

Multiply is straightforward, so let’s move on to division.

Division

Integer division is standard in all 64-bit ARM processors. This gives us some standardization, unlike the 32-bit ARM world where some processors contain an integer division instruction and some don’t.

The division instructions are
  • SDIV      Xd, Xn, Xm

  • UDIV      Xd, Xn, Xm

where
  • Xd is the destination register

  • Xn is the register holding the numerator

  • Xm is the register holding the denominator

The registers can be all X or all W registers.

There are a few problems or technical notes on these instructions:
  • There is no “S” option of this instruction, as they don’t set the condition flags.

  • Dividing by 0 should throw an exception; with these instructions it returns 0 which can be very misleading.

  • These instructions aren’t the inverses of MUL and SMULH . For this Xn needs to be a register pair, so the value to be divided can be 128 bits. To divide a 128-bit value, we need to either go to the floating-point processor or roll our own code.

  • The instruction only returns the quotient, not the remainder. Many algorithms require the remainder and you must calculate it as remainder = numerator - (quotient ∗ denominator).

Example

The code to execute the divide instructions is simple. Listing 11-2 is an example like we did for multiplication.
//
// Examples of 64-Bit Integer Division
//
.include "debug.s"
.global main // Provide program starting address
// Load the registers with some data
// Perform various division instructions
main:
      MOV    X2, #100
      MOV    X3, #4
      printStr "Inputs:"
      printReg 2
      printReg 3
      SDIV   X4, X2, X3
      printStr "Outputs:"
      printReg 4
      UDIV   X4, X2, X3
      printStr "Outputs:"
      printReg 4
      // Division by zero
      printStr "Division by zero:"
      MOV    X3, #0
      SDIV   X4, X2, X3
      printStr "Outputs:"
      printReg 4
      MOV    X0, #0        // return code
      RET
Listing 11-2

Examples of the SDIV and UDIV instructions

The makefile is as expected; if we build and run this program, we get
smist08@kali:~/asm64/Chapter 11$ make
gcc -o divexamp divexamp.s
smist08@kali:~/asm64/Chapter 11$ ./divexamp
Inputs:
X2 =                              100, 0x0000000000000064
X3 =                                4, 0x0000000000000004
Outputs:
X4 =                               25, 0x0000000000000019
Outputs:
X4 =                               25, 0x0000000000000019
Division by zero:
Outputs:
X4 =                                0, 0x0000000000000000
smist08@kali:~/asm64/Chapter 11$
Note

The incorrect result when we divide by 0 should trigger an error, but it didn’t. Thus, we need to check for division by 0 in our code.

Next, we look at combining multiplication and addition, so we can optimize loops operating on vectors.

Multiply and Accumulate

The multiply and accumulate operation multiplies two numbers, then adds them to a third. As we go through the next few chapters, we will see this operation reappear again and again. The ARM processor is RISC; if the instruction set is reduced, then why do we find so many instructions, and as a result so much circuitry dedicated to performing multiply and accumulate?

The answer goes back to our favorite first year university math course on linear algebra. Most science students are forced to take this course, learn to work with vectors and matrices, and then hope they never see these concepts again. Unfortunately, they form the foundation for both graphics and machine learning. Before delving into the ARM instructions for multiply and accumulate, let’s review a bit of linear algebra.

Vectors and Matrices

A vector is an ordered list of numbers. For instance, in 3D graphics it might represent your location in 3D space where [x, y, z] are your coordinates. Vectors have a dimension which is the number of elements they contain. It turns out a useful computation with vectors is something called a dot product. If A = [a1, a2, …, an] is one vector and B = [b1, b2, …, bn] is another vector, then their dot product is defined as
A ∙ B = a1*b1 + a2* b1 + ... + an * bn

If we want to calculate this dot product, then a loop performing multiply and accumulate instructions will be quite efficient.

A matrix is a two-dimensional table of numbers such as
../images/494415_1_En_11_Chapter/494415_1_En_11_Figa_HTML.jpg
Matrix multiplication is a complicated process that drives first year linear algebra students nuts. When you multiply matrix A times matrix B, then each element on the resulting matrix is the dot product of a row of matrix A with a column of matrix B.
../images/494415_1_En_11_Chapter/494415_1_En_11_Figb_HTML.jpg
If these were 3x3 matrices, then there would be nine dot products each with nine terms. We can also multiply a matrix by a vector the same way.
../images/494415_1_En_11_Chapter/494415_1_En_11_Figc_HTML.jpg

In 3D graphics, if we represent a point as a 4D vector [x, y, z, 1], then the affine transformations of scale, rotate, shear, and reflection can be represented as 4x4 matrices. Any number of these transformations can be combined into a single matrix. Thus, to transform an object into a scene requires a matrix multiplication applied to each of the object’s vertex points. The faster we can do this, the faster we can render a frame in a video game.

In neural networks, the calculation for each layer of neurons is calculated by a matrix multiplication followed by the application of a nonlinear function. The bulk of the work is the matrix multiplication. Most neural networks have many layers of neurons, each requiring a matrix multiplication. The matrix size corresponds to the number of variables and the number of neurons; consequently, the matrices’ dimensions are often in the thousands. How quickly we perform object recognition or speech translation depends on how fast we can multiply matrices, which depends on how fast we can do multiply with accumulate.

These important applications are why the ARM processor dedicates so much silicon to multiply and accumulate. We’ll keep returning to how to speed up this process as we explore the ARM CPU’s floating-point unit (FPU) and Neon coprocessors in the following chapters.

Accumulate Instructions

Here are the multiply with accumulate instructions:
  • MADD      Xd, Xn, Xm, Xa

  • MSUB      Xd, Xn, Xm, Xa

  • SMADDL      Xd, Wn, Wm, Xa

  • UMADDL      Xd, Wn, Wm, Xa

  • SMSUBL      Xd, Wn, Wm, Xa

  • UMSUBL      Xd, Wn, Wm, Xa

The multiplication with accumulate instructions map closely to the multiply instructions that we’ve already discussed. In fact, most of the multiply instructions are aliases of these instructions using the zero register for Xa.

We either add or subtract the product from the running accumulator. The calculation is
Xd = Xa + Xn * Xm
or
Xd = Xa – Xn * Xm
Note

Xd can be the same as Xa, for calculating a running sum.

In the second case, we see that if Xa is the zero register, then we get all the multiply negative operations in the last section.

For the versions that multiple two 32-bit registers to get a 64-bit results, the sum needs to be a 64-bit X register.

Example 1

We’ve talked about how multiply and accumulate is ideal for multiplying matrices, so for an example, let’s multiply two 3x3 matrices.

The algorithm we are implementing is shown in Listing 11-3.
FOR row = 1 to 3
      FOR col = 1 to 3
            acum = 0
            FOR i = 1 to 3
                  acum = acum + A[row, i]*B[i, col]
            NEXT I
            C[row, col] = acum
      NEXT col
NEXT row
Listing 11-3

Pseudo-code for our matrix multiplication program

The row and column loops go through each cell of the output matrix and calculate the correct dot product for that cell in the innermost loop.

Listing 11-4 shows the implementation in Assembly.
//
// Multiply 2 3x3 integer matrices
//
// Registers:
//    W1 - Row index
//    W2 - Column index
//    X4 - Address of row
//    X5 - Address of column
//    X7 - 64 bit accumulated sum
//    W9 - Cell of A
//    W10 - Cell of B
//    X19 - Position in C
//    X20 - Loop counter for printing
//    X12 - row in dotloop
//    X6 - col in dotloop
.global main // Provide program starting address
      .equ    N, 3   // Matrix dimensions
      .equ    WDSIZE, 4 // Size of element
main:
      STR     LR, [SP, #-16]!          // Save required regs
      STP     X19, X20, [SP, #-16]!    // Save required regs
      MOV     W1, #N       // Row index
      LDR     X4, =A       // Address of current row
      LDR     X19, =C      // Address of results matrix
rowloop:
      LDR     X5, =B       // first column in B
      MOV     W2, #N       // Column index (will count down to 0)
colloop:
      // Zero accumulator registers
      MOV     X7, #0
      MOV     W0, #N       // dot product loop counter
      MOV     X12, X4      // row for dot product
      MOV     X6, X5       // column for dot product
dotloop:
      // Do dot product of a row of A with column of B
      LDR     W9, [X12], #WDSIZE    // load A[row, i] and incr
      LDR     W10, [X6], #(N*WDSIZE)       // load B[i, col]
      SMADDL  X7, W9, W10, X7       // Do multiply and accumulate
      SUBS    W0, W0, #1            // Dec loop counter
      B.NE    dotloop               // If not zero loop
      STR     W7, [X19], #4         // C[row, col] = dotprod
      ADD     X5, X5, #WDSIZE       // Inc current col
      SUBS    W2, W2, #1            // Dec col loop counter
      B.NE    colloop               // If not zero loop
      ADD     X4, X4, #(N*WDSIZE)   // Increment to next row
      SUBS    W1, W1, #1            // Dec row loop counter
      B.NE    rowloop               // If not zero loop
// Print out matrix C
// Loop through 3 rows printing 3 cols each time.
      MOV     W20, #3               // Print 3 rows
      LDR     X19, =C               // Addr of results matrix
printloop:
      LDR     X0, =prtstr           // printf format string
      LDR     W1, [X19], #WDSIZE    // first element in current row
      LDR     W2, [X19], #WDSIZE    // second element in current row
      LDR     W3, [X19], #WDSIZE    // third element in curent row
      BL      printf                // Call printf
      SUBS    W20, W20, #1          // Dec loop counter
      B.NE    printloop             // If not zero loop
      MOV     X0, #0                // return code
      LDP     X19, X20, [SP], #16   // Restore Regs
      LDR     LR, [SP], #16         // Restore LR
      RET
.data
// First matrix
A:    .word   1, 2, 3
      .word   4, 5, 6
      .word   7, 8, 9
// Second matrix
B:    .word   9, 8, 7
      .word   6, 5, 4
      .word   3, 2, 1
// Result matix
C:    .fill   9, 4, 0
prtstr: .asciz  "%3d  %3d  %3d "
Listing 11-4

3x3 matrix multiplication in Assembly

After compiling and running this program, we get
smist08@kali:~/asm64/Chapter 11$ make
gcc -g -o matrixmult matrixmult.s
smist08@kali:~/asm64/Chapter 11$ ./matrixmult
 30   24   18
 84   69   54
138  114   90
smist08@kali:~/asm64/Chapter 11$

Accessing Matrix Elements

We store the three matrices in memory, in row order. They are arranged in the .word directives so that you can see the matrix structure. In the pseudo-code, we refer to the matrix elements using two-dimensional arrays. There are no instructions or operand formats to specify two-dimensional array access, so we must do it ourselves. To Assembly each array is just a nine-word sequence of memory. Now that we know how to multiply, we can do something like

A[i, j] = A[i∗N + j]

where N is the dimension of the array. We don’t do this though; in Assembly it pays to notice that we access the array elements in order and can go from one element in a row to the next by adding the size of an element—the size of a word, or 4 bytes. We can go from an element in a column to the next one by adding the size of a row. Therefore, we use the constant N ∗ WDSIZE so often in the code. This way we go through the array incrementally and never have to multiply array indexes. Generally, multiplication and division are expensive operations, and we should try to avoid them as much as possible.

We can use post-indexing techniques to access elements and increment pointers to the next element. We use post-indexing to store the result of each computation in the array C. We see this in the following:
STR  W7, [X19], #4   // C[row, col] = dotprod

which stores our computed dot product into C and then increments the pointer into C by 4 bytes. We see it again when we print the C matrix at the end.

Multiply with Accumulate

The core of the algorithm relies on the SMADDL instruction to multiply an element of A by an element of B and add that to the running sum for the dot product:
SMADDL X7, W9, W10, X7

This instruction accumulates a 64-bit sum, though we only take the lower 32 bits when we store it into the result matrix C. We don’t check for overflow, but as long as the numbers in A and B are small, we won’t have a problem.

Register Usage

We use quite a few registers, so we’re lucky we can keep track of all our loop indexes and pointers in registers, without having to move them in and out of memory. If we had to do this, we would have allocated space on the stack to hold any needed variables.

Notice that we use registers X19 and X20 in the loop that does the printing. That is because the printf function will change any of registers X0X18 on us. We mostly use registers X0X18 otherwise since we don’t need to preserve these for our caller. However, we do need to preserve X19 and X20, so we push and pop these to and from the stack along with LR.

Summary

We introduced the various forms of the multiply and division instructions supported in the ARM 64-bit instruction set.

We then explained the concept of multiply and accumulate and why these instructions are so important to modern applications in graphics and machine learning. We reviewed the many variations of these instructions and then presented an example matrix multiplication program to show them in action.

In Chapter 12, “Floating-Point Operations,” we will look at more math, but this time in scientific notation allowing fractions and exponents, going beyond integers for the first time.

Exercises

  1. 1.

    To multiply two 64-bit numbers resulting in a 128-bit product, we used the MUL instruction to obtain the lower 64 bits of the product for both the signed and unsigned integer cases. To prove that this works, let’s work a small example multiplying two 4-bit numbers to get an 8-bit product. Multiply 0xf by 2. In this signed case, 0xf is -1 and the product is -2; in the unsigned case, 0xf is 15 and the product is 30. Manually perform the calculation to ensure the correct result is obtained in both cases.

     
  2. 2.

    Write a signed 64-bit integer division routine that checks if the denominator is zero before performing the division. Print an error if zero is encountered.

     
  3. 3.

    Write a routine to compute a dot product of dimension six. Put the numbers to calculate in the .data section and print the result.

     
  4. 4.

    Change your program in Exercise 3 to use multiply and subtract from accumulator, instead of adding.

     
  5. 5.

    Change the matrices calculated in the example and check that the result is correct.

     
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.15.15