Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

S. SmithProgramming with 64-Bit ARM Assembly Languagehttps://doi.org/10.1007/978-1-4842-5881-1_11

11. Multiply, Divide, and Accumulate

Stephen Smith¹

(1)

Gibsons, BC, Canada

In this chapter, we return to using mathematics. We’ve already covered addition, subtraction, and a collection of bit operations on our 64-bit registers. Now, we will learn multiplication and division.

We will program multiply with accumulate instructions. But first of all, we will provide some background on why the ARM processor has so much circuitry dedicated to performing this operation. This will get us into the mechanics of vector and matrix multiplication.

Multiplication

The multiply instruction is

MUL Xd, Xn, Xm

This instruction computes Xd = Xn ∗ Xm. Looks good, but people familiar with multiplication might immediately ask: These are all 64-bit registers, so when you multiply two 64-bit numbers, don’t you get a 128-bit product? That is true, and that is the most obvious limitation on this instruction. Here are some notes on this instruction:

Xd is the lower 64 bits of the product. The upper 64 bits are discarded.
There is no “S” version of this instruction, so no condition flags can be set. Therefore, you can’t detect an overflow.
There aren’t separate signed and unsigned versions; multiplication isn’t like addition where the two’s complement, as discussed in Chapter 2, “Loading and Adding,” makes the operations the same.
All the operands are registers; immediate operands aren’t allowed, but remember you can use left shift to multiply by powers of two, such as two, four, and eight.
If you multiply two 32-bit W registers, then the destination must be a W register. Why can’t it be a 64-bit X register? So you don’t lose half the resulting product.

To overcome some of these limitations, there are a few additional multiply instructions, as follows:

SMULH Xd, Xn, Xm
SMULL Xd, Wn, Wm
UMULH Xd, Xn, Xm
UMULL Xd, Wn, Wm

SMULL and UMULL allow us to multiply two 32-bit registers and get the full result in a 64-bit register.

SMULL is for signed integers.
UMULL for unsigned integers.

SMULH and UMULH complement MUL by giving us the upper 64 bits of the product of two 64-bit numbers:

Calling SMULH and MUL we can get the complete 128-bit product for signed integers.
UMULH works with MUL to get the upper 64 bits of the product of unsigned integers.

See Exercise 1 in this chapter to confirm how MUL works with both cases.

All these instructions have the same performance and work in a similar manner to how we learned to multiply in grade school, by multiplying each digit in a loop and adding the results (with shifts) together. The ability to detect when a multiplication is complete (remaining leftmost digits are 0) was added to the ARM processor some time ago, so you aren’t penalized for multiplying small numbers (the loop knows to stop early).

There are a set of similar functions that calculate the negative or the multiplication; these are

MNEG Xd, Xn, Xm
SMNEGL Xd, Wn, Wm
UMNEGL Xd, Wn, Wm

MNEG calculates –(Xn ∗ Xm) and places the result in Xd, as well as for SMNEGL and UMNEGL. With only a limited number of operands possible in the 32 bits for instructions, these may seem like a strange addition to the instruction set, but we’ll see where they come from later in this chapter.

Examples

Listing 11-1 has some code to demonstrate all the various multiply instructions. We use our debug.s file from Chapter 9, “Interacting with C and Python,” meaning our program must be organized with the C runtime in mind.

// Example of 32 & 64-Bit Multiplication

.include "debug.s"

.global main // Provide program starting address

// Load the registers with some data

// Use small positive numbers that will work for all

// multiply instructions.

main:

MOV X2, #25

MOV X3, #4

printStr "Inputs:"

printReg 2

printReg 3

MUL X4, X2, X3

printStr "MUL X4=X2*X3:"

printReg 4

MNEG X4, X2, X3

printStr "MNEG X4=-X2*X3:"

printReg 4

SMULL X4, W2, W3

printStr "SMULL X4=W2*W3:"

printReg 4

SMNEGL X4, W2, W3

printStr "SMNEGL X4=-W2*W3:"

printReg 4

UMULL X4, W2, W3

printStr "UMULL X4=W2*W3:"

printReg 4

UMNEGL X4, W2, W3

printStr "UMNEGL X4=-W2*W3:"

printReg 4

LDR X2, =A

LDR X2, [X2]

LDR X3, =B

LDR X3, [X3]

MUL X4, X2, X3

printStr "Inputs:"

printReg 2

printReg 3

MUL X4, X2, X3

printStr "MUL X4 = bottom 64 bits of X2*X3:"

printReg 4

SMULH X4, X2, X3

printStr "SMULH X4 = top 64 bits of X2*X3 (signed):"

printReg 4

UMULH X4, X2, X3

printStr "UMULH X4 = top 64 bits of X2*X3 (unsigned):"

printReg 4

MOV X0, #0 // return code

RET

.data

A: .dword 0x7812345678

B: .dword 0xFABCD12345678901

Listing 11-1

Examples of the various multiply instructions

The makefile is as expected. The output is

smist08@kali:~/asm64/Chapter 11$ make

gcc -o mulexamp mulexamp.s

smist08@kali:~/asm64/Chapter 11$ ./mulexamp

Inputs:

X2 = 25, 0x0000000000000019

X3 = 4, 0x0000000000000004

MUL X4=X2*X3:

X4 = 100, 0x0000000000000064

MNEG X4=-X2*X3:

X4 = -100, 0xffffffffffffff9c

SMULL X4=W2*W3:

X4 = 100, 0x0000000000000064

SMNEGL X4=-W2*W3:

X4 = -100, 0xffffffffffffff9c

UMULL X4=W2*W3:

X4 = 100, 0x0000000000000064

UMNEGL X4=-W2*W3:

X4 = -100, 0xffffffffffffff9c

Inputs:

X2 = 515701495416, 0x0000007812345678

X3 = -379198319187490559, 0xfabcd12345678901

MUL X4 = bottom 64 bits of X2*X3:

X4 = 8455362044785495672, 0x75577afb36c28e78

SMULH X4 = top 64 bits of X2*X3 (signed):

X4 = -10600956976, 0xfffffffd88223bd0

UMULH X4 = top 64 bits of X2*X3 (unsigned):

X4 = 505100538440, 0x000000759a569248

smist08@kali:~/asm64/Chapter 11$

To demonstrate SMULH and UMULH, we load some large numbers that overflowed a 64-bit result, so we saw nonzero values in the upper 64 bits. Notice the difference between the signed and unsigned computation.

Multiply is straightforward, so let’s move on to division.

Division

Integer division is standard in all 64-bit ARM processors. This gives us some standardization, unlike the 32-bit ARM world where some processors contain an integer division instruction and some don’t.

The division instructions are

SDIV Xd, Xn, Xm
UDIV Xd, Xn, Xm

where

Xd is the destination register
Xn is the register holding the numerator
Xm is the register holding the denominator

The registers can be all X or all W registers.

There are a few problems or technical notes on these instructions:

There is no “S” option of this instruction, as they don’t set the condition flags.
Dividing by 0 should throw an exception; with these instructions it returns 0 which can be very misleading.
These instructions aren’t the inverses of MUL and SMULH . For this Xn needs to be a register pair, so the value to be divided can be 128 bits. To divide a 128-bit value, we need to either go to the floating-point processor or roll our own code.
The instruction only returns the quotient, not the remainder. Many algorithms require the remainder and you must calculate it as remainder = numerator - (quotient ∗ denominator).

Example

The code to execute the divide instructions is simple. Listing 11-2 is an example like we did for multiplication.

// Examples of 64-Bit Integer Division

.include "debug.s"

.global main // Provide program starting address

// Load the registers with some data

// Perform various division instructions

main:

MOV X2, #100

MOV X3, #4

printStr "Inputs:"

printReg 2

printReg 3

SDIV X4, X2, X3

printStr "Outputs:"

printReg 4

UDIV X4, X2, X3

printStr "Outputs:"

printReg 4

// Division by zero

printStr "Division by zero:"

MOV X3, #0

SDIV X4, X2, X3

printStr "Outputs:"

printReg 4

MOV X0, #0 // return code

RET

Listing 11-2

Examples of the SDIV and UDIV instructions

The makefile is as expected; if we build and run this program, we get

smist08@kali:~/asm64/Chapter 11$ make

gcc -o divexamp divexamp.s

smist08@kali:~/asm64/Chapter 11$ ./divexamp

Inputs:

X2 = 100, 0x0000000000000064

X3 = 4, 0x0000000000000004

Outputs:

X4 = 25, 0x0000000000000019

Outputs:

X4 = 25, 0x0000000000000019

Division by zero:

Outputs:

X4 = 0, 0x0000000000000000

smist08@kali:~/asm64/Chapter 11$

Note

The incorrect result when we divide by 0 should trigger an error, but it didn’t. Thus, we need to check for division by 0 in our code.

Next, we look at combining multiplication and addition, so we can optimize loops operating on vectors.

Multiply and Accumulate

The multiply and accumulate operation multiplies two numbers, then adds them to a third. As we go through the next few chapters, we will see this operation reappear again and again. The ARM processor is RISC; if the instruction set is reduced, then why do we find so many instructions, and as a result so much circuitry dedicated to performing multiply and accumulate?

The answer goes back to our favorite first year university math course on linear algebra. Most science students are forced to take this course, learn to work with vectors and matrices, and then hope they never see these concepts again. Unfortunately, they form the foundation for both graphics and machine learning. Before delving into the ARM instructions for multiply and accumulate, let’s review a bit of linear algebra.

Vectors and Matrices

A vector is an ordered list of numbers. For instance, in 3D graphics it might represent your location in 3D space where [x, y, z] are your coordinates. Vectors have a dimension which is the number of elements they contain. It turns out a useful computation with vectors is something called a dot product. If A = [a₁, a₂, …, a_n] is one vector and B = [b₁, b₂, …, b_n] is another vector, then their dot product is defined as

A ∙ B = a₁*b₁ + a₂* b₁ + ... + a_n * b_n

If we want to calculate this dot product, then a loop performing multiply and accumulate instructions will be quite efficient.

A matrix is a two-dimensional table of numbers such as

../images/494415_1_En_11_Chapter/494415_1_En_11_Figa_HTML.jpg

Matrix multiplication is a complicated process that drives first year linear algebra students nuts. When you multiply matrix A times matrix B, then each element on the resulting matrix is the dot product of a row of matrix A with a column of matrix B.

../images/494415_1_En_11_Chapter/494415_1_En_11_Figb_HTML.jpg

If these were 3x3 matrices, then there would be nine dot products each with nine terms. We can also multiply a matrix by a vector the same way.

../images/494415_1_En_11_Chapter/494415_1_En_11_Figc_HTML.jpg

In 3D graphics, if we represent a point as a 4D vector [x, y, z, 1], then the affine transformations of scale, rotate, shear, and reflection can be represented as 4x4 matrices. Any number of these transformations can be combined into a single matrix. Thus, to transform an object into a scene requires a matrix multiplication applied to each of the object’s vertex points. The faster we can do this, the faster we can render a frame in a video game.

In neural networks, the calculation for each layer of neurons is calculated by a matrix multiplication followed by the application of a nonlinear function. The bulk of the work is the matrix multiplication. Most neural networks have many layers of neurons, each requiring a matrix multiplication. The matrix size corresponds to the number of variables and the number of neurons; consequently, the matrices’ dimensions are often in the thousands. How quickly we perform object recognition or speech translation depends on how fast we can multiply matrices, which depends on how fast we can do multiply with accumulate.

These important applications are why the ARM processor dedicates so much silicon to multiply and accumulate. We’ll keep returning to how to speed up this process as we explore the ARM CPU’s floating-point unit (FPU) and Neon coprocessors in the following chapters.

Accumulate Instructions

Here are the multiply with accumulate instructions:

MADD Xd, Xn, Xm, Xa
MSUB Xd, Xn, Xm, Xa
SMADDL Xd, Wn, Wm, Xa
UMADDL Xd, Wn, Wm, Xa
SMSUBL Xd, Wn, Wm, Xa
UMSUBL Xd, Wn, Wm, Xa

The multiplication with accumulate instructions map closely to the multiply instructions that we’ve already discussed. In fact, most of the multiply instructions are aliases of these instructions using the zero register for Xa.

We either add or subtract the product from the running accumulator. The calculation is

Xd = Xa + Xn * Xm

Xd = Xa – Xn * Xm

Note

Xd can be the same as Xa, for calculating a running sum.

In the second case, we see that if Xa is the zero register, then we get all the multiply negative operations in the last section.

For the versions that multiple two 32-bit registers to get a 64-bit results, the sum needs to be a 64-bit X register.

Example 1

We’ve talked about how multiply and accumulate is ideal for multiplying matrices, so for an example, let’s multiply two 3x3 matrices.

The algorithm we are implementing is shown in Listing 11-3.

FOR row = 1 to 3

FOR col = 1 to 3

acum = 0

FOR i = 1 to 3

acum = acum + A[row, i]*B[i, col]

NEXT I

C[row, col] = acum

NEXT col

NEXT row

Listing 11-3

Pseudo-code for our matrix multiplication program

The row and column loops go through each cell of the output matrix and calculate the correct dot product for that cell in the innermost loop.

Listing 11-4 shows the implementation in Assembly.

// Multiply 2 3x3 integer matrices

// Registers:

// W1 - Row index

// W2 - Column index

// X4 - Address of row

// X5 - Address of column

// X7 - 64 bit accumulated sum

// W9 - Cell of A

// W10 - Cell of B

// X19 - Position in C

// X20 - Loop counter for printing

// X12 - row in dotloop

// X6 - col in dotloop

.global main // Provide program starting address

.equ N, 3 // Matrix dimensions

.equ WDSIZE, 4 // Size of element

main:

STR LR, [SP, #-16]! // Save required regs

STP X19, X20, [SP, #-16]! // Save required regs

MOV W1, #N // Row index

LDR X4, =A // Address of current row

LDR X19, =C // Address of results matrix

rowloop:

LDR X5, =B // first column in B

MOV W2, #N // Column index (will count down to 0)

colloop:

// Zero accumulator registers

MOV X7, #0

MOV W0, #N // dot product loop counter

MOV X12, X4 // row for dot product

MOV X6, X5 // column for dot product

dotloop:

// Do dot product of a row of A with column of B

LDR W9, [X12], #WDSIZE // load A[row, i] and incr

LDR W10, [X6], #(N*WDSIZE) // load B[i, col]

SMADDL X7, W9, W10, X7 // Do multiply and accumulate

SUBS W0, W0, #1 // Dec loop counter

B.NE dotloop // If not zero loop

STR W7, [X19], #4 // C[row, col] = dotprod

ADD X5, X5, #WDSIZE // Inc current col

SUBS W2, W2, #1 // Dec col loop counter

B.NE colloop // If not zero loop

ADD X4, X4, #(N*WDSIZE) // Increment to next row

SUBS W1, W1, #1 // Dec row loop counter

B.NE rowloop // If not zero loop

// Print out matrix C

// Loop through 3 rows printing 3 cols each time.

MOV W20, #3 // Print 3 rows

LDR X19, =C // Addr of results matrix

printloop:

LDR X0, =prtstr // printf format string

LDR W1, [X19], #WDSIZE // first element in current row

LDR W2, [X19], #WDSIZE // second element in current row

LDR W3, [X19], #WDSIZE // third element in curent row

BL printf // Call printf

SUBS W20, W20, #1 // Dec loop counter

B.NE printloop // If not zero loop

MOV X0, #0 // return code

LDP X19, X20, [SP], #16 // Restore Regs

LDR LR, [SP], #16 // Restore LR

RET

.data

// First matrix

A: .word 1, 2, 3

.word 4, 5, 6

.word 7, 8, 9

// Second matrix

B: .word 9, 8, 7

.word 6, 5, 4

.word 3, 2, 1

// Result matix

C: .fill 9, 4, 0

prtstr: .asciz "%3d %3d %3d "

Listing 11-4

3x3 matrix multiplication in Assembly

After compiling and running this program, we get

smist08@kali:~/asm64/Chapter 11$ make

gcc -g -o matrixmult matrixmult.s

smist08@kali:~/asm64/Chapter 11$ ./matrixmult

30 24 18

84 69 54

138 114 90

smist08@kali:~/asm64/Chapter 11$

Accessing Matrix Elements

We store the three matrices in memory, in row order. They are arranged in the .word directives so that you can see the matrix structure. In the pseudo-code, we refer to the matrix elements using two-dimensional arrays. There are no instructions or operand formats to specify two-dimensional array access, so we must do it ourselves. To Assembly each array is just a nine-word sequence of memory. Now that we know how to multiply, we can do something like

A[i, j] = A[i∗N + j]

where N is the dimension of the array. We don’t do this though; in Assembly it pays to notice that we access the array elements in order and can go from one element in a row to the next by adding the size of an element—the size of a word, or 4 bytes. We can go from an element in a column to the next one by adding the size of a row. Therefore, we use the constant N ∗ WDSIZE so often in the code. This way we go through the array incrementally and never have to multiply array indexes. Generally, multiplication and division are expensive operations, and we should try to avoid them as much as possible.

We can use post-indexing techniques to access elements and increment pointers to the next element. We use post-indexing to store the result of each computation in the array C. We see this in the following:

STR W7, [X19], #4 // C[row, col] = dotprod

which stores our computed dot product into C and then increments the pointer into C by 4 bytes. We see it again when we print the C matrix at the end.

Multiply with Accumulate

The core of the algorithm relies on the SMADDL instruction to multiply an element of A by an element of B and add that to the running sum for the dot product:

SMADDL X7, W9, W10, X7

This instruction accumulates a 64-bit sum, though we only take the lower 32 bits when we store it into the result matrix C. We don’t check for overflow, but as long as the numbers in A and B are small, we won’t have a problem.

Register Usage

We use quite a few registers, so we’re lucky we can keep track of all our loop indexes and pointers in registers, without having to move them in and out of memory. If we had to do this, we would have allocated space on the stack to hold any needed variables.

Notice that we use registers X19 and X20 in the loop that does the printing. That is because the printf function will change any of registers X0–X18 on us. We mostly use registers X0–X18 otherwise since we don’t need to preserve these for our caller. However, we do need to preserve X19 and X20, so we push and pop these to and from the stack along with LR.

Summary

We introduced the various forms of the multiply and division instructions supported in the ARM 64-bit instruction set.

We then explained the concept of multiply and accumulate and why these instructions are so important to modern applications in graphics and machine learning. We reviewed the many variations of these instructions and then presented an example matrix multiplication program to show them in action.

In Chapter 12, “Floating-Point Operations,” we will look at more math, but this time in scientific notation allowing fractions and exponents, going beyond integers for the first time.

Exercises

1.
To multiply two 64-bit numbers resulting in a 128-bit product, we used the MUL instruction to obtain the lower 64 bits of the product for both the signed and unsigned integer cases. To prove that this works, let’s work a small example multiplying two 4-bit numbers to get an 8-bit product. Multiply 0xf by 2. In this signed case, 0xf is -1 and the product is -2; in the unsigned case, 0xf is 15 and the product is 30. Manually perform the calculation to ensure the correct result is obtained in both cases.
2.
Write a signed 64-bit integer division routine that checks if the denominator is zero before performing the division. Print an error if zero is encountered.
3.
Write a routine to compute a dot product of dimension six. Put the numbers to calculate in the .data section and print the result.
4.
Change your program in Exercise 3 to use multiply and subtract from accumulator, instead of adding.
5.
Change the matrices calculated in the example and check that the result is correct.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 11. Multiply, Divide, and Accumulate

Create new playlist

Sign In

Sign Up

11. Multiply, Divide, and Accumulate

Multiplication

Examples

Division

Example

Multiply and Accumulate

Vectors and Matrices

Accumulate Instructions

Example 1

Accessing Matrix Elements

Multiply with Accumulate

Register Usage

Summary

Exercises

Table of Contents for
11. Multiply, Divide, and Accumulate