Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

S. SmithRaspberry Pi Assembly Language Programminghttps://doi.org/10.1007/978-1-4842-5287-1_10

10. Multiply, Divide, and Accumulate

Stephen Smith¹

(1)

Gibsons, BC, Canada

In this chapter, we return to mathematics. We’ve covered addition, subtraction, and a collection of bit operations on our 32-bit registers. Now we will cover multiplication and division. The ARM processor has a surplus of multiply instructions, then a dearth of division operations.

We will cover multiply with accumulate instructions. We will provide some background on why the ARM processor has so much circuitry dedicated to performing this operation. This will get us into the mechanics of vector and matrix multiplication.

Multiplication

In Chapter 7, “Linux Operating System Services,” we discussed why there are so many Linux service calls and how part of the reason was for compatibility when they needed new functionality; they added a new call, so the old call is preserved. The ARM multiply instructions have a similar history. Multiply has been in the ARM architecture for a long time, but the original instructions were inadequate and new instructions were added while keeping the old instructions for software compatibility.

The original 32-bit instruction is

MUL{S} Rd, Rn, Rm

This instruction computes Rd = Rn * Rm. It looks good, but people familiar with multiplication might immediately ask “These are all 32-bit registers, so when you multiply two 32-bit numbers, don’t you get a 64-bit product?” That is true, and that is the most obvious limitation on this instruction. Here are some notes on this instruction:

Rd is the lower 32 bits of the product. The upper 32 bits are discarded.
The MULS version of the instruction only sets the N and Z flags; it does not set the C or V flags, so you don’t know if it overflowed.
There aren’t separate signed and unsigned versions; multiplication isn’t like addition where the two's complement makes the operations the same.
All the operands are registers; immediate operands are not allowed.
Rd cannot be the same as Rn.

To overcome some of these limitations, later versions of the ARM processor added an abundance of multiply instructions:

SMULL{S} RdLo, RdHi, Rn, Rm
UMULL{S} RdLo, RdHi, Rn, Rm
SMMUL{R} {Rd}, Rn, Rm
SMUL<x><y> {Rd}, Rn, Rm
SMULW<y> {Rd}, Rn, Rm

The first SMULL instruction will perform signed 32-bit multiplication putting the 64-bit result in two registers. The second UMULL instruction is the unsigned version of this. SMMUL complements the original MUL instruction by providing the upper 32 bits of the product and discarding the lower 32 bits.

Multiplication is an expensive operation, so there is some merit in multiplying small numbers quickly. SMUL provides this; it multiplies two 16-bit quantities to provide a 32-bit quantity. The <x> and <y> modifiers specify which 16 bits of the operand registers are used:

<x> is either B or T. B means use the bottom half (bits [15:0]) of Rn; T means use the top half (bits [31:16]) of Rn.
<y> is either B or T. B means use the bottom half (bits [15:0]) of Rm; T means use the top half (bits [31:16]) of Rm.

SMULW is an intermediate version that multiplies a 32-bit value by a 16-bit value, then only keeps the upper 32 bits of the 48-bit product. The <y> modifier is the same as for SMUL. When I’ve seen this instruction used, one of the operands has usually been shifted so that the product ends up in the upper 32 bits.

All these instructions have the same performance. The ability to detect when a multiply is done (remaining digits are 0) was added to the ARM processor some time ago, so the need for shorter versions of multiply, in my opinion, doesn’t exist anymore. I would recommend always using SMULL and UMULL as then there are less things to go wrong if your numbers change over time.

Examples

Listing 10-1 is some code to demonstrate all the various multiply instructions. We use our debug.s file from Chapter 9, “Interacting with C and Python,” which means our program must be organized with the C runtime in mind.

@ Example of 16 & 32-bit Multiplication

.include "debug.s"

.global main @ Provide program starting address to linker

@ Load the registers with some data

@ Use small positive numbers that will work for all

@ multiply instructions.

main:

push {R4-R12, LR}

MOV R2, #25

MOV R3, #4

printStr "Inputs:"

printReg 2

printReg 3

MUL R4, R2, R3

printStr "MUL R4=R2*R3:"

printReg 4

SMULL R4, R5, R2, R3

printStr "SMULL R5, R4=R2*R3:"

printReg 4

printReg 5

UMULL R4, R5, R2, R3

printStr "UMULL R5, R4=R2*R3:"

printReg 4

printReg 5

SMMUL R4, R2, R3

printStr "SMMUL R4 = top 32 bits of R2*R3:"

printReg 4

SMULBB R4, R2, R3

printStr "SMULBB R4 = R2*R3:"

printReg 4

SMULWB R4, R2, R3

printStr "SMULWB R4 = upper 32 bits of R2*R3:"

printReg 4

mov r0, #0 @ return code

pop {R4-R12, PC}

Listing 10-1

Examples of the various multiply instructions

The makefile is as we would expect. The output is

pi@raspberrypi:~/asm/Chapter 10 $ make

gcc -o mulexamp mulexamp.s

pi@raspberrypi:~/asm/Chapter 10 $ ./mulexamp

Inputs:

R2 = 25, 0x00000019

R3 = 4, 0x00000004

MUL R4=R2*R3:

R4 = 100, 0x00000064

SMULL R5, R4=R2*R3:

R4 = 100, 0x00000064

R5 = 0, 0x00000000

UMULL R5, R4=R2*R3:

R4 = 100, 0x00000064

R5 = 0, 0x00000000

SMMUL R4 = top 32 bits of R2*R3:

R4 = 0, 0x00000000

SMULBB R4 = R2*R3:

R4 = 100, 0x00000064

SMULWB R4 = upper 32 bits of R2*R3:

R4 = 0, 0x00000000

pi@raspberrypi:~/asm/Chapter 10 $

Multiply is straightforward, especially using SMULL and UMULL.

Division

Integer division is a much more recent addition to the ARM processor. In fact, the Raspberry Pi 1 and Zero have no integer division instruction. The second generation of the Raspberry Pi 2 uses ARM Cortex-A53 processors, which introduce integer division to the Pi world. The Raspberry Pi 4 includes newer Cortex-A72 processors.

If you are targeting Raspberry Pi Zero or 1, then you will need to either implement your own division algorithm in code, call some C code, or use the floating-point coprocessor. We’ll cover the floating-point coprocessor in Chapter 11, “Floating-Point Operations.”

The Raspberry Pi 2, 3, and 4’s division instructions are

SDIV {Rd}, Rn, Rm
UDIV {Rd}, Rn, Rm

where

Rd is the destination register.
Rn is the register holding the numerator.
Rm is a register holding the denominator.

There are a few problems or technical notes on these instructions:

There is no “S” option of this instruction, as it doesn’t set CPSR at all.
Dividing by 0 should throw an exception; with these instructions, it returns 0 which can be very misleading.
These instructions aren’t the inverses of SMULL and UMULL. For this Rn needs to be a register pair, so the value to be divided can be 64 bits. To divide a 64-bit value, we need to either go to the floating-point processor or roll our own code.
The instruction only returns the quotient, not the remainder. Many algorithms require the remainder, and you must calculate it as remainder = numerator - (quotient * denominator).

Example

The code to execute the divide instructions is simple; Listing 10-2 is an example like we did for multiplication.

@ Examples of 32-bit Integer Division

.include "debug.s"

.global main @ Provide program starting address to linker

@ Load the registers with some data

@ Perform various division instructions

main:

push {R4-R12, LR}

MOV R2, #100

MOV R3, #4

printStr "Inputs:"

printReg 2

printReg 3

SDIV R4, R2, R3

printStr "Outputs:"

printReg 4

UDIV R4, R2, R3

printStr "Outputs:"

printReg 4

mov r0, #0 @ return code

pop {R4-R12, PC}

Listing 10-2

Examples of the SDIV and UDIV instructions

If we try to build this in the same way we did for the multiplication example, we will get the error

pi@raspberrypi:~/asm/Chapter 10 $ make -B

gcc -o divexamp divexamp.s

divexamp.s: Assembler messages:

divexamp.s:21: Error: selected processor does not support `sdiv R4,R2,R3' in ARM mode

make: *** [makefile:15: divexamp] Error 1

pi@raspberrypi:~/asm/Chapter 10 $

This is run on a Raspberry Pi 4. Didn’t we say it supports the SDIV instruction? The reason is that the Raspberry Pi foundation goes to great pains to ensure all their software runs on all Raspberry Pi no matter how old. The default configuration of the GNU Compiler Collection in Raspbian is to target the lowest common denominator. If we change the makefile to the following

divexamp: divexamp.s debug.s

gcc -march="armv8-a" -o divexamp divexamp.s

then the program will compile and run. The -march parameter is for machine architecture, and “arm8-a” is the correct one for the Raspberry Pi 4. We could have used one to match a Raspberry Pi 3, but we’ll want to explore some new features in the Pi 4 later.

With this in place, the program runs and we get the expected results:

pi@raspberrypi:~/asm/Chapter 10 $ make

gcc -march="armv8-a" -o divexamp divexamp.s

pi@raspberrypi:~/asm/Chapter 10 $ ./divexamp

Inputs:

R2 = 100, 0x00000064

R3 = 4, 0x00000004

Outputs:

R4 = 25, 0x00000019

Outputs:

R4 = 25, 0x00000019

pi@raspberrypi:~/asm/Chapter 10 $

Multiply and Accumulate

The multiply and accumulate operation multiplies two numbers, then adds them to a third. As we go through the next few chapters, we will see this operation reappear again and again. The ARM processor is RISC, if the instruction set is reduced, then why do we find so many instructions, and hence so much circuitry, dedicated to performing multiply and accumulate. The answer goes back to our favorite first year university math course on linear algebra. Most science students are forced to take this course, learn to work with vectors and matrices, then hope they never see these concepts again. Unfortunately, they form the foundation for both graphics and machine learning. Before delving into the ARM instructions for multiply and accumulate, let’s review a bit of linear algebra.

Vectors and Matrices

A vector is an ordered list of numbers. For instance, in 3D graphics, it might represent your location in 3D space where [x, y, z] are your coordinates. Vectors have a dimension which is the number of elements they contain. It turns out a useful computation with vectors is something called a dot product. If A = [a₁, a₂, … , a_n] is one vector and B = [b₁, b₂, … , b_n] is another vector, then their dot product is defined as

A ⋅ B = a₁*b₁ + a₂* b₁ + … + a_n * b_n

If we want to calculate this dot product, then a loop performing multiply and accumulate instructions should be quite efficient. A matrix is a 2D table of numbers such as

../images/486919_1_En_10_Chapter/486919_1_En_10_Figa_HTML.jpg

Matrix multiplication is a complicated process that drives first-year linear algebra students nuts. When you multiply matrix A times matrix B, then each element on the resulting matrix is the dot product of a row of matrix A with a column of matrix B.

../images/486919_1_En_10_Chapter/486919_1_En_10_Figb_HTML.jpg

If these were 3x3 matrices, then there would be nine dot products each with nine terms. We can also multiply a matrix by a vector the same way.

../images/486919_1_En_10_Chapter/486919_1_En_10_Figc_HTML.jpg

In 3D graphics, if we represent a point as a 4D vector [x, y, z, 1], then the affine transformations of scale, rotate, shear, and reflection can be represented as 4x4 matrices. Any number of these transformations can be combined into a single matrix. Thus, to transform an object into a scene requires a matrix multiplication applied to each of the object’s vertex points. The faster we can do this, the faster we can render a frame in a video game.

In neural networks, the calculation for each layer of neurons is calculated by a matrix multiplication, followed by the application of a nonlinear function. The bulk of the work is the matrix multiplication. Most neural networks have many layers of neurons, each requiring a matrix multiplication. The matrix size corresponds to the number of variables and the number of neurons; hence, the matrix dimensions are often in the thousands. How quickly we perform object recognition or speech translation is dependent on how fast we can multiply matrices, that is dependent on how fast we can do multiply with accumulate.

These important applications are why the ARM processor dedicates so much silicon to multiply and accumulate. We’ll keep returning to how to speed up this process as we explore the Raspberry Pi’s FPU and NEON coprocessors in the following chapters.

Accumulate Instructions

As we saw with multiplication, there have been quite a proliferation of multiply with accumulate instructions. Fortunately, we’ve covered most of the details in the “Multiplication” section. Here they are:

MLA{S} Rd, Rn, Rm, Ra
SMLAL{S} RdLo, RdHi, Rn, Rm
SMLA<x><y> Rd, Rn, Rm, Ra
SMLAD{X} Rd, Rn, Rm, Ra
SMLALD{X} RdLo, RdHi, Rn, Rm
SMLAL<x><y> RdLo, RdHi, Rn, Rm
SMLAW<y> Rd, Rn, Rm, Ra
SMLSD{X} Rd, Rn, Rm, Ra
SMLSD{X} RdLo, RdHi, Rn, Rm
SMMLA{R} Rd, Rn, Rm, Ra
SMMLS{R} Rd, Rn, Rm, Ra
SMUAD{X} {Rd}, Rn, Rm
UMAAL RdLo, RdHi, Rn, Rm
UMLAL{S} RdLo, RdHi, Rn, Rm

That is a lot of instructions, so we won’t cover each in detail, but we can recognize that there is a multiply with accumulate for each regular multiply instruction. Let’s look at what leads to a further proliferation of instructions.

If there is an Ra operand, then the calculation is

Rd = Rn * Rm + Ra

Note

Rd can be the same as Ra for calculating a running sum.

If there isn’t an Ra operand, then the calculation is

Rd = Rd + Rn * Rm

This second form tends to be for instructions with 64-bit results, so the sum needs to be 64 bits, therefore, can’t be a single register.

Dual Multiply with Accumulate

The instructions that end in D are dual. They do two multiply and accumulates in a single step. They multiply the top 16 bits of Rn and Rm and multiply the bottom 16 bits of Rn and Rm, then add both products to the accumulator.

If there is an S in the instruction instead of an A, then it means it subtracts the two values before adding the result to the accumulator.

Rd = Ra + (bottom Rn * bottom Rm - top Rn * top Rm)

If the accuracy works for you and you can encode all the data this way, then you can double your throughput using these instructions. We’ll look at this in Example 2.

Example 1

We’ve talked about how multiply and accumulate is ideal for multiplying matrices, so for an example, let's multiply two 3x3 matrices.

The algorithm we are implementing is shown in Listing 10-3.

FOR row = 1 to 3

FOR col = 1 to 3

acum = 0

FOR i = 1 to 3

acum = acum + A[row, i]*B[i, col]

NEXT I

C[row, col] = acum

NEXT col

NEXT row

Listing 10-3

Pseudo-code for our matrix multiplication program

Basically, the row and column loops go through each cell of the output matrix and calculate the correct dot product for that cell in the innermost loop.

Listing 10-4 shows our implementation in Assembly.

@ Multiply 2 3x3 integer matrices

@ Registers:

@ R1 - Row index

@ R2 - Column index

@ R4 - Address of row

@ R5 - Address of column

@ R7 - 64 bit accumulated sum

@ R8 - 64 bit accumulated sum

@ R9 - Cell of A

@ R10 - Cell of B

@ R11 - Position in C

@ R12 - row in dotloop

@ R6 - col in dotloop

.global main @ Provide program starting address

.equ N, 3 @ Matrix dimensions

.equ WDSIZE, 4 @ Size of element

main:

push {R4-R12, LR} @ Save required regs

MOV R1, #N @ Row index

LDR R4, =A @ Address of current row

LDR R11, =C @ Addr of results matrix

rowloop:

LDR R5, =B @ first column in B

MOV R2, #N @ Colindex (will count down)

colloop:

@ Zero accumulator registers

MOV R7, #0

MOV R8, #0

MOV R0, #N @ dot product loop counter

MOV R12, R4 @ row for dot product

MOV R6, R5 @ column for dot product

dotloop:

@ Do dot product of a row of A with column of B

LDR R9, [R12], #WDSIZE @ load A[row, i] and incr

LDR R10, [R6], #(N*WDSIZE) @ load B[i, col]

SMLAL R7, R8, R9, R10 @ Do multiply and accumulate

SUBS R0, #1 @ Dec loop counter

BNE dotloop @ If not zero loop

STR R7, [R11], #4 @ C[row, col] = dotprod

ADD R5, #WDSIZE @ Increment current col

SUBS R2, #1 @ Dec col loop counter

BNE colloop @ If not zero loop

ADD R4, #(N*WDSIZE) @ Increment to next row

SUBS R1, #1 @ Dec row loop counter

BNE rowloop @ If not zero loop

@ Print out matrix C

@ Loop through 3 rows printing 3 cols each time.

MOV R5, #3 @ Print 3 rows

LDR R11, =C @ Addr of results matrix

printloop:

LDR R0, =prtstr @ printf format string

LDR R1, [R11], #WDSIZE @ first element in current row

LDR R2, [R11], #WDSIZE @ second element in current row

LDR R3, [R11], #WDSIZE @ third element in current row

BL printf @ Call printf

SUBS R5, #1 @ Dec loop counter

BNE printloop @ If not zero loop

mov r0, #0 @ return code

pop {R4-R12, PC} @ Restore regs and return

.data

@ First matrix

A: .word 1, 2, 3

.word 4, 5, 6

.word 7, 8, 9

@ Second matrix

B: .word 9, 8, 7

.word 6, 5, 4

.word 3, 2, 1

@ Result matrix

C: .fill 9, 4, 0

prtstr: .asciz "%3d %3d %3d "

Listing 10-4

3x3 matrix multiplication in Assembly

Compiling and running this program, we get

pi@raspberrypi:~/asm/Chapter 10 $ make

gcc -o matrixmult matrixmult.s

pi@raspberrypi:~/asm/Chapter 10 $ ./matrixmult

30 24 18

84 69 54

138 114 90

pi@raspberrypi:~/asm/Chapter 10 $

Accessing Matrix Elements

We store the three matrices in memory, in row order. They are arranged in the .word directives so that you can see the matrix structure. In the pseudo-code, we refer to the matrix elements using 2D arrays. There are no instructions or operand formats to specify 2D array access, so we must do it ourselves. To Assembly, each array is just a nine-word sequence of memory. Now that we know how to multiply, we can do something like

A[i, j] = A[i*N + j]

where N is the dimension of the array. We don’t do this though; in Assembly it pays to notice that we access the array elements in order and can go from one element in a row to the next by adding the size of an element—the size of a word, or four. We can go from an element in a column to the next one by adding the size of a row. Therefore, we use the constant N * WDSIZE so often in the code. This way, we go through the array incrementally and never have to multiply array indexes. Generally, multiplication and division are expensive operations, and we should try to avoid them as much as possible.

We can use post-indexing techniques to access elements increment pointers to the next element. We use post-indexing to store the result of each computation in the array C. We see this in

STR R7, [R11], #4

which stores our computed dot product into C, then increments the pointer into C by 4 bytes. We see it again when we print the C matrix at the end.

Multiply with Accumulate

The core of the algorithm relies on the SMLAL instruction to multiply an element of A by an element of B and add that to the running sum for the dot product.

SMLAL R7, R8, R9, R10

This instruction accumulates a 64-bit sum, but we only take the lower 32 bits in R7. We don’t check for overflow; if at the end R8 isn’t 0, we are going to give an incorrect result.

Register Usage

We nearly use all the registers; we are lucky we can keep track of all our loop indexes and pointers in registers and don’t have to move them in and out of memory. If we needed to do this, we would have allocated space on the stack to hold any needed variables.

Example 2

When we discussed the multiply with accumulate instructions, we mentioned the dual instructions that will do two steps in one instruction. The main problem is packing two numbers that need processing in each 32-bit register. We can create 16-bit integers easily enough using the .short Assembler directive. Processing the rows is easy since the cells are next to each other, but for the columns, each element is a row away. How can we easily load two column elements into one 32-bit register?

What we can do is take the transpose of the second matrix. This means making the rows columns and the columns rows, basically switching B[i, j] with B[j, i]. If we do that, then the column elements are next to each other and easy to load into a single 32-bit register.

Listing 10-5 is the code to do this.

@ Multiply 2 3x3 integer matrices

@ Uses a dual multiply/accumulate instruction

@ so processes two elements in the dot product

@ per loop.

@ Registers:

@ R1 - Row index

@ R2 - Column index

@ R4 - Address of row

@ R5 - Address of column

@ R7 - 64 bit accumulated sum

@ R8 - 64 bit accumulated sum

@ R9 - Cell of A

@ R10 - Cell of B

@ R11 - Position in C

@ R12 - row in dotloop

@ R6 - col in dotloop

.global main @ Provide program starting address to linker

.equ N, 3 @ Matrix dimensions

.equ ELSIZE, 2 @ Size of element

main:

push {R4-R12, LR} @ Save required regs

MOV R1, #N @ Row index

LDR R4, =A @ Address of current row

LDR R11, =C @ Address of results matrix

rowloop:

LDR R5, =B @ first column in B

MOV R2, #N @ Column index (will count down to 0)

colloop:

@ Zero accumulator registers

MOV R7, #0

MOV R8, #0

MOV R0, #((N+1)/2) @ dot product loop counter

MOV R12, R4 @ row for dot product

MOV R6, R5 @ column for dot product

dotloop:

@ Do dot product of a row of A with column of B

LDR R9, [R12], #(ELSIZE*2) @ load A[row, i] and incr

LDR R10, [R6], #(ELSIZE*2) @ load B[i, col]

SMLAD R7, R9, R10, R7 @ Do dual multiply and accumulate

SUBS R0, #1 @ Dec loop counter

BNE dotloop @ If not zero loop

STR R7, [R11], #4 @ C[row, col] = dotprod

ADD R5, #((N+1)*ELSIZE) @ Increment current col

SUBS R2, #1 @ Dec col loop counter

BNE colloop @ If not zero loop

ADD R4, #((N+1)*ELSIZE) @ Increment to next row

SUBS R1, #1 @ Dec row loop counter

BNE rowloop @ If not zero loop

@ Print out matrix C

@ Loop through 3 rows printing 3 cols each time.

MOV R5, #3 @ Print 3 rows

LDR R11, =C @ Addr of results matrix

printloop:

LDR R0, =prtstr @ printf format string

LDR R1, [R11], #4 @ first element in current row

LDR R2, [R11], #4 @ second element in current row

LDR R3, [R11], #4 @ third element in current row

BL printf @ Call printf

SUBS R5, #1 @ Dec loop counter

BNE printloop @ If not zero loop

mov r0, #0 @ return code

pop {R4-R12, PC} @ Restore regs and return

.data

@ First matrix

A: .short 1, 2, 3, 0

.short 4, 5, 6, 0

.short 7, 8, 9, 0

@ Second matrix

B: .short 9, 6, 3, 0

.short 8, 5, 2, 0

.short 7, 4, 1, 0

@ Result matrix

C: .fill 9, 4, 0

prtstr: .asciz "%3d %3d %3d "

Listing 10-5

3x3 matrix multiplication using a dual multiply/accumulate

The saving in instructions is in reducing the inner loop that computes the dot product.

MOV R0, #((N+1)/2) @ dot product loop counter

If our matrix had an even dimension, we would have saved more. For our 3x3 example, the dot product loop still has two elements. But then if we were doing two 4x4 matrices, it would also be two times through this loop. Notice that we had to add a 0 to the end of each row of both matrices, since the dual instruction is going to process an even number of entries.

The real workhorse of this program is

SMLAD R7, R9, R10, R7

which multiplies the high part of R9 by the high part of R10 and at the same time the low part of R9 by the low part of R10, then adds both to R7 and puts the new sum into R7. Notice it’s okay to have Rd=Ra, which is what you mostly want.

We still use LDR to load the registers from the matrices. This will load 32 bits; since we specified each element to take 16 bits, it will load two at a time enhancing our performance.

Summary

We covered the various forms of the multiply instruction supported in the ARM 32-bit instruction set. We covered the division instructions included in newer versions of the ARM processors, like those in the Raspberry Pi 3 and 4. For older processors we can use the FPU, write our own routine, or call some C code.

We then covered the concept of multiply and accumulate and why these instructions are so important to modern applications in graphics and machine learning. We reviewed the many variations of these instructions and then presented two versions of matrix multiplication to show them in action.

In Chapter 11, “Floating-Point Operations,” we will look at more math, but this time in scientific notation allowing fractions and exponents, going beyond integers for the first time.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 10. Multiply, Divide, and Accumulate

Create new playlist

Sign In

Sign Up

10. Multiply, Divide, and Accumulate

Multiplication

Examples

Division

Example

Multiply and Accumulate

Vectors and Matrices

Accumulate Instructions

Note

Dual Multiply with Accumulate

Example 1

Accessing Matrix Elements

Multiply with Accumulate

Register Usage

Example 2

Summary

Table of Contents for
10. Multiply, Divide, and Accumulate