© Stephen Smith 2020
S. SmithProgramming with 64-Bit ARM Assembly Languagehttps://doi.org/10.1007/978-1-4842-5881-1_13

13. Neon Coprocessor

Stephen Smith1 
(1)
Gibsons, BC, Canada
 

In this chapter, we will perform true parallel computing. The Neon coprocessor shares a lot of functionality with the FPU from Chapter 12, “Floating-Point Operations,” but can perform several operations at once. For example, you can achieve four 32-bit floating-point operations at once with one instruction. The type of parallel processing performed by the Neon Coprocessor is single instruction multiple data (SIMD) . In SIMD processing, each single instruction issued executes on multiple data items in parallel.

We’ll examine how to arrange data, so we can operate on it in parallel, and study the instructions that do so. We’ll then update our vector distance and 3x3 matrix multiplication programs to use the Neon processor to see how much of the work we can do in parallel.

The Neon Coprocessor shares the same register file we examined in Chapter 12, “Floating-Point Operations,” except that it can operate on all 128 bits of each register. We’ll learn how the bank of coprocessor registers is intended to be used with Neon. Let’s look in more detail at the NEON registers.

About the NEON Registers

The NEON Coprocessor can operate on the 64-bit registers that we studied in the previous chapter and a set of 128-bit registers that are new for this chapter. Having 128-bit registers doesn’t mean the NEON processor performs 128-bit arithmetic. Rather, the Neon Coprocessor segments the large register into holding multiple smaller values at once. For instance, one 128-bit register can fit four 32-bit single-precision floating-point numbers. If we multiply two such registers, all four 32-bit numbers are multiplied together at the same time resulting in another 128-bit register containing the four results.

The Neon Coprocessor operates on both integers and floating-point numbers. The greatest parallelism is obtained using 8-bit integers where 16 operations can happen at once.

The Neon coprocessor can operate on 64-bit D or 128-bit V registers; of course, if you use 64-bit D registers, you only have half the amount of parallelism. In all instructions, we refer to the V register, but the number of elements multiplied by the size of the element must always be either 64 bits or 128 bits.

Table 13-1 shows the number of elements that fit in each register type. Next, we’ll see how we perform arithmetic on these elements.
Table 13-1

Number of elements in each register type by size

 

8-Bit Elements

16-Bit Elements

32-Bit Elements

64 bits

8

4

2

128 bits

16

8

4

Stay in Your Lane

The NEON coprocessor uses the concept of lanes for all of its computations. When you choose your data type, the processor considers the register divided into the number of lanes—one lane for each data element. If we work on 32-bit integers and use a 128-bit V register, then the register is considered divided into four lanes, one for each integer. We designate the lane configuration by specifying the number of lanes and the size of the data contained there. Even though these lane designators appear to match floating-point registers, they only specify the size. The data could be either integer or floating point. The size multiplied by the number of lanes must be either 64 or 128 bits. Table 13-2 shows the lane designators we use and their sizes.
Table 13-2

Designator and size for lanes

Designator

Size

D

64 bits

S

32 bits

H

16 bits

B

8 bits

Figure 13-1 shows how register V1 can be divided into lanes of various sizes and how we specify them as arguments to instructions.
../images/494415_1_En_13_Chapter/494415_1_En_13_Fig1_HTML.jpg
Figure 13-1

How register V1 can be divided into lanes. These lanes just specify the size and number of lanes, not the data type contained in them

Figure 13-2 shows how the V registers are divided into four lanes, one for each 32-bit integer, and then how the arithmetic operation is applied to each lane independently. This way we accomplish four additions in one instruction, and the NEON coprocessor performs them all at the same time—in parallel.
../images/494415_1_En_13_Chapter/494415_1_En_13_Fig2_HTML.jpg
Figure 13-2

Example of the four lanes involved in doing 32-bit integer addition

Performing Arithmetic Operations

There are two forms of the add instruction, one for integer addition and one for floating-point addition:
  • ADD Vd.T, Vn.T, Vm.T    // Integer addition

  • FADD Vd.T, Vn.T, Vm.T    // floating-point addition

T must be
  • For ADD: 8B, 16B, 4H, 8H, 2S, 4S or 2D

  • For FADD: 4H, 8H, 2S, 4S or 2D

Note

We use the same instructions as we used for scalar integer and floating-point arithmetic. The Assembler knows to create code for the NEON Coprocessor due to the use of V registers and the inclusion of the T specifier.

The trick to using NEON is arranging your code, so that all the lanes keep doing useful work.

Since the NEON Processor supports integer operations, it supports all the logical operations like AND, BIC, and ORR. There are also a selection of comparison operations.

A look at the list of NEON instructions shows a lot of specialty instructions provided to help with specific algorithms. For example, there’s direct support for polynomials over the binary ring to support certain classes of cryptographic algorithms.

We will show you how to use a few of the instructions in working examples. This will give you enough knowledge to apply the general principles of operations for the NEON Coprocessor; then you can peruse all the instructions in the ARM Instruction Set Reference Guide.

Calculating 4D Vector Distance

Let’s expand the distance calculation example from Chapter 12, “Floating-Point Operations,” to calculate the distance between two four-dimensional (4D) vectors. The formula generalizes to any number of dimensions, by just adding the extra squares of the differences for the additional dimensions under the square root.

First, distance.s is shown in Listing 13-1, using the NEON Coprocessor.
//
// Example function to calculate the distance
// between 4D two points in single precision
// floating-point using the NEON Processor
//
// Inputs:
//    X0 - pointer to the 8 FP numbers
//           they are (x1, x2, x3, x4),
//                   (y1, y2, y3, y4)
// Outputs:
//    W0 - the length (as single precision FP)
.global distance // Allow function to be called by others
//
distance:
      // load all 4 numbers at once
      LDP   Q2, Q3, [X0]
      // calc V1 = V2 - V3
      FSUB  V1.4S, V2.4S, V3.4S
      // calc V1 = V1 * V1 = (xi-yi)^2
      FMUL  V1.4S, V1.4S, V1.4S
      // calc S0 = S0 + S1 + S2 + S3
      FADDP V0.4S, V1.4S, V1.4S
      FADDP V0.4S, V0.4S, V0.4S
      // calc sqrt(S0)
      FSQRT S4, S0
      // move result to W0 to be returned
      FMOV  W0, S4
      RET
Listing 13-1

Routine to calculate the distance between two 4D vectors using the NEON Coprocessor.

Next, main.s is shown in Listing 13-2, to test the routine.
//
// Main program to test our distance function
//
// W19 - loop counter
// X20 - address to current set of points
.global main // Provide program starting address to linker
//
      .equ   N, 3   // Number of points.
main:
      STP    X19, X20, [SP, #-16]!
      STR    LR, [SP, #-16]!
      LDR    X20, =points // pointer to current points
      MOV    W19, #N      // number of loop iterations
loop:    MOV    X0, X20   // move pointer to parameter 1 (r0)
      BL     distance     // call distance function
// need to take the single precision return value
// and convert it to a double, because the C printf
// function can only print doubles.
      FMOV   S2, W0      // move back to fpu for conversion
      FCVT   D0, S2      // convert single to double
      FMOV   X1, D0      // return double to r2, r3
      LDR    X0, =prtstr // load print string
      BL     printf      // print the distance
      ADD    X20, X20, #(8*4) // 8 elements each 4 bytes
      SUBS   W19, W19, #1 // decrement loop counter
      B.NE   loop         // loop if more points
      MOV    X0, #0       // return code
      LDR    LR, [SP], #16
      LDP    X19, X20, [SP], #16
      RET
.data
points: .single    0.0, 0.0, 0.0, 0.0, 17.0, 4.0, 2.0, 1.0
      .single      1.3, 5.4, 3.1, -1.5, -2.4, 0.323, 3.4, -0.232
 .single 1.323e10, -1.2e-4, 34.55, 5454.234, 10.9, -3.6, 4.2, 1.3
prtstr:      .asciz "Distance = %f "
Listing 13-2

The main program to test the 4D distance function.

The makefile is in Listing 13-3.
distance: distance.s main.s
       gcc -g -o distance distance.s main.s
Listing 13-3

The makefile for the distance program

If we build and run the program, we see
smist08@kali:~/asm64/Chapter 13$ make
gcc -g -o distance distance.s main.s
smist08@kali:~/asm64/Chapter 13$ ./distance
Distance = 17.606817
Distance = 6.415898
Distance = 13230000128.000000
smist08@kali:~/asm64/Chapter 13$
  1. 1.

    We load one vector into V2 and the other into V3. Each vector consists of four 32-bit floating-point numbers, so each one can be placed in a 128-bit V register and treated as four lanes.

     
  2. 2.

    Subtract all four components at once using a single FSUB instruction. We calculate the squares all at once using a FMUL instruction. Both instructions operate on all four lanes in parallel.

     
  3. 3.

    Add up all the sums which are all in V1. This means all the numbers are in different lanes and we can’t add them in parallel. This is a common situation to get into; fortunately the NEON instruction set does give us some help. It won’t add up all the lanes in a register, but it will do pairwise additions in parallel. The following instruction

    FADDP V0.4S, V1.4S, V1.4S

    will pairwise add each pair of 32-bit floating-point numbers in the two arguments, putting all the sums in V0. Since the results have half the number of elements as the arguments, we can pairwise add four pairs in this case, which can be held in two V registers. We only need the first two sums, so we ignore the results from the second operand. This accomplishes two of the additions we need.

     
  4. 4.

    Perform the third using another FADDP instruction. This leaves the result we want in lane 1 which happens to overlap the regular floating-point register S0.

     
  5. 5.

    Once the numbers are added, use the FPU’s square root instruction to calculate the final distance.

     
Figure 13-3 shows how these operations flow through the lanes in our registers and how we build up our result with each step.
../images/494415_1_En_13_Chapter/494415_1_En_13_Fig3_HTML.jpg
Figure 13-3

Flow of the calculations through the registers showing the lanes. The last two lines aren’t to scale and only show a single lane

This shows a nice feature of having the NEON and FPU sharing registers, allowing intermixing of FPU and NEON instructions without needing to move data around.

The only change to the main program is making the vectors 4D and adjust the loop to use the new vector size.

Optimizing 3x3 Matrix Multiplication

Let’s optimize the 3x3 matrix multiplication example program from Chapter 11, “Multiply, Divide, and Accumulate,” by using the parallel processing abilities of the NEON Coprocessor.

The NEON Coprocessor has a dot product function SDOT, but sadly it only operates on integers and isn’t available on all processors. Hence, we won’t use it. As we saw in the last example, adding within one register is a problem, and similarly there are problems with carrying out multiply with accumulates.

The recommended solution is to reverse two of our loops from the previous program. This way we do the multiply with accumulates as separate instructions, but we do it on three vectors at a time. The result is we eliminate one of our loops from the previous program and achieve some level of parallel operation.

The trick is to notice that one 3x3 matrix multiplication is really three matrices by vector calculations, namely:
  • Ccol1 = A ∗ Bcol1

  • Ccol2 = A ∗ Bcol2

  • Ccol3 = A ∗ Bcol3

If we look at one of these matrices times a vector, for example:
../images/494415_1_En_13_Chapter/494415_1_En_13_Figa_HTML.jpg
we see the calculation is
../images/494415_1_En_13_Chapter/494415_1_En_13_Figb_HTML.jpg
If we put a, d, and g in a register in separate lanes; b, e, and h in another register; and c, f, and i in a third register in the matching lanes, we can calculate a column in the result matrix, as shown in Figure 13-4.
../images/494415_1_En_13_Chapter/494415_1_En_13_Fig4_HTML.jpg
Figure 13-4

Showing how the calculations flow through the lanes

This is the recommended algorithm for matrix multiplication on the NEON coprocessor. We will use short integers to demonstrate integer arithmetic this time. Since four 16-bit short integers fit into 64 bits and we only need three, we will use this lane configuration.

What we did above is for one column of the results matrix, we then need to do this for all the columns. We will place this logic in a macro, to repeat the calculation three times. Since the goal is as fast matrix multiplication as possible, it is worth removing the loops, since it saves extra logic. This makes the program look much simpler.

Listing 13-4 is the code for our NEON-enabled matrix multiplication.
//
// Multiply 2 3x3 integer matrices
// Uses the NEON Coprocessor to do
// some operations in parallel.
//
// Registers:
//    D0 - first column of matrix A
//    D1 - second column of matrix A
//    D2 - third column of matrix A
//    D3 - first column of matrix B
//    D4 - second column of matrix B
//    D5 - third column of matrix B
//    D6 - first column of matrix C
//    D7 - second column of matrix C
//    D8 - third column of matrix C
.global main // Provide program starting address to linker
main:
      STP    X19, X20, [SP, #-16]!
      STR    LR, [SP, #-16]!
// load matrix A into Neon registers D0, D1, D2
      LDR    X0, =A        // Address of A
      LDP    D0, D1, [X0], #16
      LDR    D2, [X0]
// load matrix B into Neon registers D3, D4, D5
      LDR    X0, =B        // Address of B
      LDP    D3, D4, [X0], #16
      LDR    D5, [X0]
.macro mulcol ccol bcol
      MUL    ccol().4H, V0.4H, col().4H[0]
      MLA    ccol().4H, V1.4H, col().4H[1]
      MLA    ccol().4H, V2.4H, col().4H[2]
.endm
      mulcol V6, V3        // process first column
      mulcol V7, V4        // process second column
      mulcol V8, V5        // process third column
      LDR    X1, =C        // Address of C
      STP    D6, D7, [X1], #16
      STR    D8, [X1]
// Print out matrix C
// Loop through 3 rows printing 3 cols each time.
      MOV    W19, #3              // Print 3 rows
      LDR    X20, =C              // Addr of results matrix
printloop:
      LDR    X0, =prtstr    // printf format string
// print transpose so matrix is in usual row column order.
// first ldrh post-indexes by 2 for next row
// so second ldrh adds 6, so is ahead by 2+6=8=row size
// similarly for third ldh ahead by 2+14=16 = 2 x row size
      LDRH   W1, [X20], #2  // first element in current row
      LDRH   W2, [X20,#6]   // second element in current row
      LDRH   W3, [X20,#14]  // third element in current row
      BL     printf         // Call printf
      SUBS   W19, W19, #1   // Dec loop counter
      B.NE   printloop      // If not zero loop
      MOV    X0, #0         // return code
      LDR    LR, [SP], #16
      LDP    X19, X20, [SP], #16
      RET
.data
// First matrix in column major order
A:    .short 1, 4, 7, 0
      .short 2, 5, 8, 0
      .short 3, 6, 9, 0
// Second matrix in column major order
B:    .short 9, 6, 3, 0
      .short 8, 5, 2, 0
      .short 7, 4, 1, 0
// Result matrix in column major order
C:    .fill  12, 2, 0
prtstr: .asciz  "%3d  %3d  %3d "
Listing 13-4

Neon-enabled 3x3 matrix multiplication example

We store both matrices in column major order and the C matrix is produced in column major order. This is to make setting up the calculations easier, since everything is aligned properly to bulk load into our NEON registers. We changed the print loop, so that it prints out the results matrix in our usual row order form, basically doing a matrix transpose as it loops through the C matrix.

In the macro, we do the scalar multiplication:
MUL ccol().4H, V0.4H, col().4H[0]
which translates to something like the following:
MUL V6.4H, V0.4H, V3.4H[0]

This is multiplying each lane in V0 by the scalar contained in a specific lane of V3. This shows how we typically access a value in a specific lane by appending [lane number] to the end of the register specifier—counting lanes from zero.

Note

We added ( ) after the parameter name, since otherwise the .4H will be included and the parameter won’t expand correctly. The ( ) is just a null expression to introduce a separator between the macro parameter name and the next characters.

Summary

This chapter is a quick overview of how the NEON Coprocessor works and how to write programs for it. We explained how NEON uses lanes to perform parallel computations and a selection of the instructions available for computations. We gave two examples, one to calculate the distance between two 4D vectors and one to perform 3x3 matrix multiplication to demonstrate how you can easily harness the power of the NEON Coprocessor.

In Chapter 14, “Optimizing Code,” we’ll look at specialized instructions to optimize conditional logic and show how to optimize our upper-case routine.

Exercises

  1. 1.

    Compute the absolute value of a 4D vector. A 4D vector v, given by (a, b, c, d), has an absolute value square root (a2 + b2 + c2 + d2).

     
  2. 2.

    The length of a vector is its distance from the origin, the vector of all zeros. A normalized vector is a vector with length 1. Normalize a vector by dividing each of its components by its length. Modify the distance program to compute the normalized form of a vector.

     
  3. 3.

    Write a routine to calculate the dot product of two 4D vectors.

     
  4. 4.

    Alter the 3x3 matrix program to multiply 4x4 matrices. Make sure you verify your result is correct.

     
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.102.178