Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

D. KusswurmModern Arm Assembly Language Programminghttps://doi.org/10.1007/978-1-4842-6267-2_4

4. Armv8-32 Core Programming – Part 3

Daniel Kusswurm¹

(1)

Geneva, IL, USA

The content of this and the previous two chapters can be regarded as a trilogy of Arm8-32 assembly language fundamentals. In Chapters 2 and 3, you learned how to perform integer arithmetic, carry out data load and store operations, manipulate the stack, and programmatically exploit the NZCV condition flags. You also acquired useful knowledge about the GNU C++ calling convention and the GNU assembler.

This chapter you are about to read imparts additional Armv8-32 assembly language programming concepts that complete the trilogy. It begins with section that elucidates array use in an assembly language function. This is followed by a section that covers matrices and the programming techniques necessary to properly access the elements of a matrix. The final section of Chapter 4 explicates additional load and store instructions. It also explains how to reference the members of a C++ structure in an assembly language function.

Integer Arrays

Arrays are an indispensable data construct in virtually all programming languages. In C++, there is an inherent connection between arrays and pointers since the name of an array is essentially a pointer to its first element. Moreover, whenever an array variable name is used as a C++ function parameter, a pointer is passed instead of duplicating the array on the stack. In C++, one-dimensional arrays are stored in a contiguous block of memory that can be statically allocated at compile time or dynamically allocated during program execution. The elements of a C++ array are accessed using zero-based indexing, which means that valid indices for an array of size N range from 0 to N - 1.

The source code in this section discusses assembly language code that processes arrays. The first source code example explains how to perform simple arithmetic using the elements of an integer array. The second source code example demonstrates arithmetic using elements from multiple arrays.

Array Arithmetic

Listing 4-1 shows the source code for example Ch04_01. This example illustrates how to access the elements of an integer array. It also explains additional forms of the ldr instruction.

//-------------------------------------------------

// Ch04_01.cpp

//-------------------------------------------------

#include <iostream>

#include <iomanip>

#include <cstdint>

using namespace std;

extern "C" int CalcSumA_(const int* x, int n);

extern "C" uint64_t CalcSumB_(const uint32_t* x, uint32_t n);

int CalcSumA(const int* x, int n)

{

int sum = 0;

for (int i = 0; i < n; i++)

sum += *x++;

return sum;

}

uint64_t CalcSumB(const uint32_t* y, uint32_t n)

{

uint64_t sum = 0;

for (uint32_t i = 0; i < n; i++)

sum += y[i];

return sum;

}

int main()

{

const char nl = ' ';

int x[] {3, 17, -13, 25, -2, 9, -6, 12, 88, -19};

int nx = sizeof(x) / sizeof(int);

uint32_t y[] = {0x10000000, 0x20000000, 0x40000000, 0x80000000,

0x50000000, 0x70000000, 0x90000000, 0xC0000000};

uint32_t ny = sizeof(y) / sizeof(uint32_t);

// Calculate sum of elements in array x

cout << "Results for CalcSumA" << nl;

for (int i = 0; i < nx; i++)

cout << "x[" << i << "] = " << x[i] << nl;

int sum_x1 = CalcSumA(x, nx);

int sum_x2 = CalcSumA_(x, nx);

cout << "sum_x1 = " << sum_x1 << nl;

cout << "sum_x2 = " << sum_x2 << nl << nl;

// Calculate sum of elements in array y

cout << "Results for CalcSumB" << nl;

for (uint32_t i = 0; i < ny; i++)

cout << "y[" << i << "] = " << y[i] << nl;

uint64_t sum_y1 = CalcSumB(y, ny);

uint64_t sum_y2 = CalcSumB_(y, ny);

cout << "sum_y1 = " << sum_y1 << nl;

cout << "sum_y2 = " << sum_y2 << nl << nl;

return 0 ;

}

//-------------------------------------------------

// Ch04_01_.s

//-------------------------------------------------

// extern "C" int CalcSumA_(const int* x, int n);

.text

.global CalcSumA_

CalcSumA_:

mov r2,#0 // sum = 0

cmp r1,#0 // is n <= 0?

ble DoneA // jump if n <= 0

LoopA: ldr r3,[r0],#4 // r3 = *r0; r0 += 4

add r2,r2,r3 // add current x to sum

subs r1,r1,#1 // n -= 1

bne LoopA // jump if more data

DoneA: mov r0,r2 // r0 = final sum

bx lr

// extern "C" uint64_t CalcSumB_(const uint32_t* x, uint32_t n);

.global CalcSumB_

CalcSumB_:

push {r4,r5}

mov r2,#0

mov r3,#0 // sum (r2:r3) = 0

cmp r1,#0 // is n == 0?

beq DoneB // jump if n == 0

mov r4,#0 // i = 0

LoopB: ldr r5,[r0,r4,lsl #2] // r5 = x[i]

adds r2,r2,r5

adc r3,r3,#0 // sum += x[i]

add r4,#1 // i += 1

cmp r4,r1 // is i == n?

bne LoopB // jump if more data

DoneB: mov r0,r2

mov r1,r3 // r0:r1 = final 64-bit sum

pop {r4,r5}

bx lr

Listing 4-1.

Example Ch04_01

Near the top of the C++ code are the now-familiar declarations for the assembly language functions CalcSumA _ and CalcSumB_. Both functions sum the elements of an array. Note that the declaration of CalcSumB_ uses the fixed-sized unsigned integer types uint64_t and uint32_t that are declared in the header file <cstdint> instead of the normal unsigned long long and unsigned int. Some assembly language programmers (including me) prefer to use fixed-sized integer types for assembly language function declarations since it accentuates the exact size of the argument.

The function CalcSumA_ begins with a mov r2,#0 instruction that initializes sum to zero. The cmp r1,#0 and ble DoneA instructions prevent execution of for-loop LoopA if n <= 0 is true. Sweeping through the array to sum the elements requires only four instructions. The ldr r3,[r0],#4 instruction loads the array element pointed to by R0 into R3. It then adds 4 to R0, which points it to the next array element. This is an example of post-indexed addressing (see Table 1-5). The next instruction, add r2,r2,r3, adds the current array element to sum in R2. The subs r1,r1,#1 instruction subtracts one from n and also sets the NZCV condition flags, which allows the ensuing bne LoopA instruction to terminate LoopA when n equals zero.

The function CalcSumB_ sums the elements of a uint32_t array and returns a result of type uint64_t. This function starts by setting registers R2 and R3 to zero. Function CalcSumB_ uses this register pair to hold an intermediate 64-bit sum. The number of array elements n is then tested to make sure it is not equal to zero. The mov r4,#0 instruction then sets the array index variable i to zero.

CalcSumB_ uses a different technique than CalcSumA_ to sum the elements of the target array. The first instruction of for-loop LoopB, ldr r5,[r0,r4,lsl #2], loads array element x[i] into R5. In this instruction, the address of source operand x[i] is R0 + (R4 << 2) (R0 contains the address of array x and R4 contains index variable i). Register R4 is left shifted by 2 bits since the size of each element of array x is 4 bytes. Note that this form of the ldr instruction does not modify the values in both R0 and R4.

The next instruction, adds r2,r2,r5, adds x[i] to the low-order 32 bits of the intermediate sum that is maintained in register pair R2:R3. The adds instruction also sets the C condition flag to one if an unsigned overflow occurs when adding x[i] to the running sum; otherwise, C is set to zero. The ensuing adc r3,r3,#0 (add with carry) instruction adds the value of the C condition flag to the high-order 32 bits of the 64-bit running sum. The adds/adc instruction pair is often used to perform 64-bit integer addition as demonstrated in this function.

Following the 64-bit addition, CalcSumB_ uses an add r4,#1 instruction, which adds one to the array index i that is maintained in R4. The next two instructions, cmp r4,r1 and bne LoopB, test i and repeat LoopB if i != n is true. Following the summing loop, the final 64-bit sum in register pair R2:R3 is copied to R0:R1 so that it can be passed back to the calling function. Here is the output for source code example Ch04_01:

Results for CalcSumA

x[0] = 3

x[1] = 17

x[2] = -13

x[3] = 25

x[4] = -2

x[5] = 9

x[6] = -6

x[7] = 12

x[8] = 88

x[9] = -19

sum_x1 = 114

sum_x2 = 114

Results for CalcSumB

y[0] = 268435456

y[1] = 536870912

y[2] = 1073741824

y[3] = 2147483648

y[4] = 1342177280

y[5] = 1879048192

y[6] = 2415919104

y[7] = 3221225472

sum_y1 = 12884901888

sum_y2 = 12884901888

Array Arithmetic Using Multiple Arrays

Listing 4-2 shows the source code for example Ch04_02. This example demonstrates how to carry out calculations using elements from multiple arrays. It also illustrates how to reference and use a C++ global variable in an assembly language function.

//-------------------------------------------------

// Ch04_02.cpp

//-------------------------------------------------

#include <iostream>

#include <iomanip>

#include <random>

using namespace std;

int32_t g_Val1 = 2;

int32_t g_Val2 = 100;

extern "C" int32_t CalcZ_(int32_t* z, const int8_t* x, const int16_t* y, int32_t n);

void Init(int8_t* x, int16_t* y, int32_t n)

{

unsigned int seed = 7;

uniform_int_distribution<> dist {-128, 127};

mt19937 rng {seed};

for (int32_t i = 0; i < n; i++)

{

x[i] = (int8_t)dist(rng);

y[i] = (int16_t)dist(rng);

}

int32_t CalcZ(int32_t* z, const int8_t* x, const int16_t* y, int32_t n)

{

int32_t sum = 0;

for (int32_t i = 0; i < n; i++)

{

int32_t temp;

if (x[i] < 0)

temp = y[i] * g_Val1;

else

temp = y[i] * g_Val2;

sum += temp ;

z[i] = temp;

}

return sum;

}

int main()

{

const int32_t n = 12;

int8_t x[n];

int16_t y[n];

int32_t z1[n], z2[n];

Init(x, y, n);

int32_t sum_z1 = CalcZ(z1, x, y, n);

int32_t sum_z2 = CalcZ_(z2, x, y, n);

const char nl = ' ';

const char* sep = " ";

for (int32_t i = 0; i < n; i++)

{

cout << "i: " << setw(2) << i << sep;

cout << "x: " << setw(5) << (int)x[i] << sep;

cout << "y: " << setw(5) << y[i] << sep;

cout << "z1: " << setw(7) << z1[i] << sep;

cout << "z2: " << setw(7) << z2[i] << nl;

}

cout << nl;

cout << "sum_z1 = " << sum_z1 << nl;

cout << "sum_z2 = " << sum_z2 << nl;

return 0 ;

}

//-------------------------------------------------

// Ch04_02_.s

//-------------------------------------------------

// extern "C" int32_t CalcZ_(int32_t* z const int8_t* x, const int16_t* y, int32_t n);

.text

.global CalcZ_

CalcZ_: push {r4-r9}

mov r4,#0 // sum = 0

cmp r3,#0

ble Done // jump if n <= 0

ldr r5,=g_Val1

ldr r5,[r5] // r5 = g_Val1

ldr r6,=g_Val2

ldr r6,[r6] // r6 = g_Val2

// Main processing loop

Loop1: ldrsb r7,[r1],#1 // r7 = x[i]

ldrsh r8,[r2],#2 // r8 = y[i]

cmp r7,#0 // is x[i] < 0?

mullt r9,r8,r5 // temp = y[i] * g_Val1

// (if x[i] < 0)

mulge r9,r8,r6 // temp = y[i] * g_Val2

// (if x[i] >= 0)

add r4,r4,r9 // sum += temp

str r9,[r0],#4 // save result z[i]

subs r3,#1 // n -= 1

bne Loop1 // repeat until done

Done: mov r0,r4 // r0 = final sum

pop {r4-r9}

bx lr

Listing 4-2.

Example Ch04_02

The C++ code in Listing 4-2 starts with the definition of global variables g_Val1 and g_Val2. These values are used in functions CalcZ and CalcZ_. Following the declaration of CalcZ_ is a function named Init, which initializes the test arrays for this example using random numbers. This function uses the C++ Standard Template Library (STL) classes uniform_int_distribution and mt19937 to generate random values for the array. Appendix B contains a list of references that you can consult if you are interested in learning more about these classes. The definition of function CalcZ is next. This function performs some admittedly contrived arithmetic for demonstration purposes. Note that different integer types are used for the arrays x, y, and z. The remaining C++ code performs test case initialization, exercises the functions CalcZ and CalcZ_, and displays the results.

The first nonprologue instruction of CalcZ_ is a mov r4,#0, which initializes sum to zero. The value of n is then tested to make sure it is greater than zero. The next instruction, ldr r5,=g_Val1, loads the address of g_Val1 into R5. This is followed by a ldr r5,[r5] instruction that loads g_Val1 into R5. Function CalcZ_ uses a similar sequence of instructions to load g_Val2 into R6.

Each iteration of for-loop Loop1 begins with a ldrsb r7,[r1],#1 instruction that loads x[i] into R7. Note that a post-indexed offset value of one is used since array x is of type int8_t. The ldrsh r8,[r2],#2 instruction loads y[i] into R8. This instruction uses a post-indexed offset value of two since array y is of type int16_t. The ensuing cmp r7,#0 sets the NZCV condition flags. The next instruction, mullt r9,r8,r5, calculates temp = y[i] * g_Val1 only if x[i] < 0 is true. Otherwise, no operation is performed. The mullt instruction is an example of an A32 conditional instruction that was discussed in Chapter 3. Following the mullt instruction is another conditionally executed instruction mulge r9,r8,r6, which calculates temp y[i] * g_Val2 only if x[i] >= 0 is true.

The add r4,r4,r9 instruction updates the sum that is maintained in R4. This is followed by a str r9,[r0],#4 instruction that saves temp to z[i]. This instruction uses a post-indexed offset value of four since array z is of type int32_t. The processing for-loop Loop1 repeats until all elements have been examined. Here is the output for source code example Ch04_02:

i: 0 x: -109 y: -70 z1: -140 z2: -140

i: 1 x: 71 y: -47 z1: -4700 z2: -4700

i: 2 x: -16 y: 122 z1: 244 z2: 244

i: 3 x: 57 y: -12 z1: -1200 z2: -1200

i: 4 x: 122 y: -50 z1: -5000 z2: -5000

i: 5 x: 9 y: -61 z1: -6100 z2: -6100

i: 6 x: 0 y: -106 z1: -10600 z2: -10600

i: 7 x: -110 y: -21 z1: -42 z2: -42

i: 8 x: -60 y: -124 z1: -248 z2: -248

i: 9 x: -1 y: 7 z1: 14 z2: 14

i: 10 x: 45 y: 94 z1: 9400 z2: 9400

i: 11 x: 77 y: -44 z1: -4400 z2: -4400

sum_z1 = -22772

sum_z2 = -22772

Integer Matrices

C++ also uses a contiguous block of memory to implement a two-dimensional array or matrix. The elements of a C++ matrix in memory are organized using row-major ordering. Row-major ordering arranges the elements of a matrix first by row and then by column. For example, elements of the matrix int x[3][2] are stored in consecutive memory locations as follows: x[0][0], x[0][1], x[1][0], x[1][1], x[2][0], and x[2][1]. Figure 4-1 illustrates this memory ordering scheme. In order to access a specific element in the matrix, a function (or a compiler) must know the starting address of the matrix (i.e., the address of its first element), the row and column indices, the total number of columns, and the size in bytes of each element. Using this information, a function can use simple addition and multiplication to access a specific element in a matrix as exemplified by the source codes examples in this section.

../images/501069_1_En_4_Chapter/501069_1_En_4_Fig1_HTML.png — Figure 4-1.
Row-major ordering for matrix int x[3][2]

Accessing Matrix Elements

Listing 4-3 shows the source code for example Ch04_03, which demonstrates how to use assembly language to access the elements of a matrix. In this example, the functions CalcMatrixSquares and CalcMatrixSquares_ perform the following matrix calculation: y[i][j] = x[j][i] * x[j][i]. Note that in this expression, the indices i and j for matrix x are intentionally reversed to make the code in this example a little more interesting.

//-------------------------------------------------

// Ch04_03.cpp

//-------------------------------------------------

#include <iostream>

#include <iomanip>

using namespace std;

extern "C" void CalcMatrixSquares_(int* y, const int* x, int m, int n);

void CalcMatrixSquares(int* y, const int* x, int m, int n)

{

for (int i = 0; i < m; i++)

{

for (int j = 0; j < n; j++)

{

int kx = j * m + i;

int ky = i * n + j;

y[ky] = x[kx] * x[kx];

}

int main()

{

const int m = 6;

const int n = 3;

int y1[m][n], y2[m][n];

int x[n][m] {{ 1, 2, 3, 4, 5, 6 },

{ 7, 8, 9, 10, 11, 12 },

{ 13, 14, 15, 16, 17, 18 }};

CalcMatrixSquares(&y1[0][0], &x[0][0], m, n);

CalcMatrixSquares_(&y2[0][0], &x[0][0], m, n);

for (int i = 0; i < m; i++)

{

for (int j = 0; j < n; j++)

{

cout << "y1[" << setw(2) << i << "][" << setw(2) << j << "] = ";

cout << setw(6) << y1[i][j] << ' ' ;

cout << "y2[" << setw(2) << i << "][" << setw(2) << j << "] = ";

cout << setw(6) << y2[i][j] << ' ';

cout << "x[" << setw(2) << j << "][" << setw(2) << i << "] = ";

cout << setw(6) << x[j][i] << ' ';

if (y1[i][j] != y2[i][j])

cout << "Compare failed ";

}

return 0;

}

//-------------------------------------------------

// Ch04_03_.s

//-------------------------------------------------

// extern "C" void CalcMatrixSquares_(int* y, const int* x, int m, int n);

.text

.global CalcMatrixSquares_

CalcMatrixSquares_:

push {r4-r8}

cmp r2,#0

ble Done // jump if m <= 0

cmp r3,#0

ble Done // jump if n <= 0

mov r4,#0 // i = 0

Loop1: mov r5,#0 // j = 0

Loop2: mov r6,r5 // r6 = j

mul r6,r6,r2 // r6 = j * m

add r6,r6,r4 // kx = j * m + i

ldr r7,[r1,r6,lsl #2] // r7 = x[kx] (x[j][i])

mul r7,r7,r7 // r7 = x[j][i] * x[j][i]

mov r8,r4 // r8 = i

mul r8,r8,r3 // r8 = i * n

add r8,r8,r5 // ky = i * n + j

str r7,[r0,r8,lsl #2] // save y[ky] (y[i][j])

add r5,#1 // j += 1

cmp r5,r3

blt Loop2 // jump if j < n

add r4,#1 // i += 1

cmp r4,r2

blt Loop1 // jump if i < m

Done: pop {r4-r8}

bx lr

Listing 4-3.

Example Ch04_03

The function CalcMatrixSquares illustrates how to access an element in a C++ matrix using explicit arithmetic. At entry to this function, arguments x and y point to the memory blocks that contain their respective matrices. Inside the second for-loop, the expression kx = j * m + i calculates the offset necessary to access element x[j][i]. Similarly, the expression ky = i * n + j calculates the offset for element y[i][j]. Note that the code employed in CalcMatrixSquares to calculate kx and ky requires x to be a matrix of size n × m and y to be a matrix of size m × n.

The assembly language function CalcMatrixSquares_ uses the same technique as the C++ code to access elements in matrices x and y. This function begins its execution by checking argument values m and n to make sure they are greater than zero. A mov r4,#0 instruction is then used to initialize index i to zero. Each iteration of for-loop Loop1 starts with a mov r5,#0 instruction that sets index j to zero. The ensuing mov r6,r5, mul r6,r6,r2, and add r6,r6,r4 instructions calculate kx = j * m + i. This is followed by a ldr r7,[r1,r6,lsl #2] instruction that loads x[j][i] into R7. The mul r7,r7,r7 instruction calculates x[j][i] * x[j][i].

Function CalcMatrixSquares_ employs a similar sequence of instructions to calculate the address of y[i][j]. Variable ky is calculated using the instruction pair mul r8,r8,r3 and add r8,r8,r5. The str r7,[r0,r8,lsl #2] instruction then saves the previously calculated squared result to y[i][j]. Like the corresponding C++ code, the nested for-loops in CalcMatrixSquares_ continue to execute until the index counters j and i (registers R4 and R5) reach their respective termination values. Here is the output for source code example Ch04_03:

y1[ 0][ 0] = 1 y2[ 0][ 0] = 1 x[ 0][ 0] = 1

y1[ 0][ 1] = 49 y2[ 0][ 1] = 49 x[ 1][ 0] = 7

y1[ 0][ 2] = 169 y2[ 0][ 2] = 169 x[ 2][ 0] = 13

y1[ 1][ 0] = 4 y2[ 1][ 0] = 4 x[ 0][ 1] = 2

y1[ 1][ 1] = 64 y2[ 1][ 1] = 64 x[ 1][ 1] = 8

y1[ 1][ 2] = 196 y2[ 1][ 2] = 196 x[ 2][ 1] = 14

y1[ 2][ 0] = 9 y2[ 2][ 0] = 9 x[ 0][ 2] = 3

y1[ 2][ 1] = 81 y2[ 2][ 1] = 81 x[ 1][ 2] = 9

y1[ 2][ 2] = 225 y2[ 2][ 2] = 225 x[ 2][ 2] = 15

y1[ 3][ 0] = 16 y2[ 3][ 0] = 16 x[ 0][ 3] = 4

y1[ 3][ 1] = 100 y2[ 3][ 1] = 100 x[ 1][ 3] = 10

y1[ 3][ 2] = 256 y2[ 3][ 2] = 256 x[ 2][ 3] = 16

y1[ 4][ 0] = 25 y2[ 4][ 0] = 25 x[ 0][ 4] = 5

y1[ 4][ 1] = 121 y2[ 4][ 1] = 121 x[ 1][ 4] = 11

y1[ 4][ 2] = 289 y2[ 4][ 2] = 289 x[ 2][ 4] = 17

y1[ 5][ 0] = 36 y2[ 5][ 0] = 36 x[ 0][ 5] = 6

y1[ 5][ 1] = 144 y2[ 5][ 1] = 144 x[ 1][ 5] = 12

y1[ 5][ 2] = 324 y2[ 5][ 2] = 324 x[ 2][ 5] = 18

Row-Column Sums

Listing 4-4 shows the source code for example Ch04_04, which demonstrates how to sum the rows and columns of an integer matrix. The C++ code for this example begins with a function named Init that initializes a test matrix with random values. The function CalcMatrixRowColSums is a C++ implementation of the row-column summing algorithm. This function sweeps through matrix x using a set of nested for-loops. During each inner loop iteration, CalcMatrixRowColsSums adds matrix element x[i][j] to col_sums[j]. The outer for-loop updates row_sums[i]. Function CalcMatrixRowColSums also uses the same matrix element offset arithmetic that you saw in the previous example.

//-------------------------------------------------

// Ch04_04.cpp

//-------------------------------------------------

#include <iostream>

#include <random>

using namespace std;

// Ch04_04_.s

extern "C" bool CalcMatrixRowColSums_(int* row_sums, int* col_sums, const int* x, int nrows, int ncols);

// Ch04_04_Misc.cpp

extern void PrintResult(const char* msg, const int* row_sums, const int* col_sums, const int* x, int nrows, int ncols);

void Init(int* x, int nrows, int ncols)

{

unsigned int seed = 13;

uniform_int_distribution<> d {1, 200};

mt19937 rng {seed};

for (int i = 0; i < nrows * ncols; i++)

x[i] = d(rng);

}

bool CalcMatrixRowColSums(int* row_sums, int* col_sums, const int* x, int nrows, int ncols)

{

if (nrows <= 0 || ncols <= 0)

return false;

for (int j = 0; j < ncols; j++)

col_sums[j] = 0;

for (int i = 0; i < nrows; i++)

{

int row_sums_temp = 0 ;

int k = i * ncols;

for (int j = 0; j < ncols; j++)

{

int temp = x[k + j];

row_sums_temp += temp;

col_sums[j] += temp;

}

row_sums[i] = row_sums_temp;

}

return true;

}

int main()

{

const int nrows = 8;

const int ncols = 6;

int x[nrows][ncols];

Init((int*)x, nrows, ncols);

int row_sums1[nrows], col_sums1[ncols];

int row_sums2[nrows], col_sums2[ncols];

const char* msg1 = "Results for CalcMatrixRowColSums";

const char* msg2 = "Results for CalcMatrixRowColSums_";

bool rc1 = CalcMatrixRowColSums(row_sums1, col_sums1, (int*)x, nrows, ncols);

bool rc2 = CalcMatrixRowColSums_(row_sums2, col_sums2, (int*)x, nrows, ncols);

if (!rc1)

cout << " CalcMatrixRowSums failed ";

else

PrintResult(msg1, row_sums1, col_sums1, (int*)x, nrows, ncols);

if (!rc2)

cout << " CalcMatrixRowSums_ failed ";

else

PrintResult(msg2, row_sums2, col_sums2, (int*)x, nrows, ncols);

return 0 ;

}

//-------------------------------------------------

// Ch04_04_.s

//-------------------------------------------------

// extern "C" bool CalcMatrixRowColSums_(int* row_sums, int* col_sums, const int* x, int nrows, int ncols);

.text

.global CalcMatrixRowColSums_

.equ ARG_NCOLS,32

CalcMatrixRowColSums_:

push {r4-r11}

cmp r3,#0

movle r0,#0 // set error return code

ble Done // jump if nrows <= 0

ldr r4,[sp,#ARG_NCOLS]

cmp r4,#0

movle r0,#0 // set error return code

ble Done // jump if ncols <= 0

// Set elements of col_sums to zero

mov r5,r1 // r5 = col_sums

mov r6,r4 // r6 = ncols

mov r7,#0

Loop0: str r7,[r5],#4 // col_sums[j] = 0

subs r6,#1 // j -= 1

bne Loop0 // jump if j != 0

// Main processing loops

mov r5,#0 // i = 0

Loop1: mov r6,#0 // j = 0

mov r12,#0 // row_sums_temp = 0

mul r7,r5,r4 // r7 = i * ncols

Loop2: add r8,r7,r6 // r8 = i * ncols + j

ldr r9,[r2,r8,lsl #2] // r9 = x[i][j]

// Update row_sums and col_sums using current x[i][j]

add r12,r12,r9 // row_sums_temp += x[i][j]

add r10,r1,r6,lsl #2 // r10 = ptr to col_sums[j]

ldr r11,[r10] // r11 = col_sums[j]

add r11,r11,r9 // col_sums[j] += x[i][j]

str r11,[r10] // save col_sums[j]

add r6,r6,#1 // j += 1

cmp r6,r4

blt Loop2 // jump if j < ncols

str r12,[r0],#4 // save row_sums[i]

add r5,r5,#1 // i += 1

cmp r5,r3

blt Loop1 // jump if i < nrows

mov r0,#1 // set success return code

Done: pop {r4-r11}

bx lr

Listing 4-4.

Example Ch04_04

The assembly language function CalcMatrixRowColSums_ implements the same algorithm as its C++ counterpart. Following preservation of non-volatile registers R4–R11 on the stack, arguments nrows and ncols are tested for validity. Note that ncols was passed via the stack. Also note the two uses of the movle r0,#0 instruction, which load R0 with the correct return code if either ncols or nrows is invalid. For-loop Loop0 then initializes each element in col_sums to zero.

Prior to the start of for-loop Loop1, a mov r5,#0 instruction sets index i equal to zero. Each Loop1 iteration begins with the instruction pair mov r6,#0 and mov r12,#0. These instructions initialize both index j (R6) and row_sums_temp (R12) to zero. The next instruction, mul r7,r5,r4, calculates i * ncols. At the start of for-loop Loop2, an add r8,r7,r6 instruction calculates the offset i * ncols + j for matrix element x[i][j]. The ensuing ldr r9,[r2,r8,lsl #2] instruction loads x[i][j] into R9 as illustrated in Figure 4-2. This is followed by an add r12,r12,r9 instruction that calculates row_sums_temp += x[i][j].

../images/501069_1_En_4_Chapter/501069_1_En_4_Fig2_HTML.png — Figure 4-2.
Update instructions for row_sums and col_sums in function CalcMatrixRowColSums_

The add r10,r1,r6,lsl #2 instruction computes the address of cols_sums[j]. In this instruction, which calculates R10 = R1 + (R6 << 2), R1 contains a pointer to the array cols_sums and R6 contains index j. The ensuing instruction triplet, ldr r11,[r10], add r11,r11,r9, and str r11,[r10], adds x[i][j] to col_sums[j]. For-loop Loop2 continues to repeat so long as j < ncols is true. Following each Loop2 execution cycle, the str r12,[r0],#4 instruction saves row_sums_temp (R12) to row_sums[i]. Index i is then updated and Loop1 repeats while i < nrows is true. Here is the output for source code example Ch04_04:

Results for CalcMatrixRowColSums

------------------------------------------------

156 122 48 172 165 179 842

194 36 195 152 91 151 819

122 122 156 159 129 78 766

145 69 8 133 60 182 597

12 145 172 95 75 100 599

136 90 52 107 70 199 654

2 123 72 37 190 145 569

44 188 64 108 184 95 683

811 895 767 963 964 1129

Results for CalcMatrixRowColSums_

------------------------------------------------

156 122 48 172 165 179 842

194 36 195 152 91 151 819

122 122 156 159 129 78 766

145 69 8 133 60 182 597

12 145 172 95 75 100 599

136 90 52 107 70 199 654

2 123 72 37 190 145 569

44 188 64 108 184 95 683

811 895 767 963 964 1129

Advanced Programming

The source code examples in this section highlight a few advanced programming techniques. The first example introduces additional load and store instructions that you can use to access the elements of an array. The second example explains how to reference the members of a C++ structure in an assembly language function.

Array Reversal

Source code example Ch04_05 demonstrates a couple of array reversal techniques that use the ldmda (load multiple decrement after), ldmia (load multiple increment after), stmda (store multiple decrement after), and stmia (store multiple increment after) instructions. These instructions load or store multiple registers. Listing 4-5 shows the C++ and assembly language source code for example Ch04_05.

//-------------------------------------------------

// Ch04_05.cpp

//-------------------------------------------------

#include <iostream>

#include <iomanip>

#include <random>

using namespace std;

extern "C" void ReverseArrayA_(int* y, const int* x, int n);

extern "C" void ReverseArrayB_(int* x, int n);

void Init(int* x, int n, unsigned int seed)

{

uniform_int_distribution<> d {1, 1000};

mt19937 rng {seed};

for (int i = 0; i < n; i++)

x[i] = d(rng);

}

void PrintArray(const char* msg, const int* x, int n)

{

const char nl = ' ';s

cout << nl << msg << nl;

for (int i = 0; i < n; i++)

{

cout << setw(5) << x[i];

if ((i + 1) % 10 == 0)

cout << nl;

}

cout << nl;

}

void ReverseArrayA(void)

{

const int n = 25 ;

int x[n], y[n];

Init(x, n, 32);

PrintArray("ReverseArrayA - original array x", x, n);

ReverseArrayA_(y, x, n);

PrintArray("ReverseArrayA - reversed array y", y, n);

}

void ReverseArrayB(void)

{

const int n = 25;

int x[n];

Init(x, n, 32);

PrintArray("ReverseArrayB - array x before reversal", x, n);

ReverseArrayB_(x, n);

PrintArray("ReverseArrayB - array x after reversal", x, n);

}

int main()

{

ReverseArrayA();

ReverseArrayB();

return 0;

}

//-------------------------------------------------

// Ch04_05_.s

//-------------------------------------------------

// extern "C" void ReverseArrayA_(int* y, const int* x, int n);

.text

.global ReverseArrayA_

ReverseArrayA_:

push {r4-r11}

// Initialize

add r1,r1,r2,lsl #2

sub r1,#4 // r1 points to x[n - 1]

cmp r2,#4

blt SkipLoopA // jump if n < 4

// Main loop

LoopA: ldmda r1!,{r4-r7} // r4 = *r1

// r5 = *(r1 - 4)

// r6 = *(r1 - 8)

// r7 = *(r1 - 12)

// r1 -= 16

mov r8,r7 // reorder elements in

mov r9,r6 // r4 - r7 for use with

mov r10,r5 // stmia instruction

mov r11,r4

stmia r0!,{r8-r11} // *r0 = r8

// *(r0 + 4) = r9

// *(r0 + 8) = r10

// *(r0 + 12) = r11

// r0 += 16

sub r2,#4 // n -= 4

cmp r2,#4

bge LoopA // jump if n >= 4

// Process remaining (0 - 3) array elements

SkipLoopA: cmp r2,#0

ble DoneA // jump if no more elements

ldr r4,[r1],#-4 // load single element from x

str r4,[r0],#4 // save element to y

subs r2,#1 // n -= 1

beq DoneA // jump if n == 0

ldr r4,[r1],#-4 // load single element from x

str r4,[r0],#4 // save element to y

subs r2,#1 // n -= 1

beq DoneA // jump if n == 0

ldr r4,[r1] // load final element from x

str r4,[r0] // save final element to y

DoneA: pop {r4-r11}

bx lr

// extern "C" void ReverseArrayB_(int* x, int n);

.global ReverseArrayB_

ReverseArrayB_:

push {r4-r11}

// Initialize

mov r2,r1 // r2 = n

add r1,r0,r2,lsl #2

sub r1,#4 // r1 points to x[n - 1]

cmp r2,#4

blt SkipLoopB // jump if n < 4

LoopB: ldmia r0,{r4,r5} // r4 = *r0, r5 = *(r0 + 4)

ldmda r1,{r6,r7} // r6 = *r1, r7 = *(r1 - 4)

mov r8,r7 // reorder elements

mov r9,r6 // for use with stmia and

mov r10,r5 // stmda instructions

mov r11,r4

stmia r0!,{r8,r9} // *r0 = r8, *(r0 + 4) = r9, r0 += 8

stmda r1!,{r10,r11} // *r1 = r10, *(r1 - 4) = r11, r1 -= 8

sub r2,#4 // n -= 4

cmp r2,#4

bge LoopB // jump if n >= 4

// Process remaining (0 - 3) array elements

SkipLoopB: cmp r2,#1

ble DoneB // jump if done

ldr r4,[r0] // load final element

ldr r5,[r1] // pair into r4:r5

str r4,[r1] // save elements

str r5,[r0]

DoneB: pop {r4-r11}

bx lr

Listing 4-5.

Example Ch04_05

The function ReverseArrayA_ copies elements from a source array to a destination array in reverse order. This function requires three parameters: a pointer to destination array y, a pointer to source array x, and the number of elements n. During its initialization phase, ReverseArrayA_ uses the instructions add r1,r1,r2,lsl #2 and sub r1,#4 to calculate the address of the last element in array x. It then checks the value of n to see if it is less than four. If n < 4 is true, ReverseArrayA_ skips over for-loop LoopA. The reason for this is that for-loop LoopA processes four elements during each iteration.

Figure 4-3 illustrates the ldmda/stmia instruction sequence that is used in for-loop LoopA. The first instruction of LoopA, ldmda r1!,{r4-r7}, loads four array elements into registers R4, R5, R6, and R7. This instruction also updates R1 so that it points to the preceding group of four elements. Next is a series of mov instructions that rearrange the elements in registers R4–R7 for use with the stmia instruction. Following the four mov instructions, R8, R9, R10, and R11 contain y[i], y[i+1], y[i+2], and y[i+3], respectively. The ensuing stmia r0!,{r8-r11} saves R8, R9, R10, and R11 to y[i:i+3]. This instruction also updates R0 so that it points to element y[i+4].

../images/501069_1_En_4_Chapter/501069_1_En_4_Fig3_HTML.png — Figure 4-3.
First execution of ldmda r1!,{r4-r7} and stmia r0!,{r8-r11} instructions in function ReverseArrayA_

Following execution of the stmia instruction, n is decremented by four and Loop1 repeats until n < 4 is true. The block of code that follows Loop1 reverses the final few elements of array x using ldr and str instructions. Note that after an element is reversed, n is decremented by one and tested to see if it is equal to zero.

Function ReverseArrayB_ performs an in-place reversal of an integer array. This function begins its execution by initializing R0 and R1 as pointers to the first and last elements of the input array x, respectively. The first instruction of for-loop LoopB, ldmia r0,{r4,r5}, loads two elements from the beginning of array x. The ldmda r1,{r6,r7} instruction that follows loads two elements from the end of array x. Following a series of mov instructions that rearrange these elements, ReverseArrayB_ uses the instructions stmia r0!,{r8,r9} and stmda r1!,{r10,r11} to finalize a four-element reversal as shown in Figure 4-4.

../images/501069_1_En_4_Chapter/501069_1_En_4_Fig4_HTML.png — Figure 4-4.
Four-element reversal using ldmia, ldmda, stmia, and stmda instructions

The block of code that follows LoopB performs a final two-element reversal if one is required. Here is the output for source code example Ch04_05:

ReverseArrayA - original array x

859 60 373 422 556 331 956 754 737 732

817 354 102 456 929 286 610 766 597 828

92 39 346 314 663

ReverseArrayA - reversed array y

663 314 346 39 92 828 597 766 610 286

929 456 102 354 817 732 737 754 956 331

556 422 373 60 859

ReverseArrayB - array x before reversal

859 60 373 422 556 331 956 754 737 732

817 354 102 456 929 286 610 766 597 828

92 39 346 314 663

ReverseArrayB - array x after reversal

663 314 346 39 92 828 597 766 610 286

929 456 102 354 817 732 737 754 956 331

556 422 373 60 859

Structures

A structure is a programming language construct that facilitates the definition of new data types using one or more existing data types. In C++, a structure is essentially the same as a class. When a data type is defined using the keyword struct instead of class, all members are public by default. A C++ struct that is declared sans any member functions or operators is analogous to a C-style structure such as typedef struct { ... } MyStruct;. C++ structure declarations are usually placed in a header (.h) file so they can be easily referenced by multiple C++ files.

The address of a structure member is simply the starting address of the structure in memory plus the member’s offset in bytes. During compilation, most C++ compilers align structure members to their natural boundary, which means that structures frequently contain extra padding bytes. It is not possible to define a structure in a header file and include this file in both C++ and assembly language source code files. However, a simple solution to this dilemma is to use the C++ offsetof macro to determine the offset for each structure member and then use .equ directives in the assembly language file. You will learn how to do this shortly.

Listing 4-6 shows the C++ and assembly language source code for example Ch04_06. In the C++ code, a simple structure named TestStruct is defined. This structure uses sized integer types instead of the more common C++ types to highlight the exact size of each member.

//-------------------------------------------------

// Ch04_06.cpp

//-------------------------------------------------

#include <iostream>

#include <iomanip>

#include <cstdint>

#include <cstddef>

using namespace std;

struct TestStruct

{

int8_t ValA;

int8_t ValB;

int32_t ValC;

int16_t ValD;

int32_t ValE;

uint8_t ValF;

uint16_t ValG;

};

extern "C" int32_t CalcTestStructSum_(const TestStruct* ts);

void PrintTestStructOffsets(void)

{

const char nl = ' ';

cout << "offsetof(ts.ValA) = " << offsetof(TestStruct, ValA) << nl;

cout << "offsetof(ts.ValB) = " << offsetof(TestStruct, ValB) << nl;

cout << "offsetof(ts.ValC) = " << offsetof(TestStruct, ValC) << nl;

cout << "offsetof(ts.ValD) = " << offsetof(TestStruct, ValD) << nl;

cout << "offsetof(ts.ValE) = " << offsetof(TestStruct, ValE) << nl;

cout << "offsetof(ts.ValF) = " << offsetof(TestStruct, ValF) << nl;

cout << "offsetof(ts.ValG) = " << offsetof(TestStruct, ValG) << nl;

}

int32_t CalcTestStructSum(const TestStruct* ts)

{

int32_t temp1 = ts->ValA + ts->ValB + ts->ValC + ts->ValD;

int32_t temp2 = ts->ValE + ts->ValF + ts->ValG;

return temp1 + temp2;

}

int main()

{

const char nl = ' ';

PrintTestStructOffsets();

TestStruct ts;

ts.ValA = -100;

ts.ValB = 75;

ts.ValC = 1000000;

ts.ValD = -3000;

ts.ValE = 400000;

ts.ValF = 200;

ts.ValG = 50000;

int32_t sum1 = CalcTestStructSum(&ts);

int32_t sum2 = CalcTestStructSum_(&ts);

cout << nl << "Results for CalcTestStructSum" << nl;

cout << "ts1.ValA = " << (int)ts.ValA << nl;

cout << "ts1.ValB = " << (int)ts.ValB << nl;

cout << "ts1.ValC = " << ts.ValC << nl;

cout << "ts1.ValD = " << ts.ValD << nl;

cout << "ts1.ValE = " << ts.ValE << nl;

cout << "ts1.ValF = " << (int)ts.ValF << nl;

cout << "ts1.ValG = " << ts.ValG << nl;

cout << "sum1 = " << sum1 << nl;

cout << "sum2 = " << sum2 << nl;

if (sum1 != sum2)

cout << "Compare error!" << nl;

return 0;

}

//-------------------------------------------------

// Ch04_06_.s

//-------------------------------------------------

// extern "C" int32_t CalcTestStructSum_(const TestStruct* ts);

// Offsets for TestStruct

.equ S_ValA,0 // int8_t

.equ S_ValB,1 // int8_t

.equ S_ValC,4 // int32_t

.equ S_ValD,8 // int16_t

.equ S_ValE,12 // int32_t

.equ S_ValF,16 // uint8_t

.equ S_ValG,18 // uint16_t

.text

.global CalcTestStructSum_

CalcTestStructSum_:

// Sum the elements of TestStruct

ldrsb r1,[r0,#S_ValA] // r1 = ValA (sign-extended)

ldrsb r2,[r0,#S_ValB] // r2 = ValB (sign-extended)

add r1,r1,r2

ldr r2,[r0,#S_ValC] // r2 = ValC

add r1,r1,r2

ldrsh r2,[r0,#S_ValD] // r2 = ValD (sign-extended)

add r1,r1,r2

ldr r2,[r0,#S_ValE] // r2 = ValE

add r1,r1,r2

ldrb r2,[r0,#S_ValF] // r2 = ValF (zero-extended)

add r1,r1,r2

ldrh r2,[r0,#S_ValG] // r2 = ValG (zero-extended)

add r1,r1,r2

mov r0,r1

bx lr

Listing 4-6.

Example Ch04_06

Following the definition of TestStruct is a function named PrintTestStructOffsets. The function main calls this function, which prints the offset in bytes of each member in TestStruct. These results were then used to define .equ directives in the assembly language file for the members in TestStruct. The remaining code in main initializes an instance of TestStruct, calls CalcTestStructSum and CalcTestStructSum_, and displays results. The functions CalcTestStructSum and CalcTestStructSum_ both sum the members in TestStruct.

The assembly language code in Listing 4-6 begins with the aforementioned .equ directives that define offsets for each structure member. The sole argument value for CalcTestStructSum_ is a pointer to the caller’s TestStruct. The calculating code in CalcTestStructSum_ uses various forms of the ldr instruction, which you have already seen, to load each structure member into a register. Note that each ldr instruction uses simple offset addressing. Here is the output for source code example Ch04_06:

offsetof(ts.ValA) = 0

offsetof(ts.ValB) = 1

offsetof(ts.ValC) = 4

offsetof(ts.ValD) = 8

offsetof(ts.ValE) = 12

offsetof(ts.ValF) = 16

offsetof(ts.ValG) = 18

Results for CalcTestStructSum

ts1.ValA = -100

ts1.ValB = 75

ts1.ValC = 1000000

ts1.ValD = -3000

ts1.ValE = 400000

ts1.ValF = 200

ts1.ValG = 50000

sum1 = 1447175

sum2 = 1447175

You will see other examples of assembly language structure use later in this book.

Summary

Here are the key learning points for Chapter 4:

The address of an element in a one-dimensional array can be calculated using the base address (i.e., the address of the first element) of the array, the index of the element, and the size in bytes of each element. The address of an element in a two-dimensional array can be calculated using the base address of the array, the row and column indices, the number of columns, and the size in bytes of each element.
Post-indexed addressing (e.g., ldr r1,[r0],#4) is often used to implement a for-loop that processes the elements of an array that contains 32-bit wide integers. Post-indexed addressing can also be used for arrays containing 8- and 16-bit wide integers.
A function can use the lsl operator in a ldr instruction (e.g., ldr r2,[r0,r1,lsl #2]) to load array element x[i] into a register. In this example, R0 contains the address of array x and R1 contains the index i.
A function can use the instruction pair ldr r0,=VarName and ldr r0,[r0] to load the value of C++ global variable VarName into register R0.
Functions can use the ldmdb, ldmia, stmdb, and stmia instructions to load multiple elements from or store multiple elements to an array.
Assembly language load and store instructions can reference members of a structure in memory using .equ directives and the output of the C++ offsetof operator.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 4. Armv8-32 Core Programming – Part 3

Create new playlist

Sign In

Sign Up

4. Armv8-32 Core Programming – Part 3

Integer Arrays

Array Arithmetic

Array Arithmetic Using Multiple Arrays

Integer Matrices

Accessing Matrix Elements

Row-Column Sums

Advanced Programming

Array Reversal

Structures

Summary

Table of Contents for
4. Armv8-32 Core Programming – Part 3