© Daniel Kusswurm 2020
D. KusswurmModern Arm Assembly Language Programminghttps://doi.org/10.1007/978-1-4842-6267-2_4

4. Armv8-32 Core Programming – Part 3

Daniel Kusswurm1 
(1)
Geneva, IL, USA
 

The content of this and the previous two chapters can be regarded as a trilogy of Arm8-32 assembly language fundamentals. In Chapters 2 and 3, you learned how to perform integer arithmetic, carry out data load and store operations, manipulate the stack, and programmatically exploit the NZCV condition flags. You also acquired useful knowledge about the GNU C++ calling convention and the GNU assembler.

This chapter you are about to read imparts additional Armv8-32 assembly language programming concepts that complete the trilogy. It begins with section that elucidates array use in an assembly language function. This is followed by a section that covers matrices and the programming techniques necessary to properly access the elements of a matrix. The final section of Chapter 4 explicates additional load and store instructions. It also explains how to reference the members of a C++ structure in an assembly language function.

Integer Arrays

Arrays are an indispensable data construct in virtually all programming languages. In C++, there is an inherent connection between arrays and pointers since the name of an array is essentially a pointer to its first element. Moreover, whenever an array variable name is used as a C++ function parameter, a pointer is passed instead of duplicating the array on the stack. In C++, one-dimensional arrays are stored in a contiguous block of memory that can be statically allocated at compile time or dynamically allocated during program execution. The elements of a C++ array are accessed using zero-based indexing, which means that valid indices for an array of size N range from 0 to N - 1.

The source code in this section discusses assembly language code that processes arrays. The first source code example explains how to perform simple arithmetic using the elements of an integer array. The second source code example demonstrates arithmetic using elements from multiple arrays.

Array Arithmetic

Listing 4-1 shows the source code for example Ch04_01. This example illustrates how to access the elements of an integer array. It also explains additional forms of the ldr instruction.
//-------------------------------------------------
//               Ch04_01.cpp
//-------------------------------------------------
#include <iostream>
#include <iomanip>
#include <cstdint>
using namespace std;
extern "C" int CalcSumA_(const int* x, int n);
extern "C" uint64_t CalcSumB_(const uint32_t* x, uint32_t n);
int CalcSumA(const int* x, int n)
{
    int sum = 0;
    for (int i = 0; i < n; i++)
        sum += *x++;
    return sum;
}
uint64_t CalcSumB(const uint32_t* y, uint32_t n)
{
    uint64_t sum = 0;
    for (uint32_t i = 0; i < n; i++)
        sum += y[i];
    return sum;
}
int main()
{
    const char nl = ' ';
    int x[] {3, 17, -13, 25, -2, 9, -6, 12, 88, -19};
    int nx = sizeof(x) / sizeof(int);
    uint32_t y[] = {0x10000000, 0x20000000, 0x40000000, 0x80000000,
                    0x50000000, 0x70000000, 0x90000000, 0xC0000000};
    uint32_t ny = sizeof(y) / sizeof(uint32_t);
    // Calculate sum of elements in array x
    cout << "Results for CalcSumA" << nl;
    for (int i = 0; i < nx; i++)
        cout << "x[" << i << "] = " << x[i] << nl;
    int sum_x1 = CalcSumA(x, nx);
    int sum_x2 = CalcSumA_(x, nx);
    cout << "sum_x1 = " << sum_x1 << nl;
    cout << "sum_x2 = " << sum_x2 << nl << nl;
    // Calculate sum of elements in array y
    cout << "Results for CalcSumB" << nl;
    for (uint32_t i = 0; i < ny; i++)
        cout << "y[" << i << "] = " << y[i] << nl;
    uint64_t sum_y1 = CalcSumB(y, ny);
    uint64_t sum_y2 = CalcSumB_(y, ny);
    cout << "sum_y1 = " << sum_y1 << nl;
    cout << "sum_y2 = " << sum_y2 << nl << nl;
    return 0 ;
}
//-------------------------------------------------
//               Ch04_01_.s
//-------------------------------------------------
// extern "C" int CalcSumA_(const int* x, int n);
            .text
            .global CalcSumA_
CalcSumA_:
            mov r2,#0                           // sum = 0
            cmp r1,#0                           // is n <= 0?
            ble DoneA                           // jump if n <= 0
LoopA:      ldr r3,[r0],#4                      // r3 = *r0; r0 += 4
            add r2,r2,r3                        // add current x to sum
            subs r1,r1,#1                       // n -= 1
            bne LoopA                           // jump if more data
DoneA:      mov r0,r2                           // r0 = final sum
            bx lr
// extern "C" uint64_t CalcSumB_(const uint32_t* x, uint32_t n);
            .global CalcSumB_
CalcSumB_:
            push {r4,r5}
            mov r2,#0
            mov r3,#0                           // sum (r2:r3) = 0
            cmp r1,#0                           // is n == 0?
            beq DoneB                           // jump if n == 0
            mov r4,#0                           // i = 0
LoopB:      ldr r5,[r0,r4,lsl #2]               // r5 = x[i]
            adds r2,r2,r5
            adc r3,r3,#0                        // sum += x[i]
            add r4,#1                           // i += 1
            cmp r4,r1                           // is i == n?
            bne LoopB                           // jump if more data
DoneB:      mov r0,r2
            mov r1,r3                           // r0:r1 = final 64-bit sum
            pop {r4,r5}
            bx lr
Listing 4-1.

Example Ch04_01

Near the top of the C++ code are the now-familiar declarations for the assembly language functions CalcSumA _ and CalcSumB_. Both functions sum the elements of an array. Note that the declaration of CalcSumB_ uses the fixed-sized unsigned integer types uint64_t and uint32_t that are declared in the header file <cstdint> instead of the normal unsigned long long and unsigned int. Some assembly language programmers (including me) prefer to use fixed-sized integer types for assembly language function declarations since it accentuates the exact size of the argument.

The function CalcSumA_ begins with a mov r2,#0 instruction that initializes sum to zero. The cmp r1,#0 and ble DoneA instructions prevent execution of for-loop LoopA if n <= 0 is true. Sweeping through the array to sum the elements requires only four instructions. The ldr r3,[r0],#4 instruction loads the array element pointed to by R0 into R3. It then adds 4 to R0, which points it to the next array element. This is an example of post-indexed addressing (see Table 1-5). The next instruction, add r2,r2,r3, adds the current array element to sum in R2. The subs r1,r1,#1 instruction subtracts one from n and also sets the NZCV condition flags, which allows the ensuing bne LoopA instruction to terminate LoopA when n equals zero.

The function CalcSumB_ sums the elements of a uint32_t array and returns a result of type uint64_t. This function starts by setting registers R2 and R3 to zero. Function CalcSumB_ uses this register pair to hold an intermediate 64-bit sum. The number of array elements n is then tested to make sure it is not equal to zero. The mov r4,#0 instruction then sets the array index variable i to zero.

CalcSumB_ uses a different technique than CalcSumA_ to sum the elements of the target array. The first instruction of for-loop LoopB, ldr r5,[r0,r4,lsl #2], loads array element x[i] into R5. In this instruction, the address of source operand x[i] is R0 + (R4 << 2) (R0 contains the address of array x and R4 contains index variable i). Register R4 is left shifted by 2 bits since the size of each element of array x is 4 bytes. Note that this form of the ldr instruction does not modify the values in both R0 and R4.

The next instruction, adds r2,r2,r5, adds x[i] to the low-order 32 bits of the intermediate sum that is maintained in register pair R2:R3. The adds instruction also sets the C condition flag to one if an unsigned overflow occurs when adding x[i] to the running sum; otherwise, C is set to zero. The ensuing adc r3,r3,#0 (add with carry) instruction adds the value of the C condition flag to the high-order 32 bits of the 64-bit running sum. The adds/adc instruction pair is often used to perform 64-bit integer addition as demonstrated in this function.

Following the 64-bit addition, CalcSumB_ uses an add r4,#1 instruction, which adds one to the array index i that is maintained in R4. The next two instructions, cmp r4,r1 and bne LoopB, test i and repeat LoopB if i != n is true. Following the summing loop, the final 64-bit sum in register pair R2:R3 is copied to R0:R1 so that it can be passed back to the calling function. Here is the output for source code example Ch04_01:
Results for CalcSumA
x[0] = 3
x[1] = 17
x[2] = -13
x[3] = 25
x[4] = -2
x[5] = 9
x[6] = -6
x[7] = 12
x[8] = 88
x[9] = -19
sum_x1 = 114
sum_x2 = 114
Results for CalcSumB
y[0] = 268435456
y[1] = 536870912
y[2] = 1073741824
y[3] = 2147483648
y[4] = 1342177280
y[5] = 1879048192
y[6] = 2415919104
y[7] = 3221225472
sum_y1 = 12884901888
sum_y2 = 12884901888

Array Arithmetic Using Multiple Arrays

Listing 4-2 shows the source code for example Ch04_02. This example demonstrates how to carry out calculations using elements from multiple arrays. It also illustrates how to reference and use a C++ global variable in an assembly language function.
//-------------------------------------------------
//               Ch04_02.cpp
//-------------------------------------------------
#include <iostream>
#include <iomanip>
#include <random>
using namespace std;
int32_t g_Val1 = 2;
int32_t g_Val2 = 100;
extern "C" int32_t CalcZ_(int32_t* z, const int8_t* x, const int16_t* y, int32_t n);
void Init(int8_t* x, int16_t* y, int32_t n)
{
    unsigned int seed = 7;
    uniform_int_distribution<> dist {-128, 127};
    mt19937 rng {seed};
    for (int32_t i = 0; i < n; i++)
    {
        x[i] = (int8_t)dist(rng);
        y[i] = (int16_t)dist(rng);
    }
}
int32_t CalcZ(int32_t* z, const int8_t* x, const int16_t* y, int32_t n)
{
    int32_t sum = 0;
    for (int32_t i = 0; i < n; i++)
    {
        int32_t temp;
        if (x[i] < 0)
            temp = y[i] * g_Val1;
        else
            temp = y[i] * g_Val2;
        sum += temp ;
        z[i] = temp;
    }
    return sum;
}
int main()
{
    const int32_t n = 12;
    int8_t x[n];
    int16_t y[n];
    int32_t z1[n], z2[n];
    Init(x, y, n);
    int32_t sum_z1 = CalcZ(z1, x, y, n);
    int32_t sum_z2 = CalcZ_(z2, x, y, n);
    const char nl = ' ';
    const char* sep = "  ";
    for (int32_t i = 0; i < n; i++)
    {
        cout << "i: " << setw(2) << i << sep;
        cout << "x: " << setw(5) << (int)x[i] << sep;
        cout << "y: " << setw(5) << y[i] << sep;
        cout << "z1: " << setw(7) << z1[i] << sep;
        cout << "z2: " << setw(7) << z2[i] << nl;
    }
    cout << nl;
    cout << "sum_z1 = " << sum_z1 << nl;
    cout << "sum_z2 = " << sum_z2 << nl;
    return 0 ;
}
//-------------------------------------------------
//               Ch04_02_.s
//-------------------------------------------------
// extern "C" int32_t CalcZ_(int32_t* z const int8_t* x, const int16_t* y, int32_t n);
            .text
            .global CalcZ_
CalcZ_:     push {r4-r9}
            mov r4,#0                           // sum = 0
            cmp r3,#0
            ble Done                            // jump if n <= 0
            ldr r5,=g_Val1
            ldr r5,[r5]                         // r5 = g_Val1
            ldr r6,=g_Val2
            ldr r6,[r6]                         // r6 = g_Val2
// Main processing loop
Loop1:      ldrsb r7,[r1],#1                    // r7 = x[i]
            ldrsh r8,[r2],#2                    // r8 = y[i]
            cmp r7,#0                           // is x[i] < 0?
            mullt r9,r8,r5                      // temp = y[i] * g_Val1
                                                // (if x[i] < 0)
            mulge r9,r8,r6                      // temp = y[i] * g_Val2
                                                // (if x[i] >= 0)
            add r4,r4,r9                        // sum += temp
            str r9,[r0],#4                      // save result z[i]
            subs r3,#1                          // n -= 1
            bne Loop1                           // repeat until done
Done:       mov r0,r4                           // r0 = final sum
            pop {r4-r9}
            bx lr
Listing 4-2.

Example Ch04_02

The C++ code in Listing 4-2 starts with the definition of global variables g_Val1 and g_Val2. These values are used in functions CalcZ and CalcZ_. Following the declaration of CalcZ_ is a function named Init, which initializes the test arrays for this example using random numbers. This function uses the C++ Standard Template Library (STL) classes uniform_int_distribution and mt19937 to generate random values for the array. Appendix B contains a list of references that you can consult if you are interested in learning more about these classes. The definition of function CalcZ is next. This function performs some admittedly contrived arithmetic for demonstration purposes. Note that different integer types are used for the arrays x, y, and z. The remaining C++ code performs test case initialization, exercises the functions CalcZ and CalcZ_, and displays the results.

The first nonprologue instruction of CalcZ_ is a mov r4,#0, which initializes sum to zero. The value of n is then tested to make sure it is greater than zero. The next instruction, ldr r5,=g_Val1, loads the address of g_Val1 into R5. This is followed by a ldr r5,[r5] instruction that loads g_Val1 into R5. Function CalcZ_ uses a similar sequence of instructions to load g_Val2 into R6.

Each iteration of for-loop Loop1 begins with a ldrsb r7,[r1],#1 instruction that loads x[i] into R7. Note that a post-indexed offset value of one is used since array x is of type int8_t. The ldrsh r8,[r2],#2 instruction loads y[i] into R8. This instruction uses a post-indexed offset value of two since array y is of type int16_t. The ensuing cmp r7,#0 sets the NZCV condition flags. The next instruction, mullt r9,r8,r5, calculates temp = y[i] * g_Val1 only if x[i] < 0 is true. Otherwise, no operation is performed. The mullt instruction is an example of an A32 conditional instruction that was discussed in Chapter 3. Following the mullt instruction is another conditionally executed instruction mulge r9,r8,r6, which calculates temp y[i] * g_Val2 only if x[i] >= 0 is true.

The add r4,r4,r9 instruction updates the sum that is maintained in R4. This is followed by a str r9,[r0],#4 instruction that saves temp to z[i]. This instruction uses a post-indexed offset value of four since array z is of type int32_t. The processing for-loop Loop1 repeats until all elements have been examined. Here is the output for source code example Ch04_02:
i:  0  x:  -109  y:   -70  z1:    -140  z2:    -140
i:  1  x:    71  y:   -47  z1:   -4700  z2:   -4700
i:  2  x:   -16  y:   122  z1:     244  z2:     244
i:  3  x:    57  y:   -12  z1:   -1200  z2:   -1200
i:  4  x:   122  y:   -50  z1:   -5000  z2:   -5000
i:  5  x:     9  y:   -61  z1:   -6100  z2:   -6100
i:  6  x:     0  y:  -106  z1:  -10600  z2:  -10600
i:  7  x:  -110  y:   -21  z1:     -42  z2:     -42
i:  8  x:   -60  y:  -124  z1:    -248  z2:    -248
i:  9  x:    -1  y:     7  z1:      14  z2:      14
i: 10  x:    45  y:    94  z1:    9400  z2:    9400
i: 11  x:    77  y:   -44  z1:   -4400  z2:   -4400
sum_z1 = -22772
sum_z2 = -22772

Integer Matrices

C++ also uses a contiguous block of memory to implement a two-dimensional array or matrix. The elements of a C++ matrix in memory are organized using row-major ordering. Row-major ordering arranges the elements of a matrix first by row and then by column. For example, elements of the matrix int x[3][2] are stored in consecutive memory locations as follows: x[0][0], x[0][1], x[1][0], x[1][1], x[2][0], and x[2][1]. Figure 4-1 illustrates this memory ordering scheme. In order to access a specific element in the matrix, a function (or a compiler) must know the starting address of the matrix (i.e., the address of its first element), the row and column indices, the total number of columns, and the size in bytes of each element. Using this information, a function can use simple addition and multiplication to access a specific element in a matrix as exemplified by the source codes examples in this section.
../images/501069_1_En_4_Chapter/501069_1_En_4_Fig1_HTML.png
Figure 4-1.

Row-major ordering for matrix int x[3][2]

Accessing Matrix Elements

Listing 4-3 shows the source code for example Ch04_03, which demonstrates how to use assembly language to access the elements of a matrix. In this example, the functions CalcMatrixSquares and CalcMatrixSquares_ perform the following matrix calculation: y[i][j] = x[j][i] * x[j][i]. Note that in this expression, the indices i and j for matrix x are intentionally reversed to make the code in this example a little more interesting.
//-------------------------------------------------
//               Ch04_03.cpp
//-------------------------------------------------
#include <iostream>
#include <iomanip>
using namespace std;
extern "C" void CalcMatrixSquares_(int* y, const int* x, int m, int n);
void CalcMatrixSquares(int* y, const int* x, int m, int n)
{
    for (int i = 0; i < m; i++)
    {
        for (int j = 0; j < n; j++)
        {
            int kx = j * m + i;
            int ky = i * n + j;
            y[ky] = x[kx] * x[kx];
        }
    }
}
int main()
{
    const int m = 6;
    const int n = 3;
    int y1[m][n], y2[m][n];
    int x[n][m] {{ 1, 2, 3, 4, 5, 6 },
                  { 7, 8, 9, 10, 11, 12 },
                  { 13, 14, 15, 16, 17, 18 }};
    CalcMatrixSquares(&y1[0][0], &x[0][0], m, n);
    CalcMatrixSquares_(&y2[0][0], &x[0][0], m, n);
    for (int i = 0; i < m; i++)
    {
        for (int j = 0; j < n; j++)
        {
            cout << "y1[" << setw(2) << i << "][" << setw(2) << j << "] = ";
            cout << setw(6) << y1[i][j] << ' ' ;
            cout << "y2[" << setw(2) << i << "][" << setw(2) << j << "] = ";
            cout << setw(6) << y2[i][j] << ' ';
            cout << "x[" << setw(2) << j << "][" << setw(2) << i << "] = ";
            cout << setw(6) << x[j][i] << ' ';
            if (y1[i][j] != y2[i][j])
               cout << "Compare failed ";
        }
    }
    return 0;
}
//-------------------------------------------------
//               Ch04_03_.s
//-------------------------------------------------
// extern "C" void CalcMatrixSquares_(int* y, const int* x, int m, int n);
            .text
            .global CalcMatrixSquares_
CalcMatrixSquares_:
            push {r4-r8}
            cmp r2,#0
            ble Done                            // jump if m <= 0
            cmp r3,#0
            ble Done                            // jump if n <= 0
            mov r4,#0                           // i = 0
Loop1:      mov r5,#0                           // j = 0
Loop2:      mov r6,r5                           // r6 = j
            mul r6,r6,r2                        // r6 = j * m
            add r6,r6,r4                        // kx = j * m + i
            ldr r7,[r1,r6,lsl #2]               // r7 = x[kx] (x[j][i])
            mul r7,r7,r7                        // r7 = x[j][i] * x[j][i]
            mov r8,r4                           // r8 = i
            mul r8,r8,r3                        // r8 = i * n
            add r8,r8,r5                        // ky = i * n + j
            str r7,[r0,r8,lsl #2]               // save y[ky] (y[i][j])
            add r5,#1                           // j += 1
            cmp r5,r3
            blt Loop2                           // jump if j < n
            add r4,#1                           // i += 1
            cmp r4,r2
            blt Loop1                           // jump if i < m
Done:       pop {r4-r8}
            bx lr
Listing 4-3.

Example Ch04_03

The function CalcMatrixSquares illustrates how to access an element in a C++ matrix using explicit arithmetic. At entry to this function, arguments x and y point to the memory blocks that contain their respective matrices. Inside the second for-loop, the expression kx = j * m + i calculates the offset necessary to access element x[j][i]. Similarly, the expression ky = i * n + j calculates the offset for element y[i][j]. Note that the code employed in CalcMatrixSquares to calculate kx and ky requires x to be a matrix of size n × m and y to be a matrix of size m × n.

The assembly language function CalcMatrixSquares_ uses the same technique as the C++ code to access elements in matrices x and y. This function begins its execution by checking argument values m and n to make sure they are greater than zero. A mov r4,#0 instruction is then used to initialize index i to zero. Each iteration of for-loop Loop1 starts with a mov r5,#0 instruction that sets index j to zero. The ensuing mov r6,r5, mul r6,r6,r2, and add r6,r6,r4 instructions calculate kx = j * m + i. This is followed by a ldr r7,[r1,r6,lsl #2] instruction that loads x[j][i] into R7. The mul r7,r7,r7 instruction calculates x[j][i] * x[j][i].

Function CalcMatrixSquares_ employs a similar sequence of instructions to calculate the address of y[i][j]. Variable ky is calculated using the instruction pair mul r8,r8,r3 and add r8,r8,r5. The str r7,[r0,r8,lsl #2] instruction then saves the previously calculated squared result to y[i][j]. Like the corresponding C++ code, the nested for-loops in CalcMatrixSquares_ continue to execute until the index counters j and i (registers R4 and R5) reach their respective termination values. Here is the output for source code example Ch04_03:
y1[ 0][ 0] =      1 y2[ 0][ 0] =      1 x[ 0][ 0] =      1
y1[ 0][ 1] =     49 y2[ 0][ 1] =     49 x[ 1][ 0] =      7
y1[ 0][ 2] =    169 y2[ 0][ 2] =    169 x[ 2][ 0] =     13
y1[ 1][ 0] =      4 y2[ 1][ 0] =      4 x[ 0][ 1] =      2
y1[ 1][ 1] =     64 y2[ 1][ 1] =     64 x[ 1][ 1] =      8
y1[ 1][ 2] =    196 y2[ 1][ 2] =    196 x[ 2][ 1] =     14
y1[ 2][ 0] =      9 y2[ 2][ 0] =      9 x[ 0][ 2] =      3
y1[ 2][ 1] =     81 y2[ 2][ 1] =     81 x[ 1][ 2] =      9
y1[ 2][ 2] =    225 y2[ 2][ 2] =    225 x[ 2][ 2] =     15
y1[ 3][ 0] =     16 y2[ 3][ 0] =     16 x[ 0][ 3] =      4
y1[ 3][ 1] =    100 y2[ 3][ 1] =    100 x[ 1][ 3] =     10
y1[ 3][ 2] =    256 y2[ 3][ 2] =    256 x[ 2][ 3] =     16
y1[ 4][ 0] =     25 y2[ 4][ 0] =     25 x[ 0][ 4] =      5
y1[ 4][ 1] =    121 y2[ 4][ 1] =    121 x[ 1][ 4] =     11
y1[ 4][ 2] =    289 y2[ 4][ 2] =    289 x[ 2][ 4] =     17
y1[ 5][ 0] =     36 y2[ 5][ 0] =     36 x[ 0][ 5] =      6
y1[ 5][ 1] =    144 y2[ 5][ 1] =    144 x[ 1][ 5] =     12
y1[ 5][ 2] =    324 y2[ 5][ 2] =    324 x[ 2][ 5] =     18

Row-Column Sums

Listing 4-4 shows the source code for example Ch04_04, which demonstrates how to sum the rows and columns of an integer matrix. The C++ code for this example begins with a function named Init that initializes a test matrix with random values. The function CalcMatrixRowColSums is a C++ implementation of the row-column summing algorithm. This function sweeps through matrix x using a set of nested for-loops. During each inner loop iteration, CalcMatrixRowColsSums adds matrix element x[i][j] to col_sums[j]. The outer for-loop updates row_sums[i]. Function CalcMatrixRowColSums also uses the same matrix element offset arithmetic that you saw in the previous example.
//-------------------------------------------------
//               Ch04_04.cpp
//-------------------------------------------------
#include <iostream>
#include <random>
using namespace std;
// Ch04_04_.s
extern "C" bool CalcMatrixRowColSums_(int* row_sums, int* col_sums, const int* x, int nrows, int ncols);
// Ch04_04_Misc.cpp
extern void PrintResult(const char* msg, const int* row_sums, const int* col_sums, const int* x, int nrows, int ncols);
void Init(int* x, int nrows, int ncols)
{
    unsigned int seed = 13;
    uniform_int_distribution<> d {1, 200};
    mt19937 rng {seed};
    for (int i = 0; i < nrows * ncols; i++)
        x[i] = d(rng);
}
bool CalcMatrixRowColSums(int* row_sums, int* col_sums, const int* x, int nrows, int ncols)
{
    if (nrows <= 0 || ncols <= 0)
        return false;
    for (int j = 0; j < ncols; j++)
        col_sums[j] = 0;
    for (int i = 0; i < nrows; i++)
    {
        int row_sums_temp = 0 ;
        int k = i * ncols;
        for (int j = 0; j < ncols; j++)
        {
            int temp = x[k + j];
            row_sums_temp += temp;
            col_sums[j] += temp;
        }
        row_sums[i] = row_sums_temp;
    }
    return true;
}
int main()
{
    const int nrows = 8;
    const int ncols = 6;
    int x[nrows][ncols];
    Init((int*)x, nrows, ncols);
    int row_sums1[nrows], col_sums1[ncols];
    int row_sums2[nrows], col_sums2[ncols];
    const char* msg1 = "Results for CalcMatrixRowColSums";
    const char* msg2 = "Results for CalcMatrixRowColSums_";
    bool rc1 = CalcMatrixRowColSums(row_sums1, col_sums1, (int*)x, nrows, ncols);
    bool rc2 = CalcMatrixRowColSums_(row_sums2, col_sums2, (int*)x, nrows, ncols);
    if (!rc1)
        cout << " CalcMatrixRowSums failed ";
    else
        PrintResult(msg1, row_sums1, col_sums1, (int*)x, nrows, ncols);
    if (!rc2)
        cout << " CalcMatrixRowSums_ failed ";
    else
        PrintResult(msg2, row_sums2, col_sums2, (int*)x, nrows, ncols);
    return 0 ;
}
//-------------------------------------------------
//               Ch04_04_.s
//-------------------------------------------------
// extern "C" bool CalcMatrixRowColSums_(int* row_sums, int* col_sums, const int* x, int nrows, int ncols);
            .text
            .global CalcMatrixRowColSums_
            .equ ARG_NCOLS,32
CalcMatrixRowColSums_:
            push {r4-r11}
            cmp r3,#0
            movle r0,#0                         // set error return code
            ble Done                            // jump if nrows <= 0
            ldr r4,[sp,#ARG_NCOLS]
            cmp r4,#0
            movle r0,#0                         // set error return code
            ble Done                            // jump if ncols <= 0
// Set elements of col_sums to zero
            mov r5,r1                           // r5 = col_sums
            mov r6,r4                           // r6 = ncols
            mov r7,#0
Loop0:      str r7,[r5],#4                      // col_sums[j] = 0
            subs r6,#1                          // j -= 1
            bne Loop0                           // jump if j != 0
// Main processing loops
            mov r5,#0                           // i = 0
Loop1:      mov r6,#0                           // j = 0
            mov r12,#0                          // row_sums_temp = 0
            mul r7,r5,r4                        // r7 = i * ncols
Loop2:      add r8,r7,r6                        // r8 = i * ncols + j
            ldr r9,[r2,r8,lsl #2]               // r9 = x[i][j]
// Update row_sums and col_sums using current x[i][j]
            add r12,r12,r9                      // row_sums_temp += x[i][j]
            add r10,r1,r6,lsl #2                // r10 = ptr to col_sums[j]
            ldr r11,[r10]                       // r11 = col_sums[j]
            add r11,r11,r9                      // col_sums[j] += x[i][j]
            str r11,[r10]                       // save col_sums[j]
            add r6,r6,#1                        // j += 1
            cmp r6,r4
            blt Loop2                           // jump if j < ncols
            str r12,[r0],#4                     // save row_sums[i]
            add r5,r5,#1                        // i += 1
            cmp r5,r3
            blt Loop1                           // jump if i < nrows
            mov r0,#1                           // set success return code
Done:       pop {r4-r11}
            bx lr
Listing 4-4.

Example Ch04_04

The assembly language function CalcMatrixRowColSums_ implements the same algorithm as its C++ counterpart. Following preservation of non-volatile registers R4–R11 on the stack, arguments nrows and ncols are tested for validity. Note that ncols was passed via the stack. Also note the two uses of the movle r0,#0 instruction, which load R0 with the correct return code if either ncols or nrows is invalid. For-loop Loop0 then initializes each element in col_sums to zero.

Prior to the start of for-loop Loop1, a mov r5,#0 instruction sets index i equal to zero. Each Loop1 iteration begins with the instruction pair mov r6,#0 and mov r12,#0. These instructions initialize both index j (R6) and row_sums_temp (R12) to zero. The next instruction, mul r7,r5,r4, calculates i * ncols. At the start of for-loop Loop2, an add r8,r7,r6 instruction calculates the offset i * ncols + j for matrix element x[i][j]. The ensuing ldr r9,[r2,r8,lsl #2] instruction loads x[i][j] into R9 as illustrated in Figure 4-2. This is followed by an add r12,r12,r9 instruction that calculates row_sums_temp += x[i][j].
../images/501069_1_En_4_Chapter/501069_1_En_4_Fig2_HTML.png
Figure 4-2.

Update instructions for row_sums and col_sums in function CalcMatrixRowColSums_

The add r10,r1,r6,lsl #2 instruction computes the address of cols_sums[j]. In this instruction, which calculates R10 = R1 + (R6 << 2), R1 contains a pointer to the array cols_sums and R6 contains index j. The ensuing instruction triplet, ldr r11,[r10], add r11,r11,r9, and str r11,[r10], adds x[i][j] to col_sums[j]. For-loop Loop2 continues to repeat so long as j < ncols is true. Following each Loop2 execution cycle, the str r12,[r0],#4 instruction saves row_sums_temp (R12) to row_sums[i]. Index i is then updated and Loop1 repeats while i < nrows is true. Here is the output for source code example Ch04_04:
Results for CalcMatrixRowColSums
------------------------------------------------
   156   122    48   172   165   179     842
   194    36   195   152    91   151     819
   122   122   156   159   129    78     766
   145    69     8   133    60   182     597
    12   145   172    95    75   100     599
   136    90    52   107    70   199     654
     2   123    72    37   190   145     569
    44   188    64   108   184    95     683
   811   895   767   963   964  1129
Results for CalcMatrixRowColSums_
------------------------------------------------
   156   122    48   172   165   179     842
   194    36   195   152    91   151     819
   122   122   156   159   129    78     766
   145    69     8   133    60   182     597
    12   145   172    95    75   100     599
   136    90    52   107    70   199     654
     2   123    72    37   190   145     569
    44   188    64   108   184    95     683
   811   895   767   963   964  1129

Advanced Programming

The source code examples in this section highlight a few advanced programming techniques. The first example introduces additional load and store instructions that you can use to access the elements of an array. The second example explains how to reference the members of a C++ structure in an assembly language function.

Array Reversal

Source code example Ch04_05 demonstrates a couple of array reversal techniques that use the ldmda (load multiple decrement after), ldmia (load multiple increment after), stmda (store multiple decrement after), and stmia (store multiple increment after) instructions. These instructions load or store multiple registers. Listing 4-5 shows the C++ and assembly language source code for example Ch04_05.
//-------------------------------------------------
//               Ch04_05.cpp
//-------------------------------------------------
#include <iostream>
#include <iomanip>
#include <random>
using namespace std;
extern "C" void ReverseArrayA_(int* y, const int* x, int n);
extern "C" void ReverseArrayB_(int* x, int n);
void Init(int* x, int n, unsigned int seed)
{
    uniform_int_distribution<> d {1, 1000};
    mt19937 rng {seed};
    for (int i = 0; i < n; i++)
        x[i] = d(rng);
}
void PrintArray(const char* msg, const int* x, int n)
{
    const char nl = ' ';s
    cout << nl << msg << nl;
    for (int i = 0; i < n; i++)
    {
        cout << setw(5) << x[i];
        if ((i + 1) % 10 == 0)
            cout << nl;
    }
    cout << nl;
}
void ReverseArrayA(void)
{
    const int n = 25 ;
    int x[n], y[n];
    Init(x, n, 32);
    PrintArray("ReverseArrayA - original array x", x, n);
    ReverseArrayA_(y, x, n);
    PrintArray("ReverseArrayA - reversed array y", y, n);
}
void ReverseArrayB(void)
{
    const int n = 25;
    int x[n];
    Init(x, n, 32);
    PrintArray("ReverseArrayB - array x before reversal", x, n);
    ReverseArrayB_(x, n);
    PrintArray("ReverseArrayB - array x after reversal", x, n);
}
int main()
{
    ReverseArrayA();
    ReverseArrayB();
    return 0;
}
//-------------------------------------------------
//               Ch04_05_.s
//-------------------------------------------------
// extern "C" void ReverseArrayA_(int* y, const int* x, int n);
            .text
            .global ReverseArrayA_
ReverseArrayA_:
              push {r4-r11}
// Initialize
            add r1,r1,r2,lsl #2
            sub r1,#4                           // r1 points to x[n - 1]
            cmp r2,#4
            blt SkipLoopA                       // jump if n < 4
// Main loop
LoopA:      ldmda r1!,{r4-r7}                   // r4 = *r1
                                                // r5 = *(r1 - 4)
                                                // r6 = *(r1 - 8)
                                                // r7 = *(r1 - 12)
                                                // r1 -= 16
            mov r8,r7                           // reorder elements in
            mov r9,r6                           // r4 - r7 for use with
            mov r10,r5                          // stmia instruction
            mov r11,r4
            stmia r0!,{r8-r11}                  // *r0 = r8
                                                // *(r0 + 4) = r9
                                                // *(r0 + 8) = r10
                                                // *(r0 + 12) = r11
                                                // r0 += 16
            sub r2,#4                           // n -= 4
            cmp r2,#4
            bge LoopA                           // jump if n >= 4
// Process remaining (0 - 3) array elements
SkipLoopA:  cmp r2,#0
            ble DoneA                           // jump if no more elements
            ldr r4,[r1],#-4                     // load single element from x
            str r4,[r0],#4                      // save element to y
            subs r2,#1                          // n -= 1
            beq DoneA                           // jump if n == 0
            ldr r4,[r1],#-4                     // load single element from x
            str r4,[r0],#4                      // save element to y
            subs r2,#1                          // n -= 1
            beq DoneA                           // jump if n == 0
            ldr r4,[r1]                         // load final element from x
            str r4,[r0]                         // save final element to y
DoneA:      pop {r4-r11}
            bx lr
// extern "C" void ReverseArrayB_(int* x, int n);
            .global ReverseArrayB_
ReverseArrayB_:
            push {r4-r11}
// Initialize
            mov r2,r1                           // r2 = n
            add r1,r0,r2,lsl #2
            sub r1,#4                           // r1 points to x[n - 1]
            cmp r2,#4
            blt SkipLoopB                       // jump if n < 4
LoopB:      ldmia r0,{r4,r5}                    // r4 = *r0, r5 = *(r0 + 4)
            ldmda r1,{r6,r7}                    // r6 = *r1, r7 = *(r1 - 4)
            mov r8,r7                           // reorder elements
            mov r9,r6                           // for use with stmia and
            mov r10,r5                          // stmda instructions
            mov r11,r4
            stmia r0!,{r8,r9}                   // *r0 = r8, *(r0 + 4) = r9, r0 += 8
            stmda r1!,{r10,r11}                 // *r1 = r10, *(r1 - 4) = r11, r1 -= 8
            sub r2,#4                           // n -= 4
            cmp r2,#4
            bge LoopB                           // jump if n >= 4
// Process remaining (0 - 3) array elements
SkipLoopB:  cmp r2,#1
            ble DoneB                           // jump if done
            ldr r4,[r0]                         // load final element
            ldr r5,[r1]                         // pair into r4:r5
            str r4,[r1]                         // save elements
            str r5,[r0]
DoneB:      pop {r4-r11}
            bx lr
Listing 4-5.

Example Ch04_05

The function ReverseArrayA_ copies elements from a source array to a destination array in reverse order. This function requires three parameters: a pointer to destination array y, a pointer to source array x, and the number of elements n. During its initialization phase, ReverseArrayA_ uses the instructions add r1,r1,r2,lsl #2 and sub r1,#4 to calculate the address of the last element in array x. It then checks the value of n to see if it is less than four. If n < 4 is true, ReverseArrayA_ skips over for-loop LoopA. The reason for this is that for-loop LoopA processes four elements during each iteration.

Figure 4-3 illustrates the ldmda/stmia instruction sequence that is used in for-loop LoopA. The first instruction of LoopA, ldmda r1!,{r4-r7}, loads four array elements into registers R4, R5, R6, and R7. This instruction also updates R1 so that it points to the preceding group of four elements. Next is a series of mov instructions that rearrange the elements in registers R4–R7 for use with the stmia instruction. Following the four mov instructions, R8, R9, R10, and R11 contain y[i], y[i+1], y[i+2], and y[i+3], respectively. The ensuing stmia r0!,{r8-r11} saves R8, R9, R10, and R11 to y[i:i+3]. This instruction also updates R0 so that it points to element y[i+4].
../images/501069_1_En_4_Chapter/501069_1_En_4_Fig3_HTML.png
Figure 4-3.

First execution of ldmda r1!,{r4-r7} and stmia r0!,{r8-r11} instructions in function ReverseArrayA_

Following execution of the stmia instruction, n is decremented by four and Loop1 repeats until n < 4 is true. The block of code that follows Loop1 reverses the final few elements of array x using ldr and str instructions. Note that after an element is reversed, n is decremented by one and tested to see if it is equal to zero.

Function ReverseArrayB_ performs an in-place reversal of an integer array. This function begins its execution by initializing R0 and R1 as pointers to the first and last elements of the input array x, respectively. The first instruction of for-loop LoopB, ldmia r0,{r4,r5}, loads two elements from the beginning of array x. The ldmda r1,{r6,r7} instruction that follows loads two elements from the end of array x. Following a series of mov instructions that rearrange these elements, ReverseArrayB_ uses the instructions stmia r0!,{r8,r9} and stmda r1!,{r10,r11} to finalize a four-element reversal as shown in Figure 4-4.
../images/501069_1_En_4_Chapter/501069_1_En_4_Fig4_HTML.png
Figure 4-4.

Four-element reversal using ldmia, ldmda, stmia, and stmda instructions

The block of code that follows LoopB performs a final two-element reversal if one is required. Here is the output for source code example Ch04_05:
ReverseArrayA - original array x
  859   60  373  422  556  331  956  754  737  732
  817  354  102  456  929  286  610  766  597  828
   92   39  346  314  663
ReverseArrayA - reversed array y
  663  314  346   39   92  828  597  766  610  286
  929  456  102  354  817  732  737  754  956  331
  556  422  373   60  859
ReverseArrayB - array x before reversal
  859   60  373  422  556  331  956  754  737  732
  817  354  102  456  929  286  610  766  597  828
   92   39  346  314  663
ReverseArrayB - array x after reversal
  663  314  346   39   92  828  597  766  610  286
  929  456  102  354  817  732  737  754  956  331
  556  422  373   60  859

Structures

A structure is a programming language construct that facilitates the definition of new data types using one or more existing data types. In C++, a structure is essentially the same as a class. When a data type is defined using the keyword struct instead of class, all members are public by default. A C++ struct that is declared sans any member functions or operators is analogous to a C-style structure such as typedef struct { ... } MyStruct;. C++ structure declarations are usually placed in a header (.h) file so they can be easily referenced by multiple C++ files.

The address of a structure member is simply the starting address of the structure in memory plus the member’s offset in bytes. During compilation, most C++ compilers align structure members to their natural boundary, which means that structures frequently contain extra padding bytes. It is not possible to define a structure in a header file and include this file in both C++ and assembly language source code files. However, a simple solution to this dilemma is to use the C++ offsetof macro to determine the offset for each structure member and then use .equ directives in the assembly language file. You will learn how to do this shortly.

Listing 4-6 shows the C++ and assembly language source code for example Ch04_06. In the C++ code, a simple structure named TestStruct is defined. This structure uses sized integer types instead of the more common C++ types to highlight the exact size of each member.
//-------------------------------------------------
//               Ch04_06.cpp
//-------------------------------------------------
#include <iostream>
#include <iomanip>
#include <cstdint>
#include <cstddef>
using namespace std;
struct TestStruct
{
    int8_t ValA;
    int8_t ValB;
    int32_t ValC;
    int16_t ValD;
    int32_t ValE;
    uint8_t ValF;
    uint16_t ValG;
};
extern "C" int32_t CalcTestStructSum_(const TestStruct* ts);
void PrintTestStructOffsets(void)
{
    const char nl = ' ';
    cout << "offsetof(ts.ValA) = " << offsetof(TestStruct, ValA) << nl;
    cout << "offsetof(ts.ValB) = " << offsetof(TestStruct, ValB) << nl;
    cout << "offsetof(ts.ValC) = " << offsetof(TestStruct, ValC) << nl;
    cout << "offsetof(ts.ValD) = " << offsetof(TestStruct, ValD) << nl;
    cout << "offsetof(ts.ValE) = " << offsetof(TestStruct, ValE) << nl;
    cout << "offsetof(ts.ValF) = " << offsetof(TestStruct, ValF) << nl;
    cout << "offsetof(ts.ValG) = " << offsetof(TestStruct, ValG) << nl;
}
int32_t CalcTestStructSum(const TestStruct* ts)
{
    int32_t temp1 = ts->ValA + ts->ValB + ts->ValC + ts->ValD;
    int32_t temp2 = ts->ValE + ts->ValF + ts->ValG;
    return temp1 + temp2;
}
int main()
{
    const char nl = ' ';
    PrintTestStructOffsets();
    TestStruct ts;
    ts.ValA = -100;
    ts.ValB = 75;
    ts.ValC = 1000000;
    ts.ValD = -3000;
    ts.ValE = 400000;
    ts.ValF = 200;
    ts.ValG = 50000;
    int32_t sum1 = CalcTestStructSum(&ts);
    int32_t sum2 = CalcTestStructSum_(&ts);
    cout << nl << "Results for CalcTestStructSum" << nl;
    cout << "ts1.ValA = " << (int)ts.ValA << nl;
    cout << "ts1.ValB = " << (int)ts.ValB << nl;
    cout << "ts1.ValC = " << ts.ValC << nl;
    cout << "ts1.ValD = " << ts.ValD << nl;
    cout << "ts1.ValE = " << ts.ValE << nl;
    cout << "ts1.ValF = " << (int)ts.ValF << nl;
    cout << "ts1.ValG = " << ts.ValG << nl;
    cout << "sum1 =     " << sum1 << nl;
    cout << "sum2 =     " << sum2 << nl;
    if (sum1 != sum2)
        cout << "Compare error!" << nl;
    return 0;
}
//-------------------------------------------------
//               Ch04_06_.s
//-------------------------------------------------
// extern "C" int32_t CalcTestStructSum_(const TestStruct* ts);
// Offsets for TestStruct
            .equ S_ValA,0                       // int8_t
            .equ S_ValB,1                       // int8_t
            .equ S_ValC,4                       // int32_t
            .equ S_ValD,8                       // int16_t
            .equ S_ValE,12                      // int32_t
            .equ S_ValF,16                      // uint8_t
            .equ S_ValG,18                      // uint16_t
            .text
            .global CalcTestStructSum_
CalcTestStructSum_:
// Sum the elements of TestStruct
            ldrsb r1,[r0,#S_ValA]               // r1 = ValA (sign-extended)
            ldrsb r2,[r0,#S_ValB]               // r2 = ValB (sign-extended)
            add r1,r1,r2
            ldr r2,[r0,#S_ValC]                 // r2 = ValC
            add r1,r1,r2
            ldrsh r2,[r0,#S_ValD]               // r2 = ValD (sign-extended)
            add r1,r1,r2
            ldr r2,[r0,#S_ValE]                 // r2 = ValE
            add r1,r1,r2
            ldrb r2,[r0,#S_ValF]                // r2 = ValF (zero-extended)
            add r1,r1,r2
            ldrh r2,[r0,#S_ValG]                // r2 = ValG (zero-extended)
            add r1,r1,r2
            mov r0,r1
            bx lr
Listing 4-6.

Example Ch04_06

Following the definition of TestStruct is a function named PrintTestStructOffsets. The function main calls this function, which prints the offset in bytes of each member in TestStruct. These results were then used to define .equ directives in the assembly language file for the members in TestStruct. The remaining code in main initializes an instance of TestStruct, calls CalcTestStructSum and CalcTestStructSum_, and displays results. The functions CalcTestStructSum and CalcTestStructSum_ both sum the members in TestStruct.

The assembly language code in Listing 4-6 begins with the aforementioned .equ directives that define offsets for each structure member. The sole argument value for CalcTestStructSum_ is a pointer to the caller’s TestStruct. The calculating code in CalcTestStructSum_ uses various forms of the ldr instruction, which you have already seen, to load each structure member into a register. Note that each ldr instruction uses simple offset addressing. Here is the output for source code example Ch04_06:
offsetof(ts.ValA) = 0
offsetof(ts.ValB) = 1
offsetof(ts.ValC) = 4
offsetof(ts.ValD) = 8
offsetof(ts.ValE) = 12
offsetof(ts.ValF) = 16
offsetof(ts.ValG) = 18
Results for CalcTestStructSum
ts1.ValA = -100
ts1.ValB = 75
ts1.ValC = 1000000
ts1.ValD = -3000
ts1.ValE = 400000
ts1.ValF = 200
ts1.ValG = 50000
sum1 =     1447175
sum2 =     1447175

You will see other examples of assembly language structure use later in this book.

Summary

Here are the key learning points for Chapter 4:
  • The address of an element in a one-dimensional array can be calculated using the base address (i.e., the address of the first element) of the array, the index of the element, and the size in bytes of each element. The address of an element in a two-dimensional array can be calculated using the base address of the array, the row and column indices, the number of columns, and the size in bytes of each element.

  • Post-indexed addressing (e.g., ldr r1,[r0],#4) is often used to implement a for-loop that processes the elements of an array that contains 32-bit wide integers. Post-indexed addressing can also be used for arrays containing 8- and 16-bit wide integers.

  • A function can use the lsl operator in a ldr instruction (e.g., ldr r2,[r0,r1,lsl #2]) to load array element x[i] into a register. In this example, R0 contains the address of array x and R1 contains the index i.

  • A function can use the instruction pair ldr r0,=VarName and ldr r0,[r0] to load the value of C++ global variable VarName into register R0.

  • Functions can use the ldmdb, ldmia, stmdb, and stmia instructions to load multiple elements from or store multiple elements to an array.

  • Assembly language load and store instructions can reference members of a structure in memory using .equ directives and the output of the C++ offsetof operator.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.86.134