The content of this and the previous two chapters can be regarded as a trilogy of Arm8-32 assembly language fundamentals. In Chapters 2 and 3, you learned how to perform integer arithmetic, carry out data load and store operations, manipulate the stack, and programmatically exploit the NZCV condition flags. You also acquired useful knowledge about the GNU C++ calling convention and the GNU assembler.
This chapter you are about to read imparts additional Armv8-32 assembly language programming concepts that complete the trilogy. It begins with section that elucidates array use in an assembly language function. This is followed by a section that covers matrices and the programming techniques necessary to properly access the elements of a matrix. The final section of Chapter 4 explicates additional load and store instructions. It also explains how to reference the members of a C++ structure in an assembly language function.
Integer Arrays
Arrays are an indispensable data construct in virtually all programming languages. In C++, there is an inherent connection between arrays and pointers since the name of an array is essentially a pointer to its first element. Moreover, whenever an array variable name is used as a C++ function parameter, a pointer is passed instead of duplicating the array on the stack. In C++, one-dimensional arrays are stored in a contiguous block of memory that can be statically allocated at compile time or dynamically allocated during program execution. The elements of a C++ array are accessed using zero-based indexing, which means that valid indices for an array of size N range from 0 to N - 1.
The source code in this section discusses assembly language code that processes arrays. The first source code example explains how to perform simple arithmetic using the elements of an integer array. The second source code example demonstrates arithmetic using elements from multiple arrays.
Array Arithmetic
Example Ch04_01
Near the top of the C++ code are the now-familiar declarations for the assembly language functions CalcSumA _ and CalcSumB_. Both functions sum the elements of an array. Note that the declaration of CalcSumB_ uses the fixed-sized unsigned integer types uint64_t and uint32_t that are declared in the header file <cstdint> instead of the normal unsigned long long and unsigned int. Some assembly language programmers (including me) prefer to use fixed-sized integer types for assembly language function declarations since it accentuates the exact size of the argument.
The function CalcSumA_ begins with a mov r2,#0 instruction that initializes sum to zero. The cmp r1,#0 and ble DoneA instructions prevent execution of for-loop LoopA if n <= 0 is true. Sweeping through the array to sum the elements requires only four instructions. The ldr r3,[r0],#4 instruction loads the array element pointed to by R0 into R3. It then adds 4 to R0, which points it to the next array element. This is an example of post-indexed addressing (see Table 1-5). The next instruction, add r2,r2,r3, adds the current array element to sum in R2. The subs r1,r1,#1 instruction subtracts one from n and also sets the NZCV condition flags, which allows the ensuing bne LoopA instruction to terminate LoopA when n equals zero.
The function CalcSumB_ sums the elements of a uint32_t array and returns a result of type uint64_t. This function starts by setting registers R2 and R3 to zero. Function CalcSumB_ uses this register pair to hold an intermediate 64-bit sum. The number of array elements n is then tested to make sure it is not equal to zero. The mov r4,#0 instruction then sets the array index variable i to zero.
CalcSumB_ uses a different technique than CalcSumA_ to sum the elements of the target array. The first instruction of for-loop LoopB, ldr r5,[r0,r4,lsl #2], loads array element x[i] into R5. In this instruction, the address of source operand x[i] is R0 + (R4 << 2) (R0 contains the address of array x and R4 contains index variable i). Register R4 is left shifted by 2 bits since the size of each element of array x is 4 bytes. Note that this form of the ldr instruction does not modify the values in both R0 and R4.
The next instruction, adds r2,r2,r5, adds x[i] to the low-order 32 bits of the intermediate sum that is maintained in register pair R2:R3. The adds instruction also sets the C condition flag to one if an unsigned overflow occurs when adding x[i] to the running sum; otherwise, C is set to zero. The ensuing adc r3,r3,#0 (add with carry) instruction adds the value of the C condition flag to the high-order 32 bits of the 64-bit running sum. The adds/adc instruction pair is often used to perform 64-bit integer addition as demonstrated in this function.
Array Arithmetic Using Multiple Arrays
Example Ch04_02
The C++ code in Listing 4-2 starts with the definition of global variables g_Val1 and g_Val2. These values are used in functions CalcZ and CalcZ_. Following the declaration of CalcZ_ is a function named Init, which initializes the test arrays for this example using random numbers. This function uses the C++ Standard Template Library (STL) classes uniform_int_distribution and mt19937 to generate random values for the array. Appendix B contains a list of references that you can consult if you are interested in learning more about these classes. The definition of function CalcZ is next. This function performs some admittedly contrived arithmetic for demonstration purposes. Note that different integer types are used for the arrays x, y, and z. The remaining C++ code performs test case initialization, exercises the functions CalcZ and CalcZ_, and displays the results.
The first nonprologue instruction of CalcZ_ is a mov r4,#0, which initializes sum to zero. The value of n is then tested to make sure it is greater than zero. The next instruction, ldr r5,=g_Val1, loads the address of g_Val1 into R5. This is followed by a ldr r5,[r5] instruction that loads g_Val1 into R5. Function CalcZ_ uses a similar sequence of instructions to load g_Val2 into R6.
Each iteration of for-loop Loop1 begins with a ldrsb r7,[r1],#1 instruction that loads x[i] into R7. Note that a post-indexed offset value of one is used since array x is of type int8_t. The ldrsh r8,[r2],#2 instruction loads y[i] into R8. This instruction uses a post-indexed offset value of two since array y is of type int16_t. The ensuing cmp r7,#0 sets the NZCV condition flags. The next instruction, mullt r9,r8,r5, calculates temp = y[i] * g_Val1 only if x[i] < 0 is true. Otherwise, no operation is performed. The mullt instruction is an example of an A32 conditional instruction that was discussed in Chapter 3. Following the mullt instruction is another conditionally executed instruction mulge r9,r8,r6, which calculates temp y[i] * g_Val2 only if x[i] >= 0 is true.
Integer Matrices
Accessing Matrix Elements
Example Ch04_03
The function CalcMatrixSquares illustrates how to access an element in a C++ matrix using explicit arithmetic. At entry to this function, arguments x and y point to the memory blocks that contain their respective matrices. Inside the second for-loop, the expression kx = j * m + i calculates the offset necessary to access element x[j][i]. Similarly, the expression ky = i * n + j calculates the offset for element y[i][j]. Note that the code employed in CalcMatrixSquares to calculate kx and ky requires x to be a matrix of size n × m and y to be a matrix of size m × n.
The assembly language function CalcMatrixSquares_ uses the same technique as the C++ code to access elements in matrices x and y. This function begins its execution by checking argument values m and n to make sure they are greater than zero. A mov r4,#0 instruction is then used to initialize index i to zero. Each iteration of for-loop Loop1 starts with a mov r5,#0 instruction that sets index j to zero. The ensuing mov r6,r5, mul r6,r6,r2, and add r6,r6,r4 instructions calculate kx = j * m + i. This is followed by a ldr r7,[r1,r6,lsl #2] instruction that loads x[j][i] into R7. The mul r7,r7,r7 instruction calculates x[j][i] * x[j][i].
Row-Column Sums
Example Ch04_04
The assembly language function CalcMatrixRowColSums_ implements the same algorithm as its C++ counterpart. Following preservation of non-volatile registers R4–R11 on the stack, arguments nrows and ncols are tested for validity. Note that ncols was passed via the stack. Also note the two uses of the movle r0,#0 instruction, which load R0 with the correct return code if either ncols or nrows is invalid. For-loop Loop0 then initializes each element in col_sums to zero.
Advanced Programming
The source code examples in this section highlight a few advanced programming techniques. The first example introduces additional load and store instructions that you can use to access the elements of an array. The second example explains how to reference the members of a C++ structure in an assembly language function.
Array Reversal
Example Ch04_05
The function ReverseArrayA_ copies elements from a source array to a destination array in reverse order. This function requires three parameters: a pointer to destination array y, a pointer to source array x, and the number of elements n. During its initialization phase, ReverseArrayA_ uses the instructions add r1,r1,r2,lsl #2 and sub r1,#4 to calculate the address of the last element in array x. It then checks the value of n to see if it is less than four. If n < 4 is true, ReverseArrayA_ skips over for-loop LoopA. The reason for this is that for-loop LoopA processes four elements during each iteration.
Following execution of the stmia instruction, n is decremented by four and Loop1 repeats until n < 4 is true. The block of code that follows Loop1 reverses the final few elements of array x using ldr and str instructions. Note that after an element is reversed, n is decremented by one and tested to see if it is equal to zero.
Structures
A structure is a programming language construct that facilitates the definition of new data types using one or more existing data types. In C++, a structure is essentially the same as a class. When a data type is defined using the keyword struct instead of class, all members are public by default. A C++ struct that is declared sans any member functions or operators is analogous to a C-style structure such as typedef struct { ... } MyStruct;. C++ structure declarations are usually placed in a header (.h) file so they can be easily referenced by multiple C++ files.
The address of a structure member is simply the starting address of the structure in memory plus the member’s offset in bytes. During compilation, most C++ compilers align structure members to their natural boundary, which means that structures frequently contain extra padding bytes. It is not possible to define a structure in a header file and include this file in both C++ and assembly language source code files. However, a simple solution to this dilemma is to use the C++ offsetof macro to determine the offset for each structure member and then use .equ directives in the assembly language file. You will learn how to do this shortly.
Example Ch04_06
Following the definition of TestStruct is a function named PrintTestStructOffsets. The function main calls this function, which prints the offset in bytes of each member in TestStruct. These results were then used to define .equ directives in the assembly language file for the members in TestStruct. The remaining code in main initializes an instance of TestStruct, calls CalcTestStructSum and CalcTestStructSum_, and displays results. The functions CalcTestStructSum and CalcTestStructSum_ both sum the members in TestStruct.
You will see other examples of assembly language structure use later in this book.
Summary
The address of an element in a one-dimensional array can be calculated using the base address (i.e., the address of the first element) of the array, the index of the element, and the size in bytes of each element. The address of an element in a two-dimensional array can be calculated using the base address of the array, the row and column indices, the number of columns, and the size in bytes of each element.
Post-indexed addressing (e.g., ldr r1,[r0],#4) is often used to implement a for-loop that processes the elements of an array that contains 32-bit wide integers. Post-indexed addressing can also be used for arrays containing 8- and 16-bit wide integers.
A function can use the lsl operator in a ldr instruction (e.g., ldr r2,[r0,r1,lsl #2]) to load array element x[i] into a register. In this example, R0 contains the address of array x and R1 contains the index i.
A function can use the instruction pair ldr r0,=VarName and ldr r0,[r0] to load the value of C++ global variable VarName into register R0.
Functions can use the ldmdb, ldmia, stmdb, and stmia instructions to load multiple elements from or store multiple elements to an array.
Assembly language load and store instructions can reference members of a structure in memory using .equ directives and the output of the C++ offsetof operator.