© Daniel Kusswurm 2020
D. KusswurmModern Arm Assembly Language Programminghttps://doi.org/10.1007/978-1-4842-6267-2_11

11. Armv8-64 Core Programming – Part 1

Daniel Kusswurm1 
(1)
Geneva, IL, USA
 

Chapter 11 introduces Armv8-64 core programming. It begins with a section that illustrates the use of basic integer arithmetic instructions including addition, subtraction, multiplication, and division. The section that follows covers data loads and stores, shift and rotate operations, and bitwise logical manipulations. This second section is especially important since it accentuates notable differences between A32 and A64 assembly language programming.

This chapter also covers details about the semantics and syntax of an A64 assembly language source code file. You will learn the basics of passing arguments and return values between functions written in C++ and A64 assembly language. The subsequent discussions and source code examples are intended to complement the material presented in Chapter 10.

Like the introductory A32 programming chapters, the primary purpose of the source code presented in this (and the next) chapter is to elucidate proper use of the A64 instruction set and basic assembly language programming techniques. The source code that is described in later A64 programming chapters places more emphasis on efficient coding techniques. Appendix A contains additional information on how to build and run the A64 source code examples. Depending on your personal preference, you may want to set up a test system before proceeding with the discussions in this chapter.

Integer Arithmetic

In this section, you will learn the basics of A64 assembly language programming. It begins with a simple program that demonstrates how to perform integer addition and subtraction. This is followed by a source code example that illustrates integer multiplication. The final example explains integer division. Besides common arithmetic operations, the source code examples in this section also explicate passing argument and return values between a C++ and assembly language function. They also show how to use common assembler directives.

Addition and Subtraction

Listing 11-1 shows the source code for example Ch11_01. This example demonstrates how to use the A64 assembly language instructions add (integer add) and sub (integer subtract). It also illustrates some basic A64 assembly language programming concepts including passing arguments, returning values, and using assembler directives.
//-------------------------------------------------
//               Ch11_01.cpp
//-------------------------------------------------
#include <iostream>
using namespace std;
extern "C" int IntegerAddSubA_(int a, int b, int c);
extern "C" long IntegerAddSubB_(long a, long b, long c);
template <typename T>
void PrintResult(const char* msg, T a, T b, T c, T result)
{
    const char nl = ' ';
    cout << msg << nl;
    cout << "a = " << a << nl;
    cout << "b = " << b << nl;
    cout << "c = " << c << nl;
    cout << "result (a + b - c) = " << result << nl;
    cout << nl;
}
int main(int argc, char** argv)
{
    int a1 = 100, b1 = 200, c1 = -50, result1;
    result1 = IntegerAddSubA_(a1, b1, c1);
    PrintResult("IntegerAddSubA_", a1, b1, c1, result1);
    long a2 = 1000, b2 = -2000, c2 = 500, result2;
    result2 = IntegerAddSubB_(a2, b2, c2);
    PrintResult("IntegerAddSubB_", a2, b2, c2, result2);
}
//-------------------------------------------------
//               Ch11_01_.s
//-------------------------------------------------
// extern "C" int IntegerAddSubA_(int a, int b int c);
            .text
            .global IntegerAddSubA_
IntegerAddSubA_:
// Calculate a + b - c
            add w3,w0,w1                            // w3 = a + b
            sub w0,w3,w2                            // w0 = a + b - c
            ret                                     // return to caller
// extern "C" long IntegerAddSubB_(long a, long b long c);
            .global IntegerAddSubB_
IntegerAddSubB_:
// Calculate a + b - c
            add x3,x0,x1                            // x3 = a + b
            sub x0,x3,x2                            // x0 = a + b - c
            ret                                     // return to caller
Listing 11-1.

Example Ch11_01

The C++ code in Listing 11-1 begins with the declarations of the assembly language functions IntegerAddSubA_ and IntegerAddSubB_. These functions carry out simple integer addition and subtraction operations using int (32-bit) and long (64-bit) argument values. The "C" modifier that is used in the function declarations instructs the C++ compiler to use a C-style function name instead of a C++ decorated name (recall that a C++ decorated name contains extra prefix and suffix characters to facilitate function overloading). Also included in the C++ code is a template function named PrintResult, which streams results to cout. The C++ function main includes code that exercises the assembly language functions IntegerAddSubA_ and IntegerAddSubB_.

The A64 assembly language code for example Ch11_01 is shown in Listing 11-1 immediately after the C++ code. The first thing to notice is the // symbol. Like the GNU C++ compiler, the GNU assembler treats any text on a line that follows a // as an appended comment. Unlike A32 code, the @ symbol cannot be used for appended comments in A64 assembly language source files. A64 assembly language source code files can also use block comments using the /* and */ symbols.

The .text statement is an assembler directive that defines the start of an assembly language code section. As explained in Chapter 2, an assembler directive is a statement that instructs the assembler to perform a specific action during assembly of the source code. The next statement, .global IntegerAddSubA_, is another directive that tells the assembler to treat the function IntegerAddSubA_ as a global function. A global function can be called by functions that are defined in other source code modules. The ensuing IntegerAddSubA_: statement defines the entry point (or start address) for function IntegerAddSubA_. The text that precedes the : symbol is called a label. Besides designating function entry points, labels are also employed to define assembly language variable names and targets for branch instructions.

The assembly language function IntegerAddSubA_ calculates a + b - c and returns this value to the calling function. It begins with an add w3,w0,w1 instruction that adds the values in registers W0 (argument value a) and W1 (argument value b); the result is then saved in register W3. The use of registers W0 and W1 for argument values a and b is mandated by the GNU C++ calling convention for Armv8-64. According to this convention, the first eight integer (or pointer) arguments are passed in registers W0/X0–W7/X7. Any remaining arguments are passed via the stack. You will learn more about the GNU C++ calling convention later in this chapter and in subsequent chapters.

The next instruction in IntegerAddSubA_, sub w0,w3,w2, subtracts W2 (c) from W3 (a + b) and saves the result in register W0. This completes the calculation of a + b - c. An A64 assembly language function must use register W0 to return a single 32-bit integer (or C++ int) value to its calling function. In the current example, no additional instructions are necessary to achieve this requirement since W0 already contains the correct return value. The final instruction of IntegerAddSubA_ is a ret (return from subroutine). This instruction returns program control back to the calling function. More specifically, the ret instruction performs an unconditional branch (or jump) to the address in register X30 (or link register). Unlike Armv8-32, Armv8-64 defines an explicit instruction mnemonic for function returns. This enables the processor to make better branch predictions since it can now differentiate between a function return and an ordinary branch operation. You will learn more about branch predictions in Chapter 17.

Listing 11-1 also includes the assembly language function IntegerAddSubB_. This function is the 64-bit counterpart of function IntegerAddSubA_. Function IntegerAddSubB_ begins with an add x3,x0,x1 instruction that calculates a + b. Function IntegerAddSubB_ uses X registers since argument values a, b, and c are long (64-bit) integers. The next instruction, sub x0,x3,x2, completes the calculation of a + b - c. The GNU C++ calling convention for Armv8-64 specifies that a function must use register X0 for a 64-bit return value. The final instruction of IntegerAddSubB_, ret, returns program control back to the caller. Here is the output for source code example Ch11_01:
IntegerAddSubA_
a = 100
b = 200
c = -50
result (a + b - c) = 350
IntegerAddSubB_
a = 1000
b = -2000
c = 500
result (a + b - c) = -1500

Multiplication

The next source code example, Ch11_02, illustrates the use of common A64 multiplication instructions. Listing 11-2 shows the source code for this example. Like the previous example, the C++ code begins with the declarations of the assembly language functions IntegerMulA_, IntegerMulB_, IntegerMulC_, and IntegerMulD_. Note that these functions use assorted data types for parameters and return values. The template function PrintResult contains code that displays results. The function main contains code that initializes test case data and exercises the assembly language multiplication functions.
//-------------------------------------------------
//               Ch11_02.cpp
//-------------------------------------------------
#include <iostream>
using namespace std;
extern "C" int IntegerMulA_(int a, int b);
extern "C" long IntegerMulB_(long a, long b);
extern "C" long IntegerMulC_(int a, int b);
extern "C" unsigned long IntegerMulD_(unsigned int a, unsigned int b);
template <typename T1, typename T2>
void PrintResult(const char* msg, T1 a, T1 b, T2 result)
{
    const char nl = ' ';
    cout << msg << nl;
    cout << "a = " << a << ", b = " << b;
    cout << " result = " << result << nl << nl;
}
int main(int argc, char** argv)
{
    int a1 = 50;
    int b1 = 25;
    int result1 = IntegerMulA_(a1, b1);
    PrintResult("IntegerMulA_", a1, b1, result1);
    long a2 = -3000000000;
    long b2 = 7;
    long result2 = IntegerMulB_(a2, b2);
    PrintResult("IntegerMulB_", a2, b2, result2);
    int a3 = 4000;
    int b3 = 0x80000000;
    long result3 = IntegerMulC_(a3, b3);
    PrintResult("IntegerMulC_", a3, b3, result3);
    unsigned int a4 = 4000;
    unsigned int b4 = 0x80000000;
    unsigned long result4 = IntegerMulD_(a4, b4);
    PrintResult("IntegerMulD_", a4, b4, result4);
    return 0;
}
//-------------------------------------------------
//               Ch11_02_.s
//-------------------------------------------------
// extern "C" int IntegerMulA_(int a, int b);
            .text
            .global IntegerMulA_
IntegerMulA_:
// Calculate a * b and save result
            mul w0,w0,w1                        // a * b (32-bit)
            ret
// extern "C" long IntegerMulB_(long a, long b);
            .global IntegerMulB_
IntegerMulB_:
// Calculate a * b and save result
            mul x0,x0,x1                        // a * b (64-bit)
            ret
// extern "C" long IntegerMulC_(int a, int b);
            .global IntegerMulC_
IntegerMulC_:
// Calculate a * b and save result
            smull x0,w0,w1                      // signed 64-bit
            ret
// extern "C" unsigned long IntegerMulD_(unsigned int a, unsigned int b);
            .global IntegerMulD_
IntegerMulD_:
// Calculate a * b and save result
            umull x0,w0,w1                      // unsigned signed 64-bit
            ret
Listing 11-2.

Example Ch11_02

The assembly language functions IntegerMulA_, IntegerMulB_, IntegerMulC_, and IntegerMulD_ illustrate the use of various A64 multiplication instructions. In function IntegerMulA_, the mul w0,w0,w1 (multiply) instruction multiplies registers W0 (argument value a) by W1 (argument value b). It then truncates the result to 32 bits and saves this value in register W0. The mul instruction is an alias instruction. Recall that an alias instruction is a distinct mnemonic that provides a more expressive description of the operation that is being performed. In the current example, the mul w0,w0,w1 instruction is an alias of madd w0,w0,w1,wzr (multiply-add). You will learn how to use the madd instruction in Chapter 12. Alias instructions are generally unimportant when writing A64 code. However, it is something that you need to be aware of when using a debugger or viewing a listing of disassembled code.

The assembly language function IntegerMulB_ uses a mul x0,x0,x1 instruction to multiply two 64-bit wide integers. It saves the low-order 64 bits of the 128-bit product in register X0. The mul instruction can be used with either signed or unsigned integer operands. Function IntegerMulC_ uses the smull x0,w0,w1 (signed multiply long) instruction. This instruction multiplies the 32-bit signed integer values in registers W0 and W1. It then saves the complete 64-bit signed integer product in register X0. The smull instruction is an alias instruction of smaddl (signed multiply-add long). The final integer multiplication function, IntegerMulD_, uses the umull (unsigned multiply long) instruction to perform unsigned integer multiplication. The umull instruction, which is an alias of umaddl (unsigned multiply-add long), calculates a 64-bit unsigned integer product using two unsigned 32-bit integer operands. Here are the results for source code example Ch11_02:
IntegerMulA_
a = 50, b = 25 result = 1250
IntegerMulB_
a = -3000000000, b = 7 result = -21000000000
IntegerMulC_
a = 4000, b = -2147483648 result = -8589934592000
IntegerMulD_
a = 4000, b = 2147483648 result = 8589934592000

Division

Listing 11-3 shows the source code for example Ch11_03. This source code example illustrates integer division. It also describes the use of the str (store) instruction. The C++ code in Listing 11-3 resembles the previous two examples in that it performs test case initialization and streams its results to cout. Note that the declarations for assembly language functions CalcQuoRemA_ and CalcQuoRemB_ include a mixture of integer and pointer arguments.
//-------------------------------------------------
//               Ch11_03.cpp
//-------------------------------------------------
#include <iostream>
using namespace std;
extern "C" void CalcQuoRemA_(int a, int b, int* quo, int* rem);
extern "C" void CalcQuoRemB_(long a, long b, long* quo, long* rem);
template <typename T>
void PrintResult(const char* msg, T a, T b, T quo, T rem)
{
    const char nl = ' ';
    cout << msg << nl;
    cout << "a = " << a << nl;
    cout << "b = " << b << nl;
    cout << "quotient = " << quo << nl;
    cout << "remainder = " << rem << nl;
    cout << nl;
}
int main(int argc, char** argv)
{
    int a1 = 100, b1 = 7, quo1, rem1;
    CalcQuoRemA_(a1, b1, &quo1, &rem1);
    PrintResult("CalcQuoRemA_", a1, b1, quo1, rem1);
    long a2 = -2000000000, b2 = 11, quo2, rem2;
    CalcQuoRemB_(a2, b2, &quo2, &rem2);
    PrintResult("CalcQuoRemB_", a2, b2, quo2, rem2);
}
//-------------------------------------------------
//               Ch11_03_.s
//-------------------------------------------------
// extern "C" void CalcQuoRemA_(int a, int b, int* quo, int* rem);
            .text
            .global CalcQuoRemA_
CalcQuoRemA_:
// Calculate quotient and remainder
            sdiv w4,w0,w1                           // a / b
            str w4,[x2]                             // save quotient
            mul w5,w4,w1                            // quotient * b
            sub w6,w0,w5                            // a - quotient * b
            str w6,[x3]                             // save remainder
            ret                                     // return to caller
// extern "C" void CalcQuoRemB_(long a, long b, long* quo, long* rem);
            .global CalcQuoRemB_
CalcQuoRemB_:
// Calculate quotient and remainder
            sdiv x4,x0,x1                           // a / b
            str x4,[x2]                             // save quotient
            mul x5,x4,x1                            // quotient * b
            sub x6,x0,x5                            // a - quotient * b
            str x6,[x3]                             // save remainder
            ret                                     // return to caller
Listing 11-3.

Example Ch11_03

The assembly language function CalcQuoRemA_ begins its execution with a sdiv w4,w0,w1 (signed divide) instruction. This instruction divides the value in register W0 (argument value a) by W1 (argument value b) and saves the quotient in register W4. The next instruction, str w4,[x2], saves the 32-bit signed integer quotient to the memory location pointed to by register X2 (argument value quo). The ensuing mul w5,w4,w1 and sub w6,w0,w5 instructions calculate the remainder (the sdiv instruction does not return a remainder). The final instruction of CalcQuoRemA_, str w6,[x3], saves the remainder to memory location pointed to by rem.

Function CalcQuoRemB_ is identical to function CalcQuoRemA_ except that it uses 64-bit signed integer values. The A64 instruction set also includes a udiv (unsigned divide) instruction, which performs division using 32- or 64-bit wide unsigned integers. Here are the results for source code example Ch11_03:
CalcQuoRemA_
a = 100
b = 7
quotient = 14
remainder = 2
CalcQuoRemB_
a = -2000000000
b = 11
quotient = -181818181
remainder = -9

Integer Operations

The source code examples of this section explain how to use common integer load, store, move, shift, and bitwise logical instructions. It is important to master these instructions given their frequency of use. Like A32 assembly language programming, it is sometimes necessary to use multiple instructions or pseudo instructions when writing A64 code especially for load and move operations as you will soon see.

Load and Store Instructions

Listing 11-4 shows the source code for example Ch11_04, which explains how to use the ldr (load register) instruction. It also illustrates the use of several Armv8-64 memory addressing modes.
//-------------------------------------------------
//               Ch11_04.cpp
//-------------------------------------------------
#include <iostream>
using namespace std;
extern "C" int TestLDR1_(unsigned int i, unsigned long j);
extern "C" long TestLDR2_(unsigned int i, unsigned long j);
extern "C" short TestLDR3_(unsigned int i, unsigned long j);
void TestLDR1(void)
{
    const char nl = ' ';
    unsigned int i = 3;
    unsigned long j = 6;
    int test_ldr1 = TestLDR1_(i, j);
    cout << "TestLDR1_(" << i << ", " << j << ") = " << test_ldr1 << nl;
}
void TestLDR2(void)
{
    const char nl = ' ';
    unsigned int i = 2;
    unsigned long j = 7;
    long test_ldr2 = TestLDR2_(i, j);
    cout << "TestLDR2_(" << i << ", " << j << ") = " << test_ldr2 << nl;
}
void TestLDR3(void)
{
    const char nl = ' ';
    unsigned int i = 5;
    unsigned long j = 1;
    short test_ldr3 = TestLDR3_(i, j);
    cout << "TestLDR3_(" << i << ", " << j << ") = " << test_ldr3 << nl;
}
int main(int argc, char** argv)
{
    TestLDR1();
    TestLDR2();
    TestLDR3();
}
//-------------------------------------------------
//               Ch11_04_.s
//-------------------------------------------------
// Test arrays
            .data
A1:         .word 1, 2, 3, 4, 5, 6, 7, 8
A2:         .quad 10, -20, 30, -40, 50, -60, 70, -80
            .text
A3:         .short 100, 200, -300, 400, 500, -600, 700, 800
// extern "C" int TestLDR1_(unsigned int i, unsigned long j);
            .global TestLDR1_
TestLDR1_:  ldr x2,=A1                              // x2 = ptr to A1
            ldr w3,[x2,w0,uxtw 2]                   // w3 = A1[i]
            ldr w4,[x2,x1,lsl 2]                    // w4 = A1[j]
            add w0,w3,w4                            // w0 = A1[i] + A1[j]
            ret
// extern "C" long TestLDR2_(unsigned int i, unsigned long j);
            .global TestLDR2_
TestLDR2_:  ldr x2,=A2                              // x2 = ptr to A2
            ldr x3,[x2,w0,uxtw 3]                   // x3 = A2[i]
            ldr x4,[x2,x1,lsl 3]                    // x4 = A2[j]
            add x0,x3,x4                            // w0 = A2[i] + A2[j]
            ret
// extern "C" short TestLDR3_(unsigned int i, unsigned long j);
            .global TestLDR3_
TestLDR3_:  adr x2,A3                               // x2 = ptr to A3
            ldrsh w3,[x2,w0,uxtw 1]                 // w3 = A3[i]
            ldrsh w4,[x2,x1,lsl 1]                  // w4 = A3[j]
            add w0,w3,w4                            // w0 = A3[i] + A3[j]
            ret
Listing 11-4.

Example Ch11_04

The C++ code begins with the declarations of the assembly language functions TestLDR1_, TestLDR2_, and TestLDR3_. These functions require two integer arguments, which are used as indices to access elements in a small test array. They also return integer values of varying sizes. The C++ functions TestLDR1, TestLDR2, and TestLDR3 contain code that exercise the aforementioned assembly language functions and display results.

The assembly language code in Listing 11-4 starts with a .data directive. This directive signifies the beginning of section in memory that contains read-write data. The line that starts with the label A1: initializes an eight-element array of .word (32-bit) integers. This is followed by an eight-element array of .quad (64-bit) integers named A2. The final test array, A3, contains eight .short (16-bit) integer elements. Note that the definition of array A3 follows the .text directive, which means that it is a read-only array since the elements are allocated in a code section. Table 2-1 (see Chapter 2) summarizes the GNU assembler directives that are used to allocate storage space and initialize data values.

The assembly language function TestLDR1_ begins its execution with a ldr x2,=A1 that loads the address array A1 into register X2. Recall from the discussions in Chapter 2 that this form of the ldr instruction is a pseudo instruction. The assembler replaces the ldr x2,=A2 instruction with a ldr x2,offset instruction that loads the address of array A1 from a literal pool. Figure 11-1 illustrates this in greater detail. This figure contains output from the GNU debugger (with minor edits to improve readability) that shows the machine code for the TestLDRx_ functions. Note that the GNU debugger output displays runtime addresses for the ldr pseudo instructions instead of the offsets that are embedded in the instruction encodings.
../images/501069_1_En_11_Chapter/501069_1_En_11_Fig1_HTML.png
Figure 11-1.

Machine code for TestLDRx_ functions

The next instruction in TestLDR1_, ldr w3,[x2,w0,uxtw 2], uses extended register memory addressing (see Chapter 10, Table 10-3) to load element A1[i] into register W3. This instruction zero-extends (as specified by the extended operator uxtw) the word value in W0 (argument value i) to 64 bits, left shifts this 64-bit intermediate result by two, and adds X2 (address of array A1) to calculate the address of element A1[i]. Extended register addressing also supports other operators including sxtb, sxth, sxtw, uxtb, and uxth (the “s” versions sign-extend the index register operand). The ensuing ldr w4,[x2,x1,lsl 2] instruction loads A1[j] into register W4. This instruction employs a lsl operator to calculate the required address since X1 (argument value j) is already a 64-bit wide integer.

Function TestLDR2_ uses a similar sequence of instructions to load elements from array A2, which contains quadword instead of word values. Note that the uxtw and lsl operators shift the array indices in registers W0 and X1 by three instead of two bits since the target array contains quadword elements.

Function TestLDR3_ uses an adr x2,A3 (form PC relative address) instruction to load the address of array A3 into register X2. An adr instruction can be used here since array A3 is defined in the same .text section (just before function TestLDR1_) as the executable code. Note that in Figure 11-1, there is no literal pool entry for array A3 since the adr instruction uses PC relative offsets instead of literal pools. The next instruction, ldrsh w3,[x2,w0,uxtw 1] (load register signed halfword), loads A3[i] into register W3. The ensuing ldrsh w4,[x2,x1,lsl 1] instruction then loads A3[j] into register W4. Note that in these ldrsh instructions, the shift count is 1 since halfword values are being loaded. Here is the output for source code example Ch11_04:
TestLDR1_(3, 6) = 11
TestLDR2_(2, 7) = -50
TestLDR3_(5, 1) = -400

Move Instructions

Recall from the discussions in Chapter 10 that the A64 instruction set uses 32-bit wide fixed-length encodings for all instructions. This means that it is sometimes necessary to use multiple instructions or pseudo instructions to load an integer constant into a register. Listing 11-5 shows the source code for example Ch11_05. This example explains how to use various A64 move instructions to load integer constants.
//-------------------------------------------------
//               Ch11_05.cpp
//-------------------------------------------------
#include <iostream>
#include <cstdint>
using namespace std;
extern "C" void MoveA_(int32_t& a0, int32_t& a1, int32_t& a2, int32_t& a3);
extern "C" void MoveB_(int64_t& b0, int64_t& b1, int64_t& b2, int64_t& b3);
extern "C" void MoveC_(int32_t& c0, int32_t& c1);
extern "C" void MoveD_(int64_t& d0, int64_t& d1, int64_t& d2);
int main(int argc, char** argv)
{
    const char nl = ' ';
    int32_t a0, a1, a2, a3;
    MoveA_(a0, a1, a2, a3);
    cout << " Results for MoveA_" << nl;
    cout << "a0 = " << a0 << nl;
    cout << "a1 = " << a1 << nl;
    cout << "a2 = " << a2 << nl;
    cout << "a3 = " << a3 << nl;
    int64_t b0, b1, b2, b3;
    MoveB_(b0, b1, b2, b3);
    cout << " Results for MoveB_" << nl;
    cout << "b0 = " << b0 << nl;
    cout << "b1 = " << b1 << nl;
    cout << "b2 = " << b2 << nl;
    cout << "b3 = " << b3 << nl;
    int32_t c0, c1;
    MoveC_(c0, c1);
    cout << " Results for MoveC_" << nl;
    cout << "c0 = " << c0 << nl;
    cout << "c1 = " << c1 << nl;
    int64_t d0, d1, d2;
    MoveD_(d0, d1, d2);
    cout << " Results for MoveD_" << nl;
    cout << "d0 = " << d0 << nl;
    cout << "d1 = " << d1 << nl;
    cout << "d2 = " << d2 << nl;
    return 0;
}
//-------------------------------------------------
//               Ch11_05_.s
//-------------------------------------------------
// extern "C" MoveA_(int32_t& a0, int32_t& a1, int32_t& a2, int32_t& a3);
            .text
            .global MoveA_
MoveA_:     mov w7,1000                         // w7 = 1000
            str w7,[x0]
            mov w7,65536000                     // w7 = 65536000
            str w7,[x1]
            movz w7,1000,lsl 16                 // w7 = 65536000
            str w7,[x2]
            mov w7,-131072                      // w7 = -131027
            str w7,[x3]
            ret
// extern "C" MoveB_(int64_t& b0, int64_t& b1, int64_t& b2, int64_t& b3);
            .global MoveB_
MoveB_:     mov x7,131072000                    // x7 = 131072000
            str x7,[x0]
            movz x7,2000,lsl 16                 // x7 = 131072000
            str x7,[x1]
            mov x7,429496729600                 // x7 = 429496729600
            str x7,[x2]
            movz x7,100,lsl 32                  // x7 = 429496729600
            str x7,[x3]
            ret
// extern "C" void MoveC_(int32_t& c0, int32_t& c1);
            .equ VAL1,2000000
            .equ VAL1_LO16,(VAL1 & 0xffff)
            .equ VAL1_HI16,((VAL1 & 0xffff0000) >> 16)
            .global MoveC_
MoveC_:
//          mov w7,VAL1                         // invalid value
            mov w7,VAL1_LO16                    // w7 = 33920
            movk w7,VAL1_HI16,lsl 16            // w7 = 2000000
            str w7,[x0]
            ldr w7,=VAL1                        // w7 = 2000000
            str w7,[x1]
            ret
// extern "C" void MoveD_(int64_t& d0, int64_t& d1, int64_t& d2);
            .equ VAL2,-1000000000000000
            .equ VAL2_00,(VAL2 & 0xffff)
            .equ VAL2_16,(VAL2 & 0xffff0000) >> 16
            .equ VAL2_32,(VAL2 & 0xffff00000000) >> 32
            .equ VAL2_48,(VAL2 & 0xffff000000000000) >> 48
            .equ VAL3,0x100000064               // (2**32 + 100)
            .equ VAL3_00,(VAL3 & 0xffff)
            .equ VAL3_32,(VAL3 & 0xffff00000000) >> 32
            .global MoveD_
MoveD_:
//          mov x7,VAL2                         // invalid value
            mov x7,VAL2_00
            movk x7,VAL2_16,lsl 16
            movk x7,VAL2_32,lsl 32
            movk x7,VAL2_48,lsl 48              // x7 = VAL2
            str x7,[x0]
            ldr x7,=VAL2                        // x7 = VAL2
            str x7,[x1]
            mov x7,VAL3_00
            movk x7,VAL3_32,lsl 32              // x7 = 0x100000064
            str x7,[x2]
            ret
Listing 11-5.

Example Ch11_05

The C++ code in Listing 11-5 is straightforward. It begins with the declarations of the assembly language functions MoveA_, MoveB_, MoveC_, and MoveD_. These functions contain code that illustrate the loading of constant values using a variety of A64 move instructions. The remaining C++ code exercises the assembly language move functions and displays results.

Assembly language function MoveA_ begins with a mov w7,1000 (move wide immediate) that loads 1000 into register W7. This instruction (and all other A64 instructions that use a W register destination operand) also sets the upper 32 bits of X0 to zero. Like its A32 counterpart, the A64 mov instruction can be used to load a subset of all possible 32-bit wide integer constants into a W register. Following the str w7,[x0] instruction is a mov w7,65536000 instruction that loads 65536000 into register W7. The mov instruction is an alias of the movz (move wide with zero) instruction. This instruction moves an optionally shifted 16-bit constant value into a register.

The ensuing movz w7,1000,lsl 16 instruction also loads 65536000 into register W7. When loading a 32-bit constant into a W register, the alias instruction mov should be employed when possible instead of a movz instruction since the former is easier to read and type. The final move instruction example in MoveA_, mov w7,-131072, loads a negative value into W7.

Function MoveB_ illustrates the use of the mov and movz instructions using 64-bit constants and X register destination operands. Note than when a movz instruction uses an X register destination operand, the lsl operator can use a shift bit count of 0 (the default), 16, 32, or 48 bits.

Just prior to the start of function MoveC_ are three .equ directives. The .equ VAL1,2000000 defines VAL1 as a symbolic name for the constant 2000000. The next two directive statements, .equ VAL1_LO16,(VAL1 & 0xffff) and .equ VAL1_HI16,((VAL1 & 0xffff0000) >> 16), define symbolic names for the low- and high-order 16 bits of VAL1, respectively. The first instruction of MoveC_, mov w7,VAL1, is commented out. If you remove the comment and build the project using make, the GNU assembler will generate an “immediate cannot be moved by a single instruction” error message. The next two instructions, mov w7,VAL1_LO16 and movk w7,VAL1_HI16,lsl 16 (move wide with keep), illustrate an instruction sequence that loads 2000000 into register W7. The mov instruction loads VAL1_LO16 into register W7. The ensuing movk instruction loads VAL1_HI into bit positions 31:16 of register W7 and leaves bits 15:0 unchanged. Function MoveC_ also contains an ldr w7,=VAL1 instruction that loads VAL1 into W7. This instruction form is easier to read but is also slower since it requires an extra memory read cycle to load VAL1 from a literal pool.

The final move function is named MoveD_. This function illustrates how to use the movk instruction to load a 64-bit constant into an X register. The .equ directives that precede the start of MoveD_ include expressions that split the constant VAL2 into four 16-bit wide values. The constant VAL3 is also split into two 16-bit wide values. Removing the comment from the mov x7,VAL2 instruction and running make will generate another GNU assembler error message. To load VAL2 into register X7, a series of mov and movk instructions is required. The first instruction of the sequence, mov x7,VAL2_00, loads VAL2_00 into register X7. The ensuing movk x7,VAL2_16,lsl 16 instruction loads VAL2_16 into bit positions 31:16 of register X7 and leaves all other bits unchanged. The movk x7,VAL2_32,lsl 32 and movk x7,VAL2_48,lsl 48 instructions load bit positions 47:32 and 63:48, respectively. When necessary, this four-instruction sequence is the recommended method for loading a 64-bit wide constant since the AArch64 execution state is optimized for this type of sequence. It is often reasonable to use a ldr x7,=VAL2 instruction for a one-time initialization since it is easier to read and type, but this approach should be avoided inside a for-loop.

In many cases, it is not necessary to use a quartet of mov and movk instructions to load a 64-bit wide constant. Function MoveD_ employs the instruction sequence mov x7,VAL3_00 and movk x7,VAL3_32,lsl 32 to load VAL3 into register X7. The reason only two instructions are required here is that bit positions 63:48 and 31:16 of the 64-bit constant VAL3 are all zeros. Execution of the mov x7,VAL3_00 instruction has already set these bits in register X7 to zero. Here are the results for source code example Ch11_05:
Results for MoveA_
a0 = 1000
a1 = 65536000
a2 = 65536000
a3 = -131072
Results for MoveB_
b0 = 131072000
b1 = 131072000
b2 = 429496729600
b3 = 429496729600
Results for MoveC_
c0 = 2000000
c1 = 2000000
Results for MoveD_
d0 = -1000000000000000
d1 = -1000000000000000
d2 = 4294967396

Shift Instructions

Listing 11-6 contains the code for source code example Ch11_06. This example illuminates the use of the asr (arithmetic shift right), lsl (logical shift left), lsr (logical shift right), and ror (rotate right) instructions. The C++ code begins the requisite function declarations. Note that the source code for function PrintResult is not shown in Listing 11-6 but is included with the downloadable software package. The remaining C++ code performs test case initialization and exercises the assembly language functions ShiftA_ and ShiftB_.
//-------------------------------------------------
//               Ch11_06.cpp
//-------------------------------------------------
#include <iostream>
#include <cstdint>
using namespace std;
// Ch11_06_Misc.cpp
extern void PrintResult(const char* msg, const uint32_t* x, uint32_t a,
    size_t n, int count = -1);
// Ch11_06_.s
extern "C" void ShiftA_(uint32_t* x, uint32_t a);
extern "C" void ShiftB_(uint32_t* x, uint32_t a, uint32_t count);
void ShiftA(void)
{
    const size_t n = 4 ;
    uint32_t a = 0xC1234561;
    uint32_t x[4];
    ShiftA_(x, a);
    PrintResult("ShiftA_", x, a, n);
}
void ShiftB(void)
{
    const size_t n = 4;
    uint32_t a = 0xC1234561;
    uint32_t x[4];
    uint32_t count = 8;
    ShiftB_(x, a, count);
    PrintResult("ShiftB_", x, a, n, (int)count);
}
int main(int argc, char** argv)
{
    ShiftA();
    ShiftB();
    return 0;
}
//-------------------------------------------------
//               Ch11_06_.s
//-------------------------------------------------
// extern "C" void ShiftA_(uint32_t* x, uint32_t a);
            .text
            .global ShiftA_
ShiftA_:    asr w2,w1,2                         // arithmetic shift right - 2 bits
            lsl w3,w1,4                         // logical shift left - 4 bits
            lsr w4,w1,5                         // logical shift right - 5 bits
            ror w5,w1,3                         // rotate right - 3 bits
            str w2,[x0]                         // save asr result to x[0]
            str w3,[x0,4]                       // save lsl result to x[1]
            str w4,[x0,8]                       // save lsr result to x[2]
            str w5,[x0,12]                      // save ror result to x[3]
            ret
// extern "C" void ShiftB_(uint32_t* x, uint32_t a, uint32_t count);
            .global ShiftB_
ShiftB_:    asr w3,w1,w2                        // arithmetic shift right
            lsl w4,w1,w2                        // logical shift left
            lsr w5,w1,w2                        // logical shift right
            ror w6,w1,w2                        // rotate right
            str w3,[x0]                         // save asr result to x[0]
            str w4,[x0,4]                       // save lsl result to x[1]
            str w5,[x0,8]                       // save lsr result to x[2]
            str w6,[x0,12]                      // save ror result to x[3]
            ret
Listing 11-6.

Example Ch11_06

Assembly language function ShiftA_ demonstrates the use of the asr, lsl, lsr , and ror instructions using immediate operand bit counts. Following execution of these instructions, ShiftA_ uses a series of str instructions that save the result of each shift/rotate operation to array x. It should be noted that the immediate bit count forms of the asr , lsl, lsr , and ror instructions are aliases for sbmf (signed bit field move), ubfm (unsigned bit field move), ubfm, and extr (extract register), respectively. This is something to keep in mind when using a debugger.

Function ShiftB_ resembles ShiftA_ in that it employs the same shift and rotate instructions. However, ShiftB_ uses the variable bit count forms of the asr , lsl, lsr , and ror instructions. Note that the second source operand of these instructions is a register that contains the shift/rotate bit count. The variable bit count forms of asr, lsl, lsr , and ror instructions are aliases for asrv (arithmetic shift right variable), lslv (logical shift left variable), lsrv (logical shift right variable, and rorv (rotate right variable), respectively.

In this source code example, functions ShiftA_ and ShiftB_ both used 32-bit W register operands. The asr , lsl, lsr, and ror instructions can also be used with 64-bit wide X register operands. Finally, unlike the A32 instruction set, the A64 instruction set does not include a rrx (rotate right with extend) instruction. Here is the output for source code example Ch11_06:
ShiftA_
a:    0xc1234561 | 1100 0001 0010 0011 0100 0101 0110 0001
x[0]: 0xf048d158 | 1111 0000 0100 1000 1101 0001 0101 1000  | asr #2
x[1]: 0x12345610 | 0001 0010 0011 0100 0101 0110 0001 0000  | lsl #4
x[2]: 0x06091a2b | 0000 0110 0000 1001 0001 1010 0010 1011  | lsr #5
x[3]: 0x382468ac | 0011 1000 0010 0100 0110 1000 1010 1100  | ror #3
ShiftB_ - count = 8
a:    0xc1234561 | 1100 0001 0010 0011 0100 0101 0110 0001
x[0]: 0xffc12345 | 1111 1111 1100 0001 0010 0011 0100 0101  | asr
x[1]: 0x23456100 | 0010 0011 0100 0101 0110 0001 0000 0000  | lsl
x[2]: 0x00c12345 | 0000 0000 1100 0001 0010 0011 0100 0101  | lsr
x[3]: 0x61c12345 | 0110 0001 1100 0001 0010 0011 0100 0101  | ror

Bitwise Logical Operations

The final source code example of this chapter is called Ch11_07. This example demonstrates how to carry out bitwise logical operations including AND, OR, and exclusive OR. Listing 11-7 shows the source code for example Ch11_07.
//-------------------------------------------------
//               Ch11_07.cpp
//-------------------------------------------------
#include <iostream>
#include <cstdint>
using namespace std;
// Ch11_07_Misc.cpp
extern void PrintResultA(const char* msg, const uint32_t* x, uint32_t a, uint32_t b, size_t n);
extern void PrintResultB(const char* msg, const uint32_t* x, uint32_t a, size_t n);
// Ch11_07_.s
extern "C" void BitwiseOpsA_(uint32_t* x, uint32_t a, uint32_t b);
extern "C" void BitwiseOpsB_(uint32_t* x, uint32_t a);
void BitwiseOpsA(void)
{
    const size_t n = 3;
    uint32_t a, b, x[n];
    a = 0x12345678;
    b = 0xaa55aa55;
    BitwiseOpsA_(x, a, b);
    PrintResultA("BitwiseOpsA_ Test #1", x, a, b, n);
    a = 0x12345678;
    b = 0x00ffc384;
    BitwiseOpsA_(x, a, b);
    PrintResultA("BitwiseOpsA_ Test #2", x, a, b, n);
}
void BitwiseOpsB(void)
{
    const size_t n = 4;
    uint32_t a, x[n];
    a = 0x12345678;
    BitwiseOpsB_(x, a);
    PrintResultB("BitwiseOpsB_ Test #1", x, a, n);
    a = 0xaa55aa55;
    BitwiseOpsB_(x, a);
    PrintResultB("BitwiseOpsB_ Test #2", x, a, n);
}
int main(int argc, char** argv)
{
    BitwiseOpsA();
    cout << " ";
    BitwiseOpsB();
    return 0;
}
//-------------------------------------------------
//               Ch11_07_.s
//-------------------------------------------------
// extern "C" void BitwiseOpsA_(uint32_t* x, uint32_t a, uint32_t b);
            .text
            .global BitwiseOpsA_
BitwiseOpsA_:
// Perform various bitwise logical operations
            and w3,w1,w2                        // a AND b
            str w3,[x0]
            orr w3,w1,w2                        // a OR b
            str w3,[x0,4]
            eor w3,w1,w2                        // a EOR b
            str w3,[x0,8]
            ret
// extern "C" void BitwiseOpsB_(uint32_t* x, uint32_t a);
            .global BitwiseOpsB_
BitwiseOpsB_:
            and w2,w1,0x0000ff00                // a AND 0x0000ff00
            str w2,[x0]
            orr w2,w1,0x00ff0000                // a OR 0x00ff0000
            str w2,[x0,4]
            eor w2,w1,0xff000000                // a EOR 0xff000000
            str w2,[x0,8]
//          and w2,w1,0xcc00ff00                // invalid imm. operand
            mov w2,0xff00
            movk w2,0xcc00,lsl 16               // w2 = 0xcc00ff00
            and w2,w1,w2                        // a AND 0xcc00ff00
            str w2,[x0,12]
            ret
Listing 11-7.

Example Ch11_07

The C++ code in Listing 11-7 performs straightforward test case initialization and displays results. Function BitwiseOpsA_ illustrates the use of the and (bitwise AND), orr (bitwise OR), and eor (bitwise exclusive OR) instructions using W register operands. These instructions can also be used with X register operands.

Function BitwiseOpsB_ highlights the use of the and , or, and eor instructions with immediate operands. Note that the and w2,w1,0xcc00ff00 instruction is commented out. Removing this comment and running make will cause the GNU assembler to generate an “immediate out of range” error message. The reason for this is that the machine language bit pattern for the and instruction does not support encoding of the constant 0xCC00FF00. In cases like this, the required constant must be loaded into a register using one or more mov , movz , or movk instructions. BitwiseOpsB_ uses the instruction pair mov w2,0xff00 and movk w2,0xcc00,lsl 16 to load 0xCC00FF00 into register W2. The and instruction constant encoding limitation also applies to other calculating instructions that use immediate constants including add and sub. Here are the results for source code example Ch11_07:
BitwiseOpsA_ Test #1
a:    0x12345678 | 0001 0010 0011 0100 0101 0110 0111 1000
b:    0xaa55aa55 | 1010 1010 0101 0101 1010 1010 0101 0101
x[0]: 0x02140250 | 0000 0010 0001 0100 0000 0010 0101 0000  | a AND b
x[1]: 0xba75fe7d | 1011 1010 0111 0101 1111 1110 0111 1101  | a OR  b
x[2]: 0xb861fc2d | 1011 1000 0110 0001 1111 1100 0010 1101  | a EOR b
BitwiseOpsA_ Test #2
a:    0x12345678 | 0001 0010 0011 0100 0101 0110 0111 1000
b:    0x00ffc384 | 0000 0000 1111 1111 1100 0011 1000 0100
x[0]: 0x00344200 | 0000 0000 0011 0100 0100 0010 0000 0000  | a AND b
x[1]: 0x12ffd7fc | 0001 0010 1111 1111 1101 0111 1111 1100  | a OR  b
x[2]: 0x12cb95fc | 0001 0010 1100 1011 1001 0101 1111 1100  | a EOR b
BitwiseOpsB_ Test #1
a:    0x12345678 | 0001 0010 0011 0100 0101 0110 0111 1000
x[0]: 0x00005600 | 0000 0000 0000 0000 0101 0110 0000 0000  | a AND 0x0000ff00
x[1]: 0x12ff5678 | 0001 0010 1111 1111 0101 0110 0111 1000  | a OR  0x00ff0000
x[2]: 0xed345678 | 1110 1101 0011 0100 0101 0110 0111 1000  | a EOR 0xff000000
x[3]: 0x00005600 | 0000 0000 0000 0000 0101 0110 0000 0000  | a AND 0xcc00ff00
BitwiseOpsB_ Test #2
a:    0xaa55aa55 | 1010 1010 0101 0101 1010 1010 0101 0101
x[0]: 0x0000aa00 | 0000 0000 0000 0000 1010 1010 0000 0000  | a AND 0x0000ff00
x[1]: 0xaaffaa55 | 1010 1010 1111 1111 1010 1010 0101 0101  | a OR  0x00ff0000
x[2]: 0x5555aa55 | 0101 0101 0101 0101 1010 1010 0101 0101  | a EOR 0xff000000
x[3]: 0x8800aa00 | 1000 1000 0000 0000 1010 1010 0000 0000  | a AND 0xcc00ff00

Summary

Here are the key learning points for Chapter 11:
  • The add and sub instructions perform signed and unsigned integer addition and subtraction using 32- or 64-bit wide operands.

  • The mul instruction performs signed and unsigned integer multiplication; it saves the low-order 32/64 bits of the resultant 64-/128-bit product. The smull and umull instructions carry out multiplication using 32-bit signed or unsigned integers. The full 64-bit wide product is saved.

  • The sdiv and udiv instructions perform signed and unsigned integer division, respectively. These instructions calculate only the quotient.

  • Extended register addressing can be used to load elements from an array using byte, halfword, or word indices.

  • The movz instruction loads a 16-bit immediate constant (with optional shift) into a W or X register. The mov instruction is an alias of movz and is often used instead of movz to improve code readability.

  • The movk instruction loads a 16-bit immediate constant (with optional shift) into a register without altering other bits.

  • The and, orr , and eor instructions carry out bitwise logical AND, OR, and exclusive OR operations.

  • The GNU C++ calling convention for Armv8-64 uses registers X0/W0–X7/W7 to pass integer or pointer arguments to a function. A function must use register X0 or W0 to return a 64-bit or 32-bit wide integer value to its caller.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.19.31.73