Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

10 Modern Processor Architectures and Instruction Sets

Most modern personal computers contain a processor supporting either the Intel or AMD version of the x86 32-bit and x64 64-bit architectures. In contrast, almost all smartphones, smartwatches, tablets, and many embedded systems contain ARM 32-bit or 64-bit processors. This chapter takes a detailed look at the registers and instruction sets of these processor families.

After completing this chapter, you will understand the high-level architectures and unique attributes of the x86, x64, 32-bit ARM, and 64-bit ARM registers, instruction sets, assembly languages, and key aspects of legacy features supported in these architectures.

This chapter covers the following topics:

x86 architecture and instruction set
x64 architecture and instruction set
32-bit ARM architecture and instruction set
64-bit ARM architecture and instruction set

Technical requirements

The files for this chapter, including the answers to the exercises, are available at https://github.com/PacktPublishing/Modern-Computer-Architecture-and-Organization-Second-Edition.

x86 architecture and instruction set

For the purposes of this discussion, the term x86 refers to the 16-bit and 32-bit instruction set architecture of the series of processors that began with the Intel 8086, introduced in 1978. The 8088, released in 1979, is functionally very similar to the 8086, except it has an 8-bit data bus instead of the 16-bit bus of the 8086. The 8088 was the central processor in the original IBM PC.

Subsequent generations of this processor series were named 80186, 80286, 80386, and 80486, leading to the term “x86” as shorthand for members of the family. Subsequent generations dropped the numeric naming convention and received the names Pentium, Core, i Series, Celeron, and Xeon.

Advanced Micro Devices (AMD), a semiconductor manufacturing company that competes with Intel, has been producing x86-compatible processors since 1982. Some recent AMD x86 processor generations have been named Ryzen, Opteron, Athlon, Turion, Phenom, and Sempron.

Code execution compatibility between Intel and AMD processors is good in many aspects. There are some key differences between processors from the two vendors, including the chip pin configuration and chipset compatibility.

In general, Intel processors only work in motherboards and with chipsets designed for Intel chips, and AMD processors only work in motherboards and with chipsets designed for AMD chips. We will highlight some other differences between Intel and AMD processors later in this section.

The 8086 and 8088 are 16-bit processors, despite the 8-bit data bus of the 8088. Internal registers in these processors are 16 bits wide and the instruction set operates on 16-bit data values. The 8088 transparently executes two bus cycles to transfer each 16-bit value between the processor and memory.

The 8086 and 8088 do not support the more sophisticated features of modern processors such as paged virtual memory and protection rings. These early processors also have only 20 address lines, limiting the addressable memory to 1 MB. A 20-bit address cannot fit in a 16-bit register, so it is necessary to use a somewhat complicated system of segment registers and offsets to access the full 1 MB address space.

In 1985, Intel released the 80386 with enhancements that mitigate many of these limitations. The 80386 introduced these features:

32-bit architecture: Addresses, registers, and the ALU are 32 bits wide and instructions operate natively on operands up to 32 bits wide.
Protected mode: This mode implements the multilevel privilege mechanism consisting of ring numbers 0 to 3 that we examined in Chapter 9, Specialized Processor Extensions. In Windows and Linux, ring 0 is kernel mode, and ring 3 is user mode. Rings 1 and 2 are not used in these operating systems.
On-chip MMU: The 80386 MMU supports a flat memory model enabling any location in the 4 GB space to be accessed with a 32-bit address. Manipulation of segment registers and offsets is no longer required. The MMU supports paged virtual memory.
3-stage instruction pipeline: The pipeline accelerates instruction execution, as discussed in Chapter 8, Performance-Enhancing Techniques.
Hardware debug registers: The debug registers support setting up to four breakpoints that stop code execution at a specified virtual address when the address is accessed and a selected condition is satisfied. The available break conditions are code execution, data write, and data read or write. These registers are only available for use by code running in ring 0.

Modern x86 processors boot into the 16-bit operating mode of the original 8086, which is now called real mode. This mode retains compatibility with software written for the 8086/8088 environment, such as the MS-DOS operating system.

In most modern systems running on x86 processors, a transition to protected mode occurs during system startup. Once in protected mode, the operating system remains in protected mode until the computer shuts down.

MS-DOS ON A MODERN PC

Although the x86 processor in a modern PC is compatible at the instruction level with the original 8088, running an old copy of MS-DOS on a modern computer system is unlikely to be a straightforward process. The peripheral devices and their interfaces in modern PCs are not compatible with the corresponding interfaces in PCs from the 1980s. MS-DOS would need a driver that understands how to interact with the USB-connected keyboard of a modern motherboard, for example.

These days, the primary use for 16-bit mode in x86 processors is to serve as a bootloader for a protected mode operating system. Because most developers of computerized devices and the software that runs on them are unlikely to be involved in implementing such a capability, the remainder of our x86 discussion in this chapter will address protected mode and the associated 32-bit flat memory model.

The x86 architecture supports unsigned and signed two’s complement integer data types with widths of 8, 16, 32, 64, and 128 bits. The names assigned to these data types are as follows:

Byte: 8 bits
Word: 16 bits
Doubleword: 32 bits
Quadword: 64 bits
Double quadword: 128 bits

In most cases, the x86 architecture does not mandate the storage of these data types on natural boundaries. The natural boundary of a data type is any address evenly divisible by the size of the data type in bytes.

Storing any of the multi-byte types at unaligned boundaries is allowed but is discouraged because it causes a negative performance impact: instructions operating on unaligned data consume additional clock cycles. A few instructions that operate on double quadwords require naturally aligned storage and will generate a general protection fault if unaligned access is attempted.

x86 natively supports floating-point data types in widths of 16, 32, 64, and 80 bits. The 32-bit, 64-bit, and 80-bit formats are those presented in Chapter 9, Specialized Processor Extensions. The 16-bit format is called half-precision floating-point and has an 11-bit mantissa, an implied leading 1 bit, and a 5-bit exponent. The half-precision floating-point format is used extensively in GPU processing.

In the next section, we will look at the x86 register set in detail.

The x86 register set

The x86 architecture protected mode has eight 32-bit wide general-purpose registers, a flags register, and an instruction pointer. There are also six segment registers and additional processor model-specific configuration registers. The segment registers and model-specific registers are configured by system software during startup and are, in general, not relevant to the developers of applications and device drivers. For these reasons, we will not discuss the segment registers and model-specific registers further.

The 16-bit general-purpose registers in the original 8086 architecture are named AX, CX, DX, BX, SP, BP, SI, and DI. The reason for listing the first four registers in this non-alphabetic order is because this is the sequence in which these eight registers are pushed onto the stack by a pushad (push all registers) instruction.

In the transition to the 32-bit architecture of the 80386, each register grew to 32 bits. The 32-bit version of a register’s name is prefixed with the letter “E” to indicate this extension.

It is possible to access portions of 32-bit registers in smaller bit widths. For example, the lower 16 bits of the 32-bit EAX register are referenced as AX. The AX register can be further accessed as individual bytes using the names AH (high-order byte) and AL (low-order byte). The following diagram shows the register names and the subsets of each:

Figure 10.1: Register names and subsets

Writing to a portion of a 32-bit register, for example, the AL register, affects only the bits in that portion. In the case of AL, loading an 8-bit value modifies the lowest 8 bits of EAX, leaving the other 24 bits unaffected.

In keeping with the x86’s CISC architecture, several functions associated with various instructions are tied to specific registers. Table 10.1 provides a description of the functions associated with each of the x86 general-purpose registers:

Register	Name	Function
`EAX`	Accumulator	Arithmetic operations
`ECX`	Counter	Loop counter and shift/rotate counter
`EDX`	Data	Arithmetic and I/O operations
`EBX`	Base	Pointer to data
`ESP`	Stack pointer	Pointer to the top of the stack
`EBP`	Base pointer	Pointer to the stack base within a function
`ESI`	Source index	Pointer to the source location in array operations
`EDI`	Destination index	Pointer to the destination location in array operations

Table 10.1: x86 general-purpose registers and associated functions

These register-specific functions contrast with the architectures of many RISC processors, which tend to provide a greater number of general-purpose registers. Registers within a RISC processor are, for the most part, functionally equivalent to one another.

The x86 flags register, EFLAGS, contains the processor status bits described in Table 10.2:

Bit	Name	Function
0	`CF`	Carry flag: Indicates if addition produced a carry or subtraction produced a borrow. Used as input by addition and subtraction instructions.
2	`PF`	Parity flag: Set if the low 8 bits of the result contain an even number of 1 bits.
4	`AF`	Adjust flag: Indicates if addition produced a carry or subtraction produced a borrow from the lower 4 bits. Used in BCD arithmetic.
6	`ZF`	Zero flag: Set if the result of an operation is zero.
7	`SF`	Sign flag: Set if the result of an operation is negative.
8	`TF`	Trap flag: Used in single-step debugging.
9	`IF`	Interrupt enable flag: Setting this bit enables hardware interrupts.
10	`DF`	Direction flag: Controls the direction of string processing. When clear, the order is lowest to highest addresses. When set, the order is highest to lowest addresses.
11	`OF`	Overflow flag: Set if an operation resulted in a signed overflow.
12-13	`IOPL`	I/O privilege level: The privilege level of the currently executing thread. IOPL 0 is kernel mode, and 3 is user mode.
14	`NT`	Nested task flag: Controls the chaining of interrupts.
16	`RF`	Resume flag: Used for processing exceptions during debugging.
17	`VM`	Virtual 8086 mode flag: If set, 8086 compatibility mode is active. This mode allows some MS-DOS applications to be run in the context of a protected mode operating system.
18	`AC`	Alignment check flag: If set, memory alignment checking is active. For example, if the AC flag is set, storing a 16-bit value to an odd address triggers an Alignment Check exception. x86 processors can perform unaligned memory accesses when this flag is not set, but the number of instruction cycles required may increase.
19	`VIF`	Virtual interrupt flag: Virtual version of the `IF` flag in virtual 8086 mode.
20	`VIP`	Virtual interrupt pending flag: Set when an interrupt is pending in virtual 8086 mode.
21	`ID`	ID flag: If this bit can be set, the `cpuid` instruction is supported. `cpuid` returns processor identification and feature information.

Table 10.2: x86 flags’ register bits

All bits in the EFLAGS register that are not listed in Table 10.2 are reserved and are unused.

The 32-bit instruction pointer, EIP, contains the address of the next instruction to execute, unless a branch is taken. When a branch is taken, the address of the branch destination is loaded into EIP and execution continues from there.

The x86 architecture is little-endian, meaning multi-byte values are stored in memory with the least significant byte at the lowest address and the most significant byte at the highest address.

x86 addressing modes

As one would expect for a CISC architecture, x86 supports a variety of addressing modes. There are several rules associated with addressing source and destination operands that must be followed to create valid instructions. For instance, the sizes of the source and destination operands of a mov instruction must be equal. The assembler will attempt to select a suitable size for an operand that has an ambiguous size (for example, an immediate value of 7) to match the width of a destination location (such as the 32-bit register EAX). In cases where the size of an operand cannot be inferred, size keywords such as byte ptr must be provided.

The assembly language in these examples uses Intel syntax, which places the operands in destination-source order. Intel syntax is used primarily in the Windows and MS-DOS contexts. An alternative notation, known as AT&T syntax, places operands in source-destination order. AT&T syntax is used in Unix-based operating systems. All examples in this book will use the Intel syntax.

The x86 architecture supports a variety of addressing modes, which we will look at next. Comments in assembly code begin with a semicolon and continue to the end of the line.

Implied addressing

In this addressing mode, the register is implied by the instruction opcode. For example:

clc ; Clear the carry flag (CF in the EFLAGS register)

Register addressing

One or both source and destination registers are encoded in the instruction:

mov eax, ecx ; Copy the contents of register ECX to EAX

Registers may be used as the first operand, the second operand, or both operands.

Immediate addressing

An immediate value is provided as an instruction operand:

mov eax, 7 ; Move the 32-bit value 7 into EAX
mov ax, 7 ; Move the 16-bit value 7 into AX (the lower 16 bits of EAX)

When using Intel syntax, it is not necessary to prefix immediate values with the # character.

Direct memory addressing

The address of the value is provided as an instruction operand:

mov eax, [078bch] ; Copy the 32-bit value at hex address 78BC to EAX

In x86 assembly code, square brackets surrounding an expression indicate the expression is an address. When performing moves or other operations are performed on square-bracketed operands, the value being operated upon is the data at the specified address. The exception to this rule is the LEA (load effective address) instruction, which we’ll examine later.

Register indirect addressing

The operand is a register containing the address of the data value:

mov eax, [esi] ; Copy the 32-bit value at the address contained in ESI to
               ; EAX

This mode is equivalent to using a pointer to reference a variable in C or C++.

Indexed addressing

The operand indicates a register plus offset that combine to provide the address of the data value:

mov eax, [esi + 0bh] ; Copy the 32-bit value at the address (ESI + 0bh) to
                     ; EAX

This mode is useful for accessing the elements of a data structure. In this scenario, the ESI register contains the address of the structure, and the added constant is the byte offset of the element from the beginning of the structure.

Based indexed addressing

The operand indicates a base register, an index register, and an offset that sum together to calculate the address of the data value:

mov eax, [ebx + esi + 10] ; Copy the 32-bit value starting at the address
                          ; (EBX + ESI + 10) to EAX

This mode is useful for accessing individual data elements within an array of data structures. In this example, the EBX register contains the address of the beginning of the structure array, ESI contains the offset of the referenced structure within the array, and the constant value (10) is the offset of the desired element from the beginning of the selected structure.

Based indexed addressing with scaling

The operand is composed of a base register, an index register multiplied by a scale factor, and an offset that sum together to calculate the address of the data value:

mov eax, [ebx + esi*4 + 10] ; Copy the 32-bit value starting at the
                            ; address (EBX + ESI*4 + 10) to EAX

In this addressing mode, the value in the index register can be multiplied by 1 (the default), 2, 4, or 8 before being summed with the other components of the operand address. There is no performance penalty associated with using the scaling multiplier. This feature is helpful when iterating over arrays containing elements with sizes of 2, 4, or 8 bytes.

Most of the general-purpose registers can be used as the base or index register in the based addressing modes.

The following diagram shows the possible combinations of register usage and scaling in the based addressing modes:

Figure 10.2: Based addressing mode

All eight general-purpose registers are available for use as the base register. Of those eight, only ESP is unavailable for use as the index register.

x86 instruction categories

The x86 instruction set was introduced with the Intel 8086 and has been extended several times over the years. Some of the most significant changes relate to the extension of the architecture from 16 to 32 bits, which added protected mode and paged virtual memory. In almost all cases, the new capabilities have been added while retaining full backward compatibility.

The full x86 instruction set contains several hundred instructions. We will not discuss all of them in this chapter. This section will provide brief summaries of the more important and commonly encountered instructions applicable to user-mode applications and device drivers.

This subset of x86 instructions can be divided into a few general categories: data movement; stack manipulation; arithmetic and logic; conversions; control flow; string and flag manipulation; input/output; and protected mode. We will also cover some miscellaneous instructions that do not fall into any specific category.

Data movement

Data movement instructions do not affect the processor flags. The following instructions perform data movement:

mov: Copies the data value referenced by the second operand to the location provided as the first operand.
cmovcc: Conditionally moves the second operand’s data to the register provided as the first operand if the cc condition is true. The condition is determined from one or more of the following processor flags: CF, ZF, SF, OF, and PF. The condition codes are e (equal), ne (not equal), g (greater), ge (greater or equal), a (above), ae (above or equal), l (less), le (less or equal), b (below), be (below or equal), o (overflow), no (no overflow), z (zero), nz (not zero), s (SF=1), ns (SF=0), cxz (register CX is zero), and ecxz (the ECX register is zero).
movsx, movzx: These are variants of the mov instruction performing sign extension and zero extension, respectively. The source operand must be a smaller size than the destination.
lea: Computes the address provided by the second operand and stores it at the location given in the first operand. The second operand is surrounded by square brackets. Unlike the other data movement instructions, the computed address is stored in the destination rather than the data value located at that address.

Stack manipulation

Stack manipulation instructions do not affect the processor flags. These instructions are:

push: Decrements ESP by 4, and then places the 32-bit operand into the stack location pointed to by ESP.
pop: Copies the 32-bit data value pointed to by ESP to the operand location (a register or memory address), and then increments ESP by 4.
pushfd, popfd: Pushes or pops the EFLAGS register.
pushad, popad: Pushes or pops the EAX, ECX, EDX, EBX, ESP, EBP, ESI, and EDI registers, in that order.

Arithmetic and logic

The arithmetic and logic instructions modify the processor flags. The following instructions perform arithmetic and logic operations:

add, sub: Perform integer addition or subtraction. When subtracting, the second operand is subtracted from the first. Both operands can be registers, or one operand can be a memory location and the other a register. One operand can be a constant.
adc, sbb: Performs integer addition or subtraction using the CF flag as a carry input (for addition) or as a borrow input (for subtraction).
cmp: Subtracts the two operands and discards the result while updating the OF, SF, ZF, AF, PF, and CF flags based on the result.
neg: Negates the operand.
inc, dec: Increments or decrements the operand by one.
mul: Performs unsigned integer multiplication. The size of the product depends on the size of the operand. A byte operand is multiplied by AL and the result is placed in AX. A word operand is multiplied by AX and the result is placed in DX:AX, with the upper 16 bits in DX. A doubleword is multiplied by EAX and the result is placed in EDX:EAX.
imul: Performs signed integer multiplication. The first operand must be a register and receives the result of the operation. There may be a total of two or three operands. In the two-operand form, the first operand multiplies the second operand, and the result is stored in the first operand (a register). In the three-operand form, the second operand multiplies the third operand, and the result is stored in the first operand register. In the three-operand form, the third operand must be an immediate value.
div, idiv: Performs unsigned (div) or signed (idiv) division. The size of the result depends on the size of the operand. A byte operand is divided into AX, the quotient is placed in AL, and the remainder is placed in AH. A word operand is divided into DX:AX, the quotient is placed in AX, and the remainder is placed in DX. A doubleword is divided into EDX:EAX, the quotient is placed in EAX, and the remainder is placed in EDX.
and, or, xor: Performs the corresponding logical operation on the two operands and stores the result in the destination operand location.
not: Performs a logical NOT (bit inversion) operation on a single operand.
sal, shl, sar, shr: Performs a logical (shl and shr) or arithmetic (sal and sar) shift of the byte, word, or doubleword argument left or right by 1 to 31 bit positions. sal and shl place the last bit shifted out into the carry flag and insert zeros into the vacated least significant bits. shr places the last bit shifted out into the carry flag and inserts zeros into the vacated most significant bits. sar differs from shr by propagating the sign bit into the vacated most significant bits.
rol, rcl, ror, rcr: Performs a left or right rotation by 0 to 31 bits, optionally through the carry flag. rcl and rcr rotate through the carry flag, while rol and ror do not.
bts, btr, btc: Reads a specified bit number (provided as the second operand) within the bits of the first operand into the carry flag, then either sets (bts), resets (btr), or complements (btc) that bit. These instructions may be preceded by the lock keyword to make the operation atomic.
test: Performs a logical AND operation of two operands and updates the SF, ZF, and PF flags based on the result.

Conversions

Conversion instructions extend a smaller data size to a larger size. These instructions are:

cbw: Converts a byte (AL register) into a word (AX).
cwd: Converts a word (AX register) into a doubleword (DX:AX).
cwde: Converts a word (AX register) into a doubleword (EAX).
cdq: Converts a doubleword (AX register) into a quadword (EDX:EAX).

Control flow

Control flow instructions conditionally or unconditionally transfer execution to an address. The control flow instructions are:

jmp: Transfers control to the instruction at the address provided as the operand.
jcc: Transfers control to the instruction at the address provided as the operand if the condition cc is true. The condition codes were described previously in the cmovcc instruction description. The condition is determined from one or more of the following processor flags: CF, ZF, SF, OF, and PF.
call: Pushes the current value of EIP onto the stack and transfers control to the instruction at the address provided as the operand.
ret: Pops the top-of-stack value and stores it in EIP. If an operand is provided, it pops the given number of bytes from the stack to clear parameters.
loop: Decrements the loop counter in ECX and, if not zero, transfers control to the instruction at the address provided as the operand.

String manipulation

String manipulation instructions may be prefixed by the rep keyword to repeat the operation the number of times given by the ECX register, incrementing or decrementing the source and destination location on each iteration, depending on the state of the DF flag. The operand size processed on each iteration can be a byte, word, or doubleword. The source address of each string element is given by the ESI register and the destination by the EDI register. These instructions are:

mov: Moves a string element
cmps: Compares elements at corresponding locations in two strings
scas: Compares a string element to the value in EAX, AX, or AL, depending on the operand size
lods: Loads the string into EAX, AX, or AL, depending on the operand size
stos: Stores EAX, AX, or AL, depending on the operand size, to the address in EDI

Flag manipulation

Flag manipulation instructions modify bits in the EFLAGS register. The flag manipulation instructions are:

stc, clc, cmc: Sets, clears, or complements the carry flag, CF
std, cld: Sets or clears the direction flag, DF
sti, cli: Sets or clears the interrupt flag, IF

Input/output

Input/output instructions read data from or write data to peripheral devices. The input/output instructions are:

in, out: Moves 1, 2, or 4 bytes between EAX, AX, or AL and an I/O port, depending on the operand size
ins, outs: Moves a data element between memory and an I/O port in the same manner as the string instructions
rep ins, rep outs: Moves blocks of data between memory and an I/O port in the same manner as the string instructions

Protected mode

The following instructions access the features of protected mode:

sysenter, sysexit: Transfers control from ring 3 to ring 0 (sysenter) or from ring 0 to ring 3 (sysexit) in Intel processors.
syscall, sysret: Transfers control from ring 3 to ring 0 (syscall) or from ring 0 to ring 3 (sysret) in AMD processors. In x86 (32-bit) mode, AMD processors also support sysenter and sysexit.

Miscellaneous instructions

These instructions do not fit into the categories previously listed:

int: Initiates a software interrupt. The operand is the interrupt vector number.
nop: No operation.
cpuid: Provides information about the processor model and its capabilities.

Other instruction categories

The instructions listed in this section are some of the more common instructions you will come across in x86 applications and device drivers beyond those listed in the preceding sections. The x86 architecture contains a wide variety of instruction categories, including the following:

Floating-point instructions: These instructions are executed by the x87 floating-point unit.
SIMD instructions: This category includes the MMX, SSE, SSE2, SSE3, SSE4, AVX, AVX2, and AVX-512 instructions. Some of the instruction sets in this category were introduced in the SIMD processing section of Chapter 8, Performance-Enhancing Techniques.
AES instructions: These instructions support encryption and decryption using the Advanced Encryption Standard (AES).
MPX instructions: The memory protection extensions (MPX) enhance memory integrity by preventing errors such as buffer overruns.
SMX instructions: The safer mode extensions (SMX) improve system security in the presence of user trust decisions.
TSX instructions: The transactional synchronization extensions (TSX) enhance the performance of multithreaded execution using shared resources.
VMX instructions: The virtual machine extensions (VMX) support the secure and efficient execution of virtualized operating systems.

Additional processor registers are provided for use by the floating-point and SIMD instructions.

There are even more categories of x86 instructions beyond those listed here, a few of which have been retired in later generations of the architecture.

Common instruction patterns

Listed below are some examples of instruction usage patterns you will come across frequently in compiled code. The techniques used in these examples produce the desired result while minimizing code size and the number of clock cycles required:

xor reg, reg ; Set reg to zero
test reg, reg ; Test if reg contains zero
add reg, reg ; Shift reg left by one bit

x86 instruction formats

Individual x86 instructions are of variable length and can range in size from 1 to 15 bytes. The components of a single instruction, including any optional bytes, are laid out in memory in the following sequence:

Prefix bytes: One or more optional prefix bytes provide auxiliary opcode execution information. For example, the lock prefix performs bus locking in a multiprocessor system to enable atomic test-and-set type operations. rep and related prefixes enable string instructions to perform repeated operations on string elements in a single instruction. Other prefixes are available to provide hints for conditional branch instructions or to override the default size of an address or operand.
Opcode bytes: An x86 opcode, consisting of 1 to 3 bytes, follows any prefix bytes. For some opcodes, an additional 3 opcode bits are encoded in a ModR/M byte following the opcode.
ModR/M byte: Not all instructions require this byte. The ModR/M byte contains three information fields providing an address mode and operand register information. The upper two bits of this byte (the Mod field) and the lower three bits (the R/M field) combine to form a 5-bit field with 32 possible values. Of these, 8 values identify register operands, and the other 24 values specify addressing modes. The remaining 3 bits (the reg/opcode field) either indicate a register or provide three additional opcode bits, depending on the instruction.
Address displacement bytes: 0, 1, 2, or 4 bytes provide an address displacement used in computing the operand address.
Immediate value bytes: If the instruction includes an immediate value, it is in the last 1, 2, or 4 bytes of the instruction.

The variable-length nature of x86 instructions makes the process of instruction decoding quite complex. It is also challenging for debugging tools to disassemble a sequence of instructions in reverse order, perhaps to display the code leading up to a breakpoint.

This difficulty arises because it is possible for a trailing subset of bytes within a lengthy instruction to form a complete, valid instruction. This complexity is a notable difference from the more regular instruction formats used in RISC architectures.

x86 assembly language

It is possible to develop programs of any level of complexity in assembly language.

Most modern applications, however, are largely or entirely developed in high-level languages. Assembly language tends to be used in cases where the use of specialized instructions is required, or a level of extreme optimization is necessary that is unachievable with an optimizing compiler.

Regardless of the language used in application development, all code must ultimately execute as processor instructions. To fully understand how code executes on a computer system, there is no substitute for examining the state of the system following the execution of each individual instruction. A good way to learn to understand and operate in this environment is to write some assembly code.

The x86 assembly language example in the following listing is a complete x86 application that runs in a Windows command console, printing a text string and then exiting:

.386
.model FLAT,C
.stack 400h
.code
includelib libcmt.lib
includelib legacy_stdio_definitions.lib
extern printf:near
extern exit:near
public main
main proc
    ; Print the message
    push    offset message
    call    printf
    
    ; Exit the program with status 0
    push    0
    call    exit
main endp
.data
message db "Hello, Computer Architect!",0
end

A description of the contents of this assembly language file follows:

The .386 directive indicates the instructions in this file should be interpreted as applying to 80386 and later-generation processors.
The .model FLAT,C directive specifies a 32-bit flat memory model and the use of C language function calling conventions.
The .stack 400h directive specifies a stack size of 400 h (1,024) bytes.
The .code directive indicates the start of executable code.
The includelib and extern directives reference system-provided libraries and the functions within them to be used by the program.
The public directive indicates that the function name, main, is an externally visible symbol.
The lines between main proc and main endp are the assembly language instructions making up the main function.
The .data directive indicates the start of data memory. The message db statement defines the message string as a sequence of bytes, followed by a zero byte.
The end directive marks the end of the program.

This file, named hello_x86.asm, can be assembled and linked to form the executable hello_x86.exe program with the following command, which runs the Microsoft Macro Assembler:

ml /Fl /Zi /Zd hello_x86.asm

The components of this command are:

ml runs the assembler (ml.exe)
/Fl creates a listing file
/Zi includes symbolic debugging information in the executable file
/Zd includes line number debugging information in the executable file
hello_x86.asm is the name of the assembly language source file

This is a portion of the hello_x86.lst listing file generated by the assembler:

                         .386
                         .model FLAT,C
                         .stack 400h
 00000000                .code
                         includelib libcmt.lib
                         includelib legacy_stdio_definitions.lib
                         extern printf:near
                         extern exit:near
                         public main
 00000000                main proc
                             ; Print the message
 00000000  68 00000000 R     push    offset message
 00000005  E8 00000000 E     call    printf
                             
                             ; Exit the program with status 0
 0000000A  6A 00             push    0
 0000000C  E8 00000000 E     call    exit
 00000011                main endp
 00000000                .data
 00000000 48 65 6C 6C 6F message db "Hello, Computer Architect!",0
          2C 20 43 6F 6D
          70 75 74 65 72
          20 41 72 63 68
          69 74 65 63 74
          21 00

This listing displays the address offsets from the beginning of the main function in the left column. On lines containing instructions, the opcode follows the address offset. Address references in the code (for example, offset message) are displayed as 00000000 in the listing because these values are determined during linking, and not during assembly, which is when this listing is generated.

This is the output displayed when running this program:

C:>hello_x86.exe
Hello, Computer Architect!

Next, we will look at the extension of the 32-bit x86 architecture to the 64-bit x64 architecture.

x64 architecture and instruction set

The original specification for a processor architecture extending the x86 processor and instruction set to 64 bits, named AMD64, was introduced by AMD in 2000. The first AMD64 processor, the Opteron, was released in 2003. Intel found itself following AMD’s lead and developed an AMD64-compatible architecture, eventually given the name Intel 64. The first Intel processor that implemented the 64-bit architecture was the Xeon, introduced in 2004. The name of the architecture shared by AMD and Intel came to be called x86-64, reflecting the evolution of x86 to 64 bits, and, in popular usage, this term has been shortened to x64.

The first Linux version supporting the x64 architecture was released in 2001, well before the first x64 processors were even available. Windows began supporting the x64 architecture in 2005.

Processors implementing the AMD64 and Intel 64 architectures are largely compatible at the instruction set level of user-mode programs. There are a few differences between the architectures, the most significant of which is the difference in support of the sysenter/sysexit Intel instructions and the syscall/sysret AMD instructions we saw earlier.

In general, operating systems and programming language compilers manage these differences, making them rarely an issue of concern to software and system developers. Developers of kernel software, drivers, and assembly code must take these differences into account.

The principal features of the x64 architecture are:

x64 is a mostly compatible 64-bit extension of the 32-bit x86 architecture. Most software, particularly user-mode applications, written for the 32-bit environment, should execute without modification in a processor running in 64-bit mode. 64-bit mode is also referred to as long mode.
The eight 32-bit general-purpose registers of x86 are extended to 64 bits in x64. The register name prefix R indicates 64-bit registers. For example, in x64, the extended x86 EAX register is called RAX. The x86 register subcomponents EAX, AX, AH, and AL continue to be available in x64.
The instruction pointer, RIP, is now 64 bits. The flags register, RFLAGS, also extends to 64 bits, though the upper 32 bits are reserved. The lower 32 bits of RFLAGS are the same as EFLAGS in the x86 architecture.
Eight 64-bit general-purpose registers have been added, named R8 through R15.
64-bit integers are supported as a native data type.
x64 processors retain the option of running in x86 compatibility mode. This mode enables the use of 32-bit operating systems and allows any application built for x86 to run on x64 processors. In 32-bit compatibility mode, the 64-bit extensions are unavailable.

Virtual addresses in the x64 architecture are 64 bits wide, supporting an address space of 16 exabytes (EB), equivalent to 2⁶⁴ bytes. Current processors from AMD and Intel, however, support only 48 bits of virtual address space. This restriction reduces processor hardware complexity while still supporting up to 256 terabytes (TB) of virtual address space. Current-generation processors also support a maximum of 48 bits of physical address space. This permits a processor to address 256 TB of physical RAM, though modern motherboards do not support the number of DRAM devices such a system would require.

The x64 register set

In the x64 architecture, the extension of x86 register lengths to 64 bits and the addition of registers R8 through R15 results in the register map shown in Figure 10.3:

Figure 10.3: x64 registers

In Figure 10.3, the x86 registers described in the preceding section (and present in x64) are shaded. The x86 registers have the same names and are the same sizes when operating in 64-bit mode.

The 64-bit extended versions of the x86 registers have names starting with the letter R. The new 64-bit registers (R8 through R15) can be accessed in smaller widths using the appropriate suffix letter:

Suffix D accesses the lower 32 bits of the register: R11D
Suffix W accesses the lower 16 bits of the register: R11W
Suffix B accesses the lower 8 bits of the register: R11B

Unlike the x86 registers, the new registers in the x64 architecture are truly general purpose and do not perform any special functions at the processor instruction level.

x64 instruction categories and formats

The x64 architecture implements essentially the same instruction set as x86, with 64-bit extensions. When operating in 64-bit mode, the x64 architecture uses a default address size of 64 bits and a default operand size of 32 bits. A new opcode prefix byte, rex, specifies the use of 64-bit operands.

The format of x64 instructions in memory matches that of the x86 architecture, with some exceptions that, for our purposes, are minor. The addition of support for the rex prefix byte is the most significant variation from the x86 instruction format. Address displacements and immediate values within some instructions can be 64 bits wide, in addition to all the bit widths supported in x86.

Although it is possible to define instructions that are longer than 15 bytes, the processor instruction decoder will raise a general protection fault if an attempt is made to decode an instruction longer than 15 bytes.

x64 assembly language

The x64 assembly language source file for the hello program is like the x86 version of this code, with some notable differences:

There is no directive specifying a memory model because there is only one x64 memory model.
The Windows x64 application programming interface (API) uses a calling convention that stores the first four arguments to a called function in the RCX, RDX, R8, and R9 registers, in that order. This differs from the default x86 calling convention, which pushes parameters onto the stack. Both library functions called by this program (printf and exit) take a single argument, passed in RCX.
The calling convention requires the caller of a function to allocate stack space to hold at least the number of arguments passed to the called functions, with a minimum reservation space for four arguments, even if fewer are being passed. Because the stack grows downward in memory, this requires a subtraction from the stack pointer. The sub rsp, 40 instruction performs this stack allocation. Normally, after the called function returns, it would be necessary to adjust the stack pointer to remove this allocation. Our program calls the exit function, terminating program execution, which makes this step unnecessary.

The code for the 64-bit version of the hello program is as follows:

.code
includelib libcmt.lib
includelib legacy_stdio_definitions.lib
extern printf:near
extern exit:near
public main
main proc
    ; Reserve stack space
    sub     rsp, 40
    
    ; Print the message
    lea     rcx, message
    call    printf
    
    ; Exit the program with status 0
    xor     rcx, rcx
    call    exit
main endp
.data
message db "Hello, Computer Architect!",0
end

This file, named hello_x64.asm, is assembled and linked to form the executable hello_x64.exe program with the following call to the Microsoft Macro Assembler (x64 version):

ml64 /Fl /Zi /Zd hello_x64.asm

The components of this command are:

ml64 runs the 64-bit assembler
/Fl creates a listing file
/Zi includes symbolic debugging information in the executable file
/Zd includes line number debugging information in the executable file
hello_x64.asm is the name of the assembly language source file

This is a portion of the hello_x64.lst listing file generated by the assembler command:

 00000000                .code
                         includelib libcmt.lib
                         includelib legacy_stdio_definitions.lib
                         extern printf:near
                         extern exit:near
                         public main
 00000000                main proc
                             ; Reserve stack space
 00000000  48/ 83 EC 28            sub     rsp, 40
                             
                             ; Print the message
 00000004  48/ 8D 0D                lea     rcx, message
           00000000 R
 0000000B  E8 00000000 E     call    printf
                             
                             ; Exit the program with status 0
 00000010  48/ 33 C9                xor     rcx, rcx
 00000013  E8 00000000 E     call    exit
 00000018                main endp
 00000000                .data
 00000000 48 65 6C 6C 6F message db "Hello, Computer Architect!",0
          2C 20 43 6F 6D
          70 75 74 65 72
          20 41 72 63 68
          69 74 65 63 74
          21 00

The output of running this program is as follows:

C:>hello_x64.exe
Hello, Computer Architect!

This completes our brief introduction to the x86 and x64 architectures. There is a great deal more to be learned, and indeed the Intel 64 and IA-32 Architectures Software Developer’s Manual, Volumes 1 through 4, contains nearly 5,000 pages of detailed documentation on these architectures. We have just scratched the surface in this chapter.

Next, we will take a similar top-level tour of the ARM 32-bit and 64-bit architectures.

32-bit ARM architecture and instruction set

The ARM architectures define a family of RISC processors suitable for use in a wide variety of applications. Processors based on ARM architectures are preferred in designs where a combination of high performance, low power consumption, and small physical size is needed.

ARM Holdings, a British semiconductor and software company, developed the ARM architectures and licenses them to other companies who implement processors in silicon. Many applications of the ARM architectures are system-on-chip (SoC) designs combining a processor with specialized hardware to support functions such as cellular radio communications in smartphones.

ARM processors are employed in a broad spectrum of applications, from tiny battery-powered devices to supercomputers. ARM processors serve as embedded processors in safety-critical systems such as automotive anti-lock brakes and as general-purpose processors in smartwatches, portable phones, tablets, laptop computers, desktop computers, and servers. As of 2021, over 180 billion ARM processors have been manufactured.

ARM processors are true RISC systems with a large set of general-purpose registers and single-cycle execution of most instructions. Standard ARM instructions have a fixed width of 32 bits, though a separate variable-length instruction set named T32 (formerly called Thumb) is available for applications where memory is at a premium. The T32 instruction set uses a mixture of 16- and 32-bit instructions.

Current-generation ARM processors support both the ARM and T32 instruction sets and can switch between the two sets on the fly. Most operating systems and applications prefer to use the T32 instruction set over the ARM set because code density is improved.

ARM is a load/store architecture, requiring data to be loaded from memory to a register before any processing such as an ALU operation can take place with it. A subsequent instruction stores the result back to memory. While this might seem like a step back from the x86 and x64 architectures, which operate directly on operands in memory in a single instruction, in practice, the load/store approach permits several sequential operations to be performed at high speed on an operand once it has been loaded into one of the many processor registers.

ARM processors are bi-endian. A configuration setting is available to select the little-endian or big-endian byte order for multi-byte values. The default setting is little-endian, which is the configuration commonly used by operating systems.

The ARM architecture natively supports these data types:

Byte: 8 bits
Halfword: 16 bits
Word: 32 bits
Doubleword: 64 bits

WHAT’S IN A WORD?

There is a potentially confusing difference between the data type names of the ARM architecture and those of the x86 and x64 architectures: in x86 and x64, a word is 16 bits and a doubleword is 32 bits. In ARM, a word is 32 bits and a doubleword is 64 bits.

ARM processors support eight distinct execution privilege levels. These levels, and their abbreviations, are as follows:

User (USR)
Supervisor (SVC)
Fast interrupt request (FIQ)
Interrupt request (IRQ)
Monitor (MON)
Abort (ABT)
Undefined (UND)
System (SYS)

For the purposes of operating systems and user applications, the most important privilege levels are USR and SVC. The two interrupt request modes, FIQ and IRQ, are used by device drivers for processing interrupts.

In most operating systems running on ARM, including Windows and Linux, the kernel mode runs in ARM SVC mode, equivalent to ring 0 on x86/64. ARM USR mode is equivalent to ring 3 on x86/x64. Applications running under Linux on ARM processors use software interrupts to request kernel services, which involves a transition from USR mode to SVC mode.

The ARM architecture provides system capabilities beyond those of the main processor via the concept of coprocessors. Each coprocessor implements a specialized category of functionality in support of the main processor. Up to 16 coprocessors can be implemented in a system, with predefined functions assigned to four of them.

Coprocessor 15 implements the MMU and other system functions. If present, coprocessor 15 must support the instruction opcodes, register set, and behaviors specified for the MMU. Coprocessors 10 and 11 combine to provide floating-point functionality in processors equipped with that feature. Coprocessor 14 provides debugging functions.

The ARM architectures have evolved through several versions over the years. The architectural variant currently in wide use is ARMv8-A. ARMv8-A supports 32-bit and 64-bit operating systems and applications. 32-bit applications can run under a 64-bit ARMv8-A operating system.

Virtually all high-end smartphones and portable electronic devices produced since 2016 are designed around processors or SoCs based on the ARMv8-A architecture. The description that follows will focus on ARMv8-A 32-bit mode. We will look at the differences in ARMv8-A 64-bit mode in a later section in this chapter.

The ARM register set

In USR mode, the ARM architecture has 16 general-purpose 32-bit registers named R0 through R15. The first 13 registers are truly general-purpose, while the last three have the following defined functions:

R13 is the stack pointer, also named SP in assembly code. This register points to the top of the stack.
R14 is the link register, also named LR. This register holds the return address while in a called function. The use of a link register differs from x86/x64, which pushes the return address onto the stack.
The reason for using a register to hold the return address is because it is significantly faster to resume execution at the address in LR at the end of a function than it is to pop the return address from the stack and resume execution at that address.
R15 is the program counter, also named PC. Due to pipelining, the value contained in PC is usually two instructions ahead of the currently executing instruction. Unlike x86/x64, it is possible for user code to directly read and write the PC register. Writing an address to PC causes execution to immediately jump to the newly written address.

The current program status register (CPSR) contains status and mode control bits, similar to EFLAGS/RFLAGS in the x86/x64 architectures.

Bit	Name	Function
0-3	`M`	Mode: The current execution privilege level (`USR`, `SVC`, and so on).
4	`T`	Thumb: Set if the `T32` (Thumb) instruction set is active. If clear, the ARM instruction set is active. User code can set and clear this bit.
9	`E`	Endianness: Setting this bit enables big-endian mode. If clear, little-endian mode is active. Most code uses little-endian mode.
27	`Q`	Cumulative saturation flag: Set if, at some point in a series of operations, an overflow or saturation occurred.
28	`V`	Overflow flag: Set if the operation resulted in a signed overflow.
29	`C`	Carry flag: Indicates whether addition produced a carry, or subtraction produced a borrow.
30	`Z`	Zero flag: Set if the result of an operation is zero.
31	`N`	Negative flag: Set if the result of an operation is negative.

Table 10.3: Selected CPSR bits

CPSR bits not listed in Table 10.3 are either reserved or represent functions not discussed in this chapter.

By default, most instructions do not affect the flags. The S suffix must be used with, for example, an addition instruction (adds) to cause the result to affect the flags. Comparison instructions are the exception to this rule; they update the flags automatically.

ARM addressing modes

In true RISC fashion, the only ARM instructions that can access system memory are those that perform register loads and stores.

The ldr instruction loads a register from memory, while str stores a register to memory. A separate instruction, mov, transfers the contents of one register to another or moves an immediate value into a register.

When computing the target address for a load or store operation, ARM starts with a base address provided in a register and adds an increment to arrive at the target memory address. There are three methods for determining the increment that will be added to the base register in register load and store instructions:

Offset: A signed constant is added to the base register. The offset is stored as part of the instruction. For example, ldr r0, [r1, #10] loads r0 with the word at the address r1+10. As shown in the following addressing mode examples, pre- or post-indexing can optionally update the base register to the target address before or after the memory location is accessed.
Register: An unsigned increment stored in a register can be added to or subtracted from the value in a base register. For example, ldr r0, [r1, r2] loads r0 with the word at the address r1+r2. Either of the registers can be thought of as the base register.
Scaled register: An increment in a register is shifted left or right by a specified number of bit positions before being added to or subtracted from the base register value. For example, ldr r0, [r1, r2, lsl #3] loads r0 with the word at the address r1+(r2×8). The shift can be a logical left or right shift, lsl or lsr, inserting zero bits in the vacated bit positions, or an arithmetic right shift, asr, that replicates the sign bit in the vacated positions.

The addressing modes available for specifying source and destination operands in ARM instructions are presented in the following sections.

Immediate

An immediate value is provided as part of the instruction. The possible immediate values consist of an 8-bit value, coded in the instruction, rotated through an even number of bit positions. A full 32-bit value cannot be specified because the instruction itself is, at most, 32 bits wide. To load an arbitrary 32-bit value into a register, the ldr instruction must be used instead to load the value from memory:

mov r0, #10 // Load the 32-bit value 10 decimal into r0
mov r0, #0xFF000000 // Load the 32-bit value FF000000h into r0

The second example contains the 8-bit value FFh in the instruction opcode. During execution, it is rotated left by 24-bit positions into the most significant 8 bits of the word.

Register direct

This mode copies one register to another:

mov r0, r1 // Copy r1 to r0
mvn r0, r1 // Copy NOT(r1) to r0

Register indirect

The address of the operand is provided in a register. The register containing the address is surrounded by square brackets:

ldr r0, [r1] // Load the 32-bit value at the address given in r1 to r0
str r0, [r3] // Store r0 to the address in r3

Unlike most instructions, str uses the first operand as the source and the second as the destination.

Register indirect with offset

The address of the operand is computed by adding an offset to the base register:

ldr r0, [r1, #32] // Load r0 with the value at the address [r1+32]
str r0, [r1, #4] // Store r0 to the address [r1+4]

Register indirect with offset, pre-incremented

The address of the value is determined by adding an offset to the base register. The base register is updated to the computed address and this address is used to load the destination register:

ldr r0, [r1, #32]! // Load r0 with [r1+32] and update r1 to (r1+32)
str r0, [r1, #4]! // Store r0 to [r1+4] and update r1 to (r1+4)

Register indirect with offset, post-incremented

The base address is first used to access the memory location. The base register is then updated to the computed address:

ldr r0, [r1], #32 // Load [r1] to r0, then update r1 to (r1+32)
str r0, [r1], #4 // Store r0 to [r1], then update r1 to (r1+4)

Double register indirect

The address of the operand is the sum of a base register and an increment register. The register names are surrounded by square brackets:

ldr r0, [r1, r2] // Load r0 with [r1+r2]
str r0, [r1, r2] // Store r0 to [r1+r2]

Double register indirect with scaling

The address of the operand is the sum of a base register and an increment register shifted left or right by the given number of bits. The register names and the shift information are surrounded by square brackets:

ldr r0, [r1, r2, lsl #5] // Load r0 with [r1+(r2*32)]
str r0, [r1, r2, lsr #2] // Store r0 to [r1+(r2/4)]

The next section introduces the general categories of ARM instructions.

ARM instruction categories

The instructions described in this section are from the T32 instruction set.

Load/store

These instructions move data between registers and memory:

ldr, str: Copies an 8-bit (suffix b for byte), 16-bit (suffix h for halfword), or 32-bit value between a register and a memory location. ldr copies the value from memory to a register, while str copies a register to memory. ldrb copies 1 byte into the lower 8 bits of a register.
ldm, stm: Loads or stores multiple registers. Copies 1 to 16 registers to or from memory. For example, the instruction ldm r1, {r0, r2, r4-r11} loads registers r0, r2, and r4 through r11 from contiguous memory beginning at the address provided in r1. Any subset of registers can be loaded from, or stored to, memory using these instructions.

Stack manipulation

These instructions store data to, and retrieve data from, the stack:

push, pop: Pushes or pops any subset of the registers to or from the stack, for example, push {r0, r2, r4-r11}. These instructions are variants of the ldm and stm instructions.

Register movement

These instructions transfer data between registers:

mov, mvn: Moves a register (mov), or its bit-inversion (mvn), to the destination register.

Arithmetic and logic

These instructions mostly have one destination register and two source operands. The first source operand is a register, while the second can be a register, a shifted register, or an immediate value.

Including the s suffix causes these instructions to set the condition flags. For example, adds performs addition and sets the condition flags:

add, sub: Adds or subtracts two numbers. For example, add r0, r1, r2, lsl #3 is equivalent to the expression r0 = r1 + (r2 × 2³). The lsl operator performs a logical shift left of the second operand, r2.
adc, sbc: Adds or subtracts two numbers with carry or borrow.
neg: Negates a number.
and, orr, eor: Performs logical AND, OR, or XOR operations.
orn, eon: Performs logical OR or XOR operations between the first operand and the bitwise-inverted second operand.
bic: Clears selected bits in a register.
mul: Multiplies two numbers.
mla: Multiplies two numbers and accumulates the result. This instruction has an additional operand to specify the accumulator register.
sdiv, udiv: Signed and unsigned division, respectively.

Comparisons

These instructions compare two values and set the condition flags based on the result of the comparison. The s suffix is not needed with these instructions to set the condition codes:

cmp: Subtracts two numbers, discards the result, and sets the condition flags. This is equivalent to a subs instruction, except the result is discarded.
cmn: Adds two numbers, discards the result, and sets the condition flags. This is equivalent to an adds instruction, except the result is discarded.
tst: Performs a bitwise AND, discards the result, and sets the condition flags. This is equivalent to an ands instruction, except the result is discarded.

Control flow

These instructions transfer control conditionally or unconditionally to a target address:

b: Performs an unconditional branch to the target address.
bcc: Branches based on one of these condition codes as cc: eq (equal), ne (not equal), gt (greater than), lt (less than), ge (greater or equal), le (less or equal), cs (carry set), cc (carry clear), mi (minus: N flag = 1), pl (plus: N flag = 0), vs (V flag set), vc (V flag clear), hi (higher: C flag set and Z flag clear), or ls (lower or same: C flag clear and Z flag clear).
bl: Branches to the specified address and stores the address of the next instruction in the link register (r14, also called lr). The called function returns to the calling code with the mov pc, lr instruction.
bx: Branches and selects the instruction set. If bit 0 of the target address is 1, T32 mode is entered. If bit 0 is clear, ARM mode is entered. Bit 0 of instruction addresses must always be zero due to ARM’s address alignment requirements. This frees bit 0 to select the instruction set.
blx: Branches with a link and selects the instruction set. This instruction combines the functions of the bl and bx instructions.

Supervisor mode

This instruction allows user-mode code to initiate a call to supervisor mode:

svc (supervisor call): Initiates a software interrupt that causes the supervisor mode exception handler to process a system service request.

Breakpoint

This instruction is used by debuggers during software development:

bkpt (trigger a breakpoint): This instruction takes a 16-bit operand for use by debugging software to identify the breakpoint.

Conditional execution

Many ARM instructions support conditional execution, which uses the same condition codes as the branch instructions to determine whether individual instructions are executed. If an instruction’s condition evaluates false, the instruction is processed as a no-op. The condition code is appended to the instruction mnemonic. This conditional execution mechanism is formally known as predication.

For example, this function converts a nibble (the lower 4 bits of a byte) into an ASCII character version of the nibble:

// Convert the low 4 bits of r0 to an ascii character in r0
nibble2ascii:
and r0, #0xF
cmp r0, #10
addpl r0, r0, #('A' - 10)
addmi r0, r0, #'0'
mov pc, lr

The cmp instruction subtracts 10 from the nibble in r0 and sets the N flag if r0 is less than 10. If r0 is greater than or equal to 10, the N flag is clear.

If N is clear, the addpl instruction executes (pl means “plus,” as in “not negative”), and the addmi instruction does not execute. If N is set, the addpl instruction does not execute and the addmi instruction executes. After this sequence completes, r0 contains a character in the range 0-9 or A-F.

The use of conditional instruction execution helps keep the instruction pipeline flowing efficiently by avoiding branches.

Other instruction categories

ARM processors optionally support a range of SIMD and floating-point instructions. Additional instructions are provided that are generally only used during system configuration.

32-bit ARM assembly language

The ARM assembly example in this section uses the syntax of the GNU Assembler, provided with the Android Studio integrated development environment (IDE). Other assemblers may use a different syntax. As with the Intel syntax for the x86 and x64 assembly languages, the operand order for most instructions is the destination followed by the source.

The ARM assembly language source file for the hello program is as follows:

.text
.global _start
_start:
    mov     r0, #1       // int fd 1 (stdout)
    ldr     r1, =message // const void *buf
    mov     r2, #count   // size_t count
    mov     r7, #4       // syscall 4 (sys_write)
    svc     0
    mov     r0, #0       // int status (0=OK)
    mov     r7, #1       // syscall 1 (sys_exit)
    svc     0
        
.data
message:
    .ascii      "Hello, Computer Architect!"
count = . - message

This file, named hello_arm.s, is assembled and linked to form the executable program hello_arm with the following commands. These commands use the development tools provided with the Android Studio Native Development Kit (NDK). The commands assume the Windows PATH environment variable has been set to include the NDK tools directory:

arm-linux-androideabi-as -al=hello_arm.lst -o hello_arm.o hello_arm.s
arm-linux-androideabi-ld -o hello_arm hello_arm.o

The components of these commands are:

arm-linux-androideabi-as runs the assembler
-al=hello_arm.lst creates a listing file named hello_arm.lst
-o hello_arm.o creates an object file named hello_arm.o
hello_arm.s is the name of the assembly language source file
arm-linux-androideabi-ld runs the linker
-o hello_arm creates an executable file named hello_arm
hello_arm.o is the name of the object file provided as input to the linker

This is a portion of the hello_arm.lst listing file generated by the assembler command:

   1              	.text
   2              	.global _start
   3              	
   4              	_start:
   5 0000 0100A0E3           mov     r0, #1       // int fd 1 (stdout)
   6 0004 14109FE5           ldr     r1, =message // const void *buf
   7 0008 1A20A0E3           mov     r2, #count   // size_t count
   8 000c 0470A0E3           mov     r7, #4       // syscall 4 (sys_write)
   9 0010 000000EF           svc     0
  10              	
  11 0014 0000A0E3           mov     r0, #0       // int status (0=OK)
  12 0018 0170A0E3           mov     r7, #1       // syscall 1 (sys_exit)
  13 001c 000000EF           svc     0
  14              	        
  15              	.data
  16              	message:
  17 0000 48656C6C           .ascii      "Hello, Computer Architect!"
  17      6F2C2043 
  17      6F6D7075 
  17      74657220 
  17      41726368 
  18              	count = . - message

You can run this program on an Android device with Developer options enabled. We won’t go into the procedure for enabling those options here, but you can learn more about that topic with an internet search.

This is the output displayed when running this program on an Android ARM device connected to the host PC with a USB cable:

C:>adb push hello_arm /data/local/tmp/hello_arm
C:>adb shell chmod +x /data/local/tmp/hello_arm
C:>adb shell /data/local/tmp/hello_arm
Hello, Computer Architect!

These commands use the Android Debug Bridge (adb) tool included with Android Studio. Although the hello_arm program runs on the Android device, output from the program is sent back to the PC and appears in the command window.

The next section introduces the 64-bit ARM architecture, an extension of the 32-bit ARM architecture.

64-bit ARM architecture and instruction set

The 64-bit version of the ARM architecture, named AArch64, was announced in 2011. This architecture has 31 general-purpose 64-bit registers, 64-bit addressing, a 48-bit virtual address space, and a new instruction set named A64.

The 64-bit instruction set is a superset of the 32-bit instruction set, allowing existing 32-bit code to run unmodified on 64-bit processors.

Instructions are 32 bits wide, and most operands are 32 or 64 bits. The A64 register functions differ in some respects from 32-bit mode: the program counter is no longer directly accessible as a register and an additional register is provided that always returns an operand value of zero.

At the user privilege level, most A64 instructions have the same mnemonics as the corresponding 32-bit instructions. The assembler determines whether an instruction operates on 64-bit or 32-bit data based on the operands provided. The following rules determine the operand length and register size used by an instruction:

64-bit register names begin with the letter X; for example, x0
32-bit register names begin with the letter W; for example, w1
32-bit registers occupy the lower 32 bits of the corresponding 64-bit register number

When working with 32-bit registers, the following rules apply:

Register operations such as right shifts behave the same as in the 32-bit architecture. A 32-bit arithmetic right shift uses bit 31 as the sign bit, not bit 63.
Condition codes for 32-bit operations are set based on the result in the lower 32 bits.
Writes to a W register set the upper 32 bits of the corresponding X register to zero.

The A64 is a load/store architecture with the same instruction mnemonics for memory operations (ldr and str) as 32-bit mode. There are some differences and limitations in comparison to the 32-bit load and store instructions:

The base register must be an X (64-bit) register.
An address offset can be any of the same types as in 32-bit mode, as well as an X register. A 32-bit offset can be zero-extended or sign-extended to 64 bits.
Indexed addressing modes can only use immediate values as an offset.
A64 does not support the ldm or stm instructions for loading or storing multiple registers in a single instruction. Instead, A64 adds the ldp and stp instructions for loading or storing a pair of registers in a single instruction.
A64 only supports conditional execution for a small subset of instructions.

Stack operations are significantly different in A64. Perhaps the biggest difference in this area is that the stack pointer must maintain 16-byte alignment when accessing data.

64-bit ARM assembly language

This is the 64-bit ARM assembly language source file for the hello program:

.text
.global _start
_start:
    // Print the message to file 1 (stdout) with syscall 64
    mov     x0, #1
    ldr     x1, =msg
    mov     x2, #msg_len
    mov     x8, #64
    svc     0
    // Exit the program with syscall 93, returning status 0
    mov     x0, #0
    mov     x8, #93
    svc     0
    
.data
msg:
    .ascii      "Hello, Computer Architect!"
msg_len = . - msg

This file, named hello_arm64.s, is assembled and linked to form the executable hello_arm64 program with the following commands. These commands use the 64-bit development tools provided with the Android Studio NDK. The use of these commands assumes the Windows PATH environment variable has been set to include the tools directory:

aarch64-linux-android-as -al=hello_arm64.lst -o hello_arm64.o ^hello_arm64.s
aarch64-linux-android-ld -o hello_arm64 hello_arm64.o

The components of these commands are:

aarch64-linux-android-as runs the assembler
-al=hello_arm64.lst creates a listing file named hello_arm64.lst
-o hello_arm64.o creates an object file named hello_arm64.o
hello_arm64.s is the name of the assembly language source file
aarch64-linux-android-ld runs the linker
-o hello_arm64 creates an executable file named hello_arm64
hello_arm64.o is the name of the object file provided as input to the linker

This is a portion of the hello_arm64.lst listing file generated by the assembler:

   1              	.text
   2              	.global _start
   3              	
   4              	_start:
   5              	    // Print the message to file 1 (stdout) with syscall 64
   6 0000 200080D2           mov     x0, #1
   7 0004 E1000058           ldr     x1, =msg
   8 0008 420380D2           mov     x2, #msg_len
   9 000c 080880D2           mov     x8, #64
  10 0010 010000D4           svc     0
  11              	
  12              	    // Exit the program with syscall 93, returning status 0
  13 0014 000080D2           mov     x0, #0
  14 0018 A80B80D2           mov     x8, #93
  15 001c 010000D4           svc     0
  16              	    
  17              	.data
  18              	msg:
  19 0000 48656C6C           .ascii      "Hello, Computer Architect!"
  19      6F2C2043 
  19      6F6D7075 
  19      74657220 
  19      41726368 
  20              	msg_len = . - msg

You can run this program on an Android device with Developer options enabled, as described earlier. This is the output displayed when running this program on an Android ARM device connected to the host PC with a USB cable:

C:>adb push hello_arm64 /data/local/tmp/hello_arm64
C:>adb shell chmod +x /data/local/tmp/hello_arm64
C:>adb shell /data/local/tmp/hello_arm64
Hello, Computer Architect!

This completes our introduction to the 32-bit and 64-bit ARM architectures.

Summary

Having completed this chapter, you should have a good understanding of the high-level architectures and features of the x86, x64, 32-bit ARM, and 64-bit ARM registers, instruction sets, and assembly languages.

The x86 and x64 architectures represent a mostly CISC approach to processor design, using variable-length instructions that can take many cycles to execute, a lengthy pipeline, and (in x86) a limited number of processor registers.

The ARM architectures, on the other hand, implement RISC processors with mostly single-cycle instruction execution, a large register set, and (somewhat) fixed-length instructions. Early versions of ARM had pipelines as short as three stages, though later generations have considerably more stages.

Is one of these architectures better than the other, in a general sense? It may be that each is better in some ways, and system designers must make their selection of processor architecture based on the specific needs of the system under development. Of course, there is a great deal of inertia behind the use of x86/x64 processors in personal computing, business computing, and server applications. Similarly, there is much history behind the dominance of ARM processors in smart personal devices and embedded systems. Many factors beyond raw performance must be considered in the processor selection process when designing a new computer or smart device.

In the next chapter, we’ll look at the RISC-V architecture. RISC-V was developed from a clean sheet, incorporating lessons learned from the history of processor development and without any of the baggage required to maintain support for decades-old legacy designs.

Exercises

Install the free Visual Studio Community edition, available at https://visualstudio.microsoft.com/vs/community/, on a Windows PC. Once installation is complete, open the Visual Studio IDE and select Get Tools and Features… under the Tools menu. Install the Desktop development with C++ workload.
In the Windows search box in the Task bar, begin typing Developer Command Prompt for VS 2022. When the app appears in the search menu, select it to open Command Prompt.

Create a file named hello_x86.asm with the content shown in the source listing in the x86 assembly language section of this chapter.

Build the program using the command shown in the x86 assembly language section of this chapter and run it. Verify that the output Hello, Computer Architect! appears on the screen.
Write an x86 assembly language program that computes the following expression and prints the result as a hexadecimal number: [(129 – 66) × (445 + 136)] ÷ 3. As part of this program, create a callable function to print 1 byte as two hex digits.
In the Windows search box in the Task bar, begin typing x64 Native Tools Command Prompt for VS 2022. When the app appears in the search menu, select it to open Command Prompt.
Create a file named hello_x64.asm with the content shown in the source listing in the x64 assembly language section of this chapter.

Build the program using the command shown in the x64 assembly language section of this chapter and run it. Verify that the output Hello, Computer Architect! appears on the screen.
Write an x64 assembly language program that computes the following expression and prints the result as a hexadecimal number: [(129 – 66) × (445 + 136)] ÷ 3. As part of this program, create a callable function to print 1 byte as two hex digits.
Install the free Android Studio IDE, available at https://developer.android.com/studio/. Once installation is complete, open the Android Studio IDE, create a new project, and select SDK Manager under the Tools menu. Select the SDK Tools tab and check the NDK option, which may say NDK (Side by side). Complete the installation of the NDK.
Locate the following files under the SDK installation directory (the default location is under %LOCALAPPDATA%Android) and add their directories to your PATH environment variable: arm-linux-androideabi-as.exe and adb.exe. Hint: The following command works for one version of Android Studio (your path may vary):
```
set PATH=%PATH%;%LOCALAPPDATA%AndroidSdk
dk23.0.7599858	oolchainsllvmprebuiltwindows-x86_64in
```
Create a file named hello_arm.s with the content shown in the source listing in the 32-bit ARM assembly language section of this chapter.

Build the program using the commands shown in the 32-bit ARM assembly language section of this chapter.

Enable Developer Options on an Android phone or tablet. Search the internet for instructions on how to do this.

Connect your Android device to the computer with a USB cable.

Copy the program executable image to the phone using the commands shown in the 32-bit ARM assembly language section of this chapter and run the program. Verify that the output Hello, Computer Architect! appears on the host computer screen.

Disable Developer Options on your Android phone or tablet.
Write a 32-bit ARM assembly language program that computes the following expression and prints the result as a hexadecimal number: [(129 – 66) × (445 + 136)] ÷ 3. As part of this program, create a callable function to print 1 byte as two hex digits.
Locate the following files under the Android SDK installation directory (the default location is under %LOCALAPPDATA%Android) and add their directories to your PATH environment variable: aarch64-linux-android-as.exe and adb.exe. Hint: The following command works for one version of Android Studio (your path may vary):
```
set PATH=%PATH%;%LOCALAPPDATA%AndroidSdk
dk23.0.7599858	oolchainsllvmprebuiltwindows-x86_64in;%LOCALAPPDATA%AndroidSdkplatform-tools
```
Create a file named hello_arm64.s with the content shown in the source listing in the 64-bit ARM assembly language section of this chapter.

Build the program using the commands shown in the 64-bit ARM assembly language section of this chapter.

Enable Developer Options on an Android phone or tablet.

Connect your Android device to the computer with a USB cable.

Copy the program executable image to the phone using the commands shown in the 64-bit ARM assembly language section of this chapter and run the program. Verify that the output Hello, Computer Architect! appears on the host computer screen.

Disable Developer Options on your Android phone or tablet.
Write a 64-bit ARM assembly language program that computes the following expression and prints the result as a hexadecimal number: [(129 – 66) × (445 + 136)] ÷ 3. As part of this program, create a callable function to print 1 byte as two hex digits.

Join our community Discord space

Join the book’s Discord workspace for a monthly Ask me Anything session with the author: https://discord.gg/7h8aNRhRuY

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Modern Processor Architectures and Instruction Sets

Create new playlist

Sign In

Sign Up

10

Modern Processor Architectures and Instruction Sets

Technical requirements

x86 architecture and instruction set

The x86 register set

x86 addressing modes

Implied addressing

Register addressing

Immediate addressing

Direct memory addressing

Register indirect addressing

Indexed addressing

Based indexed addressing

Based indexed addressing with scaling

x86 instruction categories

Data movement

Stack manipulation

Arithmetic and logic

Conversions

Control flow

String manipulation

Flag manipulation

Input/output

Protected mode

Miscellaneous instructions

Other instruction categories

Common instruction patterns

x86 instruction formats

x86 assembly language

x64 architecture and instruction set

The x64 register set

x64 instruction categories and formats

x64 assembly language

32-bit ARM architecture and instruction set

The ARM register set

ARM addressing modes

Immediate

Register direct

Register indirect

Register indirect with offset

Register indirect with offset, pre-incremented

Register indirect with offset, post-incremented

Double register indirect

Double register indirect with scaling

ARM instruction categories

Load/store

Stack manipulation

Register movement

Arithmetic and logic

Comparisons

Control flow

Supervisor mode

Breakpoint

Conditional execution

Other instruction categories

32-bit ARM assembly language

64-bit ARM architecture and instruction set

64-bit ARM assembly language

Summary

Exercises

Join our community Discord space

Table of Contents for
Modern Processor Architectures and Instruction Sets