Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 10: Modern Processor Architectures and Instruction Sets

Most modern personal computers contain a processor supporting either the Intel or AMD version of the x86 32-bit and x64 64-bit architectures. Almost all smartphones, smart watches, tablets, and many embedded systems, on the other hand, contain ARM 32-bit or 64-bit processors. This chapter takes a detailed look at the registers and instruction sets of all of these processor families.

After completing this chapter, you will understand the top-level architectures and unique attributes of the x86, x64, 32-bit ARM, and 64-bit ARM registers, instruction sets, assembly languages, and key aspects of legacy feature support in these architectures.

This chapter covers the following topics:

x86 architecture and instruction set
x64 architecture and instruction set
32-bit ARM architecture and instruction set
64-bit ARM architecture and instruction set

Technical requirements

The files for this chapter, including the answers to the exercises, are available at https://github.com/PacktPublishing/Modern-Computer-Architecture-and-Organization.

x86 architecture and instruction set

For the purpose of this discussion, the term x86 refers to the 16-bit and 32-bit instruction set architecture of the series of processors that began with the Intel 8086, introduced in 1978. The 8088, released in 1979, is functionally very similar to the 8086, except it has an 8-bit data bus instead of the 16-bit bus of the 8086. The 8088 was the central processor in the original IBM PC.

Subsequent generations of this processor series were named 80186, 80286, 80386, and 80486, leading to the term "x86" as shorthand for the family. Subsequent generations dropped the numeric naming convention and were given the names Pentium, Core, i Series, Celeron, and Xeon.

Advanced Micro Devices (AMD), a semiconductor manufacturing company that competes with Intel, has been producing x86-compatible processors since 1982. Some recent AMD x86 processor generations have been named Ryzen, Opteron, Athlon, Turion, Phenom, and Sempron.

For the most part, compatibility is good between Intel and AMD processors. There are some key differences between processors from the two vendors, including the chip pin configuration and chipset compatibility. In general, Intel processors only work in motherboards and with chipsets designed for Intel chips, and AMD processors only work in motherboards and with chipsets designed for AMD chips. We will see some other differences between Intel and AMD processors later in this section.

The 8086 and 8088 are 16-bit processors, despite the 8-bit data bus of the 8088. Internal registers in these processors are 16 bits wide and the instruction set operates on 16-bit data values. The 8088 transparently executes two bus cycles to transfer each 16-bit value between the processor and memory.

The 8086 and 8088 do not support the more sophisticated features of modern processors, such as paged virtual memory and protection rings. These early processors also have only 20 address lines, limiting the addressable memory to 1 MB. A 20-bit address cannot fit in a 16-bit register, so it is necessary to use a somewhat complicated system of segment registers and offsets to access the full 1 MB address space.

In 1985, Intel released the 80386 with enhancements to mitigate many of these limitations. The 80386 introduced these features:

32-bit architecture: Addresses, registers, and the ALU are 32 bits wide and instructions operate natively on operands up to 32 bits wide.
Protected mode: This mode enables the multi-level privilege mechanism, consisting of ring numbers 0 to 3. In Windows and Linux, ring 0 is kernel mode and ring 3 is user mode. Rings 1 and 2 are not used in these operating systems.
On-chip MMU: The 80386 MMU supports a flat memory model, enabling any location in the 4 GB space to be accessed with a 32-bit address. Manipulation of segment registers and offsets is no longer required. The MMU supports paged virtual memory.
3-stage instruction pipeline: The pipeline accelerates instruction execution, as discussed in Chapter 8, Performance-Enhancing Techniques.
Hardware debug registers: The debug registers support setting up to four breakpoints that stop code execution at a specified virtual address when the address is accessed and a selected condition is satisfied. The available break conditions are code execution, data write, and data read or write. These registers are only available to code running in ring 0.

Modern x86 processors boot into the 16-bit operating mode of the original 8086, which is now called real mode. This mode retains compatibility with software written for the 8086/8088 environment, such as the MS-DOS operating system. In most modern systems running on x86 processors, a transition to protected mode occurs during system startup. Once in protected mode, the operating system generally remains in protected mode until the computer shuts down.

MS-DOS on a modern PC

Although the x86 processor in a modern PC is compatible at the instruction level with the original 8088, running an old copy of MS-DOS on a modern computer system is unlikely to be a straightforward process. This is because peripheral devices and their interfaces in modern PCs are not compatible with the corresponding interfaces in PCs from the 1980s. MS-DOS would need a driver that understands how to interact with the USB-connected keyboard of a modern motherboard, for example.

These days, the main use for 16-bit mode in x86 processors is to serve as a bootloader for a protected mode operating system. Because most developers of computerized devices and the software that runs on them are unlikely to be involved in implementing such a capability, the remainder of our x86 discussion in this chapter will address protected mode and the associated 32-bit flat memory model.

The x86 architecture supports unsigned and signed two's complement integer data types with widths of 8, 16, 32, 64, and 128 bits. The names assigned to these data types are as follows:

Byte: 8 bits
Word: 16 bits
Doubleword: 32 bits
Quadword: 64 bits
Double quadword: 128 bits

In most cases, the x86 architecture does not mandate storage of these data types on natural boundaries. The natural boundary of a data type is any address evenly divisible by the size of the data type in bytes.

Storing any of the multi-byte types at unaligned boundaries is permitted, but is discouraged because it causes a negative performance impact: instructions operating on unaligned data require additional clock cycles. A few instructions that operate on double quadwords require naturally aligned storage and will generate a general protection fault if unaligned access is attempted.

x86 natively supports floating-point data types in widths of 16, 32, 64, and 80 bits. The 32-bit, 64-bit, and 80-bit formats are those presented in Chapter 9, Specialized Processor Extensions. The 16-bit format is called half-precision floating-point and has an 11-bit mantissa, an implied leading 1 bit, and a 5-bit exponent. The half-precision floating-point format is used extensively in GPU processing.

The x86 register set

In protected mode, the x86 architecture has eight 32-bit wide general-purpose registers, a flags register, and an instruction pointer. There are also six segment registers and additional processor model-specific configuration registers. The segment registers and model-specific registers are configured by system software during startup and are, in general, not relevant to the developers of applications and device drivers. For these reasons, we will not discuss the segment registers and model-specific registers further.

The 16-bit general-purpose registers in the original 8086 architecture are named AX, CX, DX, BX, SP, BP, SI, and DI. The reason for listing the first four registers in this non-alphabetic order is because this is the sequence in which these eight registers are pushed onto the stack by a pushad (push all registers) instruction.

In the transition to the 32-bit architecture of the 80386, each register grew to 32 bits. The 32-bit version of a register's name is prefixed with the letter "E" to indicate this extension.

It is possible to access portions of 32-bit registers in smaller bit widths. For example, the lower 16 bits of the 32-bit EAX register are referenced as AX. The AX register can be further accessed as individual bytes using the names AH (high-order byte) and AL (low-order byte). The following diagram shows the register names and subsets of each:

Figure 10.1: Register names and subsets

Writing to a portion of a 32-bit register, for example, register AL, affects only the bits in that portion. In the case of AL, loading an 8-bit value modifies the lowest 8 bits of EAX, leaving the other 24 bits unaffected.

In keeping with the x86's complex instruction set computer (CISC) architecture, several functions associated with various instructions are tied to specific registers. The following table provides a description of the functions associated with each of the x86 general-purpose registers:

Table 10.1: x86 general-purpose registers and associated functions

These register-specific functions contrast with the architectures of many reduced instruction set computer (RISC) processors, which tend to provide a greater number of general-purpose registers. Registers within a RISC processor are, for the most part, functionally equivalent to one another.

The x86 flags register, EFLAGS, contains the processor status bits described in the following table:

Table 10.2: x86 flags register status bits

All bits in the EFLAGS register that are not listed in the preceding table are reserved.

The 32-bit instruction pointer, EIP, contains the address of the next instruction to execute, unless a branch is taken.

The x86 architecture is little-endian, meaning multi-byte values are stored in memory with the least significant byte at the lowest address and the most significant byte at the highest address.

x86 addressing modes

As befits a CISC architecture, x86 supports a variety of addressing modes. There are several rules associated with addressing source and destination operands that must be followed to create valid instructions. For instance, the sizes of the source and destination operands of a mov instruction must be equal. The assembler will attempt to select a suitable size for an operand of ambiguous size (for example, an immediate value of 7) to match the width of a destination location (such as the 32-bit register EAX). In cases where the size of an operand cannot be inferred, size keywords such as byte ptr must be provided.

The assembly language in these examples uses the Intel syntax, which places the operands in destination-source order. The Intel syntax is used primarily in the Windows and MS-DOS contexts. An alternative notation, known as the AT&T syntax, places operands in source-destination order. The AT&T syntax is used in Unix-based operating systems. All examples in this book will use the Intel syntax.

The x86 architecture supports a variety of addressing modes, which we will look at now.

Implied addressing

The register is implied by the instruction opcode:

clc ; Clear the carry flag (This is CF in the EFLAGS register)

Register addressing

One or both source and destination registers are encoded in the instruction:

mov eax, ecx ; Copy the contents of register ECX to EAX

Registers may be used as the first operand, the second operand, or both operands.

Immediate addressing

An immediate value is provided as an instruction operand:

mov eax, 7 ; Move the 32-bit value 7 into EAX

mov ax, 7 ; Move the 16-bit value 7 into AX (the lower 16 bits of EAX)

When using the Intel syntax, it is not necessary to prefix immediate values with the # character.

Direct memory addressing

The address of the value is provided as an instruction operand:

mov eax, [078bch] ; Copy the 32-bit value at hex address 78BC to EAX

In x86 assembly code, square brackets around an expression indicate the expression is an address. When performing moves or other operations on square-bracketed operands, the value being operated upon is the data at the specified address. The exception to this rule is the LEA (load effective address) instruction, which we'll look at later.

Register indirect addressing

The operand identifies a register containing the address of the data value:

mov eax, [esi] ; Copy the 32-bit value at the address contained in ESI to EAX

This mode is equivalent to using a pointer to reference a variable in C or C++.

Indexed addressing

The operand indicates a register plus offset that calculates the address of the data value:

mov eax, [esi + 0bh] ; Copy the 32-bit value at the address (ESI + 0bh) to EAX

This mode is useful for accessing the elements of a data structure. In this scenario, the ESI register contains the address of the structure and the added constant is the byte offset of the element from the beginning of the structure.

Based indexed addressing

The operand indicates a base register, an index register, and an offset that sum together to calculate the address of the data value:

mov eax, [ebx + esi + 10] ; Copy the 32-bit value starting at the address (EBX + ESI + 10) to EAX

This mode is useful for accessing individual data elements within an array of structures. In this example, the EBX register contains the address of the beginning of the structure array, ESI contains the offset of the referenced structure within the array, and the constant value (10) is the offset of the desired element from the beginning of the selected structure.

Based indexed addressing with scaling

The operand is composed of a base register, an index register multiplied by a scale factor, and an offset that sum together to calculate the address of the data value:

mov eax, [ebx + esi*4 + 10] ; Copy the 32-bit value starting at the address (EBX + ESI*4 + 10) to EAX

In this mode, the value in the index register can be multiplied by 1 (the default), 2, 4, or 8 before being summed with the other components of the operand address. There is no performance penalty associated with using the scaling multiplier. This feature is helpful when iterating over arrays containing elements with sizes of 2, 4, or 8 bytes.

Most of the general-purpose registers can be used as the base or index register in the based addressing modes. The following diagram shows the possible combinations of register usage and scaling in the based addressing modes:

Figure 10.2: Based addressing mode

All eight general-purpose registers are available for use as the base register. Of those eight, only ESP is unavailable for use as the index register.

x86 instruction categories

The x86 instruction set was introduced with the Intel 8086 and has been extended several times over the years. Some of the most significant changes relate to the extension of the architecture from 16 to 32 bits, which added protected mode and paged virtual memory. In almost all cases, the new capabilities were added while retaining full backward compatibility.

The x86 instruction set contains several hundred instructions. We will not discuss all of them here. This section will provide brief summaries of the more important and commonly encountered instructions applicable to user mode applications and device drivers. This subset of x86 instructions can be divided into a few general categories: data movement; stack manipulation; arithmetic and logic; conversions; control flows; string and flag manipulations; input/output; and protected mode. There are also some miscellaneous instructions that do not fall into any specific category.

Data movement

Data movement instructions do not affect the processor flags. The following instructions perform data movement:

mov: Copies the data value referenced by the second operand to the location provided as the first operand.
cmovcc: Conditionally moves the second operand's data to the register provided as the first operand if the cc condition is true. The condition is determined from one or more of the processor flags: CF, ZF, SF, OF, and PF. The condition codes are e (equal), ne (not equal), g (greater), ge (greater or equal), a (above), ae (above or equal), l (less), le (less or equal), b (below), be (below or equal), o (overflow), no (no overflow), z (zero), nz (not zero), s (SF=1), ns (SF=0), cxz (register CX is zero), and ecxz (the ECX register is zero).
movsx, movzx: Variants of the mov instruction performing sign extension and zero extension, respectively. The source operand must be a smaller size than the destination.
lea: Computes the address provided by the second operand and stores it at the location given in the first operand. The second operand is surrounded by square brackets. Unlike the other data movement instructions, the computed address is stored in the destination rather than the data value at that address.

Stack manipulation

Stack manipulation instructions do not affect the processor flags. These instructions are as follows:

push: Decrements ESP by 4, then places the 32-bit operand into the stack location pointed to by ESP.
pop: Copies the 32-bit data value pointed to by ESP to the operand location (a register or memory address), then increments ESP by 4.
pushfd, popfd: Pushes or pops the EFLAGS register.
pushad, popad: Pushes or pops the EAX, ECX, EDX, EBX, ESP, EBP, ESI, and EDI registers, in that order.

Arithmetic and logic

Arithmetic and logic instructions modify the processor flags. The following instructions perform arithmetic and logic operations:

add, sub: Performs integer addition or subtraction. When subtracting, the second operand is subtracted from the first. Both operands can be registers, or one operand can be a memory location and the other a register. One operand can be a constant.
adc, sbb: Performs integer addition or subtraction using the CF flag as a carry input (for addition) or as a borrow input (for subtraction).
cmp: Subtracts the two operands and discards the result while updating the OF, SF, ZF, AF, PF, and CF flags based on the result.
neg: Negates the operand.
inc, dec: Increments or decrements the operand by one.
mul: Performs unsigned integer multiplication. The size of the product depends on the size of the operand. A byte operand is multiplied by AL and the result is placed in AX. A word operand is multiplied by AX and the result is placed in DX:AX, with the upper 16 bits in DX. A doubleword is multiplied by EAX and the result is placed in EDX:EAX.
imul: Performs signed integer multiplication. The first operand must be a register and receives the result of the operation. There may be a total of two or three operands. In the two-operand form, the first operand multiplies the second operand and the result is stored in the first operand (a register). In the three-operand form, the second operand multiplies the third operand and the result is stored in the first operand register. In the three-operand form, the third operand must be an immediate value.
div, idiv: Performs unsigned (div) or signed (idiv) division. The size of the result depends on the size of the operand. A byte operand is divided into AX, the quotient is placed in AL, and the remainder is placed in AH. A word operand is divided into DX:AX, the quotient is placed in AX, and the remainder is placed in DX. A doubleword is divided into EDX:EAX, the quotient is placed in EAX, and the remainder is placed in EDX.
and, or, xor: Performs the corresponding logical operation on the two operands and stores the result in the destination operand location.
not: Performs a logical NOT (bit inversion) operation on a single operand.
sal, shl, sar, shr: Performs a logical (shl and shr) or arithmetic (sal and sar) shift of the byte, word, or doubleword argument left or right by 1 to 31 bit positions. sal and shl place the last bit shifted out into the carry flag and insert zeros into the vacated least significant bits. shr places the last bit shifted out into the carry flag and inserts zeros into the vacated most significant bits. sar differs from shr by propagating the sign bit into the vacated most significant bits.
rol, rcl, ror, rcr: Performs a left or right rotation by 0 to 31 bits, optionally through the carry flag. rcl and rcr rotate through the carry flag while rol and ror do not.
bts, btr, btc: Reads a specified bit number (provided as the second operand) within the bits of the first operand into the carry flag, then either sets (bts), resets (btr), or complements (btc) that bit. These instructions may be preceded by the lock keyword to make the operation atomic.
test: Performs a logical AND of two operands and updates the SF, ZF, and PF flags based on the result.

Conversions

Conversion instructions extend a smaller data size to a larger size. These instructions are as follows:

cbw: Converts a byte (register AL) into a word (AX).
cwd: Converts a word (register AX) into a doubleword (DX:AX).
cwde: Converts a word (register AX) into a doubleword (EAX).
cdq: Converts a doubleword (register AX) into a quadword (EDX:EAX).

Control flow

Control flow instructions conditionally or unconditionally transfer execution to an address. The control flow instructions are as follows:

jmp: Transfers control to the instruction at the address provided as the operand.
jcc: Transfers control to the instruction at the address provided as the operand if the condition cc is true. The condition codes were described earlier, in the cmovcc instruction description. The condition is determined from one or more of the processor flags: CF, ZF, SF, OF, and PF.
call: Pushes the current value of EIP onto the stack and transfers control to the instruction at the address provided as the operand.
ret: Pops the top-of-stack value and stores it in EIP. If an operand is provided, it pops the given number of bytes from the stack to clear parameters.
loop: Decrements the loop counter in ECX and, if not zero, transfers control to the instruction at the address provided as the operand.

String manipulation

String manipulation instructions may be prefixed by the rep keyword to repeat the operation the number of times given by the ECX register, incrementing or decrementing the source and destination location on each iteration, depending on the state of the DF flag. The operand size processed on each iteration can be a byte, word, or doubleword. The source address of each string element is given by the ESI register and the destination is given by the EDI register. These instructions are as follows:

mov: Moves a string element.
cmps: Compares elements at corresponding locations in two strings.
scas: Compares a string element to the value in EAX, AX, or AL, depending on the operand size.
lods: Loads the string into EAX, AX, or AL, depending on the operand size.
stos: Stores EAX, AX, or AL, depending on the operand size, to the address in EDI.

Flag manipulation

Flag manipulation instructions modify bits in the EFLAGS register. The flag manipulation instructions are as follows:

stc, clc, cmc: Sets, clears, or complements the carry flag, CF.
std, cld: Sets or clears the direction flag, DF.
sti, cli: Sets or clears the interrupt flag, IF.

Input/output

Input/output instructions read data from or write data to peripheral devices. The input/output instructions are as follows:

in, out: Moves 1, 2, or 4 bytes between EAX, AX, or AL and an I/O port, depending on the operand size.
ins, outs: Moves a data element between memory and an I/O port in the same manner as string instructions.
rep ins, rep outs: Moves blocks of data between memory and an I/O port in the same manner as string instructions.

Protected mode

The following instructions access the features of the protected mode:

sysenter, sysexit: Transfers control from ring 3 to ring 0 (sysenter) or from ring 0 to ring 3 (sysexit) in Intel processors.
syscall, sysret: Transfers control from ring 3 to ring 0 (syscall) or from ring 0 to ring 3 (sysret) in AMD processors. In x86 (32-bit) mode, AMD processors also support sysenter and sysexit.

Miscellaneous instructions

These instructions do not fit into the categories previously listed:

int: Initiates a software interrupt. The operand is the interrupt vector number.
nop: No operation.
cpuid: Provides information about the processor model and its capabilities.

Other instruction categories

The additional instructions listed in this section are some of the more common general-purpose instructions you will come across in x86 applications and device drivers. The x86 architecture contains a variety of instruction categories, including the following:

Floating-point instructions: These instructions are executed by the x87 floating-point unit.
SIMD instructions: This category includes the MMX, SSE, SSE2, SSE3, SSE4, AVX, AVX2, and AVX-512 instructions.
AES instructions: These instructions support encryption and decryption using the Advanced Encryption Standard (AES).
MPX instructions: The memory protection extensions (MPX) enhance memory integrity by preventing errors such as buffer overruns.
SMX instructions: The safer mode extensions (SMX) improve system security in the presence of user trust decisions.
TSX instructions: The transactional synchronization extensions (TSX) enhance the performance of multithreaded execution using shared resources.
VMX instructions: The virtual machine extensions (VMX) support the secure and efficient execution of virtualized operating systems.

Additional processor registers are provided for use by the floating-point and SIMD instructions.

There are still other categories of x86 instructions, a few of which have been retired in later generations of the architecture.

Common instruction patterns

These are examples of instruction usage patterns you will come across frequently in compiled code. The techniques used in these examples produce the desired result while minimizing code size and the number of execution cycles required:

xor reg, reg ; Set reg to zero

test reg, reg ; Test if reg contains zero

add reg, reg ; Shift reg left by one bit

x86 instruction formats

Individual x86 instructions are of variable length, and can range in size from 1 to 15 bytes. The components of a single instruction, including any optional bytes, are laid out in memory in the following sequence:

Prefix bytes: One or more optional prefix bytes provide auxiliary opcode execution information. For example, the lock prefix performs bus locking in a multiprocessor system to enable atomic test-and-set type operations. rep and related prefixes enable string instructions to perform repeated operations on string elements in a single instruction. Other prefixes are available to provide hints for conditional branch instructions or to override the default size of an address or operand.
Opcode bytes: An x86 opcode, consisting of 1 to 3 bytes, follows any prefix bytes. For some opcodes, an additional 3 opcode bits are encoded in a ModR/M byte following the opcode.
ModR/M byte: Not all instructions require this byte. The ModR/M byte contains three information fields providing address mode and operand register information. The upper two bits of this byte (the Mod field) and the lower three bits (the R/M field) combine to form a 5-bit field with 32 possible values. Of these, 8 values identify register operands and the other 24 specify addressing modes. The remaining 3 bits (the reg/opcode field) either indicate a register or provide three additional opcode bits, depending on the instruction.
Address displacement bytes: Either 0, 1, 2, or 4 bytes provide an address displacement used in computing the operand address.
Immediate value bytes: If the instruction includes an immediate value, it is located in the last 1, 2, or 4 bytes of the instruction.

The variable-length nature of x86 instructions makes the process of instruction decoding quite complex. It is also challenging for debugging tools to disassemble a sequence of instructions in reverse order, perhaps to display the code leading up to a breakpoint. This difficulty arises because it is possible for a trailing subset of bytes within a lengthy instruction to form a complete, valid instruction. This complexity is a notable difference from the more regular instruction formats used in RISC architectures.

x86 assembly language

It is possible to develop programs of any level of complexity in assembly language. Most modern applications, however, are largely or entirely developed in high-level languages. Assembly language tends to be used in cases where the employment of specialized instructions is desirable, or a level of extreme optimization is required that is unachievable with an optimizing compiler.

Regardless of the language used in application development, all code must ultimately execute as processor instructions. To fully understand how code executes on a computer system, there is no substitute for examining the state of the system following the execution of each individual instruction. A good way to learn how to operate in this environment is to write some assembly code.

The x86 assembly language example in the following listing is a complete x86 application that runs in a Windows command console, printing a text string and then exiting:

.386

.model FLAT,C

.stack 400h

.code

includelib libcmt.lib

includelib legacy_stdio_definitions.lib

extern printf:near

extern exit:near

public main

main proc

; Print the message

push offset message

call printf

; Exit the program with status 0

push 0

call exit

main endp

.data

message db "Hello, Computer Architect!",0

end

A description of the contents of the assembly language file follows:

The .386 directive indicates that the instructions in this file should be interpreted as applying to 80386 and later-generation processors.
The .model FLAT,C directive specifies a 32-bit flat memory model and the use of C language function calling conventions.
The .stack 400h directive specifies a stack size of 400h (1,024) bytes.
The .code directive indicates the start of executable code.
The includelib and extern directives reference system-provided libraries and functions within them to be used by the program.
The public directive indicates the function name, main, is an externally visible symbol.
The lines between main proc and main endp are the assembly language instructions making up the main function.
The .data directive indicates the start of data memory. The message db statement defines the message string as a sequence of bytes, followed by a zero byte.
The end directive marks the end of the program.

This file, named hello_x86.asm, is assembled and linked to form the executable hello_x86.exe program with the following command, which runs the Microsoft Macro Assembler:

ml /Fl /Zi /Zd hello_x86.asm

The components of this command are as follows:

ml runs the assembler.
/Fl creates a listing file.
/Zi includes symbolic debugging information in the executable file.
/Zd includes line number debugging information in the executable file.
hello_x86.asm is the name of the assembly language source file.

This is a portion of the hello_x86.lst listing file generated by the assembler:

.386

.model FLAT,C

.stack 400h

00000000 .code

includelib libcmt.lib

includelib legacy_stdio_definitions.lib

extern printf:near

extern exit:near

public main

00000000 main proc

; Print the message

00000000 68 00000000 R push offset message

00000005 E8 00000000 E call printf

; Exit the program with status 0

0000000A 6A 00 push 0

0000000C E8 00000000 E call exit

00000011 main endp

00000000 .data

00000000 48 65 6C 6C 6F message db "Hello, Computer Architect!",0

2C 20 43 6F 6D

70 75 74 65 72

20 41 72 63 68

69 74 65 63 74

21 00

The preceding listing displays the address offsets from the beginning of the main function in the left column. On lines containing instructions, the opcode follows the address offset. Address references in the code (for example, offset message) are displayed as 00000000 in the listing because these values are determined during linking, and not during assembly, which is when this listing is generated.

This is the output displayed when running this program:

C:>hello_x86.exe

Hello, Computer Architect!

Next, we will look at the extension of the 32-bit x86 architecture to the 64-bit x64 architecture.

x64 architecture and instruction set

The original specification for a processor architecture extending the x86 processor and instruction set to 64 bits, named AMD64, was introduced by AMD in 2000. The first AMD64 processor, the Opteron, was released in 2003. Intel found itself following AMD's lead and developed an AMD64-compatible architecture, eventually given the name Intel 64. The first Intel processor that implemented their 64-bit architecture was the Xeon, introduced in 2004. The name of the architecture shared by AMD and Intel came to be called x86-64, reflecting the evolution of x86 to 64 bits, and in popular usage, this term has been shortened to x64.

The first Linux version supporting the x64 architecture was released in 2001, well before the first x64 processors were even available. Windows began supporting the x64 architecture in 2005.

Processors implementing the AMD64 and Intel 64 architectures are largely compatible at the instruction level of user mode programs. There are a few differences between the architectures, the most significant of which is the difference in support of the sysenter/sysexit Intel instructions and the syscall/sysret AMD instructions we saw earlier. In general, operating systems and programming language compilers manage these differences, making them rarely an issue of concern to software and system developers. Developers of kernel software, drivers, and assembly code must take these differences into account.

The principal features of the x64 architecture are as follows:

x64 is a mostly compatible 64-bit extension of the 32-bit x86 architecture. Most software, particularly user mode applications, written for the 32-bit environment should execute without modification in a processor running in 64-bit mode. 64-bit mode is also referred to as long mode.
The eight 32-bit general-purpose registers of x86 are extended to 64 bits in x64. The register name prefix R indicates 64-bit registers. For example, in x64, the extended x86 EAX register is called RAX. The x86 register subcomponents EAX, AX, AH, and AL continue to be available in x64.
The instruction pointer, RIP, is now 64 bits. The flags register, RFLAGS, also extends to 64 bits, though the upper 32 bits are reserved. The lower 32 bits of RFLAGS are the same as EFLAGS in the x86 architecture.
Eight 64-bit general-purpose registers have been added, named R8 through R15.
64-bit integers are supported as a native data type.
x64 processors retain the option of running in x86 compatibility mode. This mode enables the use of 32-bit operating systems and allows any application built for x86 to run on x64 processors. In 32-bit compatibility mode, the 64-bit extensions are unavailable.

Virtual addresses in the x64 architecture are 64 bits wide, supporting an address space of 16 exabytes (EB), equivalent to 264 bytes. Current processors from AMD and Intel, however, support only 48 bits of virtual address space. This restriction reduces processor hardware complexity while still supporting up to 256 terabytes (TB) of virtual address space. Current-generation processors also support a maximum of 48 bits of physical address space. This permits a processor to address 256 TB of physical RAM, though modern motherboards do not support the quantity of DRAM devices such a system would require.

The x64 register set

In the x64 architecture, the extension of x86 register lengths to 64 bits and the addition of registers R8 through R15 results in the register map shown in the following diagram:

Figure 10.3: x64 registers

The x86 registers described in the preceding section, and present in x64, appear in a darker shade. The x86 registers have the same names and are the same sizes when operating in 64-bit mode. The 64-bit extended versions of the x86 registers have names starting with the letter R. The new 64-bit registers (R8 through R15) can be accessed in smaller widths using the appropriate suffix letter:

Suffix D accesses the lower 32 bits of the register: R11D.
Suffix W accesses the lower 16 bits of the register: R11W.
Suffix B accesses the lower 8 bits of the register: R11B.

Unlike the x86 registers, the new registers in the x64 architecture are truly general purpose and do not perform any special functions at the processor instruction level.

x64 instruction categories and formats

The x64 architecture implements essentially the same instruction set as x86, with 64-bit extensions. When operating in 64-bit mode, the x64 architecture uses a default address size of 64 bits and a default operand size of 32 bits. A new opcode prefix byte, rex, is provided to specify the use of 64-bit operands.

The format of x64 instructions in memory matches that of the x86 architecture, with some exceptions that, for our purposes, are minor. The addition of support for the rex prefix byte is the most significant variation from the x86 instruction format. Address displacements and immediate values within some instructions can be 64 bits wide, in addition to all the bit widths supported in x86.

Although it is possible to define instructions longer than 15 bytes, the processor instruction decoder will raise a general protection fault if an attempt is made to decode an instruction longer than 15 bytes.

x64 assembly language

The x64 assembly language source file for the hello program is similar to the x86 version of this code, with some notable differences:

There is no directive specifying a memory model because there is a single x64 memory model.
The Windows x64 application programming interface uses a calling convention that stores the first four arguments to a called function in the RCX, RDX, R8, and R9 registers, in that order. This differs from the default x86 calling convention, which pushes parameters onto the stack. Both of the library functions this program calls (printf and exit) take a single argument, passed in RCX.
The calling convention requires the caller of a function to allocate stack space to hold at least the number of arguments passed to the called functions, with a minimum reservation space for four arguments, even if fewer are being passed. Because the stack grows downward in memory, this requires a subtraction from the stack pointer. The sub rsp, 40 instruction performs this stack allocation. Normally, after the called function returns, it would be necessary to adjust the stack pointer to remove this allocation. Our program calls the exit function, terminating program execution, which makes this step unnecessary.

The code for the 64-bit version of the hello program is as follows:

.code

includelib libcmt.lib

includelib legacy_stdio_definitions.lib

extern printf:near

extern exit:near

public main

main proc

; Reserve stack space

sub rsp, 40

; Print the message

lea rcx, message

call printf

; Exit the program with status 0

xor rcx, rcx

call exit

main endp

.data

message db "Hello, Computer Architect!",0

end

This file, named hello_x64.asm, is assembled and linked to form the executable hello_x64.exe program with the following call to the Microsoft Macro Assembler (x64):

ml64 /Fl /Zi /Zd hello_x64.asm

The components of this command are:

ml64 runs the 64-bit assembler.
/Fl creates a listing file.
/Zi includes symbolic debugging information in the executable file.
/Zd includes line number debugging information in the executable file.
hello_x64.asm is the name of the assembly language source file.

This is a portion of the hello_x64.lst listing file generated by the assembler command:

00000000 .code

includelib libcmt.lib

includelib legacy_stdio_definitions.lib

extern printf:near

extern exit:near

public main

00000000 main proc

; Reserve stack space

00000000 48/ 83 EC 28 sub rsp, 40

; Print the message

00000004 48/ 8D 0D lea rcx, message

00000000 R

0000000B E8 00000000 E call printf

; Exit the program with status 0

00000010 48/ 33 C9 xor rcx, rcx

00000013 E8 00000000 E call exit

00000018 main endp

00000000 .data

00000000 48 65 6C 6C 6F message db "Hello, Computer Architect!",0

2C 20 43 6F 6D

70 75 74 65 72

20 41 72 63 68

69 74 65 63 74

21 00

The output of running this program is as follows:

C:>hello_x64.exe

Hello, Computer Architect!

This completes our brief introduction to the x86 and x64 architectures. There is a great deal more to be learned, and indeed the Intel 64 and IA-32 Architectures Software Developer's Manual, Volumes 1 through 4, contain nearly 5,000 pages of detailed documentation on these architectures. We have clearly just scratched the surface in this chapter.

Next, we will take a similarly top-level tour of the ARM 32-bit and 64-bit architectures.

32-bit ARM architecture and instruction set

The ARM architectures define a family of RISC processors suitable for use in a wide variety of applications. Processors based on ARM architectures are preferred in designs where a combination of high performance, low power consumption, and small physical size is needed.

ARM Holdings, a British semiconductor and software company, developed the ARM architectures and licenses them to other companies who implement processors in silicon. Many applications of the ARM architectures are system-on-chip (SoC) designs combining a processor with specialized hardware to support functions such as cellular radio communications in smartphones.

ARM processors are used across a broad spectrum of applications, from tiny battery-powered devices to supercomputers. ARM processors serve as embedded processors in safety-critical systems, such as automotive anti-lock brakes, and as general-purpose processors in smart watches, portable phones, tablets, laptop computers, desktop computers, and servers. As of 2017, over 100 billion ARM processors have been manufactured.

ARM processors are true RISC systems, with a large set of general-purpose registers and single-cycle execution of most instructions. Standard ARM instructions have a fixed width of 32 bits, though a separate instruction set named T32 (formerly called Thumb) is available for applications where memory is at a premium. The T32 instruction set consists of 16- and 32-bit wide instructions. Current-generation ARM processors support both the ARM and T32 instruction sets, and can switch between instruction sets on the fly. Most operating systems and applications prefer the use of the T32 instruction set over the ARM set because code density is improved.

ARM is a load/store architecture, requiring data to be loaded from memory to a register before any processing such as an ALU operation can take place upon it. A separate instruction then stores the result back to memory. While this might seem like a step back from the x86 and x64 architectures, which operate directly on operands in memory in a single instruction, in practice, the load/store approach permits several sequential operations to be performed at high speed on an operand once it has been loaded into one of the many available registers.

ARM processors are bi-endian. A configuration setting is available to select little-endian or big-endian byte order for multi-byte values. The default setting is little-endian, which is the configuration commonly used by operating systems.

The ARM architecture natively supports these data types:

Byte: 8 bits
Halfword: 16 bits
Word: 32 bits
Doubleword: 64 bits
What's in a word?
There is a confusing difference between the data type names of the ARM architecture and those of the x86 and x64 architectures: in x86 and x64, a word is 16 bits. In ARM, a word is 32 bits.

ARM processors support eight distinct execution privilege levels. These levels, and their abbreviations, are as follows:

User (USR)
Supervisor (SVC)
Fast interrupt request (FIQ)
Interrupt request (IRQ)
Monitor (MON)
Abort (ABT)
Undefined (UND)
System (SYS)

For the purposes of operating systems and user applications, the most important privilege levels are USR and SVC. The two interrupt request modes, FIQ and IRQ, are used by device drivers for processing interrupts.

In most operating systems running on ARM, including Windows and Linux, kernel mode runs in ARM SVC mode, equivalent to Ring 0 on x86/64. ARM USR mode is equivalent to Ring 3 on x86/x64. Applications running in Linux on ARM processors use software interrupts to request kernel services, which involves a transition from USR mode to SVC mode.

The ARM architecture supports system capabilities beyond those of the core processor via the concept of coprocessors. Up to 16 coprocessors can be implemented in a system, with predefined functions assigned to four of them. Coprocessor 15 implements the MMU and other system functions. If present, coprocessor 15 must support the instruction opcodes, register set, and behaviors specified for the MMU. Coprocessors 10 and 11 combine to provide floating-point functionality in processors equipped with that feature. Coprocessor 14 provides debugging functions.

The ARM architectures have evolved through several versions over the years. The architectural variant currently in common use is ARMv8-A. ARMv8-A supports 32-bit and 64-bit operating systems and applications. 32-bit applications can run under a 64-bit ARMv8-A operating system.

Virtually all high-end smartphones and portable electronic devices produced since 2016 are designed around processors or SoCs based on the ARMv8-A architecture. The description that follows will focus on ARMv8-A 32-bit mode. We will look at the differences in ARMv8-A 64-bit mode later in this chapter.

The ARM register set

In USR mode, the ARM architecture has 16 general-purpose 32-bit registers named R0 through R15. The first 13 registers are truly general-purpose, while the last three have the following defined functions:

R13 is the stack pointer, also named SP in assembly code. This register points to the top of the stack.
R14 is the link register, also named LR. This register holds the return address while in a called function. The use of a link register differs from x86/x64, which pushes the return address onto the stack. The reason for using a register to hold the return address is because it is significantly faster to resume execution at the address in LR at the end of a function than it is to pop the return address from the stack and resume execution at that address.
R15 is the program counter, also named PC. Due to pipelining, the value contained in PC is usually two instructions ahead of the currently executing instruction. Unlike x86/x64, it is possible for user code to directly read and write the PC register. Writing an address to PC causes execution to immediately jump to the newly written address.

The current program status register (CPSR) contains status and mode control bits, similar to EFLAGS/RFLAGS in the x86/x64 architectures.

Table 10.3: Selected CPSR bits

CPSR bits not listed in the preceding table are either reserved or represent functions not discussed in this chapter.

By default, most instructions do not affect the flags. The S suffix must be used with, for example, an addition instruction (adds) to cause the result to affect the flags. Comparison instructions are the exception to this rule; they update the flags automatically.

ARM addressing modes

In true RISC fashion, the only ARM instructions that can access system memory are those that perform register loads and stores. The ldr instruction loads a register from memory, while str stores a register to memory. A separate instruction, mov, transfers the contents of one register to another or moves an immediate value into a register.

When computing the target address for a load or store operation, ARM starts with a base address provided in a register and adds an increment to arrive at the target memory address. There are three techniques for determining the increment that will be added to the base register in register load and store instructions:

Offset: A signed constant is added to the base register. The offset is stored as part of the instruction. For example, ldr r0, [r1, #10] loads r0 with the word at the address r1+10. As shown in the following examples, pre- or post-indexing can optionally update the base register to the target address before or after the memory location is accessed.
Register: An unsigned increment stored in a register can be added to or subtracted from the value in a base register. For example, ldr r0, [r1, r2] loads r0 with the word at the address r1+r2. Either of the registers can be thought of as the base register.
Scaled register: An increment in a register is shifted left or right by a specified number of bit positions before being added to or subtracted from the base register value. For example, ldr r0, [r1, r2, lsl #3] loads r0 with the word at the address r1+(r2×8). The shift can be a logical left or right shift, lsl or lsr, inserting zero bits in the vacated bit positions, or an arithmetic right shift, asr, that replicates the sign bit in the vacated positions.

The addressing modes available for specifying source and destination operands in ARM instructions are presented in the following sections.

Immediate

An immediate value is provided as part of the instruction. The possible immediate values consist of an 8-bit value, coded in the instruction, rotated through an even number of bit positions. A full 32-bit value cannot be specified because the instruction itself is, at most, 32 bits wide. To load an arbitrary 32-bit value into a register, the ldr instruction must be used instead to load the value from memory:

mov r0, #10 // Load the 32-bit value 10 decimal into r0

mov r0, #0xFF000000 // Load the 32-bit value FF000000h into r0

The second example contains the 8-bit value FFh in the instruction opcode. During execution, it is rotated left by 24 bit positions into the most significant 8 bits of the word.

Register direct

This mode copies one register to another:

mov r0, r1 // Copy r1 to r0

mvn r0, r1 // Copy NOT(r1) to r0

Register indirect

The address of the operand is provided in a register. The register containing the address is surrounded by square brackets:

ldr r0, [r1] // Load the 32-bit value at the address given in r1 to r0

str r0, [r3] // Store r0 to the address in r3

Unlike most instructions, str uses the first operand as the source and the second as the destination.

Register indirect with offset

The address of the operand is computed by adding an offset to the base register:

ldr r0, [r1, #32] // Load r0 with the value at the address [r1+32]

str r0, [r1, #4] // Store r0 to the address [r1+4]

Register indirect with offset, pre-incremented

The address of the value is determined by adding an offset to the base register. The base register is updated to the computed address and this address is used to load the destination register:

ldr r0, [r1, #32]! // Load r0 with [r1+32] and update r1 to (r1+32)

str r0, [r1, #4]! // Store r0 to [r1+4] and update r1 to (r1+4)

Register indirect with offset, post-incremented

The base address is first used to access the memory location. The base register is then updated to the computed address:

ldr r0, [r1], #32 // Load [r1] to r0, then update r1 to (r1+32)

str r0, [r1], #4 // Store r0 to [r1], then update r1 to (r1+4)

Double register indirect

The address of the operand is the sum of a base register and an increment register. The register names are surrounded by square brackets:

ldr r0, [r1, r2] // Load r0 with [r1+r2]

str r0, [r1, r2] // Store r0 to [r1+r2]

Double register indirect with scaling

The address of the operand is the sum of a base register and an increment register shifted left or right by a number of bits. The register names and the shift information are surrounded by square brackets:

ldr r0, [r1, r2, lsl #5] // Load r0 with [r1+(r2*32)]

str r0, [r1, r2, lsr #2] // Store r0 to [r1+(r2/4)]

The next section introduces the general categories of ARM instructions.

ARM instruction categories

The instructions described in this section are from the T32 instruction set.

Load/store

These instructions move data between registers and memory:

ldr, str: Copies an 8-bit (suffix b for byte), 16-bit (suffix h for halfword), or 32-bit value between a register and a memory location. ldr copies the value from memory to a register, while str copies a register to memory. ldrb copies one byte into the lower 8 bits of a register.
ldm, stm: Loads or stores multiple registers. Copies 1 to 16 registers to or from memory. Any subset of registers can be loaded from or stored to a contiguous region of memory.

Stack manipulation

These instructions store data to and retrieve data from the stack.

push, pop: Pushes or pops any subset of the registers to or from the stack; for example, push {r0, r2, r3-r5}. These instructions are variants of the ldm and stm instructions.

Register movement

These instructions transfer data between registers.

mov, mvn: Moves a register (mov), or its bit-inversion (mvn), to the destination register.

Arithmetic and logic

These instructions mostly have one destination register and two source operands. The first source operand is a register, while the second can be a register, a shifted register, or an immediate value.

Including the s suffix causes these instructions to set the condition flags. For example, adds performs addition and sets the condition flags.

add, sub: Adds or subtracts two numbers. For example, add r0, r1, r2, lsl #3 is equivalent to the expression r0 = r1 + (r2 × 23). The lsl operator performs a logical shift left of the second operand, r2.
adc, sbc: Adds or subtracts two numbers with carry or borrow.
neg: Negates a number.
and, orr, eor: Performs logical AND, OR, or XOR.
orn, eon: Performs logical OR or XOR between the first operand and the bitwise-inverted second operand.
bic: Clears selected bits in a register.
mul: Multiplies two numbers.
mla: Multiplies two numbers and accumulates the result. This instruction has an additional operand to specify the accumulator register.
sdiv, udiv: Signed and unsigned division, respectively.

Comparisons

These instructions compare two values and set the condition flags based on the result of the comparison. The s suffix is not needed with these instructions to set the condition codes.

cmp: Subtracts two numbers, discards the result, and sets the condition flags. This is equivalent to a subs instruction, except the result is discarded.
cmn: Adds two numbers, discards the result, and sets the condition flags. This is equivalent to an adds instruction, except the result is discarded.
tst: Performs a bitwise AND, discards the result, and sets the condition flags. This is equivalent to an ands instruction, except the result is discarded.

Control flow

These instructions transfer control conditionally or unconditionally to an address.

b: Performs an unconditional branch to the target address.
bcc: Branches based on one of these condition codes as cc: eq (equal), ne (not equal), gt (greater than), lt (less than), ge (greater or equal), le (less or equal), cs (carry set), cc (carry clear), mi (minus: N flag = 1), pl (plus: N flag = 0), vs (V flag set), vc (V flag clear), hi (higher: C flag set and Z flag clear), ls (lower or same: C flag clear and Z flag clear).
bl: Branches to the specified address and stores the address of the next instruction in the link register (r14, also called lr). The called function returns to the calling code with the mov pc, lr instruction.
bx: Branches and selects the instruction set. If bit 0 of the target address is 1, T32 mode is entered. If bit 0 is clear, ARM mode is entered. Bit 0 of instruction addresses must always be zero due to ARM's address alignment requirements. This frees bit 0 to select the instruction set.
blx: Branches with link and selects the instruction set. This instruction combines the functions of the bl and bx instructions.

Supervisor mode

This instruction allows user mode code to initiate a call to supervisor mode:

svc (Supervisor call): Initiates a software interrupt that causes the supervisor mode exception handler to process a system service request.

Miscellaneous

This instruction does not fit into the categories listed:

bkpt (Trigger a breakpoint): This instruction takes a 16-bit operand for use by debugging software to identify the breakpoint.

Conditional execution

Many ARM instructions support conditional execution, which uses the same condition codes as the branch instructions to determine whether individual instructions are executed. If an instruction's condition evaluates false, the instruction is processed as a no-op. The condition code is appended to the instruction mnemonic. This technique is formally known as predication.

For example, this function converts a nibble (the lower 4 bits of a byte) into an ASCII character version of the nibble:

// Convert the low 4 bits of r0 to an ascii character in r0

nibble2ascii:

and r0, #0xF

cmp r0, #10

addpl r0, r0, #('A' - 10)

addmi r0, r0, #'0'

mov pc, lr

The cmp instruction subtracts 10 from the nibble in r0 and sets the N flag if r0 is less than 10. Otherwise, the N flag is clear, indicating the value in r0 is 10 or greater.

If N is clear, the addpl instruction executes (pl means "plus," as in "not negative"), and the addmi instruction does not execute. If N is set, the addpl instruction does not execute and the addmi instruction executes. After this sequence completes, r0 contains a character in the range '0'-'9' or 'A'-'F'.

The use of conditional instruction execution keeps the instruction pipeline flowing by avoiding branches.

Other instruction categories

ARM processors optionally support a range of SIMD and floating-point instructions. Other instructions are provided that are generally only used during system configuration.

ARM assembly language

The ARM assembly example in this section uses the syntax of the GNU Assembler, provided with the Android Studio integrated development environment (IDE). Other assemblers may use a different syntax. As with the Intel syntax for the x86 and x64 assembly languages, the operand order for most instructions is the destination-source.

The ARM assembly language source file for the hello program is as follows:

.text

.global _start

_start:

// Print the message to file 1 (stdout) with syscall 4

mov r0, #1

ldr r1, =msg

mov r2, #msg_len

mov r7, #4

svc 0

// Exit the program with syscall 1, returning status 0

mov r0, #0

mov r7, #1

svc 0

.data

msg:

.ascii "Hello, Computer Architect!"

msg_len = . - msg

This file, named hello_arm.s, is assembled and linked to form the executable program hello_arm with the following commands. These commands use the development tools provided with the Android Studio Native Development Kit (NDK). The commands assume the Windows PATH environment variable has been set to include the NDK tools directory:

arm-linux-androideabi-as -al=hello_arm.lst -o hello_arm.o hello_arm.s

arm-linux-androideabi-ld -o hello_arm hello_arm.o

The components of these commands are as follows:

arm-linux-androideabi-as runs the assembler.
-al=hello_arm.lst creates a listing file named hello_arm.lst.
-o hello_arm.o creates an object file named hello_arm.o.
hello_arm.s is the name of the assembly language source file.
arm-linux-androideabi-ld runs the linker.
-o hello_arm creates an executable file named hello_arm.
hello_arm.o is the name of the object file provided as input to the linker.

This is a portion of the hello_arm.lst listing file generated by the assembler command:

1 .text

2 .global _start

4 _start:

5 // Print the message to file 1 //(stdout) with syscall 4

6 0000 0100A0E3 mov r0, #1

7 0004 14109FE5 ldr r1, =msg

8 0008 1A20A0E3 mov r2, #msg_len

9 000c 0470A0E3 mov r7, #4

10 0010 000000EF svc 0

12 // Exit the program with syscall 1, //returning status 0

13 0014 0000A0E3 mov r0, #0

14 0018 0170A0E3 mov r7, #1

15 001c 000000EF svc 0

17 .data

18 msg:

19 0000 48656C6C .ascii "Hello, Computer Architect!"

19 6F2C2043

19 6F6D7075

19 74657220

19 41726368

20 msg_len = . - msg

You can run this program on an Android device with Developer options enabled. We won't go into the procedure for enabling those options here, but you can easily learn more about it with an Internet search.

This is the output displayed when running this program on an Android ARM device connected to the host PC with a USB cable:

C:>adb push hello_arm /data/local/tmp/hello_arm

C:>adb shell chmod +x /data/local/tmp/hello_arm

C:>adb shell /data/local/tmp/hello_arm

Hello, Computer Architect!

These commands use the Android Debug Bridge (adb) tool included with Android Studio. Although the hello_arm program runs on the Android device, output from the program is sent back to the PC and displayed in the command window.

The next section introduces the 64-bit ARM architecture, an extension of the 32-bit ARM architecture.

64-bit ARM architecture and instruction set

The 64-bit version of the ARM architecture, named AArch64, was announced in 2011. This architecture has 31 general-purpose 64-bit registers, 64-bit addressing, a 48-bit virtual address space, and a new instruction set named A64. The 64-bit instruction set is a superset of the 32-bit instruction set, allowing existing 32-bit code to run unmodified on 64-bit processors.

Instructions are 32 bits wide and most operands are 32 or 64 bits. The A64 register functions differ in some respects from 32-bit mode: the program counter is no longer directly accessible as a register and an additional register is provided that always returns an operand value of zero.

At the user privilege level, most A64 instructions have the same mnemonics as the corresponding 32-bit instructions. The assembler determines whether an instruction operates on 64-bit or 32-bit data based on the operands provided. The following rules determine the operand length and register size used by an instruction:

64-bit register names begin with the letter X; for example, x0.
32-bit register names begin with the letter W; for example, w1.
32-bit registers occupy the lower 32 bits of the corresponding 64-bit register number.

When working with 32-bit registers, the following rules apply:

Register operations such as right shifts behave the same as in the 32-bit architecture. A 32-bit arithmetic right shift uses bit 31 as the sign bit, not bit 63.
Condition codes for 32-bit operations are set based on the result in the lower 32 bits.
Writes to a W register set the upper 32 bits of the corresponding X register to zero.

The A64 is a load/store architecture with the same instruction mnemonics for memory operations (ldr and str) as 32-bit mode. There are some differences and limitations in comparison to the 32-bit load and store instructions:

The base register must be an X (64-bit) register.
An address offset can be any of the same types as in 32-bit mode, as well as an X register. A 32-bit offset can be zero-extended or sign-extended to 64 bits.
Indexed addressing modes can only use immediate values as an offset.
A64 does not support the ldm or stm instructions for loading or storing multiple registers in a single instruction. Instead, A64 adds the ldp and stp instructions for loading or storing a pair of registers in a single instruction.
A64 only supports conditional execution for a small subset of instructions.

Stack operations are significantly different in A64. Perhaps the biggest difference in this area is that the stack pointer must maintain 16-byte alignment when accessing data.

64-bit ARM assembly language

This is the 64-bit ARM assembly language source file for the hello program:

.text

.global _start

_start:

// Print the message to file 1 (stdout) with syscall 64

mov x0, #1

ldr x1, =msg

mov x2, #msg_len

mov x8, #64

svc 0

// Exit the program with syscall 93, returning status 0

mov x0, #0

mov x8, #93

svc 0

.data

msg:

.ascii "Hello, Computer Architect!"

msg_len = . - msg

This file, named hello_arm64.s, is assembled and linked to form the executable hello_arm64 program with the following commands. These commands use the 64-bit development tools provided with the Android Studio NDK. The use of these commands assumes the Windows PATH environment variable has been set to include the tools directory:

aarch64-linux-android-as -al=hello_arm64.lst -o hello_arm64.o hello_arm64.s

aarch64-linux-android-ld -o hello_arm64 hello_arm64.o

The components of these commands are as follows:

aarch64-linux-android-as runs the assembler.
-al=hello_arm64.lst creates a listing file named hello_arm64.lst.
-o hello_arm64.o creates an object file named hello_arm64.o.
hello_arm64.s is the name of the assembly language source file.
aarch64-linux-android-ld runs the linker.
-o hello_arm64 creates an executable file named hello_arm64.
hello_arm64.o is the name of the object file provided as input to the linker.

This is a portion of the hello_arm64.lst listing file generated by the assembler:

1 .text

2 .global _start

4 _start:

5 // Print the message to file 1 //(stdout) with syscall 64

6 0000 200080D2 mov x0, #1

7 0004 E1000058 ldr x1, =msg

8 0008 420380D2 mov x2, #msg_len

9 000c 080880D2 mov x8, #64

10 0010 010000D4 svc 0

12 // Exit the program with syscall //93, returning status 0

13 0014 000080D2 mov x0, #0

14 0018 A80B80D2 mov x8, #93

15 001c 010000D4 svc 0

17 .data

18 msg:

19 0000 48656C6C .ascii "Hello, Computer Architect!"

19 6F2C2043

19 6F6D7075

19 74657220

19 41726368

20 msg_len = . - msg

You can run this program on an Android device with Developer options enabled, as described earlier. This is the output displayed when running this program on an Android ARM device connected to the host PC with a USB cable:

C:>adb push hello_arm64 /data/local/tmp/hello_arm64

C:>adb shell chmod +x /data/local/tmp/hello_arm64

C:>adb shell /data/local/tmp/hello_arm64

Hello, Computer Architect!

This completes our introduction to the 32-bit and 64-bit ARM architectures.

Summary

Having completed this chapter, you should have a good understanding of the top-level architectures and features of the x86, x64, 32-bit ARM, and 64-bit ARM registers, instruction sets, and assembly languages.

The x86 and x64 architectures represent a mostly CISC approach to processor design, with variable-length instructions that can take many cycles to execute, a lengthy pipeline, and (in x86) a limited number of processor registers.

The ARM architectures, on the other hand, are RISC processors with mostly single-cycle instruction execution, a large register set, and (somewhat) fixed-length instructions. Early versions of ARM had pipelines as short as three stages, though later versions have considerably more stages.

Is one of these architectures better than the other, in a general sense? It may be that each is better in some ways, and system designers must make their selection of processor architecture based on the specific needs of the system under development. Of course, there is a great deal of inertia behind the use of x86/x64 processors in personal computing, business computing, and server applications. Similarly, there is a lot of history behind the domination of ARM processors in smart personal devices and embedded systems. Many factors go into the processor selection process when designing a new computer or smart device.

In the next chapter, we'll look at the RISC-V architecture. RISC-V was developed from a clean sheet, incorporating lessons learned from the history of processor development and without any of the baggage needed to maintain support for decades-old legacy designs.

Exercises

Install the free Visual Studio Community edition, available at https://visualstudio.microsoft.com/vs/community/, on a Windows PC. After installation is complete, open the Visual Studio IDE and select Get Tools and Features… under the Tools menu. Install the Desktop development with C++ workload.
In the Windows search box in the Task Bar, begin typing x86 Native Tools Command Prompt for VS 2019. When the app appears in the search menu, select it to open a command prompt.
Create a file named hello_x86.asm with the content shown in the source listing in the x86 assembly language section of this chapter.
Build the program using the command shown in the x86 assembly language section of this chapter and run it. Verify the output Hello, Computer Architect! appears on the screen.
Write an x86 assembly language program that computes the following expression and prints the result as a hexadecimal number: [(129 – 66) × (445 + 136)] ÷ 3. As part of this program, create a callable function to print one byte as two hex digits.
In the Windows search box in the Task Bar, begin typing x64 Native Tools Command Prompt for VS 2019. When the app appears in the search menu, select it to open command prompt.
Create a file named hello_x64.asm with the content shown in the source listing in the x64 assembly language section of this chapter.
Build the program using the command shown in the x64 assembly language section of this chapter and run it. Verify that the output Hello, Computer Architect! appears on the screen.
Write an x64 assembly language program that computes the following expression and prints the result as a hexadecimal number: [(129 – 66) × (445 + 136)] ÷ 3. As part of this program, create a callable function to print one byte as two hex digits.
Install the free Android Studio IDE, available at https://developer.android.com/studio/. After installation is complete, open the Android Studio IDE and select SDK Manager under the Tools menu. In the Settings for New Projects dialog, select the SDK Tools tab and check the NDK option, which may say NDK (Side by side). Complete the installation of the native development kit (NDK).
Locate the following files under the SDK installation directory (the default location is under %LOCALAPPDATA%Android) and add their directories to your PATH environment variable: arm-linux-androideabi-as.exe and adb.exe. Hint: The following command works for one version of Android Studio (your path may vary):
set PATH=%PATH%;%LOCALAPPDATA%AndroidSdk dk20.1.5948944 oolchainsarm-linux-androideabi-4.9prebuiltwindows-x86_64in;%LOCALAPPDATA%AndroidSdkplatform-tools
Create a file named hello_arm.s with the content shown in the source listing in the The 32-bit ARM assembly language section of this chapter.
Build the program using the commands shown in the the 32-bit ARM assembly language section of this chapter.
Enable Developer Options on an Android phone or tablet. Search the Internet for instructions on how to do this.
Connect your Android device to the computer with a USB cable.
Copy the program executable image to the phone using the commands shown in the 32-bit ARM assembly language section of this chapter and run the program. Verify the output Hello, Computer Architect! appears on the host computer screen.
Write a 32-bit ARM assembly language program that computes the following expression and prints the result as a hexadecimal number: [(129 – 66) × (445 + 136)] ÷ 3. As part of this program, create a callable function to print one byte as two hex digits.
Locate the following files under the Android SDK installation directory (the default location is under %LOCALAPPDATA%Android) and add their directories to your PATH environment variable: aarch64-linux-android-as.exe and adb.exe. Hint: The following command works for one version of Android Studio (your path may vary):
set PATH=%PATH%;%LOCALAPPDATA Androidsdk dk-bundle oolchainsarm-linux-androideabi-4.9prebuiltwindows-x86_64in;%LOCALAPPDATA%AndroidSdkplatform-tools
Create a file named hello_arm64.s with the content shown in the source listing in the 64-bit ARM assembly language section of this chapter.
Build the program using the commands shown in the 64-bit ARM assembly language section of this chapter.
Enable Developer Options on an Android phone or tablet.
Connect your Android device to the computer with a USB cable.
Copy the program executable image to the phone using the commands shown in the 64-bit ARM assembly language section of this chapter and run the program. Verify the output Hello, Computer Architect! appears on the host computer screen.
Write a 64-bit ARM assembly language program that computes the following expression and prints the result as a hexadecimal number: [(129 – 66) × (445 + 136)] ÷ 3. As part of this program, create a callable function to print one byte as two hex digits.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 10: Modern Processor Architectures and Instruction Sets

Create new playlist

Sign In

Sign Up

Chapter 10: Modern Processor Architectures and Instruction Sets

Technical requirements

x86 architecture and instruction set

The x86 register set

x86 addressing modes

Implied addressing

Register addressing

Immediate addressing

Direct memory addressing

Register indirect addressing

Indexed addressing

Based indexed addressing

Based indexed addressing with scaling

x86 instruction categories

Data movement

Stack manipulation

Arithmetic and logic

Conversions

Control flow

String manipulation

Flag manipulation

Input/output

Protected mode

Miscellaneous instructions

Other instruction categories

Common instruction patterns

x86 instruction formats

x86 assembly language

x64 architecture and instruction set

The x64 register set

x64 instruction categories and formats

x64 assembly language

32-bit ARM architecture and instruction set

The ARM register set

ARM addressing modes

Immediate

Register direct

Register indirect

Register indirect with offset

Register indirect with offset, pre-incremented

Register indirect with offset, post-incremented

Double register indirect

Double register indirect with scaling

ARM instruction categories

Load/store

Stack manipulation

Register movement

Arithmetic and logic

Comparisons

Control flow

Supervisor mode

Miscellaneous

Conditional execution

Other instruction categories

ARM assembly language

64-bit ARM architecture and instruction set

64-bit ARM assembly language

Summary

Exercises

Table of Contents for
Chapter 10: Modern Processor Architectures and Instruction Sets