© Jo Van Hoey 2019
J. Van HoeyBeginning x64 Assembly Programminghttps://doi.org/10.1007/978-1-4842-5076-1_1

1. Your First Program

Jo Van Hoey1 
(1)
Hamme, Belgium
 
Generations of programmers have started their programming careers by learning how to display hello, world on a computer screen. It is a tradition that was started in the seventies by Brian W. Kernighan in the book he wrote with Dennis Ritchie, The C Programming Language. Kernighan developed the C programming language at Bell Labs. Since then, the C language has changed a lot but has remained the language that every self-respecting programmer should be familiar with. The majority of “modern” and “fancy” programming languages have their roots in C. C is sometimes called a portable assembly language, and as an aspiring assembly programmer, you should get familiar with C. To honor the tradition, we will start with an assembler program to put hello, world on your screen. Listing 1-1 shows the source code for an assembly language version of the hello, world program , which we will analyze in this chapter.
;hello.asm
section .data
    msg    db      "hello, world",0
section .bss
section .text
    global main
main:
    mov    rax, 1       ; 1 = write
    mov    rdi, 1       ; 1 = to stdout
    mov    rsi, msg     ; string to display in rsi
    mov    rdx, 12      ; length of the string, without 0
    syscall             ; display the string
    mov    rax, 60      ; 60 = exit
    mov     rdi, 0      ; 0 = success exit code
    syscall             ; quit
Listing 1-1

hello.asm

Edit, Assemble, Link, and Run (or Debug)

There are many good text editors on the market, both free and commercial. Look for one that supports syntax highlighting for NASM 64-bit. In most cases, you will have to download some kind of plugin or package to have syntax highlighting.

Note

In this book, we will write code for the Netwide Assembler (NASM). There are other assemblers such as YASM, FASM, GAS, or MASM from Microsoft. And as with everything in the computer world, there are sometimes heavy discussions about which assembler is the best. We will use NASM in this book because it is available on Linux, Windows, and macOS and because there is a large community using NASM. You can find the manual at www.nasm.us .

We use gedit with an assembler language syntax file installed. Gedit is a standard editor available in Linux; We use Ubuntu Desktop 18.04.2 LTS. You can find a syntax highlighting file at https://wiki.gnome.org/action/show/Projects/GtkSourceView/LanguageDefinitions . Download the file asm-intel.lang, copy it to /usr/share/gtksourceview*.0/language-specs/, and replace the asterisk (*) with the version installed on your system. When you open gedit, you can choose your programming language, here Assembler (Intel), at the bottom of the gedit window.

On our gedit screen, the hello.asm file shown in Listing 1-1 looks like Figure 1-1.
../images/483996_1_En_1_Chapter/483996_1_En_1_Fig1_HTML.jpg
Figure 1-1

hello.asm in gedit

We think you will agree that syntax highlighting makes the assembler code a little bit easier to read.

When we write assembly programs, we have two windows open on our screen—a window with gedit containing our assembler source code and a window with a command prompt in the project directory—so that we can easily switch between editing and manipulating the project files (assembling and running the program, debugging, and so on). We agree that for more complex and larger projects, this is not feasible; you will need an integrated development environment (IDE). But for now, working with a simple text editor and the command line (in other words, the CLI) will do. This process has the benefit that we can concentrate on the assembler instead of the bells and whistles of an IDE. In later chapters, we will discuss useful tools and utilities, some of them with graphical user interfaces and some of them CLI oriented. But explaining and using IDEs is beyond the scope of this book.

For every exercise in this book, we use a separate project directory that will contain all the files needed and generated for the project.

Of course, in addition to a text editor, you have to check that you have a number of other tools installed, such as GCC, GDB, make, and NASM. First we need GCC, the default Linux compiler linker.

GCC stands for GNU Compiler Collection and is the standard compiler and linker tool on Linux. (GNU stands for GNU is Not Unix; it is a recursive acronym. Using recursive acronyms for naming things is an insider programmer joke that started in the seventies by LISP programmers. Yes, a lame old joke....)

Type gcc -v at the CLI. GCC will respond with a number of messages if it is installed. If it is not installed, install it by typing the following at the CLI:
sudo apt install gcc

Do the same with gdb -v and make -v. If you don’t understand these instructions, brush up on your Linux knowledge before continuing.

You need to install NASM and build-essential, which contains a number of tools we will use. To do so in Ubuntu Desktop 18.04, use this:
sudo apt install build-essential nasm

Type nasm -v at the CLI, and nasm will respond with a version number if it is properly installed. If you have these programs installed, you are ready for your first assembly program.

Type the hello, world program shown in Listing 1-1 into your favorite editor and save it with the name hello.asm. As mentioned, use a separate directory for saving the files of this first project. We will explain every line of code later in this chapter; note the following characteristics of assembly source code (the “source code” is the hello.asm file with the program instructions you just typed):
  • In your code, you can use tabs, spaces, and new lines to make the code more readable.

  • Use one instruction per line.

  • The text following a semicolon is a comment, in other words, an explanation for the benefit of humans. Computers happily ignore comments.

With your text editor, create another file containing the lines in Listing 1-2.
#makefile for hello.asm
hello: hello.o
      gcc -o hello hello.o -no-pie
hello.o: hello.asm
      nasm -f elf64 -g -F dwarf hello.asm -l hello.lst
Listing 1-2

makefile for hello.asm

Figure 1-2 shows what we have in gedit.
../images/483996_1_En_1_Chapter/483996_1_En_1_Fig2_HTML.jpg
Figure 1-2

makefile in gedit

Save this file as makefile in the same directory as hello.asm and quit the editor.

A makefile will be used by make to automate the building of our program. Building a program means checking your source code for errors, adding all necessary services from the operation system, and converting your code into a sequence of machine-readable instructions. In this book, we will use simple makefiles. If you want to know more about makefiles, here is the manual:

https://www.gnu.org/software/make/manual/make.html

Here is a tutorial:

https://www.tutorialspoint.com/makefile/

You read the makefile from the bottom up to see what it is doing. Here is a simplified explanation: the make utility works with a dependency tree. It notes that hello depends on hello.o. It then sees that hello.o depends on hello.asm and that hello.asm depends on nothing else. make compares the last modification dates of hello.asm with hello.o, and if the date from hello.asm is more recent, make executes the line after hello.o, which is hello.asm. Then make restarts reading the makefile and finds that the modification date of hello.o is more recent than the date from hello. So, it executes the line after hello, which is hello.o.

In the bottom line of our makefile, NASM is used as the assembler. The -f is followed by the output format, in our case elf64, which means Executable and Linkable Format for 64-bit . The -g means that we want to include debug information in a debug format specified after the -F option. We use the dwarf debug format. The software geeks who invented this format seemed to like The Hobbit and Lord of the Rings written by J.J.R. Tolkien, so maybe that is why they decided that DWARF would be a nice complement to ELF…just in case you were wondering. Seriously, DWARF stands for Debug With Arbitrary Record Format .

STABS is another debug format, which has nothing to do with all the stabbing in Tolkien’s novels; the name comes from Symbol Table Strings. We will not use STABS here, so you won’t get hurt.

The -l tells NASM to generate a .lst file. We will use .lst files to examine the result of the assembly. NASM will create an object file with an .o extension. That object file will next be used by a linker.

Note

Often it will happen that NASM complains with a number of cryptic messages and refuses to give you an object file. Sometimes NASM will complain so often that it will drive you almost insane. In those cases, it is essential to keep calm, have another coffee, and review your code, because you did something wrong. As you program more and more in assembly, you will catch mistakes faster.

When you finally convinced NASM to give you an object file, this object file is then linked with a linker. A linker takes your object code and searches the system for other files that are needed, typically system services or other object files. These files are combined with your generated object code by the linker, and an executable file is produced. Of course, the linker will take every possible occasion to complain to you about missing things and so on. If that is the case, have another coffee and check your source code and makefile.

In our case, we use the linking functionality of GCC (repeated here for reference):
hello: hello.o
      gcc -o hello hello.o -no-pie

The recent GCC linker and compiler generate position-independent executables (PIEs) by default. This is to prevent hackers from investigating how memory is used by a program and eventually interfering with program execution. At this point, we will not build position-independent executables; it would really complicate the analysis of our program (on purpose, for security reasons). So, we add the parameter -no-pie in the makefile.

Finally, you can insert comments in your makefile by preceding them with the pound symbol, #.
#makefile for hello.asm

We use GCC because of the ease of accessing C standard library functions from within assembler code. To make life easy, we will use C language functions from time to time to simplify the example assembly code. Just so you know, another popular linker on Linux is ld, the GNU linker.

If the previous paragraphs do not make sense to you, do not worry—have a coffee and carry on; it is just background information and not important at this stage. Just remember that makefile is your friend and doing a lot of work for you; the only thing you have to worry about at this time is making no errors.

At the command prompt, go to the directory where you saved your hello.asm file and your makefile. Type make to assemble and build the program and then run the program by typing ./hello at the command prompt. If you see the message hello, world displayed in front of the command prompt, then everything worked out fine. Otherwise, you made some typing or other error, and you need to review your source code or makefile. Refill your cup of coffee and happy debugging!

Figure 1-3 shows an example of the output we have on our screen.
../images/483996_1_En_1_Chapter/483996_1_En_1_Fig3_HTML.jpg
Figure 1-3

hello, world output

Structure of an Assembly Program

This first program illustrates the basic structure of an assembly program. The following are the main parts of an assembly program:
  • section .data

  • section .bss

  • section .text

section .data

In section .data , initialized data is declared and defined, in the following format:
      <variable name>       <type>       <value>

When a variable is included in section .data, memory is allocated for that variable when the source code is assembled and linked to an executable. Variable names are symbolic names, and references to memory locations and a variable can take one or more memory locations. The variable name refers to the start address of the variable in memory.

Variable names start with a letter, followed by letters or numbers or special characters. Table 1-1 lists the possible datatypes.
Table 1-1

Datatypes

Type

Length

Name

db

8 bits

Byte

dw

16 bits

Word

dd

32 bits

Double word

dq

64 bits

Quadword

In the example program, section .data contains one variable, msg, which is a symbolic name pointing to the memory address of 'h', which is the first byte of the string "hello, world",0. So, msg points to the letter 'h', msg+1 points to the letter 'e', and so on. This variable is called a string, which is a contiguous list of characters. A string is a “list” or “array” of characters in memory. In fact, any contiguous list in memory can be considered a string; the characters can be human readable or not, and the string can be meaningful to humans or not.

It is convenient to have a zero indicating the end of a human-readable string. You can omit the terminating zero at your own peril. The terminating 0 we are referring to is not an ASCII 0; it is a numeric zero, and the memory place at the 0 contains eight 0 bits. If you frowned at the acronym ASCII, do some Googling. Having a grasp of what ASCII means is important in programming. Here is the short explanation: characters for use by humans have a special code in computers. Capital A has code 65, B has code 66, and so on. A line feed or new line has code 10, and NULL has code 0. Thus, we terminate a string with NULL. When you type man ascii at the CLI, Linux will show you an ASCII table.

section .data can also contain constants, which are values that cannot be changed in the program. They are declared in the following format:
      <constant name>      equ      <value>
Here’s an example:
      pi equ 3.1416

section .bss

The acronym bss stands for Block Started by Symbol , and its history goes back to the fifties, when it was part of assembly language developed for the IBM 704. In this section go the uninitialized variables. Space for uninitialized variables is declared in this section, in the following format:
<variable name>      <type>      <number>
Table 1-2 shows the possible bss datatypes.
Table 1-2

bss Datatypes

Type

Length

Name

resb

8 bits

Byte

resw

16 bits

Word

resd

32 bits

Double word

resq

64 bits

Quadword

For example, the following declares space for an array of 20 double words:
      dArray resd 20

The variables in section .bss do not contain any values; the values will be assigned later at execution time. Memory places are not reserved at compile time but at execution time. In future examples, we will show the use of section .bss. When your program starts executing, the program asks for the needed memory from the operating system, allocated to variables in section .bss and initialized to zeros. If there is not enough memory available for the .bss variables at execution time, the program will crash.

section .txt

section .txt is where all the action is. This section contains the program code and starts with the following:
            global main
  main:

The main: part is called a label. When you have a label on a line without anything following it, the word is best followed by a colon; otherwise, the assembler will send you a warning. And you should not ignore warnings! When a label is followed by other instructions, there is no need for a colon, but it is best to make it a habit to end all labels with a colon. Doing so will increase the readability of your code.

In our hello.asm code, after the main: label, registers such as rdi, rsi, and rax are prepared for outputting a message on the screen. We will see more information about registers in Chapter 2. Here, we will display a string on the screen using a system call. That is, we will ask the operating system to do the work for us.
  • The system call code 1 is put into the register rax, which means “write.”

  • To put some value into a register, we use the instruction mov. In reality, this instruction does not move anything; it makes a copy from the source to the destination. The format is as follows:

mov destination, source
  • The instruction mov can be used as follows:
    • mov register, immediate value

    • mov register, memory

    • mov memory, register

    • illegal: mov memory, memory

  • In our code, the output destination for writing is stored into the register rdi, and 1 means standard output (in this case, output to your screen).

  • The address of the string to be displayed is put into register rsi.

  • In register rdx , we place the message length. Count the characters of hello, world. Do not count the quotes of the string or the terminating 0. If you count the terminating 0, the program will try to display a NULL byte, which is a bit senseless.

  • Then the system call, syscall , is executed, and the string, msg, will be displayed on the standard output. A syscall is a call to functionality provided by the operating system.

  • To avoid error messages when the program finishes, a clean program exit is needed. We start with writing 60 into rax, which indicates “exit.” The “success” exit code 0 goes into rdi, and then a system call is executed. The program exits without complaining.

System calls are used to ask the operating system to do specific actions. Every operating system has a different list of system call parameters, and the system calls for Linux are different from Windows or macOS. We use the Linux system calls for x64 in this book; you can find more details at http://blog.rchapman.org/posts/Linux_System_Call_Table_for_x86_64/ .

Be aware that 32-bit system calls differ from 64-bit system calls. When you read code, always verify if the code is written for 32-bit or 64-bit systems.

Go to the operating system CLI and look for the file hello.lst. This file was generated during assembling, before linking, as specified in the makefile. Open hello.lst in your editor, and you will see your assembly code listing; in the leftmost column, you’ll see the relative address of your code, and in the next column, you’ll see your code translated into machine language (in hexadecimal). Figure 1-4 shows our hello.lst.
../images/483996_1_En_1_Chapter/483996_1_En_1_Fig4_HTML.jpg
Figure 1-4

hello.lst

You have a column with the line numbers and then a column with eight digits. This column represents memory locations. When the assembler built the object file, it didn’t know yet what memory locations would be used. So, it started at location 0 for the different sections. The section .bss part has no memory.

We see in the second column the result of the conversion of the assembly instruction into hexadecimal code. For example, mov rax is converted to B8 and mov rdi to BF. These are the hexadecimal representations of the machine instructions. Note also the conversion of the msg string to hexadecimal ASCII characters. Later you’ll learn more about hexadecimal notation. The first instruction to be executed starts at address 00000000 and takes five bytes: B8 01 00 00 00. The double zeros are there for padding and memory alignment. Memory alignment is a feature used by assemblers and compilers to optimize code. You can give assemblers and compilers different flags to obtain the smallest possible size of the executable, the fastest code, or a combination. In later chapters, we will discuss optimization, with the purpose of increasing execution speed.

The next instruction starts at address 00000005, and so on. The memory addresses have eight digits (that is, 8 bytes); each byte has 8 bits. So, the addresses have 64 bits; indeed, we are using a 64-bit assembler. Look at how msg is referenced. Because the memory location of msg is not known yet, it is referred to as [0000000000000000].

You will agree that assembler mnemonics and symbolic names for memory addresses are quite a bit easier to remember than hexadecimal values, knowing that there are hundreds of mnemonics, with a multitude of operands, each resulting in even more hexadecimal instructions. In the early days of computers, programmers used machine language, the first-generation programming language. Assembly language, with its “easier to remember” mnemonics, is a second-generation programming language.

Summary

In this chapter, you learned about the following:
  • The basic structure of an assembly program, with the different sections

  • Memory, with symbolic names for addresses

  • Registers

  • An assembly instruction: mov

  • How to use a syscall

  • The difference between machine code and assembly code

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.199.122