hello.asm
Edit, Assemble, Link, and Run (or Debug)
There are many good text editors on the market, both free and commercial. Look for one that supports syntax highlighting for NASM 64-bit. In most cases, you will have to download some kind of plugin or package to have syntax highlighting.
Note
In this book, we will write code for the Netwide Assembler (NASM). There are other assemblers such as YASM, FASM, GAS, or MASM from Microsoft. And as with everything in the computer world, there are sometimes heavy discussions about which assembler is the best. We will use NASM in this book because it is available on Linux, Windows, and macOS and because there is a large community using NASM. You can find the manual at www.nasm.us .
We use gedit with an assembler language syntax file installed. Gedit is a standard editor available in Linux; We use Ubuntu Desktop 18.04.2 LTS. You can find a syntax highlighting file at https://wiki.gnome.org/action/show/Projects/GtkSourceView/LanguageDefinitions . Download the file asm-intel.lang, copy it to /usr/share/gtksourceview*.0/language-specs/, and replace the asterisk (*) with the version installed on your system. When you open gedit, you can choose your programming language, here Assembler (Intel), at the bottom of the gedit window.
We think you will agree that syntax highlighting makes the assembler code a little bit easier to read.
When we write assembly programs, we have two windows open on our screen—a window with gedit containing our assembler source code and a window with a command prompt in the project directory—so that we can easily switch between editing and manipulating the project files (assembling and running the program, debugging, and so on). We agree that for more complex and larger projects, this is not feasible; you will need an integrated development environment (IDE). But for now, working with a simple text editor and the command line (in other words, the CLI) will do. This process has the benefit that we can concentrate on the assembler instead of the bells and whistles of an IDE. In later chapters, we will discuss useful tools and utilities, some of them with graphical user interfaces and some of them CLI oriented. But explaining and using IDEs is beyond the scope of this book.
For every exercise in this book, we use a separate project directory that will contain all the files needed and generated for the project.
Of course, in addition to a text editor, you have to check that you have a number of other tools installed, such as GCC, GDB, make, and NASM. First we need GCC, the default Linux compiler linker.
GCC stands for GNU Compiler Collection and is the standard compiler and linker tool on Linux. (GNU stands for GNU is Not Unix; it is a recursive acronym. Using recursive acronyms for naming things is an insider programmer joke that started in the seventies by LISP programmers. Yes, a lame old joke....)
Do the same with gdb -v and make -v. If you don’t understand these instructions, brush up on your Linux knowledge before continuing.
Type nasm -v at the CLI, and nasm will respond with a version number if it is properly installed. If you have these programs installed, you are ready for your first assembly program.
In your code, you can use tabs, spaces, and new lines to make the code more readable.
Use one instruction per line.
The text following a semicolon is a comment, in other words, an explanation for the benefit of humans. Computers happily ignore comments.
makefile for hello.asm
Save this file as makefile in the same directory as hello.asm and quit the editor.
A makefile will be used by make to automate the building of our program. Building a program means checking your source code for errors, adding all necessary services from the operation system, and converting your code into a sequence of machine-readable instructions. In this book, we will use simple makefiles. If you want to know more about makefiles, here is the manual:
https://www.gnu.org/software/make/manual/make.html
Here is a tutorial:
https://www.tutorialspoint.com/makefile/
You read the makefile from the bottom up to see what it is doing. Here is a simplified explanation: the make utility works with a dependency tree. It notes that hello depends on hello.o. It then sees that hello.o depends on hello.asm and that hello.asm depends on nothing else. make compares the last modification dates of hello.asm with hello.o, and if the date from hello.asm is more recent, make executes the line after hello.o, which is hello.asm. Then make restarts reading the makefile and finds that the modification date of hello.o is more recent than the date from hello. So, it executes the line after hello, which is hello.o.
In the bottom line of our makefile, NASM is used as the assembler. The -f is followed by the output format, in our case elf64, which means Executable and Linkable Format for 64-bit . The -g means that we want to include debug information in a debug format specified after the -F option. We use the dwarf debug format. The software geeks who invented this format seemed to like The Hobbit and Lord of the Rings written by J.J.R. Tolkien, so maybe that is why they decided that DWARF would be a nice complement to ELF…just in case you were wondering. Seriously, DWARF stands for Debug With Arbitrary Record Format .
STABS is another debug format, which has nothing to do with all the stabbing in Tolkien’s novels; the name comes from Symbol Table Strings. We will not use STABS here, so you won’t get hurt.
The -l tells NASM to generate a .lst file. We will use .lst files to examine the result of the assembly. NASM will create an object file with an .o extension. That object file will next be used by a linker.
Note
Often it will happen that NASM complains with a number of cryptic messages and refuses to give you an object file. Sometimes NASM will complain so often that it will drive you almost insane. In those cases, it is essential to keep calm, have another coffee, and review your code, because you did something wrong. As you program more and more in assembly, you will catch mistakes faster.
When you finally convinced NASM to give you an object file, this object file is then linked with a linker. A linker takes your object code and searches the system for other files that are needed, typically system services or other object files. These files are combined with your generated object code by the linker, and an executable file is produced. Of course, the linker will take every possible occasion to complain to you about missing things and so on. If that is the case, have another coffee and check your source code and makefile.
The recent GCC linker and compiler generate position-independent executables (PIEs) by default. This is to prevent hackers from investigating how memory is used by a program and eventually interfering with program execution. At this point, we will not build position-independent executables; it would really complicate the analysis of our program (on purpose, for security reasons). So, we add the parameter -no-pie in the makefile.
We use GCC because of the ease of accessing C standard library functions from within assembler code. To make life easy, we will use C language functions from time to time to simplify the example assembly code. Just so you know, another popular linker on Linux is ld, the GNU linker.
If the previous paragraphs do not make sense to you, do not worry—have a coffee and carry on; it is just background information and not important at this stage. Just remember that makefile is your friend and doing a lot of work for you; the only thing you have to worry about at this time is making no errors.
At the command prompt, go to the directory where you saved your hello.asm file and your makefile. Type make to assemble and build the program and then run the program by typing ./hello at the command prompt. If you see the message hello, world displayed in front of the command prompt, then everything worked out fine. Otherwise, you made some typing or other error, and you need to review your source code or makefile. Refill your cup of coffee and happy debugging!
Structure of an Assembly Program
section .data
section .bss
section .text
section .data
When a variable is included in section .data, memory is allocated for that variable when the source code is assembled and linked to an executable. Variable names are symbolic names, and references to memory locations and a variable can take one or more memory locations. The variable name refers to the start address of the variable in memory.
Datatypes
Type | Length | Name |
---|---|---|
db | 8 bits | Byte |
dw | 16 bits | Word |
dd | 32 bits | Double word |
dq | 64 bits | Quadword |
In the example program, section .data contains one variable, msg, which is a symbolic name pointing to the memory address of 'h', which is the first byte of the string "hello, world",0. So, msg points to the letter 'h', msg+1 points to the letter 'e', and so on. This variable is called a string, which is a contiguous list of characters. A string is a “list” or “array” of characters in memory. In fact, any contiguous list in memory can be considered a string; the characters can be human readable or not, and the string can be meaningful to humans or not.
It is convenient to have a zero indicating the end of a human-readable string. You can omit the terminating zero at your own peril. The terminating 0 we are referring to is not an ASCII 0; it is a numeric zero, and the memory place at the 0 contains eight 0 bits. If you frowned at the acronym ASCII, do some Googling. Having a grasp of what ASCII means is important in programming. Here is the short explanation: characters for use by humans have a special code in computers. Capital A has code 65, B has code 66, and so on. A line feed or new line has code 10, and NULL has code 0. Thus, we terminate a string with NULL. When you type man ascii at the CLI, Linux will show you an ASCII table.
section .bss
bss Datatypes
Type | Length | Name |
---|---|---|
resb | 8 bits | Byte |
resw | 16 bits | Word |
resd | 32 bits | Double word |
resq | 64 bits | Quadword |
The variables in section .bss do not contain any values; the values will be assigned later at execution time. Memory places are not reserved at compile time but at execution time. In future examples, we will show the use of section .bss. When your program starts executing, the program asks for the needed memory from the operating system, allocated to variables in section .bss and initialized to zeros. If there is not enough memory available for the .bss variables at execution time, the program will crash.
section .txt
The main: part is called a label. When you have a label on a line without anything following it, the word is best followed by a colon; otherwise, the assembler will send you a warning. And you should not ignore warnings! When a label is followed by other instructions, there is no need for a colon, but it is best to make it a habit to end all labels with a colon. Doing so will increase the readability of your code.
The system call code 1 is put into the register rax, which means “write.”
To put some value into a register, we use the instruction mov. In reality, this instruction does not move anything; it makes a copy from the source to the destination. The format is as follows:
- The instruction mov can be used as follows:
mov register, immediate value
mov register, memory
mov memory, register
illegal: mov memory, memory
In our code, the output destination for writing is stored into the register rdi, and 1 means standard output (in this case, output to your screen).
The address of the string to be displayed is put into register rsi.
In register rdx , we place the message length. Count the characters of hello, world. Do not count the quotes of the string or the terminating 0. If you count the terminating 0, the program will try to display a NULL byte, which is a bit senseless.
Then the system call, syscall , is executed, and the string, msg, will be displayed on the standard output. A syscall is a call to functionality provided by the operating system.
To avoid error messages when the program finishes, a clean program exit is needed. We start with writing 60 into rax, which indicates “exit.” The “success” exit code 0 goes into rdi, and then a system call is executed. The program exits without complaining.
System calls are used to ask the operating system to do specific actions. Every operating system has a different list of system call parameters, and the system calls for Linux are different from Windows or macOS. We use the Linux system calls for x64 in this book; you can find more details at http://blog.rchapman.org/posts/Linux_System_Call_Table_for_x86_64/ .
Be aware that 32-bit system calls differ from 64-bit system calls. When you read code, always verify if the code is written for 32-bit or 64-bit systems.
You have a column with the line numbers and then a column with eight digits. This column represents memory locations. When the assembler built the object file, it didn’t know yet what memory locations would be used. So, it started at location 0 for the different sections. The section .bss part has no memory.
We see in the second column the result of the conversion of the assembly instruction into hexadecimal code. For example, mov rax is converted to B8 and mov rdi to BF. These are the hexadecimal representations of the machine instructions. Note also the conversion of the msg string to hexadecimal ASCII characters. Later you’ll learn more about hexadecimal notation. The first instruction to be executed starts at address 00000000 and takes five bytes: B8 01 00 00 00. The double zeros are there for padding and memory alignment. Memory alignment is a feature used by assemblers and compilers to optimize code. You can give assemblers and compilers different flags to obtain the smallest possible size of the executable, the fastest code, or a combination. In later chapters, we will discuss optimization, with the purpose of increasing execution speed.
The next instruction starts at address 00000005, and so on. The memory addresses have eight digits (that is, 8 bytes); each byte has 8 bits. So, the addresses have 64 bits; indeed, we are using a 64-bit assembler. Look at how msg is referenced. Because the memory location of msg is not known yet, it is referred to as [0000000000000000].
You will agree that assembler mnemonics and symbolic names for memory addresses are quite a bit easier to remember than hexadecimal values, knowing that there are hundreds of mnemonics, with a multitude of operands, each resulting in even more hexadecimal instructions. In the early days of computers, programmers used machine language, the first-generation programming language. Assembly language, with its “easier to remember” mnemonics, is a second-generation programming language.
Summary
The basic structure of an assembly program, with the different sections
Memory, with symbolic names for addresses
Registers
An assembly instruction: mov
How to use a syscall
The difference between machine code and assembly code