We’ve now learned quite a bit of ARM 64-bit Assembly Language; one of the things we can do is read another programmer’s code. Reading another programmer’s code is a great way to not only add to our toolkit of tips and tricks but also improve our own coding. We’ll review some places where you can find Assembly source code for the ARM processor. We’ll examine one of the Assembly Language routines from the Linux kernel to learn some new optimization techniques. Then we’ll look at how the GNU C compiler writes Assembly code and how we can analyze it. We’ll look at the NSA’s Ghidra hacking tool that converts Assembly Language code back into C code—at least approximately.
We’ll use our upper-case program to see how the C compiler writes Assembly Language code and then examine how Ghidra can take that code and reconstitute the C code.
Browsing Linux and GCC Code
Linux kernel: https://github.com/torvalds/linux
GCC source code: https://github.com/gcc-mirror/gcc
- Linux kernel
arch/arm64/lib
arch/arm64/kernel
arch/arm64/crypto
- GCC
libgcc/config/aarch64
The arch/arm64/crypto has several cryptographic routines implemented on the NEON Coprocessor cryptographic extensions that won’t be implemented on all processors.
The Assembly source code for these is in *.S files (note the upper-case S). This is so they can include C header files and utilize C preprocessor directives.
We can learn a lot by studying this code. For example, we’ll now look at how the Linux kernel copies pages of memory around.
Copying a Page of Memory
The Linux kernel contains machine-specific code to handle things like the initialization of the CPU, handling interrupts and performing multitasking. It also contains Assembly Language versions of many C runtime functions and other specialty functions that optimize the Linux kernel’s performance.
The Linux kernel does not use the C runtime library. That’s because the C runtime library must be initialized once Linux is running; rather the Linux kernel has copies of some key runtime functions. Furthermore, special machine-specific, highly optimized versions are contained in the arch/arm64/lib folder. There is a lot we can learn from these functions.
The Linux kernel’s virtual memory manager deals with allocating memory to processes in 4K pages. Manipulating these pages efficiently is key to the Linux kernel performing well. We will look at the kernel’s implementation of copying a page from one location to another. This will teach us a little of how Linux kernel functions are implemented and learn a couple of new optimization techniques in the process. This particular function was implemented by ARM Holdings and donated to the Linux kernel, since it is in ARM’s interest that Linux runs well on their processors.
The Linux kernel’s copy page function
I suspect this implementation isn’t what you’d expect to implement. So, let’s go through how this function works and why it’s implemented the way it is.
About the Algorithm
- 1.
Loop unrolling: The code only loops 31 times. This reduces the number of times the loop-related instructions execute.
- 2.
Parallel processing: Notice that the code does all the LDP instruction ahead of the STP instructions. This way the instruction pipeline can execute quite a few of these instructions in parallel, since the data isn’t used until much later. If your particular ARM processor has a deep instruction pipeline, this can greatly help.
- 1.
It adds 256 right away to X0; the destination pointer then must dereference the values with negative offsets. This is necessary since the starting address is on a page boundary, and the test would abort the loop right away if we didn’t add something first. It adds 256 rather than 128, since the first set of LDPs are done before the loop and the last set of STPs are done after the loop. This gives the correct number of iterations.
- 2.
This routine uses all the corruptible registers, except for one. This means it doesn’t need to push or pop any registers to or from the stack. Register X19 is still available to use and this could store the original address, so we can test with CMP to see when we hit the end of the page, or it could be used as a regular counter. Perhaps, this would lead to more readable code, without requiring extra overhead.
- 3.
The TST is a long distance in the code from the B.NE that uses the result. This can be confusing, since when you see the B.NE, it isn’t obvious who is setting the condition flags for it.
- 4.
It relies on the pointers being page aligned (which they’re specified to be).
- 5.
It uses 1b as the label rather than something more descriptive. Perhaps, this was a macro at one time, but currently this is a function, so a descriptive label is okay.
I think the loop would’ve been better accomplished in a more typical fashion.
Macros and Kernel Options
The macros SYM_FUNC_START, SYM_FUNC_END, and EXPORT_SYMBOL are defined in include/linux/linkage.h. They contain the GNU Assembler directives to ensure the routine is aligned properly and the function name is global.
The macros alternative_if and alternative_else_nop_endif are defined in arch/arm64/include/asm/alternative.h. They provide a configurable mechanism to configure the Linux kernel depending on the exact features that a given processor contains. The folder arch/arm64/include/asm has several interesting Assembly Language include files that are worth looking at.
- 1.
pld: Preload the data.
- 2.
l1: Load into the L1 cache.
- 3.
strm: Stream the data starting at the specified address. It also implies the data will only be used once; then it can be discarded.
Similarly, the routine uses STNP to store the register pair. This instruction is the same as STP; the N is a non-temporal hint that we’re done with the cache value. The processor can also use this as a hint that nearby memory addresses will be saved shortly, and it can batch the memory operations together if that helps performance.
We’ve spent quite a bit of time writing our own Assembly Language; let’s have a look at how the GNU C compiler writes Assembly code.
Code Created by GCC
We’ll code our upper-case routine in C and compare the generated code to what we wrote. For this example, we want gcc to do as good a job as possible, so we’ll use the -O3 option for maximal optimization.
C implementation of the mytoupper routine
Assembly code generated by the C compiler for our upper-case function
The compiler automatically inlined the mytoupper function like our macro version. The mytoupper function is elsewhere in the listing, in case it’s called from another file.
The compiler knows about the range optimization and shifted the range, so it only does one comparison. The shift is performed by
sub w3, w0, #0x61The compiler sets up a stack frame, but doesn’t use it, because all the variables fit in the corruptible registers. As a result, it only saves and restores the LR and FP registers.
The compiler uses the ADRP instruction to load the values of pointers. We covered ADR in Chapter 5, “Thanks for the Memories”; ADRP works like ADR, except that it loads to a 4K page boundary. This means that it has a greater range than ADR, but for humans it’s harder to use. The compiler must set it to a page boundary, which in this case points to C runtime data and then uses cumbersome offsets to get to the correct data. This is good for compilers, not so good for humans to code.
The compiler uses the CBNZ instruction , which we’ll discuss shortly.
There are a few occurrences of
and w3, w3, #0xff
For compiler accesses outstr via register X4. Strangely, it adds one to this first, then references it with a -1 offset. This results in an unnecessary ADD instruction.
The compiler always performs the case conversion with
sub w6, w0, #0x20
Then based on the comparison, it either saves W6 or W0 depending upon whether the conversion is required or not.
Overall, the compiler did a reasonable job of compiling our code, but there are a few instructions that can be removed. We can certainly see how some hand optimization will help.
This is why many Assembly Language programmers start with C code and then remove any extra instructions. The C code becomes less efficient once it can’t fit all the variables in registers and must start swapping data to and from the stack. This usually happens when the complexity is higher and the need for speed is greater.
In Chapter 8, “Programming GPIO Pins,” we looked at programming the GPIO pins using the GPIO controller’s memory registers. This sort of code confuses the optimizer. Often it needs to be turned off, or it optimizes away the code that accesses these locations. This is because we write to memory locations and never read them and also read memory we never set. There are keywords to help the optimizer; however, Assembly Language can result in quite a bit better code, because you’re working against the C optimizer that doesn’t know what the GPIO controller is doing with this memory.
The listing used the CBNZ instruction that we haven’t seen before; let’s have a look at this along with the matching CBZ instruction.
Using the CBNZ and CBZ Instructions
CBNZ is compare and branch on nonzero. It compares W1 to 0, and if it isn’t 0 yet, then it branches. Not all instructions have an S version like SUBS, and this instruction can be used in those cases. CBZ is the reverse and will branch when the register is 0. These are the only choices; there aren’t versions for any other condition flags.
The compiler doesn’t seem to use SUBS instructions when it generates code. It could have eliminated the CMP instruction by putting an S on the end of one of the SUB instructions.
Reverse Engineering and Ghidra
In the Linux world, most of the programs you encounter are open source, from which you can easily download the source code and study it. There is documentation on how it works, and you are actively encouraged to contribute to the program, perhaps fix bugs or add a new feature.
Suppose we encounter a program that we don’t have the source code for, and we want to know how it works. Perhaps, we want to study it, to see if it contains malware. It might be the case that we are worried about privacy concerns and want to know what information the program sends on the Internet. Maybe, it's a game, and we want to know if there is a secret code we can enter to go into God mode. What is the best way to go about this?
We can examine the Assembly code of any Linux executable using objdump or gdb . We know enough about Assembly that we can make sense of the instructions we encounter. However, this doesn’t help us form a big picture of how the program is structured and it’s time consuming.
There are tools to help with this. Until recently there were only expensive commercial products available; however, the National Security Agency (NSA), yes, that NSA, released a version of the tool that their hackers use to analyze code. It is called Ghidra, named after the three-headed monster that Godzilla fights. This tool lets you analyze compiled programs and includes the ability to decompile a program back into C code. It includes tools to show you the graphs of function calls and the ability to make annotations as you learn things.
You can download Ghidra from https://ghidra-sre.org/. To install it, you unzip it, then run the ghidraRun script if you are on Linux. Ghidra requires the Java runtime; if you don’t have this already installed, you will need to install it for your operating system.
Ghidra requires the 64-bit version of Oracle Java. Some Linux distributions, even though they are 64 bits, have the 32-bit version of Java installed. If you run Ghidra under 32-bit Java, it will work until you try to disassemble some code, at which point the disassembler will fail to run. There’s currently no 64-bit version of Java for ARM, so you need to do this on an Intel or AMD-based computer.
- 1.Create a project in Ghidra, import our upper program, and we get an information dialog shown in Figure 15-1.
- 2.
Another information window with more detailed data. Click OK to get the main Window.
- 3.Right-click the upper executable and select “Open with default tool”. This opens the code analysis window. Click Yes when asked if you want the code analyzed and accept the defaults at the next prompt. Figure 15-2 is the resulting code analysis window. You need to click main in the symbol tree to get the source code to appear.
C code created by Ghidra for our upper C program
- 4.
Run the program. The expected output is
smist08@kali:~/asm64/Chapter 15$ makegcc -O3 -o upperghidra upperghidra.csmist08@kali:~/asm64/Chapter 15$ ./upperghidraInput: This is a test!Output: THIS IS A TEST!smist08@kali:~/asm64/Chapter 15$
The code produced isn’t pretty. The variable names are generated. It knows tstStr and outStr , because these are global variables. The logic is in smaller steps, often each C statement being the equivalent of a single Assembly instruction. When trying to figure out a program you don’t have the source code for, having a couple of different viewpoints is a great help.
This technique only works for true compiled languages like C, Fortran, or C++. It does not work for interpreted languages like Python or JavaScript; it also doesn’t work for partially compiled languages that use a virtual machine architecture like Java or C#. There are other tools for these and often these are much more effective, since the compile step doesn’t do as much.
Summary
In this chapter, we reviewed where we can find some sample Assembly source code in the Linux kernel and the GCC runtime library. We looked at the Linux kernels copy_page function to see how that works. We wrote a C version of our upper-case program, so we could compare the Assembly code that the C compiler produces and compare it to what we have written.
We then looked at the sophisticated Ghidra program for decompiling programs to reverse the process and see what it produces. Although it produces working C code from Assembly code, it isn’t that easy to read.
In Chapter 16, “Hacking Code,” we’ll look how hackers use Assembly Language knowledge to hack our code and take control of our computers.
Exercises
- 1.
Manually execute the instructions in Listing 15-1 that perform the loop to ensure you understand how it works and that it performs the correct number of iterations.
- 2.
Have a look at the Linux kernel library function memchr.S located in arch/arm64/lib. Can you easily follow this code?
- 3.
The copy_page routine was simpler, because the pages were guaranteed to be aligned. Look at the memcmp.S file in arch/arm64/lib. This routine is more complicated because it doesn’t assume alignment, but wants to use the same efficiencies alignment gives you. It needs to handle the first non-aligned bytes, then the main block that’s aligned, and any leftover bytes. Understanding this routine is more challenging.
- 4.
Rewrite the loop in one of the versions of the upper-case routine to use a CBNZ or CBZ instruction for its main loop.
- 5.
Compile the C code generated by the Ghidra disassembler in Listing 15-4. Then run objdump on the output and compare it to the original Assembly code in Listing 15-3. Is this what you expected?
- 6.
Examine one of the smaller executables from /usr/bin, such as head, in Ghidra. Can you figure out how it works and find the main block of code?