We’ve now learned quite a bit of ARM 32-bit Assembly language; one of the things we can do is read another programmer’s code. Reading other programmer’s code is a great way to add to our toolkit of tips and tricks and improve our own coding. We’ll review some places where you can find Assembly source code for the ARM32. Then we’ll look at how the GNU C compiler writes Assembly code and how we can analyze it. We’ll look at the NSA’s Ghidra hacking tool that can convert Assembly code back into C code—at least approximately.
We’ll use our uppercase program to see how the C compiler writes Assembly code and then examine how Ghidra can take that code and reconstitute the C code. We’ll also look at how the C compiler deals with the lack of an integer division instruction in older ARM processors.
Raspbian and GCC
One of the many nice things about working with the Raspberry Pi and GNU Compiler Collection is that they are open source. That means you can browse through the source code and peruse the Assembly parts contained there.
Raspbian Linux kernel: https://github.com/raspberrypi/linux
GCC source code: https://github.com/gcc-mirror/gcc
- Raspbian Linux kernel:
arch/arm/common
arch/arm/kernel
arch/arm/crypto
- GCC:
libgcc/config/arm
Note
The arch/arm/crypto has several cryptographic routines implemented on the NEON coprocessor.
The Assembly source code for these are in ∗.S files (note the uppercase S). Raspbian is based on Debian Linux. Both Debian Linux and GCC support dozens of processor architectures, so when looking for Assembly source code, make sure you look for ∗.S files in an arm folder. If you are interested, you could compare the ARM 32-bit Assembly files to the files for other processors.
The source code for these use both GNU Assembler directives like .MACRO and C preprocessor directives like #define and #ifdef. If you are going to read this source code, it helps to brush up on the C preprocessor.
The GNU compiler supports older ARM processors than contained in any Raspberry Pi, as well as configurations of the ARM processor that the Raspberry foundation never used. For instance, there is a library to implement IEEE 754 floating-point for ARM processors without an FPU. However, all Raspberry Pis do have an FPU, so this isn’t used.
Division Revisited
Simple C program that divides two numbers
Note
We can’t use any of the -O flag options, because any optimization will remove the expression and the compiler will just plug 4 in for z.
and repeated 32 times. What’s going on here? Since we can download the source code for gcc and all its libraries, we can look at the source code. If we search for the definition of _divsi3, we will find it in libgcc/config/arm/lib1funcs.S. This source code is confusing, because it contains versions of its routines for different generations of ARM, as well as having versions that use thumb code. We’ll cover thumb code in Chapter 15, “Thumb Code,” but until then we can ignore those parts.
Main part of the gcclib division routine
The routine starts by checking for division by 0, which is an error. It then looks for the easy cases of division by 1 or –1, then the other cases of dividing by a power of 2. It also saves the sign bits so the answer can be set properly at the end.
Main body of the division routine
which generates the repetitive code we see. This is a form of optimization called loop unrolling, where if a loop executes a fixed number of times, we just duplicate the code that many times. This saves us an expensive branch instruction, as well as the arithmetic calculating the loop index. Division will be used often enough that we want the code as fast as possible, and we can spare the extra code space to achieve this.
The algorithm for this division is basically the same long division algorithm you learned in elementary school. It is just a bit simpler in binary since there can only be two answers at each step, whether to put a 1 in the result or not.
Note
If we included the -march=“armv8-a” compiler switch, then the compiler would use a SDIV instruction instead of this function call. GCC will use advanced ARM features if it knows they are available.
Sadly, the Assembly source code contained in gcc and Linux isn’t always as well documented as we would like, but it does give us quite a bit of source code to ponder and learn from.
You might want to look at ieee754-sf.S and ieee754-df.S in the same folder as lib1funcs.S, gcc/libgcc/config/arm. These are the implementations of floating-point in single and double precision for ARM processors that don’t have an FPU. It’s interesting to see all the work the FPU does for us.
Code Created by GCC
In the last section, we looked at some code generated by gcc to see how it handles the lack of a SDIV instruction. Let’s look at how gcc would write our code. We’ll code our uppercase routine in C and compare the generated code to what we wrote. For this example, we want gcc to do as good a job as possible, so we will use the -O3 option to get maximal optimization.
C implementation of our mytoupper routine
Assembly code generated by the C compiler for our uppercase function
The compiler automatically inlined the mytoupper function like our macro version.
The compiler knows about the range optimization and shifted the range, so it only does one comparison.
The compiler made good use of the registers and didn’t create a stack frame. It only uses five registers, so it only needs to push/pop R4.
The compiler knows how to use conditional instructions.
The compiler took a slightly different approach to adding the conditional, putting it on a store instruction, so the converted character is only stored if the character is lowercase. It then jumps to loop since it knows if it’s lowercase, it can’t be NULL. Otherwise, it falls through, stores the unconverted character, checks for NULL, and loops if it isn’t.
Overall, the compiler did a good job of compiling our code, just taking a couple extra instructions over what we wrote in the last chapter. GCC has supported the ARM processor for 20 years now. ARM Holdings has made major contributions to GCC to improve the ARM support. All the work over this time has led to a robust and performant system, and the best part is that it is all open source.
This is why many Assembly language programmers start with C code, then only recode in Assembly if the C code isn’t efficient. This usually happens when the complexity is higher and the need for speed is greater, such as the code in the gcclib for floating-point arithmetic and division, where speed is crucial, and pure Assembler is better at bit-level manipulations than C.
In Chapter 8, “Programming GPIO Pins,” we looked at programming the GPIO pins using the GPIO controller’s memory registers. This sort of code will confuse the optimizer. Often it needs to be turned off, or it optimizes away the code that accesses these locations. This is because we write to memory locations and never read them and read memory we never set. There are keywords to help the optimizer, but in the end, Assembler can result in quite a bit better code, because you are working against the C optimizer, that doesn’t know what the GPIO controller is doing with this memory.
Reverse Engineering and Ghidra
In the Raspbian world, most of the programs you encounter are open source that you can easily download the source code and study it. There is documentation on how it works, and you are actively encouraged to contribute to the program, perhaps fix bugs or add a new feature.
Suppose we encounter a program that we don’t have the source code for, and we want to know how it works. Perhaps we want to study it to see if it contains malware. It might be the case that we are worried about privacy concerns and want to know what information the program sends on the Internet. Maybe it's a game, and we want to know if there is a secret code we can enter to go into God mode. What is the best way to go about this?
We can examine the Assembly code of any Linux executable using objdump or gdb. We know enough about Assembly that we can make sense of the instructions we encounter. However, this doesn’t help us form a big picture of how the program is structured and it’s time-consuming.
There are tools to help with this. Until recently there were only expensive commercial products available; however, the NSA, yes, that NSA, released a version of the tool that their hackers use to analyze code. It is called Ghidra, named after the three-headed monster that Godzilla fights. This tool lets you analyze compiled programs and includes the ability to decompile a program back into C code. It includes tools to show you the graphs of function calls and the ability to make annotations as you learn things.
Sadly, Ghidra doesn’t run properly on the Raspberry Pi anymore, even though it is written in Java. The NSA states that Ghidra won’t be supported running on 32-bit operating systems anymore. However, Ghidra still supports analyzing 32-bit programs. It also has full support for the ARM processor. This means we need to transfer our executable file to a computer running a 64-bit operating system, whether it is Linux, macOS, or Windows.
You can download Ghidra from https://ghidra-sre.org/ . To install it, you unzip it, then run the ghidraRun script if you are on Linux. Ghidra requires the Java runtime; if you don’t have this already installed, you will need to install it for your operating system.
Decompiling an optimized C program is difficult. As we saw in the last section, the GCC optimizer does some major rewriting of our original code as part of converting it to Assembly language. Let’s take the upper program that we compiled from C in the last section, give it to Ghidra to decompile, and see whether the result is like our starting source code.
C code created by Ghidra for our upper C program
The code produced isn’t pretty. The variable names are generated. It knows tstStr and outStr, because these are global variables. The logic is in smaller steps, often each C statement being the equivalent of a single Assembly instruction. When trying to figure out a program you don’t have the source code for, having a couple of different viewpoints is a great help.
Note
This technique only works for true compiled languages like C, Fortran, or C++. It does not work for interpreted languages like Python or JavaScript; it also doesn’t work for partially compiled languages that use a virtual machine architecture like Java or C#. There are other tools for these and often these are much more effective, since the compile step doesn’t do as much.
Summary
In this chapter, we reviewed where we can find some sample Assembly source code in the Raspbian Linux kernel and the GCC runtime library. We looked at how GCC compiles the division operator from C and what happens when the ARM processor doesn’t support a division instruction. We wrote a C version of our uppercase program, so we could compare the Assembly code that the C compiler produces and compare it to what we have written.
We then looked at the sophisticated Ghidra program for decompiling programs to reverse the process and see what it produces. Although it produces working C code from Assembly code, it isn’t that easy to read.
In Chapter 15, “Thumb Code,” we’ll look at thumb code where we reduce the Assembly instruction size from 32 bits to 16 bits.