Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

S. SmithProgramming with 64-Bit ARM Assembly Languagehttps://doi.org/10.1007/978-1-4842-5881-1_15

15. Reading and Understanding Code

Stephen Smith¹

(1)

Gibsons, BC, Canada

We’ve now learned quite a bit of ARM 64-bit Assembly Language; one of the things we can do is read another programmer’s code. Reading another programmer’s code is a great way to not only add to our toolkit of tips and tricks but also improve our own coding. We’ll review some places where you can find Assembly source code for the ARM processor. We’ll examine one of the Assembly Language routines from the Linux kernel to learn some new optimization techniques. Then we’ll look at how the GNU C compiler writes Assembly code and how we can analyze it. We’ll look at the NSA’s Ghidra hacking tool that converts Assembly Language code back into C code—at least approximately.

We’ll use our upper-case program to see how the C compiler writes Assembly Language code and then examine how Ghidra can take that code and reconstitute the C code.

Browsing Linux and GCC Code

One of the many nice things about working with Linux and the GNU Compiler Collection is that they are open source. That means you can browse through the source code and peruse the Assembly parts contained there. They are available in the following GitHub repositories:

Linux kernel: https://github.com/torvalds/linux
GCC source code: https://github.com/gcc-mirror/gcc

Clicking the “Clone or download” button and choosing “Download ZIP” is the easiest way to obtain them. Within all this source code, a couple of good folders to review ARM 64-bit Assembly Language source code are

Linux kernel
- arch/arm64/lib
- arch/arm64/kernel
- arch/arm64/crypto
GCC
- libgcc/config/aarch64

Note

The arch/arm64/crypto has several cryptographic routines implemented on the NEON Coprocessor cryptographic extensions that won’t be implemented on all processors.

The Assembly source code for these is in *.S files (note the upper-case S). This is so they can include C header files and utilize C preprocessor directives.

We can learn a lot by studying this code. For example, we’ll now look at how the Linux kernel copies pages of memory around.

Copying a Page of Memory

The Linux kernel contains machine-specific code to handle things like the initialization of the CPU, handling interrupts and performing multitasking. It also contains Assembly Language versions of many C runtime functions and other specialty functions that optimize the Linux kernel’s performance.

The Linux kernel does not use the C runtime library. That’s because the C runtime library must be initialized once Linux is running; rather the Linux kernel has copies of some key runtime functions. Furthermore, special machine-specific, highly optimized versions are contained in the arch/arm64/lib folder. There is a lot we can learn from these functions.

The Linux kernel’s virtual memory manager deals with allocating memory to processes in 4K pages. Manipulating these pages efficiently is key to the Linux kernel performing well. We will look at the kernel’s implementation of copying a page from one location to another. This will teach us a little of how Linux kernel functions are implemented and learn a couple of new optimization techniques in the process. This particular function was implemented by ARM Holdings and donated to the Linux kernel, since it is in ARM’s interest that Linux runs well on their processors.

Listing 15-1 is the source code from the Linux 5.6 kernel currently under development; the file is arch/arm64/lib/copy_page.S. Linux kernel source code uses both C and Assembler macros; this routine contains fewer than most, so this code should be largely familiar. Before you read the following code, think of how you might implement an Assembly Language function to copy 4K of data from one place to another.

/* SPDX-License-Identifier: GPL-2.0-only */

#include <linux/linkage.h>

#include <linux/const.h>

#include <asm/assembler.h>

#include <asm/page.h>

#include <asm/cpufeature.h>

#include <asm/alternative.h>

* Copy a page from src to dest (both are page aligned)

* Parameters:

* x0 - dest

* x1 - src

SYM_FUNC_START(copy_page)

alternative_if ARM64_HAS_NO_HW_PREFETCH

// Prefetch three cache lines ahead.

prfm pldl1strm, [x1, #128]

prfm pldl1strm, [x1, #256]

prfm pldl1strm, [x1, #384]

alternative_else_nop_endif

ldp x2, x3, [x1]

ldp x4, x5, [x1, #16]

ldp x6, x7, [x1, #32]

ldp x8, x9, [x1, #48]

ldp x10, x11, [x1, #64]

ldp x12, x13, [x1, #80]

ldp x14, x15, [x1, #96]

ldp x16, x17, [x1, #112]

add x0, x0, #256

add x1, x1, #128

tst x0, #(PAGE_SIZE - 1)

alternative_if ARM64_HAS_NO_HW_PREFETCH

prfm pldl1strm, [x1, #384]

alternative_else_nop_endif

stnp x2, x3, [x0, #-256]

ldp x2, x3, [x1]

stnp x4, x5, [x0, #16 - 256]

ldp x4, x5, [x1, #16]

stnp x6, x7, [x0, #32 - 256]

ldp x6, x7, [x1, #32]

stnp x8, x9, [x0, #48 - 256]

ldp x8, x9, [x1, #48]

stnp x10, x11, [x0, #64 - 256]

ldp x10, x11, [x1, #64]

stnp x12, x13, [x0, #80 - 256]

ldp x12, x13, [x1, #80]

stnp x14, x15, [x0, #96 - 256]

ldp x14, x15, [x1, #96]

stnp x16, x17, [x0, #112 - 256]

ldp x16, x17, [x1, #112]

add x0, x0, #128

add x1, x1, #128

b.ne 1b

stnp x2, x3, [x0, #-256]

stnp x4, x5, [x0, #16 - 256]

stnp x6, x7, [x0, #32 - 256]

stnp x8, x9, [x0, #48 - 256]

stnp x10, x11, [x0, #64 - 256]

stnp x12, x13, [x0, #80 - 256]

stnp x14, x15, [x0, #96 - 256]

stnp x16, x17, [x0, #112 - 256]

ret

SYM_FUNC_END(copy_page)

EXPORT_SYMBOL(copy_page)

Listing 15-1

The Linux kernel’s copy page function

I suspect this implementation isn’t what you’d expect to implement. So, let’s go through how this function works and why it’s implemented the way it is.

About the Algorithm

This routine copies 128 bytes at a time by loading 16 64-bit (8-byte) registers with data. Why does it do this? Why not just copy 16 bytes at a time using repeated LDP/STP instructions? There are two reasons for this:

1.
Loop unrolling: The code only loops 31 times. This reduces the number of times the loop-related instructions execute.
2.
Parallel processing: Notice that the code does all the LDP instruction ahead of the STP instructions. This way the instruction pipeline can execute quite a few of these instructions in parallel, since the data isn’t used until much later. If your particular ARM processor has a deep instruction pipeline, this can greatly help.

The loop is a bit strange. It uses a TST instruction rather than a CMP instruction to test if we’re done. TST is just like CMP, except it uses ANDS rather than SUBS to do the comparison. Is this being clever for the sake of being clever? Here are a few points about this loop:

1.
It adds 256 right away to X0; the destination pointer then must dereference the values with negative offsets. This is necessary since the starting address is on a page boundary, and the test would abort the loop right away if we didn’t add something first. It adds 256 rather than 128, since the first set of LDPs are done before the loop and the last set of STPs are done after the loop. This gives the correct number of iterations.
2.
This routine uses all the corruptible registers, except for one. This means it doesn’t need to push or pop any registers to or from the stack. Register X19 is still available to use and this could store the original address, so we can test with CMP to see when we hit the end of the page, or it could be used as a regular counter. Perhaps, this would lead to more readable code, without requiring extra overhead.
3.
The TST is a long distance in the code from the B.NE that uses the result. This can be confusing, since when you see the B.NE, it isn’t obvious who is setting the condition flags for it.
4.
It relies on the pointers being page aligned (which they’re specified to be).
5.
It uses 1b as the label rather than something more descriptive. Perhaps, this was a macro at one time, but currently this is a function, so a descriptive label is okay.

I think the loop would’ve been better accomplished in a more typical fashion.

Macros and Kernel Options

The macros SYM_FUNC_START, SYM_FUNC_END, and EXPORT_SYMBOL are defined in include/linux/linkage.h. They contain the GNU Assembler directives to ensure the routine is aligned properly and the function name is global.

The macros alternative_if and alternative_else_nop_endif are defined in arch/arm64/include/asm/alternative.h. They provide a configurable mechanism to configure the Linux kernel depending on the exact features that a given processor contains. The folder arch/arm64/include/asm has several interesting Assembly Language include files that are worth looking at.

In this case if the ARM processor has memory prefetch, then we include instructions like

prfm pldl1strm, [x1, #128]

The preceding instruction asks the processor to load the data stored at this address into the L1 cache. The intent being that when we get to the LDP instructions, the data will already be in the cache and execute faster. The string pldl1strm means

1.
pld: Preload the data.
2.
l1: Load into the L1 cache.
3.
strm: Stream the data starting at the specified address. It also implies the data will only be used once; then it can be discarded.

Similarly, the routine uses STNP to store the register pair. This instruction is the same as STP; the N is a non-temporal hint that we’re done with the cache value. The processor can also use this as a hint that nearby memory addresses will be saved shortly, and it can batch the memory operations together if that helps performance.

We’ve spent quite a bit of time writing our own Assembly Language; let’s have a look at how the GNU C compiler writes Assembly code.

Code Created by GCC

We’ll code our upper-case routine in C and compare the generated code to what we wrote. For this example, we want gcc to do as good a job as possible, so we’ll use the -O3 option for maximal optimization.

We create upper.c from Listing 15-2.

#include <stdio.h>

int mytoupper(char *instr, char *outstr)

{

char cur;

char *orig_outstr = outstr;

{

cur = *instr;

if ((cur >= 'a') && (cur <='z'))

{

cur = cur - ('a'-'A');

}

*outstr++ = cur;

instr++;

} while (cur != '');

return( outstr - orig_outstr );

}

#define BUFFERSIZE 250

char *tstStr = "This is a test!";

char outStr[BUFFERSIZE];

int main()

{

mytoupper(tstStr, outStr);

printf("Input: %s Output: %s ", tstStr, outStr);

return(0);

}

Listing 15-2

C implementation of the mytoupper routine

We can compile this with

gcc -O3 -o upper upper.c

and then run objdump to see the generated code:

objdump -d upper >od.txt

We get Listing 15-3.

0000000000000690 <main>:

690: a9bf7bfd stp x29, x30, [sp, #-16]!

694: b0000080 adrp x0, 11000 <__cxa_finalize@GLIBC_2.17>

698: 90000082 adrp x2, 10000 <__FRAME_END__+0xf588>

69c: 910003fd mov x29, sp

6a0: f9401c01 ldr x1, [x0, #56]

6a4: f947dc44 ldr x4, [x2, #4024]

6a8: aa0103e5 mov x5, x1

6ac: 384014a0 ldrb w0, [x5], #1

6b0: 91000484 add x4, x4, #0x1

6b4: 51018403 sub w3, w0, #0x61

6b8: 51008006 sub w6, w0, #0x20

6bc: 12001c63 and w3, w3, #0xff

6c0: 7100647f cmp w3, #0x19

6c4: 54000128 b.hi 6e8 <main+0x58> // b.pmore

6c8: 381ff086 sturb w6, [x4, #-1]

6cc: 91000484 add x4, x4, #0x1

6d0: 384014a0 ldrb w0, [x5], #1

6d4: 51018403 sub w3, w0, #0x61

6d8: 51008006 sub w6, w0, #0x20

6dc: 12001c63 and w3, w3, #0xff

6e0: 7100647f cmp w3, #0x19

6e4: 54ffff29 b.ls 6c8 <main+0x38> // b.plast

6e8: 381ff080 sturb w0, [x4, #-1]

6ec: 35fffe00 cbnz w0, 6ac <main+0x1c>

6f0: f947dc42 ldr x2, [x2, #4024]

6f4: 90000000 adrp x0, 0 <_init-0x600>

6f8: 91242000 add x0, x0, #0x908

6fc: 97ffffe1 bl 680 <printf@plt>

700: 52800000 mov w0, #0x0 // #0

704: a8c17bfd ldp x29, x30, [sp], #16

708: d65f03c0 ret

Listing 15-3

Assembly code generated by the C compiler for our upper-case function

A few things to notice about this listing:

The compiler automatically inlined the mytoupper function like our macro version. The mytoupper function is elsewhere in the listing, in case it’s called from another file.
The compiler knows about the range optimization and shifted the range, so it only does one comparison. The shift is performed by

sub w3, w0, #0x61
The compiler sets up a stack frame, but doesn’t use it, because all the variables fit in the corruptible registers. As a result, it only saves and restores the LR and FP registers.
The compiler uses the ADRP instruction to load the values of pointers. We covered ADR in Chapter 5, “Thanks for the Memories”; ADRP works like ADR, except that it loads to a 4K page boundary. This means that it has a greater range than ADR, but for humans it’s harder to use. The compiler must set it to a page boundary, which in this case points to C runtime data and then uses cumbersome offsets to get to the correct data. This is good for compilers, not so good for humans to code.
The compiler uses the CBNZ instruction , which we’ll discuss shortly.
There are a few occurrences of

and w3, w3, #0xff

This is to maintain type correctness in C. A C char data type is an unsigned 8-bit number. When we subtract, it could go negative, resulting in the upper 8 bits of W3 being set to 1. This corrects it back to an unsigned quantity. We never did this, because we knew we’d only ever save this as 8 bits using STRB; therefore, we knew the upper bits would be ignored whatever they are.

For compiler accesses outstr via register X4. Strangely, it adds one to this first, then references it with a -1 offset. This results in an unnecessary ADD instruction.
The compiler always performs the case conversion with

sub w6, w0, #0x20

Then based on the comparison, it either saves W6 or W0 depending upon whether the conversion is required or not.

Overall, the compiler did a reasonable job of compiling our code, but there are a few instructions that can be removed. We can certainly see how some hand optimization will help.

This is why many Assembly Language programmers start with C code and then remove any extra instructions. The C code becomes less efficient once it can’t fit all the variables in registers and must start swapping data to and from the stack. This usually happens when the complexity is higher and the need for speed is greater.

In Chapter 8, “Programming GPIO Pins,” we looked at programming the GPIO pins using the GPIO controller’s memory registers. This sort of code confuses the optimizer. Often it needs to be turned off, or it optimizes away the code that accesses these locations. This is because we write to memory locations and never read them and also read memory we never set. There are keywords to help the optimizer; however, Assembly Language can result in quite a bit better code, because you’re working against the C optimizer that doesn’t know what the GPIO controller is doing with this memory.

The listing used the CBNZ instruction that we haven’t seen before; let’s have a look at this along with the matching CBZ instruction.

Using the CBNZ and CBZ Instructions

Consider the set of instructions:

SUB W1, W1, #1

CMP W1, #0

B.NE mylabel

This is typical code in many loops. We have been eliminating the CMP instruction by using SUBS:

SUBS W1, W1, #1

B.NE mylabel

Another way to optimize this is with

SUB W1, W1, #1

CBNZ W1, mylabel

CBNZ is compare and branch on nonzero. It compares W1 to 0, and if it isn’t 0 yet, then it branches. Not all instructions have an S version like SUBS, and this instruction can be used in those cases. CBZ is the reverse and will branch when the register is 0. These are the only choices; there aren’t versions for any other condition flags.

The compiler doesn’t seem to use SUBS instructions when it generates code. It could have eliminated the CMP instruction by putting an S on the end of one of the SUB instructions.

Reverse Engineering and Ghidra

In the Linux world, most of the programs you encounter are open source, from which you can easily download the source code and study it. There is documentation on how it works, and you are actively encouraged to contribute to the program, perhaps fix bugs or add a new feature.

Suppose we encounter a program that we don’t have the source code for, and we want to know how it works. Perhaps, we want to study it, to see if it contains malware. It might be the case that we are worried about privacy concerns and want to know what information the program sends on the Internet. Maybe, it's a game, and we want to know if there is a secret code we can enter to go into God mode. What is the best way to go about this?

We can examine the Assembly code of any Linux executable using objdump or gdb . We know enough about Assembly that we can make sense of the instructions we encounter. However, this doesn’t help us form a big picture of how the program is structured and it’s time consuming.

There are tools to help with this. Until recently there were only expensive commercial products available; however, the National Security Agency (NSA), yes, that NSA, released a version of the tool that their hackers use to analyze code. It is called Ghidra, named after the three-headed monster that Godzilla fights. This tool lets you analyze compiled programs and includes the ability to decompile a program back into C code. It includes tools to show you the graphs of function calls and the ability to make annotations as you learn things.

You can download Ghidra from https://ghidra-sre.org/. To install it, you unzip it, then run the ghidraRun script if you are on Linux. Ghidra requires the Java runtime; if you don’t have this already installed, you will need to install it for your operating system.

Note

Ghidra requires the 64-bit version of Oracle Java. Some Linux distributions, even though they are 64 bits, have the 32-bit version of Java installed. If you run Ghidra under 32-bit Java, it will work until you try to disassemble some code, at which point the disassembler will fail to run. There’s currently no 64-bit version of Java for ARM, so you need to do this on an Intel or AMD-based computer.

Decompiling an optimized C program is difficult. As we saw in the last section, the GCC optimizer does some major rewriting of our original code as part of converting it to Assembly Language. Let’s take the upper program that we compiled from C in the last section, give it to Ghidra to decompile, and see whether the result is like our starting source code.

1.
Create a project in Ghidra, import our upper program, and we get an information dialog shown in Figure 15-1.
Figure 15-1
High-level information on the upper executable
2.
Another information window with more detailed data. Click OK to get the main Window.
3.
Right-click the upper executable and select “Open with default tool”. This opens the code analysis window. Click Yes when asked if you want the code analyzed and accept the defaults at the next prompt. Figure 15-2 is the resulting code analysis window. You need to click main in the symbol tree to get the source code to appear.
Figure 15-2
Ghidra analyzing our upper program

Listing 15-4 is the C code that Ghidra generates. The lines above the definition of the main routine were added, so the program will compile and run.

#include <stdio.h>

#define BUFFERSIZE 250

char *tstStr = "This is a test!";

char outStr[BUFFERSIZE];

typedef unsigned char byte;

#define true 1

int main(void)

{

char cVar1;

char *pcVar2;

char *pcVar3;

char *pcVar4;

char *pcVar5;

pcVar2 = tstStr;

pcVar3 = outStr;

pcVar5 = tstStr;

do {

cVar1 = *pcVar5;

pcVar4 = pcVar3;

while( true ) {

pcVar3 = pcVar4 + 1;

pcVar5 = pcVar5 + 1;

if (0x19 < (byte)(cVar1 + 0x9fU)) break;

*pcVar4 = cVar1 + -0x20 ;

cVar1 = *pcVar5;

pcVar4 = pcVar3;

}

*pcVar4 = cVar1;

} while (cVar1 != '');

printf("Input: %s Output: %s ",pcVar2,outStr);

return 0;

}

Listing 15-4

C code created by Ghidra for our upper C program

4.
Run the program. The expected output is

smist08@kali:~/asm64/Chapter 15$ make
gcc -O3 -o upperghidra upperghidra.c
smist08@kali:~/asm64/Chapter 15$ ./upperghidra
Input: This is a test!
Output: THIS IS A TEST!
smist08@kali:~/asm64/Chapter 15$

The code produced isn’t pretty. The variable names are generated. It knows tstStr and outStr , because these are global variables. The logic is in smaller steps, often each C statement being the equivalent of a single Assembly instruction. When trying to figure out a program you don’t have the source code for, having a couple of different viewpoints is a great help.

Note

This technique only works for true compiled languages like C, Fortran, or C++. It does not work for interpreted languages like Python or JavaScript; it also doesn’t work for partially compiled languages that use a virtual machine architecture like Java or C#. There are other tools for these and often these are much more effective, since the compile step doesn’t do as much.

Summary

In this chapter, we reviewed where we can find some sample Assembly source code in the Linux kernel and the GCC runtime library. We looked at the Linux kernels copy_page function to see how that works. We wrote a C version of our upper-case program, so we could compare the Assembly code that the C compiler produces and compare it to what we have written.

We then looked at the sophisticated Ghidra program for decompiling programs to reverse the process and see what it produces. Although it produces working C code from Assembly code, it isn’t that easy to read.

In Chapter 16, “Hacking Code,” we’ll look how hackers use Assembly Language knowledge to hack our code and take control of our computers.

Exercises

1.
Manually execute the instructions in Listing 15-1 that perform the loop to ensure you understand how it works and that it performs the correct number of iterations.
2.
Have a look at the Linux kernel library function memchr.S located in arch/arm64/lib. Can you easily follow this code?
3.
The copy_page routine was simpler, because the pages were guaranteed to be aligned. Look at the memcmp.S file in arch/arm64/lib. This routine is more complicated because it doesn’t assume alignment, but wants to use the same efficiencies alignment gives you. It needs to handle the first non-aligned bytes, then the main block that’s aligned, and any leftover bytes. Understanding this routine is more challenging.
4.
Rewrite the loop in one of the versions of the upper-case routine to use a CBNZ or CBZ instruction for its main loop.
5.
Compile the C code generated by the Ghidra disassembler in Listing 15-4. Then run objdump on the output and compare it to the original Assembly code in Listing 15-3. Is this what you expected?
6.
Examine one of the smaller executables from /usr/bin, such as head, in Ghidra. Can you figure out how it works and find the main block of code?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 15. Reading and Understanding Code

Create new playlist

Sign In

Sign Up

15. Reading and Understanding Code

Browsing Linux and GCC Code

Copying a Page of Memory

About the Algorithm

Macros and Kernel Options

Code Created by GCC

Using the CBNZ and CBZ Instructions

Reverse Engineering and Ghidra

Summary

Exercises

Table of Contents for
15. Reading and Understanding Code