© Stephen Smith 2020
S. SmithProgramming with 64-Bit ARM Assembly Languagehttps://doi.org/10.1007/978-1-4842-5881-1_15

15. Reading and Understanding Code

Stephen Smith1 
(1)
Gibsons, BC, Canada
 

We’ve now learned quite a bit of ARM 64-bit Assembly Language; one of the things we can do is read another programmer’s code. Reading another programmer’s code is a great way to not only add to our toolkit of tips and tricks but also improve our own coding. We’ll review some places where you can find Assembly source code for the ARM processor. We’ll examine one of the Assembly Language routines from the Linux kernel to learn some new optimization techniques. Then we’ll look at how the GNU C compiler writes Assembly code and how we can analyze it. We’ll look at the NSA’s Ghidra hacking tool that converts Assembly Language code back into C code—at least approximately.

We’ll use our upper-case program to see how the C compiler writes Assembly Language code and then examine how Ghidra can take that code and reconstitute the C code.

Browsing Linux and GCC Code

One of the many nice things about working with Linux and the GNU Compiler Collection is that they are open source. That means you can browse through the source code and peruse the Assembly parts contained there. They are available in the following GitHub repositories:
Clicking the “Clone or download” button and choosing “Download ZIP” is the easiest way to obtain them. Within all this source code, a couple of good folders to review ARM 64-bit Assembly Language source code are
  • Linux kernel
    • arch/arm64/lib

    • arch/arm64/kernel

    • arch/arm64/crypto

  • GCC
    • libgcc/config/aarch64

Note

The arch/arm64/crypto has several cryptographic routines implemented on the NEON Coprocessor cryptographic extensions that won’t be implemented on all processors.

The Assembly source code for these is in *.S files (note the upper-case S). This is so they can include C header files and utilize C preprocessor directives.

We can learn a lot by studying this code. For example, we’ll now look at how the Linux kernel copies pages of memory around.

Copying a Page of Memory

The Linux kernel contains machine-specific code to handle things like the initialization of the CPU, handling interrupts and performing multitasking. It also contains Assembly Language versions of many C runtime functions and other specialty functions that optimize the Linux kernel’s performance.

The Linux kernel does not use the C runtime library. That’s because the C runtime library must be initialized once Linux is running; rather the Linux kernel has copies of some key runtime functions. Furthermore, special machine-specific, highly optimized versions are contained in the arch/arm64/lib folder. There is a lot we can learn from these functions.

The Linux kernel’s virtual memory manager deals with allocating memory to processes in 4K pages. Manipulating these pages efficiently is key to the Linux kernel performing well. We will look at the kernel’s implementation of copying a page from one location to another. This will teach us a little of how Linux kernel functions are implemented and learn a couple of new optimization techniques in the process. This particular function was implemented by ARM Holdings and donated to the Linux kernel, since it is in ARM’s interest that Linux runs well on their processors.

Listing 15-1 is the source code from the Linux 5.6 kernel currently under development; the file is arch/arm64/lib/copy_page.S. Linux kernel source code uses both C and Assembler macros; this routine contains fewer than most, so this code should be largely familiar. Before you read the following code, think of how you might implement an Assembly Language function to copy 4K of data from one place to another.
/* SPDX-License-Identifier: GPL-2.0-only */
/*
 * Copyright (C) 2012 ARM Ltd.
 */
#include <linux/linkage.h>
#include <linux/const.h>
#include <asm/assembler.h>
#include <asm/page.h>
#include <asm/cpufeature.h>
#include <asm/alternative.h>
/*
 * Copy a page from src to dest (both are page aligned)
 *
 * Parameters:
 *     x0 - dest
 *     x1 - src
 */
SYM_FUNC_START(copy_page)
alternative_if ARM64_HAS_NO_HW_PREFETCH
       // Prefetch three cache lines ahead.
       prfm   pldl1strm, [x1, #128]
       prfm   pldl1strm, [x1, #256]
       prfm   pldl1strm, [x1, #384]
alternative_else_nop_endif
       ldp    x2, x3, [x1]
       ldp    x4, x5, [x1, #16]
       ldp    x6, x7, [x1, #32]
       ldp    x8, x9, [x1, #48]
       ldp    x10, x11, [x1, #64]
       ldp    x12, x13, [x1, #80]
       ldp    x14, x15, [x1, #96]
       ldp    x16, x17, [x1, #112]
       add    x0, x0, #256
       add    x1, x1, #128
1:
       tst    x0, #(PAGE_SIZE - 1)
alternative_if ARM64_HAS_NO_HW_PREFETCH
       prfm   pldl1strm, [x1, #384]
alternative_else_nop_endif
       stnp   x2, x3, [x0, #-256]
       ldp    x2, x3, [x1]
       stnp   x4, x5, [x0, #16 - 256]
       ldp    x4, x5, [x1, #16]
       stnp   x6, x7, [x0, #32 - 256]
       ldp    x6, x7, [x1, #32]
       stnp   x8, x9, [x0, #48 - 256]
       ldp    x8, x9, [x1, #48]
       stnp   x10, x11, [x0, #64 - 256]
       ldp    x10, x11, [x1, #64]
       stnp   x12, x13, [x0, #80 - 256]
       ldp    x12, x13, [x1, #80]
       stnp   x14, x15, [x0, #96 - 256]
       ldp    x14, x15, [x1, #96]
       stnp   x16, x17, [x0, #112 - 256]
       ldp    x16, x17, [x1, #112]
       add    x0, x0, #128
       add    x1, x1, #128
       b.ne   1b
       stnp   x2, x3, [x0, #-256]
       stnp   x4, x5, [x0, #16 - 256]
       stnp   x6, x7, [x0, #32 - 256]
       stnp   x8, x9, [x0, #48 - 256]
       stnp   x10, x11, [x0, #64 - 256]
       stnp   x12, x13, [x0, #80 - 256]
       stnp   x14, x15, [x0, #96 - 256]
       stnp   x16, x17, [x0, #112 - 256]
       ret
SYM_FUNC_END(copy_page)
EXPORT_SYMBOL(copy_page)
Listing 15-1

The Linux kernel’s copy page function

I suspect this implementation isn’t what you’d expect to implement. So, let’s go through how this function works and why it’s implemented the way it is.

About the Algorithm

This routine copies 128 bytes at a time by loading 16 64-bit (8-byte) registers with data. Why does it do this? Why not just copy 16 bytes at a time using repeated LDP/STP instructions? There are two reasons for this:
  1. 1.

    Loop unrolling: The code only loops 31 times. This reduces the number of times the loop-related instructions execute.

     
  2. 2.

    Parallel processing: Notice that the code does all the LDP instruction ahead of the STP instructions. This way the instruction pipeline can execute quite a few of these instructions in parallel, since the data isn’t used until much later. If your particular ARM processor has a deep instruction pipeline, this can greatly help.

     
The loop is a bit strange. It uses a TST instruction rather than a CMP instruction to test if we’re done. TST is just like CMP, except it uses ANDS rather than SUBS to do the comparison. Is this being clever for the sake of being clever? Here are a few points about this loop:
  1. 1.

    It adds 256 right away to X0; the destination pointer then must dereference the values with negative offsets. This is necessary since the starting address is on a page boundary, and the test would abort the loop right away if we didn’t add something first. It adds 256 rather than 128, since the first set of LDPs are done before the loop and the last set of STPs are done after the loop. This gives the correct number of iterations.

     
  2. 2.

    This routine uses all the corruptible registers, except for one. This means it doesn’t need to push or pop any registers to or from the stack. Register X19 is still available to use and this could store the original address, so we can test with CMP to see when we hit the end of the page, or it could be used as a regular counter. Perhaps, this would lead to more readable code, without requiring extra overhead.

     
  3. 3.

    The TST is a long distance in the code from the B.NE that uses the result. This can be confusing, since when you see the B.NE, it isn’t obvious who is setting the condition flags for it.

     
  4. 4.

    It relies on the pointers being page aligned (which they’re specified to be).

     
  5. 5.

    It uses 1b as the label rather than something more descriptive. Perhaps, this was a macro at one time, but currently this is a function, so a descriptive label is okay.

     

I think the loop would’ve been better accomplished in a more typical fashion.

Macros and Kernel Options

The macros SYM_FUNC_START, SYM_FUNC_END, and EXPORT_SYMBOL are defined in include/linux/linkage.h. They contain the GNU Assembler directives to ensure the routine is aligned properly and the function name is global.

The macros alternative_if and alternative_else_nop_endif are defined in arch/arm64/include/asm/alternative.h. They provide a configurable mechanism to configure the Linux kernel depending on the exact features that a given processor contains. The folder arch/arm64/include/asm has several interesting Assembly Language include files that are worth looking at.

In this case if the ARM processor has memory prefetch, then we include instructions like
prfm   pldl1strm, [x1, #128]
The preceding instruction asks the processor to load the data stored at this address into the L1 cache. The intent being that when we get to the LDP instructions, the data will already be in the cache and execute faster. The string pldl1strm means
  1. 1.

    pld: Preload the data.

     
  2. 2.

    l1: Load into the L1 cache.

     
  3. 3.

    strm: Stream the data starting at the specified address. It also implies the data will only be used once; then it can be discarded.

     

Similarly, the routine uses STNP to store the register pair. This instruction is the same as STP; the N is a non-temporal hint that we’re done with the cache value. The processor can also use this as a hint that nearby memory addresses will be saved shortly, and it can batch the memory operations together if that helps performance.

We’ve spent quite a bit of time writing our own Assembly Language; let’s have a look at how the GNU C compiler writes Assembly code.

Code Created by GCC

We’ll code our upper-case routine in C and compare the generated code to what we wrote. For this example, we want gcc to do as good a job as possible, so we’ll use the -O3 option for maximal optimization.

We create upper.c from Listing 15-2.
#include <stdio.h>
int mytoupper(char *instr, char *outstr)
{
      char cur;
      char *orig_outstr = outstr;
      do
      {
            cur = *instr;
            if ((cur >= 'a') && (cur <='z'))
            {
                 cur = cur - ('a'-'A');
            }
            *outstr++ = cur;
            instr++;
      } while (cur != '');
      return( outstr - orig_outstr );
}
#define BUFFERSIZE 250
char *tstStr = "This is a test!";
char outStr[BUFFERSIZE];
int main()
{
      mytoupper(tstStr, outStr);
      printf("Input: %s Output: %s ", tstStr, outStr);
      return(0);
}
Listing 15-2

C implementation of the mytoupper routine

We can compile this with
gcc -O3 -o upper upper.c
and then run objdump to see the generated code:
objdump -d upper >od.txt
We get Listing 15-3.
0000000000000690 <main>:
 690:    a9bf7bfd    stp    x29, x30, [sp, #-16]!
 694:    b0000080    adrp   x0, 11000 <__cxa_finalize@GLIBC_2.17>
 698:    90000082    adrp   x2, 10000 <__FRAME_END__+0xf588>
 69c:    910003fd    mov    x29, sp
 6a0:    f9401c01    ldr    x1, [x0, #56]
 6a4:    f947dc44    ldr    x4, [x2, #4024]
 6a8:    aa0103e5    mov    x5, x1
 6ac:    384014a0    ldrb   w0, [x5], #1
 6b0:    91000484    add    x4, x4, #0x1
 6b4:    51018403    sub    w3, w0, #0x61
 6b8:    51008006    sub    w6, w0, #0x20
 6bc:    12001c63    and    w3, w3, #0xff
 6c0:    7100647f    cmp    w3, #0x19
 6c4:    54000128    b.hi   6e8 <main+0x58>  // b.pmore
 6c8:    381ff086    sturb  w6, [x4, #-1]
 6cc:    91000484    add    x4, x4, #0x1
 6d0:    384014a0    ldrb   w0, [x5], #1
 6d4:    51018403    sub    w3, w0, #0x61
 6d8:    51008006    sub    w6, w0, #0x20
 6dc:    12001c63    and    w3, w3, #0xff
 6e0:    7100647f    cmp    w3, #0x19
 6e4:    54ffff29    b.ls   6c8 <main+0x38>  // b.plast
 6e8:    381ff080    sturb  w0, [x4, #-1]
 6ec:    35fffe00    cbnz   w0, 6ac <main+0x1c>
 6f0:    f947dc42    ldr    x2, [x2, #4024]
 6f4:    90000000    adrp   x0, 0 <_init-0x600>
 6f8:    91242000    add    x0, x0, #0x908
 6fc:    97ffffe1    bl     680 <printf@plt>
 700:    52800000    mov    w0, #0x0         // #0
 704:    a8c17bfd    ldp    x29, x30, [sp], #16
 708:    d65f03c0    ret
Listing 15-3

Assembly code generated by the C compiler for our upper-case function

A few things to notice about this listing:
  • The compiler automatically inlined the mytoupper function like our macro version. The mytoupper function is elsewhere in the listing, in case it’s called from another file.

  • The compiler knows about the range optimization and shifted the range, so it only does one comparison. The shift is performed by

    sub   w3, w0, #0x61
  • The compiler sets up a stack frame, but doesn’t use it, because all the variables fit in the corruptible registers. As a result, it only saves and restores the LR and FP registers.

  • The compiler uses the ADRP instruction to load the values of pointers. We covered ADR in Chapter 5, “Thanks for the Memories”; ADRP works like ADR, except that it loads to a 4K page boundary. This means that it has a greater range than ADR, but for humans it’s harder to use. The compiler must set it to a page boundary, which in this case points to C runtime data and then uses cumbersome offsets to get to the correct data. This is good for compilers, not so good for humans to code.

  • The compiler uses the CBNZ instruction , which we’ll discuss shortly.

  • There are a few occurrences of

    and   w3, w3, #0xff
This is to maintain type correctness in C. A C char data type is an unsigned 8-bit number. When we subtract, it could go negative, resulting in the upper 8 bits of W3 being set to 1. This corrects it back to an unsigned quantity. We never did this, because we knew we’d only ever save this as 8 bits using STRB; therefore, we knew the upper bits would be ignored whatever they are.
  • For compiler accesses outstr via register X4. Strangely, it adds one to this first, then references it with a -1 offset. This results in an unnecessary ADD instruction.

  • The compiler always performs the case conversion with

    sub   w6, w0, #0x20

Then based on the comparison, it either saves W6 or W0 depending upon whether the conversion is required or not.

Overall, the compiler did a reasonable job of compiling our code, but there are a few instructions that can be removed. We can certainly see how some hand optimization will help.

This is why many Assembly Language programmers start with C code and then remove any extra instructions. The C code becomes less efficient once it can’t fit all the variables in registers and must start swapping data to and from the stack. This usually happens when the complexity is higher and the need for speed is greater.

In Chapter 8, “Programming GPIO Pins,” we looked at programming the GPIO pins using the GPIO controller’s memory registers. This sort of code confuses the optimizer. Often it needs to be turned off, or it optimizes away the code that accesses these locations. This is because we write to memory locations and never read them and also read memory we never set. There are keywords to help the optimizer; however, Assembly Language can result in quite a bit better code, because you’re working against the C optimizer that doesn’t know what the GPIO controller is doing with this memory.

The listing used the CBNZ instruction that we haven’t seen before; let’s have a look at this along with the matching CBZ instruction.

Using the CBNZ and CBZ Instructions

Consider the set of instructions:
SUB   W1, W1, #1
CMP   W1, #0
B.NE  mylabel
This is typical code in many loops. We have been eliminating the CMP instruction by using SUBS:
SUBS   W1, W1, #1
B.NE   mylabel
Another way to optimize this is with
SUB   W1, W1, #1
CBNZ  W1, mylabel

CBNZ is compare and branch on nonzero. It compares W1 to 0, and if it isn’t 0 yet, then it branches. Not all instructions have an S version like SUBS, and this instruction can be used in those cases. CBZ is the reverse and will branch when the register is 0. These are the only choices; there aren’t versions for any other condition flags.

The compiler doesn’t seem to use SUBS instructions when it generates code. It could have eliminated the CMP instruction by putting an S on the end of one of the SUB instructions.

Reverse Engineering and Ghidra

In the Linux world, most of the programs you encounter are open source, from which you can easily download the source code and study it. There is documentation on how it works, and you are actively encouraged to contribute to the program, perhaps fix bugs or add a new feature.

Suppose we encounter a program that we don’t have the source code for, and we want to know how it works. Perhaps, we want to study it, to see if it contains malware. It might be the case that we are worried about privacy concerns and want to know what information the program sends on the Internet. Maybe, it's a game, and we want to know if there is a secret code we can enter to go into God mode. What is the best way to go about this?

We can examine the Assembly code of any Linux executable using objdump or gdb . We know enough about Assembly that we can make sense of the instructions we encounter. However, this doesn’t help us form a big picture of how the program is structured and it’s time consuming.

There are tools to help with this. Until recently there were only expensive commercial products available; however, the National Security Agency (NSA), yes, that NSA, released a version of the tool that their hackers use to analyze code. It is called Ghidra, named after the three-headed monster that Godzilla fights. This tool lets you analyze compiled programs and includes the ability to decompile a program back into C code. It includes tools to show you the graphs of function calls and the ability to make annotations as you learn things.

You can download Ghidra from https://ghidra-sre.org/. To install it, you unzip it, then run the ghidraRun script if you are on Linux. Ghidra requires the Java runtime; if you don’t have this already installed, you will need to install it for your operating system.

Note

Ghidra requires the 64-bit version of Oracle Java. Some Linux distributions, even though they are 64 bits, have the 32-bit version of Java installed. If you run Ghidra under 32-bit Java, it will work until you try to disassemble some code, at which point the disassembler will fail to run. There’s currently no 64-bit version of Java for ARM, so you need to do this on an Intel or AMD-based computer.

Decompiling an optimized C program is difficult. As we saw in the last section, the GCC optimizer does some major rewriting of our original code as part of converting it to Assembly Language. Let’s take the upper program that we compiled from C in the last section, give it to Ghidra to decompile, and see whether the result is like our starting source code.
  1. 1.
    Create a project in Ghidra, import our upper program, and we get an information dialog shown in Figure 15-1.
    ../images/494415_1_En_15_Chapter/494415_1_En_15_Fig1_HTML.jpg
    Figure 15-1

    High-level information on the upper executable

     
  2. 2.

    Another information window with more detailed data. Click OK to get the main Window.

     
  3. 3.
    Right-click the upper executable and select “Open with default tool”. This opens the code analysis window. Click Yes when asked if you want the code analyzed and accept the defaults at the next prompt. Figure 15-2 is the resulting code analysis window. You need to click main in the symbol tree to get the source code to appear.
    ../images/494415_1_En_15_Chapter/494415_1_En_15_Fig2_HTML.jpg
    Figure 15-2

    Ghidra analyzing our upper program

     
Listing 15-4 is the C code that Ghidra generates. The lines above the definition of the main routine were added, so the program will compile and run.
#include <stdio.h>
#define BUFFERSIZE 250
char *tstStr = "This is a test!";
char outStr[BUFFERSIZE];
typedef unsigned char byte;
#define true 1
int main(void)
{
  char cVar1;
  char *pcVar2;
  char *pcVar3;
  char *pcVar4;
  char *pcVar5;
  pcVar2 = tstStr;
  pcVar3 = outStr;
  pcVar5 = tstStr;
  do {
    cVar1 = *pcVar5;
    pcVar4 = pcVar3;
    while( true ) {
      pcVar3 = pcVar4 + 1;
      pcVar5 = pcVar5 + 1;
      if (0x19 < (byte)(cVar1 + 0x9fU)) break;
      *pcVar4 = cVar1 + -0x20 ;
      cVar1 = *pcVar5;
      pcVar4 = pcVar3;
    }
    *pcVar4 = cVar1;
  } while (cVar1 != '');
  printf("Input: %s Output: %s ",pcVar2,outStr);
  return 0;
}
Listing 15-4

C code created by Ghidra for our upper C program

  1. 4.

    Run the program. The expected output is

    smist08@kali:~/asm64/Chapter 15$ make
    gcc -O3 -o upperghidra upperghidra.c
    smist08@kali:~/asm64/Chapter 15$ ./upperghidra
    Input: This is a test!
    Output: THIS IS A TEST!
    smist08@kali:~/asm64/Chapter 15$
     

The code produced isn’t pretty. The variable names are generated. It knows tstStr and outStr , because these are global variables. The logic is in smaller steps, often each C statement being the equivalent of a single Assembly instruction. When trying to figure out a program you don’t have the source code for, having a couple of different viewpoints is a great help.

Note

This technique only works for true compiled languages like C, Fortran, or C++. It does not work for interpreted languages like Python or JavaScript; it also doesn’t work for partially compiled languages that use a virtual machine architecture like Java or C#. There are other tools for these and often these are much more effective, since the compile step doesn’t do as much.

Summary

In this chapter, we reviewed where we can find some sample Assembly source code in the Linux kernel and the GCC runtime library. We looked at the Linux kernels copy_page function to see how that works. We wrote a C version of our upper-case program, so we could compare the Assembly code that the C compiler produces and compare it to what we have written.

We then looked at the sophisticated Ghidra program for decompiling programs to reverse the process and see what it produces. Although it produces working C code from Assembly code, it isn’t that easy to read.

In Chapter 16, “Hacking Code,” we’ll look how hackers use Assembly Language knowledge to hack our code and take control of our computers.

Exercises

  1. 1.

    Manually execute the instructions in Listing 15-1 that perform the loop to ensure you understand how it works and that it performs the correct number of iterations.

     
  2. 2.

    Have a look at the Linux kernel library function memchr.S located in arch/arm64/lib. Can you easily follow this code?

     
  3. 3.

    The copy_page routine was simpler, because the pages were guaranteed to be aligned. Look at the memcmp.S file in arch/arm64/lib. This routine is more complicated because it doesn’t assume alignment, but wants to use the same efficiencies alignment gives you. It needs to handle the first non-aligned bytes, then the main block that’s aligned, and any leftover bytes. Understanding this routine is more challenging.

     
  4. 4.

    Rewrite the loop in one of the versions of the upper-case routine to use a CBNZ or CBZ instruction for its main loop.

     
  5. 5.

    Compile the C code generated by the Ghidra disassembler in Listing 15-4. Then run objdump on the output and compare it to the original Assembly code in Listing 15-3. Is this what you expected?

     
  6. 6.

    Examine one of the smaller executables from /usr/bin, such as head, in Ghidra. Can you figure out how it works and find the main block of code?

     
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.154.171