© Stephen Smith 2019
S. SmithRaspberry Pi Assembly Language Programminghttps://doi.org/10.1007/978-1-4842-5287-1_14

14. Reading and Understanding Code

Stephen Smith1 
(1)
Gibsons, BC, Canada
 

We’ve now learned quite a bit of ARM 32-bit Assembly language; one of the things we can do is read another programmer’s code. Reading other programmer’s code is a great way to add to our toolkit of tips and tricks and improve our own coding. We’ll review some places where you can find Assembly source code for the ARM32. Then we’ll look at how the GNU C compiler writes Assembly code and how we can analyze it. We’ll look at the NSA’s Ghidra hacking tool that can convert Assembly code back into C code—at least approximately.

We’ll use our uppercase program to see how the C compiler writes Assembly code and then examine how Ghidra can take that code and reconstitute the C code. We’ll also look at how the C compiler deals with the lack of an integer division instruction in older ARM processors.

Raspbian and GCC

One of the many nice things about working with the Raspberry Pi and GNU Compiler Collection is that they are open source. That means you can browse through the source code and peruse the Assembly parts contained there.

They are available in the following Github repositories:
Clicking the “Clone or download” button and choosing “Download ZIP” is the easiest way to obtain it. Within all this source code, a couple of good folders to peruse ARM 32-bit Assembly source code are
  • Raspbian Linux kernel:
    • arch/arm/common

    • arch/arm/kernel

    • arch/arm/crypto

  • GCC:
    • libgcc/config/arm

Note

The arch/arm/crypto has several cryptographic routines implemented on the NEON coprocessor.

The Assembly source code for these are in ∗.S files (note the uppercase S). Raspbian is based on Debian Linux. Both Debian Linux and GCC support dozens of processor architectures, so when looking for Assembly source code, make sure you look for ∗.S files in an arm folder. If you are interested, you could compare the ARM 32-bit Assembly files to the files for other processors.

The source code for these use both GNU Assembler directives like .MACRO and C preprocessor directives like #define and #ifdef. If you are going to read this source code, it helps to brush up on the C preprocessor.

The GNU compiler supports older ARM processors than contained in any Raspberry Pi, as well as configurations of the ARM processor that the Raspberry foundation never used. For instance, there is a library to implement IEEE 754 floating-point for ARM processors without an FPU. However, all Raspberry Pis do have an FPU, so this isn’t used.

Division Revisited

In Chapter 10, “Multiply, Divide, and Accumulate,” we assumed we had a newer Raspberry Pi and used the newer ARM processor’s SDIV or UDIV instructions. We just left a comment that if you wanted to divide on older Pi, then use the FPU or roll your own. We never covered how to roll our own. Another approach is to see what the C compiler does. Consider Listing 14-1, the simple C program.
#include <stdio.h>
int main()
{
      int x = 100;
      int y = 25;
      int z;
      z = x / y;
      printf("%d / %d = %d ", x, y, z);
      return(0);
}
Listing 14-1

Simple C program that divides two numbers

We can compile this with
gcc -o div div.c

Note

We can’t use any of the -O flag options, because any optimization will remove the expression and the compiler will just plug 4 in for z.

We can look at the generated Assembly code with
objdump -d div
Because we didn’t compile with an -O option, there is a lot of code, but in the middle of the main routine, we see
   10454:  e51b100c   ldr   r1, [fp, #-12]
   10458:  e51b0008   ldr   r0, [fp, #-8]
   1045c:  eb00000b   bl    10490 <__divsi3>
   10460:  e1a03000   mov   r3, r0
which sets up and calls a division routine called _divsi3. The Assembly for the _divsi3 routine is also present in the output from objdump. It is very long and contains code like
   104e0:  e1530f81   cmp   r3, r1, lsl #31
   104e4:  e0a00000   adc   r0, r0, r0
   104e8:  20433f81   subcs r3, r3, r1, lsl #31

and repeated 32 times. What’s going on here? Since we can download the source code for gcc and all its libraries, we can look at the source code. If we search for the definition of _divsi3, we will find it in libgcc/config/arm/lib1funcs.S. This source code is confusing, because it contains versions of its routines for different generations of ARM, as well as having versions that use thumb code. We’ll cover thumb code in Chapter 15, “Thumb Code,” but until then we can ignore those parts.

Listing 14-2 is the main part of the division routine.
      ARM_FUNC_START divsi3
      ARM_FUNC_ALIAS aeabi_idiv divsi3
      cmp   r1, #0
      beq   LSYM(Ldiv0)
LSYM(divsi3_skip_div0_test):
      eor   ip, r0, r1  @ save the sign of the result.
      do_it mi
      rsbmi r1, r1, #0  @ loops below use unsigned.
      subs  r2, r1, #1  @ division by 1 or -1 ?
      beq   10f
      movs  r3, r0
      do_it mi
      rsbmi r3, r0, #0          @ positive dividend value
      cmp   r3, r1
      bls   11f
      tst   r1, r2              @ divisor is power of 2?
      beq   12f
      ARM_DIV_BODY r3, r1, r0, r2
      cmp   ip, #0
      do_it mi
      rsbmi r0, r0, #0
      RET
Listing 14-2

Main part of the gcclib division routine

The routine starts by checking for division by 0, which is an error. It then looks for the easy cases of division by 1 or –1, then the other cases of dividing by a power of 2. It also saves the sign bits so the answer can be set properly at the end.

There are a lot of macros used in this code. Listing 14-3 is the one that generates the actual division is ARM_DIV_BODY.
.macro ARM_DIV_BODY dividend, divisor, result, curbit
      clz   curbit, dividend
      clz    esult, divisor
      sub   curbit, esult, curbit
      rsbs  curbit, curbit, #31
      addne curbit, curbit, curbit, lsl #1
      mov    esult, #0
      addne pc, pc, curbit, lsl #2
      nop
      .set  shift, 32
      .rept 32
      .set  shift, shift - 1
      cmp   dividend, divisor, lsl #shift
      adc    esult, esult, esult
      subcs dividend, dividend, divisor, lsl #shift
      .endr
.endm
Listing 14-3

Main body of the division routine

Within this macro is
.set  shift, 32
      .rept 32
      .set  shift, shift - 1
      cmp   dividend, divisor, lsl #shift
      adc    esult, esult, esult
      subcs dividend, dividend, divisor, lsl #shift
      .endr

which generates the repetitive code we see. This is a form of optimization called loop unrolling, where if a loop executes a fixed number of times, we just duplicate the code that many times. This saves us an expensive branch instruction, as well as the arithmetic calculating the loop index. Division will be used often enough that we want the code as fast as possible, and we can spare the extra code space to achieve this.

The algorithm for this division is basically the same long division algorithm you learned in elementary school. It is just a bit simpler in binary since there can only be two answers at each step, whether to put a 1 in the result or not.

Note

If we included the -march=“armv8-a” compiler switch, then the compiler would use a SDIV instruction instead of this function call. GCC will use advanced ARM features if it knows they are available.

Sadly, the Assembly source code contained in gcc and Linux isn’t always as well documented as we would like, but it does give us quite a bit of source code to ponder and learn from.

You might want to look at ieee754-sf.S and ieee754-df.S in the same folder as lib1funcs.S, gcc/libgcc/config/arm. These are the implementations of floating-point in single and double precision for ARM processors that don’t have an FPU. It’s interesting to see all the work the FPU does for us.

Code Created by GCC

In the last section, we looked at some code generated by gcc to see how it handles the lack of a SDIV instruction. Let’s look at how gcc would write our code. We’ll code our uppercase routine in C and compare the generated code to what we wrote. For this example, we want gcc to do as good a job as possible, so we will use the -O3 option to get maximal optimization.

We create upper.c from Listing 14-4.
#include <stdio.h>
int mytoupper(char *instr, char *outstr)
{
      char cur;
      char *orig_outstr = outstr;
      do
      {
            cur = *instr;
            if ((cur >= 'a') && (cur <='z'))
            {
                  cur = cur - ('a'-'A');
            }
            *outstr++ = cur;
            instr++;
      } while (cur != '');
      return( outstr - orig_outstr );
}
#define BUFFERSIZE 250
char *tstStr = "This is a test!";
char outStr[BUFFERSIZE];
int main()
{
      mytoupper(tstStr, outStr);
      printf("Input: %s Output: %s ", tstStr, outStr);
      return(0);
}
Listing 14-4

C implementation of our mytoupper routine

We can compile this with
      gcc -O3 -o upper upper.c
then run objdump to see the generated code
      objdump -d upper >od.txt
We get Listing 14-5.
00010318 <main>:
   10318:  e59f2048 ldr    r2, [pc, #72]  ; 10368 <main+0x50>
   1031c:  e59f3048 ldr    r3, [pc, #72]  ; 1036c <main+0x54>
   10320:  e92d4010 push   {r4, lr}
   10324:  e5921000 ldr    r1, [r2]
   10328:  e1a02001 mov    r2, r1
   1032c:  e4d24001 ldrb   r4, [r2], #1
   10330:  e2833001 add    r3, r3, #1
   10334:  e2440061 sub    r0, r4, #97    ; 0x61
   10338:  e3500019 cmp    r0, #25
   1033c:  e2440020 sub    r0, r4, #32
   10340:  95430001 strbls r0, [r3, #-1]
   10344:  9afffff8 bls    1032c <main+0x14>
   10348:  e3540000 cmp    r4, #0
   1034c:  e5434001 strb   r4, [r3, #-1]
   10350:  1afffff5 bne    1032c <main+0x14>
   10354:  e59f2010 ldr    r2, [pc, #16]  ; 1036c <main+0x54>
   10358:  e59f0010 ldr    r0, [pc, #16]  ; 10370 <main+0x58>
   1035c:  ebffffe1 bl     102e8 <printf@plt>
   10360:  e1a00004 mov    r0, r4
   10364:  e8bd8010 pop    {r4, pc}
   10368:  00021028 .word  0x00021028
   1036c:  00021030 .word  0x00021030
   10370:  0001050c .word  0x0001050c
Listing 14-5

Assembly code generated by the C compiler for our uppercase function

A few things to notice about this listing are as follows:
  • The compiler automatically inlined the mytoupper function like our macro version.

  • The compiler knows about the range optimization and shifted the range, so it only does one comparison.

  • The compiler made good use of the registers and didn’t create a stack frame. It only uses five registers, so it only needs to push/pop R4.

  • The compiler knows how to use conditional instructions.

  • The compiler took a slightly different approach to adding the conditional, putting it on a store instruction, so the converted character is only stored if the character is lowercase. It then jumps to loop since it knows if it’s lowercase, it can’t be NULL. Otherwise, it falls through, stores the unconverted character, checks for NULL, and loops if it isn’t.

Overall, the compiler did a good job of compiling our code, just taking a couple extra instructions over what we wrote in the last chapter. GCC has supported the ARM processor for 20 years now. ARM Holdings has made major contributions to GCC to improve the ARM support. All the work over this time has led to a robust and performant system, and the best part is that it is all open source.

This is why many Assembly language programmers start with C code, then only recode in Assembly if the C code isn’t efficient. This usually happens when the complexity is higher and the need for speed is greater, such as the code in the gcclib for floating-point arithmetic and division, where speed is crucial, and pure Assembler is better at bit-level manipulations than C.

In Chapter 8, “Programming GPIO Pins,” we looked at programming the GPIO pins using the GPIO controller’s memory registers. This sort of code will confuse the optimizer. Often it needs to be turned off, or it optimizes away the code that accesses these locations. This is because we write to memory locations and never read them and read memory we never set. There are keywords to help the optimizer, but in the end, Assembler can result in quite a bit better code, because you are working against the C optimizer, that doesn’t know what the GPIO controller is doing with this memory.

Reverse Engineering and Ghidra

In the Raspbian world, most of the programs you encounter are open source that you can easily download the source code and study it. There is documentation on how it works, and you are actively encouraged to contribute to the program, perhaps fix bugs or add a new feature.

Suppose we encounter a program that we don’t have the source code for, and we want to know how it works. Perhaps we want to study it to see if it contains malware. It might be the case that we are worried about privacy concerns and want to know what information the program sends on the Internet. Maybe it's a game, and we want to know if there is a secret code we can enter to go into God mode. What is the best way to go about this?

We can examine the Assembly code of any Linux executable using objdump or gdb. We know enough about Assembly that we can make sense of the instructions we encounter. However, this doesn’t help us form a big picture of how the program is structured and it’s time-consuming.

There are tools to help with this. Until recently there were only expensive commercial products available; however, the NSA, yes, that NSA, released a version of the tool that their hackers use to analyze code. It is called Ghidra, named after the three-headed monster that Godzilla fights. This tool lets you analyze compiled programs and includes the ability to decompile a program back into C code. It includes tools to show you the graphs of function calls and the ability to make annotations as you learn things.

Sadly, Ghidra doesn’t run properly on the Raspberry Pi anymore, even though it is written in Java. The NSA states that Ghidra won’t be supported running on 32-bit operating systems anymore. However, Ghidra still supports analyzing 32-bit programs. It also has full support for the ARM processor. This means we need to transfer our executable file to a computer running a 64-bit operating system, whether it is Linux, macOS, or Windows.

You can download Ghidra from https://ghidra-sre.org/ . To install it, you unzip it, then run the ghidraRun script if you are on Linux. Ghidra requires the Java runtime; if you don’t have this already installed, you will need to install it for your operating system.

Decompiling an optimized C program is difficult. As we saw in the last section, the GCC optimizer does some major rewriting of our original code as part of converting it to Assembly language. Let’s take the upper program that we compiled from C in the last section, give it to Ghidra to decompile, and see whether the result is like our starting source code.

If we create a project in Ghidra, import our upper program, then run the code browser we get the window shown in Figure 14-1.
../images/486919_1_En_14_Chapter/486919_1_En_14_Fig1_HTML.jpg
Figure 14-1

Ghidra analyzing our upper program

Listing 14-6 is the C code that Ghidra generated. I added the lines above the definition of the main routine, so the program will compile and run.
#include <stdio.h>
#define BUFFERSIZE 250
char *tstStr = "This is a test!";
char outStr[BUFFERSIZE];
typedef unsigned int uint;
typedef unsigned char byte;
typedef void undefined;
#define true 1
uint main(void)
{
  byte bVar1;
  undefined *puVar2;
  byte *pbVar3;
  byte *pbVar4;
  puVar2 = tstStr;
  pbVar3 = tstStr;
  pbVar4 = outStr;
  do {
    while( true ) {
      bVar1 = *pbVar3;
      if (0x19 < (uint)bVar1 - 0x61) break;
      *pbVar4 = bVar1 - 0x20;
      pbVar3 = pbVar3 + 1;
      pbVar4 = pbVar4 + 1;
    }
    *pbVar4 = bVar1;
    pbVar3 = pbVar3 + 1;
    pbVar4 = pbVar4 + 1;
  } while (bVar1 != 0);
  printf("Input: %s Output: %s ",puVar2,outStr);
  return (uint)bVar1;
}
Listing 14-6

C code created by Ghidra for our upper C program

If we run the program, we get the expected output:
pi@raspberrypi:~/asm/Chapter 14 $ make
gcc -O3 -o upperghidra upperghidra.c
pi@raspberrypi:~/asm/Chapter 14 $ ./upperghidra
Input: This is a test!
Output: THIS IS A TEST!
pi@raspberrypi:~/asm/Chapter 14 $

The code produced isn’t pretty. The variable names are generated. It knows tstStr and outStr, because these are global variables. The logic is in smaller steps, often each C statement being the equivalent of a single Assembly instruction. When trying to figure out a program you don’t have the source code for, having a couple of different viewpoints is a great help.

Note

This technique only works for true compiled languages like C, Fortran, or C++. It does not work for interpreted languages like Python or JavaScript; it also doesn’t work for partially compiled languages that use a virtual machine architecture like Java or C#. There are other tools for these and often these are much more effective, since the compile step doesn’t do as much.

Summary

In this chapter, we reviewed where we can find some sample Assembly source code in the Raspbian Linux kernel and the GCC runtime library. We looked at how GCC compiles the division operator from C and what happens when the ARM processor doesn’t support a division instruction. We wrote a C version of our uppercase program, so we could compare the Assembly code that the C compiler produces and compare it to what we have written.

We then looked at the sophisticated Ghidra program for decompiling programs to reverse the process and see what it produces. Although it produces working C code from Assembly code, it isn’t that easy to read.

In Chapter 15, “Thumb Code,” we’ll look at thumb code where we reduce the Assembly instruction size from 32 bits to 16 bits.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.138.144