© Stephen Smith 2019
S. SmithRaspberry Pi Assembly Language Programminghttps://doi.org/10.1007/978-1-4842-5287-1_15

15. Thumb Code

Stephen Smith1 
(1)
Gibsons, BC, Canada
 

The Assembly code we’ve been developing produces compact code compared to high-level languages due to not needing a runtime and each instruction only taking 32 bits. However, in the early days of the ARM processor, there were a lot of complaints that this was too large. People used ARMs in small embeddable devices with very limited RAM and needed more compact programs. Others created systems with a 16-bit memory bus that allowed 64K of memory—tiny by today’s standards and took two memory cycles to load each 32-bit instruction slowing down the processor.

ARM took these concerns and applications seriously and developed a 16-bit version of the instruction set, called thumb code . The original thumb code was expanded, and we’ll be looking at the slightly newer Thumb-2 code available on the Raspberry Pis. The smallest Raspberry Pi has 512 MB of memory and a 32-bit bus. However, there is a lot of thumb code around; it is supported by GCC and provides smaller programs.

Thumb code is implemented in the ARM processor as part of the instruction load and decode part of the pipeline. The ARM instruction decoder converts each 16-bit instruction into a 32-bit counterpart in the CPU, so the execution unit doesn’t know the difference.

In this chapter, we will look at the basics of Thumb-2 code, how we get useful 16-bit instructions, and how we can interoperate between Thumb and normal code.

Note

In the 64-bit instruction world, there is no similar concept. There is no 32-bit Thumb mode. In the 64-bit instruction world, all instructions are 32 bits long without exception.

16-Bit Instruction Format

We’ve battled with how ARM packs information into 32 bits, giving us problems loading registers with immediate values; we often need two instructions to load a 32-bit value. Won’t this just get worse in 16-bit instructions? The big savings to reduce the number of instruction bits are
  • Eliminate conditional instructions; this saves 4 bits. There is a way to do conditional instructions in some cases using the IT instruction.

  • Only access to the lower eight registers. This reduces each register encoding from 4 to 3 bits.

  • Reduce the number of registers in an instruction.

  • Reduce the size of immediate constants, usually to whatever is left over; it can be as small as 3 bits.

  • Eliminate all the pre- and post-indexing addressing modes. You must do this in separate instructions.

  • The S suffix to say whether an instruction updates the CPSR is fixed either on or off.

Let’s look at three forms of the 16-bit ADD instruction:
  • ADDS Rd, Rn, #imm @ imm can be 0–7

  • ADDS Rd, #imm @ imm can be 0–255

  • ADDS Rd, Rn, Rm

In the first example, if we add an immediate to a register and put it in a separate destination register, then there are only 3 bits left for the immediate code, so it must be in the range 0–7.

The second example is adding an immediate to a register; since there is one less register, there are more bits available for the immediate operand, allowing it to be in the range 0–255.

The registers in all these three examples have to be in the range R0R7, though there are forms of the ADD instruction for adding to SP and adding an immediate constant to PC.

Note

All three examples have the S flag set; it is not optional.

Calling Thumb Code

In Chapter 4, “Controlling Program Flow,” we noted that the CPSR contained a bit that indicates if the processor is running in Thumb mode. The ARM processor supports running some code in Thumb mode and some as the normal ARM 32-bit instructions we’ve been studying up until now.

In Chapter 6, “Functions and the Stack,” we mentioned that the BX instruction can switch between processor states when it executes. If we want to return from a function written with Thumb instruction to one that isn’t, then we must use the BX instruction; we can’t just POP the return address into PC—if we do, we’ll get an “Illegal Instruction” exception.

There is a matching BLX instruction to call between ARM32 and Thumb code. Both these instructions can go either way between Thumb and ARM32 instructions.

How do the BLX and BX instructions know whether they are branching to Thumb or ARM32 code? The ARM processor uses a trick. All ARM32 instructions must be word aligned, and all Thumb instructions have to be aligned to a 16-bit boundary. That means any address pointing to an instruction must be even, which means the low-order bit isn’t used. The ARM processor uses the low-order bit of an instruction address to indicate if the pointer is to an ARM32 or a Thumb instruction.

This means if you are going to call BLX to call Thumb code, you need to add one to the address. When you do this, LR will be set with the correct address for BX to do the right thing when it returns. This is a bit of a hack, but the ARM processor works hard to get functionality out of every bit.

This holds if you pass these instructions as a register. If you use the form of BLX where you pass a label, then BLX will always change modes, whether from Thumb to ARM32 or vice versa. This is partly because the label is represented by an offset from the PC in words, so the even/odd trick won’t work.

To see how the Assembler helps us, consider the following code:
@ ARM Code
_start:
l1:  LDR   R0, =myfunc
     BLX   R0
...
.thumb_func
myfunc:
L2:  ADDS  R2, R1, #2
...
The ARM code will compile as
00010054 <_start>:
   10054:  e59f001c    ldr   r0, [pc, #28]   ; 10078 <L4+0x6>
   10058:  e12fff30    blx   r0
...
00010068 <myfunc>:
   10068:  1c8a        adds  r2, r1, #2
...
  10078:   00010069    .word 0x00010069

We see that the LDR instruction loads 0x00010069 from the location pc+28 (0x10078) which is the address of myfunc (00010068) plus 1.

Thumb-2 Is More than 16 Bits

The original Thumb instruction set was limited to 16-bit instructions except for a handful of exceptions. The newer Thumb-2 variant allows many 32-bit instructions, so you can do much more in Thumb mode. It also adds a new IT instruction which provided limited conditional execution.

Within Thumb code if we want to force an instruction to be 32 bits, we can add a .W suffix, for wide, or if we want to force the instruction to be 16 bits, we can add a .N suffix, for narrow. There are still limitations on these .W instructions compared to what we have done, like no conditional instructions without an IT instruction.

To enable this syntax, we start our source file with a
.syntax unified

Assembler directive.

This tells the Assembler this file is using all the Thumb-2 features. If we wanted only the old Thumb-1 instructions, then we would start the file with a .Thumb directive.

IT Blocks

Thumb code doesn’t support conditional execution; however, with Thumb-2 it was considered important enough to add a new instruction If-Then (IT) to make the following instruction conditional, for example:
IT    EQ
ADDEQ R2, R1

Instructions in Thumb-2 are only allowed condition codes when following an IT instruction, and the conditions in the two instructions must be the same.

Note

Originally IT supported IF-THEN-ELSE and allowed up to four following instructions. This functionality is deprecated, meaning it may not be supported in future generations of the ARM processor, so we won’t mention it.

The 16-bit version of the ADD instruction is either ADDS or ADD<condition code>. Other versions will generate a 32-bit instruction.

Uppercase in Thumb-2

How this all works will become clearer with an example. Let’s convert our upper2.s file from Chapter 13, “Conditional Instructions and Optimizing Code,” to Thumb code. The way we do this is add the Assembly directives to the top of the file. We add “.syntax unified”, then “.thumb_func” after the .global directive. The “.thumb_func” directive tells the Assembler that the following function is in Thumb code, so assemble it accordingly. It also handles the details of switching between Thumb-2 and ARM32 mode, so we don’t have to.

If we do this to the original upper2.s and compile, we get the error message
pi@raspberrypi:~/asm/Chapter 15 $ make
as -march="armv8-a" -mfpu=neon-vfpv4   upper2.s -o upper2.o
upper2.s: Assembler messages:
upper2.s:27: Error: thumb conditional instruction should be in IT block -- `subls R5,#(97-65)'
make: *** [makefile:14: upper2.o] Error 1
pi@raspberrypi:~/asm/Chapter 15 $
This is expected since we know Thumb code doesn’t support conditional execution. If we add
      IT   LS
before the SUBLS instruction, then it will compile. Listing 15-1 is our first attempt at Thumb code.
@
@ Assembler program to convert a string to
@ all uppercase.
@
@ R1 - address of output string
@ R0 - address of input string
@ R4 - original output string for length calc.
@ R5 - current character being processed
@ R6 - minus 'a' to compare < 26.
@
.syntax unified
.global toupper    @ Allow other files to call this
.thumb_func
toupper:    PUSH  {R4-R6}    @ Save the registers
      MOV   R4, R1
@ The loop is until byte pointed to by R1 is non-zero
loop: LDRB  R5, [R0], #1     @ load character
@ Want to know if 'a' <= R5 <= 'z'
@ First subtract 'a'
      SUB   R6, R5, #'a'
@ Now want to know if R6 <= 25
      CMP   R6, #25     @ chars are 0-25 after shift
@ if we got here then the letter is
@ lowercase, so convert it.
      IT    LS
      SUBLS R5, #('a'-'A')
      STRB  R5, [R1], #1    @ store character
      CMP   R5, #0          @ stop on hitting a null
      BNE   loop       @ loop if character isn't null
      SUB   R0, R1, R4  @ get the length
      POP   {R4-R6}         @ Restore the registers
      BX    LR         @ Return to caller
Listing 15-1

Our first attempt at converting upper2.s to Thumb code

We have to make one modification to main.s; we have to change
      BL   toupper
to
      BLX  toupper

Because we placed “.thumb_func” in front of the definition calling, it will be handled correctly by the Assembler.

Now we can compile and run the program, then get the expected output
pi@raspberrypi:~/asm/Chapter 15 $ make
as -march="armv8-a" -mfpu=neon-vfpv4   upper2.s -o upper2.o
ld -o upper2 main.o upper2.o
pi@raspberrypi:~/asm/Chapter 15 $ ./upper2
THIS IS OUR TEST STRING THAT WE WILL CONVERT. AAZZ@[`{
pi@raspberrypi:~/asm/Chapter 15 $
That was too easy. Listing 15-2 is the generated Assembly code using objdump.
Disassembly of section .text:
00010074 <_start>:
   10074:  e59f002c    ldr   r0, [pc, #44]   ; 100a8 <_start+0x34>
   10078:  e59f102c    ldr   r1, [pc, #44]   ; 100ac <_start+0x38>
   1007c:  e3a0400c    mov   r4, #12
   10080:  e3a0500d    mov   r5, #13
   10084:  fa000009    blx   100b0 <toupper>
   10088:  e1a02000    mov   r2, r0
   1008c:  e3a00001    mov   r0, #1
   10090:  e59f1014    ldr   r1, [pc, #20]   ; 100ac <_start+0x38>
   10094:  e3a07004    mov   r7, #4
   10098:  ef000000    svc   0x00000000
   1009c:  e3a00000    mov   r0, #0
   100a0:  e3a07001    mov   r7, #1
   100a4:  ef000000    svc   0x00000000
   100a8:  000200e0    .word 0x000200e0
   100ac:  00020120    .word 0x00020120
000100b0 <toupper>:
   100b0:  b470        push  {r4, r5, r6}
   100b2:  460c        mov   r4, r1
000100b4 <loop>:
   100b4:  f810 5b01   ldrb.w    r5, [r0], #1
   100b8:  f1a5 0661   sub.w r6, r5, #97    ; 0x61
   100bc:  2e19        cmp   r6, #25
   100be:  bf98        it    ls
   100c0:  3d20        subls r5, #32
   100c2:  f801 5b01   strb.w     r5, [r1], #1
   100c6:  2d00        cmp   r5, #0
   100c8:  d1f4        bne.n 100b4 <loop>
   100ca:  eba1 0004   sub.w r0, r1, r4
   100ce:  bc70        pop   {r4, r5, r6}
   100d0:  4770        bx    lr
Listing 15-2

Objdump output of our uppercase program

We see the main program at _start contains normal 32-bit code. The only change from the Chapter 13 version is calling BLX instead of BL. The call to BLX will change the processor from ARM32 mode to Thumb mode.

If we look at the toupper part of the program, we see that nine instructions are 16 bits, but four instructions are 32 bits. As a result, we saved 18 bytes over the Chapter 13 version, but it seems we can do better.

There are two SUB instructions that are 32 bits; they look simple enough, but why are they 32 bits? The reason is that ADD and SUB instructions can either have the S suffix or be part of an IT block. If we add the S to these instructions, they will become 16 bits and won’t affect the operation of this routine.

The LDRB and STRB instructions are wide because Thumb mode doesn’t support post-index updates. We have to move these to separate ADDS instructions. The result is two 16-bit instructions rather than one 32-bit instruction, so we go from one instruction to two instructions, but use the same space. We will make this change to show we can make toupper all 16 bits. When we go to force
      SUB   R6, R5, #'a'
to be 16 bits, we run into the problem that the immediate constant is limited to 3 bits so ‘a’ doesn’t fit. To get around this, we add
      MOVS  R7, #'a'

near the top and subtract R7 instead. Since we had to break this instruction into two, we don’t save any space here. The S is required to keep this MOV instruction 16 bits.

If we make these changes, we get upper3.s, shown in Listing 15-3.
@
@ Assembler program to convert a string to
@ all uppercase.
@
@ R1 - address of output string
@ R0 - address of input string
@ R4 - original output string for length calc.
@ R5 - current character being processed
@ R6 - minus 'a' to compare < 26.
@
.syntax unified
.global toupper      @ Allow main.s to call.
.thumb_func
toupper:    PUSH  {R4-R7}    @ Save the registers
      MOV   R4, R1
      MOVS  R7, #'a'
@ The loop is until byte pointed to by R1 is non-zero
loop: LDRB  R5, [R0]  @ load character
      ADDS  R0, #1    @ increment pointer
@ Want to know if 'a' <= R5 <= 'z'
@ First subtract 'a'
      SUBS  R6, R5, R7
@ Now want to know if R6 <= 25
      CMP   R6, #25     @ chars are 0-25 after shift
@ if we got here then the letter is
@ lowercase, so convert it.
      IT    LS
      SUBLS R5, #('a'-'A')
      STRB  R5, [R1]    @ store character to output str
      ADDS  R1, #1      @ increment output pointer
      CMP   R5, #0      @ stop on hitting a null
      BNE   loop        @ loop if character isn't null
      SUBS  R0, R1, R4  @ get the length
      POP   {R4-R7}     @ Restore the registers we use.
      BX    LR          @ Return to caller
Listing 15-3

Modified toupper routine that is all 16-bit instructions

To prove it is all 16-bit instructions, we run objdump to get Listing 15-4.
000100b0 <toupper>:
   100b0:  b4f0      push  {r4, r5, r6, r7}
   100b2:  460c      mov   r4, r1
   100b4:  2761      movs  r7, #97    ; 0x61
000100b6 <loop>:
   100b6:  7805      ldrb  r5, [r0, #0]
   100b8:  3001      adds  r0, #1
   100ba:  1bee      subs  r6, r5, r7
   100bc:  2e19      cmp   r6, #25
   100be:  bf98      it    ls
   100c0:  3d20      subls r5, #32
   100c2:  700d      strb  r5, [r1, #0]
   100c4:  3101      adds  r1, #1
   100c6:  2d00      cmp   r5, #0
   100c8:  d1f5      bne.n 100b6 <loop>
   100ca:  1b08      subs  r0, r1, r4
   100cc:  bcf0      pop   {r4, r5, r6, r7}
   100ce:  4770      bx    lr
Listing 15-4

Objdump output of our fully 16-bit toupper function

In summary, the sizes of our various toupper functions are given in Table 15-1.
Table 15-1

Comparison of the sizes of our three toupper routines

Function version

Size (bytes)

Original 32 bits

48

Quick port

34

All 16 bits

32

Overall, we made the routine about a third smaller, which is what you typically attain using Thumb mode code.

Use the C Compiler

The GNU C compiler can generate Thumb code. There is a switch:
-mthumb
to generate thumb code when compiling. If you switch this on, you will get an error message because the C runtime uses the FPU by default and Thumb-1 instructions don’t have the ability to access the FPU. We need to add the switch
-march="armv8-a"
or at least v6 to have the ability to use Thumb-2 instructions. When we do this, we can compile our C program from Listing 14-4 and compare the code sizes. The code generated by the C compiler is different based on the optimization levels. Table 15-2 is a comparison of the code size of the toupper routine under different compiler options, no optimization, optimized for speed, and optimized for size.
Table 15-2

Sizes of toupper routine generated by the C compiler

Instruction set

Optimization

Size (bytes)

ARM

None

148

 

-O3

56

 

-Os

48

Thumb-2

None

78

 

-O3

44

 

-Os

36

We see that the thumb code saves us memory. In the 16-bit optimized for size case, the compiler could save another 2 bytes; it does the following:
subs   r3, #32
uxtb   r3, r3

UXTB is zero extend byte. The compiler is worried the SUBS instruction results in a negative number, so it zeros the upper 3 bytes in R3 to keep it as an unsigned byte. However, this can’t happen since we only execute the subtraction if R3 is between ‘a’ and ‘z’.

The code generation is interesting. Unoptimized, almost all the Thumb instructions are 16 bits, but as you turn up the optimization level, more 32-bit instructions creep in. I won’t include the generated Assembly code here, but you can easily change the compile options on the Chapter 14 code to see the results.

Summary

This chapter was a quick overview of the ARM processor’s Thumb mode. This mode allows extremely compact code for devices with limited memory. Raspberry Pi have lots of memory compared to embedded devices; still saving memory is always worthwhile. You can generate Thumb code from either Assembly or C source code. The new Thumb-2 instruction set lets you do almost anything you can do in ARM32 code.

Keep in mind that most instructions execute in one cycle whether 16 or 32 bits. This means each 16-bit instruction takes less memory but uses the same processing time as matching 32-bit instructions that can do more in a single instruction.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.146.37.35