Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

S. SmithRaspberry Pi Assembly Language Programminghttps://doi.org/10.1007/978-1-4842-5287-1_15

15. Thumb Code

Stephen Smith¹

(1)

Gibsons, BC, Canada

The Assembly code we’ve been developing produces compact code compared to high-level languages due to not needing a runtime and each instruction only taking 32 bits. However, in the early days of the ARM processor, there were a lot of complaints that this was too large. People used ARMs in small embeddable devices with very limited RAM and needed more compact programs. Others created systems with a 16-bit memory bus that allowed 64K of memory—tiny by today’s standards and took two memory cycles to load each 32-bit instruction slowing down the processor.

ARM took these concerns and applications seriously and developed a 16-bit version of the instruction set, called thumb code . The original thumb code was expanded, and we’ll be looking at the slightly newer Thumb-2 code available on the Raspberry Pis. The smallest Raspberry Pi has 512 MB of memory and a 32-bit bus. However, there is a lot of thumb code around; it is supported by GCC and provides smaller programs.

Thumb code is implemented in the ARM processor as part of the instruction load and decode part of the pipeline. The ARM instruction decoder converts each 16-bit instruction into a 32-bit counterpart in the CPU, so the execution unit doesn’t know the difference.

In this chapter, we will look at the basics of Thumb-2 code, how we get useful 16-bit instructions, and how we can interoperate between Thumb and normal code.

Note

In the 64-bit instruction world, there is no similar concept. There is no 32-bit Thumb mode. In the 64-bit instruction world, all instructions are 32 bits long without exception.

16-Bit Instruction Format

We’ve battled with how ARM packs information into 32 bits, giving us problems loading registers with immediate values; we often need two instructions to load a 32-bit value. Won’t this just get worse in 16-bit instructions? The big savings to reduce the number of instruction bits are

Eliminate conditional instructions; this saves 4 bits. There is a way to do conditional instructions in some cases using the IT instruction.
Only access to the lower eight registers. This reduces each register encoding from 4 to 3 bits.
Reduce the number of registers in an instruction.
Reduce the size of immediate constants, usually to whatever is left over; it can be as small as 3 bits.
Eliminate all the pre- and post-indexing addressing modes. You must do this in separate instructions.
The S suffix to say whether an instruction updates the CPSR is fixed either on or off.

Let’s look at three forms of the 16-bit ADD instruction:

ADDS Rd, Rn, #imm @ imm can be 0–7
ADDS Rd, #imm @ imm can be 0–255
ADDS Rd, Rn, Rm

In the first example, if we add an immediate to a register and put it in a separate destination register, then there are only 3 bits left for the immediate code, so it must be in the range 0–7.

The second example is adding an immediate to a register; since there is one less register, there are more bits available for the immediate operand, allowing it to be in the range 0–255.

The registers in all these three examples have to be in the range R0–R7, though there are forms of the ADD instruction for adding to SP and adding an immediate constant to PC.

Note

All three examples have the S flag set; it is not optional.

Calling Thumb Code

In Chapter 4, “Controlling Program Flow,” we noted that the CPSR contained a bit that indicates if the processor is running in Thumb mode. The ARM processor supports running some code in Thumb mode and some as the normal ARM 32-bit instructions we’ve been studying up until now.

In Chapter 6, “Functions and the Stack,” we mentioned that the BX instruction can switch between processor states when it executes. If we want to return from a function written with Thumb instruction to one that isn’t, then we must use the BX instruction; we can’t just POP the return address into PC—if we do, we’ll get an “Illegal Instruction” exception.

There is a matching BLX instruction to call between ARM32 and Thumb code. Both these instructions can go either way between Thumb and ARM32 instructions.

How do the BLX and BX instructions know whether they are branching to Thumb or ARM32 code? The ARM processor uses a trick. All ARM32 instructions must be word aligned, and all Thumb instructions have to be aligned to a 16-bit boundary. That means any address pointing to an instruction must be even, which means the low-order bit isn’t used. The ARM processor uses the low-order bit of an instruction address to indicate if the pointer is to an ARM32 or a Thumb instruction.

This means if you are going to call BLX to call Thumb code, you need to add one to the address. When you do this, LR will be set with the correct address for BX to do the right thing when it returns. This is a bit of a hack, but the ARM processor works hard to get functionality out of every bit.

This holds if you pass these instructions as a register. If you use the form of BLX where you pass a label, then BLX will always change modes, whether from Thumb to ARM32 or vice versa. This is partly because the label is represented by an offset from the PC in words, so the even/odd trick won’t work.

To see how the Assembler helps us, consider the following code:

@ ARM Code

_start:

l1: LDR R0, =myfunc

BLX R0

...

.thumb_func

myfunc:

L2: ADDS R2, R1, #2

...

The ARM code will compile as

00010054 <_start>:

10054: e59f001c ldr r0, [pc, #28] ; 10078 <L4+0x6>

10058: e12fff30 blx r0

...

00010068 <myfunc>:

10068: 1c8a adds r2, r1, #2

...

10078: 00010069 .word 0x00010069

We see that the LDR instruction loads 0x00010069 from the location pc+28 (0x10078) which is the address of myfunc (00010068) plus 1.

Thumb-2 Is More than 16 Bits

The original Thumb instruction set was limited to 16-bit instructions except for a handful of exceptions. The newer Thumb-2 variant allows many 32-bit instructions, so you can do much more in Thumb mode. It also adds a new IT instruction which provided limited conditional execution.

Within Thumb code if we want to force an instruction to be 32 bits, we can add a .W suffix, for wide, or if we want to force the instruction to be 16 bits, we can add a .N suffix, for narrow. There are still limitations on these .W instructions compared to what we have done, like no conditional instructions without an IT instruction.

To enable this syntax, we start our source file with a

.syntax unified

Assembler directive.

This tells the Assembler this file is using all the Thumb-2 features. If we wanted only the old Thumb-1 instructions, then we would start the file with a .Thumb directive.

IT Blocks

Thumb code doesn’t support conditional execution; however, with Thumb-2 it was considered important enough to add a new instruction If-Then (IT) to make the following instruction conditional, for example:

IT EQ

ADDEQ R2, R1

Instructions in Thumb-2 are only allowed condition codes when following an IT instruction, and the conditions in the two instructions must be the same.

Note

Originally IT supported IF-THEN-ELSE and allowed up to four following instructions. This functionality is deprecated, meaning it may not be supported in future generations of the ARM processor, so we won’t mention it.

The 16-bit version of the ADD instruction is either ADDS or ADD<condition code>. Other versions will generate a 32-bit instruction.

Uppercase in Thumb-2

How this all works will become clearer with an example. Let’s convert our upper2.s file from Chapter 13, “Conditional Instructions and Optimizing Code,” to Thumb code. The way we do this is add the Assembly directives to the top of the file. We add “.syntax unified”, then “.thumb_func” after the .global directive. The “.thumb_func” directive tells the Assembler that the following function is in Thumb code, so assemble it accordingly. It also handles the details of switching between Thumb-2 and ARM32 mode, so we don’t have to.

If we do this to the original upper2.s and compile, we get the error message

pi@raspberrypi:~/asm/Chapter 15 $ make

as -march="armv8-a" -mfpu=neon-vfpv4 upper2.s -o upper2.o

upper2.s: Assembler messages:

upper2.s:27: Error: thumb conditional instruction should be in IT block -- `subls R5,#(97-65)'

make: *** [makefile:14: upper2.o] Error 1

pi@raspberrypi:~/asm/Chapter 15 $

This is expected since we know Thumb code doesn’t support conditional execution. If we add

IT LS

before the SUBLS instruction, then it will compile. Listing 15-1 is our first attempt at Thumb code.

@ Assembler program to convert a string to

@ all uppercase.

@ R1 - address of output string

@ R0 - address of input string

@ R4 - original output string for length calc.

@ R5 - current character being processed

@ R6 - minus 'a' to compare < 26.

.syntax unified

.global toupper @ Allow other files to call this

.thumb_func

toupper: PUSH {R4-R6} @ Save the registers

MOV R4, R1

@ The loop is until byte pointed to by R1 is non-zero

loop: LDRB R5, [R0], #1 @ load character

@ Want to know if 'a' <= R5 <= 'z'

@ First subtract 'a'

SUB R6, R5, #'a'

@ Now want to know if R6 <= 25

CMP R6, #25 @ chars are 0-25 after shift

@ if we got here then the letter is

@ lowercase, so convert it.

IT LS

SUBLS R5, #('a'-'A')

STRB R5, [R1], #1 @ store character

CMP R5, #0 @ stop on hitting a null

BNE loop @ loop if character isn't null

SUB R0, R1, R4 @ get the length

POP {R4-R6} @ Restore the registers

BX LR @ Return to caller

Listing 15-1

Our first attempt at converting upper2.s to Thumb code

We have to make one modification to main.s; we have to change

BL toupper

BLX toupper

Because we placed “.thumb_func” in front of the definition calling, it will be handled correctly by the Assembler.

Now we can compile and run the program, then get the expected output

pi@raspberrypi:~/asm/Chapter 15 $ make

as -march="armv8-a" -mfpu=neon-vfpv4 upper2.s -o upper2.o

ld -o upper2 main.o upper2.o

pi@raspberrypi:~/asm/Chapter 15 $ ./upper2

THIS IS OUR TEST STRING THAT WE WILL CONVERT. AAZZ@[`{

pi@raspberrypi:~/asm/Chapter 15 $

That was too easy. Listing 15-2 is the generated Assembly code using objdump.

Disassembly of section .text:

00010074 <_start>:

10074: e59f002c ldr r0, [pc, #44] ; 100a8 <_start+0x34>

10078: e59f102c ldr r1, [pc, #44] ; 100ac <_start+0x38>

1007c: e3a0400c mov r4, #12

10080: e3a0500d mov r5, #13

10084: fa000009 blx 100b0 <toupper>

10088: e1a02000 mov r2, r0

1008c: e3a00001 mov r0, #1

10090: e59f1014 ldr r1, [pc, #20] ; 100ac <_start+0x38>

10094: e3a07004 mov r7, #4

10098: ef000000 svc 0x00000000

1009c: e3a00000 mov r0, #0

100a0: e3a07001 mov r7, #1

100a4: ef000000 svc 0x00000000

100a8: 000200e0 .word 0x000200e0

100ac: 00020120 .word 0x00020120

000100b0 <toupper>:

100b0: b470 push {r4, r5, r6}

100b2: 460c mov r4, r1

000100b4 <loop>:

100b4: f810 5b01 ldrb.w r5, [r0], #1

100b8: f1a5 0661 sub.w r6, r5, #97 ; 0x61

100bc: 2e19 cmp r6, #25

100be: bf98 it ls

100c0: 3d20 subls r5, #32

100c2: f801 5b01 strb.w r5, [r1], #1

100c6: 2d00 cmp r5, #0

100c8: d1f4 bne.n 100b4 <loop>

100ca: eba1 0004 sub.w r0, r1, r4

100ce: bc70 pop {r4, r5, r6}

100d0: 4770 bx lr

Listing 15-2

Objdump output of our uppercase program

We see the main program at _start contains normal 32-bit code. The only change from the Chapter 13 version is calling BLX instead of BL. The call to BLX will change the processor from ARM32 mode to Thumb mode.

If we look at the toupper part of the program, we see that nine instructions are 16 bits, but four instructions are 32 bits. As a result, we saved 18 bytes over the Chapter 13 version, but it seems we can do better.

There are two SUB instructions that are 32 bits; they look simple enough, but why are they 32 bits? The reason is that ADD and SUB instructions can either have the S suffix or be part of an IT block. If we add the S to these instructions, they will become 16 bits and won’t affect the operation of this routine.

The LDRB and STRB instructions are wide because Thumb mode doesn’t support post-index updates. We have to move these to separate ADDS instructions. The result is two 16-bit instructions rather than one 32-bit instruction, so we go from one instruction to two instructions, but use the same space. We will make this change to show we can make toupper all 16 bits. When we go to force

SUB R6, R5, #'a'

to be 16 bits, we run into the problem that the immediate constant is limited to 3 bits so ‘a’ doesn’t fit. To get around this, we add

MOVS R7, #'a'

near the top and subtract R7 instead. Since we had to break this instruction into two, we don’t save any space here. The S is required to keep this MOV instruction 16 bits.

If we make these changes, we get upper3.s, shown in Listing 15-3.

@ Assembler program to convert a string to

@ all uppercase.

@ R1 - address of output string

@ R0 - address of input string

@ R4 - original output string for length calc.

@ R5 - current character being processed

@ R6 - minus 'a' to compare < 26.

.syntax unified

.global toupper @ Allow main.s to call.

.thumb_func

toupper: PUSH {R4-R7} @ Save the registers

MOV R4, R1

MOVS R7, #'a'

@ The loop is until byte pointed to by R1 is non-zero

loop: LDRB R5, [R0] @ load character

ADDS R0, #1 @ increment pointer

@ Want to know if 'a' <= R5 <= 'z'

@ First subtract 'a'

SUBS R6, R5, R7

@ Now want to know if R6 <= 25

CMP R6, #25 @ chars are 0-25 after shift

@ if we got here then the letter is

@ lowercase, so convert it.

IT LS

SUBLS R5, #('a'-'A')

STRB R5, [R1] @ store character to output str

ADDS R1, #1 @ increment output pointer

CMP R5, #0 @ stop on hitting a null

BNE loop @ loop if character isn't null

SUBS R0, R1, R4 @ get the length

POP {R4-R7} @ Restore the registers we use.

BX LR @ Return to caller

Listing 15-3

Modified toupper routine that is all 16-bit instructions

To prove it is all 16-bit instructions, we run objdump to get Listing 15-4.

000100b0 <toupper>:

100b0: b4f0 push {r4, r5, r6, r7}

100b2: 460c mov r4, r1

100b4: 2761 movs r7, #97 ; 0x61

000100b6 <loop>:

100b6: 7805 ldrb r5, [r0, #0]

100b8: 3001 adds r0, #1

100ba: 1bee subs r6, r5, r7

100bc: 2e19 cmp r6, #25

100be: bf98 it ls

100c0: 3d20 subls r5, #32

100c2: 700d strb r5, [r1, #0]

100c4: 3101 adds r1, #1

100c6: 2d00 cmp r5, #0

100c8: d1f5 bne.n 100b6 <loop>

100ca: 1b08 subs r0, r1, r4

100cc: bcf0 pop {r4, r5, r6, r7}

100ce: 4770 bx lr

Listing 15-4

Objdump output of our fully 16-bit toupper function

In summary, the sizes of our various toupper functions are given in Table 15-1.

Table 15-1

Comparison of the sizes of our three toupper routines

Function version	Size (bytes)
Original 32 bits	48
Quick port	34
All 16 bits	32

Overall, we made the routine about a third smaller, which is what you typically attain using Thumb mode code.

Use the C Compiler

The GNU C compiler can generate Thumb code. There is a switch:

-mthumb

to generate thumb code when compiling. If you switch this on, you will get an error message because the C runtime uses the FPU by default and Thumb-1 instructions don’t have the ability to access the FPU. We need to add the switch

-march="armv8-a"

or at least v6 to have the ability to use Thumb-2 instructions. When we do this, we can compile our C program from Listing 14-4 and compare the code sizes. The code generated by the C compiler is different based on the optimization levels. Table 15-2 is a comparison of the code size of the toupper routine under different compiler options, no optimization, optimized for speed, and optimized for size.

Table 15-2

Sizes of toupper routine generated by the C compiler

Instruction set	Optimization	Size (bytes)
ARM	None	148
	-O3	56
	-Os	48
Thumb-2	None	78
	-O3	44
	-Os	36

We see that the thumb code saves us memory. In the 16-bit optimized for size case, the compiler could save another 2 bytes; it does the following:

subs r3, #32

uxtb r3, r3

UXTB is zero extend byte. The compiler is worried the SUBS instruction results in a negative number, so it zeros the upper 3 bytes in R3 to keep it as an unsigned byte. However, this can’t happen since we only execute the subtraction if R3 is between ‘a’ and ‘z’.

The code generation is interesting. Unoptimized, almost all the Thumb instructions are 16 bits, but as you turn up the optimization level, more 32-bit instructions creep in. I won’t include the generated Assembly code here, but you can easily change the compile options on the Chapter 14 code to see the results.

Summary

This chapter was a quick overview of the ARM processor’s Thumb mode. This mode allows extremely compact code for devices with limited memory. Raspberry Pi have lots of memory compared to embedded devices; still saving memory is always worthwhile. You can generate Thumb code from either Assembly or C source code. The new Thumb-2 instruction set lets you do almost anything you can do in ARM32 code.

Keep in mind that most instructions execute in one cycle whether 16 or 32 bits. This means each 16-bit instruction takes less memory but uses the same processing time as matching 32-bit instructions that can do more in a single instruction.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 15. Thumb Code

Create new playlist

Sign In

Sign Up

15. Thumb Code

Note

16-Bit Instruction Format

Note

Calling Thumb Code

Thumb-2 Is More than 16 Bits

IT Blocks

Note

Uppercase in Thumb-2

Use the C Compiler

Summary

Table of Contents for
15. Thumb Code