The Assembly code we’ve been developing produces compact code compared to high-level languages due to not needing a runtime and each instruction only taking 32 bits. However, in the early days of the ARM processor, there were a lot of complaints that this was too large. People used ARMs in small embeddable devices with very limited RAM and needed more compact programs. Others created systems with a 16-bit memory bus that allowed 64K of memory—tiny by today’s standards and took two memory cycles to load each 32-bit instruction slowing down the processor.
ARM took these concerns and applications seriously and developed a 16-bit version of the instruction set, called thumb code . The original thumb code was expanded, and we’ll be looking at the slightly newer Thumb-2 code available on the Raspberry Pis. The smallest Raspberry Pi has 512 MB of memory and a 32-bit bus. However, there is a lot of thumb code around; it is supported by GCC and provides smaller programs.
Thumb code is implemented in the ARM processor as part of the instruction load and decode part of the pipeline. The ARM instruction decoder converts each 16-bit instruction into a 32-bit counterpart in the CPU, so the execution unit doesn’t know the difference.
In this chapter, we will look at the basics of Thumb-2 code, how we get useful 16-bit instructions, and how we can interoperate between Thumb and normal code.
Note
In the 64-bit instruction world, there is no similar concept. There is no 32-bit Thumb mode. In the 64-bit instruction world, all instructions are 32 bits long without exception.
16-Bit Instruction Format
Eliminate conditional instructions; this saves 4 bits. There is a way to do conditional instructions in some cases using the IT instruction.
Only access to the lower eight registers. This reduces each register encoding from 4 to 3 bits.
Reduce the number of registers in an instruction.
Reduce the size of immediate constants, usually to whatever is left over; it can be as small as 3 bits.
Eliminate all the pre- and post-indexing addressing modes. You must do this in separate instructions.
The S suffix to say whether an instruction updates the CPSR is fixed either on or off.
ADDS Rd, Rn, #imm @ imm can be 0–7
ADDS Rd, #imm @ imm can be 0–255
ADDS Rd, Rn, Rm
In the first example, if we add an immediate to a register and put it in a separate destination register, then there are only 3 bits left for the immediate code, so it must be in the range 0–7.
The second example is adding an immediate to a register; since there is one less register, there are more bits available for the immediate operand, allowing it to be in the range 0–255.
The registers in all these three examples have to be in the range R0–R7, though there are forms of the ADD instruction for adding to SP and adding an immediate constant to PC.
Note
All three examples have the S flag set; it is not optional.
Calling Thumb Code
In Chapter 4, “Controlling Program Flow,” we noted that the CPSR contained a bit that indicates if the processor is running in Thumb mode. The ARM processor supports running some code in Thumb mode and some as the normal ARM 32-bit instructions we’ve been studying up until now.
In Chapter 6, “Functions and the Stack,” we mentioned that the BX instruction can switch between processor states when it executes. If we want to return from a function written with Thumb instruction to one that isn’t, then we must use the BX instruction; we can’t just POP the return address into PC—if we do, we’ll get an “Illegal Instruction” exception.
There is a matching BLX instruction to call between ARM32 and Thumb code. Both these instructions can go either way between Thumb and ARM32 instructions.
How do the BLX and BX instructions know whether they are branching to Thumb or ARM32 code? The ARM processor uses a trick. All ARM32 instructions must be word aligned, and all Thumb instructions have to be aligned to a 16-bit boundary. That means any address pointing to an instruction must be even, which means the low-order bit isn’t used. The ARM processor uses the low-order bit of an instruction address to indicate if the pointer is to an ARM32 or a Thumb instruction.
This means if you are going to call BLX to call Thumb code, you need to add one to the address. When you do this, LR will be set with the correct address for BX to do the right thing when it returns. This is a bit of a hack, but the ARM processor works hard to get functionality out of every bit.
This holds if you pass these instructions as a register. If you use the form of BLX where you pass a label, then BLX will always change modes, whether from Thumb to ARM32 or vice versa. This is partly because the label is represented by an offset from the PC in words, so the even/odd trick won’t work.
We see that the LDR instruction loads 0x00010069 from the location pc+28 (0x10078) which is the address of myfunc (00010068) plus 1.
Thumb-2 Is More than 16 Bits
The original Thumb instruction set was limited to 16-bit instructions except for a handful of exceptions. The newer Thumb-2 variant allows many 32-bit instructions, so you can do much more in Thumb mode. It also adds a new IT instruction which provided limited conditional execution.
Within Thumb code if we want to force an instruction to be 32 bits, we can add a .W suffix, for wide, or if we want to force the instruction to be 16 bits, we can add a .N suffix, for narrow. There are still limitations on these .W instructions compared to what we have done, like no conditional instructions without an IT instruction.
Assembler directive.
This tells the Assembler this file is using all the Thumb-2 features. If we wanted only the old Thumb-1 instructions, then we would start the file with a .Thumb directive.
IT Blocks
Instructions in Thumb-2 are only allowed condition codes when following an IT instruction, and the conditions in the two instructions must be the same.
Note
Originally IT supported IF-THEN-ELSE and allowed up to four following instructions. This functionality is deprecated, meaning it may not be supported in future generations of the ARM processor, so we won’t mention it.
The 16-bit version of the ADD instruction is either ADDS or ADD<condition code>. Other versions will generate a 32-bit instruction.
Uppercase in Thumb-2
How this all works will become clearer with an example. Let’s convert our upper2.s file from Chapter 13, “Conditional Instructions and Optimizing Code,” to Thumb code. The way we do this is add the Assembly directives to the top of the file. We add “.syntax unified”, then “.thumb_func” after the .global directive. The “.thumb_func” directive tells the Assembler that the following function is in Thumb code, so assemble it accordingly. It also handles the details of switching between Thumb-2 and ARM32 mode, so we don’t have to.
Our first attempt at converting upper2.s to Thumb code
Because we placed “.thumb_func” in front of the definition calling, it will be handled correctly by the Assembler.
Objdump output of our uppercase program
We see the main program at _start contains normal 32-bit code. The only change from the Chapter 13 version is calling BLX instead of BL. The call to BLX will change the processor from ARM32 mode to Thumb mode.
If we look at the toupper part of the program, we see that nine instructions are 16 bits, but four instructions are 32 bits. As a result, we saved 18 bytes over the Chapter 13 version, but it seems we can do better.
There are two SUB instructions that are 32 bits; they look simple enough, but why are they 32 bits? The reason is that ADD and SUB instructions can either have the S suffix or be part of an IT block. If we add the S to these instructions, they will become 16 bits and won’t affect the operation of this routine.
near the top and subtract R7 instead. Since we had to break this instruction into two, we don’t save any space here. The S is required to keep this MOV instruction 16 bits.
Modified toupper routine that is all 16-bit instructions
Objdump output of our fully 16-bit toupper function
Comparison of the sizes of our three toupper routines
Function version | Size (bytes) |
---|---|
Original 32 bits | 48 |
Quick port | 34 |
All 16 bits | 32 |
Overall, we made the routine about a third smaller, which is what you typically attain using Thumb mode code.
Use the C Compiler
Sizes of toupper routine generated by the C compiler
Instruction set | Optimization | Size (bytes) |
---|---|---|
ARM | None | 148 |
-O3 | 56 | |
-Os | 48 | |
Thumb-2 | None | 78 |
-O3 | 44 | |
-Os | 36 |
UXTB is zero extend byte. The compiler is worried the SUBS instruction results in a negative number, so it zeros the upper 3 bytes in R3 to keep it as an unsigned byte. However, this can’t happen since we only execute the subtraction if R3 is between ‘a’ and ‘z’.
The code generation is interesting. Unoptimized, almost all the Thumb instructions are 16 bits, but as you turn up the optimization level, more 32-bit instructions creep in. I won’t include the generated Assembly code here, but you can easily change the compile options on the Chapter 14 code to see the results.
Summary
This chapter was a quick overview of the ARM processor’s Thumb mode. This mode allows extremely compact code for devices with limited memory. Raspberry Pi have lots of memory compared to embedded devices; still saving memory is always worthwhile. You can generate Thumb code from either Assembly or C source code. The new Thumb-2 instruction set lets you do almost anything you can do in ARM32 code.
Keep in mind that most instructions execute in one cycle whether 16 or 32 bits. This means each 16-bit instruction takes less memory but uses the same processing time as matching 32-bit instructions that can do more in a single instruction.