At this point, if we have done our job properly, you now possess the essential skills to use IDA effectively and, more important, to bend it to your will. The next step, young grasshopper, is to learn to react to the ninja stars that binaries (as opposed to IDA) will throw at you. Depending on your motives for staring at assembly language, you may be very familiar with what you are looking at, or you may never know what you are going to be faced with. If you happen to spend all of your time examining code that was compiled using gcc on a Linux platform, you may become quite familiar with the style of code that it generates. On the other hand, if someone dropped a debug version of a program compiled using Microsoft Visual C++ (VC++) in your lap, you might be completely baffled. Malware analysts in particular are faced with a wide variety of code to examine. Setting aside the topic of obfuscation for the moment, malware analysts are likely to see code created using Visual Basic, Delphi, and Visual C/C++; machine language blobs embedded within documents; and more all in the same afternoon.
In this chapter we will take a brief look at some of the ways that compilers differ as viewed through the IDA looking glass. The intent is not to delve into why compilers differ; rather, we hope to cover some of the ways that those differences manifest themselves in disassembly listings and how you may resolve those differences. Among other things, the compiler and associated options used to build a particular piece of software constitute one data point in profiling the author of that software.
While a wide variety of compilers are available for a wide variety of languages, in this chapter we will primarily utilize compiled C code for our examples, as a large number of C compilers are available for a large number of platforms.
The C switch
statement is a frequent target for compiler optimizations. The goal of these optimizations is to match the switch variable to a valid case label in the most efficient manner possible. The means by which this is achieved typically depends on the nature of the switch
statement’s case labels. When the case labels are widely spread, as in the following example, most compilers generate code to perform a binary search [140]to match the switch variable against one of the cases.
switch (value) { case 1: //code executed when value == 1 break; case 211: //code executed when value == 211 break; case 295: //code executed when value == 295 break; case 462: //code executed when value == 462 break; case 1093: //code executed when value == 1093 break; case 1839: //code executed when value == 1839 break; }
When case labels are closely clustered, preferably sequentially as shown here, compilers generally resolve the switch variable by performing a table lookup[141] to match the switch variable to the address of its associated case.
switch (value) { case 1: //code executed when value == 1 break; case 2: //code executed when value == 2 break; case 3: //code executed when value == 3 break; case 4: //code executed when value == 4 break; case 5: //code executed when value == 5 break; case 6: //code executed when value == 6 break; }
A compiled example of a switch
statement that matches the switch variable against the consecutive cases 1 through 12 is shown here:
.text:00401155 mov edx, [ebp+arg_0] .text:00401158 cmp edx, 0Ch ; switch 13 cases .text:0040115B ja loc_4011F1 ; default .text:0040115B ; jumptable 00401161 case 0 .text:00401161 jmp ds:off_401168[edx*4] ; switch jump .text:00401161 ; --------------------------------------------------------------- .text:00401168 off_401168 dd offset loc_4011F1 ; DATA XREF: sub_401150+11↑r .text:00401168 dd offset loc_40119C ; jump table for switch statement .text:00401168 dd offset loc_4011A1 .text:00401168 dd offset loc_4011A6 .text:00401168 dd offset loc_4011AB .text:00401168 dd offset loc_4011B3 .text:00401168 dd offset loc_4011BB .text:00401168 dd offset loc_4011C3 .text:00401168 dd offset loc_4011CB .text:00401168 dd offset loc_4011D3 .text:00401168 dd offset loc_4011DB .text:00401168 dd offset loc_4011E3 .text:00401168 dd offset loc_4011EB .text:0040119C ; --------------------------------------------------------------- .text:0040119C .text:0040119C loc_40119C: ; CODE XREF: sub_401150+11↑j .text:0040119C ; DATA XREF: sub_401150:off_401168↑o .text:0040119C mov eax, [ebp+arg_4] ; jumptable 00401161 case 1
This example was compiled using the Borland command-line compiler, which IDA well understands. The comments, which IDA inserted during the analysis phase, demonstrate that IDA has a clear understanding that this is a switch
statement. In this example we note that IDA recognizes the switch test , the jump table , and individual cases by value within the code.
As a side note on the use of jump tables to resolve switch cases, note that the table in the previous example contains 13 entries, while the switch
statement is known to test cases 1 through 12 only. In this case, the compiler elected to include an entry for case 0 rather than treating 0 as a special case. The destination for case 0 is the same as the destination for every other value outside the range of 1 to 12 .
A final implementation note concerns the nature of the test performed on the switch variable. For readers less familiar with the x86 instruction set, the test and the associated jump in the succeeding line may appear only to exclude values larger than 12 while failing to account for negative values. If true, this could be disastrous, as using a negative index into the jump table might lead to unintended consequences. Fortunately, the ja
(jump above) instruction treats comparisons as if they were performed on unsigned values; thus −1
(0xFFFFFFFF
) would be seen as 4294967295
, which is much larger than 12 and therefore excluded from the valid range for indexing the jump table.
The same source code compiled using Microsoft Visual C++ results in the disassembly listing shown here:
.text:004013D5 mov ecx, [ebp+var_8] .text:004013D8 sub ecx, 1 .text:004013DB mov [ebp+var_8], ecx .text:004013DE cmp [ebp+var_8], 0Bh ; switch 12 cases .text:004013E2 ja loc_40146E ; jumptable 004013EB default case .text:004013E8 mov edx, [ebp+var_8] .text:004013EB jmp ds:off_401478[edx*4] ; switch jump .text:004013F2 .text:004013F2 loc_4013F2: ; DATA XREF: .text:off_401478?o .text:004013F2 mov eax, [ebp+arg_4] ; jumptable 004013EB case 0 ... ; REMAINDER OF FUNCTION EXCLUDED FOR BREVITY .text:00401477 retn .text:00401477 sub_4013B0 endp .text:00401477 ; ------------------------------------------------------------- .text:00401478 off_401478 dd offset loc_4013F2 ; DATA XREF: sub_4013B0+3B↓r .text:00401478 dd offset loc_4013FA ; jump table for switch statement .text:00401478 dd offset loc_401402 .text:00401478 dd offset loc_40140A .text:00401478 dd offset loc_401415 .text:00401478 dd offset loc_401420 .text:00401478 dd offset loc_40142B .text:00401478 dd offset loc_401436 .text:00401478 dd offset loc_401441 .text:00401478 dd offset loc_40144C .text:00401478 dd offset loc_401458 .text:00401478 dd offset loc_401464
Several differences are apparent when comparing this code with the code generated by the Borland compiler. One obvious difference is that the jump table has been relocated to space immediately following the function containing the switch
statement (as opposed to being embedded within the function itself in the case of the Borland code). Other than providing a cleaner separation of code and data, relocating the jump table in this manner has little effect on the behavior of the program. Despite the different layout of the code, IDA remains capable of annotating the key features of the switch
statement, including the number of cases and the code blocks associated with each case.
A few of the implementation details of the switch
statement include the fact that the switch variable (var_8
in this case) is decremented to shift the range of valid values to 0 through 11 , allowing the variable to be used directly as an index into the jump table without the need to create a dummy slot for the unused case 0. As a result, the first entry (or zero index entry) in the jump table actually refers to the code for switch case 1.
Rounding out our comparison of switch
statements is the following code generated by gcc:
.text:004011FA cmp [ebp+arg_0], 0Ch ; switch 13 cases .text:004011FE ja loc_40129D ; jumptable 00401210 case 0 .text:00401204 mov eax, [ebp+arg_0] .text:00401207 shl eax, 2 .text:0040120A mov eax, ds:off_402010[eax] .text:00401210 jmp eax ; switch jump .text:00401212 .text:00401212 loc_401212: ; DATA XREF: .rdata:off_402010 o .text:00401212 mov eax, [ebp+arg_4] ; jumptable 00401210 case 1 ... ; REMAINDER OF .text SECTION EXCLUDED FOR BREVITY .rdata:00402010 off_402010 dd offset loc_40129D ; DATA XREF: sub_4011ED+1D↑r .rdata:00402010 dd offset loc_401212 ; jump table for switch statement .rdata:00402010 dd offset loc_40121D .rdata:00402010 dd offset loc_401225 .rdata:00402010 dd offset loc_40122D .rdata:00402010 dd offset loc_40123C .rdata:00402010 dd offset loc_40124B .rdata:00402010 dd offset loc_40125A .rdata:00402010 dd offset loc_401265 .rdata:00402010 dd offset loc_401270 .rdata:00402010 dd offset loc_40127B .rdata:00402010 dd offset loc_401287 .rdata:00402010 dd offset loc_401293
This code bears some similarities to the Borland code as seen by the comparison to 12 , the jump table that contains 13 entries, and the use of a pointer to the default case in the case 0 slot of the jump table. As in the Borland code, the address for the case 1 handler can be found at index 1 into the jump table. Notable differences between the gcc code and previous examples include a different style of executing the jump and the fact that the jump table is stored in the read-only data (.rdata
) section of the binary, providing a logical separation between the code associated with the switch
statement and the data required to implement the switch
statement. As in the other two examples, IDA is able to locate and annotate the key elements of the switch statement.
One of the points we are making here is that there is no single correct way to compile source to assembly. Familiarity with code generated by a specific compiler in no way guarantees that you will recognize high-level constructs compiled using an entirely different compiler (or even different versions of the same compiler family). More important, do not assume that something is not a switch
statement simply because IDA fails to add comments to that effect. Like you, IDA is more familiar with the output of some compilers than others. Rather than relying entirely on IDA’s analysis capabilities to recognize commonly used code and data constructs, you should always be prepared to utilize your own skills—your familiarity with a given assembly language, your knowledge of compilers, and your research skills—to properly interpret a disassembly.
[140] For you algorithmic analysis fans, this means that the switch variable is matched after at most log-2N operations, where N is the number of cases contained in the switch
statement.
[141] Again for those analyzing algorithms at home, the use of a table lookup allows the target case to be found in a single operation, which you may recall from your algorithms class is also called constant time or O(1).
3.133.138.14