At this point it is probably useful to see some examples of scripts that perform specific tasks. For the remainder of the chapter we present some fairly common situations in which a script can be used to answer a question about a database.
Many scripts operate on individual functions. Examples include generating the call tree rooted at a specific function, generating the control flow graph of a function, or analyzing the stack frames of every function in a database. Example 15-1 iterates through every function in a database and prints basic information about each function, including the start and end addresses of the function, the size of the function’s arguments, and the size of the function’s local variables. All output is sent to the output window.
Example 15-1. Function enumeration script
#include <idc.idc> static main() { auto addr, end, args, locals, frame, firstArg, name, ret; addr = 0; for (addr = NextFunction(addr); addr != BADADDR; addr = NextFunction(addr)) { name = Name(addr); end = GetFunctionAttr(addr, FUNCATTR_END); locals = GetFunctionAttr(addr, FUNCATTR_FRSIZE); frame = GetFrame(addr); // retrieve a handle to the function's stack frame ret = GetMemberOffset(frame, " r"); // " r" is the name of the return address if (ret == −1) continue; firstArg = ret + 4; args = GetStrucSize(frame) - firstArg; Message("Function: %s, starts at %x, ends at %x ", name, addr, end); Message(" Local variable area is %d bytes ", locals); Message(" Arguments occupy %d bytes (%d args) ", args, args / 4); } }
This script uses some of IDC’s structure-manipulation functions to obtain a handle to each function’s stack frame (GetFrame
), determine the size of the stack frame (GetStrucSize
), and determine the offset of the saved return address within the frame (GetMemberOffset
). The first argument to the function lies 4 bytes beyond the saved return address. The size of the function’s argument area is computed as the space between the first argument and the end of the stack frame. Since IDA can’t generate stack frames for imported functions, this script tests whether the function’s stack frame contains a saved return address as a simple means of identifying calls to an imported function.
Within a given function, you may want to enumerate every instruction. Example 15-2 counts the number of instructions contained in the function identified by the current cursor position:
Example 15-2. Instruction enumeration script
#include <idc.idc> static main() { auto func, end, count, inst; func = GetFunctionAttr(ScreenEA(), FUNCATTR_START); if (func != −1) { end = GetFunctionAttr(func, FUNCATTR_END); count = 0; inst = func; while (inst < end) { count++; inst = FindCode(inst, SEARCH_DOWN | SEARCH_NEXT); } Warning("%s contains %d instructions ", Name(func), count); } else { Warning("No function found at location %x", ScreenEA()); } }
The function begins by using GetFunctionAttr
to determine the start address of the function containing the cursor address (ScreenEA()
). If the beginning of a function is found, the next step is to determine the end address for the function, once again using the GetFunctionAttr
function. Once the function has been bounded, a loop is executed to step through successive instructions in the function by using the search functionality of the FindCode
function . In this example, the Warning
function is used to display results, since only a single line of output will be generated by the function and output displayed in a Warning dialog is much more obvious than output generated in the message window. Note that this example assumes that all of the instructions within the given function are contiguous. An alternative approach might replace the use of FindCode
with logic to iterate over all of the code cross-references for each instruction within the function. Properly written, this second approach would handle noncontiguous, also known as “chunked,” functions.
Iterating through cross-references can be confusing because of the number of functions available for accessing cross-reference data and the fact that code cross-references are bidirectional. In order to get the data you want, you need to make sure you are accessing the proper type of cross-reference for your situation. In our first cross-reference example, shown in Example 15-3, we derive the list of all function calls made within a function by iterating through each instruction in the function to determine if the instruction calls another function. One method of doing this might be to parse the results of GetMnem
to look for call
instructions. This would not be a very portable solution, because the instruction used to call a function varies among CPU types. Second, additional parsing would be required to determine exactly which function was being called. Cross-references avoid each of these difficulties because they are CPU-independent and directly inform us about the target of the cross-reference.
Example 15-3. Enumerating function calls
#include <idc.idc> static main() { auto func, end, target, inst, name, flags, xref; flags = SEARCH_DOWN | SEARCH_NEXT; func = GetFunctionAttr(ScreenEA(), FUNCATTR_START); if (func != −1) { name = Name(func); end = GetFunctionAttr(func, FUNCATTR_END); for (inst = func; inst < end; inst = FindCode(inst, flags)) { for (target = Rfirst(inst); target != BADADDR; target = Rnext(inst, target)) { xref = XrefType(); if (xref == fl_CN || xref == fl_CF) { Message("%s calls %s from 0x%x ", name, Name(target), inst); } } } } else { Warning("No function found at location %x", ScreenEA()); } }
In this example, we must iterate through each instruction in the function. For each instruction, we must then iterate through each cross-reference from the instruction. We are interested only in cross-references that call other functions, so we must test the return value of XrefType
looking for fl_CN
or fl_CF
-type cross-references. Here again, this particular solution handles only functions whose instructions happen to be contiguous. Given that the script is already iterating over the cross-references from each instruction, it would not take many changes to produce a flow-driven analysis instead of the address-driven analysis seen here.
Another use for cross-references is to determine every location that references a particular location. For example, if we wanted to create a low-budget security analyzer, we might be interested in highlighting all calls to functions such as strcpy
and sprintf
.
In the example shown in Example 15-4, we work in reverse to iterate across all of the cross-references to (as opposed to from in the preceding example) a particular symbol:
Example 15-4. Enumerating a function’s callers
#include <idc.idc> static list_callers(bad_func) { auto func, addr, xref, source; func = LocByName(bad_func); if (func == BADADDR) { Warning("Sorry, %s not found in database", bad_func); } else { for (addr = RfirstB(func); addr != BADADDR; addr = RnextB(func, addr)) { xref = XrefType(); if (xref == fl_CN || xref == fl_CF) { source = GetFunctionName(addr); Message ("%s is called from 0x%x in %s ", bad_func, addr, source); } } } } static main() { list_callers("_strcpy"); list_callers("_sprintf"); }
In this example, the LocByName
function is used to find the address of a given (by name) bad function. If the function’s address is found, a loop is executed in order to process all cross-references to the bad function. For each cross-reference, if the cross-reference type is determined to be a call-type cross-reference, the calling function’s name is determined and is displayed to the user .
It is important to note that some modifications may be required to perform a proper lookup of the name of an imported function. In ELF executables in particular, which combine a procedure linkage table (PLT) with a global offset table (GOT) to handle the details of linking to shared libraries, the names that IDA assigns to imported functions may be less than clear. For example, a PLT entry may appear to be named _memcpy
, when in fact it is named .memcpy
and IDA has replaced the dot with an underscore because IDA considers dots invalid characters within names. Further complicating matters is the fact that IDA may actually create a symbol named memcpy
that resides in a section that IDA names extern
. When attempting to enumerate cross-references to memcpy
, we are interested in the PLT version of the symbol because this is the version that is called from other functions in the program and thus the version to which all cross-references would refer.
In Chapter 13 we discussed the use of idsutils
to generate .ids files that describe the contents of shared libraries. Recall that the first step in generating a .ids file involves generating a .idt file, which is a text file containing descriptions of each exported function contained in the library. IDC contains functions for iterating through the functions that are exported by a shared library. The script shown in Example 15-5 can be run to generate an .idt file after opening a shared library with IDA:
Example 15-5. A script to generate .idt files
#include <idc.idc> static main() { auto entryPoints, i, ord, addr, name, purged, file, fd; file = AskFile(1, "*.idt", "Select IDT save file"); fd = fopen(file, "w"); entryPoints = GetEntryPointQty(); fprintf(fd, "ALIGNMENT 4 "); fprintf(fd, "0 Name=%s ", GetInputFile()); for (i = 0; i < entryPoints; i++) { ord = GetEntryOrdinal(i); if (ord == 0) continue; addr = GetEntryPoint(ord); if (ord == addr) { continue; //entry point has no ordinal } name = Name(addr); fprintf(fd, "%d Name=%s", ord, name); purged = GetFunctionAttr(addr, FUNCATTR_ARGSIZE); if (purged > 0) { fprintf(fd, " Pascal=%d", purged); } fprintf(fd, " "); } }
The output of the script is saved to a file chosen by the user. New functions introduced in this script include GetEntryPointQty
, which returns the number of symbols exported by the library; GetEntryOrdinal
, which returns an ordinal number (an index into the library’s export table); GetEntryPoint
, which returns the address associated with an exported function that has been identified by ordinal number; and GetInputFile
, which returns the name of the file that was loaded into IDA.
Versions of GCC later than 3.4 use mov
statements rather than push
statements in x86 binaries to place function arguments into the stack before calling a function. Occasionally this causes some analysis problems for IDA (newer versions of IDA handle this situation better), because the analysis engine relies on finding push
statements to pinpoint locations at which arguments are pushed for a function call. The following listing shows an IDA disassembly when parameters are pushed onto the stack:
.text:08048894 push 0 ; protocol .text:08048896 push 1 ; type .text:08048898 push 2 ; domain .text:0804889A call _socket
Note the comments that IDA has placed in the right margin. Such commenting is possible only when IDA recognizes that parameters are being pushed and when IDA knows the signature of the function being called. When mov
statements are used to place parameters onto the stack, the resulting disassembly is somewhat less informative, as shown here:
.text:080487AD mov [esp+8], 0 .text:080487B5 mov [esp+4], 1 .text:080487BD mov [esp], 2 .text:080487C4 call _socket
In this case, IDA has failed to recognize that the three mov
statements preceding the call are being used to set up the parameters for the function call. As a result, we get less assistance from IDA in the form of automatic comments in the disassembly.
Here we have a situation where a script might be able to restore some of the information that we are accustomed to seeing in our disassemblies. Example 15-6 is a first effort at automatically recognizing instructions that are setting up parameters for function calls:
Example 15-6. Automating parameter recognition
#include <idc.idc> static main() { auto addr, op, end, idx; auto func_flags, type, val, search; search = SEARCH_DOWN | SEARCH_NEXT; addr = GetFunctionAttr(ScreenEA(), FUNCATTR_START); func_flags = GetFunctionFlags(addr); if (func_flags & FUNC_FRAME) { //Is this an ebp-based frame? end = GetFunctionAttr(addr, FUNCATTR_END); for (; addr < end && addr != BADADDR; addr = FindCode(addr, search)) { type = GetOpType(addr, 0); if (type == 3) { //Is this a register indirect operand? if (GetOperandValue(addr, 0) == 4) { //Is the register esp? MakeComm(addr, "arg_0"); //[esp] equates to arg_0 } } else if (type == 4) { //Is this a register + displacement operand? idx = strstr(GetOpnd(addr, 0), "[esp"); //Is the register esp? if (idx != −1) { val = GetOperandValue(addr, 0); //get the displacement MakeComm(addr, form("arg_%d", val)); //add a comment } } } } }
The script works only on EBP-based frames and relies on the fact that when parameters are moved into the stack prior to a function call, GCC generates memory references relative to esp
. The script iterates through all instructions in a function; for each instruction that writes to a memory location using esp
as a base register, the script determines the depth within the stack and adds a comment indicating which parameter is being moved. The GetFunctionFlags
function offers access to various flags associated with a function, such as whether the function uses an EBP-based stack frame. Running the script in Example 15-6 yields the annotated disassembly shown here:
.text:080487AD mov [esp+8], 0 ; arg_8 .text:080487B5 mov [esp+4], 1 ; arg_4 .text:080487BD mov [esp], 2 ; arg_0 .text:080487C4 call _socket
The comments aren’t particularly informative. However, we can now tell at a glance that the three mov
statements are used to place parameters onto the stack, which is a step in the right direction. By extending the script a bit further and exploring some more of IDC’s capabilities, we can come up with a script that provides almost as much information as IDA does when it properly recognizes parameters. The output of the final product is shown here:
.text:080487AD mov [esp+8], 0 ; int protocol .text:080487B5 mov [esp+4], 1 ; int type .text:080487BD mov [esp], 2 ; int domain .text:080487C4 call _socket
The extended version of the script in Example 15-6, which is capable of incorporating data from function signatures into comments, is available on this book’s website.[103]
There are a number of reasons why you might need to write a script that emulates the behavior of a program you are analyzing. For example, the program you are studying may be self-modifying, as many malware programs are, or the program may contain some encoded data that gets decoded when it is needed at runtime. Without running the program and pulling the modified data out of the running process’s memory, how can you understand the behavior of the program? The answer may lie with an IDC script. If the decoding process is not terribly complex, you may be able to quickly write an IDC script that performs the same actions that are performed by the program when it runs. Using a script to decode data in this way eliminates the need to run a program when you don’t know what the program does or you don’t have access to a platform on which you can run the program. An example of the latter case might occur if you were examining a MIPS binary with your Windows version of IDA. Without any MIPS hardware, you would not be able to execute the MIPS binary and observe any data decoding it might perform. You could, however, write an IDC script to mimic the behavior of the binary and make the required changes within the IDA database, all with no need for a MIPS execution environment.
The following x86 code was extracted from a DEFCON[104] Capture the Flag binary.[105]
.text:08049EDE mov [ebp+var_4], 0 .text:08049EE5 .text:08049EE5 loc_8049EE5: .text:08049EE5 cmp [ebp+var_4], 3C1h .text:08049EEC ja short locret_8049F0D .text:08049EEE mov edx, [ebp+var_4] .text:08049EF1 add edx, 804B880h .text:08049EF7 mov eax, [ebp+var_4] .text:08049EFA add eax, 804B880h .text:08049EFF mov al, [eax] .text:08049F01 xor eax, 4Bh .text:08049F04 mov [edx], al .text:08049F06 lea eax, [ebp+var_4] .text:08049F09 inc dword ptr [eax] .text:08049F0B jmp short loc_8049EE5
This code decodes a private key that has been embedded within the program binary. Using the IDC script shown in Example 15-7, we can extract the private key without running the program:
Example 15-7. Emulating assembly language with IDC
auto var_4, edx, eax, al; var_4 = 0; while (var_4 <= 0x3C1) { edx = var_4; edx = edx + 0x804B880; eax = var_4; eax = eax + 0x804B880; al = Byte(eax); al = al ^ 0x4B; PatchByte(edx, al); var_4++; }
Example 15-7 is a fairly literal translation of the preceding assembly language sequence generated according to the following rather mechanical rules.
For each stack variable and register used in the assembly code, declare an IDC variable.
For each assembly language statement, write an IDC statement that mimics its behavior.
Reading and writing stack variables is emulated by reading and writing the corresponding variable declared in your IDC script.
Reading from a nonstack location is accomplished using the Byte
, Word
, or Dword
function, depending on the amount of data being read (1, 2, or 4 bytes).
Writing to a nonstack location is accomplished using the PatchByte
, PatchWord
, or PatchDword
function, depending on the amount of data being written.
In general, if the code appears to contain a loop for which the termination condition is not immediately obvious, it is easiest to begin with an infinite loop such as while (1) {}
and then insert a break
statement when you encounter statements that cause the loop to terminate.
When the assembly code calls functions, things get complicated. In order to properly simulate the behavior of the assembly code, you must find a way to mimic the behavior of the function that has been called, including providing a return value that makes sense within the context of the code being simulated. This fact alone may preclude the use of IDC as a tool for emulating the behavior of an assembly language sequence.
The important thing to understand when developing scripts such as the previous one is that it is not absolutely necessary to fully understand how the code you are emulating behaves on a global scale. It is often sufficient to understand only one or two instructions at a time and generate correct IDC translations for those instructions. If each instruction has been correctly translated into IDC, then the script as a whole should properly mimic the complete functionality of the original assembly code. We can delay further study of the assembly language algorithm until after the IDC script has been completed, at which point we can use the IDC script to enhance our understanding of the underlying assembly. Once we spend some time considering how our example algorithm works, we might shorten the preceding IDC script to the following:
auto var_4, addr; for (var_4 = 0; var_4 <= 0x3C1; var_4++) { addr = 0x804B880 + var_4; PatchByte(addr, Byte(addr) ^ 0x4B); }
As an alternative, if we did not wish to modify the database in any way, we could replace the PatchByte
function with a call to Message
if we were dealing with ASCII data, or as an alternative we could write the data to a file if we were dealing with binary data.
[104] See http://www.defcon.org/.
[105] Courtesy of Kenshoto, the organizers of CTF at DEFCON 15. Capture the Flag is an annual hacking competition held at DEFCON.
3.135.196.172