Before we get too far into this chapter, we need to make one thing clear: IDA is not a vulnerability discovery tool. There, we said it; what a relief! IDA seems to have attained mystical qualities in some people’s minds. All too often people seem to have the impression that merely opening a binary with IDA will reveal all the secrets of the universe, that the behavior of a piece of malware will be fully explained to them in comments automatically generated by IDA, that vulnerabilities will be highlighted in red, and that IDA will automatically generate exploit code if you right-click while standing on one foot in some obscure Easter egg–activation sequence.
While IDA is certainly a very capable tool, without a clever user sitting at the keyboard (and perhaps a handy collection of scripts and plug-ins), it is really only a disassembler/debugger. As a static-analysis tool, it can only facilitate your attempts to locate software vulnerabilities. Ultimately, it is up to your skills and how you apply them as to whether IDA makes your search for vulnerabilities easier. Based on our experience, IDA is not the optimal tool for locating new vulnerabilities,[186] but when used in conjunction with a debugger, it is one of the best tools available for assisting in exploit development once a vulnerability has been discovered.
Over the past several years, IDA has taken on a new role in discovering existing vulnerabilities. Initially, it may seem unusual to search for known vulnerabilities until we stop to consider exactly what is known about these vulnerabilities and exactly who knows it. In the closed-source, binary-only software world, vendors frequently release software patches without disclosing exactly what has been patched and why. By performing differential analysis between new patched versions of a piece of software and old un-patched versions of the same software, it is possible to isolate the areas that have changed within a binary. Under the assumption that these changes were made for a reason, such differential-analysis techniques actually help to shine a spotlight on what were formerly vulnerable code sequences. With the search thusly narrowed, anyone with the requisite skills can develop an exploit for use against unpatched systems. In fact, given Microsoft’s well-known Patch Tuesday cycle of publishing updates, large numbers of security researchers prepare to sit down and do just that once every month.
Considering that entire books exist on the topic,[187] there is no way that we can do justice to vulnerability analysis in a single chapter in a book dedicated to IDA. What we will do is assume that the reader is familiar with some of the basic concepts of software vulnerabilities, such as buffer overflows, and discuss some of the ways that IDA may be used to hunt down, analyze, and ultimately develop exploits for those vulnerabilities.
Vulnerability researchers take many different approaches to discovering new vulnerabilities in software. When source code is available, it may be possible to utilize any of a growing number of automated source code–auditing tools to highlight potential problem areas within a program. In many cases, such automated tools will only point out the low-hanging fruit, while discovery of deeper vulnerabilities may require extensive manual auditing.
Tools for performing automated auditing of binaries offer many of the same reporting capabilities offered by automated source-auditing tools. A clear advantage of automated binary analysis is that no access to the application source code is required. Therefore, it is possible to perform automated analysis of closed-source, binary-only programs. Veracode[188] is an example of a company that offers a subscription-based service in which users may submit binary files for analysis by Veracode’s proprietary binary-analysis tools. While there is no guarantee that such tools can find any or all vulnerabilities within a binary, these technologies bring binary analysis within reach of the average person seeking some measure of confidence that the software she uses is free from vulnerabilities.
Whether auditing at the source or binary level, basic static-analysis techniques include auditing for the use of problematic functions such as strcpy
and sprintf
, auditing the use of buffers returned by dynamic memory-allocation routines such as malloc
and VirtualAlloc
, and auditing the handling of user-supplied input received via functions such as recv
, read
, fgets
, and many other similar functions. Locating such calls within a database is not difficult. For example, to track down all calls to strcpy
, we could perform the following steps:
Find the strcpy
function.
Display all cross-references to the strcpy
function by positioning the cursor on the strcpy
label and then choosing View ▸ Open Subviews ▸ Cross References.
Visit each cross-reference and analyze the parameters provided to strcpy
to determine whether a buffer overflow may be possible.
Step 3 may require a substantial amount of code and data-flow analysis to understand all potential inputs to the function call. Hopefully, the complexity of such a task is clear. Step 1, although it seems straightforward, may require a little effort on your part. Locating strcpy
may be as easy as using the Jump ▸ Jump to Address command (G) and entering strcpy
as the address to jump to. In Windows PE binaries or statically linked ELF binaries, this is usually all that is needed. However, with other binaries, extra steps may be required. In a dynamically linked ELF binary, using the Jump command may not take you directly to the desired function. Instead, it is likely to take you to an entry in the extern
section (which is involved in the dynamic-linking process). An IDA representation of the strcpy
entry in an extern
section is shown here:
extern:804DECC extrn strcpy:near ; CODE XREF: _strcpy↑j extern:804DECC ; DATA XREF: .got:off_804D5E4↑o
To confuse matters, this location does not appear to be named strcpy
at all (it is, but the name is indented), and the only code cross-reference to the location is a jump cross-reference from a function that appears to be named _strcpy
, while a data cross-reference is also made to this location from the .got
section. The referencing function is actually named .strcpy
, which is not at all obvious from the display. In this case, IDA has replaced the dot character with an underscore because IDA does not consider dots to be valid identifier characters by default. Double-clicking the code cross-reference takes us to the program’s procedure linkage table (.plt
) entry for strcpy
, as shown here:
.plt:08049E90 _strcpy proc near ; CODE XREF: decode+5F↓p .plt:08049E90 ; extract_int_argument+24↓p ... .plt:08049E90 jmp ds:off_804D5E4 .plt:08049E90 _strcpy endp
If instead we follow the data cross-reference, we end up at the corresponding .got
entry for strcpy
shown here:
.got:0804D5E4 off_804D5E4 dd offset strcpy ; DATA XREF: _strcpy↑r
In the .got
entry, we encounter another data cross-reference to the .strcpy
function in the .plt
section. In practice, following the data cross-references is the most reliable means of navigating from the extern
section to the .plt
section. In dynamically linked ELF binaries, functions are called indirectly through the procedure linkage table. Now that we have reached the .plt
, we can bring up the cross-references to _strcpy
(actually .strcpy
) and begin to audit each call (of which there are at least two in this example).
This process can become tedious when we have a list of several common functions whose calls we wish to locate and audit. At this point it may be useful to develop a script that can automatically locate and comment all interesting function calls for us. With comments in place, we can perform simple searches to move from one audit location to another. The foundation for such a script is a function that can reliably locate another function so that we can locate all cross-references to that function. With the understanding of ELF binaries gained in the preceding discussion, the IDC function in Example 22-1 takes a function name as an input argument and returns an address suitable for cross-reference iteration.
Example 22-1. Finding a function’s callable address
static getFuncAddr(fname) { auto func = LocByName(fname); if (func != BADADDR) { auto seg = SegName(func); //what segment did we find it in? if (seg == "extern") { //Likely an ELF if we are in "extern" //First (and only) data xref should be from got func = DfirstB(func); if (func != BADADDR) { seg = SegName(func); if (seg != ".got") return BADADDR; //Now, first (and only) data xref should be from plt func = DfirstB(func); if (func != BADADDR) { seg = SegName(func); if (seg != ".plt") return BADADDR; } } } else if (seg != ".text") { //otherwise, if the name was not in the .text section, then we // don't have an algorithm for finding it automatically func = BADADDR; } } return func; }
Using the supplied return address, it is now possible to track down all of the references to any function whose use we want to audit. The IDC function in Example 22-2 leverages the getFuncAddr
function from the preceding example to obtain a function address and add comments at all calls to the function.
Example 22-2. Flagging calls to a designated function
static flagCalls(fname) { auto func, xref; //get the callable address of the named function func = getFuncAddr(fname); if (func != BADADDR) { //Iterate through calls to the named function, and add a comment //at each call for (xref = RfirstB(func); xref != BADADDR; xref = RnextB(func, xref)) { if (XrefType() == fl_CN || XrefType() == fl_CF) { MakeComm(xref, "*** AUDIT HERE ***"); } } //Iterate through data references to the named function, and add a //comment at reference for (xref = DfirstB(func); xref != BADADDR; xref = DnextB(func, xref)) { if (XrefType() == dr_O) { MakeComm(xref, "*** AUDIT HERE ***"); } } } }
Once the desired function’s address has been located , two loops are used to iterate over cross-references to the function. In the first loop , a comment is inserted at each location that calls the function of interest. In the second loop , additional comments are inserted at each location that takes the address of the function (use of an offset cross-reference type). The second loop is required in order to track down calls of the following style:
.text:000194EA mov esi, ds:strcpy .text:000194F0 push offset loc_40A006 .text:000194F5 add edi, 160h .text:000194FB push edi .text:000194FC call esi
In this example, the compiler has cached the address of the strcpy
function in the ESI register in order to make use of a faster means of calling strcpy
later in the program. The call
instruction shown here is faster to execute because it is both smaller (2 bytes) and requires no additional operations to resolve the target of the call, since the address is already contained within the CPU within the ESI register. A compiler may choose to generate this type of code when one function makes several calls to another function.
Given the indirect nature of the call in this example, the flagCalls
function in our example may see only the data cross-reference to strcpy
while failing to see the call to strcpy
because the call
instruction does not reference strcpy
directly. In practice, however, IDA possesses the capability to perform some limited data-flow analysis in cases such as these and is likely to generate the disassembly shown here:
.text:000194EA mov esi, ds:strcpy .text:000194F0 push offset loc_40A006 .text:000194F5 add edi, 160h .text:000194FB push edi .text:000194FC call esi ; strcpy
Note that the call
instruction has been annotated with a comment indicating which function IDA believes is being called. In addition to inserting the comment, IDA adds a code cross-reference from the point of the call to the function being called. This benefits the flagCalls
function, because in this case the call
instruction will be found and annotated via a code cross-reference.
To finish up our example script, we need a main
function that invokes flagCalls
for all of the functions that we are interested in auditing. A simple example to annotate calls to some of the functions mentioned earlier in this section is shown here:
static main() { flagCalls("strcpy"); flagCalls("strcat"); flagCalls("sprintf"); flagCalls("gets"); }
After running this script, we can move from one interesting call to the next by searching for the inserted comment text, *** AUDIT ***
. Of course this still leaves a lot of work to be done from an analysis perspective, since the mere fact that a program calls strcpy
does not make that program exploitable. This is where data-flow analysis comes into play. In order to understand whether a particular call to strcpy
is exploitable or not, you must determine what parameters are being passed in to strcpy
and evaluate whether those parameters can be manipulated to your advantage or not.
Data-flow analysis is a far more complex task than simply finding calls to problem functions. In order to track the flow of data in a static-analysis environment, a thorough understanding of the instruction set being used is required. Your static-analysis tools need to understand where registers may have been assigned values and how those values may have changed and propagated to other registers. Further, your tools need a means for determining the sizes of source and destination buffers being referenced within the program, which in turn requires the ability to understand the layout of stack frames and global variables as well as the ability to deduce the size of dynamically allocated memory blocks. And, of course, all of this is being attempted without actually running the program.
An interesting example of what can be accomplished with creative scripting comes in the form of the BugScam[189] scripts created by Halvar Flake. BugScam utilizes techniques similar to the preceding examples to locate calls to problematic functions and takes the additional step of performing rudimentary data-flow analysis at each function call. The result of BugScam’s analysis is an HTML report of potential problems in a binary. A sample report table generated as a result of a sprintf
analysis is shown here:
Address | Severity | Description |
---|---|---|
8048c03 | 5 | The maximum expansion of the data appears to be larger than the target buffer; this might be the cause of a buffer overrun! Maximum Expansion: 1053. Target Size: 1036. |
In this case, BugScam was able to determine the size of the input and output buffers, which, when combined with the format specifiers contained in the format string, were used to determine the maximum size of the generated output.
Developing scripts of this nature requires an in-depth understanding of various exploit classes in order to develop an algorithm that can be applied generically across a large body of binaries. Lacking such knowledge, we can still develop scripts (or plug-ins) that answer simple questions for us faster than we can find the answers manually.
As a final example, consider the task of locating all functions that contain stack-allocated buffers, since these are the functions that might be susceptible to stack-based buffer-overflow attacks. Rather than manually scrolling through a database, we can develop a script to analyze the stack frame of each function, looking for variables that occupy large amounts of space. The Python function in Example 22-3 iterates through the defined members of a given function’s stack frame in search of variables whose size is larger than a specified minimum size.
Example 22-3. Scanning for stack-allocated buffers
def findStackBuffers(func_addr, minsize): prev_idx = −1 frame = GetFrame(func_addr) if frame == −1: return #bad function idx = 0 prev = None while idx < GetStrucSize(frame): member = GetMemberName(frame, idx) if member is not None: if prev_idx != −1: #compute distance from previous field to current field delta = idx - prev_idx if delta >= minsize: Message("%s: possible buffer %s: %d bytes " % (GetFunctionName(func_addr), prev, delta)) prev_idx = idx prev = member idx = idx + GetMemberSize(frame, idx) else: idx = idx + 1
This function locates all the variables in a stack frame using repeated calls to GetMemberName
for all valid offsets within the stack frame. The size of a variable is computed as the difference between the starting offsets of two successive variables . If the size exceeds a threshold size (minsize
) , then the variable is reported as a possible stack buffer. The index into the structure is moved along by either 1 byte when no member is defined at the current offset or by the size of any member found at the current offset . The GetMem-berSize
function may seem like a more suitable choice for computing the size of each stack variable; however, this is true only if the variable has been sized properly by either IDA or the user. Consider the following stack frame:
.text:08048B38 sub_8048B38 proc near .text:08048B38 .text:08048B38 var_818 = byte ptr −818h .text:08048B38 var_418 = byte ptr −418h .text:08048B38 var_C = dword ptr −0Ch .text:08048B38 arg_0 = dword ptr 8
Using the displayed byte offsets, we can compute that there are 1,024 bytes from the start of var_818
to the start of var_418
(818h - 418h = 400h
) and 1,036 bytes between the start of var_418
and the start of var_C
(418h - 0Ch
). However, the stack frame might be expanded to show the following layout:
-00000818 var_818 db ? −00000817 db ? ; undefined −00000816 db ? ; undefined ... −0000041A db ? ; undefined −00000419 db ? ; undefined −00000418 var_418 db 1036 dup(?) −0000000C var_C dd ?
Here, var_418
has been collapsed into an array, while var_818
appears to be only a single byte (with 1,023 undefined bytes filling the space between var_818
and var_418
). For this stack layout, GetMemberSize
will report 1 byte for var_818
and 1,036 bytes for var_418
, which is an undesirable result. The output of a call to findStackBuffers(0x08048B38, 16)
results in the following output, regardless of whether var_818
is defined as a single byte or an array of 1,024 bytes:
sub_8048B38: possible buffer var_818: 1024 bytes sub_8048B38: possible buffer var_418: 1036 bytes
Creating a main
function that iterates through all functions in a database (see Chapter 15) and calls findStackBuffers
for each function yields a script that quickly points out the use of stack buffers within a program. Of course, determining whether any of those buffers can be overflowed requires additional (usually manual) study of each function. The tedious nature of static analysis is precisely the reason that fuzz testing is so popular.
[186] In general, far more vulnerabilities are discovered through fuzz testing than through static analysis.
[187] For example, see Jon Erickson’s Hacking: The Art of Exploitation, 2nd Edition (http://nostarch.com/hacking2.htm).
[188] See http://www.veracode.com/.
3.139.239.41