If you were fortunate enough to have source code available for a C/C++ program that you wanted to analyze, a good place to begin your analysis might be the main
function, as this is where execution notionally begins. When faced with analyzing a binary, this is not a bad strategy to follow. However, as we know, it is complicated by the fact that compilers/linkers (and the use of libraries) add additional code that executes before main
is reached. Thus it would often be incorrect to assume that the entry point of a binary corresponds to the main
function written by the program’s author.
In fact, the notion that all programs have a main
function is a C/C++ compiler convention rather than a hard-and-fast rule for writing programs. If you have ever written a Windows GUI application, then you may be familiar with the WinMain
variation on main
. Once you step away from C/C++, you will find that other languages use other names for their primary entry-point function. Regardless of what it may be called, we will refer to this function generically as the main
function.
Chapter 12 covered the concept of IDA signature files, their generation, and their application. IDA utilizes special startup signatures to attempt to identify a program’s main
function. When IDA is able to match a binary’s startup sequence against one of the startup sequences in its signature files, IDA can locate a program’s main
function based on its understanding of the behavior of the matched startup routine. This works great until IDA fails to match the startup sequence in a binary to any of its known signatures. In general, a program’s startup code is closely tied to both the compiler used to generate the code and the platform for which the code was built.
Recall from Chapter 12 that startup signatures are grouped together and stored in signature files specific to binary file types. For example, startup signatures for use with the PE loader are stored in pe.sig, while startup signatures for use with the MS-DOS loader are stored in exe.sig. The existence of a signature file for a given binary file type does not guarantee that IDA will be able to identify a program’s main
function 100 percent of the time. There are too many compilers, and startup sequences are too much of a moving target for IDA to ship with every possible signature.
For many file types, such as ELF and Mach-O, IDA does not include any startup signatures at all. The net result is that IDA can’t use signatures to locate a main
function within an ELF binary (though the function will be found if it is named main
).
The point of this discussion is to prepare you for the fact that, on occasion, you will be on your own when it comes to locating the main
function of a program. In such cases it is useful to have some strategies for understanding how the program itself prepares for the call to main
. As an example, consider a binary that has been obfuscated to some degree. In this case, IDA will certainly fail to match a startup signature because the startup routine itself has been obfuscated. If you manage to de-obfuscate the binary somehow (the topic of Chapter 21), you will probably need to locate not only main
on your own but the original start routine as well.
For C and C++ programs with a traditional main
function,[144] one of the responsibilities of the startup code is to set up the stack arguments required by main
, the integer argc
(a count of the number of command-line arguments), the character pointer array argv
(an array of pointers to strings containing the command-line arguments), and the character pointer array envp
(an array of pointers to strings containing the environment variables that were set at program invocation). The following excerpt from a FreeBSD 8.0 dynamically linked, stripped binary demonstrates how gcc-generated startup code calls to main
on a FreeBSD system:
.text:08048365 mov dword ptr [esp], offset _term_proc ; func .text:0804836C call _atexit .text:08048371 call _init_proc .text:08048376 lea eax, [ebp+arg_0] .text:08048379 mov [esp+8], esi .text:0804837D mov [esp+4], eax .text:08048381 mov [esp], ebx .text:08048384 call sub_8048400 .text:08048389 mov [esp], eax ; status .text:0804838C call _exit
In this case, the call to sub_8048400
turns out to be the call to main
. This code is typical of many startup sequences in that there are calls to initialization functions (_atexit
and _init_proc
) preceding the call to main
and a call to _exit
following the return from main
. The call to _exit
ensures that the program terminates cleanly in the event that main
performs a return rather than calling _exit
itself. Note that the parameter passed to _exit
is the value returned by main in EAX; thus the exit code of the program is the return value of main
.
If the previous program was statically linked and stripped, the start routine would have the same structure as the preceding example; however, none of the library functions would have useful names. In that case, the main
function would continue to stand out as the only function that is called with three parameters. Of course, applying FLIRT signatures as early as possible would also help to restore many of the library function names and make main
stand out, as it does in the preceding example.
In order to demonstrate that the same compiler may generate a completely different style of code when running on a different platform, consider the following example, also created using gcc, of a dynamically linked, stripped binary taken from a Linux system:
.text:080482B0 start proc near .text:080482B0 xor ebp, ebp .text:080482B2 pop esi .text:080482B3 mov ecx, esp .text:080482B5 and esp, 0FFFFFFF0h .text:080482B8 push eax .text:080482B9 push esp .text:080482BA push edx .text:080482BB push offset sub_80483C0 .text:080482C0 push offset sub_80483D0 .text:080482C5 push ecx .text:080482C6 push esi .text:080482C7 push offset loc_8048384 .text:080482CC call ___libc_start_main .text:080482D1 hlt .text:080482D1 start endp
In this example, start
makes a single function call to ___libc_start_main
. The purpose of ___libc_start_main
is to perform all of the same types of tasks that were performed in the preceding FreeBSD example, including calling main
and ultimately exit
. Since ___libc_start_main
is a library function, we know that the only way it knows where main
actually resides is that it is told via one of its parameters (of which there appear to be eight). Clearly two of the parameters and are pointers to functions, while a third is a pointer to a location within the .text
section. There are few clues in the previous listing as to which function might be main
, so you might need to analyze the code at the three potential locations in order to correctly locate main
. This might be a useful exercise; however, you may prefer simply to remember that the first argument (topmost on the stack and therefore last pushed) to ___libc_start_main
is in fact a pointer to main
. There are two factors that combine to prevent IDA from identifying loc_8048384
as a function (which would have been named sub_8048384
). The first is that the function is never called directly, so loc_8048384
never appears as the target of a call instruction. The second is that although IDA contains heuristics to recognized functions based on their prologues (which is why sub_80483C0
and sub_80483D0
are identified as functions even though they too are never called directly), the function at loc_8048384
(main
) does not use a prologue recognized by IDA. The offending prologue (with comments) is shown here:
.text:08048384 loc_8048384: ; DATA XREF: start+17↑o .text:08048384 lea ecx, [esp+4] ; address of arg_0 into ecx .text:08048388 and esp, 0FFFFFFF0h ; 16 byte align esp .text:0804838B push dword ptr [ecx-4] ; push copy of return address .text:0804838E push ebp ; save caller's ebp .text:0804838F mov ebp, esp ; initialize our frame pointer .text:08048391 push ecx ; save ecx .text:08048392 sub esp, 24h ; allocate locals
This prologue clearly contains the elements of a traditional prologue for a function that uses EBP as a frame pointer. The caller’s frame pointer is saved before setting the frame pointer for the current function and finally allocating space for local variables . The problem for IDA is that these actions do not occur as the first actions within the function, and thus IDA’s heuristics fail. It is a simple enough matter to manually create a function (Edit ▸ Functions ▸ Create Function) at this point, but you should take care to monitor IDA’s behavior. Just as it failed to identify the function in the first place, it may fail to recognize the fact that the function uses EBP as a frame pointer. In such a case, you would need to edit the function (alt-P) to force IDA to believe that the function has a BP-based frame as well as to make adjustments to the number of stack bytes dedicated to saved registers and local variables.
As in the case of the FreeBSD binary, if the preceding Linux example happened to be both statically linked and stripped, the start routine would not change at all other than the fact that the name for ___libc_start_main
would be missing. You could still locate main
by remembering that gcc’s Linux start routine makes only one function call and that the first parameter to that function is the address of main
.
On the Windows side of the house, the number of C/C++ compilers (and therefore the number of startup routines) in use is somewhat higher. Perhaps not unsurprisingly, in the case of gcc on Windows, it is possible to leverage some of the knowledge gained by studying gcc’s behavior on other platforms. The startup routine shown here is from a gcc/Cygwin binary:
.text:00401000 start proc near .text:00401000 .text:00401000 var_28 = dword ptr −28h .text:00401000 var_24 = dword ptr −24h .text:00401000 var_20 = dword ptr −20h .text:00401000 var_2 = word ptr −2 .text:00401000 .text:00401000 push ebp .text:00401001 mov ebp, esp .text:00401003 sub esp, 28h .text:00401006 and esp, 0FFFFFFF0h .text:00401009 fnstcw [ebp+var_2] .text:0040100C movzx eax, [ebp+var_2] .text:00401010 and ax, 0F0C0h .text:00401014 mov [ebp+var_2], ax .text:00401018 movzx eax, [ebp+var_2] .text:0040101C or ax, 33Fh .text:00401020 mov [ebp+var_2], ax .text:00401024 fldcw [ebp+var_2] .text:00401027 mov [esp+28h+var_28], offset sub_4010B0 .text:0040102E call sub_401120
Clearly this code does not map cleanly to the previous Linux-based example. However, there is one striking similarity: only one function is called , and the function takes a function pointer for parameter . In this case sub_401120
serves much the same purpose as ___libc_start_main
, while sub_4010B0
turns out to be the main
function of the program.
Windows binaries compiled using gcc/MinGW make use of yet another style of start
function, as shown here:
.text:00401280 start proc near .text:00401280 .text:00401280 var_8 = dword ptr −8 .text:00401280 .text:00401280 push ebp .text:00401281 mov ebp, esp .text:00401283 sub esp, 8 .text:00401286 mov [esp+8+var_8], 1 .text:0040128D call ds:__set_app_type .text:00401293 call sub_401150 .text:00401293 start endp
This is another case in which IDA will fail to identify the program’s main
function. The preceding code offers few clues as to the location of main
, as there is only one nonlibrary function called (sub_401150
) and that function does not appear to take any arguments (as main
should). In this instance, the best course of action is to continue the search for main
within sub_401150
. A portion of sub_401150
is shown here:
.text:0040122A call __p__environ .text:0040122F mov eax, [eax] .text:00401231 mov [esp+8], eax .text:00401235 mov eax, ds:dword_404000 .text:0040123A mov [esp+4], eax .text:0040123E mov eax, ds:dword_404004 .text:00401243 mov [esp], eax .text:00401246 call sub_401395 .text:0040124B mov ebx, eax .text:0040124D call _cexit .text:00401252 mov [esp], ebx .text:00401255 call ExitProcess
In this example, the function turns out to have many similarities with the start
function associated with FreeBSD that we saw earlier. Process of elimination points to sub_401395
as the likely candidate for main
, as it is the only non-library function that is called with three arguments—, , and . Also, the third argument is related to the return value of the __p__environ
library function, which correlates well with the fact that main
’s third argument is expected to be a pointer to the environment strings array. The example code is also preceded by a call to the getmainargs
library function (not shown), which is called to set up the argc
and argv
parameters prior to actually calling main
. This helps to reinforce the notion that main
is about to be called.
The start routine for Visual C/C++ code is short and sweet, as seen here:
.text:0040134B start proc near .text:0040134B call ___security_init_cookie .text:00401350 jmp ___tmainCRTStartup .text:00401350 start endp
IDA has actually recognized the library routines referenced in the two instructions through the application of startup signatures rather than by the fact that the program is linked to a dynamic library containing the given symbols. IDA’s startup signatures provide easy location of the initial call to main
, as shown here:
.text:004012D8 mov eax, envp .text:004012DD mov dword_40ACF4, eax .text:004012E2 push eax ; envp .text:004012E3 push argv ; argv .text:004012E9 push argc ; argc .text:004012EF call _main .text:004012F4 add esp, 0Ch .text:004012F7 mov [ebp+var_1C], eax .text:004012FA cmp [ebp+var_20], 0 .text:004012FE jnz short $LN35 .text:00401300 push eax ; uExitCode .text:00401301 call $LN27 .text:00401306 $LN35: ; CODE XREF: ___tmainCRTStartup+169âj .text:00401306 call __cexit .text:0040130B jmp short loc_40133B
Within the entire body of tmainCRTStartup
, _main
is the only function called with exactly three arguments. Further analysis would reveal that the call to _main
is preceded by a call to the GetCommandLine
library function, which is yet another indication that a program’s main
function may be called shortly. As a final note concerning the use of startup signatures, it is important to understand that, in this example, IDA has generated the name _main
entirely on its own as a result of matching a startup signature. The ASCII string main
appeared nowhere in the binary used in this example. Thus, you can expect main
to be found and labeled anytime a startup signature is matched, even when a binary has been stripped of its symbols.
The last startup routine that we will examine for a C compiler is generated by Borland’s free command-line compiler. [145] The last few lines of Borland’s start routine are shown here:
.text:00401041 push offset off_4090B8 .text:00401046 push 0 ; lpModuleName .text:00401048 call GetModuleHandleA .text:0040104D mov dword_409117, eax .text:00401052 push 0 ; fake return value .text:00401054 jmp __startup
The pointer value pushed on the stack refers to a structure that in turn contains a pointer to main
. Within __startup
, the setup to call main
is shown here:
.text:00406997 mov edx, dword_40BBFC .text:0040699D push edx .text:0040699E mov ecx, dword_40BBF8 .text:004069A4 push ecx .text:004069A5 mov eax, dword_40BBF4 .text:004069AA push eax .text:004069AB call dword ptr [esi+18h] .text:004069AE add esp, 0Ch .text:004069B1 push eax ; status .text:004069B2 call _exit
Again, this example bears many similarities to previous examples in that the call to main
takes three arguments , , and (the only function called within __startup
to do so) and the return value is passed directly to _exit
to terminate the program. Additional analysis of __startup
would reveal calls to the Windows API functions GetEnvironmentStrings
and GetCommandLine
, which are often precursors to the invocation of main
.
Finally, in order to demonstrate that tracking down a program’s main
function is not a problem specific to C programs, consider the following startup code from a compiled Visual Basic 6.0 program:
.text:004018A4 start: .text:004018A4 push offset dword_401994 .text:004018A9 call ThunRTMain
The ThunRTMain
library function performs a function similar to the Linux libc_start_main
function in that its job is to perform any initialization required prior to invoking the actual main
function of the program. In order to transfer control to the main
function, Visual Basic utilizes a mechanism very similar to that in the Borland code in the earlier examples. ThunRTMain
takes a single argument , which is a pointer to a structure containing additional information required for program initialization, including the address of the main
function. The content of this structure is shown here:
.text:00401994 dword_401994 dd 21354256h, 2A1FF0h, 3 dup(0) ; DATA XREF: .text:start↑o .text:004019A8 dd 7Eh, 2 dup(0) .text:004019B4 dd 0A0000h, 409h, 0 .text:004019C0 dd offset sub_4045D0 .text:004019C4 dd offset dword_401A1C .text:004019C8 dd 30F012h, 0FFFFFF00h, 8, 2 dup(1), 0E9h, 401944h, 4018ECh .text:004019C8 dd 4018B0h, 78h, 7Dh, 82h, 83h, 4 dup(0)
Within this data structure, there is only one item that appears to reference code at all, the pointer to sub_4045D0
, which turns out to be the main
function for the program.
In the end, learning how to find main
is a matter of understanding how executable files are built. In cases where you are experiencing difficulties, it may be beneficial to build some simple executables (with a reference to an easily identifiable string in main
, for example) with the same tools used to build the binary you are analyzing. By studying your test cases, you will gain an understanding of the basic structure of binaries built using a specific set of tools that may assist you in further analyzing more complex binaries built with the same set of tools.
[144] Windows GUI applications require a WinMain
function instead of main
. Documentation regarding WinMain
can be found here: http://msdn2.microsoft.com/en-us/library/ms633559.aspx.
18.118.7.102