Locating main

If you were fortunate enough to have source code available for a C/C++ program that you wanted to analyze, a good place to begin your analysis might be the main function, as this is where execution notionally begins. When faced with analyzing a binary, this is not a bad strategy to follow. However, as we know, it is complicated by the fact that compilers/linkers (and the use of libraries) add additional code that executes before main is reached. Thus it would often be incorrect to assume that the entry point of a binary corresponds to the main function written by the program’s author.

In fact, the notion that all programs have a main function is a C/C++ compiler convention rather than a hard-and-fast rule for writing programs. If you have ever written a Windows GUI application, then you may be familiar with the WinMain variation on main. Once you step away from C/C++, you will find that other languages use other names for their primary entry-point function. Regardless of what it may be called, we will refer to this function generically as the main function.

Chapter 12 covered the concept of IDA signature files, their generation, and their application. IDA utilizes special startup signatures to attempt to identify a program’s main function. When IDA is able to match a binary’s startup sequence against one of the startup sequences in its signature files, IDA can locate a program’s main function based on its understanding of the behavior of the matched startup routine. This works great until IDA fails to match the startup sequence in a binary to any of its known signatures. In general, a program’s startup code is closely tied to both the compiler used to generate the code and the platform for which the code was built.

Recall from Chapter 12 that startup signatures are grouped together and stored in signature files specific to binary file types. For example, startup signatures for use with the PE loader are stored in pe.sig, while startup signatures for use with the MS-DOS loader are stored in exe.sig. The existence of a signature file for a given binary file type does not guarantee that IDA will be able to identify a program’s main function 100 percent of the time. There are too many compilers, and startup sequences are too much of a moving target for IDA to ship with every possible signature.

For many file types, such as ELF and Mach-O, IDA does not include any startup signatures at all. The net result is that IDA can’t use signatures to locate a main function within an ELF binary (though the function will be found if it is named main).

The point of this discussion is to prepare you for the fact that, on occasion, you will be on your own when it comes to locating the main function of a program. In such cases it is useful to have some strategies for understanding how the program itself prepares for the call to main. As an example, consider a binary that has been obfuscated to some degree. In this case, IDA will certainly fail to match a startup signature because the startup routine itself has been obfuscated. If you manage to de-obfuscate the binary somehow (the topic of Chapter 21), you will probably need to locate not only main on your own but the original start routine as well.

For C and C++ programs with a traditional main function,[144] one of the responsibilities of the startup code is to set up the stack arguments required by main, the integer argc (a count of the number of command-line arguments), the character pointer array argv (an array of pointers to strings containing the command-line arguments), and the character pointer array envp (an array of pointers to strings containing the environment variables that were set at program invocation). The following excerpt from a FreeBSD 8.0 dynamically linked, stripped binary demonstrates how gcc-generated startup code calls to main on a FreeBSD system:

.text:08048365          mov     dword ptr [esp], offset _term_proc ; func
.text:0804836C        call    _atexit
.text:08048371        call    _init_proc
.text:08048376          lea     eax, [ebp+arg_0]
.text:08048379          mov     [esp+8], esi
.text:0804837D          mov     [esp+4], eax
.text:08048381          mov     [esp], ebx
.text:08048384        call    sub_8048400
.text:08048389        mov     [esp], eax      ; status
.text:0804838C        call    _exit

In this case, the call to sub_8048400 turns out to be the call to main. This code is typical of many startup sequences in that there are calls to initialization functions (_atexit and _init_proc ) preceding the call to main and a call to _exit following the return from main. The call to _exit ensures that the program terminates cleanly in the event that main performs a return rather than calling _exit itself. Note that the parameter passed to _exit is the value returned by main in EAX; thus the exit code of the program is the return value of main.

If the previous program was statically linked and stripped, the start routine would have the same structure as the preceding example; however, none of the library functions would have useful names. In that case, the main function would continue to stand out as the only function that is called with three parameters. Of course, applying FLIRT signatures as early as possible would also help to restore many of the library function names and make main stand out, as it does in the preceding example.

In order to demonstrate that the same compiler may generate a completely different style of code when running on a different platform, consider the following example, also created using gcc, of a dynamically linked, stripped binary taken from a Linux system:

.text:080482B0 start           proc near
.text:080482B0                 xor     ebp, ebp
.text:080482B2                 pop     esi
.text:080482B3                 mov     ecx, esp
.text:080482B5                 and     esp, 0FFFFFFF0h
.text:080482B8                 push    eax
.text:080482B9                 push    esp
.text:080482BA                 push    edx
.text:080482BB                push    offset sub_80483C0
.text:080482C0                push    offset sub_80483D0
.text:080482C5                 push    ecx
.text:080482C6                 push    esi
.text:080482C7                push    offset loc_8048384
.text:080482CC                 call    ___libc_start_main
.text:080482D1                 hlt
.text:080482D1 start           endp

In this example, start makes a single function call to ___libc_start_main. The purpose of ___libc_start_main is to perform all of the same types of tasks that were performed in the preceding FreeBSD example, including calling main and ultimately exit. Since ___libc_start_main is a library function, we know that the only way it knows where main actually resides is that it is told via one of its parameters (of which there appear to be eight). Clearly two of the parameters and are pointers to functions, while a third is a pointer to a location within the .text section. There are few clues in the previous listing as to which function might be main, so you might need to analyze the code at the three potential locations in order to correctly locate main. This might be a useful exercise; however, you may prefer simply to remember that the first argument (topmost on the stack and therefore last pushed) to ___libc_start_main is in fact a pointer to main. There are two factors that combine to prevent IDA from identifying loc_8048384 as a function (which would have been named sub_8048384). The first is that the function is never called directly, so loc_8048384 never appears as the target of a call instruction. The second is that although IDA contains heuristics to recognized functions based on their prologues (which is why sub_80483C0 and sub_80483D0 are identified as functions even though they too are never called directly), the function at loc_8048384 (main) does not use a prologue recognized by IDA. The offending prologue (with comments) is shown here:

.text:08048384 loc_8048384:                       ; DATA XREF: start+17↑o
.text:08048384          lea     ecx, [esp+4]       ; address of arg_0 into ecx
.text:08048388          and     esp, 0FFFFFFF0h    ; 16 byte align esp
.text:0804838B          push    dword ptr [ecx-4]  ; push copy of return address
.text:0804838E         push    ebp                ; save caller's ebp
.text:0804838F         mov
ebp, esp           ; initialize our frame pointer
.text:08048391          push    ecx                ; save ecx
.text:08048392         sub     esp, 24h           ; allocate locals

This prologue clearly contains the elements of a traditional prologue for a function that uses EBP as a frame pointer. The caller’s frame pointer is saved before setting the frame pointer for the current function and finally allocating space for local variables . The problem for IDA is that these actions do not occur as the first actions within the function, and thus IDA’s heuristics fail. It is a simple enough matter to manually create a function (Edit ▸ Functions ▸ Create Function) at this point, but you should take care to monitor IDA’s behavior. Just as it failed to identify the function in the first place, it may fail to recognize the fact that the function uses EBP as a frame pointer. In such a case, you would need to edit the function (alt-P) to force IDA to believe that the function has a BP-based frame as well as to make adjustments to the number of stack bytes dedicated to saved registers and local variables.

As in the case of the FreeBSD binary, if the preceding Linux example happened to be both statically linked and stripped, the start routine would not change at all other than the fact that the name for ___libc_start_main would be missing. You could still locate main by remembering that gcc’s Linux start routine makes only one function call and that the first parameter to that function is the address of main.

On the Windows side of the house, the number of C/C++ compilers (and therefore the number of startup routines) in use is somewhat higher. Perhaps not unsurprisingly, in the case of gcc on Windows, it is possible to leverage some of the knowledge gained by studying gcc’s behavior on other platforms. The startup routine shown here is from a gcc/Cygwin binary:

.text:00401000 start     proc near
.text:00401000
.text:00401000 var_28    = dword ptr −28h
.text:00401000 var_24    = dword ptr −24h
.text:00401000 var_20    = dword ptr −20h
.text:00401000 var_2     = word ptr −2
.text:00401000
.text:00401000           push    ebp
.text:00401001           mov     ebp, esp
.text:00401003           sub     esp, 28h
.text:00401006           and     esp, 0FFFFFFF0h
.text:00401009           fnstcw  [ebp+var_2]
.text:0040100C           movzx   eax, [ebp+var_2]
.text:00401010           and     ax, 0F0C0h
.text:00401014           mov     [ebp+var_2], ax
.text:00401018           movzx   eax, [ebp+var_2]
.text:0040101C           or      ax, 33Fh
.text:00401020           mov     [ebp+var_2], ax
.text:00401024           fldcw   [ebp+var_2]
.text:00401027          mov     [esp+28h+var_28], offset sub_4010B0
.text:0040102E          call    sub_401120

Clearly this code does not map cleanly to the previous Linux-based example. However, there is one striking similarity: only one function is called , and the function takes a function pointer for parameter . In this case sub_401120 serves much the same purpose as ___libc_start_main, while sub_4010B0 turns out to be the main function of the program.

Windows binaries compiled using gcc/MinGW make use of yet another style of start function, as shown here:

.text:00401280 start           proc near
.text:00401280
.text:00401280 var_8           = dword ptr −8
.text:00401280
.text:00401280                 push    ebp
.text:00401281                 mov     ebp, esp
.text:00401283                 sub     esp, 8
.text:00401286                 mov     [esp+8+var_8], 1
.text:0040128D                 call    ds:__set_app_type
.text:00401293                call    sub_401150
.text:00401293 start           endp

This is another case in which IDA will fail to identify the program’s main function. The preceding code offers few clues as to the location of main, as there is only one nonlibrary function called (sub_401150) and that function does not appear to take any arguments (as main should). In this instance, the best course of action is to continue the search for main within sub_401150. A portion of sub_401150 is shown here:

.text:0040122A                 call    __p__environ
.text:0040122F                 mov     eax, [eax]
.text:00401231                mov     [esp+8], eax
.text:00401235                 mov     eax, ds:dword_404000
.text:0040123A                mov     [esp+4], eax
.text:0040123E                 mov     eax, ds:dword_404004
.text:00401243                mov     [esp], eax
.text:00401246                call    sub_401395
.text:0040124B                 mov     ebx, eax
.text:0040124D                 call    _cexit
.text:00401252                 mov     [esp], ebx
.text:00401255                 call    ExitProcess

In this example, the function turns out to have many similarities with the start function associated with FreeBSD that we saw earlier. Process of elimination points to sub_401395 as the likely candidate for main, as it is the only non-library function that is called with three arguments—, , and . Also, the third argument is related to the return value of the __p__environ library function, which correlates well with the fact that main’s third argument is expected to be a pointer to the environment strings array. The example code is also preceded by a call to the getmainargs library function (not shown), which is called to set up the argc and argv parameters prior to actually calling main. This helps to reinforce the notion that main is about to be called.

The start routine for Visual C/C++ code is short and sweet, as seen here:

.text:0040134B start           proc near
.text:0040134B                 call    ___security_init_cookie
.text:00401350                 jmp     ___tmainCRTStartup
.text:00401350 start           endp

IDA has actually recognized the library routines referenced in the two instructions through the application of startup signatures rather than by the fact that the program is linked to a dynamic library containing the given symbols. IDA’s startup signatures provide easy location of the initial call to main, as shown here:

.text:004012D8                 mov     eax, envp
.text:004012DD                 mov     dword_40ACF4, eax
.text:004012E2                 push    eax             ; envp
.text:004012E3                 push    argv            ; argv
.text:004012E9                 push    argc            ; argc
.text:004012EF                call    _main
.text:004012F4                 add     esp, 0Ch
.text:004012F7                 mov     [ebp+var_1C], eax
.text:004012FA                 cmp     [ebp+var_20], 0
.text:004012FE                 jnz     short $LN35
.text:00401300                 push    eax             ; uExitCode
.text:00401301                 call    $LN27
.text:00401306 $LN35:                      ; CODE XREF: ___tmainCRTStartup+169âj
.text:00401306                 call    __cexit
.text:0040130B                 jmp     short loc_40133B

Within the entire body of tmainCRTStartup, _main is the only function called with exactly three arguments. Further analysis would reveal that the call to _main is preceded by a call to the GetCommandLine library function, which is yet another indication that a program’s main function may be called shortly. As a final note concerning the use of startup signatures, it is important to understand that, in this example, IDA has generated the name _main entirely on its own as a result of matching a startup signature. The ASCII string main appeared nowhere in the binary used in this example. Thus, you can expect main to be found and labeled anytime a startup signature is matched, even when a binary has been stripped of its symbols.

The last startup routine that we will examine for a C compiler is generated by Borland’s free command-line compiler. [145] The last few lines of Borland’s start routine are shown here:

.text:00401041                push    offset off_4090B8
.text:00401046                 push    0               ; lpModuleName
.text:00401048                 call    GetModuleHandleA
.text:0040104D                 mov     dword_409117, eax
.text:00401052                 push    0          ; fake return value
.text:00401054                 jmp     __startup

The pointer value pushed on the stack refers to a structure that in turn contains a pointer to main. Within __startup, the setup to call main is shown here:

.text:00406997                 mov     edx, dword_40BBFC
.text:0040699D                push    edx
.text:0040699E                 mov     ecx, dword_40BBF8
.text:004069A4                push    ecx
.text:004069A5                 mov     eax, dword_40BBF4
.text:004069AA                push    eax
.text:004069AB                call    dword ptr [esi+18h]
.text:004069AE                 add     esp, 0Ch
.text:004069B1                 push    eax             ; status
.text:004069B2                 call    _exit

Again, this example bears many similarities to previous examples in that the call to main takes three arguments , , and (the only function called within __startup to do so) and the return value is passed directly to _exit to terminate the program. Additional analysis of __startup would reveal calls to the Windows API functions GetEnvironmentStrings and GetCommandLine, which are often precursors to the invocation of main.

Finally, in order to demonstrate that tracking down a program’s main function is not a problem specific to C programs, consider the following startup code from a compiled Visual Basic 6.0 program:

.text:004018A4 start:
.text:004018A4               push    offset dword_401994
.text:004018A9               call    ThunRTMain

The ThunRTMain library function performs a function similar to the Linux libc_start_main function in that its job is to perform any initialization required prior to invoking the actual main function of the program. In order to transfer control to the main function, Visual Basic utilizes a mechanism very similar to that in the Borland code in the earlier examples. ThunRTMain takes a single argument , which is a pointer to a structure containing additional information required for program initialization, including the address of the main function. The content of this structure is shown here:

.text:00401994 dword_401994    dd 21354256h, 2A1FF0h,
 3 dup(0) ; DATA XREF: .text:start↑o
.text:004019A8                 dd 7Eh, 2 dup(0)
.text:004019B4                 dd 0A0000h, 409h, 0
.text:004019C0                dd offset sub_4045D0
.text:004019C4                 dd offset dword_401A1C
.text:004019C8                 dd 30F012h, 0FFFFFF00h, 8, 2
 dup(1), 0E9h, 401944h, 4018ECh
.text:004019C8                 dd 4018B0h, 78h, 7Dh, 82h, 83h, 4 dup(0)

Within this data structure, there is only one item that appears to reference code at all, the pointer to sub_4045D0, which turns out to be the main function for the program.

In the end, learning how to find main is a matter of understanding how executable files are built. In cases where you are experiencing difficulties, it may be beneficial to build some simple executables (with a reference to an easily identifiable string in main, for example) with the same tools used to build the binary you are analyzing. By studying your test cases, you will gain an understanding of the basic structure of binaries built using a specific set of tools that may assist you in further analyzing more complex binaries built with the same set of tools.



[144] Windows GUI applications require a WinMain function instead of main. Documentation regarding WinMain can be found here: http://msdn2.microsoft.com/en-us/library/ms633559.aspx.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.103.219