Chapter 8. Datatypes and Data Structures

image with no caption

The low-hanging fruit in understanding the behavior of binary programs lies in cataloging the library functions that the program calls. A C program that calls the connect function is creating a network connection. A Windows program that calls RegOpenKey is accessing the Windows registry. Additional analysis is required, however, to gain an understanding of how and why these functions are called.

Discovering how a function is called requires learning what parameters are passed to the function. In the case of a connect call, beyond the simple fact that the function is being called, it is important to know exactly what network address the program is connecting to. Understanding the data that is being passed into functions is the key to reverse engineering a function’s signature (the number, type, and sequence of parameters required by the function) and, as such, points out the importance of understanding how datatypes and data structures are manipulated at the assembly language level.

In this chapter we will examine how IDA conveys datatype information to the user, how data structures are stored in memory, and how data within those data structures is accessed. The simplest method for associating a specific datatype with a variable is to observe the use of the variable as a parameter to a function that we know something about. During its analysis phase, IDA makes every effort to annotate datatypes when they can be deduced based on a variable’s use with a function for which IDA possesses a prototype. When possible, IDA will go as far as using a formal parameter name lifted from a function prototype rather than generating a default dummy name for the variable. This can be seen in the following disassembly of a call to connect:

.text:004010F3                 push    10h             ; namelen
.text:004010F5                 lea     ecx, [ebp+name]
.text:004010F8                 push    ecx             ; name
.text:004010F9                 mov     edx, [ebp+s]
.text:004010FF                 push    edx             ; s
.text:00401100                 call    connect

In this listing we can see that each push has been commented with the name of the parameter that is being pushed (taken from IDA’s knowledge of the function prototype). In addition, two local stack variables have been named for the parameters that they correspond to. In most cases, these names will be far more informative than the dummy names that IDA would otherwise generate.

IDA’s ability to propagate type information from function prototypes is not limited to library functions contained in IDA’s type libraries. IDA can propagate formal parameter names and data types from any function in your database as long as you have explicitly set the function’s type information. Upon initial analysis, IDA assigns dummy names and the generic type int to all function arguments, unless through type propagation it has reason to do otherwise. In any case, you must set a function’s type by using the Edit ▸ Functions ▸ Set Function Type command, right-clicking on a function name, and choosing Set Function Type on the context menu or using the Y hotkey. For the function shown below, this results in the dialog shown in Figure 8-1, in which you may enter the function’s correct prototype.

.text:00401050 ; ======== S U B R O U T I N E =========================
.text:00401050
.text:00401050 ; Attributes: bp-based frame
.text:00401050
.text:00401050 foo     proc near      ; CODE XREF: demo_stackframe+2A↓p
.text:00401050
.text:00401050 arg_0   = dword ptr  8
.text:00401050 arg_4   = dword ptr  0Ch
.text:00401050
.text:00401050         push    ebp
.text:00401051         mov     ebp, esp

As shown below, IDA assumes an int return type, correctly deduces that the cdecl calling convention is used based on the type of ret instruction used, incorporates the name of the function as we have modified it, and assumes all parameters are of type int. Because we have not yet modified the argument names, IDA displays only their types.

Setting a function’s type

Figure 8-1. Setting a function’s type

If we modify the prototype to read int __cdecl foo(float f, char *ptr), IDA will automatically insert a prototype comment for the function and change the argument names in the disassembly as shown below.

.text:00401050 ; ======== S U B R O U T I N E =========================
.text:00401050
.text:00401050 ; Attributes: bp-based frame
.text:00401050
.text:00401050 ; int __cdecl foo(float f, char *ptr)
.text:00401050 foo     proc near      ; CODE XREF: demo_stackframe+2A↓p
.text:00401050
.text:00401050 f       = dword ptr  8
.text:00401050 ptr     = dword ptr  0Ch
.text:00401050
.text:00401050         push    ebp
.text:00401051         mov     ebp, esp

Finally, IDA propagates this information to all callers of the newly modified function, resulting in improved annotation of all related function calls as shown here. Note that the argument names f and ptr have been propagated out as comments in the calling function and used to rename variables that formerly used dummy names.

.text:004010AD         mov     eax, [ebp+ptr]
.text:004010B0         mov     [esp+4], eax    ; ptr
.text:004010B4         mov     eax, [ebp+f]
.text:004010B7         mov     [esp], eax      ; f
.text:004010BA         call    foo

Returning to imported library functions, it is often the case that IDA will already know the prototype of the function. In such cases, you can easily view the prototype by holding the mouse over the function name.[44] When IDA has no knowledge of a function’s parameter sequence, it should, at a minimum, know the name of the library from which the function was imported (see the Imports window). When this happens, your best resources for learning the behavior of the function are any associated man pages or other available API documentation (such as MSDN online[45]). When all else fails, remember the adage: Google is your friend.

For the remainder of this chapter, we will be discussing how to recognize when data structures are being used in a program, how to decipher the organizational layout of such structures, and how to use IDA to improve the readability of a disassembly when such structures are in use. Since C++ classes are a complex extension of C structures, the chapter concludes with a discussion of reverse engineering compiled C++ programs.

Recognizing Data Structure Use

While primitive datatypes are often a natural fit with the size of a CPU’s registers or instruction operands, composite datatypes such as arrays and structures typically require more complex instruction sequences in order to access the individual data items that they contain. Before we can discuss IDA’s feature for improving the readability of code that utilizes complex datatypes, we need to review what that code looks like.

Array Member Access

Arrays are the simplest composite data structure in terms of memory layout. Traditionally, arrays are contiguous blocks of memory that contain consecutive elements of the same datatype. The size of an array is easy to compute, as it is the product of the number of elements in the array and the size of each element. Using C notation, the minimum number of bytes consumed by the following array

int array_demo[100];

is computed as

int bytes = 100 * sizeof(int);

Individual array elements are accessed by supplying an index value, which may be a variable or a constant, as shown in these array references:

 array_demo[20] = 15;  //fixed index into the array
  for (int i = 0; i < 100; i++) {
     array_demo[i] = i;  //varying index into the array
  }

Assuming, for the sake of example, that sizeof(int) is 4 bytes, then the first array access at accesses the integer value that lies 80 bytes into the array, while the second array access at accesses successive integers at offsets 0, 4, 8, .. 96 bytes into the array. The offset for the first array access can be computed at compile time as 20 * 4. In most cases, the offset for the second array access must be computed at runtime because the value of the loop counter, i, is not fixed at compile time. Thus for each pass through the loop, the product i * 4 must be computed to determine the exact offset into the array. Ultimately, the manner in which an array element is accessed depends not only on the type of index used but also on where the array happens to be allocated within the program’s memory space.

Globally Allocated Arrays

When an array is allocated within the global data area of a program (within the .data or .bss section, for example), the base address of the array is known to the compiler at compile time. The fixed base address makes it possible for the compiler to compute fixed addresses for any array element that is accessed using a fixed index. Consider the following trivial program that accesses a global array using both fixed and variable offsets:

int global_array[3];

int main() {
   int idx = 2;
   global_array[0] = 10;
   global_array[1] = 20;
   global_array[2] = 30;
   global_array[idx] = 40;
}

This program disassembles to the following:

.text:00401000 _main           proc near
.text:00401000
.text:00401000 idx             = dword ptr −4
.text:00401000
.text:00401000                 push    ebp
.text:00401001                 mov     ebp, esp
.text:00401003                 push    ecx
.text:00401004                 mov     [ebp+idx], 2
.text:0040100B                mov     dword_40B720, 10
.text:00401015                mov     dword_40B724, 20
.text:0040101F                mov     dword_40B728, 30
.text:00401029                 mov     eax, [ebp+idx]
.text:0040102C                mov     dword_40B720[eax*4], 40
.text:00401037                 xor     eax, eax
.text:00401039                 mov     esp, ebp
.text:0040103B                 pop     ebp
.text:0040103C                 retn
.text:0040103C _main           endp

While this program has only one global variable, the disassembly lines at , , and seem to indicate that there are three global variables. The computation of an offset (eax * 4) at is the only thing that seems to hint at the presence of a global array named dword_40B720, yet this is the same name as the global variable found at .

Based on the dummy names assigned by IDA, we know that the global array is made up of the 12 bytes beginning at address 0040B720. During the compilation process, the compiler has used the fixed indexes (0, 1, 2) to compute the actual addresses of the corresponding elements in the array (0040B720, 0040B724, and 0040B728), which are referenced using the global variables at , , and . Using IDA’s array-formatting operations discussed in the last chapter (Edit ▸ Array), dword_40B720 can be formatted as a three-element array yielding the alternate disassembly lines shown in the following listing. Note that this particular formatting highlights the use of offsets into the array:

.text:0040100B                 mov     dword_40B720, 10
.text:00401015                 mov     dword_40B720+4, 20
.text:0040101F                 mov     dword_40B720+8, 30

There are two points to note in this example. First, when constant indexes are used to access global arrays, the corresponding array elements will appear as global variables in the corresponding disassembly. In other words, the disassembly will offer essentially no evidence that an array exists. The second point is that the use of variable index values leads us to the start of the array because the base address will be revealed (as in ) when the computed offset is added to it to compute the actual array location to be accessed. The computation at offers one additional piece of significant information about the array. By observing the amount by which the array index is multiplied (4 in this case), we learn the size (though not the type) of an individual element in the array.

Stack-Allocated Arrays

How does array access differ if the array is allocated as a stack variable instead? Instinctively, we might think that it must be different since the compiler can’t know an absolute address at compile time, so surely even accesses that use constant indexes must require some computation at runtime. In practice, however, compilers treat stack-allocated arrays almost identically to globally allocated arrays.

Consider the following program that makes use of a small stack-allocated array:

int main() {
   int stack_array[3];
   int idx = 2;
   stack_array[0] = 10;
   stack_array[1] = 20;
   stack_array[2] = 30;
   stack_array[idx] = 40;
}

The address at which stack_array will be allocated is unknown at compile time, so it is not possible for the compiler to precompute the address of stack_array[1] at compile time as it did in the global array example. By examining the disassembly listing for this function, we gain insight into how stack-allocated arrays are accessed:

.text:00401000 _main           proc near
.text:00401000
.text:00401000 var_10          = dword ptr −10h
.text:00401000 var_C           = dword ptr −0Ch
.text:00401000 var_8           = dword ptr −8
.text:00401000 idx             = dword ptr −4
.text:00401000
.text:00401000                 push    ebp
.text:00401001                 mov     ebp, esp
.text:00401003                 sub     esp, 10h
.text:00401006                 mov     [ebp+idx], 2
.text:0040100D                mov     [ebp+var_10], 10
.text:00401014                mov     [ebp+var_C], 20
.text:0040101B                mov     [ebp+var_8], 30
.text:00401022                 mov     eax, [ebp+idx]
.text:00401025                mov     [ebp+eax*4+var_10], 40
.text:0040102D                 xor     eax, eax
.text:0040102F                 mov     esp, ebp
.text:00401031                 pop     ebp
.text:00401032                 retn
.text:00401032 _main           endp

As with the global array example, this function appears to have three variables (var_10, var_C, and var_8) rather than an array of three integers. Based on the constant operands used at , , and , we know that what appear to be local variable references are actually references to the three elements of stack_array whose first element must reside at var_10, the local variable with the lowest memory address.

To understand how the compiler resolved the references to the other elements of the array, consider what the compiler goes through when dealing with the reference to stack_array[1], which lies 4 bytes into the array, or 4 bytes beyond the location of var_10. Within the stack frame, the compiler has elected to allocate stack_array at ebp - 0x10. The compiler understands that stack_array[1] lies at ebp - 0x10 + 4, which simplifies to ebp - 0x0C. The result is that IDA displays this as a local variable reference. The net effect is that, similar to globally allocated arrays, the use of constant index values tends to hide the presence of a stack-allocated array. Only the array access at hints at the fact that var_10 is the first element in the array rather than a simple integer variable. In addition, the disassembly line at also helps us conclude that the size of individual elements in the array is 4 bytes.

Stack-allocated arrays and globally allocated arrays are thus treated very similarly by compilers. However, there is an extra piece of information that we can attempt to extract from the disassembly of the stack example. Based on the location of idx within the stack, it is possible to conclude that the array that begins with var_10 contains no more than three elements (otherwise, it would overwrite idx). If you are an exploit developer, this can be very useful in determining exactly how much data you can fit into an array before you overflow it and begin to corrupt the data that follows.

Heap-Allocated Arrays

Heap-allocated arrays are allocated using a dynamic memory allocation function such as malloc (C) or new (C++). From the compiler’s perspective, the primary difference in dealing with a heap-allocated array is that the compiler must generate all references into the array based on the address value returned from the memory allocation function. For the sake of comparison, we now take a look at the following function, which allocates a small array in the program heap:

int main() {
   int *heap_array = (int*)malloc(3 * sizeof(int));
   int idx = 2;
   heap_array[0] = 10;
   heap_array[1] = 20;
   heap_array[2] = 30;
   heap_array[idx] = 40;
}

In studying the corresponding disassembly that follows, you should notice a few similarities and differences with the two previous disassemblies:

.text:00401000 _main      proc near
.text:00401000
.text:00401000 heap_array      = dword ptr −8
.text:00401000 idx             = dword ptr −4
.text:00401000
.text:00401000            push    ebp
.text:00401001            mov     ebp, esp
.text:00401003            sub     esp, 8
.text:00401006           push    0Ch             ; size_t
.text:00401008            call    _malloc
.text:0040100D            add     esp, 4
.text:00401010            mov     [ebp+heap_array], eax
.text:00401013            mov     [ebp+idx], 2
.text:0040101A            mov     eax, [ebp+heap_array]
.text:0040101D           mov     dword ptr [eax], 10
.text:00401023            mov     ecx, [ebp+heap_array]
.text:00401026           mov     dword ptr [ecx+4], 20
.text:0040102D            mov     edx, [ebp+heap_array]
.text:00401030           mov     dword ptr [edx+8], 30
.text:00401037            mov     eax, [ebp+idx]
.text:0040103A            mov     ecx, [ebp+heap_array]
.text:0040103D           mov     dword ptr [ecx+eax*4], 40
.text:00401044            xor     eax, eax
.text:00401046            mov     esp, ebp
.text:00401048            pop     ebp
.text:00401049            retn
.text:00401049 _main      endp

The starting address of the array (returned from malloc in the EAX register) is stored in the local variable heap_array. In this example, unlike the previous examples, every access to the array begins with reading the contents of heap_array to obtain the array’s base address before an offset value can be added to compute the address of the correct element within the array. The references to heap_array[0], heap_array[1], and heap_array[2] require offsets of 0, 4, and 8 bytes, respectively, as seen at , , and . The operation that most closely resembles the previous examples is the reference to heap_array[idx] at , in which the offset into the array continues to be computed by multiplying the array index by the size of an array element.

Heap-allocated arrays have one particularly nice feature. When both the total size of the array and the size of each element can be determined, it is easy to compute the number of elements allocated to the array. For heap-allocated arrays, the parameter passed to the memory allocation function (0x0C passed to malloc at ) represents the total number of bytes allocated to the array. Dividing this by the size of an element (4 bytes in this example, as observed from the offsets at , , and ) tells us the number of elements in the array. In the previous example, a three-element array was allocated.

The only firm conclusion we can draw regarding the use of arrays is that they are easiest to recognize when a variable is used as an index into the array. The array-access operation requires the index to be scaled by the size of an array element before adding the resulting offset to the base address of the array. Unfortunately, as we will show in the next section, when constant index values are used to access array elements, they do little to suggest the presence of an array and look remarkably similar to code used to access structure members.

Structure Member Access

C-style structs, referred to here generically as structures, are heterogeneous collections of data that allow grouping of items of dissimilar datatypes into a single composite datatype. A major distinguishing feature of structures is that the data fields within a structure are accessed by name rather than by index, as is done with arrays. Unfortunately, field names are converted to numeric offsets by the compiler, so by the time you are looking at a disassembly, structure field access looks remarkably similar to accessing array elements using constant indexes.

When a compiler encounters a structure definition, the compiler maintains a running total of the number of bytes consumed by the fields of the structure in order to determine the offset at which each field resides within the structure. The following structure definition will be used with the upcoming examples:

struct ch8_struct {   //Size     Minimum offset     Default offset
   int field1;        //  4             0                  0
   short field2;      //  2             4                  4
   char field3;       //  1             6                  6
   int field4;        //  4             7                  8
   double field5;     //  8             11                 16
};                //Minimum total size: 19   Default size: 24

The minimum required space to allocate a structure is determined by the sum of the space required to allocate each field within the structure. However, you should never assume that a compiler utilizes the minimum required space to allocate a structure. By default, compilers seek to align structure fields to memory addresses that allow for the most efficient reading and writing of those fields. For example, 4-byte integer fields will be aligned to offsets that are divisible by 4, while 8-byte doubles will be aligned to offsets that are divisible by 8. Depending on the composition of the structure, meeting alignment requirements may require the insertion of padding bytes, causing the actual size of a structure to be larger than the sum of its component fields. The default offsets and resulting structure size for the example structure shown previously can be seen in the Default offset column.

Structures can be packed into the minimum required space by using compiler options to request specific member alignments. Microsoft Visual C/C++ and GNU gcc/g++ both recognize the pack pragma as a means of controlling structure field alignment. The GNU compilers additionally recognize the packed attribute as a means of controlling structure alignment on a per-structure basis. Requesting 1-byte alignment for structure fields causes compilers to squeeze the structure into the minimum required space. For our example structure, this yields the offsets and structure size found in the Minimum offset column. Note that some CPUs perform better when data is aligned according to its type, while other CPUs may generate exceptions if data is not aligned on specific boundaries.

With these facts in mind, we can begin our look at how structures are treated in compiled code. For the sake of comparison, it is worth observing that, as with arrays, access to structure members is performed by adding the base address of the structure to the offset of the desired member. However, while array offsets can be computed at runtime from a provided index value (because each item in an array has the same size), structure offsets must be precomputed and will turn up in compiled code as fixed offsets into the structure, looking nearly identical to array references that make use of constant indexes.

Globally Allocated Structures

As with globally allocated arrays, the address of globally allocated structures is known at compile time. This allows the compiler to compute the address of each member of the structure at compile time and eliminates the need to do any math at runtime. Consider the following program that accesses a globally allocated structure:

struct ch8_struct global_struct;

int main() {
   global_struct.field1 = 10;
   global_struct.field2 = 20;
   global_struct.field3 = 30;
   global_struct.field4 = 40;
   global_struct.field5 = 50.0;
}

If this program is compiled with default structure alignment options, we can expect to see something like the following when we disassemble it:

.text:00401000 _main           proc near
.text:00401000                 push    ebp
.text:00401001                 mov     ebp, esp
.text:00401003                 mov     dword_40EA60, 10
.text:0040100D                 mov     word_40EA64, 20
.text:00401016                 mov     byte_40EA66, 30
.text:0040101D                 mov     dword_40EA68, 40
.text:00401027                 fld     ds:dbl_40B128
.text:0040102D                 fstp    dbl_40EA70
.text:00401033                 xor     eax, eax
.text:00401035                 pop     ebp
.text:00401036                 retn
.text:00401036 _main           endp

This disassembly contains no math whatsoever to access the members of the structure, and, in the absence of source code, it would not be possible to state with any certainty that a structure is being used at all. Because the compiler has performed all of the offset computations at compile time, this program appears to reference five global variables rather than five fields within a single structure. You should be able to note the similarities with the previous example regarding globally allocated arrays using constant index values.

Stack-Allocated Structures

Like stack-allocated arrays (see Stack-Allocated Arrays), stack-allocated structures are equally difficult to recognize based on stack layout alone. Modifying the preceding program to use a stack-allocated structure, declared in main, yields the following disassembly:

.text:00401000 _main           proc near
.text:00401000
.text:00401000 var_18          = dword ptr −18h
.text:00401000 var_14          = word ptr −14h
.text:00401000 var_12          = byte ptr −12h
.text:00401000 var_10          = dword ptr −10h
.text:00401000 var_8           = qword ptr −8
.text:00401000
.text:00401000                 push    ebp
.text:00401001                 mov     ebp, esp
.text:00401003                 sub     esp, 18h
.text:00401006                 mov     [ebp+var_18], 10
.text:0040100D                 mov     [ebp+var_14], 20
.text:00401013                 mov     [ebp+var_12], 30
.text:00401017                 mov     [ebp+var_10], 40
.text:0040101E                 fld     ds:dbl_40B128
.text:00401024                 fstp    [ebp+var_8]
.text:00401027                 xor     eax, eax
.text:00401029                 mov     esp, ebp
.text:0040102B                 pop     ebp
.text:0040102C                 retn
.text:0040102C _main           endp

Again, no math is performed to access the structure’s fields since the compiler can determine the relative offsets for each field within the stack frame at compile time. In this case, we are left with the same, potentially misleading picture that five individual variables are being used rather than a single variable that happens to contain five distinct fields. In reality, var_18 should be the start of a 24-byte structure, and each of the other variables should somehow be formatted to reflect the fact that they are fields within the structure.

Heap-Allocated Structures

Heap-allocated structures turn out to be much more revealing regarding the size of the structure and the layout of its fields. When a structure is allocated in the program heap, the compiler has no choice but to generate code to compute the proper offset into the structure whenever a field is accessed. This is a result of the structure’s address being unknown at compile time. For globally allocated structures, the compiler is able to compute a fixed starting address. For stack-allocated structures, the compiler can compute a fixed relationship between the start of the structure and the frame pointer for the enclosing stack frame. When a structure has been allocated in the heap, the only reference to the structure available to the compiler is the pointer to the structure’s starting address.

Modifying our structure example once again to make use of a heap-allocated structure results in the following disassembly. Similar to the heap-allocated array example from page 134, we declare a pointer within main and assign it the address of a block of memory large enough to hold our structure:

.text:00401000 _main           proc near
.text:00401000
.text:00401000 heap_struct     = dword ptr −4
.text:00401000
.text:00401000                 push    ebp
.text:00401001                 mov     ebp, esp
.text:00401003                 push    ecx
.text:00401004                push    24              ; size_t
.text:00401006                 call    _malloc
.text:0040100B                 add     esp, 4
.text:0040100E                 mov     [ebp+heap_struct], eax
.text:00401011                 mov     eax, [ebp+heap_struct]
.text:00401014                mov     dword ptr [eax], 10
.text:0040101A                 mov     ecx, [ebp+heap_struct]
.text:0040101D                mov     word ptr [ecx+4], 20
.text:00401023                 mov     edx, [ebp+heap_struct]
.text:00401026                mov     byte ptr [edx+6], 30
.text:0040102A                 mov     eax, [ebp+heap_struct]
.text:0040102D                mov     dword ptr [eax+8], 40
.text:00401034                 mov     ecx, [ebp+heap_struct]
.text:00401037                 fld     ds:dbl_40B128
.text:0040103D                fstp    qword ptr [ecx+10h]
.text:00401040                 xor     eax, eax
.text:00401042                 mov     esp, ebp
.text:00401044                 pop     ebp
.text:00401045                 retn
.text:00401045 _main           endp

In this example, unlike the global and stack-allocated structure examples, we are able to discern the exact size and layout of the structure. The structure size can be inferred to be 24 bytes based on the amount of memory requested from malloc . The structure contains the following fields at the indicated offsets:

  • A 4-byte (dword) field at offset 0

  • A 2-byte (word) field at offset 4

  • A 1-byte field at offset 6

  • A 4-byte (dword) field at offset 8

  • An 8-byte (qword) field at offset 16 (10h)

Based on the use of floating point instructions, we can further deduce that the qword field is actually a double. The same program compiled to pack structures with a 1-byte alignment yields the following disassembly:

.text:00401000 _main           proc near
.text:00401000
.text:00401000 heap_struct     = dword ptr −4
.text:00401000
.text:00401000                 push    ebp
.text:00401001                 mov     ebp, esp
.text:00401003                 push    ecx
.text:00401004                 push    19              ; size_t
.text:00401006                 call    _malloc
.text:0040100B                 add     esp, 4
.text:0040100E                 mov     [ebp+heap_struct], eax
.text:00401011                 mov     eax, [ebp+heap_struct]
.text:00401014                 mov     dword ptr [eax], 10
.text:0040101A                 mov     ecx, [ebp+heap_struct]
.text:0040101D                 mov     word ptr [ecx+4], 20
.text:00401023                 mov     edx, [ebp+heap_struct]
.text:00401026                 mov     byte ptr [edx+6], 30
.text:0040102A                 mov     eax, [ebp+heap_struct]
.text:0040102D                 mov     dword ptr [eax+7], 40
.text:00401034                 mov     ecx, [ebp+heap_struct]
.text:00401037                 fld     ds:dbl_40B128
.text:0040103D                 fstp    qword ptr [ecx+0Bh]
.text:00401040                 xor     eax, eax
.text:00401042                 mov     esp, ebp
.text:00401044                 pop     ebp
.text:00401045                 retn
.text:00401045 _main           endp

The only changes to the program are the smaller size of the structure (now 19 bytes) and the adjusted offsets to account for the realignment of each structure field.

Regardless of the alignment used when compiling a program, finding structures allocated and manipulated in the program heap is the fastest way to determine the size and layout of a given data structure. However, keep in mind that many functions will not do you the favor of immediately accessing every member of a structure to help you understand the structure’s layout. Instead, you may need to follow the use of the pointer to the structure and make note of the offsets used whenever that pointer is dereferenced. In this manner, you will eventually be able to piece together the complete layout of the structure.

Arrays of Structures

Some programmers would say that the beauty of composite data structures is that they allow you to build arbitrarily complex structures by nesting smaller structures within larger structures. Among other possibilities, this capability allows for arrays of structures, structures within structures, and structures that contain arrays as members. The preceding discussions regarding arrays and structures apply just as well when dealing with nested types such as these. As an example, consider an array of structures like the following simple program in which heap_struct points to an array of five ch8_struct items:

int main() {
     int idx = 1;
     struct ch8_struct *heap_struct;
     heap_struct = (struct ch8_struct*)malloc(sizeof(struct ch8_struct) * 5);
    heap_struct[idx].field1 = 10;
  }

The operations required to access field1 at include multiplying the index value by the size of an array element, in this case the size of the structure, and then adding the offset to the desired field. The corresponding disassembly is shown here:

.text:00401000 _main           proc near
.text:00401000
.text:00401000 idx             = dword ptr −8
.text:00401000 heap_struct     = dword ptr −4
.text:00401000
.text:00401000                 push    ebp
.text:00401001                 mov     ebp, esp
.text:00401003                 sub     esp, 8
.text:00401006                 mov     [ebp+idx], 1
.text:0040100D                push    120              ; size_t
.text:0040100F                 call    _malloc
.text:00401014                 add     esp, 4
.text:00401017                 mov     [ebp+heap_struct], eax
.text:0040101A                 mov     eax, [ebp+idx]
.text:0040101D                imul    eax, 24
.text:00401020                 mov     ecx, [ebp+heap_struct]
.text:00401023                mov     dword ptr [ecx+eax], 10
.text:0040102A                 xor     eax, eax
.text:0040102C                 mov     esp, ebp
.text:0040102E                 pop     ebp
.text:0040102F                 retn
.text:0040102F _main           endp

The disassembly reveals 120 bytes () being requested from the heap. The array index is multiplied by 24 at before being added to the start address for the array at . No additional offset is required in order to generate the final address for the reference at . From these facts we can deduce the size of an array item (24), the number of items in the array (120 / 24 = 5), and the fact that there is a 4-byte (dword) field at offset 0 within each array element. This short listing does not offer enough information to draw any conclusions about how the remaining 20 bytes within each structure are allocated to additional fields.



[44] Holding the mouse over any name in the IDA display causes a tool tip–style pop-up window to be displayed that shows up to 10 lines of disassembly at the target location. In the case of library function names, this often includes the prototype for calling the library function.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.237.79