C++ Reversing Primer

C++ classes are the object-oriented extensions of C structs, so it is somewhat logical to wrap up our discussion of data structures with a review of the features of compiled C++ code. C++ is sufficiently complex that detailed coverage of the topic is beyond the scope of this book. Here we attempt to cover the highlights and a few of the differences between Microsoft’s Visual C++ and GNU’s g++.

An important point to remember is that a solid, fundamental understanding of the C++ language will assist you greatly in understanding compiled C++. Object-oriented concepts such as inheritance and polymorphism are difficult enough to learn well at the source level. Attempting to dive into these concepts at the assembly level without understanding them at the source level will certainly be an exercise in frustration.

The this Pointer

The this pointer is a pointer available in all nonstatic C++ member functions. Whenever such a function is called, this is initialized to point to the object used to invoke the function. Consider the following functions calls:

//object1, object2, and *p_obj are all the same type.
object1.member_func();
object2.member_func();
p_obj->member_func();

In the three calls to member_func, this takes on the values &object1, &object2, and p_obj, respectively. It is easiest to view this as a hidden first parameter passed in to all nonstatic member functions. As discussed in Chapter 6, Microsoft Visual C++ utilizes the thiscall calling convention and passes this in the ECX register. The GNU g++ compiler treats this exactly as if it was the first (leftmost) parameter to nonstatic member functions and pushes the address of the object used to invoke the function as the topmost item on the stack prior to calling the function.

From a reverse engineering point of view, the moving of an address into the ECX register immediately prior to a function call is a probable indicator of two things. First, the file was compiled using Visual C++. Second, the function is a member function. When the same address is passed to two or more functions, we can conclude that those functions all belong to the same class hierarchy.

Within a function, the use of ECX prior to initializing it implies that the caller must have initialized ECX and is a possible sign that the function is a member function (though the function may simply use the fastcall calling convention). Further, when a member function is observed to pass this to additional functions, those functions can be inferred to be members of the same class as well.

For code compiled using g++, calls to member functions stand out somewhat less. However, any function that does not take a pointer as its first argument can certainly be ruled out as a member function.

Virtual Functions and Vtables

Virtual functions provide the means for polymorphic behavior in C++ programs. For each class (or subclass through inheritance) that contains virtual functions, the compiler generates a table containing pointers to each virtual function in the class. Such tables are called vtables. Furthermore, every class that contains virtual functions is given an additional data member whose purpose is to point to the appropriate vtable at runtime. This member is typically referred to as a vtable pointer and is allocated as the first data member within the class. When an object is created at runtime, its vtable pointer is set to point at the appropriate vtable. When that object invokes a virtual function, the correct function is selected by performing a lookup in the object’s vtable. Thus, vtables are the underlying mechanism that facilitates runtime resolution of calls to virtual functions.

A few examples may help to clarify the use of vtables. Consider the following C++ class definitions:

class BaseClass {
public:
   BaseClass();
   virtual void vfunc1() = 0;
   virtual void vfunc2();
   virtual void vfunc3();
   virtual void vfunc4();
private:
   int x;
   int y;
};

class SubClass : public BaseClass {
public:
   SubClass();
   virtual void vfunc1();
   virtual void vfunc3();
   virtual void vfunc5();
private:
   int z;
};

In this case, SubClass inherits from BaseClass. BaseClass contains four virtual functions, while SubClass contains five (four from BaseClass plus the new vfunc5). Within BaseClass, vfunc1 is a pure virtual function by virtue of the use of = 0 in its declaration. Pure virtual functions have no implementation in their declaring class and must be overridden in a subclass before the class is considered concrete. In other words, there is no function named Base-Class::vfunc1, and until a subclass provides an implementation, no objects can be instantiated. SubClass provides such an implementation, so SubClass objects can be created.

At first glance BaseClass appears to contain two data members and Sub Class three data members. Recall, however, that any class that contains virtual functions, either explicitly or because they are inherited, also contains a vtable pointer. As a result, instantiated BaseClass objects actually have three data members, while instantiated SubClass objects have four data members. In each case, the first data member is the vtable pointer. Within SubClass, the vtable pointer is actually inherited from BaseClass rather than being introduced specifically for SubClass. Figure 8-14 shows a simplified memory layout in which a single SubClass object has been dynamically allocated. During the creation of the object, the compiler ensures that the new object’s vtable pointer points to the correct vtable (SubClass’s in this case).

A simple vtable layout

Figure 8-14. A simple vtable layout

Note that the vtable for SubClass contains two pointers to functions belonging to BaseClass (BaseClass::vfunc2 and BaseClass::vfunc4). This is because SubClass does not override either of these functions and instead inherits them from BaseClass. Also shown is the typical handling of pure virtual function entries. Because there is no implementation for the pure virtual function BaseClass::vfunc1, no address is available to store in the BaseClass vtable slot for vfunc1. In such cases, compilers insert the address of an error handling function, often dubbed purecall, which in theory should never be called but which will usually abort the program in the event that it somehow is called.

One consequence of the presence of a vtable pointer is that you must account for it when you manipulate the class within IDA. Recall that C++ classes are an extension of C structures. Therefore, you may choose to make use of IDA’s structure definition features to define the layout of C++ classes. In the case of classes that contain virtual functions, you must remember to include a vtable pointer as the first field within the class. Vtable pointers must also be accounted for in the total size of an object. This is most apparent when observing the dynamic allocation of an object using the new[48] operator, where the size value passed to new includes the space consumed by all explicitly declared fields in the class (and any superclasses) as well as any space required for a vtable pointer.

In the following example a SubClass object is created dynamically, and its address saved in a BaseClass pointer. The pointer is then passed to a function (call_vfunc), which uses the pointer to call vfunc3.

void call_vfunc(BaseClass *b) {
   b->vfunc3();
}

int main() {
   BaseClass *bc = new SubClass();
   call_vfunc(bc);
}

Since vfunc3 is a virtual function, the compiler must ensure that Sub-Class::vfunc3 is called in this case because the pointer points to a Sub-Class object. The following disassembled version of call_vfunc demonstrates how the virtual function call is resolved:

.text:004010A0 call_vfunc      proc near
.text:004010A0
.text:004010A0 b               = dword ptr  8
.text:004010A0
.text:004010A0                 push    ebp
.text:004010A1                 mov     ebp, esp
.text:004010A3                 mov     eax, [ebp+b]
.text:004010A6                mov     edx, [eax]
.text:004010A8                 mov     ecx, [ebp+b]
.text:004010AB                mov     eax, [edx+8]
.text:004010AE                call    eax
.text:004010B0                 pop     ebp
.text:004010B1                 retn
.text:004010B1 call_vfunc      endp

The vtable pointer is read from the structure at and saved in the EDX register. Since the parameter b points to a SubClass object, this will be the address of SubClass’s vtable. At , the vtable is indexed to read the third pointer (the address of SubClass::vfunc3 in this case) into the EAX register. Finally, at , the virtual function is called.

Note that the vtable indexing operation at looks very much like a structure reference operation. In fact, it is no different, and it is possible to define a structure to represent the layout of a class’s vtable and then use the defined structure to make the disassembly more readable, as shown here:

00000000 SubClass_vtable struc ; (sizeof=0x14)
00000000 vfunc1          dd ?
00000004 vfunc2          dd ?
00000008 vfunc3          dd ?
0000000C vfunc4          dd ?
00000010 vfunc5          dd ?
00000014 SubClass_vtable ends

This structure allows the vtable reference operation to be reformatted as follows:

.text:004010AB                 mov     eax, [edx+SubClass_vtable.vfunc3]

The Object Life Cycle

An understanding of the mechanism by which objects are created and destroyed can help to reveal object hierarchies and nested object relationships as well as quickly identify class constructor and destructor functions.[49]

For global and statically allocated objects, constructors are called during program startup and prior to entry into the main function. Constructors for stack-allocated objects are invoked at the point the object comes into scope within the function in which it is declared. In many cases, this will be immediately upon entry to the function in which it is declared. However, when an object is declared within a block statement, its constructor is not invoked until that block is entered, if it is entered at all. When an object is allocated dynamically in the program heap, its creation is a two-step process. In the first step, the new operator is invoked to allocate the object’s memory. In the second step, the constructor is invoked to initialize the object. A major difference between Microsoft’s Visual C++ and GNU’s g++ is that Visual C++ ensures that the result of new is not null prior to invoking the constructor.

When a constructor executes, the following sequence of actions takes place:

  1. If the class has a superclass, the superclass constructor is invoked.

  2. If the class has any virtual functions, the vtable pointer is initialized to point to the class’s vtable. Note that this may overwrite a vtable pointer that was initialized in the superclass, which is exactly the desired behavior.

  3. If the class has any data members that are themselves objects, then the constructor for each such data member is invoked.

  4. Finally, the code-specific constructor is executed. This is the code representing the C++ behavior of the constructor specified by the programmer.

Constructors do not specify a return type; however, constructors generated by Microsoft Visual C++ actually return this in the EAX register. Regardless, this is a Visual C++ implementation detail and does not permit C++ programmers to access the returned value.

Destructors are called in essentially the reverse order. For global and static objects, destructors are called by cleanup code that is executed after the main function terminates. Destructors for stack-allocated objects are invoked as the objects go out of scope. Destructors for heap-allocated objects are invoked via the delete operator immediately before the memory allocated to the object is released.

The actions performed by destructors mimic those performed by constructors, with the exception that they are performed in roughly reverse order.

  1. If the class has any virtual functions, the vtable pointer for the object is restored to point to the vtable for the associated class. This is required in case a subclass had overwritten the vtable pointer as part of its creation process.

  2. The programmer-specified code for the destructor executes.

  3. If the class has any data members that are themselves objects, the destructor for each such member is executed.

  4. Finally, if the object has a superclass, the superclass destructor is called.

By understanding when superclass constructors and destructors are called, it is possible to trace an object’s inheritance hierarchy through the chain of calls to its related superclass functions. A final point regarding vtables relates to how they are referenced within programs. There are only two circumstances in which a class’s vtable is referenced directly, within the class constructor(s) and destructor. When you locate a vtable, you can utilize IDA’s data cross-referencing capabilities (see Chapter 9) to quickly locate all constructors and destructors for the associated class.

Name Mangling

Also called name decoration, name mangling is the mechanism C++ compilers use to distinguish among overloaded[50] versions of a function. In order to generate unique names for overloaded functions, compilers decorate the function name with additional characters used to encode various pieces of information about the function. Encoded information typically describes the return type of the function, the class to which the function belongs, and the parameter sequence (type and order) required to call the function.

Name mangling is a compiler implementation detail for C++ programs and as such is not part of the C++ language specification. Not unexpectedly, compiler vendors have developed their own, often-incompatible conventions for name mangling. Fortunately, IDA understands the name-mangling conventions employed by Microsoft Visual C++ and GNU g++ as well as a few other compilers. By default, when a mangled name is encountered within a program, IDA displays the demangled equivalent as a comment anywhere the name appears in the disassembly. IDA’s name-demangling options are selected using the dialog shown in Figure 8-15, which is accessed using Options ▸ Demangled Names.

Demangled name display options

Figure 8-15. Demangled name display options

The three principal options control whether demangled names are displayed as comments, whether the names themselves are demangled, or whether no demangling is performed at all. Displaying demangled names as comments results in a display similar to the following:

.text:00401050 ; protected: __thiscall SubClass::SubClass(void)
  text:00401050 ??0SubClass@@IAE@XZ  proc near
  ...
  .text:004010DC           
call  ??0SubClass@@IAE@XZ  ; SubClass::SubClass(void)

Likewise, displaying demangled names as names results in the following:

 .text:00401050 protected: __thiscall SubClass::SubClass(void) proc near
  ...
  .text:004010DC             call    SubClass::SubClass(void)

where is representative of the first line of a disassembled function and is representative of a call to that function.

The Assume GCC v3.x names checkbox is used to distinguish between the mangling scheme used in g++ version 2.9.x and that used in g++ versions 3.x and later. Under normal circumstances, IDA should automatically detect the naming conventions in use in g++-compiled code. The Setup short names and Setup long names buttons offer fine-grained control over the formatting of demangled names with a substantial number of options that are documented in IDA’s help system.

Because mangled names carry so much information regarding the signature of each function, they reduce the time required to understand the number and types of parameters passed into a function. When mangled names are available within a binary, IDA’s demangling capability instantly reveals the parameter types and return types for all functions whose names are mangled. In contrast, for any function that does not utilize a mangled name, you must conduct time-consuming analysis of the data flowing into and out of the function in order to determine the signature of the function.

Runtime Type Identification

C++ provides operators that allow for runtime determination (typeid) and checking (dynamic_cast) of an object’s datatype. To facilitate these operations, C++ compilers must embed type information within a program binary and implement procedures whereby the type of a polymorphic object can be determined with certainty regardless of the type of the pointer that may be dereferenced to access the object. Unfortunately, as with name mangling, Runtime Type Identification (RTTI) is a compiler implementation detail rather than a language issue, and there is no standard means by which compilers implement RTTI capabilities.

We will take brief look at the similarities and differences between the RTTI implementations of Microsoft Visual C++ and GNU g++. Specifically, the only details presented here concern how to locate RTTI information and, from there, how to learn the name of class to which that information pertains. Readers desiring more detailed discussion of Microsoft’s RTTI implementation should consult the references listed at the end of this chapter. In particular, the references detail how to traverse a class’s inheritance hierarchy, including how to trace that hierarchy when multiple inheritance is being used.

Consider the following simple program, which makes use of polymorphism:

class abstract_class {
public:
   virtual int vfunc() = 0;
};

class concrete_class : public abstract_class {
public:
   concrete_class();
   int vfunc();
};

void print_type(abstract_class *p) {
   cout << typeid(*p).name() << endl;
}

int main() {
   abstract_class *sc = new concrete_class();
   print_type(sc);
}

The print_type function must correctly print the type of the object being pointed to by the pointer p. In this case, it is trivial to realize that “concrete_class” must be printed based on the fact that a concrete_class object is created in the main function. The question we answer here is: How does print_type, and more specifically typeid, know what type of object p is pointing to?

The answer is surprisingly simple. Since every polymorphic object contains a pointer to a vtable, compilers leverage that fact by co-locating class-type information with the class vtable. Specifically, the compiler places a pointer immediately prior to the class vtable. This pointer points to a structure that contains information used to determine the name of the class that owns the vtable. In g++ code, this pointer points to a type_info structure, which contains a pointer to the name of the class. In Visual C++, the pointer points to a Microsoft RTTICompleteObjectLocator structure, which in turn contains a pointer to a TypeDescriptor structure. The TypeDescriptor structure contains a character array that specifies the name of the polymorphic class.

It is important to realize that RTTI information is required only in C++ programs that use the typeid or dynamic_cast operator. Most compilers provide options to disable the generation of RTTI in binaries that do not require it; therefore, you should not be surprised if RTTI information ever happens to be missing.

Inheritance Relationships

If you dig deep enough into some RTTI implementations, you will find that it is possible to unravel inheritance relationships, though you must understand the compiler’s particular implementation of RTTI in order to do so. Also, RTTI may not be present when a program does not utilize the typeid or dynamic_cast operators. Lacking RTTI information, what techniques can be employed to determine inheritance relationships among C++ classes?

The simplest method of determining an inheritance hierarchy is to observe the chain of calls to superclass constructors that are called when an object is created. The single biggest hindrance to this technique is the use of inline[51] constructors, the use of which makes it impossible to understand that a superclass constructor has in fact been called.

An alternative means for determining inheritance relationships involves the analysis and comparison of vtables. For example, in comparing the vtables shown in Figure 8-14, we note that the vtable for SubClass contains two of the same pointers that appear in the vtable for BaseClass. We can easily conclude that BaseClass and SubClass must be related in some way, but which one is the base class and which one is the subclass? In such cases we can apply the following guidelines, singly or in combination, in an attempt to understand the nature of their relationship.

  • When two vtables contain the same number of entries, the two corresponding classes may be involved in an inheritance relationship.

  • When the vtable for class X contains more entries than the vtable for class Y, class X may be a subclass of class Y.

  • When the vtable for class X contains entries that are also found in the vtable for class Y, then one of the following relationships must exist: X is a subclass of Y, Y is a subclass of X, or X and Y are both subclasses of a common superclass Z.

  • When the vtable for class X contains entries that are also found in the vtable for class Y and the vtable for class X contains at least one purecall entry that is not also present in the corresponding vtable entry for class Y, then class Y is a subclass of class X.

While the list above is by no means all-inclusive, we can use these guidelines to deduce the relationship between BaseClass and SubClass in Figure 8-14. In this case, the last three rules all apply, but the last rule specifically leads us to conclude, based on vtable analysis alone, that SubClass inherits from BaseClass.

C++ Reverse Engineering References

For further reading on the topic of reverse engineering compiled C++, check out these excellent references:

While many of the details in each of these articles apply specifically to programs compiled using Microsoft Visual C++, many of the concepts apply equally to programs compiled using other C++ compilers.



[48] The new operator is used for dynamic memory allocation in C++ in much the same way that malloc is used in C (though new is built into the C++ language, where malloc is merely a standard library function).

[49] A class constructor function is an initialization function that is invoked automatically when an object is created. A corresponding destructor is optional and would be called when an object is no longer in scope or similar.

[50] In C++, function overloading allows programmers to use the same name for several functions. The only requirement is that each version of an overloaded function must differ from every other version in the sequence and/or quantity of parameter types that the function receives. In other words, each function prototype must be unique.

[51] In C/C++ programs a function declared as inline is treated as a macro by the compiler, and the code for the function is expanded in place of an explicit function call. Since the presence of an assembly language call statement is a dead giveaway that a function is being called, the use of inline functions tends to hide the fact that a function is being used.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.79.147