Chapter 2. Dancing with the Dead

While many malware analysis tasks involve pattern recognition and investigation on an existing binary disassembly, the level of comfort while performing your tasks will be directly proportional to your ability to think and write in assembly code. How the compiler translates and arranges the source text in a final binary (object code) is a very different process (lexical parsing, tokenizing, data flow analysis, and control flow analysis) from a human expressing their ideas in a text form by using English code constructs. Furthermore, it's the linker (which is invoked by modern compilers) that actually builds the final executable binary from various libraries and other object code sources and resources. If assembly code such as the following does not make sense, this chapter could be of help:

 mov eax,dword ptr[0x402500]
 cdq
 sar eax,4 

Our focus for the current chapter will be the following:

  • x86/x64 assembly programming concepts using VC++ and MASM32
  • x86 disassembly and an analysis of binaries in VC++ 2008 Express
  • Various ways to do assembly programming in the VC++ environment

Motivation

To be clear from the outset, it is actually the memory management work that takes up bulk of the work in assembly programming, not the instruction sequences themselves, which can be taken as enablers or the core vocabulary. Each instruction sequence is atomic, and like a set of symbols that have a singular meaning and purpose, very linear. Each instruction in the text form above is called a mnemonic, where each assembly instruction can be taken as a function with a certain requirement and output.

Each assembly line is directly mapped to an opcode sequence consisting of byte patterns that are unique to a particular architecture, for our purposes, the 80x86 family of Intel microprocessors. This mapping is done by an assembler (having dual meaning of both the language and the software used to generate the machine object code), which creates object code from assembly text, which is then processed by the linker to get the final executable.

Assembly code is, by definition, not portable as it varies for each microprocessor design. However, market share and the standards established over the years have made it redundant for Windows software analysis as the operating system runs mainly on Intel and AMD microprocessors. Other operating systems also run on the x86/x64 instruction set, and thus, the Intel instruction set has become a convention. To summarize, the benefit of learning assembly is that all software on a platform eventually has to run in the form of microprocessor instructions, which is something like the popular saying that "all roads lead to Rome." This puts immense power in your hands as all and any software can be deconstructed to a good approximation, given enough time and resources. However, intractable issues arise as a result of binary compilation as the symbols and identifiers used to denote things such as variable names and function names become generic memory addresses and it takes some effort to create an approximate representation of the original design.

The Intel 64 and IA-32 architecture software developer's manual combined volumes 1, 2A, 2B, 2C, 3A, 3B, and 3C is the best reference for the IA32 instruction set and for system programming for Intel chips; you can find it at https://www-ssl.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf.

The Intel microprocessor 80x86 family is often called Complex Instruction Set Computer (CISC). The instruction opcodes are of variable length, and a singular opcode sequence (instruction) can perform a range of tasks depending on how it is invoked. This is unlike Reduced Instruction Set Computer (RISC) machines where the opcode lengths are not variable and a singular instruction opcode can execute with focus on a particular task, but it would require more instructions than a CISC machine to complete a similar task. Parallel processing is feasible on both designs with the debate continuing on which architecture is a better one. Hyper-threading technology, which basically enables multiple microprocessors to communicate with each other without the requirement of parallel instruction execution may hold the future for CISC as a design decision for software backward compatibility.

The two important memory modes are the real mode (DOS) and the protected mode (Windows). The real mode maps to a 16-bit memory address space (1 MB) and the protected mode to a 32-bit address space (4 GB). The real mode is present for backward compatibility and enabled during the booting cycle of a computer after which it switches to the protected mode for modern operating systems such as the 32/64-bit Windows versions.

Looking at the assembly code and the disassembly of the native code, some things are quite evident:

  • Data movement instructions are implemented to facilitate communication between the memory and I/O components and within its own faculties such as general/FPU registers and flags.
  • Conditional constructs are implemented using elementary decisions using logic. This, in turn, facilitates program control flow.
  • Basic arithmetic- and number representation-related instructions, as well as instructions for Boolean logic, give it a mathematical brain.

64-bit programming is just an extension of 32-bit programming, and hence, it is mandatory that the 32-bit concepts are fully understood.

There are 8 essential general-purpose registers in an Intel microprocessor:

Motivation

Further, there are 8 additional registers for 64-bit programming:

Motivation

Note that only the last 8 bits are accessible in these additional registers (no high-order byte) in addition to the 64-bit and 32-bit regions for memory addressing.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.134.77.195