CHAPTER 6

Decompiler Implementation

We are now at the point where you will learn to actually deal with the individual bytecodes, decompile the opcodes into partial statements and expressions and, ultimately (well that’s the plan anyway), back into complete blocks of source code.

If I’m gauging my audience correctly, this chapter, and possibly Chapter 5, will appeal to a significant cross section of readers. We’re now at the nub of the problem of how to implement a decompiler using using Java versions of Lex and Yacc, namely JLex and CUP.

To keep this as practical as possible, I’ll use a test suite of ten simple programs, each with a different language construct. For each program, I’ll reconstruct the original source gradually, building the decompiler as I go. Each program is first compiled, then disassembled and converted into XML using ClassToXML. I’ll then look at the Java source and the corresponding method bytecode and create a CUP specification for each example to convert the bytecode back into source.

I’m taking an easy, yet powerful, approach to dealing with the classfile: by having the decompiler pretend to be the Java Virtual Machine (JVM). The decompiler will simulate stack operations as the assembly directs, but it will be working with variable names rather than actual data.

And because the classfile is more than method bytecode, I’ll also need to be able to incorporate the remaining information in the classfile to recover import statements, package names, and variable names from the constant pool. So to begin with, I’ll take a look at the format of the ClassToXML output, that is, the input for our decompiler that does not deal with recovering the actual code from the bytecodes. This is the supporting cast for the code recovery section. Then I’ll expand the section that deals with recovering the expressions and code structure while looking at the test suite.

ClassToXML Output: An Overview

You have seen the ClassToXML format in earlier chapters, where we used it to create a simple obfuscator. It is now going to be used as input for the decompiler. To parse ClassToXML, I first create a JLex specification and then, using the terminal symbols, I create a skeleton CUP file for the classfile grammar.

From the point of view of the decompiler, every ClassToXML output file has the same format:

  • A start token, (<root>), to allow initialization of variables.
  • The contents of the constant pool, which contain all the metadata for the class.
  • Divider tokens (<Fields>, <Methods>, etc.), to separate data and to allow processing between individual sections of the ClassToXML file.
  • The class data, which includes properties and the constant pool index, followed by a divider token.
  • The field data, which includes properties and constant pool indices, followed by a divider token. If no fields are used within the class, this may be blank.
  • The method data, which includes properties, constant pool indices, and assembly code. Methods are separated by semicolons, which can be thought of as subdivider tokens.
  • An end token (</root>).

Let’s take a look at each section of ClassToXML listed above in more detail.

Constant Pool Overview

Constant pool entries consist of the following:

  • Index
  • Type (which can be Integer, Long, Float, Double, UTF-8, String, Class, NameAndType, FieldRef, MethodRef, or InterfaceMethodRef)
  • Data (which can be either a constant value or an index—or a pair of indices—to another constant pool entry)

For example, a MethodRef type to the println function links to its Class and NameAndType entries, which each link to strings, as shown in Listing 6-1.

Obviously, you’ll have to load the constant pool into memory and resolve cross-references between elements in order to create any sort of a useful decompiler. Otherwise, you’ll have code without any of the original names of variables and methods.

Class Data Overview

The second section of ClassToXML is the class data. In this limited decompiler, the class data you use consists of merely the class’s access flags and ThisClass. You just parse and discard SuperClass and the interface data. The class data for a Hello World program might consist of the following:

  <AccessFlags>public</AccessFlags> 
  <ThisClass>15</ThisClass>
  <SuperClass>16</SuperClass>

Field Data Overview

The third section of ClassToXML is the field data. Field entries consist of the following:

  • Access information (public, private, etc.) and properties (static, volatile, etc.)
  • Constant pool index of Name
  • Constant pool index of Description
  • Attribute count
  • Attributes

Again, for this decompiler, you need to assume that no attributes belong to any fields. For example, you can see the ClassToXML field data for Listing 6-2 in Listing 6-3.

Method Data Overview

The method data is the fourth section of ClassToXML. Method data entries of ClassToXML consist of the following:

  • Access information (e.g., whether it’s public or not) and properties (e.g., whether it’s static or volatile)
  • Constant pool index of Name
  • Constant pool index of Parameters

All other data included in the XML file is discarded.

For example, method data XML for the recurse method shown in Listing 6-4 can be seen in Listing 6-5.

The constant pool entries shown in Listing 6-6 provide the necessary information.

The parameter description <ConstantPool_Index>20</ConstantPool_Index> is the most important thing here, and I will cover it in greater detail later (see the discussion in the “CUP Specification” section). The type codes inside parentheses, for example (I) within the <Value></Value> node in Listing 6-6, are the types of data passed into the method, while those outside are the return types of the method.

JLex Specification

To begin the decompiler discussion, I should probably show you the complete lexical specification for breaking down the bytecode into tokens because that will be unchanged from start to finish.

In Listing 6-7, each of the 200 or so bytecodes is broken down into a token along with all the classfile’s associated labels and numbers. As I’ve already mentioned, I’m using the output from the disassembler, ClassToXML, as the input for the JLex scanner.

The JLex specification tokenizes or scans the input. But every scanner needs a corresponding parser. The next stage in the deompiler is to create a CUP specification to turn these tokens back into the original source.

CUP Specification

Listing 6-8 is the beginning of our CUP specification. The aim of this chapter is to turn our CUP specification into a full-blown decompiler, but for now, I’ll just show the skeleton of the decompiler specification (Listing 6-8) so that we have something that can be compiled together with our JLex scanner.

Although the decompiler isn’t nearly complete at this stage, I have included much of the action code in the following listings that is crucial for creating our application—to resolve the constant pool references, for instance.

resolveConstant recursively resolves the constant pool entry associated with constantPoolIndex and returns it as a string, as shown in Listing 6-9.

We use a number of stacks to help resolve the constant pool; oStackDebug and finalStackDebug dump the contents of each of those stacks for ease of debugging, as shown in Listing 6-10.

Next, let us look at the terminal and non-terminal symbol declarations. The terminals are the same as the tokens coming in from scanner and the non-terminals are what we resolve the terminals into, to gradually turn the tokens or terminals into source code. First are the XML tag terminals, which you can see Listing 6-11. There’s nothing surprising here; refer to the JLex specification (Listing 6-7) to match the terminal name to the tag.

Next Listing 6-12 shows the terminals for the each of the individual opcodes:

Finally, the non-terminals are shown in Listing 6-13. These break up the XML file into bite-sized pieces so that they can be dealt with more easily and can be ultimately converted into code. In CUP and Yacc grammars, general terminals always resolve into non-terminals as part of the parsing process.

I’ll now look at each non-terminal in a little more detail. In all CUP parsers, well all Yacc parsers to be exact, we convert the scanned tokens or terminals into the final output by converting terminals and non-terminals into other non-terminals until we reach the final non-terminal and the parser exists.

CUP is not unlike the JVM itself, acting like a simple stack machine where each terminal and non-terminal is popped onto a stack. Then when a rule in the parser (specified by the non-terminal) is satisfied, the terminals and non-terminals are popped off the stack and replaced with the new non-terminal. The parser ultimately exists when there are no more terminals to parse and the final rule has been resolved, which in our case, is the file non-terminal, and then the source code is output.

The file Non-Terminal

Select the file non-terminal as the start symbol:

start with file;

The file non-terminal definition describes the entire higher-level schema of the ClassToXML output, as just described. Because you’re parsing most of the “useless” class information (such as field and method counts) in file, the non-terminal definition is very unwieldy, as you can see in Listing 6-14.

Remember, the code following these definitions is executed only once the symbol has been parsed. Once the whole class is decompiled, you will output the final parenthesis as using a simple System.out.println.

Note  From here on, until we get to the test suite, that is, I will only show the terminals and not the code that manages the terminals. You can find and download the remaining code from the Apress web site (http://www.apress.com).

The startfile Non-Terminal

In Listing 6-15, you use the startfile symbol so that you can set up the constant pool arrays before parsing begins. Null is added at the start of the arrays to match the 1-based indexing used in the bytecode.

The constantpool Non-Terminal

Now you can actually read in the constant pool. First set up the superstructure for parsing—the constantpool non-terminal, as shown in Listing 6-16. As you saw earlier, this allows the parser to read in every constant pool element.

The constantelement Non-Terminal

The constantelement non-terminal is the key to the problem. You just have to break it down into the various types of constant element, as shown in Listing 6-17.

First, you will need to read in the primitive types: the UTF-8 char array, int, long, float, and double. The eight-byte constants (double and long) require two constant pool entries, but you have already dealt with this in ClassToXML by merely filling both entries with the whole eight-byte constant value. Negative numbers must include the NEGATIVE terminal. You need to do some tricky string manipulation to correctly read in spaces with their escape characters and to replace &lt; and &gt;, the XML codes for < and >.

Next, you have the more complex constant data types: String, Class, NameAndType, Field, Methodref, and InterfaceMethodref. These types consist of index values to other elements in the constant pool. Once we deal with the constant data types, the constant pool will be ready to use. Decompiling the bytecode without using the constant pool information will not make the source code very readable.

The classname Non-Terminal

Now you need to prepare the class structure by reading in the access properties and the constant pool index of the class name and also by parsing the superclass constant pool index. But, since you don’t resolve the superclass, you can discard it.

classname ::= ACCFLAGS access:a XACCFLAGS THISCL number:classnum 
                XTHISCL SUPERCL number XSUPERCL
             ;

The interfaces Non-Terminal

Now read in the (empty) headers for the interfaces. As stated earlier, I am assuming that there are no interfaces in the test suite.

interfaces ::=  INTCNT number XINTCNT INTERFACES XINTERFACES
            ;

The fields Non-Terminal

Next, you must read in the fields. In Listing 6-18, I am using the fields non-terminal to collect them all. The parameter symbols given in the constant pool (I, B, L, J, etc.) are outlined by the Java specification. J, for instance, is used for long variables because L is already reserved for class references.

The methods Non-Terminal

Next read in the methods using the methods non-terminal to collect them all.

methods ::=  methods METHOD method XMETHOD 
        | METHOD method XMETHOD
        ;

For your purposes, each method consists of an access flag (public, private, protected, etc.), an indefinite number of properties (static, volatile, etc.), and constant pool indices for the name and passed/returned parameter types. However, reading in the relevant data requires reading in a lot of data you do not use first, which leads to extremely long non-terminal declarations. To minimize this, you’ll need to break the method non-terminal into pieces. In Listing 6-19, the codeattribs non-terminal will precede stmts, which is the assembly code for the method; endcodeattribs will follow it.

The line number table is an often-present attribute of methods; it matches pieces of the bytecode to assembly code line numbers. Again, you just parse and forget it. By doing so, you accumulate an unknown number of line-number mappings with the linenumtable non-terminal and individual mappings with linenummapping, as shown in Listing 6-20.

And, finally, it’s time to deal with the exception table (see Listing 6-21). You need to consider that the exception table may be empty, and so you’ll need to add a blank definition. I’ve included this for completeness’s sake because I don’t implement the try/catch/finally decompilation in this version, which is the primary use of the exception table.

You then get the method data (access flags, definition parts, and code attributes) using the method non-terminal. The program code is then picked up iteratively using the stmts non-terminal as shown in Listing 6-22. Note that the return type and parameter types are resolved in this non-terminal:

In non-static methods, the local variable 0 is used to hold the this reference; you need to use a global variable, staticAdjustment, to correct for static methods. I also use the level variable to denote the number of conditional statement levels. We won’t be addressing conditional statements for some time, however.

Because all the decompiled code is in finalStack backwards, you need to pop objects and add them to the front of outString. Finally, you should add the method information you collected at the start of this function and output it. Note that I skip the <init> and <clinit> methods, because they only exist to initialize field objects; they are dealt with separately.

definitionparts, see Listing 6-23, is the non-terminal you use to set the field and method properties (static, final, etc.), name, and parameters. You must separate it from the field and method non-terminals so that you know the current object’s name while you are analyzing the assembly code.

stmts, shown in Listing 6-24, is the accumulating non-terminal for assembly code. If a new line of code is being added to a preexisting stack, you need to check to see whether any conditional statements end at that point in the program. Conditional resolution is one of the most difficult parts of the program.

Miscellaneous Non-Terminals

At this point, I’m not yet adding instructions, so I’ll just define the built-in error non-terminal as the only member of expr_part.

expr_part ::= error ;

Here is the number non-terminal, a very simple non-terminal that is used repeatedly.

number ::= NUMBER:n  ;

access is similarly simple, but some methods do not have access flags. To provide for these, you must include a “blank” definition.

access ::=  | ACCESS:a ;

properties is similarly primitive. This is where you set the staticAdjustment variable if necessary.

properties ::=   properties property | property ;
property ::= | PROPERTY:p ;

Finally, the type non-terminal provides type strings to the various instructions that need them.

type   ::= TYPE:t ;

Test Suite

To complete the remainder of the decompiler code, in particular to build up the expr_part non-terminal, I use a test suite of programs. There are ten programs in the test suite, each demonstrating a different language construct.

  • HelloWorld.java
  • Basics.java
  • MathOps.java
  • DoWhile.java and IfTest.java
  • Recurses.java
  • WhileLoop.java
  • ForLoop.java
  • ArrayTest.java
  • ArrayInit.java

For each of the programs, I begin with the original source and then show the important sections of ClassToXML—that is, the decompiler input. I then discuss the high-level grammar changes we need to make to accommodate this new construct. Next, I show the complete code for those of you who are interested in the details. Finally, I display the decompiler output.

HelloWorld.java

It’s now time to start looking at the decompiler XMLToSource in earnest. There is no better place to begin than with a simple HelloWorld example (Listing 6-25). In this program, you’ll see several important basic operations including constant loading, method invocation and return, and most importantly, resolution of constant pool elements.

Input

The decompiler currently recognizes the ClassToXML output; it will read in the constant pool and the class information as shown in the previous section. The additional input consists of two generated methods in the classfile separately: a dummy constructor (Listing 6-26) and the main method (Listing 6-27).

The second method (Listing 6-27) is our main function.

No local variables are used in this method although local0 is reserved as a reference to the String array of arguments Java requires of main functions. In this case, local0 resolves to public static void main(String[] args).

Grammar

The grammar is as follows:

expr_part -> error | return | store | load  | invoke | object ;
return  -> number RETURN  | number type RETURN;
store   -> number type STORE  number;
load    -> number type LOAD number  | number LDC number ;
invoke -> number INVOKE number;
object   -> number NEW number | number GETSTATIC number
type -> TYPE;

The return Non-Terminal

There are two return productions: line number and RETURN terminal; and line number, type of value to return, and RETURN terminal.

The return non-terminal is fairly simple. It has two possible variations in the assembly: return is used to return void, whereas ireturn, dreturn, and so on, push a value onto the oStack and then return to the calling method. Note that non-terminal definitions must include the line number of the instruction so that you have immediate access to it.

The store Non-Terminal

The store production consists of the line number, the local variable type, the STORE terminal, and the index of the local variable being used.

The store non-terminal is more complex. Every variable declaration creates a store instruction: primitives in their primitive type, higher-level objects as pointers (astore). To correctly parse an instruction, you have to check its type, verify whether the local variable it references has already been declared, and then store the top element of the oStack in the appropriate local variable.

As with return, different primitive types are identified by a single-character prefix to the store instruction. It’s very easy to identify the int, long, float, and double primitives, but what of the class variables stored using astore (that is, stored by address)?

This implementation of the type non-terminal loads the primitive types into a global string type, but ignores the “a” type. The higher-level objects must be addressed differently: the type can be set to String when a string constant is loaded from the constant pool. For other classes, you can use the new instruction to get the variable type.

The load Non-Terminal

The load non-terminal contains two productions: one for the load instruction, which consists of the line number, the local variable type, the LOAD terminal, and the index of the local variable being loaded; and one for ldc, the “load constant” instruction, which consists of the line number, the LDC terminal, and the constant pool index of the constant being loaded.

The ldc instruction is easy to deal with: it gets the constant pool element specified by the index and pushes it onto the oStack. The definition for the load instruction is nearly as simple. First check to see whether the local variable whose value is being pushed onto the oStack has been used before. If it hasn’t, it is a parameter of the method and you set the corresponding bit in varsInUse.

              if (!varsInUse.get(new Integer(n.toString()).intValue()))
                     varsInUse.set(new Integer(n.toString()).intValue());

Since it’s being loaded onto the oStack, the type of the variable doesn’t matter and “local”+index (that is, localN) can be pushed onto the oStack.

The invoke Non-Terminal

The invoke production is deceptively simple, consisting of only the line number, the INVOKE terminal, and the constant pool index of the MethodRef for the method being invoked.

The invoke non-terminal is the meatiest you’ll encounter in this example: it is the trickiest, the most basic, and the most important. When dealing with invocations, you must be concerned with whether the method being invoked is static (since it will not affect the current reference), virtual, or “special”—although you don’t need to worry too much about the last, as it rarely comes up.

The code first resolves the method reference (given by constant pool index), then isolates the number of passed parameters and pops that number of elements from the oStack.

Once the constant pool data for the method has been loaded, the oStack must be checked. If the top element on the oStack is an invocation containing the typeCheck type, the method being invoked belongs to that reference and can be attached to it—for instance, a string concatenation, (StringBuffer.append()), may be attached to another if multiple strings are appended to a StringBuffer.

Next, the decompiler must check whether the method being called is <init>. If it is, this instantiates an object. If the object constructor takes a parameter, you must add it in. Then you need to check to see if the class method being invoked belongs to is the current reference. If it is, you must attach the previous chain of invocations (i.e., tempString.substring(0, tempString.indexOf(“.”)).You use the outstandingType global string for this. You must check three things: whether outstandingType has been initialized, whether typeCheck (the class to which the method being invoked belongs) starts with it, and whether the oStack is empty. You’ll use this in successive cases, too.

The next condition is a special case, the case of appending strings. If the method being invoked is .append(“somevalue”), and the top element of the oStack begins with new StringBuffer().append, you can insert + “somevalue” into the argument of append. This returns you to a familiar and efficient style of string concatenation—for example (“value1” + value1 + “.”).

If none of these criteria apply, you need to check those same three criteria discussed above: outstandingType, typeCheck, and whether the oStack is empty. If the method returns void, you can push it to the finalStack; otherwise, you must push it to the oStack.

If the outstandingType check fails, you check the return type.

The object Non-Terminal

At present, there are two productions for object; more will be added later. The first, for the new instruction, consists of the line number, the NEW terminal, and the constant pool index of the object being instantiated. The second, for the getstatic instruction, consists of the line number, the GETSTATIC terminal, and the constant pool index of the ClassRef being loaded.

The object non-terminal is the last remaining, and it is fairly simple. The related opcodes putstatic, getfield, and putfield will not become important for some time. The new instruction just gets the class name from the constant pool, stores it in the global outstandingType string, and pushes the new instance of the class onto the oStack. The getstatic instruction is no more difficult: again, just resolve the constant pool reference and push it onto the oStack.

Code

The complete code to handle HelloWorld.java is shown in Listing 6-28.

Output

Once you recompile the CUP and JLex files and run the decompiler, the results you get are indistinguishable from the original (see Listing 6-29).

Basics.java

In this example, the decompiler is extended slightly to support a few more common operations. In this program, you’ll see several important basic operations: primitive type assignation, recasting, initialization of variables, and non-void returns. We begin with the original code, shown in Listing 6-30.

Input

The first method in ClassToXML’s output will of course be the constructor, which can be disregarded. The second is the main function (shown in Listing 6-31).

There is an interesting and important difference between static and non-static methods within the bytecode. Since the main function is a non-static method, local0 is used to store this. Change both functions to public static, recompile the program, and disassemble it with ClassToXML, and you will see that the assembly of the main function is quite different (Listing 6-32).

No longer is this loaded before the type conversion method is invoked. Instead, only the contents of local1 (formerly local2) are loaded and the method is invoked.

Last is the tiny type conversion method (Listing 6-33). Little is done within it, as type conversion between primitives requires only a single opcode.

Also note that the type of the return opcode reflects the return type of the method—this helps the JVM ensure security. Though there can be multiple return statements in a method, they all must return the same type.

Grammar

The grammar is as follows:

expr_part -> error | return | store | load  | invoke | object | 
                              const | ipush | conv;
ipush -> number IPUSH number;
const   -> number type CONST number | number type CONST M1
                | number type CONST NULL;
conv-> number type T2T type;

The ipush Non-Terminal

The ipush production consists of the line number, the IPUSH terminal, and the byte or short value to convert to load to the oStack.

Both bipush and sipush sign-extend a byte or a short, respectively, to int, then push it to the oStack. With the abstracted approach the decompiler uses, this is very easy to deal with: it just pushes the number.

The Const Non-Terminal

The const production consists of the line number, the type of constant to load, the CONST terminal, and the constant value to load.

Compared to ipush, const has a wider range of data types but a smaller range of values. Const can push data of type int, long, float, and double. iconst can push integers 0–5, as well as –1 (iconst_m1), onto the oStack; lconst longs 0 and 1; fconst float 0.0, 1.0, and 2.0; and dconst double 0.0 and 1.0. There is also an aconst instruction that can push a null object reference. Again, this is easy to parse, requiring only that M1 and NULL terminals be defined.

The conv Non-Terminal

The conv production consists of the line number, the type of the value to be converted, the T2T terminal, and the type to which it’s being converted.

There’s one more immensely useful little instruction to address: type-to-type conversion (t2t). T2T has various forms (i2l, f2d) can convert most primitives to one another. This will take care of recasting values. You don’t need to worry about what type a piece of data is being converted from, just what it will be converted to. This and the way types are processed make things very simple:

                    oStack.push(“(“ + type + “) “ + oStack.pop());

Code

The complete code to handle Basics.java is shown in Listing 6-34.

Output

Once the CUP specification is updated and recompiled, the decompiler produces exactly the result we want (Listing 6-35), although I must admit it’s very unlikely that you’d choose local1, local2, and local3 for your variable names.

MathOps.java

The next step is to have programs that change the data in the oStack—to start with, a simple arithmetic program. You will see that all the arithmetic operations (add, sub, mul, div, and rem) work the same way. Only the iinc increment instruction needs to be handled differently. The original code is shown in Listing 6-36.

Input

The main function from ClassToXML is shown in Listing 6-37. The JVM is completely stack-based, so the arithmetic operations work with the top two pieces of data on the oStack. For example, lines 19–23 of the assembly are equivalent to

local1 += (double) local3;

Grammar

The grammar is as follows:

expr_part -> error | return | store | load  | invoke | object | const | bipush | 
                                    conv | arith | iinc;
arith -> number type NEG | number type REM | number type ADD
                            | number type SUB | number type MUL | number type DIV;
iinc   -> number IINC number number
                | number IINC number NEGATIVE number;

The arith Non-Terminal

Each of the arith productions consists of the line number, the type of the operand(s), and the appropriate opcode name (ADD, SUB, MUL, DIV, REM, or NEG).

Parsing the arithmetic expressions is not difficult: just pop the top two values in the oStack and perform the appropriate arithmetic function; then push the combined instruction back onto the oStack. For lack of a better place to put it, the NEG opcode is also included in arith; this pops the top item on the oStack, switches the sign (whether it’s a signed value or an arithmetic expression), and pushes it back.

Unfortunately, arithmetic is a bit more complicated than it looks. If expressions are chained, parentheses are necessary to ensure proper results. Addition and subtraction do not depend on the order of operations, while multiplication and subtraction do. Thus, the multiplication and division non-terminals must check for arithmetic signs as the NEG non-terminal does.

The iinc Non-Terminal

The iinc production consists of the line number, the IINC terminal, the index of the local variable to increment, and the integer increment value (which may be preceded by a negative sign).

Integer increment (iinc) adds or subtracts given values from the value atop the oStack. This requires a new terminal definition, since iinc can take a negative value as the increment. As negative numbers are not defined in the non-terminal expression number, two different productions are used for iinc: an increment and a decrement (IINC number NEGATIVE number). In addition, increment values of one can be caught and the format switched to “i++” from the clunkier “i+=1”.

Code

The complete code to handle MathOps.java is shown in Listing 6-38.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.146.35.72