CHAPTER 5

Decompiler Design

For the remainder of this book, I’m going to focus on how you can create your own decompiler, which is, in fact, a cross-compiler that translates bytecode to source code. Although I will be covering the theory behind the relevant design decisions as they arise, my intention is to give you enough background information to get you going rather than to give you a full-blown chapter on compiler theory.

Don’t expect your decompiler (ClassToSource in this chapter) to be more comprehensive or better than anything currently on the market; to be honest, it is probably closer to Mocha than JAD or SourceAgain. Like most development, the first 80 to 90 percent of our decompiler is the easiest and the last 10 to 20 percent takes much longer to complete. But ClassToSource will show you the basic steps of how to write a simple Java decompiler that can reverse engineer the majority of code you will come across.

I cover the general design of your ClassToSource decompiler in this chapter and delve into its implementation in Chapter 6. I’ll round off the book with a number of case studies and a look at what the future may have in store for Java decompilers, obfuscators, and bytecode rearrangers.

The tone of the next two chapters will be as practical as possible and I’ll try not to burden you with too much theory. It is not that I don’t want to pad out the book with endless pages of compiler theory, it’s just that there are too many other good books on the subject. Compilers: Principles, Techniques and Tools, by Aho, Sethi, and Ullman (Pearson Higher Education, 1985), otherwise known as the Dragon book (so called because of its cover design), is just one of the better examples that quickly springs to mind. Appel’s Modern Compiler Implementation in Java (Cambridge University Press, 2002) is another highly recommended tome. I’m going more for the style of Crafting a Compiler with C, by Fischer and LeBlanc (Addison-Wesley, 1991). Having said that, if I think there are some theoretical considerations that you need to know about, then I’ll discuss them as necessary.

Introduction

As I mentioned earlier, writing a decompiler is pretty similar to writing a compiler or cross-compiler because both translate data from one format to another. The essential difference between a decompiler and a compiler is that they go in opposite directions. In a standard compiler, source code is converted to tokens and then it is parsed and analyzed to finally produce a binary executable.

As it happens, decompiling is very similar to compiling, only this time, the back end of the compiler is changing the intermediary symbols back into source code rather than into Assembler. Because of the binary format of a Java classfile, you can quickly transform the binary into bytecode and then treat the bytecode as just another language. It might help to think of your decompiler as a cross-compiler or source code translator that transforms bytecode to Java. And remember, with ClassToXML, you’ve already got a way of converting the binary format of a classfile into XML. You’ll be using this XML as the input to your decompiler.

An abundance of other source code translators translate between different languages—for example, you can translate from COBOL to C, or even from Java to Ada or C, which will give you plenty of places to look for ideas.

Note  In case you’re confused about the difference between opcodes and bytecodes, an opcode is a single instruction, such as iload, that may or may not be followed by a data value or operand. Opcodes and operands together are generally referred to as bytecodes.

Overall, the task of writing a decompiler is simpler; there are several significant reasons for this.

Limited Number of Opcodes

You only need to understand a limited number of possible opcodes. You don’t need to worry too much about syntax errors because the Java bytecode language is machine generated. This makes it tightly defined and very predictable. Sure, bytecode rearrangers or optimizers may cause problems, but rearrangers are a very special case. In general, you won’t come across them because they have an awful habit of not being able to pass Java Verifiers or, worse still, of not working on some obscure JVM implementation.

No Registers

If James Gosling hadn’t turned the hardware clock back 20 years and decided to go for a stack-based architecture, it’s very unlikely that this book would ever see the light of day. Most hardware architectures use a minimum of three registers for typical computations, such as the simple equation a + b = c where values a and b are stored in the first two registers, and the result c is stored in a third register or used as part of a much larger computation.

But a Java Virtual Machine (JVM) is a stack-based machine that pushes two operands onto the stack and then adds the top two stack elements by popping the stack twice—performing the addition and then pushing the result back onto the stack. No doubt, any design decisions are made because the virtual stack machine is easy to write and implement on any number of different platforms, from phones to SunFire E15,000’s. But this portability is also the JVM’s Achilles’ heel as it simplifies the design of the JVM tremendously with the JVM designers adopting a Lowest Common Denominator approach to cater for all these diverse architectures.

Data and Instruction Separation

A further reason for the JVM’s simple design is Java’s security model. The separation of a stack machine’s data and instructions allow the JVM to quickly scan the bytecodes and verify that each class file is not going to misbehave.

We know the format of the bytecode at all times on all platforms. In most modern compilers, the original source is translated into several intermediary formats. Typically, a number of transformations are moving from a higher level, such as C or C++, to a much lower level, such as Assembler. Normally these internal languages are rarely if ever published, but as luck would have it, the specification for Java bytecode is published. Thanks to Java’s portability and the JVM specification, for the most part, we know what source produces what combinations of opcodes and operands on any number of platforms.

Outside Java, any intermediate languages are much harder to decipher because the specification is rarely published. Decompiler writers have to recognize that sequences of low-level operators represent some high-level functionality by a process of trial and error, with data and instructions scattered everywhere. Even the people behind decompiling early Visual Basic interpreted code have a harder time than their Java counterparts because, unlike Sun, Microsoft didn’t publish a specification for VB’s intermediate p-code language.

Only a few papers or books cover the whole area of decompilation. One of the best resources is Dr. Cristina Cifuentes’ thesis1 on Reverse Compilation Techniques available from the Queensland University of Technology in Australia. Dr. Cifuentes takes the much more arduous task of turning executables, rather than our partly compiled bytecode, back into the original higher-level language. In terms of difficulty, there really is no comparison between converting a classfile back into Java source and converting any executable back into C code. The output isn’t defined at all, the data and instructions are not partitioned in any way, and different platforms and even different compilers on the same machine can produce completely different output. For example, a case statement compiled into binary by Visual C++ is a whole lot different than the output produced by a Watcom compiler. And as I suggested earlier, registers exponentially complicate what happens at run time.

JIT Optimization

In compiled or noninterpreted languages, compilers do most of the hard work, such as optimization during the compilation phase. Interpreted languages such as Java do all their work when they are finally executed. The technologies for increasing Java speed—JIT and inlining or hotspot technology being the obvious examples—all focus on speeding up the interpreting phase. They completely ignore the initial compilation in an attempt to overcome the limitations of the stack-based virtual machine design. Interpreters are by and large much easier to decompile as you’re always looking at an intermediate format with all the required information still intact in its original form.

Fragile Super Class

The next reason why Java is particularly susceptible to decompilation comes from Java’s object-oriented (OO) nature and is called the fragile super class problem. Because of the nature of other OO languages such as C++, it is often necessary to recompile or relink an application when a class in another library is changed. This is true if the implementation of a given method changes or when new methods are added. Java gets around this by adopting a static architecture. All classes are dynamically linked on demand as and when they are needed during execution. Or in English, all classes including the core classes are only linked at runtime.

For this to work, all the symbolic names and types need to be in the classfile. And for simplicity’s sake, Java’s original designers put all the methods and fields in the class as well as provide pointers to all externally referenced fields and methods in the classfile’s constant pool. This separates the data and instructions in any given classfile and gives us a complete map of the program.

Simple Control Structures

Almost as important, from our point of view, is the simplicity of the control structure in Java. Program flow in Java is all based around conditional and nonconditional goto statements, which is one of the reasons why many decompilers find it hard to tell the difference between a while loop and a for loop. But it also means that we don’t need to handle complex flows such as any indirect calls. It’s also unlikely that this will change in the near future because, thanks to the JVM, design and security restrictions on possible bytecode combinations will only offer very limited scope for any real optimization, especially for any code run through the Java Verifier.

So where do we go from here? I’ll begin with a recap of the overall problem domain. Then I’ll quickly review what tools are available to help you solve the problem. In addition, I’ll show you how to investigate some of the strategies that others before us have used to create a decompiler. Finally I’ll outline the design and conclude with some simple decompiler examples.

Defining the Problem

Perhaps now is a good time to recap what you already know about the classfile and the JVM before I get into the design of your decompiler.

Although the main area of focus in this chapter is the bytecode in the class methods where most of the decompilation will take place, you will still need to use much of the other information in the remainder of the classfile, such as the constant pool, to create good decompiled code.

If you quickly review the classfile, you can break it down into the following parts.

  • Magic number
  • Major and minor versions
  • Constant pool
  • Access flags
  • Interfaces
  • Class fields
  • Class methods
  • Class attributes

If you leave out any attributes for simplicity’s sake, you know that the constant pool follows the OxCAFEBABE magic number and the major and minor versions of the compiler. The constant pool contains all the symbolic information and constants used in the rest of the classfile. The symbolic information appears as short character–encoded strings in a Unicode-like format, which detail method signatures as well as parameter and type information for each field and method.

The constant pool is followed by a series of access flags that tell you the type of method, namely class or interface, and whether it’s public or private; they also give you a pointer to the name of the superclass in the constant pool.

The remaining three elements of the classfile—interfaces, class fields, and class methods—are all arrays. Interfaces contain the definitions of any interfaces defined in the classfile; class fields contain a series of pointers or indices into the constant pool giving the name and type information for the class; and finally, class methods can be either Java methods or native methods. Note that you cannot decompile native methods because they are essentially C or C++ dynamically linked libraries.

From Chapter 2, you know that the JVM is an abstract stack processor and that each method maintains a unique operand stack, stack pointer, program counter, and an array of local variables that act as general purpose registers. Likewise, in Chapter 2, you also learned the format of the classfile and how to read it in from a file. Figure5-1 shows a diagram of its architecture.

9781590592656_Fig05-01.jpg

Figure 5-1 Java Virtual Machine architecture

The execution frame uses both a stack and a series of local variables that act like general-purpose registers. Whereas lv_0 is always the address of this, the current method, lv_1, lv_2… lv_n, are the local variables created as the bytecode is interpreted. The stack is used as a temporary storage area and the stack pointer and program counter enable the JVM to know what opcodes need to be executed and where to find their operands on the stack.

The JVM’s instruction set of 200 odd opcodes has the usual mathematical, logical, and stack operators as well as a number of high-level opcodes that perform dynamic memory allocation (new), function invocation (invokevirtual), array indexing (aaload), exception processing (throw), type checking (checkcast), and monitors (monitorenter, monitorexit). tableswitch and lookupswitch allow for multiway conditional and unconditional branching as well as the more standard goto and label statements using statements such as ifcmple (goto label if less than or equal).

What you now need is a bytecode grammar that you can use to match the opcodes and convert the data back into Java source. You also need to be able mirror the control flow and any explicit transfers, such as goto or jsr (jump subroutine) statements, as well as being able to handle any corresponding labels.

From the JVM specification, bytecodes can be broken down into the following types.

  • Load and save instructions
  • Arithmetic instructions
  • Type conversion instructions
  • Object creation and manipulation
  • Operand stack management instructions
  • Control transfer instructions
  • Method invocation and return instructions
  • Handling exceptions
  • Implementing finally
  • Synchronization

Each and every opcode has a defined behavior that you use in the parser to re-create the original Java. Both Sun’s original JVM specification and the Java Virtual Machine (O’Reilly and Associates, 1997) are very good at describing opcodes in what can only be termed Technicolor detail. You can use this information in your decompiler’s grammar.

(De)Compiler Tools

All the constraints described in this chapter’s introduction enforce a type of computational grammar on your bytecode. The stack-based last in first out (LIFO) architecture is one of the simplest you can encounter. You can codify these actions in a parser so that you can fairly easily reconstruct the source code from these series of opcodes.

You need to make a number of choices before you write your decompiler. You could code the entire decompiler by hand, and several Java decompilers almost certainly take that approach, or you could look at a number of tools that would make your job a whole lot easier.

Flow analysis tools such as Java Tree Builder (JTB)2 or JJTree3—which you won’t be using—generate control flow graphs to help you figure out where and why the code branched. You can use compiler-compiler tools such as Lex and Yacc to scan and parse the bytecode; many developers have used them for more complex tasks. Lex and Yacc allow you to codify the grammar into a series of production rules that pattern match the bytecode and whose associated actions generate the source code fragments.

You’ll get to look at as many other solutions as possible. I suspect that all high-level code transformations used by Jive, Mocha, and even SourceAgain are hard-coded transformations looking for specific constructs rather than taking a more general approach. It’s unlikely that any of these programs tries to interpret rather than predict all combinations of high-level structures.

I will focus on the simpler approach where the high-level structures are hard coded, because it fits in well with your parser methodology. But you will also look at some advanced strategies where the decompiler infers all complicated high-level structures from the series of goto statements in the bytecode.

Compiler-Compiler Tools

Although you will be looking at the different strategies that have been used to decompile Java, your core parser will be a Lex/Yacc parser written in Java using JLex and CUP where the bytecode is recovered from the bottom up rather than the top down. You don’t want to reinvent the wheel, and wherever possible, you’ll use existing tools like JLex and CUP to implement the functionality you need in your language translator. You’ll see that decompiling the basic code blocks is straightforward with the exception of a few rather awkward bytcodes.

If you’ve never come across Lex and Yacc before, they operate on textual input files. The Lex input file defines how the byte stream is to be tokenized using pattern matching. The Yacc input file consists of a series of production rules for the tokens. These define the grammar, and the corresponding actions generate a user-defined output.

After running these input files through Lex and Yacc, the generated output in C or even Java becomes the source code for your decompiler engine. I have two principal reasons for using compiler-compiler tools: firstly, these tools dramatically reduce the number of lines of code, which makes it a whole lot easier for readers to understand concepts; and secondly, it cuts development time in half. JLex and CUP are available on any platform that can support a JVM.

On the negative side, once compiled, the generated code can be a lot slower than what you can achieve by handcrafting a compiler front end. However, making the code easy to follow is a prerequisite of this book. Nobody—especially yours truly—wants to read through reams of code to understand just what is happening, so I’ll be sticking to Lex and Yacc.

By a bizarre twist, using compiler-compiler tools such as JLex and CUP is a great way of hiding program logic and defeating decompilers because the rules and logic are at a much higher level. Even if someone decompiles the classfiles associated with ClassToSource, they would need to take another step to recover the original JLex and CUP files from the generated Java. Unfortunately this only has limited applications in the real world.

Now dozens of different compiler-compiler tools exist for different operating systems and target languages. Lex and Yacc are the most commonly known, and they come in Windows as well as the original Unix flavors. There are both free and commercial versions, which output Pascal, C, C++, and Java code. No doubt many other varieties exist that are too numerous to mention (see the comp.compilers FAQ4 for more information).

Apart from one or two notable exceptions, explanations of Lex and Yacc are always too theoretical and lack that practical dimension of how to put it all together. A number of steps are simple once they are explained, but more often than not, they lead to a great deal of confusion.

Note  The first example of Lex and Yacc that I ever came across was a simple calculator. In fact, the second and almost every other example that I’ve subsequently found were also for calculators of varying degrees of difficulty. You might say that the calculator is the “Hello, World” of the compiler-compiler tool community.

A myriad of alternatives are not based on Lex and Yacc. If you take Java as the target language and start with JLex and CUP as the Lex and Yacc variants, you’ll also come across ANother Tool for Language Recognition (ANTLR),5 the compiler-compiler tool formerly known as PCCTS. And, of course, you’ll come across Jack (Get it? It rhymes with Yacc.), which became JavaCC.6

Lex

Lex uses regular expressions to break up the input stream into tokens, and Yacc tries to take these tokens and match them to a number of production rules using a shift/reduce mechanism. Most production rules are associated with an action, and it’s these context-sensitive actions that output, in this case, Java source code.

Tokens are also known as terminals, and production rules are identified by a single non-terminal. Each non-terminal is made up of a series of terminals and other non-terminals. An analogy that most people use explains this as thinking of terminals (tokens) as leaves and non-terminals as the branches.

Yacc

CUP, like Yacc, is a standard, bottom-up LALR(1) parser (LALR stands for lookahead left-right). By bottom-up, I mean that you can construct the parse tree from the leaves, whereas a top-down parser tries to construct the tree from the root. LALR(1) means that the type of parser processes tokens supplied by the scanner Lex from left to right (LALR(1)) using the rightmost derivation (LALR(1)) and can look ahead one token (LALR(1)). An LR parser is also known as a predictive parser, and an LALR is the result of merging two LR sets whose items are identical except for the lookahead sets. LALR(1) parsers are very similar to LR(1) parsers, but LALR(1) parsers are typically much smaller because the lookahead token helps reduce the number of possible patterns.

LALR(1) parser generators are the de facto standard in the rest of computing world. However, Java parsers are more likely to fall into the LL(k) category of parsers. LL(k) parsers are top down parsers, scanning from left to right (LL(k)) using the leftmost derivation (LL(k))—which is where the top down comes from—and can look ahead k tokens.

Although LL(k) parsers are allegedly easier to write and the user-definable number of lookahead tokens is a major attraction, all my experience is with LALR(1) parsers, so personally, I feel more at home with LALR(1) parsers such as CUP. Many of the standard compiler construction tomes also heavily feature Lex and Yacc rather than any other LL(k) alternatives, so there are plenty of examples to get you going. Not surprisingly then, I’m going to use JLex and CUP in ClassToSource.

Hopefully some of you will already familiar with Lex and Yacc, which will help cut any learning curve. Feel free to play around with the other parsers—it comes down to personal preferences and what you feel most comfortable with.7

JLex

Eliot Berk originally developed JLex at Princeton University, but now it is maintained by Andrew Appel, also at Princeton, the author of Modern Compilers in Java/ML/C (Cambridge University Press, 1997). Like all versions of Lex, JLex allows you to use regular expressions to break up the input stream and turn it into tokens. You’ll use it in conjunction with CUP to define your decompiler bytecode grammar, but first I’ll show you how to use JLex on its own as a simple scanner ( see Listing 5-1 in the “Regular Expression Rules” section).

Lex, whether it’s running on Unix or DOS, in C or in Java, is a preprocessing language that transforms the specification or rules into the target language. A C language specification becomes lex.yy.c and a Java specification becomes filename.lex.java after it’s run through the Lex program. You then need to compile the code output like any other C or Java program. Lex is normally used in conjunction with Yacc, but you can also use it on its own for simple tasks such as removing comments from source code. However, if you need to attach any logic to the program, then you’ll almost certainly need to hook it up to some sort of parser, such as Yacc, or in your case, CUP.

During the introduction to this section, I mentioned how Lex and Yacc have been used for many years by compiler developers in the Unix community. If you’re more used to Lex, then JLex does differs in a number of ways. A JLex file is split into three sections:

  • User code
  • JLex directives
  • Regular expression rules

Although the structure (shown in Listing 5-1) is different from the Unix version of Lex—typically compiled using C instead of Java—and the definitions and macros are quite different too, thankfully, the regular expression rules use standard regular expressions. So if you’re already familiar with Lex, or even vi or Perl, then it won’t seem like you’ve strayed too far from familiar ground.

User Code

User code is everything that precedes the first %%. It is copied “as-is” into the generated Java file. Typically, this is a series of import statements. And, as you’re going to use JLex in conjunction with CUP, your user code will consist of the following:

import java_cup.runtime.Symbol;

JLex Directives

The directives section is next, beginning after the first %% and ending with another %%. These directives or flags tell JLex how to behave. For example, if you use the %notunix operating system compatibility directive, then JLex will expect a new line to be represented by and not as it is in the Unix world. The remaining directives, listed here, allow you to enter your own code into various parts of the generated file or change the default name of the generated Lex class, function, or type (e.g., from yylex to scanner).

  • Internal code
  • Init class code
  • End of File class
  • Macro definitions
  • State declarations
  • Character counting
  • Line counting
  • Java CUP compatibility
  • Component titles directive
  • Default Token Type directive
  • End of File directive
  • Operating system compatibility
  • Character sets
  • Format to and from file
  • Exceptions code
  • End-of-File return value
  • Interface to implement
  • Making the Generated class public

You are only interested in a few of the directives, such as the %cup or CUP compatibility directive. For your purposes, the directives section will be something as simple as the following:

%%

%cup

digit = [0-9]
whitespace  = [ ]
%%

Regular Expression Rules

The regular expressions section is where the real scanning takes place. The rules are a collection of regular expressions that break up the incoming stream into tokens so the parser can do its job. If you’re familiar with regular expression in Perl or vi, then you really shouldn’t have any problems. If you haven’t come across regular expressions before, then the JLex manual8 is a great place to start.

Let’s take a simple example to put this all together. Listing 5-1 adds line numbers to any file input from the command line.

Install JLex by obtaining a copy of Main.java from the URL provided. Copy it into a directory called JLex and compile it using your favorite Java compiler. Save the Num.lex file (see Listing 5-1), and compile it as follows:

java JLex.Main Num.lex
mv Num.yylex.java Num.java
javac Num.java

Now you can add line numbers to your file by typing the following:

java Num < Num.java > Num_withlineno.java

Normally in a scanner/parser combination the scanner operates as parser input. In the first example, you didn’t even generate any token, so you have nothing to pass to CUP, your Java parser. You’ll see later that JLex and CUP can interoperate with some small modifications. Lex generates a yylex() function that eats tokens and passes them on to yyparse(), which is generated by Yacc. You’ll rename these functions or methods to scanner() and parse(), but the idea is the same.

I doubt if many commercial compilers are built around Lex and Yacc because they have limited functionality and cannot deal with quirky aspects of some programming languages. Fortran, for example, is a nightmare to tokenize, because it is completely oblivious to whitespace. But for your purposes, Lex and Yacc are excellent utilities, because JVM bytecode is so tightly defined and because they offer a simple and neat way to see how bytecode fits together.

CUP

Yacc, as you may or may not know, stands for “Yet another compiler-compiler” and is essentially a parser generator. Don’t worry if up until now you thought Yacc could just as easily have stood for “Yamanashi area communication community” because I’ll try to explain in the simplest possible terms exactly what I’m talking about and why it is such a useful tool.

Simply put, Yacc allows you to define grammar rules for parsing incoming lexical tokens, and hopefully, it produces the desired output as defined by your grammar. CUP9—also known as JavaCUP—is a public domain Java variant of Yacc, which, because of Java’s portability, will compile on any machine that has a JVM and JDK. Like Yacc, CUP is almost certainly a contrived acronym as it comes from mildly cumbersome Construction of Useful Parsers.

Stephen Johnson at the AT&T Bell Laboratories in New Jersey wrote the original version of Yacc. Lex and Yacc as well as sed and awk were almost always included in every Unix implementation since the early days of Berkeley in the 1980s. sed and awk would typically be used for simple command-line parsing tools with Lex and Yacc being reserved for more complicated parsers. Unix system administrators and developers typically use some or all of these tools from time to time in an effort to transform or translate an input file into some other format. These days, Perl has largely taken over from all such utilities, with Lex and Yacc being reserved for only the most difficult of tasks, if at all.

Yacc as well as Lex have been copied many times and are available on many platforms. Commercial and public domain variants of Lex and Yacc are available on Windows and DOS, for example, from MKS and from GNU (Flex/Bison). Be warned, however, CUP does not have exactly the same functionality or format as a Yacc grammar written for any C compiler, but it does behave in a somewhat similar fashion. CUP can be compiled on any operating system that supports JDK.

CUP, being a Yacc parser, is closest to an LALR(1) parser and is just one of a number of different Yacc parser generators written for the Java language. Byacc and Jell are two other examples. If you’re happier with an LL parser and don’t want to use an LALR grammar, then you might want to look at ANTLR or JavaCC from Sun. But for our purposes I’m going to focus primarily on CUP.

To install CUP, copy the source files from the URL provided. An installation script (INSTALL) is provided for Unix along with a makefile for Windows 95/NT (nmake10). However, it is very straightforward to compile by typing the following in the CUP root directory:

javac java_cup/*java java_cup/runtime/*.java

CUP files are made up of the following four sections:

  • Preamble or declarations sections
  • User routines
  • List of symbols or tokens
  • Grammar rules

Declarations

The declarations section consists of a series Java package and import statements that vary depending on what other packages or classes you want to import. Assuming that the CUP classes are in your classpath, add the following line of code to include the CUP classes:

import java_cup.runtime*;

All other imports or package references are optional. A start declaration will tell the parser where to look for the start rule if you want it to start with some other rule. The default is to use the top production rule, so in most grammars, you’ll come across the start rule and find that it is redundant information.

User Routines

Four possible user routines are allowed in CUP (see Listing 5-2): action and parser, which are used to insert new code and override default scanner code and variables, init for any parser initialization, and finally scan, which is used by the parser to call the next token. All four possible user routines are optional.

Both init and scan are commonly used even if it is only to use scan to change the name of the scanner/lexer to something more meaningful than yylex(). Any routines within init are executed before the first token is requested.

Most parsers will have a series of actions defined in the grammar section. CUP puts all these actions in a single class file. The action user routine allows you to define variables and add extra code, for example, symbol table manipulation routines that can be referenced in the nonpublic action class. parser routines are for adding extra code into the generated parser class—don’t expect to use this very often, if at all, except maybe for better error handling.

Symbols

CUP, like the original Yacc, acts like a stack machine. Every time a token is read it is converted into a symbol and placed or shifted onto the stack. These tokens are defined in the “Symbols” section of the parser.

Symbols can be either terminal or non-terminal. Unreduced symbols are called terminal symbols and symbols that have been reduced into some sort of rule or expression are called non-terminal symbols. Or to put it another way, terminal symbols are the symbols/tokens used by the JLex Scanner, and non-terminal symbols are what the terminals become after they satisfy one of the patterns or rules in the “Grammar” section of the parser. Listing 5-3 shows a good example of both terminal and non-terminal tokens.

Symbols can have associated Integer, Float, or String values, which are propagated up the grammar rules until the group of tokens either satisfy a rule and can be reduced or crash the parser if no rule is ever satisfied.

A parser’s functionality depends on its ability to interpret a stream of input tokens and how it turns these tokens into the desired output as defined by the language’s grammar. In order for the parser to have any chance of success, it needs to know every symbol along with its type before any shift/reduce cycles can take place. A symbol table is generated from the list of terminals and non-terminals by CUP and output as a Sym.java file, which needs to be imported into the JLex scanner for JLex and CUP to work together.

Grammar

CUP is an LALR(1) machine, meaning that it can look ahead one token or symbol to try and satisfy a grammar rule. If a production rule is satisfied, then the symbols are popped off the stack and reduced with the production rule. The aim of every parser is to convert these input symbols into a series of reduced symbols right back to the start symbol or token.

In layman’s terms, given a string of tokens and a number of rules, the goal is to trace the rightmost derivation in reverse by starting with the input string and working back to the start symbol. You reduce your series of non-terminals to a terminal using bottom-up parsing. All input tokens are terminal symbols that are subsequently combined into non-terminals or other intermediate terminals using this shift/reduce principle. As each group of symbols is matched to production rule, it ultimately kicks off an action that generates some sort of output defined in the production rule action. You’ll see how this works in a simple example at the end of this chapter in Listing 5-15.

You’ll see that under certain circumstances, it’s possible that input tokens or intermediate symbols can satisfy multiple production rules; this is what is known as an ambiguous grammar. The precedence keyword mentioned in the “Symbol” section allows the parser to decide which symbol will take a higher precedence. For example, the symbols for multiplication and division might take precedence over the addition or subtraction symbols.

It’s worth mentioning that CUP will allow you to dump the shift/reduction table for debugging purposes. Listing 5-4 shows the command that produces a human-readable dump of the symbols and grammar, the parse state machine, the parse tables, and the complete transitions. Some of the output is also shown.

Now that you’ve seen what tools you’re going to use, you should begin to think about what steps you need to take to create your decompiler. First, you have to find some way to break up the classfile into an appropriate syntax so that it can be tokenized, similar to javap output. ClassToXML will suffice for your purposes. Next you need to write the JLex Scanner to tokenize the output from ClassToXML and finally you’ll need to define the CUP grammar and associated actions to convert these tokens back into source.

Strategy

A significant part of the problem with building a decompiler is making it general enough to deal with arbitrary cases. When Mocha comes across an unexpected language idiom, it either aborts or shoots out illegal gotos. Ideally, you should be able to code a general solution decompiler rather than one that is little more than a series of standard routines and an awful lot of exception cases. You don’t want ClassToSource to fail on any construct, so a general solution is very attractive.

Before you take that approach, though, you need to know if there are any disadvantages to this approach and whether you will gain a better solution at the expense of outputting illegible code that looks nothing like the original source. Or worse still, whether it will take an inordinate amount of time to get there. You could replace all the control structures in a program with a program counter and a single while loop, but that would destroy the mapping and cause you to lose structural or syntactical equivalence, which is definitely not your goal even if it is a general solution.

From our discussions, you know that unlike other languages, such as C++, which doesn’t use an interpreter, you don’t have the headache of separating data and instructions because all of your data is in the constant pool. In the remainder of this chapter and in the next, you’ll also see that recovering source-level expressions is relatively easy. So it seems that your main problem and any corresponding strategy you use is going to mainly concern handling the bytecode’s control flow.

After the previous “CUP” section, you may be asking why you need to have a strategy, why can’t you just build a grammar in Lex and Yacc and see what comes out the other side? Well unfortunately the parser can recognize only sequential instruction sequences. So you might not be able to parse all JVM instructions in a single bottom-up pass because bytecodes have this awful habit of branching. The stack is used as a temporary storage area, and you need to be able to control what happens to that partial sequence when the code branches. On their own, Lex and Yacc just don’t offer that level of functionality, so you need to figure out what approach you need to take to store these partially recognized sequences.

This section looks at a couple of different strategies you can try to help overcome this problem of synthesizing high-level control constructs from goto-like primitives. As I said, the ideal general solution would be where you could decompile every possible if, then, else, or for loop combination without needing any exception cases and while still keeping the source as close as possible to the original. The alternative is to attempt to anticipate all high-level control idioms.

The first choice is to use the techniques based on Cristina Cifuentes’ work as described in the paper “A Methodology for Decompilation.”11 This describes dcc, Cifuentes’ decompiler for C programs on Intel boxes. Although dcc recovers C and not Java code, a great deal of the discussion and design of Cifuentes’ Universal Decompiler is directly applicable to the task at hand.

The second choice is the more general approach—where you transform goto statements into equivalent forms. It would be so much simpler if you could just fire off a Lex/Yacc scanner and parser at the classfile and decompile the code in a single pass or, at the very least, dispense with any control flow analysis. Well that’s what Todd Proebsting and Scott Watterson attempt to do in their “Krakatoa: Decompilation in Java” paper.12 Krakatoa, an early decompiler that is now part of the Sumatra/Toba project, uses Ramshaw’s algorithm13 to transform gotos into loops and multilevel breaks. It claims to offer a neat one-pass solution while still keeping the original structure. The Krakatoa approach is tempting, because it is less likely to fail due to any control flow analysis problems.

The third choice comes from Daniel Ford of IBM Research and was part of possibly the very first decompiler, Jive, which I believe never even made it out of IBM Research. Daniel, in his paper “Jive: A Java Decompiler”—which unfortunately is no longer available—puts forward a truly multipass decompiler that “integrates its parse state information with the sequence of machine instructions it is parsing.” Jive decompiles by reducing tokens as more and more information becomes available with each successive pass.

Your fourth choice would be to take the truly simple approach. You’d look at a single pass using Lex/Yacc in which you’d use a lot of buffers to maintain the state of each reduction before a branch, and then you’d continue where you left off after the function returns control back to where it was before the jump. Or you could go one step further and use peephole optimization as a second pass parser. Peephole optimization is where you hard code combinations of instructions and simply hope you don’t come across any new idioms.

Perhaps you could go for something more esoteric, such as some sort of pattern matching AI parser. It’s conceivable that a neural network could be trained to output source after it was trained on numerous applets and applications. Strings are passed in at one end of the neural network as tokens push a Java representation of code at the other end. You then train the neural network to turn bytecode phrases into Java phrases similar to your more conventional approaches. Additional code would then take this output information to recover the complete structure. The neural network, like your previous choices, would still be matching patterns because, after all, no matter what tools you use, you are always going to be using pattern matching to some degree. Even if this approach does seem a bit unobtainable, it would have one advantage over other techniques. Due to the fuzzy nature of the solution, it would be better at overcoming any new combinations of bytecodes that might make it possible to handle any bytecode rearranging techniques.

Universal Decompiler

First a word of warning from Cristina Cifuentes in “A Methodology for Decompilation.” “A naive approach to decompilation attempts to enumerate all valid phrases of an arbitrary attribute grammar and then to perform a reverse match of these phrases to their original source code.” Or to put it another way, if you were thinking that you could simply use Lex and Yacc to recover code, then you better think again.

I have already mentioned Cristina Cifuentes’ decompiler dcc several times in this book. Essentially dcc is an incomplete decompiler written at the Queensland University of Technology during 1993 and 1994. dcc was different from other earlier decompilers because it recovered source code from binaries rather than from partially compiled objects.

Although it seems that Cifuentes and her colleagues hit a number of roadblocks, they did introduce some interesting concepts that you can take advantage of, especially because you’re dealing with bytecode that is more analogous to objects than Intel binaries.

dcc consists of three modules: the front-end loader/parser, the universal decompiling machine (UDM), and the back-end code generator. The front and rear ends are platform dependent, whereas the UDM is machine and language independent. ClassToSource’s front end will always be reading in the same stream of Java bytecodes, but there is no reason why the back end couldn’t output any language such as Ada, which can already be compiled to bytecode. And, of course, ClassToXML could always be rewritten to handle other input streams such as Visual Basic 3 and 4 or even VB.NET objects.

Front End

The front end simulates an operating system loader, parses the incoming stream, and produces a higher level intermediate code. dcc’s front-end parser starts at the entry point determined by the virtual loader and follows every path sequentially, creating a new node at every jump until the final instruction has been mapped. Because of the lack of separation between data and instructions, dcc’s loader needs to be a much more complicated module than you need.

The intermediate code generator works closely with the parser, giving each instruction an intermediate code representation as well as associating any used or defined registers. An optimization phase eliminates any redundant instructions by finding obvious combinations or language idioms. This is one of the reasons why the front end cannot be used for more than one language—different languages have different idioms. After this optimization phase, the condensed tokenized version of the input is passed onto the UDM.

Universal Decompiler Machine (UDM)

The UDM performs the flow analysis of the program constructing a control flow graph (CFG) (not to be confused with a context free grammar) and splitting the high-level program into a series of basic blocks.

Cifuentes defines a basic block as being “a sequence of instructions that has a single point of entry and single point of exit.” The UDM builds a linked list, or tree, of these basic blocks for every branch, whether it’s a loop, a procedure call, or an end of program.

This is followed by a control flow analysis phase, which reduces the CFG into likely control structures, if-then-else, while, and for loops as well as case statements. Irreducible graphs are also converted into reducible versions at this stage, or in extreme circumstances, labeled and output as Assembler.

A structuring algorithm interprets this information and outputs a high-level version of the code, and a data flow analysis module analyzes the defined and used data information to understand the type and scope of the variables.

Back-End Processing

The back end takes the high-level language and maps it onto the target language. Not surprisingly in the case of dcc, the target language is C. Global variables are declared, and then code is generated on a procedure-by-procedure basis from the blocks defined in the UDM, with local variables being declared as necessary.

Ramshaw’s Algorithm

The full title of Lyle Ramshaw’s paper is Eliminating go to’s While Preserving Program Structure”.14 The paper originally came out of some work that Ramshaw was doing during the early 1980s while he was trying to convert Donald E. Knuth’s TeX source from Pascal to a programming language called Mesa. You’re looking at the algorithm because it can and has been used to convert goto-laden bytecode into goto-free Java source code.

Ramshaw’s idea is straightforward; replace all goto statements and their corresponding labels with an exit statement and appropriately labeled repeat-endloop pairs. Listing 5-5 shows a pseudocode example of putting the theory into practice.

Control flow in Java bytecode is largely dependent on the simple goto statement. You can use Ramshaw’s ideas to replace the goto bytecode statements with the equivalent for(;;), do-while loops, and break commands.

Ramshaw went to a great deal of trouble to keep the Mesa and Pascal code structurally equivalent, which is exactly what you are trying to accomplish. However, your problem is that although the code is syntactically equivalent, it is the bytecode that will remain syntactically equivalent and not the original source. Ramshaw’s algorithm changes the order of the bytecode and makes it more difficult for you to recover the original source.

For this reason, Ramshaw’s work is not going to be of much use in your quest for a good general decompiler. It may be very useful to an optimizing compiler where the source is not as important and finding a good general solution becomes the main criteria. However, there may be a single case where the algorithm is very useful.

Normally in bytecode you don’t have to worry about gotos with crossing scope, but you may come across some obfuscators that can introduce this sort of problem. Ramshaw deals with several categories of this irreducible graph problem in his paper.

Although a Pascal to Mesa translator already existed, Ramshaw found that he needed to extend the translator to compensate for the fact that Mesa had only a limited version of the goto statement.

Listing 5-6 shows a classic example of a how not to write goto statements. The scope of the gotos cross and create an irreducible graph that cannot be recovered unless the code is rewritten. Ramshaw describes a mechanism called stretching loops where he uses goto graphs to get around the simplest irreducible flow problems.

Mesa’s author was influenced by Knuth’s article “Structured Programming with go to Statements”15 where Knuth recommended not using goto statements that passed control into the center of a loop. So it is mildly ironic that almost a decade later, Ramshaw finds himself converting those very goto statements that Knuth and many others so vehemently opposed in code written by none other than Donald E. Knuth. Having said that, Ramshaw does say that there were little or no occurrences of the offending goto statements and that they were quickly removed from later versions of TeX.

Jive

Jive was probably the first Java decompiler. It was written by Daniel Ford, a manager in the Web Technologies department at the IBM Almaden Research Center, in San Jose, California. The original paper is not available online, but you might be lucky if you request a paper copy by going to http://www.research.ibm.com/.

In this approach to decompilation, you use a multipass scanner. After each successive pass, the bytecode is gradually transformed into something that resembles the original Java source code. The bytecode is converted into statements and idioms to make it easier to recognize and recover the building blocks of the original Java code. Listing 5-7 demonstrates this.

The main method in Listing 5-7 has the bytecode in Listing 5-8.

This is then converted into Java in a series of steps; each step adds more and more information until the entire class file can be assembled into source. You then create a series of tokens in your parser that would produce the output shown in Listing 5-9.

Although you didn’t have any real reductions in the first pass, it soon becomes apparent where you’re going at the end of the second pass (see Listing 5-10), because the number of lines is halved.

The tokens generated in the first pass are used to create the building blocks in the second pass. Note the introduction of the 1-> tag, which tells the next pass where the if branch ends. In the next and final pass (see Listing 5-11), the code can now be easily resolved into something resembling your original Java code in Listing 5-7.

Jive does decompile more complicated structures than your if statement using the same approach of breaking the bytecode into statements, but it uses a lots more passes to do so. To date, I have not been able to track down the original Jive decompiler.

Single Pass Parser

Stack machines, virtual or otherwise, directly mimic the operation of a parser. The parser’s constant shifts and reductions are analogous to a stack machine’s pop and push actions. This leads us nicely to the fourth and final strategy.

You know it’s possible to create a scanner that can handle every possible bytecode instruction. However, you may not have realized that is it possible to code the parser to cover every eventuality that it could possibly come across. The parser grammar would not need to handle nonreducible bytecode—that is such a rare occurrence that you can safely ignore—but it would still need to be at least as good as Mocha.

The main problem here is not how the parser handles simple lines of code such as initializations or assignments. The real difficulty would be how a single-pass parser could handle control flow. It would, by default, not be a general solution and you will have to throw a test suite of Java classfiles at it to make sure it can handle most eventualities.

Strategy Conclusion

You’ve briefly looked at four strategies for recovering source code from bytecodes in a classfile. The first option was to take Dr. Cifuentes design and reuse the existing code, as described in her thesis. Although dcc is primarily used to recover C source from DOS executables, Dr. Cifuentes thesis describes a more general approach of using it to decompile any binary to a higher-level language. It would be conceivable to use Dr. Cifuentes’ UDM for your purposes. It might be tempting to use a Babel Fish decompiler for all computer languages, however, because your task is so much simpler than other decompilation problems, doing so would really be over-engineering a solution.

Alternatively, you can take Lyle Ramshaw’s approach and rewrite the goto statements into a series of repeat- and endloop-like conditions. Or you could take the Jive approach and code the decompiler in a series of passes/decompilation phases.

The final option is to use JLex and Yacc to create a single-pass decompiler. I deliberately didn’t go into this in too much detail, because this option—the simplest option—is the one that you’re going to take. In the “Parser Design” section, I’ll explain how to use the single-pass solution to decompile a simple Hello World example. In the Chapter 6, you’re going to see how you can extend the CUP grammar to handle a much more complicated suite of examples.

Parser Design

Human and computer languages share a lot of common elements, namely keywords and a grammar or a sequence of rules. Different languages use completely different sets of keywords and rules and sometimes even a different alphabet. Subconsciously your mind processes these keywords or tokens and applies them to your version of the English language. My version of the rules of the English language will be slightly different from your version, unless you grew up in the same area of Dublin, Ireland. Somehow, I doubt if anyone will ever read this in any other language than English (oh, the humility). But if you are reading this in French or German then someone must have translated the original syntax into yet another set of keywords and grammar rules, namely your language or native tongue.

The main difference between understanding a conversation and a compiler generating machine code is that the compiler requires a lot fewer keywords and rules to produce output that the computer can understand—what is called its native format. If compiling a computer program is a smaller subset of understanding a human language, then decompiling Java is a smaller subset yet again. Let’s take a look at Listing 5-11 and Listing 5-12 to demonstrate.

The number of keywords and the limited number of grammar rules allow you to easily tokenize the input and subsequently parse the tokens into Java phrases. Turning this back into code requires some further analysis, but I’ll get to that a little later. What you need to do is turn the input data stream into tokens, as shown in Listing 5-12.

Ultimately, you’ll be using the XML output from ClassToXML as the input file for your parser. But for the moment, you’re going to use the output of javap to get started. Looking at main method in Listing 5-12, which javap generated from Listing 5-11, you can see that tokens can be split into the following types.

  • identifiers
  • integers
  • keywords
  • whitespace

Bytecode identifiers are typically constant pool references that take the following form:

1 invokespecial #3

Integers are usually the data or numbers that follow the opcode to make up a complete bytecode statement. Good examples are numbers that get placed on the stack or labels for a goto statement.

Keywords are the 200 or so opcodes that make up the bytecode language—you’ll be looking at these and their constructs in the next chapter.

Finally we need to account for whitespace, which of course includes tabs, blank spaces, new lines, and carriage returns if you’re using DOS or Windows. Most decompilers would not be encountering a lot of whitespace in a real classfile, but you need to deal with it in your javap and ClassToXML output.

All these tokens are crucial in order for the parser to be able to do its job and to try to match them with the predefined grammatical rules. You could write your own tokenizer and parser by hand, which would search for the different tokens within the bytecode; however, there are plenty of tools to help you along, so why not take advantage of them? A handcrafted parser would be quicker,16 but you’re looking for ease of use rather than an optimized production quality application.

JLex is the first part of your decompilation process. JLex reads or scans input from a data stream and turns it into valid tokens that are passed on to the CUP parser, which interprets the tokens. The generated yylex program scans each incoming characters of the input bytecode using regular expressions that break the input into strings and tokens. The scanner’s primary function is to keep track of the current position in the input stream and generate the tokens for the parser. Listing 5-13 shows some sample bytecode input using the classfile compiled from Listing 5-11.

So without much further ado, let’s create a scanner to tokenize the main method in Listing 5-13. First, you’ll isolate the javap bytecode so that you’re only dealing with the information in Listing 5-14.

As yet, this is only a very simple lexical analyzer because it is only a partial specification of the possible opcodes, but it should point you in the right direction.

Gradually, as each character is read, it becomes obvious which regular expression is going to be satisfied and what token will be generated. This assumes that your specification is complete, which at this stage, it most certainly is not, but it will work for this simple example. Tokens are defined by regular expressions in the scanner and the set of these lexical tokens defines the elements of the bytecode language. If there is a syntax error in the input stream or, in other words, if the input isn’t syntactically correct, then the scanner will jam. You should also make sure that your regular expressions are not ambiguous, because only the first match will be chosen when two conditions are satisfied, and as a result, the wrong token may be generated.

JLex, and indeed all Lex programs, convert these regular expressions (regexp) into two types of finite automata. First, the regexp are converted to nondeterministic finite automata (NFA) where it is still possible to match any condition in more than one way. The aim is to turn the regexps ultimately into deterministic finite automata (DFA) where all transitions from one state to another are singularly defined by, amongst other things, removing all redundant transitions. Or, in other words, all tokens can only be generated in one particular way. A DFA is essentially the set of regular expressions mapped onto a large sparse matrix. It has the disadvantage of being slower to generate, but the significant advantage of being a much quicker simulator.

Ultimately what we are trying to do is create a parse tree using CUP. JLex will generate tokens for the longest sequence of characters from the starting position to the current position that matches a specific regular expression. The parser requests a token from the scanner and then tries to construct a parse tree using a series of shift and reduce operations. Let’s take a look at what your first attempt at the CUP parser looks like, in Listing 5-15. You won’t define any actions just yet; you just want to make sure it parses.

You can now compile decompiler.lex and decompiler.cup as follows:

java JLex.Main decompiler.lex
mv decompiler.lex.java Yylex.java
java java_cup.Main decompiler.cup
javac -d . parser.java sym.java Yylex.java

It’s nothing to write home about just yet, but it does scan and parse each line from the main method.

Take the second line:

3 ldc #1 <String “Hello World”>

This scanner breaks the line into tokens, where LABR is a left angle bracket:

NUMBER LDC NUMBER LABR BRSTRING BRSTRING RABR

This is then reduced into the non-terminals as follows:

LABR BRSTRING BRSTRING RABR -> vstack
NUMBER -> number
NUMBER -> number
number LDC number vstack -> expr_part

You need all three lines if you’re going to reduce all the way back to expr_list. It might also help if you add the actions to see how you get back to the original code. To do this, you’ll first need to add some action code so that you can have a variable stack and a type stack for some temporary storage space while you’re decompiling the code.

The complete parser for your very simple example is in Listing 5-16.

If you run the code now, the information that you need to extract out of the bytecode is stored on the vStack and you pop off the information when the complete expression has been recovered. If you now compile and run the decompiler, you’ll get the following output:

System.out.println(“Hello, World”);
return;

If you’re paying attention, you’ll notice that System wasn’t exactly derived from the bytecode. However because this snippet of bytecode doesn’t have any access to the constant pool, you’d have to add this little kludge to get your little example to work.

In the Chapter 6, you’ll be pulling whatever information you need from wherever you need it in the classfile. You first need to preprocess the classfile by converting it to XML using ClassToXML. Next you’ll load the XML and decompile each of the individual methods, resolving the constant pool references as you go.

Conclusion

So far I’ve talked about the tools you can use to create a small working decompiler. You’ve looked at the different strategies that you might or might not employ, and finally, I created a toy example to show you how your single-pass decompiler will function.

By the end of Chapter 6, you’ll have a working decompiler that will be able to handle the majority of Java classes. Chapter 6 will also look at the different internal structures and gradually create a more effective decompiler that can handle a lot more than the “Hello World” example.

____________________

1http://www.itee.uq.edu.au/~cristina/dcc.html#thesis

2http://www.cs.purdue.edu/jtb/

3http://www.j-paine.org/jjtree.html

4http://compilers.iecc.com/faq.txt

5http://www.antlr.org/

6https://javacc.dev.java.net/

7See http://dinosaur.compilertools.net/ for more information and links to some excellent resources.

8http://www.cs.princeton.edu/~appel/modern/java/JLex/

9http://www.cs.princeton.edu/~appel/modern/java/CUP/

10Available with Visual C++

11www.itee.uq.edu.au/~cristina/clei1.ps

12www.usenix.org/publications/library/proceedings/coots97/full_papers/proebsting2/proebsting2.pdf

13Lyle Ramshaw’s paper “Eliminating go to’s While Preserving Program Structure” was written at Digital’s System Research Center in Palo Alto in 1985.

14http://gatekeeper.dec.com/pub/DEC/SRC/research-reports/SRC-004.pdf

15D. E. Knuth. “Structured programming with go to statements.” Computing Surveys, pages 261–302, Dec 1974.

16This may not be 100 percent true because the performance section in the JLex manual mentions a lexical analyzer or scanner that soundly outperformed a handwritten scanner.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.81.43