Interpreting the Dalvik bytecode

You may know by now that the Dalvik VM is slightly different in structure and operation as compared to the Java VM; its file and instruction formats are different. The Java VM is stack-based, meaning bytecode (the code format is named this way because instructions are each a byte long) works by push and popping instruction on and off a stack. The Dalvik bytecode is designed to resemble the x86 instructions sets; it also uses a somewhat C-style calling convention. You'll see in a moment how each calling method is responsible for setting up the arguments before making calls to another method. For more details on the design and general caveats of the Dalvik code format, refer to the entry named General Design—Bytecode for the Dalvik VM, Android Open Source project in the See also section.

Interpreting bytecode means actually being able to understand how the instruction format works. This section is dedicated to provide you with the references and tools you need to understand the Dalvik bytecode. Let's dig into the bytecode format and find out how it works and what it all means.

Understanding the Dalvik bytecode

Before jumping into bytecode specifics, it's important to establish some context. We need to understand a little about how a bytecode is executed. This will help you understand the attributes of the Dalvik bytecode and determine the difference between knowing what a piece bytecode is and what a piece of bytecode means in a given context of execution, which is a very valuable skill.

The Dalvik machine executes methods one-by-one, branching between methods where necessary, for instance, when one method invokes another. Each method can then be thought of as an independent instance of the Dalvik VM's execution. Each of the methods have a private space of memory called a frame that holds just enough space to accommodate the data needed for the method's execution. Each frame also holds a reference to the DEX file; naturally, the method needs this reference in order to reference TypeIds and object definitions. It also holds reference to an instance of the program counter, which is a register that controls the flow of execution and can be used to branch off into other execution flows. For instance, while executing an "if" statement, the method may need to jump in and out of different portions of code, depending on the result of a comparison. Frames also hold areas called registers, which are used to perform operations such as adding, multiplying, and moving values around, which may sometimes mean passing arguments to other methods, such as object constructors.

A bytecode consists of a collection of operators and operands, with each operator performing a specific action on the operands supplied to it. Some of the operators also summarize complex operations, such as invoking methods. The simple and atomic nature of these operators is the reason they are so robust, easy to read and understand, and supportive of a complex high-level language such as Java.

An important thing to note about Dalvik, as with all intermediate code representations, is the order of the operands for the Dalvik bytecode. The destination of the operation always appears before the source for the relevant operators, for instance, take an operation such as the following:

move vA,vB

This means that the contents of register B will be placed in register A. A popular jargon for this order is "Destination-then-Source"; this means the destination of the result of the operation appears first, followed by the operand that specifies the source.

Operands can be registers, of which each method, an instance of independent execution, has a collection of registers. Operands may also be literal values (signed/unsigned integers of a specified size) or instances of a given type. For non-primitive types such as strings, the bytecode dereferences a type defined in the TypeIds section.

There are a number of instruction formats that dictate how many registers and number of type instances can be used as arguments for given opcodes. You can find these specifics at http://source.android.com/devices/tech/dalvik/instruction-formats.html. It's well worth your time to read through these definitions, because each opcode in the Dalvik instruction set and its specifics is merely an implementation of one of the opcode formats. Try to understand the format IDs because they make for very useful short-hand while reading the instruction formats.

After covering some of the basics, and trusting that you've at least skimmed the opcodes and opcode formats, we can move on to dumping some bytecode in a way that makes it semantic to read.

Getting ready

Before we start, you will need the Smali decompiler, which is called baksmali. As an added convenience, we will now go over how to set up your path variable so that you can use the baksmali scripts and a JAR file from anywhere on your machine without referencing it canonically every single time. Here's how you set it up:

  1. Grab a copy of the baksmali JAR file at https://code.google.com/p/smali/downloads/list, or from the newer repository at https://bitbucket.org/JesusFreke/smali/download. Look specifically for the baksmali[version].jar file—where [version] is the latest available version.
  2. Save it in some conveniently-named directory, because to have the two files you need to download will need to be in the same directory makes things a whole lot easier.
  3. Download the baksmali wrapper script; it allows you to avoid invoking the java –jar command explicitly every time you need to run the baksmali JAR. You can grab a copy of the script at https://code.google.com/p/smali/downloads/list, or from the newer repository at https://bitbucket.org/JesusFreke/smali/downloads. Save it in the same directory as the baksmali JAR file. This step does not apply to Windows users, since it's a bash script file.
  4. Change the name of the baksmali jar file to baksmali.jar, omitting the version number so that the wrapper script you've downloaded in step 2 will be able to find it. You can change the name using the following command on a Linux or Unix machine:
    mv baksmali-[version-number].jar baksmali.jar
    

    You can also do this using whatever window manager your operating system uses; as long as you change the name to baksmali.jar, you're doing it right!

  5. You then need to make sure that the baksmali script is executable. You can do this by issuing it the following command if you're using a Unix or Linux operating system:
    chmod +x 700 baksmali
    
  6. Add, the current folder to your default PATH variable.

    And you're all done! You can now decompile the DEX files! See the following section to find out how.

How to do it...

So, you've got baksmali all downloaded and set up, and you'd like to decompile some DEX files into the nice semantic syntax of smali; here's how you do that.

Execute the following command from your terminal or command prompt:

baksmali [Dex filename].dex
How to do it...

This command will output the contents for the DEX file as though it's an inflated JAR file, but instead of class files, all of the source files will be .smali files containing a slight translation or dialect of the semantic Dalvik bytecode called smali:

How to do it...

Let's take a look at the smali file generated by baksmali and walk through what each bytecode instruction means. The code is as follows:

.class public LExample;
.super Ljava/lang/Object;
.source "Example.java"


# direct methods
.method public constructor <init>()V
    .registers 1

    .prologue
    .line 1
    invoke-direct {p0}, Ljava/lang/Object;-><init>()V

    return-void
.end method

.method public static main([Ljava/lang/String;)V
    .registers 4

    .prologue
    .line 3
    sget-object v0, Ljava/lang/System;->out:Ljava/io/PrintStream;

    const-string v1, "Hello World!
"

    const/4 v2, 0x0

    new-array v2, v2, [Ljava/lang/Object;

    invoke-virtual {v0, v1, v2}, Ljava/io/PrintStream;->printf(Ljava/lang/String;[Ljava/lang/Object;)Ljava/io/PrintStream;

    .line 4
    return-void
.end method

Please note that because baksmali, the Android Dalvik VM, and the Java language are constantly being improved, you may see slightly different results to the previous code sample. Don't panic if you do; the preceding sample code is intended to merely be an example for you to learn from. You will still be able to apply the information in this chapter to the code your baksmali generates, whose first few lines are as follows:

.class public LExample;
.super Ljava/lang/Object;
.source "Example.java"

These are merely some metadata on the actual class being decompiled; they mention the class name, the source file, and the super class (the class that this method inherits from). You may notice from the code of Example.java that we never explicitly inherit from another class, though when decompiled, Example.java seems to have a parent: how is this possible? Well, because all Java classes inherit from java.lang.Object implicitly.

Moving on, the next bunch of lines are a little more interesting. They are the smali code for the constructor of Example.java:

# direct methods
.method public constructor <init>()V
    .registers 1

    .prologue
    .line 1
    invoke-direct {p0}, Ljava/lang/Object;-><init>()V

    return-void
.end method

The first line, .method public constructor <init>()V, is a declaration of the method to follow. It says that the method called init returns a void type and has public access flags.

The next line that contains the piece of code, namely:

.registers 1

Says that this method only makes use of one register. The method will know this because the number of registers it needs are decided before it is run. I'll shortly mention the one register it needs. Following this is a line that looks like the following code:

.prologue

This declares that the method prologue follows, which is something every Java method has. It makes sure to call the inherited forms of the method, if there are any. This explains why the next line, containing the following code, seems to invoke another method called init:

invoke-direct {p0}, Ljava/lang/Object;-><init>()V

But this time it dereferences it from the java.lang.Object class. The invoke-direct method here accepts two arguments: the p0 register and a reference to the method that needs to be called here. This is indicated by the Ljava/lang/Object;-><init>()V label. The description of the invoke-direct opcode is stated as follows:

"invoke-direct is used to invoke a non-static direct method (an instance method that is non-overridable by nature and is either a private instance method or a constructor)."

So in summary, all it's doing is calling a non-static direct method that is the constructor of the java.lang.Object class.

Let's move on to the next line of the smali code:

return-void

It does exactly what it seems to, and that is, return a void type and exit the current method to return the flow of execution to whichever method invoked it.

The definition of this opcode as per the official website is "Return from a void method."

Nothing really complex about that. The next line, as with other lines beginning with the period (".") character, is a piece of metadata, or a footnote added by the smali decompiler, to help add some semantic information about the code. The .end method line marks the end of this method.

The code for the main method follows. Here, you will see some code forms that will appear over and over again, namely, the code generated when arguments are passed to the methods and when they are invoked. Since Java is object-oriented, a lot of what you're doing when your code is calling another object's methods is passing arguments and converting from one object type to another. So, a good idea would be to learn to identify when this is happening by decompiling some Java code that does this to the smali code. The code for the main method is as follows:

.method public static main([Ljava/lang/String;)V
    .registers 4

    .prologue
    .line 3
    sget-object v0, Ljava/lang/System;->out:Ljava/io/PrintStream;

    const-string v1, "Hello World!
"

    const/4 v2, 0x0

    new-array v2, v2, [Ljava/lang/Object;

    invoke-virtual {v0, v1, v2}, Ljava/io/PrintStream;->printf(Ljava/lang/String;[Ljava/lang/Object;)Ljava/io/PrintStream;

    .line 4
    return-void
.end method

According to the first line .method public static main([Ljava/lang/String;)V, the method accepts an array of the type java.lang.String and returns a void, indicated by the following:

([Ljava/lang/String;)V

Proceeding to the method name, it also says that the main method is static and has public access flags.

After the method header, we see the following piece of code, which shows that an sget-object operation is being formed:

sget-object v0, Ljava/lang/System;->out:Ljava/io/PrintStream;

The description of this opcode as per the official website is "Perform the identified object static field operation with the identified static field, loading or storing into the value register."

According to the official documentation, the sget-object operation accepts two arguments:

  • A register that Dalvik will use to store the result of the operation
  • An object reference to store in the mentioned register

So, what this really does is fetch an instance of an object and store it in a register. Here, this register is the first register called v0. The next line looks as follows:

const-string v1, "Hello World!
"

The previous code shows the const-string instruction in action. What it does is fetch a string and save it in the register indicated by the first argument. This register is the second register in the main method's frame called v1. The definition of the const-string opcode as per the official website is "Move a reference to the string specified by the given index into the specified register."

If it's not obvious enough, the string being fetched here is "Hello World ".

Moving on, the next line is also part of the const opcode family and is being used here to move a 0 value into the third register named v2:

const/4 v2, 0x0

This may seem a little random, but in the next line you'll see why it needs the 0 value in the v2 register. The code for the next line is as follows:

new-array v2, v2, [Ljava/lang/Object;

What the new array does is construct an array of a given type and size and save it in the first register from the left. Here this register is v2, so after this opcode has been executed, v2 will hold an array of type java.lang.Object which has a size of 0; this is the value of the v2 register in the second argument of the opcode. This also makes the previous operation, of moving a 0 value in to v2 before the execution of this opcode, clear. The definition of this opcode, as per the official website is "Construct a new array of the indicated type and size. The type must be an array type."

The next line contains a very common opcode; make sure you know how this family of opcodes works because you're going to see a lot of it. Moving on, the next line is as follows:

invoke-virtual {v0, v1, v2}, Ljava/io/PrintStream;->printf(Ljava/lang/String;[Ljava/lang/Object;)Ljava/io/PrintStream;

The definition of the invoke-virtual opcode as per the official website is "invoke-virtual is used to invoke a normal virtual method (a method that is not private, static, or final, and is also not a constructor)."

The arguments for the invoke-virtual method work as follows:

invoke-kind {vC, vD, vE, vF, vG}, meth@BBBB

Where vC, vD, vE, vF, and vG are the argument registers used to pass arguments to the method being invoked, which is dereferenced by the last argument meth@BBBB. This means it accepts a 16-bit method reference since each B field indicates a field of size 4 bits. In summary, what this opcode does in terms of our code for Example.smali is it invokes a method called java.io.PrintStream.printf, which accepts an array of the type java.lang.Object and a java.lang.String object and returns an object of the type java.io.PrintStream.

And that's it! You've just interpreted some smali code. It takes a bit of practice to get used to reading smali code. If you'd like to know more, check out the references in the See also section.

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.90.246