You may know by now that the Dalvik VM is slightly different in structure and operation as compared to the Java VM; its file and instruction formats are different. The Java VM is stack-based, meaning bytecode (the code format is named this way because instructions are each a byte long) works by push and popping instruction on and off a stack. The Dalvik bytecode is designed to resemble the x86 instructions sets; it also uses a somewhat C-style calling convention. You'll see in a moment how each calling method is responsible for setting up the arguments before making calls to another method. For more details on the design and general caveats of the Dalvik code format, refer to the entry named General Design—Bytecode for the Dalvik VM, Android Open Source project in the See also section.
Interpreting bytecode means actually being able to understand how the instruction format works. This section is dedicated to provide you with the references and tools you need to understand the Dalvik bytecode. Let's dig into the bytecode format and find out how it works and what it all means.
Before jumping into bytecode specifics, it's important to establish some context. We need to understand a little about how a bytecode is executed. This will help you understand the attributes of the Dalvik bytecode and determine the difference between knowing what a piece bytecode is and what a piece of bytecode means in a given context of execution, which is a very valuable skill.
The Dalvik machine executes methods one-by-one, branching between methods where necessary, for instance, when one method invokes another. Each method can then be thought of as an independent instance of the Dalvik VM's execution. Each of the methods have a private space of memory called a frame that holds just enough space to accommodate the data needed for the method's execution. Each frame also holds a reference to the DEX file; naturally, the method needs this reference in order to reference TypeIds and object definitions. It also holds reference to an instance of the program counter, which is a register that controls the flow of execution and can be used to branch off into other execution flows. For instance, while executing an "if" statement, the method may need to jump in and out of different portions of code, depending on the result of a comparison. Frames also hold areas called registers, which are used to perform operations such as adding, multiplying, and moving values around, which may sometimes mean passing arguments to other methods, such as object constructors.
A bytecode consists of a collection of operators and operands, with each operator performing a specific action on the operands supplied to it. Some of the operators also summarize complex operations, such as invoking methods. The simple and atomic nature of these operators is the reason they are so robust, easy to read and understand, and supportive of a complex high-level language such as Java.
An important thing to note about Dalvik, as with all intermediate code representations, is the order of the operands for the Dalvik bytecode. The destination of the operation always appears before the source for the relevant operators, for instance, take an operation such as the following:
move vA,vB
This means that the contents of register B will be placed in register A. A popular jargon for this order is "Destination-then-Source"; this means the destination of the result of the operation appears first, followed by the operand that specifies the source.
Operands can be registers, of which each method, an instance of independent execution, has a collection of registers. Operands may also be literal values (signed/unsigned integers of a specified size) or instances of a given type. For non-primitive types such as strings, the bytecode dereferences a type defined in the TypeIds
section.
There are a number of instruction formats that dictate how many registers and number of type instances can be used as arguments for given opcodes. You can find these specifics at http://source.android.com/devices/tech/dalvik/instruction-formats.html. It's well worth your time to read through these definitions, because each opcode in the Dalvik instruction set and its specifics is merely an implementation of one of the opcode formats. Try to understand the format IDs because they make for very useful short-hand while reading the instruction formats.
After covering some of the basics, and trusting that you've at least skimmed the opcodes and opcode formats, we can move on to dumping some bytecode in a way that makes it semantic to read.
Before we start, you will need the Smali decompiler, which is called baksmali. As an added convenience, we will now go over how to set up your path variable so that you can use the baksmali scripts and a JAR file from anywhere on your machine without referencing it canonically every single time. Here's how you set it up:
baksmali[version].jar
file—where [version]
is the latest available version.java –jar
command explicitly every time you need to run the baksmali JAR. You can grab a copy of the script at https://code.google.com/p/smali/downloads/list, or from the newer repository at https://bitbucket.org/JesusFreke/smali/downloads. Save it in the same directory as the baksmali JAR file. This step does not apply to Windows users, since it's a bash script file.baksmali.jar
, omitting the version number so that the wrapper script you've downloaded in step 2 will be able to find it. You can change the name using the following command on a Linux or Unix machine:mv baksmali-[version-number].jar baksmali.jar
You can also do this using whatever window manager your operating system uses; as long as you change the name to baksmali.jar
, you're doing it right!
chmod +x 700 baksmali
PATH
variable.And you're all done! You can now decompile the DEX files! See the following section to find out how.
So, you've got baksmali all downloaded and set up, and you'd like to decompile some DEX files into the nice semantic syntax of smali; here's how you do that.
Execute the following command from your terminal or command prompt:
baksmali [Dex filename].dex
This command will output the contents for the DEX file as though it's an inflated JAR file, but instead of class files, all of the source files will be .smali
files containing a slight translation or dialect of the semantic Dalvik bytecode called smali:
Let's take a look at the smali file generated by baksmali and walk through what each bytecode instruction means. The code is as follows:
.class public LExample; .super Ljava/lang/Object; .source "Example.java" # direct methods .method public constructor <init>()V .registers 1 .prologue .line 1 invoke-direct {p0}, Ljava/lang/Object;-><init>()V return-void .end method .method public static main([Ljava/lang/String;)V .registers 4 .prologue .line 3 sget-object v0, Ljava/lang/System;->out:Ljava/io/PrintStream; const-string v1, "Hello World! " const/4 v2, 0x0 new-array v2, v2, [Ljava/lang/Object; invoke-virtual {v0, v1, v2}, Ljava/io/PrintStream;->printf(Ljava/lang/String;[Ljava/lang/Object;)Ljava/io/PrintStream; .line 4 return-void .end method
Please note that because baksmali, the Android Dalvik VM, and the Java language are constantly being improved, you may see slightly different results to the previous code sample. Don't panic if you do; the preceding sample code is intended to merely be an example for you to learn from. You will still be able to apply the information in this chapter to the code your baksmali generates, whose first few lines are as follows:
.class public LExample; .super Ljava/lang/Object; .source "Example.java"
These are merely some metadata on the actual class being decompiled; they mention the class name, the source file, and the super class (the class that this method inherits from). You may notice from the code of Example.java
that we never explicitly inherit from another class, though when decompiled, Example.java
seems to have a parent: how is this possible? Well, because all Java classes inherit from java.lang.Object
implicitly.
Moving on, the next bunch of lines are a little more interesting. They are the smali code for the constructor of Example.java
:
# direct methods .method public constructor <init>()V .registers 1 .prologue .line 1 invoke-direct {p0}, Ljava/lang/Object;-><init>()V return-void .end method
The first line, .method public constructor <init>()V
, is a declaration of the method to follow. It says that the method called init
returns a void type and has public access flags.
The next line that contains the piece of code, namely:
.registers 1
Says that this method only makes use of one register. The method will know this because the number of registers it needs are decided before it is run. I'll shortly mention the one register it needs. Following this is a line that looks like the following code:
.prologue
This declares that the method prologue
follows, which is something every Java method has. It makes sure to call the inherited forms of the method, if there are any. This explains why the next line, containing the following code, seems to invoke another method called init
:
invoke-direct {p0}, Ljava/lang/Object;-><init>()V
But this time it dereferences it from the java.lang.Object
class. The invoke-direct
method here accepts two arguments: the p0
register and a reference to the method that needs to be called here. This is indicated by the Ljava/lang/Object;-><init>()V
label. The description of the invoke-direct
opcode is stated as follows:
"invoke-direct
is used to invoke a non-static direct method (an instance method that is non-overridable by nature and is either a private
instance method or a constructor)."
An extract is available at http://source.android.com/devices/tech/dalvik/dalvik-bytecode.html.
So in summary, all it's doing is calling a non-static direct method that is the constructor of the java.lang.Object
class.
Let's move on to the next line of the smali code:
return-void
It does exactly what it seems to, and that is, return a void
type and exit the current method to return the flow of execution to whichever method invoked it.
The definition of this opcode as per the official website is "Return from a void method."
Nothing really complex about that. The next line, as with other lines beginning with the period (".") character, is a piece of metadata, or a footnote added by the smali decompiler, to help add some semantic information about the code. The .end
method line marks the end of this method.
The code for the main method follows. Here, you will see some code forms that will appear over and over again, namely, the code generated when arguments are passed to the methods and when they are invoked. Since Java is object-oriented, a lot of what you're doing when your code is calling another object's methods is passing arguments and converting from one object type to another. So, a good idea would be to learn to identify when this is happening by decompiling some Java code that does this to the smali code. The code for the main method is as follows:
.method public static main([Ljava/lang/String;)V .registers 4 .prologue .line 3 sget-object v0, Ljava/lang/System;->out:Ljava/io/PrintStream; const-string v1, "Hello World! " const/4 v2, 0x0 new-array v2, v2, [Ljava/lang/Object; invoke-virtual {v0, v1, v2}, Ljava/io/PrintStream;->printf(Ljava/lang/String;[Ljava/lang/Object;)Ljava/io/PrintStream; .line 4 return-void .end method
According to the first line .method public static main([Ljava/lang/String;)V
, the method accepts an array of the type java.lang.String
and returns a void, indicated by the following:
([Ljava/lang/String;)V
Proceeding to the method name, it also says that the main method is static and has public access flags.
After the method header, we see the following piece of code, which shows that an sget-object
operation is being formed:
sget-object v0, Ljava/lang/System;->out:Ljava/io/PrintStream;
The description of this opcode as per the official website is "Perform the identified object static field operation with the identified static field, loading or storing into the value register."
According to the official documentation, the sget-object
operation accepts two arguments:
So, what this really does is fetch an instance of an object and store it in a register. Here, this register is the first register called v0
. The next line looks as follows:
const-string v1, "Hello World! "
The previous code shows the const-string
instruction in action. What it does is fetch a string and save it in the register indicated by the first argument. This register is the second register in the main method's frame called v1
. The definition of the const-string
opcode as per the official website is "Move a reference to the string specified by the given index into the specified register."
If it's not obvious enough, the string being fetched here is "Hello World ".
Moving on, the next line is also part of the const
opcode family and is being used here to move a 0
value into the third register named v2
:
const/4 v2, 0x0
This may seem a little random, but in the next line you'll see why it needs the 0
value in the v2
register. The code for the next line is as follows:
new-array v2, v2, [Ljava/lang/Object;
What the new array does is construct an array of a given type and size and save it in the first register from the left. Here this register is v2
, so after this opcode has been executed, v2
will hold an array of type java.lang.Object
which has a size of 0
; this is the value of the v2
register in the second argument of the opcode. This also makes the previous operation, of moving a 0
value in to v2
before the execution of this opcode, clear. The definition of this opcode, as per the official website is "Construct a new array of the indicated type and size. The type must be an array type."
The next line contains a very common opcode; make sure you know how this family of opcodes works because you're going to see a lot of it. Moving on, the next line is as follows:
invoke-virtual {v0, v1, v2}, Ljava/io/PrintStream;->printf(Ljava/lang/String;[Ljava/lang/Object;)Ljava/io/PrintStream;
The definition of the invoke-virtual
opcode as per the official website is "invoke-virtual
is used to invoke a normal virtual method (a method that is not private
, static
, or final
, and is also not a constructor)."
The arguments for the invoke-virtual
method work as follows:
invoke-kind {vC, vD, vE, vF, vG}, meth@BBBB
Where vC
, vD
, vE
, vF
, and vG
are the argument registers used to pass arguments to the method being invoked, which is dereferenced by the last argument meth@BBBB
. This means it accepts a 16-bit method reference since each B
field indicates a field of size 4 bits. In summary, what this opcode does in terms of our code for Example.smali
is it invokes a method called java.io.PrintStream.printf
, which accepts an array of the type java.lang.Object
and a java.lang.String
object and returns an object of the type java.io.PrintStream
.
And that's it! You've just interpreted some smali code. It takes a bit of practice to get used to reading smali code. If you'd like to know more, check out the references in the See also section.
18.116.90.246