The main focus of most profiling tools is to provide a profile of method calls. This gives you a good idea of where the bottlenecks in your code are and is probably the most important way to pinpoint where to target your efforts. By showing which methods and lines take the most time, a good profiling tool can save you time and effort in locating bottlenecks.
Most method profilers work by sampling the call stack at regular
intervals and recording the methods on the stack.[8] This
regular snapshot identifies the method currently being executed (the
method at the top of the stack) and all the methods below, to the
program’s entry point. By accumulating the number of hits on
each method, the resulting profile usually identifies where the
program is spending most of its time. This profiling technique
assumes that the sampled methods are representative, i.e., if 10% of
stacks sampled show method foo( )
at the top of
the stack, then the assumption is that method foo()
takes 10% of the running time. However, this is a
sampling
technique
, and so it is not
foolproof: methods can be missed altogether or have their weighting
misrecorded if some of their execution calls are missed. But usually
only the shortest tests are skewed. Any reasonably long test (i.e.,
over seconds, rather than milliseconds) will normally give correct
results.
This sampling technique can be difficult to get right. It is not enough to simply sample the stack. The profiler must also ensure that it has a coherent stack state, so the call must be synchronized across the stack activities, possibly by temporarily stopping the thread. The profiler also needs to make sure that multiple threads are treated consistently, and that the timing involved in its activities is accounted for without distorting the regular sample time. Also, too short a sample interval causes the program to become extremely slow, while too long an interval results in many method calls being missed and hence misrepresentative profile results being generated.
The JDK comes with a minimal profiler, obtained by running a program
using the java
executable with the
-Xrunhprof
option
(-prof
before JDK 1.2,
-Xprof
with HotSpot). The result of running with
this option is a file with the profile data in it. The default name
of the file is java.hprof.txt
(java.prof
before 1.2). This filename can be
specified by using the modified option,
-Xrunhprof:file=<filename>
(-prof:<filename>
before 1.2). The
output using these options is discussed in detail shortly.
When using a method profiler, the most useful technique is to target the top five to ten methods and choose the quickest to fix. The reason for this is that once you make one change, the profile tends to be different the next time, sometimes markedly so. This way, you can get the quickest speedup for a given effort.
However, it is also important to consider what you are changing, so you know what your results are. If you select a method that is taking up 10% of the execution time, then if you halve the time that method takes, you have speeded up your application by 5%. On the other hand, targeting a method that takes up only 1% of execution time is going to give you a maximum of only 1% speedup to the application, no matter how much effort you put in to speed up that method.
Similarly, if you have a method that takes 10% of the time but is called a huge number of times so that each individual method call is quite short, you are less likely to speed up that method. On the other hand, if you can eliminate some significant fraction of the calling methods (the methods that call the method that takes 10% of the time), you might gain a good speedup in that way.
Let’s look at the profile output from a short program that repeatedly converts some numbers to strings and also inserts them into a hash table:
package tuning.profile; import java.util.*; public class ProfileTest { public static void main(String[] args) { //Repeat the loop this many times int repeat = 2000; //Two arrays of numbers, eight doubles and ten longs double[] ds = {Double.MAX_VALUE, -3.14e-200D, Double.NEGATIVE_INFINITY, 567.89023D, 123e199D, -0.000456D, -1.234D, 1e55D}; long[] ls = {2283911683699007717L, -8007630872066909262L, 4536503365853551745L, 548519563869L, 45L, Long.MAX_VALUE, 1L, -9999L, 7661314123L, 0L}; //initializations long time; StringBuffer s = new StringBuffer( ); Hashtable h = new Hashtable( ); System.out.println("Starting test"); time = System.currentTimeMillis( ); //Repeatedly add all the numbers to a stringbuffer, //and also put them into a hash table for (int i = repeat; i > 0; i--) { s.setLength(0); for (int j = ds.length-1; j >= 0; j--) { s.append(ds[j]); h.put(new Double(ds[j]), Boolean.TRUE); } for (int j = ls.length-1; j >= 0; j--) { s.append(ls[j]); h.put(new Long(ls[j]), Boolean.FALSE); } } time = System.currentTimeMillis( ) - time; System.out.println(" The test took " + time + " milliseconds"); } }
The relevant output from running this program with the JDK 1.2 method profiling option follows. (See Section 2.3.2 for a detailed explanation of the 1.2 profiling option and its output.)
CPU SAMPLES BEGIN (total = 15813) Wed Jan 12 11:26:47 2000 rank self accum count trace method 1 54.79% 54.79% 8664 204 java/lang/FloatingDecimal.dtoa 2 11.67% 66.46% 1846 215 java/lang/Double.equals 3 10.18% 76.64% 1609 214 java/lang/FloatingDecimal.dtoa 4 3.10% 79.74% 490 151 java/lang/FloatingDecimal.dtoa 5 2.90% 82.63% 458 150 java/lang/FloatingDecimal.<init> 6 2.11% 84.74% 333 213 java/lang/FloatingDecimal.<init> 7 1.23% 85.97% 194 216 java/lang/Double.doubleToLongBits 8 0.97% 86.94% 154 134 sun/io/CharToByteConverter.convertAny 9 0.94% 87.88% 148 218 java/lang/FloatingDecimal.<init> 10 0.82% 88.69% 129 198 java/lang/Double.toString 11 0.78% 89.47% 123 200 java/lang/Double.hashCode 12 0.70% 90.17% 110 221 java/lang/FloatingDecimal.dtoa 13 0.66% 90.83% 105 155 java/lang/FloatingDecimal.multPow52 14 0.62% 91.45% 98 220 java/lang/Double.equals 15 0.52% 91.97% 83 157 java/lang/FloatingDecimal.big5pow 16 0.46% 92.44% 73 158 java/lang/FloatingDecimal.constructPow52 17 0.46% 92.89% 72 133 java/io/OutputStreamWriter.write
In this example, I have extracted only the top few lines from the profile summary table. The methods are ranked according to the percentage of time they take. Note that the trace does not identify actual method signatures, only method names. The top three methods take, respectively, 54.79%, 11.67%, and 10.18% of the time taken to run the full program.[9] The fourth method in the list takes 3.10% of the time, so clearly you need look no further than the top three methods to optimize the program. The methods ranked first, third, and fourth are the same method, possibly called in different ways. Obtaining the traces for these three entries from the relevant section of the profile output (trace 204 for the first entry, and traces 215 and 151 for the second and fourth entries), you get:
TRACE 204: java/lang/FloatingDecimal.dtoa(FloatingDecimal.java:Compiled method) java/lang/FloatingDecimal.<init>(FloatingDecimal.java:Compiled method) java/lang/Double.toString(Double.java:Compiled method) java/lang/String.valueOf(String.java:Compiled method) TRACE 214: java/lang/FloatingDecimal.dtoa(FloatingDecimal.java:Compiled method) TRACE 151: java/lang/FloatingDecimal.dtoa(FloatingDecimal.java:Compiled method) java/lang/FloatingDecimal.<init>(FloatingDecimal.java:Compiled method) java/lang/Double.toString(Double.java:132) java/lang/String.valueOf(String.java:2065)
In fact, both traces 204 and 151 are the same stack, but trace 151 provides line numbers for two of the methods. Trace 214 is a truncated entry, and is probably the same stack as the other two (these differences are one of the limitations of the JDK profiler, i.e., that information is sometimes lost).
So all three entries refer to the same stack: an inferred call from
the
StringBuffer
to append a
double
, which calls String.valueOf()
, which calls Double.toString()
, which in turn creates a
FloatingDecimal
object. (<init>
is the standard way to write a
constructor
call; <clinit>
is the standard way to show a
class initializer being executed. These are also the actual names for
constructors and static initializers in the class file).
FloatingDecimal
is a class that is private to the
java.lang
package, which handles most of the
logic involved in converting floating-point numbers.
FloatingDecimal.dtoa( )
is the method called by the
FloatingDecimal
constructor that converts the
binary floating-point representation of a number into its various
parts of digits before the decimal point, after the decimal point,
and the exponent. FloatingDecimal
stores the
digits of the floating-point number as an array of
char
s when the FloatingDecimal
is created; no strings are created until the floating-point number is
converted to a string.
Since this stack includes a call to a constructor, it is worth
checking the object-creation profile to see whether you are
generating an excessive number of objects: object
creation
is expensive, and a method that
generates many new objects is often a performance bottleneck. (I show
the object-creation profile and how to generate it in Section 2.4.) The
object-creation profile shows that a large number of extra objects
are being created, including a large number of
FDBigInt
objects that are created by the new
FloatingDecimal
objects.
Clearly, FloatingDecimal.dtoa( )
is the primary
method to try to optimize in this case. Almost any improvement in
this one method translates directly to a similar improvement in the
overall program. However, normally only Sun can modify this method,
and even if you want to modify it, it is long and complicated and
takes an excessive amount of time to optimize unless you are already
familiar with both floating-point binary representation and
converting that representation to a string format.
Normally when tuning, the first alternative to optimizing
FloatingDecimal.dtoa( )
is to examine the other
significant bottleneck method, Double.equals( )
,
which came second in the summary. Even though this entry takes up
only 11.67% compared to over 68% for the
FloatingDecimal.dtoa( )
method, it may be an
easier optimization target. But note that while a small 10%
improvement in the FloatingDecimal.dtoa( )
method
translates into a 6% improvement for the program as a whole, the
Double.equals( )
method needs to be speeded up to
be more than twice as fast to get a similar 6% improvement for the
full program.
The trace corresponding to this second entry in the summary example
turns out to be another truncated trace, but the example shows the
same method in 14th position, and the trace for that entry identifies
the Double.equals( )
call as coming from the
Hashtable.put( )
call. Unfortunately for tuning
purposes, the Double.equals( )
method itself is
already quite fast and cannot be optimized further.
When methods cannot be directly optimized, the next best choice is to
reduce the number of times they are called or even avoid the methods
altogether. (In fact, eliminating method calls is actually the better
tuning choice, but is often considerably more difficult to achieve
and so is not a first-choice tactic for optimization.) The
object-creation profile and the method profile together point to the
FloatingDecimal
class as being a huge bottleneck, so
avoiding this class is the obvious tuning tactic here. In Chapter 5, I employ this technique, avoiding the default
call through the FloatingDecimal
class for the
case of converting floating-point numbers to
String
s, and I obtain an order-of-magnitude
improvement. Basically, the strategy is to create a more efficient
routine to run the equivalent conversion functionality, and then
replacing the calls to the underperforming
FloatingDecimal
methods with calls to the more
efficient optimized methods.
The best way to avoid the Double.equals( )
method
is to replace the hash table with another implementation that stores
double primitive data types directly rather than requiring the
double
s to be wrapped in a
Double
object. This allows the == operator to make
the comparison in the put( )
method, thus
completely avoiding the Double.equals( )
call:
this is another standard tuning tactic, where a data structure is
replaced with a more appropriate and faster one for the
task.
The 1.1 profiling output is quite different and much less like a standard profiler’s output. Running the 1.1 profiler with this program (details of this output are given in Section 2.3.4) gives:
count callee caller time 21 java/lang/System.gc( )V java/lang/FloatingDecimal.dtoa(IJI)V 760 8 java/lang/System.gc( )V java/lang/Double.equals(Ljava/lang/Object;)Z 295 2 java/lang/Double.doubleToLongBits(D)J java/lang/Double.equals(Ljava/lang/Object;)Z 0
I have shown only the top four lines from the output. This output
actually identifies both the FloatingDecimal.dtoa()
and the Double.equals( )
methods as
taking the vast majority of the time, and the percentages (given by
the reported times) are listed as around 70% and 25% of the total
program time for the two methods, respectively. Since the
“callee” for these methods is listed as
System.gc( )
, this also identifies that the methods
are significantly involved in memory creation and suggests that the
next tuning step might be to analyze the object-creation output for
this program.
The
default profile output gained from executing with
-Xrunhprof
in Java 2 is not useful for method profiling. The default output
generates object-creation statistics from the heap as the dump
(output) occurs. By default, the dump occurs when the application
terminates; you can modify the dump time by typing Ctrl- on Solaris
and other
Unix
systems, or Ctrl-Break on
Win32.
To get a useful method profile, you need to
modify the profiler options to specify method profiling. A typical
call to achieve this is:
java -Xrunhprof:cpu=samples,thread=y <classname>
(Note that in a Windows command-line prompt, you need to surround the option with double quotes because the equals sign is considered a meta character.)
Note that -Xrunhprof
has an “h” in it. There seems to be an undocumented feature of the VM in which the option -Xrun<something>
makes the VM try to load a shared library called <something>
, e.g., using -Xrunprof
results in the VM trying to load a shared library called “prof.” This can be quite confusing if you are not expecting it. In fact, -Xrunhprof
loads the “hprof” shared library.
The profiling option in JDK 1.2/1.3 can be pretty flaky. Several of the options can cause the runtime to crash (core dump). The output is a large file, since huge amounts of trace data are written rather than summarized. Since the profile option is essentially a Sun engineering tool, it has had limited resources applied to it, especially as Sun has a separate (not free) profile tool that Sun engineers would normally use. Another tool that Sun provides to analyze the output of the profiler is called heap-analysis tool (search http://www.java.sun.com for “HAT”). But this tool analyzes only the object-creation statistics output gained with the default profile output, and so is not that useful for method profiling (see Section 2.4 for slightly more about this tool).
Nevertheless, I expect the free profiling option to stabilize and be
more useful in future versions. The output when run with the options
already listed (cpu=samples,
thread=y
) already results in fairly usable
information. This profiling mode operates by periodically sampling
the stack. Each unique stack trace provides a TRACE entry in the
second section of the file; describing the method calls on the stack
for that trace. Multiple identical samples are not listed; instead,
the number of their “hits” are summarized in the third
section of the file. The profile output file in this mode has three
sections:
A standard header section describing possible monitored entries in the file. For example:
WARNING! This file format is under development, and is subject to change without notice. This file contains the following types of records: THREAD START THREAD END mark the lifetime of Java threads TRACE represents a Java stack trace. Each trace consists of a series of stack frames. Other records refer to TRACEs to identify (1) where object allocations have taken place, (2) the frames in which GC roots were found, and (3) frequently executed methods.
Individual entries describing monitored events, i.e., threads starting and terminating, but mainly sampled stack traces. For example:
THREAD START (obj=8c2640, id = 6, name="Thread-0", group="main") THREAD END (id = 6) TRACE 1: <empty> TRACE 964: java/io/ObjectInputStream.readObject(ObjectInputStream.java:Compiled method) java/io/ObjectInputStream.inputObject(ObjectInputStream.java:Compiled method) java/io/ObjectInputStream.readObject(ObjectInputStream.java:Compiled method) java/io/ObjectInputStream.inputArray(ObjectInputStream.java:Compiled method) TRACE 1074: java/io/BufferedInputStream.fill(BufferedInputStream.java:Compiled method) java/io/BufferedInputStream.read1(BufferedInputStream.java:Compiled method) java/io/BufferedInputStream.read(BufferedInputStream.java:Compiled method) java/io/ObjectInputStream.read(ObjectInputStream.java:Compiled method)
A summary table of methods ranked by the number of times the unique stack trace for that method appears. For example:
CPU SAMPLES BEGIN (total = 512371) Thu Aug 26 18:37:08 1999 rank self accum count trace method 1 16.09% 16.09% 82426 1121 java/io/FileInputStream.read 2 6.62% 22.71% 33926 881 java/io/ObjectInputStream.allocateNewObject 3 5.11% 27.82% 26185 918 java/io/ObjectInputStream.inputClassFields 4 4.42% 32.24% 22671 887 java/io/ObjectInputStream.inputObject 5 3.20% 35.44% 16392 922 java/lang/reflect/Field.set
Section 3 is the place to start when analyzing this profile output.
It consists of a table with six fields, headed
rank
, self
,
accum
, count
,
trace
, and method
, as shown.
These fields are used as follows:
rank
This column simply counts the entries in the table, starting with 1 at the top, and incrementing by 1 for each entry.
self
The self
field is usually interpreted as a
percentage of the total running time spent in this method. More
accurately, this field reports the percentage of samples that have
the stack given by the trace
field. Here’s a
one-line example:
rank self accum count trace method 1 11.55% 11.55% 18382 545 java/lang/FloatingDecimal.dtoa
This example shows that stack trace 545 occurred in 18,382 of the sampled stack traces, and this is 11.55% of the total number of stack trace samples made. It indicates that this method was probably executing for about 11.55% of the application execution time, because the samples are at regular intervals. You can identify the precise trace from the second section of the profile output by searching for the trace with identifier 545. For the previous example, this trace was:
TRACE 545: (thread=1) java/lang/FloatingDecimal.dtoa(FloatingDecimal.java:Compiled method) java/lang/FloatingDecimal.<init>(FloatingDecimal.java:Compiled method) java/lang/Double.toString(Double.java:Compiled method) java/lang/String.valueOf(String.java:Compiled method)
This TRACE entry clearly identifies the exact method and its caller.
Note that the stack is reported to a depth of four methods. This is
the default depth: the depth can be changed using the depth parameter
to the -Xrunhprof
option, e.g.,
-Xrunhprof:depth=6,cpu=samples,...
.
accum
This field is a running additive total of all the
self
field percentages as you go down the table:
for the Section 3 example shown previously, the third line lists
27.82% for the accum
field, indicating that the
sum total of the first three lines of the self
field is 27.82%.
count
This field indicates how many times the unique stack trace that gave rise to this entry was sampled while the program ran.
trace
This field shows the unique trace identifier from the second section
of profile output that generated this entry. The trace is recorded
only once in the second section no matter how many times it is
sampled; the number of times that this trace has been sampled is
listed in the count
field.
method
This field shows the method name from the top line of the stack trace
referred to from the trace
field, i.e., the method
that was running when the stack was sampled.
This summary table lists only the method name and not its argument
types. Therefore, it is frequently necessary to refer to the stack
itself to determine the exact method, if the method is an overloaded
method with several possible argument types. (The stack is given by
the trace identifier in the trace
field, which in
turn references the trace from the second section of the profile
output.) If a method is called in different ways, it may also give
rise to different stack traces. Sometimes the same method call can be
listed in different stack traces due to lost information. Each of
these different stack traces results in a different entry in the
third section of the profiler’s output, even though the
method
field is the same. For example, it is
perfectly possible to see several lines with the same
method
field, as in the following table segment:
rank self accum count trace method 95 1.1% 51.55% 110 699 java/lang/StringBuffer.append 110 1.0% 67.35% 100 711 java/lang/StringBuffer.append 128 1.0% 85.35% 99 332 java/lang/StringBuffer.append
When traces 699, 711, and 332 are analyzed, one trace might be
StringBuffer.append(boolean)
, while the other two
traces could both be StringBuffer.append(int)
, but
called from two different methods (and so giving rise to two
different stack traces and consequently two different lines in the
summary example). Note that the trace does not identify actual method
signatures, only method names. Line numbers are given if the class
was compiled so that line numbers remain. This ambiguity can be a
nuisance at times.
The profiler in this mode (cpu=samples
) is useful
enough to suffice when you have no better alternative. It does have
an effect on real measured times, slowing down operations by variable
amounts even within one application run. But it normally indicates
major bottlenecks, although sometimes a little extra work is
necessary to sort out multiple identical method-name references.
Using the alternative cpu=times
mode, the profile
output gives a different view of application execution. In this mode,
the method times are measured from method entry to method exit,
including the time spent in all other calls the method makes. This
profile of an application gives a tree-like view of where the
application is spending its time. Some developers are more
comfortable with this mode for profiling the application, but I find
that it does not directly identify bottlenecks in the code.
HotSpot does not support the standard Java 2 profiler detailed in the
previous section; it supports a separate profiler using the
-Xprof
option. JDK 1.3
supports the HotSpot profiler as well as the standard Java 2 profiler
detailed in the previous section. The HotSpot profiler has no further
options available to modify its behavior; it works by sampling the
stack every 10 milliseconds.
The output, printed to standard out, consists of a number of sections. Each section lists entries in order of the number of ticks counted while the method was executed. The various sections include methods executing in interpreted and compiled modes, and VM runtime costs as well:
One-line header, for example:
Flat profile of 7.55 secs (736 total ticks): main
A list of methods sampled while running in interpreted mode. The methods are listed in order of the total number of ticks counted while the method was at the top of the stack. For example:
Interpreted + native Method 3.7% 23 + 4 tuning.profile.ProfileTest.main 2.4% 4 + 14 java.lang.FloatingDecimal.dtoa 1.4% 3 + 7 java.lang.FDBigInt.<init>
A list of methods sampled while running in compiled mode. The methods are listed in order of the total number of ticks counted while the method was at the top of the stack. For example:
Compiled + native Method 13.5% 99 + 0 java.lang.FDBigInt.quoRemIteration 9.8% 71 + 1 java.lang.FDBigInt.mult 9.1% 67 + 0 java.lang.FDBigInt.add
A list of external (non-Java) method stubs, defined using the
native
keyword. Listed in order of the total
number of ticks counted while the method was at the top of the stack.
For example:
Stub + native Method 2.6% 11 + 8 java.lang.Double.doubleToLongBits 0.7% 2 + 3 java.lang.StrictMath.floor 0.5% 3 + 1 java.lang.Double.longBitsToDouble
A list of internal VM function calls. Listed in order of the total number of ticks counted while the method was at the top of the stack. Not tuneable. For example:
Runtime stub + native Method 0.1% 1 + 0 interpreter_entries 0.1% 1 + 0 Total runtime stubs
Other miscellaneous entries not included in the previous sections:
Thread-local ticks: 1.4% 10 classloader 0.1% 1 Interpreter 11.7% 86 Unknown code
A global summary of ticks recorded. This includes ticks from the garbage collector, thread-locking overheads, and other miscellaneous entries:
Global summary of 7.57 seconds: 100.0% 754 Received ticks 1.9% 14 Received GC ticks 0.3% 2 Other VM operations
The entries at the top of Section 3 are the methods that probably need tuning. Any method listed near the top of Section 2 should have been targeted by the HotSpot optimizer and may be listed lower down in Section 3. Such methods may still need to be optimized, but it is more likely that the methods at the top of Section 3 are what need optimizing. The ticks for the two sections are the same, so you can easily compare the time taken up by the top methods in the different sections and decide which to target.
The JDK 1.1.x method-profiling output, obtained by running with the
-prof
option, is quite different from the normal 1.2 output. This output
format is supported in Java 2, using the cpu=old
variation of the -Xrunhprof
option. This output
file consists of four sections:
The method profile table showing cumulative times spent in each
method executed. The table is sorted on the first
count
field; for example:
callee caller time 29 java/lang/System.gc( )V java/io/FileInputStream.read([B)I 10263 1 java/io/FileOutputStream.writeBytes([BII)V java/io/FileOutputStream.write([BII)V 0
One line describing high-water gross memory usage. For example:
handles_used: 1174, handles_free: 339046, heap-used: 113960, heap-free: 21794720
The line reports the number of handles and the number of bytes used by the heap memory storage over the application’s lifetime. A handle is an object reference. The number of handles used is the maximum number of objects that existed at any one time in the application (handles are recycled by the garbage collector, so over its lifetime the application could have used many more objects than are listed). The heap measurements are in bytes.
Reports the number of primitive data type arrays left at the end of the process, just before process termination. For example:
sig count bytes indx [C 174 19060 5 [B 5 19200 8
This section has four fields. The first field is the primitive data
type (array dimensions and data type given by letter codes listed
shortly), the second field is the number of arrays, and the third is
the total number of bytes used by all the arrays. This example shows
174 char
arrays taking a combined space of 19,060
bytes, and 5 byte
arrays taking a combined space
of 19,200 bytes.
The reported data does not include any arrays that may have been
garbage collected before the end of the process. For this reason, the
section is of limited use. You could use the
-noasyncgc
option to try to eliminate garbage
collection (if you have enough memory; you may also need
-mx
with a large number to boost the maximum
memory available). If you do, also use -verbosegc
so that if garbage collection is forced, you at least know
that garbage collection has occurred and can get the basic number of
objects and bytes reclaimed.
The fourth section of the profile output is the per-object memory dump. Again, this includes only objects left at the end of the process just before termination, not objects that may have been garbage-collected before the end of the process. For example:
*** tab[267] p=4bba378 cb=1873248 cnt=219 ac=3 al=1103 Ljava/util/HashtableEntry; 219 3504 [Ljava/util/HashtableEntry; 3 4412
This dump is a snapshot of the actual object table. The fields in the first line of an entry are:
***tab[
<index>
]
The entry location as listed in the object table. The index is of no use for performance tuning.
p=<
hex value
>
Internal memory locations for the instance and class; of no use for performance tuning.
cb=<
hex value
>
Internal memory locations for the instance and class; of no use for performance tuning.
cnt=<
integer
>
The number of instances of the class reported on the next line.
ac=<
integer
>
The number of instances of arrays of the class reported on the next line.
al=<
integer
>
The total number of array elements for all the arrays counted in the
previous (ac
) field.
This first line of the example is followed by lines consisting of
three fields: first, the class name prefixed by the array dimension
if the line refers to the array data; next, the number of instances
of that class (or array class); and last, the total amount of space
used by all the instances, in bytes. So the example reports that
there are 219 HashtableEntry
instances taking a
total of 3504 bytes between them,[10] and three HashtableEntry
arrays having
1103 array indexes between them (which amounts to 4412 bytes between
them, since each entry is a 4-byte object handle).
The last two sections, Sections 3 and 4, give snapshots of the object table memory and can be used in an interesting way: to run a garbage collection just before termination of your application. That leaves in the object table all the objects that are rooted[11] by the system and by your application (from static variables). If this snapshot shows significantly more objects than you expect, you may be referencing more objects than you realized.
The first section of the profile output is the most useful,
consisting of multiple lines, each of which specifies a method and
its caller, together with the total cumulative time spent in that
method and the total number of times it was called from that caller.
The first line of this section specifies the four fields in the
profile table in this section: count
,
callee
, caller
, and
time
. They are detailed here:
count
The total number of times the callee
method was
called from the caller
method, accumulating
multiple executions of the caller
method. For
example, if foo1( )
calls foo2( )
10 times every time foo1( )
is executed, and
foo1( )
was itself called three times during the
execution of the program, the count
field should
hold the value 30 for the callee-caller pair foo2( )-foo1()
. The line in the table should look like this:
30 x/y/Z.foo2( )V x/y/Z.foo1( )V 1263
(assuming the foo*( )
methods are in class
x.y.Z
and they both have a void return). The
actual reported numbers may be less than the true number of calls:
the profiler can miss calls.
callee
The method that was called count
times in total
from the caller
method. The callee can be listed
in other entries as the callee
method for
different caller
methods.
caller
The method that called the callee
method
count
times in total.
time
The cumulative time (in milliseconds) spent in the
callee
method, including time when the
callee
method was calling other methods (i.e.,
when the callee
method was in the stack but not at
the top, and so was not the currently executing method).
If each of the count
calls in one line took
exactly the same amount of time, then one call from caller to callee
took time
divided by count
milliseconds.
This first section is normally sorted into count
order. However, for this profiler, the time spent in methods tends to
be more useful. Because the times in the time
field include the total time that the callee method was anywhere on
the stack, interpreting the output of complex programs can be
difficult without processing the table to subtract subcall times.
This format is different from the 1.2 output with
cpu=samples
specified, and is more equivalent to a
1.2 profile with cpu=times
specified.
The lines in the profile output are unique for each callee-caller
pair, but any one callee
method and any one
caller
method can (and normally do) appear in
multiple lines. This is because any particular method can call many
other methods, and so the method registers as the caller for multiple
callee-caller pairs. Any particular method can also be called by many
other methods, and so the method registers as the callee for multiple
callee-caller pairs.
The methods are written out using the internal Java syntax listed in Table 2-1.
Table 2-1. Internal Java Syntax for -prof Output Format
Internal Symbol |
Java Meaning |
---|---|
|
Replaces the . character in package names (e.g.,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
One array dimension (e.g., |
|
A class (e.g., |
There are free viewers, including source code, for viewing this format file:
Vladimir Bulatov’s HyperProf (search for HyperProf on the Web)
Greg White’s ProfileViewer (search for ProfileViewer on the Web)
My own viewer (see ProfileStack: A Profile Viewer for Java 1.1)
The biggest drawback to the 1.1 profile output is that threads are not indicated at all. This means that it is possible to get time values for method calls that are longer than the total time spent in running the application, since all the call times from multiple threads are added together. It also means that you cannot determine from which thread a particular method call was made. Nevertheless, after re-sorting the section on the time field, rather than the count field, the profile data is useful enough to suffice as a method profiler when you have no better alternative.
One problem I’ve encountered is the limited size of the list of methods that can be held by the internal profiler. Technically, this limitation is 10,001 entries in the profile table, and there is presumably one entry per method. There are four methods that help you avoid the limitation by profiling only a small section of your code:
sun.misc.VM.suspendJavaMonitor( ) sun.misc.VM.resumeJavaMonitor( ) sun.misc.VM.resetJavaMonitor( ) sun.misc.VM.writeJavaMonitorReport( )
These methods also allow you some control over which parts of your application are profiled and when to dump the results.
[8] A variety of profiling metrics, including the way different metrics can be used, are reported in the paper “A unifying approach to performance analysis in the Java environment,” by Alexander, Berry, Levine, and Urquhart, in the IBM Systems Journal, Vol. 39, No. 1. This paper can be found at http://www.research.ibm.com/journal/sj/391/alexander.html.Specifically, see Table 4 in this paper.
[9] The samples that count towards
a particular method’s execution time are those where the method
itself is executing at the time of the sample. If method
foo( )
was calling another method when the sample
was taken, that other method would be at the top of the stack instead
of foo( )
. So you do not need to worry about the
distinction between foo( )
’s execution time
and the time spent executing foo( )
’s
callees. Only the method at the top of the stack is tallied.
[10] A
HashtableEntry
has one int
and three
object handle instance variables, each of which takes 4 bytes, so
each HashtableEntry
is 16 bytes.
[11] Objects rooted by the system are objects the JVM runtime keeps alive as part of its runtime system. Rooted objects are generally objects that cannot be garbage collected because they are referenced in some way from other objects that cannot be garbage collected. The roots of these non-garbage-collectable objects are normally objects referenced from the stack, objects referenced from static variables of classes, and special objects the runtime system ensures are kept alive.
3.141.27.244