Chapter 10. Dynamic Tracing

Contributed by Jon Haslam

Solaris 10 delivered a revolutionary new subsystem called the Solaris Dynamic Tracing Framework (or DTrace for short). DTrace is an observability technology that allows us, for the first time, to answer virtually every question we ever wanted to ask about the behavior of our systems and applications.

Introduction to DTrace

Before Solaris 10, the Solaris observational toolset was already quite rich; many examples in this book use tools such as truss(1), pmap(1), pstack(1), vmstat(1), iostat(1), and others. However, as rich as each individual tool is, it still provides only limited and fixed insight into one specific area of a system. Not only that, but each of the tools is disjoint in its operation. It’s therefore difficult to accurately correlate the events reported by a tool, such as iostat, and the applications that are driving the behavior the tool reports. In addition, all these tools present data in different formats and frequently have very different interfaces. All this conspires to make observing and explaining systemwide behavioral characteristics very difficult indeed.

Solaris dynamic tracing makes these issues a thing of the past. With one subsystem we can observe, quite literally, any part of system and application behavior, ranging from every instruction in an application to the depths of the kernel. A single interface to this vast array of information means that, for the first time ever, subsystem boundaries can be crossed seamlessly, allowing easy observation of cause and effect across an entire system. For example, requests such as “show me the applications that caused writes to a given device” or “display the kernel code path that was executed as a result of a given application function call” are now trivial to fulfill. With DTrace we can ask almost any question we can think of.

With DTrace we can create custom programs that contain arbitrary questions and then dynamically modify application and kernel code to provide immediate answers to these questions. All this can be done on live production environments in complete safety, and by default the subsystem is available only to the superuser (uid 0). When not explicitly enabled, DTrace has zero probe effect and the system acts as if DTrace were not present at all.

DTrace has its own scripting language with which we can express the questions we want to ask; this language is called “D.” It provides most of the richness of “C” plus some tracing-specific additions.

The aim of this chapter is not to go into great detail on the language and architecture but to highlight the essential elements that you need to understand when reading this book. For a thorough treatment of the subject, read the Solaris Dynamic Tracing Guide available at http://docs.sun.com.

The Basics

As an introduction to DTrace and the D language, let’s start with a simple example.

The truss(1) utility, a widely used observational tool, provides a powerful means to observe system and library call activity. However, it has many drawbacks: It operates on one process at a time, with no systemwide capability; it is verbose with fixed-output format; and it offers its users a limited choice of questions. Moreover, because of the way it works, truss can reduce application performance. Every time a thread in a process makes a system call, truss stops the thread through procfs, records the arguments for the system call, and then restarts the thread. When the system call returns, truss again stops the thread, records the return code, and then restarts it. It’s not hard to see how this can have quite an impact on performance. DTrace, however, operates completely in the kernel, collecting relevant data at the source. Because the application is no longer controlled through procfs, the impact on the application is greatly minimized.

With DTrace we can surpass the power of truss with our first script, which in itself is almost the simplest script that can be written. Here’s a D script, truss.d, that lets us observe all global system call activity.

#!/usr/sbin/dtrace -s

syscall:::entry
{
}

There are a few important things to note from the above example. The first line of the program is as follows:

#!/usr/sbin/dtrace -s

This specifies that the dtrace(1M) program is to be used as the interpreter, and the -s argument tells dtrace that what follows is a D program that it should execute. Note: The interpreter line for all the examples in this chapter is omitted for the sake of brevity, but it is still very much required.

Next follows a description of the events we are interested in looking at. Here we are interested in what happens every time a system call is made.

syscall:::entry

This is an example of a probe description. In DTrace, a probe is a place in the system where we want to ask a question and record some pertinent data. Such data might include function arguments, stack traces, timestamps, file names, function names, and the like.

The braces that follow the probe specification contain the actions that are to be executed when the associated probe is encountered. Actions are generally focused on recording items of data; we’ll see examples of these shortly. This example contains no actions, so the default behavior is to just print the name of the probe that has been hit (or fired in tracing parlance) as well as the CPU it executed on and a numerical ID for the probe.

Let’s run our simple script.

sol10# ./truss.d
dtrace: script './truss.d' matched 225 probes
CPU     ID                    FUNCTION:NAME
  0     13                      write:entry
  0    103                      ioctl:entry
  0    317                    pollsys:entry
  0     13                      write:entry
  0    103                      ioctl:entry
  0    317                    pollsys:entry
^C

As you can see from the preceding output, the syscall:::entry probe description enabled 225 different probes in this instance; this is the number of system calls currently available on this system. We don’t go into the details now of exactly what this means, but be aware that, when the script is executed, the kernel is instrumented according to our script. When we stop the script, the instrumentation is removed and the system acts in the same way as a system without DTrace installed.

The final thing to note here is that the execution of the script was terminated with a Control-C sequence (as shown with the ^C in the above output). A script can itself issue an explicit exit() call to terminate; in the absence of this, the user will have to type Control-C.

The preceding script gives a global view of all system call activity. To focus our attention on a single process, we can modify the script to use a predicate. A predicate is associated with a probe description and is a set of conditions placed between forward slashes (“/”). For example:

#pragma D option quiet

syscall:::entry
/pid == 660/
{
        printf("%-15s: %8x %8x %8x %8x %8x %8x
",
            probefunc, arg0, arg1, arg2, arg3, arg4, arg5);
}

If the expressions within the predicate evaluate to true, then we are interested in recording some data and the associated actions are executed. However, if they evaluate to false, then we choose not to record anything and return. In this case, we want to execute the actions only if the thread making the system call belongs to pid 660.

We made a couple of additions to the D script. The #pragma just tells DTrace not to print anything unless it’s explicitly asked to do so (the -q option to dtrace(1M) does the same thing). Second, we added some output formatting to printf() to display the name of the system call that was made and its first six arguments, whether the system call has them or not. We look more at output formatting and arguments later. Here is some example output from our script.

s10## ./truss.d
write          :       16  841b548        8        0 831de790         8
read           :       16 f942dcc0       20        0 831de790        20
write          :       16  841b548        8        0 831de790         8
read           :       16 f942dcc0       20        0 831de790        20
pollsys        : f942dce0        1 f942dbf0        0 831de790  f942dbf0
write          :        5 feab36b1        1        e 81b31250         1
pollsys        :  8046ef0        2  8046f88        0 81b31250   8046f88

With a few lines of D we have created the functional equivalent of truss -p.

Now that we’ve seen a simple example, let’s look at some of the basic building blocks of DTrace.

D Program Structure

D is a block-structured language similar in layout to awk. A program consists of one or more clauses that take the following form:

probe
/ optional predicates /
{
        optional action statements;
}

Each clause describes one or more probes to enable, an optional predicate, and any actions to associate with the probe specification. When a D program contains several clauses that enable the same probe, the clauses are executed in the order in which they appear in the program. For example:

syscall::read:entry
{
        printf("A");
}

syscall::read:entry
{
        printf(" B");
        exit(1);
}

The above script contains two clauses; each clause enables the read(2) system call entry probe. When this script is executed, the system is modified dynamically to insert our tracing actions into the read() system call. When any application next makes a read() call, the first clause is executed, causing the character “A” to be displayed. The next clause is executed immediately after the first, and the sequence “B” is also displayed. The exit(1) call terminates the tracing session, an action that in turn causes the enabled probes and their actions to be removed. The system then returns to its default state. Executing the script we see this:

sol10# ./read.d
A B

The preceding explanation is a huge simplification of what actually happens when we execute a D script. The important thing to note here is the dynamic nature of the modifications that are made when a D script is executed. The modifications made to the system (the “instrumentation”) exist just for the lifetime of the script. When no DTrace scripts are running, the system acts just as if DTrace were not installed.

Providers and Probes

By default, DTrace provides tens of thousands of probes that you can enable to gain unparalleled insight into the behavior of a system (use dtrace -l to list them all). Each probe can be referred to by a unique numerical ID or by a more commonly used human-readable one that consists of four colon-separated fields. These are defined as follows:

provider:module:function:name
  • Provider. The name of the DTrace provider that created this probe. A provider is essentially a kernel module that creates groups of probes that are related in some way (for example, kernel functions, an application’s functions, system calls, timers).

  • Module. The name of the module to which this probe belongs if the probe is associated with a program location. For kernel probes, it is the name of the module (for example, ufs); for applications, it is a library name (for example, libc.so).

  • Function. The name of the function that this probe is associated with if it belongs to a program location. Kernel examples are ufs_write() and clock(); a userland (a program running in user-mode) example is the printf() function of libc.

  • Name. The name component of the probe. It generally gives an idea of its meaning. Examples include entry or return for kernel function calls, start for an I/O probe, and on-cpu for a scheduling probe.

Note two key facts about probe specifications:

  • If any field in a probe specification is empty, that field matches any value (that is, it acts like a wildcard).

  • sh(1)-like pattern matching is supported.

Table 10.1 lists examples of valid probe descriptions.

Table 10.1. Examples of DTrace Probe Descriptions

Probe Description

Meaning

fbt:ufs:ufs_write:entry

The ufs_write() kernel function’s entry point

fbt:nfs::

All the probes in the kernel nfs module

syscall::write:entry

The write() system call entry point

syscall::*read*:entry

All the matches of read, readlink, readv, pread, and pread64 system calls

syscall:::

All system call entry and return probes

io:::start

All the places in the kernel from which a physical I/O can occur

sched:::off-cpu

All the places in kernel where a currently executing thread is taken off the CPU

Although it isn’t necessary to specify all the fields in a probe, the examples in this book do so in order to remove any ambiguity about which probes are being enabled. Also note that a comma-separated list of probes can be used to associate multiple probes with the same predicate and actions.

In previous examples we saw the syscall provider being used to ask questions concerning system call usage. Exactly what is a provider and what is its relationship to a probe? A provider creates the probes that are essentially the individual system points at which we ask questions. There are a number of providers, each able to instrument a different part of the system.

The following providers are of special interest to us:

  • fbt. The Function Boundary Tracing provider places probes at the entry and return point of virtually every kernel function. This provider illuminates the operation of the Solaris kernel and is used extensively in this book. Its full power is realized when it is used in conjunction with the Solaris source code.

  • pid. This provider probes for userland processes at function entry, function return, and even down to the instruction level.

  • syscall. This provider probes at the entry and return point of every system call.

  • profile. This provider gives us timer-driven probes. The timers can be specified at any resolution from nanoseconds to days and can interrupt all CPUs or just one.

  • sdt. The Statically Defined Tracing provider enables programmers to place probes at arbitrary locations in their code and to choose probe names that convey specific meaning. (For example, a probe named transmit-start means more to most observers than the function name in which it sits.)

The following providers leverage the sdt provider to grant powerful observability into key Solaris functional areas:

  • sched. This provider affords a group of probes for scheduling-related events. Such events include a thread being placed on the CPU, taken off the CPU, put to sleep, or woken up.

  • io. This provider probes for I/O-related events. Such events include I/O starts, I/O completion, and I/O waits.

  • proc. The probes of the proc provider examine process creation and life cycle events. Such events include fork, exec, thread creation, and signal send and receive.

  • vminfo. The vminfo provider is layered on top of the kstat updates to the vm kstat. Every time an update is made to a member of the vm kstat, a probe is fired.

  • sysinfo. The sysinfo provider is also layered on top of the kstat updates, in this case, to the sys kstat. Every time an update is made to a member of the sys kstat, a probe is fired.

Aggregations

The syscall example used earlier is simple and powerful. However, the output quickly becomes voluminous and overwhelming with thousands of lines generated in seconds. It rapidly becomes difficult to discern patterns of activity in the data, such as might be perceived in a view of all system calls sorted by count. Historically, we would have generated our data and post-processed this by using tools such as awk(1) or perl(1), but that approach is laborious and time wasting. DTrace enables us to succinctly specify how to group vast amounts of data so that we can easily observe such patterns. The mechanism that does this is termed an aggregation. We use aggregations to refine our initial script.

syscall:::entry
{
        @sys[probefunc] = count();
}

And here is the output now.

sol10# ./truss.d
dtrace: script './truss.d' matched 225 probes
^C

  <output elided>
  fcntl
  xstat                                                           1113
  lwp_park                                                        2767
  setcontext                                                      4593
  lwp_sigmask                                                     4599
  write                                                           7429
  setitimer                                                       8234
  writev                                                          8444
  ioctl                                                          17718
  pollsys                                                       135603
  read                                                          141379

Instead of seeing every system call as it is made, we are now presented with a table of system calls sorted by count: over 330, 000 system calls presented in several lines!

The concept of an aggregation is simple. We want to associate the value of a function with an arbitrary element in an array. In our example, every time a system call probe is fired, the name of the system call is used (using the probefunc built-in variable) to index an associative array. The result of the count() function is then stored in this element of the array (this simply adds 1 to an internal variable for the index in the array and so effectively keeps a running total of the number of times this system call has been entered). In that way, we do not focus on data at individual probe sites but succinctly collate large volumes of data.

An aggregation can be split into two basic components: on the left side, a named associative array that is preceded by the @ symbol; on the right side, an aggregating function.

@name [ keys ] = function();

An aggregating function has the special property that it produces the same result when applied to a set of data as when applied to subsets of that data and then again to that set of results. A simple example of this is finding the minimum value of the set [5, 12, 4, 7, 18]. Applying the min() function to the whole set gives the result of 4. Equally, computing the minimum value of two subsets [5, 12] and [4, 7, 18] produces 5 and 4. Applying min() again to [5, 4] yields 4.

Several aggregating functions in DTrace and their results are listed below.

  • count. Returns the number of times called.

  • avg. Returns the mean of its arguments. The following example displays the average write size that each process makes. The third argument to the write(2) system call is the size of the write being made. Since arguments are indexed from 0, arg2 is therefore the size of the write.

    syscall::write:entry
    {
            @sys[execname] = avg(arg2);
    }
    
    sol10# ./avg.d
    dtrace: script './avg.d' matched 1 probe
    ^C
    
      ls                                                               101
      egrep                                                            162
      gnome-panel                                                      169
      gnome-terminal                                                   290
      soffice.bin                                                      309
      metacity                                                         334
      battstat-applet-                                                 366
      init                                                             412
      mozilla-bin                                                     1612
      gconfd-2                                                       27763
    
  • sum. Returns the total value of its arguments.

  • max. Returns the maximum value of its arguments.

  • min. Returns the minimum value of its arguments.

  • quantize. Stores the specified argument in the appropriate bucket in a power-of-2 series.

The following example stores in the appropriate bucket the size of the memory requested in the call to malloc().

pid$1:libc:malloc:entry
{
        @["malloc sizes"] = quantize(arg0);
}

sol10# ./malloc.d 658
dtrace: script './malloc.d' matched 1 probe
^C

  malloc sizes
           value  ------------- Distribution ------------- count
               2 |                                         0
               4 |@@                                       405
               8 |@@@@                                     886
              16 |@@@@@@@                                  1673
              32 |@                                        205
              64 |@@@@@@                                   1262
             128 |@@@@@@@                                  1600
             256 |@@@@@@@                                  1632
             512 |                                         3
            1024 |                                         5
            2048 |@@@@                                     866
            4096 |                                         10
            8192 |@@@                                      586
           16384 |                                         0

The example shows that 1673 memory allocations between the size of 16 and 31 bytes were requested. The @ character indicates the relative size of each bucket.

  • lquantize. Linear quantizations are frequently used to drill down on buckets of interest when the quantize() function has previously been used. This time we use a linear range of buckets that goes between two sizes with a specified step size. The example below specifies that calls to malloc() between 4 and 7 bytes in size go in their own bucket.

pid$1:libc:malloc:entry
{
        @["malloc sizes"] = lquantize(arg0,4,8,1);
}

sol10# ./lmalloc.d 658
dtrace: script './lmalloc.d' matched 1 probe
^C

  malloc sizes
           value  ------------- Distribution -------------  count
             < 4 |                                          6
               4 |@                                         423
               5 |                                          0
               6 |@                                         400
               7 |                                          0
            >= 8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    18452

Variables

Having looked at aggregations, we now come to the two basic data types provided by D: associative arrays and scalar variables. An associative array stores data elements that can be accessed with an arbitrary name, known as a key or an index. This differs from normal, fixed-size arrays in a number of different ways:

  • There are no predefined limits on the number of elements in the array.

  • The elements can be indexed with an arbitrary key and not just with integer keys.

  • The storage for the array is not preallocated or contained in consecutive storage locations.

Associative arrays in D commonly keep a history of events that have occurred in the past to use in controlling flow in scripts. The following example uses an associative array, arr, to keep track of the largest writes made by applications.

syscall::write:entry
/arr[execname] < arg2/
{
        printf("%d byte write: %s
", arg2, execname);
        arr[execname] = arg2;
}

The actions of the clause are executed if the write size, stored in arg2, is larger than that stored in the associative array arr for a given application. If the predicate evaluates to true, then this is the largest write seen for this application. The actions record this by first printing the size of the write and then by updating the element in the array with the new maximum write size.

D is similar to languages such as C in its implementation of scalar variables, but a few differences need to be highlighted. The first thing to note is that in the D language, variables do not have to be declared in advance of their use, much the same as in awk(1) or perl(1). A variable comes into existence when it first has a value assigned to it; its type is inferred from the assigned value (you are allowed to declare variables in advance but doing so isn't necessary). There is no explicit memory management in D, much as in the Java programming language. The storage for a variable is allocated when the variable is declared, and deallocated when the value of 0 is assigned to the variable.

The D language provides three types of variable scope: global, thread-local, and clause-local. Thread-local variables provide separate storage for each thread for a given variable and are referenced with the self-> prefix.

fbt:ufs:ufs_write:entry
{
        self->in = timestamp;
}

In the clause above, every different thread that executes the ufs_write() function has its own copy of a variable named in. Its type is the same as the timestamp built-in variable, and it holds the value that the timestamp built-in variable had when the thread started executing the actions in the clause. This is a nanosecond value since an arbitrary time in the past.

A common use of thread-local variables is to highlight a sequence of interest for a given thread and also to associate data with a thread during that sequence. The following example uses the sched provider to record, by application, all the time that a specified user (UID 1003) spent executing.

sched:::on-cpu
/uid == 1003/
{
        self->ts = timestamp;
}

sched:::off-cpu
/self->ts/
{
        @time[execname] = sum(timestamp - self->ts);
        self->ts = 0;
}

The above D script contains two clauses. The first one uses the sched:::oncpu probe to enable a probe at every point in the kernel where a thread can be placed onto a processor and run. The predicate attached to this probe specifies that the actions are only to be executed if the uid of the thread is 1003. The action merely stores the current timestamp in nanoseconds by assigning the timestamp built-in variable to a thread-local variable, self->ts.

The second clause uses the sched:::off-cpu probe to enable a probe at every location in the kernel where a thread can be taken off the CPU. The self->ts variable in the predicate ensures that only threads owned by uid 1003 that have already been through the sched:::on-cpu probe shall execute the following actions. Why couldn’t we just predicate on uid == 1003 as in the first clause? Well, we want to ensure that any thread executing the following actions has already been through the first clause so that its self->ts variable is set. If it hasn't been set, we will end up storing a huge value in the @time aggregation because self->ts will be 0! Using a thread-local variable in predicates like this to control flow in a D script is a common technique that we frequently use in this book.

sol10# ./sched.d
dtrace: script './sched.d' matched 6 probes
^C

  cat                                                          1247134
  xterm                                                        1830446
  ksh                                                          3909045
  stdlibfilt                                                   5499630
  make                                                        60092218
  sh                                                         158049977
  sed                                                        162507340
  CC                                                         304925644
  ir2hf                                                      678855289
  sched                                                     2600916929
  ube_ipa                                                   2851647754
  ccfe                                                      5879225939
  ube                                                       7942433397

The preceding example can be enhanced with the profile provider to produce output at a given periodic rate. To produce output every 5 seconds, we can just add the following clause:

profile:::tick-5s
{
              printa(@time);
              trunc(@time);
}

The profile provider sets up a probe that fires every 5 seconds on a single CPU. The two actions used here are commonly used when periodically displaying aggregation data:

  • printa(). This function prints aggregation data. This example uses the default formatting, but we can control output by using modifiers in much the same way as with printf(). Note that we refer to the aggregation result (that is, the value returned from the aggregation function) by using the @ formatting character with the appropriate modifier. The above printa() could be rewritten with

    printa("%-56s %@10d", @time);
    
  • trunc(). This function truncates an aggregation or removes its current contents altogether. The trunc() action deletes all the keys and the aggregation results if no second, optional, value is given. Specifying a second argument, n, removes all the keys and the aggregation values in the aggregation apart from the top n values.

Probe Arguments

In DTrace, probe arguments are made available through one or two mechanisms, depending on which provider is responsible for the probe:

  • args[]. The args[] array presents a typed array of arguments for the current probe. args[0] is the first argument, args[1] the second, and so on. The providers whose probe arguments are presented through the args[] array include fbt, sched, io, and proc.

  • arg0... arg9. The argn built-in variables are accessible by all probes. They are raw 64-bit integer quantities and, as such, must be cast to the appropriate type.

For an example of argument usage, let’s look at a script based on the fbt provider. The Solaris kernel, like any other program, is made up of many functions that offer well-defined interfaces to perform specific operations. We often want to ask pertinent questions upon entry to a function, such as, What was the value of its third argument? or upon exit from a function, What was the return value? For example:

fbt:ufs:ufs_read:entry
/uid == 1003/
{
        self->path = stringof(args[0]->v_path);
        self->ts = timestamp;
}

fbt:ufs:ufs_read:return
/self->path != NULL/
{
        @[self->path] = max(timestamp - self->ts);
        self->path = 0;
        self->ts = 0;
}

This example looks at all the reads performed through ufs file systems by a particular user (UID 1003) and, for each file, records the maximum time taken to carry out the read call. A few new things require further explanation.

The name of the file being read from is stored in the thread-local variable, self->path, with the following statement:

self->path = stringof(args[0]->v_path);

The main point to note here is the use of the args[] array to reference the first argument (args[0]) of the ufs_read function. Using MDB, we can inspect the arguments of ufs_read:

sol10# mdb -k
Loading modules: [ unix krtld genunix dtrace specfs ufs ip sctp usba uhci fctl s1394 nca
lofs audiosup nfs random sppp crypto ptm ipc ]
> ufs_read::nm -f ctype
C Type
int (*)(struct vnode *, struct uio *, int, struct cred *, struct caller_context *)

The first argument to ufs_read() is a pointer to a vnode structure (struct vnode *). The path name of the file that is represented by that vnode is stored in the v_path member of the vnode structure and can be accessed through args[0]->v_path. Using MDB again, we inspect the type of the v_path member variable.

> ::print -t struct vnode v_path
char *v_path

The v_path member is a character pointer and needs to be converted to DTrace’s native string type. In DTrace a string is a built-in data type. The stringof() action is one of many features that allow easy manipulation of strings. It converts the char * representation of v_path into the DTrace string type.

If the arg0 built-in variable had been used, a cast would be required and would be written as this:

self->path = stringof(((struct vnode *)arg0)->v_path);

The predicate associated with the ufs_read:return probe ensures that its actions are only executed for files with a non-NULL path name. The action then uses the path name stored in the self->path variable to index an aggregation, and the max() aggregating function tracks the maximum time taken for reads against this particular file. For example:

sol10# ./ufs.d
dtrace: script './ufs.d' matched 2 probes
^C

  /lib/ld.so.1                                                    3840
  /usr/share/lib/zoneinfo/GB                                      5523
  <output elided>
  /usr/share/man/man1/ls.1                                     2599836
  /./usr/bin/more                                              3941344
  /./usr/bin/tbl                                               3988087
  /usr/share/lib/pub/eqnchar                                   4397573
  /usr/share/lib/tmac/an                                       5054675
  /./usr/bin/nroff                                             7004599
  /./usr/bin/neqn                                              7021088
  /./usr/bin/col                                               9989462
  /usr/share/man/windex                                       13742938
  /./usr/bin/man                                              17179129

Now let’s look at a syscall-based example of return probe use. The following script exploits the fact that syscall probes have their return value stored in arg0 and the error code for the call stored in the errno built-in variable.

syscall::open*:entry
{
        self->path = arg0;
}
syscall::open*:return
/self->path != NULL && (int)arg0 == -1 && errno == EACCES/
{
        printf("UID %d permission denied to open %s
",
            uid, copyinstr(self->path));
        self->path = 0;
}

The first clause enables probes for the open(2) and open64(2) system calls. It then stores the address of the buffer, which contains the file name to open, in the thread-local variable self->path.

The second clause enables the corresponding syscall return probes. The conditions of interest are laid out in the predicate:

  • The stored file name buffer isn’t a NULL pointer (self->path != NULL).

  • The open failed (arg0 == - 1).

  • The open failed owing to insufficient permissions (errno == EACCES).

If the above conditions are all true, then a message is printed specifying the UID that induced the condition and the file for which permissions were lacking .

sol10# ./open.d
UID 39079 permission denied to open /etc/shadow

Finally, a note regarding the copyinstr() action used in the second clause above: All probes, predicates, and associated actions are executed in the kernel, and therefore any data that originates in userland must be copied into the kernel to be used. The buffer that contains the file name to be opened in our example is a buffer that resides in a userland application. For the contents to be printed, the buffer must be copied to the kernel address space and converted into a DTrace string type; this is what copyinstr() does.

Mixing Providers

DTrace gives us the freedom to observe interactions across many different subsystems. The following slightly larger script demonstrates how we can follow all the work done in userland and the kernel by a given application function. We can use dtrace -p to attach to and instrument a running process. For example, we can use a script that looks at the function getgr_lookup() in the name services cache daemon. The getgr_lookup() function is called to translate group IDs and group names. Note that here we are interested in the principle of examining a particular function; the actual program and function chosen here are irrelevant.

#pragma D option flowindent

pid$target:a.out:getgr_lookup:entry
{
        self->in = 1;
}

pid$target:::entry,
pid$target:::return
/self->in/
{
        printf("(pid)
");
}

fbt:::entry,
fbt:::return
/self->in/
{
        printf("(fbt)
");
}

pid$target:a.out:getgr_lookup:return
/self->in/
{
        self->in = 0;
        exit(0);
}

The #pragma flowindent directive at the start of the script means that indentation will be increased on entry to a function and reduced on the same function’s return. Showing function calls in a nested manner like this makes the output much more readable.

The pid provider instruments userland applications. The process to be instrumented is specified with the $target macro argument, which always expands to the PID of the process being traced when we attach to the process by using the -p option to dtrace(1M).

The second clause enables all the entry and return probes in the nscd process, and the third clause enables every entry and return probe in the kernel. The predicate in both of these clauses specifies that we are only interested in executing the actions if the thread-local self->in variable is set. This variable is set to 1 when nscd’s getgr_lookup() function is entered and set to 0 on exit from this function (that is, when the return probe is fired). For example:

sol10# dtrace -s ./nscd.d -p 'pgrep nscd`
dtrace: script './nscd.d' matched 43924 probes
CPU FUNCTION
  0  -> getgr_lookup                           (pid)
  0    -> mutex_lock                           (pid)
  0      -> mutex_lock_impl                    (pid)
  0      <- mutex_lock_impl                    (pid)
  0    <- mutex_lock                           (pid)
  0    -> _xstat                               (pid)
  0      -> copyout                            (fbt)
  0      <- kcopy                              (fbt)
  0      -> syscall_mstate                     (fbt)
  0        -> gethrtime_unscaled               (fbt)
  0        <- gethrtime_unscaled               (fbt)
  0      <- syscall_mstate                     (fbt)
  0      -> syscall_entry                      (fbt)
 <output elided>
  0      <- syscall_exit                       (fbt)
  0      -> syscall_mstate                     (fbt)
  0        -> gethrtime_unscaled               (fbt)
  0        <- gethrtime_unscaled               (fbt)
  0      <- syscall_mstate                     (fbt)
  0    <- _xstat                               (pid)
  0    -> get_hash                             (pid)
  0      -> abs                                (pid)
  0        -> copyout                          (fbt)
  0        <- kcopy                            (fbt)
  0      <- abs                                (pid)
  0    <- get_hash                             (pid)
  0    -> memcpy                               (pid)
  0      -> copyout                            (fbt)
  0      <- kcopy                              (fbt)
  0    <- memcpy                               (pid)
  0    -> mutex_unlock                         (pid)
  0    <- mutex_unlock                         (pid)
  0   | getgr_lookup:return                    (pid)
  0  <- getgr_lookup

Accessing Global Kernel Data

DTrace provides a very useful feature by which we can access symbols defined in the Solaris kernel from within a D script. We can use the backquote character (`) to refer to kernel symbols, and this information can be used to great advantage when we are exploring the behavior of a Solaris kernel. For example, a variable named mpid is declared in the Solaris kernel source to keep track of the last PID that was allocated. It is declared in uts/common/os/pid.c as follows:

static pid_t mpid;

The following script uses this variable to calculate the rate of process creation on the system and to output a message if it exceeds a given amount (10 processes per second in this case):

dtrace:::BEGIN
{
        cnt = 'mpid;
}
profile:::tick-1s
/'mpid < cnt+10/
{
        cnt = 'mpid;
}

profile:::tick-1s
/'mpid >= cnt+10/
{
        printf("High process creation rate: %d/sec
", 'mpid - cnt);
        cnt = 'mpid;
}

The first clause uses the BEGIN probe from the dtrace provider to initialize a global variable (cnt) to the current value of the mpid kernel variable.

The BEGIN, END, and ERROR probes are special probes that belong to the dtrace provider. These probes are essentially virtual probes in that they aren’t associated with any code location or timer source. The BEGIN probe fires before any other probes when we start the tracing session and allows us to perform tasks such as data initialization. The END probe is called when the tracing session is terminated either with a Control-C or an explicit call to the exit() action. Its main function is to print data collected during the execution of the script. The ERROR probe is less commonly used; it is called upon abnormal termination of the script.

Both of the next two clauses in the previous example enable the profile:::tick-1s probe. The probe fires every second, and the two clauses are executed in the order specified in the script. The important thing to note is that the predicates in the two clauses contain mutually exclusive logic, which ensures that only one of them will be true at any one time—either ten processes have been created in the last second or they haven’t!

The predicate in the first profile:::tick-1s clause specifies that its actions should only be executed if fewer than ten processes have been created (the 'mpid variable is within ten of its value one second ago as stored in the cnt variable). If fewer than ten processes have been created in the last second, the cnt variable is updated with the current value of mpid.

The actions in the second clause are executed when more than ten processes have been created. If cnt has already been updated in the first clause, then the predicate will be false and the actions are not executed (a message is then printed with the growth rate, and the cnt variable is updated). For example:

sol10# ./scope.d
High process creation rate: 30/sec
High process creation rate: 31/sec
High process creation rate: 35/sec
High process creation rate: 35/sec
High process creation rate: 44/sec
High process creation rate: 44/sec
High process creation rate: 20/sec

Assorted Actions of Interest

DTrace defines numerous actions, only a small percentage of which are used in this book. Actions that you may see used include normalize(), stack(), and ustack().

  • normalize(). This action effectively divides the values in the aggregation by a supplied normalization factor. A simple example is the use of a tick-5s probe to display data that you want displayed as a per-second rate:

    syscall::read:entry
    {
            @reads[probefunc] = count();
    }
    
    tick-5s
    {
            printa("%s (non normalized) %50@d
    
    ", @reads);
            normalize(@reads, 5);
            printa("%s (  normalized  ) %50@d
    ", @reads);
            trunc(@reads);
    }
    

    The above example uses a single aggregation, @reads, to store the number of read system calls made. Every 5 seconds the contents of the aggregation are displayed by printa() and then divided by 5 to give a per-second value with the normalize() action. The normalized aggregation is then printed and its contents are deleted with the trunc() action. For example,

    sol10# ./norm.d
    read (non normalized)                                                5012
    
    read (  normalized  )                                                1002
    
  • stack(). This action produces the stack trace of the kernel thread at the time of execution. It commonly indexes aggregations to determine the most common callstacks when at a given probe. It can also be an invaluable tool for learning how the code flow in the kernel works because it gives a ready view of the call sequence up to a given point. The following script and output show the most frequently executed call sequence through the ufs file system over a given trace period.

    fbt:ufs::entry
    {
            @ufs[stack()] = count();
    }
    
    END
    {
            trunc(@ufs, 1);
    }
    
    sol10# ./stack.d
    dtrace: script './stack.d' matched 419 probes
    ^C
    CPU     ID                    FUNCTION:NAME
      0      2                             :END
    
    
                  genunix`fop_lookup+0x18
                  genunix`lookuppnvp+0x371
                  genunix`lookuppnat+0x11a
                  genunix`lookupnameat+0x8b
                  genunix`cstatat_getvp+0x16d
                  genunix`cstatat64_32+0x48
                  genunix`lstat64_32+0x25
                  unix`sys_syscall32+0x101
                18650
    
  • ustack(). This action is the equivalent of the stack function for userland applications. The following script and output display the stack trace of the userland application that is generating most of the work in the ufs code.

    #!/usr/sbin/dtrace -s
    
    fbt:ufs::entry
    {
            @ufs[ustack()] = count();
    }
    
    END
    {
            trunc(@ufs, 1);
    }
    
    sol10# ./ustack.d
    dtrace: script './stack.d' matched 419 probes
    ^C
    CPU     ID                    FUNCTION:NAME
      0      2                             :END
    
    
                  libc.so.1`lstat64+0x7
                  libc.so.1`walk+0x44b
                  libc.so.1`walk+0x44b
                  libc.so.1`walk+0x44b
                  libc.so.1`walk+0x44b
                  libc.so.1`walk+0x44b
                  libc.so.1`nftw64+0x185
                  find`main+0x2b6
                  find`0x80513d2
               231489
    

The find(1) application is at the top of the list here. The walk() routine is listed multiple times because it is recursively called to walk a file tree.

Inspecting Java Applications with DTrace

This section presents two sample applications that demonstrate the interaction of the Mustang Java HotSpot Virtual Machine and the Solaris 10 DTrace Framework. The first example, Java2Demo, is bundled with the Mustang release and will already be familiar to most developers. Because the hotspot provider is built into the Mustang VM itself, running the application is all that is required to trigger probe activity. The second example is a custom debugging scenario that uses DTrace to find a troublesome line of native code in a Java Native Interface (JNI) application.

The following script, written in the D programming language, defines the set of probes that DTrace will listen to while the Java2Demo application is running. In this case, the only probes of interest are those related to garbage collection.

#!/usr/sbin/dtrace -Zs

#pragma D option quiet

self int cnt;

dtrace:::BEGIN
{
        self->cnt == 0;
        printf("Ready..
");
}

hotspot$1:::gc-begin
/self->cnt == 0/
{
        self->tid = tid;
        self->cnt++;
}

hotspot$1:::*
/self->cnt != 0 /
{
        printf("  tid: %d, Probe: %s
", tid, probename);
}

hotspot$1:::gc-end
{
        printf("  tid: %d, D-script exited
", tid);
        exit(0);
}

To run this example:

  1. Start the sample application: java -jar Java2Demo.jar.

  2. Note the application’s PID (11201 for this example).

  3. Start the D script, passing in the PID as its only argument: hotspot_gc.d 11201.

The following output shows that DTrace prints the thread ID and probe name as each probe fires in response to garbage collection activity in the VM:

Ready..
 tid: 4, Probe: gc-begin
 tid: 4, Probe: mem-pool-gc-begin
 tid: 4, Probe: mem-pool-gc-begin
 tid: 4, Probe: mem-pool-gc-begin
 tid: 4, Probe: mem-pool-gc-begin
 tid: 4, Probe: mem-pool-gc-begin
 tid: 4, Probe: mem-pool-gc-end
 tid: 4, Probe: mem-pool-gc-end
 tid: 4, Probe: mem-pool-gc-end
 tid: 4, Probe: mem-pool-gc-end
 tid: 4, Probe: mem-pool-gc-end
 tid: 4, Probe: mem-pool-gc-begin
 tid: 4, Probe: mem-pool-gc-begin
 tid: 4, Probe: mem-pool-gc-begin
 tid: 4, Probe: mem-pool-gc-begin
 tid: 4, Probe: mem-pool-gc-begin
 tid: 4, Probe: mem-pool-gc-end
 tid: 4, Probe: mem-pool-gc-end
 tid: 4, Probe: mem-pool-gc-end
 tid: 4, Probe: mem-pool-gc-end
 tid: 4, Probe: mem-pool-gc-end
 tid: 4, Probe: gc-end
 tid: 4, D-script exited

The next script shows the thread ID (tid) and probe name in all probes; class name, method name and signature in the “method-compile-begin” probe; and method name and signature in the compiled-method-load probe:

#!/usr/sbin/dtrace -Zs

#pragma D option quiet

self int cnt;

dtrace:::BEGIN
{
        self->cnt == 0;
        printf("Ready..
");
}
hotspot$1:::method-compile-begin
/self->cnt == 0/
{
        self->tid = tid;
        self->cnt++;
        printf(" tid: %d, %21s, %s.%s %s
", tid, probename,
        copyinstr(arg2), copyinstr(arg4), copyinstr(arg6));
}

hotspot$1:::method-compile-end
/self->cnt > 0/
{
        printf(" tid: %d, %21s
", tid, probename);
}

hotspot$1:::compiled-method-load
/self->cnt > 0/
{
        printf(" tid: %d, %21s, %s %s
", tid, probename,
           copyinstr(arg2), copyinstr(arg4));
}

hotspot$1:::vm-shutdown
{
        printf(" tid: %d, %21s
", tid, probename);
        printf(" tid: %d, D-script exited
", tid);
        exit(0);
}

hotspot$1:::*
/self->cnt > 0/
{
        printf(" tid: %d, %21s, %s %s
", tid, probename,
           copyinstr(arg2), copyinstr(arg4));
}

Its output shows:

Ready..
tid: 9,  method-compile-begin, sun/java2d/SunGraphics2D.setFont (Ljava/awt/Font;)V
tid: 9,  compiled-method-load, setFont (Ljava/awt/Font;)V
tid: 9,    method-compile-end
tid: 9,  method-compile-begin, sun/java2d/SunGraphics2D validateCompClip
tid: 9,  compiled-method-load, validateCompClip ()V
tid: 9,    method-compile-end
tid: 8,  method-compile-begin, javax/swing/RepaintManager addDirtyRegion0
tid: 8,  compiled-method-load, addDirtyRegion0 (Ljava/awt/Container;IIII)V
tid: 8,    method-compile-end
tid: 9,  method-compile-begin, java/io/BufferedInputStream read
tid: 9,  compiled-method-load, read ()I
tid: 9,    method-compile-end
tid: 8,  method-compile-begin, java/awt/geom/AffineTransform translate
tid: 8,  compiled-method-load, translate (DD)V
tid: 8,    method-compile-end
tid: 9,  method-compile-begin, sun/awt/X11/Native getInt
tid: 9,  compiled-method-load, getInt (J)I
tid: 9,    method-compile-end
tid: 8,  method-compile-begin, sun/java2d/SunGraphics2D setColor
tid: 8,  compiled-method-load, setColor (Ljava/awt/Color;)V
tid: 8,    method-compile-end
tid: 9,  method-compile-begin, sun/reflect/GeneratedMethodAccessor1 invoke
tid: 9,  compiled-method-load, invoke (Ljava/lang/Object;[Ljava/lang/Object;)Ljava/
lang/
Object;
tid: 9,    method-compile-end
tid: 9,  method-compile-begin, sun/java2d/SunGraphics2D constrain
tid: 9,  compiled-method-load, constrain (IIII)V
tid: 9,    method-compile-end
tid: 8,  method-compile-begin, java/awt/Rectangle setLocation
tid: 8,  compiled-method-load, setLocation (II)V
tid: 8,    method-compile-end
tid: 9,  method-compile-begin, java/awt/Rectangle move
tid: 9,  compiled-method-load, move (II)V
tid: 9,    method-compile-end
tid: 8,  method-compile-begin, java/lang/Number <init>
tid: 8,  compiled-method-load, <init> ()V
tid: 8,    method-compile-end
tid: 8,  method-compile-begin, sun/awt/X11/XToolkit getAWTLock
tid: 8,  compiled-method-load, getAWTLock ()Ljava/lang/Object;
tid: 8,    method-compile-end
tid: 17,           vm-shutdown
tid: 17, D-script exited

The next example demonstrates a debugging session with the hotspot_jni provider. Consider, if you will, an application that is suspected to be calling JavaTM Native Interface (JNI) functions from within a critical region. A JNI critical region is the space between calls to JNI methods GetPrimitiveArrayCritical and ReleasePrimitiveArrayCritical. There are some important rules for what is allowed within that space. Chapter 4 of the JNI 5.0 Specification makes it clear that within this region, “Native code should not run for an extended period of time before it calls ReleasePrimitiveArrayCritical.” In addition, “Native code must not call other JNI functions, or any system call that may cause the current thread to block and wait for another Java thread.”

The following D script will inspect a JNI application for this kind of violation:

#!/usr/sbin/dtrace -Zs

#pragma D option quiet

self int in_critical_section;

dtrace:::BEGIN
{
        printf("ready..
");
}

hotspot_jni$1:::ReleasePrimitiveArrayCritical_entry
{
        self->in_critical_section = 0;
}
hotspot_jni$1:::GetPrimitiveArrayCritical_entry

{
        self->in_critical_section = 0;
}

hotspot_jni$1:::*
/self->in_critical_section == 1/
{
        printf("JNI call %s made from JNI critical region
", probename);
}

hotspot_jni$1:::GetPrimitiveArrayCritical_return
{
        self->in_critical_section = 1;
}

syscall:::entry
/pid == $1 && self->in_critical_section == 1/
{
        printf("system call %s made in JNI critical region
", probefunc);
}

Output:

system call brk made in JNI critical section
system call brk made in JNI critical section
system call ioctl made in JNI critical section
system call fstat64 made in JNI critical section
JNI call FindClass_entry made from JNI critical region
JNI call FindClass_return made from JNI critical region

From this DTrace output, we can see that the probes FindClass_entry and FindClass_return have fired due to a JNI function call within a critical region. The output also shows some system calls related to calling printf() in the JNI critical region. The native code for this application shows the guilty function:

#include "t.h"
/*
 * Class:     t
 * Method:    func
 * Signature: ([I)V
 */
JNIEXPORT void JNICALL Java_t_func
  (JNIEnv *env, jclass clazz, jintArray array) {
   int* value = (int*)env->GetPrimitiveArrayCritical(array, NULL);
   printf("hello world");
   env->FindClass("java/lang/Object");
   env->ReleasePrimitiveArrayCritical(array, value, JNI_ABORT);
}

Inspecting Applications with the DTrace jstack Action

Mustang is the first release to contain built-in DTrace probes, but support for the DTrace jstack() action was actually first introduced in the JavaTM 2 Platform Standard Edition 5.0 Update Release 1. The DTrace jstack() action prints mixed-mode stack traces including both Java method and native function names. As an example of its use, consider the following application, which periodically sleeps to mimic hanging behavior:

public class dtest{
    int method3(int stop){
        try{Thread.sleep(500);}
        catch(Exception ex){}
        return stop++;
    }
    int method2(int stop){
        int result = method3(stop);
        return result + 1;
    }
    int method1(int arg){
        int stop=0;
        for(int i=1; i>0; i++){
            if(i>arg){stop=i=1;}
            stop=method2(stop);
        }
        return stop;
    }
    public static void main(String[] args) {
        new dtest().method1(10000);
    }
}

To find the cause of the hang, the user would want to know the chain of native and Java method calls in the currently executing thread. The expected chain would be something like:

<chain of initial VM functions> ->
dtest.main -> dtest.method1 -> dtest.method2 -> dtest.method3 ->
java/lang/Thread.sleep -> <chain of VM sleep functions> ->
<Kernel pool functions>

The following D script (usestack.d) uses the DTrace jstack() action to print the stack trace:

#!/usr/sbin/dtrace -s

BEGIN { this->cnt = 0; }
syscall::pollsys:entry
/pid == $1 && tid == 1/
{
   this->cnt++;
   printf("
	TID: %d", tid);
   jstack(50);
}

syscall:::entry
/this->cnt == 1/
{
   exit(0);
}

And the stack trace itself appears as follows:

$ usejstack.d 1344 | c++filt
CPU     ID                    FUNCTION:NAME
 0    316                    pollsys:entry
   TID: 1
   libc.so.1'__pollsys+0xa
   libc.so.1'poll+0x52
   libjvm.so'int os_sleep(long long,int)+0xb4
   libjvm.so'int os::sleep(Thread*,long long,int)+0x1ce
   libjvm.so'JVM_Sleep+0x1bc
   java/lang/Thread.sleep
   dtest.method3
   dtest.method2
   dtest.method1
   dtest.main
   StubRoutines (1)
   libjvm.so'void JavaCalls::call_helper(JavaValue*,methodHandle*,JavaCallArgu-
ments*,Thread*)+0x1b5
   libjvm.so'void os::os_exception_wrapper(void(*)(JavaValue*,methodHandle*,JavaCallAr-
guments*,Thread*),JavaValue*,methodHandle*,Ja
vaCallArguments*,Thread*)+0x18
   libjvm.so'void JavaCalls::call(JavaValue*,methodHandle,JavaCallArgu-
ments*,Thread*)+0x2d
   libjvm.so'void jni_invoke_static(JNIEnv_*,JavaValue*,_jobject*,JNICallType,_
jmethodID*,JNI_ArgumentPush er*,Thread*)+0x214
   libjvm.so'jni_CallStaticVoidMethod+0x244
   java'main+0x642
   StubRoutines (1)

The command line shows that the output from this script was piped to the c++filt utility, which demangles C++ mangled names making the output easier to read. The DTrace header output shows that the CPU number is 0, the probe number is 316, the thread ID (TID) is 1, and the probe name is pollsys:entry, where pollsys is the name of the system call. The stack trace frames appear from top to bottom in the following order: two system call frames, three VM frames, five Java method frames, and the remaining frames are VMframes.

It is also worth noting that the DTrace jstack action will run on older releases, such as the Java 2 Platform, Standard Edition version 1.4.2, but hexadecimal addresses will appear instead of Java method names. Such addresses are of little use to application developers.

Adding Probes to Pre-Mustang Releases

In addition to the jstack() action, it is also possible for pre-Mustang users to add DTrace probes to their release with the help of VM Agents. A VM agent is a shared library that is dynamically loaded into the VM at startup.

VM agents are available for the following releases:

  • For The Java 2 Platform, Standard Edition, version 1.4.2, there is a dvmpi agent that uses the Java Virtual Machine Profiler Interface (JVMPI).

  • For The Java 2 Platform Standard Edition 5.0, there is a dvmti agent that uses the JVM Tool Interface (JVM TI).

To obtain the agents, visit the DVM java.net project website at

https://solaris10-dtrace-vm-agents.dev.java.net/

and follow the “Documents and Files” link. The file dvm.zip contains both binary and source code versions of the agent libraries.

The following diagram shows an abbreviated view of the resulting directory structure once dvm.zip has been extracted:

          build   make   src    test
            |
       ---------------------
         |      |     |      |
        amd64  i386  sparc  sparcv9
         |      |     |      |
        lib    lib   lib    lib

Each lib directory contains the pre-built binaries dvmti.jar, libdvmpi.so, and libdvmti.so. If you prefer to compile the libraries yourself, the included README file contains all necessary instructions.

Once unzipped, the VM must be able to find the native libraries on the filesystem. This can be accomplished either by copying the libraries into the release with the other shared libraries, or by using a platform-specific mechanism to help a process find it, such as LD_LIBRARY_PATH. In addition, the agent library itself must be able to find all the external symbols that it needs. The ldd utility can be used to verify that a native library knows how to find all required externals.

Both agents accept options to limit the probes that are available, and default to the least possible performance impact. To enable the agents for use in your own applications, run the java command with one of the following additional options:

  • -Xrundvmpi

  • -Xrundvmti (for defaults)

  • -Xrundvmpi:all

  • -Xrundvmti:all (for all probes)

For additional options, consult the DVM agent README. Both agents have their limitations, but dvmpi has more, and we recommend using the Java Standard Edition 5.0 Development Kit (JDK 5.0) and the dvmti agent if possible.

When using the agent-based approach, keep in mind that:

  • The dvmpi agent uses JVMPI and only works with one collector. JVMPI has historically been an unstable, experimental interface, and there is a performance penalty associated with using it. JVMPI only works with JDK 5.0 and earlier.

  • The dvmti agent uses JVM TI and only works with JDK 5.0 and later. It works with all collectors, has little performance impact for most probes, and is a formal and much more stable interface.

  • Both agents have some performance penalty for method entry/exit and object alloc/free, less so with the dvmti agent.

  • The dvmti agent uses BCI (byte code instrumentation), and therefore adds bytecodes to methods (if method entry/exit or object alloc/free probes are active).

  • Enabling the allocation event for the JVMTI agent creates an overhead even when DTrace is not attached, and the JVMPI agent severely impacts performance and limits deployment to the serial collector.

Section C.1 provides a D script for testing DVM probes. The DVM agent provider interface, shown in Section C.2, lists all probes provided by dvmpi and dvmti.

DTrace Architecture

Although DTrace instruments are found at both user and kernel level, the majority of the instrumentation and probe-processing activity take place in the Solaris kernel. This section looks at the basic architecture of DTrace, provides a high-level overview of the process of instrumentation, and examines what happens when this instrumentation is activated.

Figure 10.1 presents the architecture of the DTrace subsystem.

DTrace Architecture

Figure 10.1. DTrace Architecture

Processes, known as consumers, communicate with the DTrace kernel subsystem through the interfaces provided in the DTrace library, libdtrace(3LIB). Data is transferred between consumers and the kernel by ioctl(2) calls on the dtrace pseudo-device provided by the dtrace(7d) device driver. Several consumers are included in Solaris 10, including lockstat(1M), plockstat(1M), and intrstat(1M), but generalized access to the DTrace facility is provided by the dtrace(1M) consumer. A consumer’s basic jobs are to communicate tracing specifications to the DTrace kernel subsystem and to process data resulting from these specifications.

A key component of libdtrace is the D compiler. The role of a compiler is to transform a high-level language into the native machine language of the target processor, the high-level language in this case being D. However, DTrace implements its own virtual machine with its own machine-independent instruction set called DIF (D Intermediate Format), which is the target language for compilation. The tracing scripts we specify are transformed into the DIF language and emulated in the kernel when a probe fires, in much the same way as a Java virtual machine interprets Java bytecodes. One of the most important properties of DTrace is its ability to execute arbitrary code safely on production systems without inducing failure. The use of a runtime emulation environment ensures that errors such as dereferencing null pointers can be caught and dealt with safely.

The basic architecture and flow of the D compiler is shown in Figure 10.2.

DTrace Architecture Flow

Figure 10.2. DTrace Architecture Flow

The input D script is split up into tokens by the lexical analyzer; the tokens are used by the parser to build a parse tree. The code generator then makes several passes over the nodes in the parse tree and generates the DIF code for each of the nodes. The assembler then builds DIF Objects (DIFO) for the generated DIF. A DIFO stores the return type of the D expression encoded by this piece of DIF along with its string and variable tables. All the individual pieces of DIFO that constitute a D program are put together into a file. The format of this file is known as the DTrace Object Format (DOF). This DOF is then injected into the kernel and the system is instrumented.

Take as an example the following D clause:

syscall::write:entry
/execname == "foo" && uid == 1001/
{
        self->me = 1;
}

This clause contains two DIF objects, one for the predicate and one for the single action. We can use the -S option to dtrace to look at the DIF instructions generated when the clauses are compiled. Three DIF instructions are generated for the single action shown above.

OFF OPCODE      INSTRUCTION
00: 25000001    setx DT_INTEGER[0], %r1         ! 0x1
01: 2d050001    stts %r1, DT_VAR(1280)          ! DT_VAR(1280) = "me"
02: 23000001    ret  %r1

The DIF virtual machine is a simple RISC-like environment with a limited set of registers and a small instruction set. The first instruction loads register r1 with the first value in a DIFO-specific array of integer constants. The second instruction stores the value that is now in register r1 into the thread-specific variable me, which is referenced through the DIFO-specific variable table. The third instruction returns the value stored in register r1.

The encodings for DIF instructions are called opcodes; it is these that are stored in the DIFO. Each instruction is a fixed 4 bytes, so this DIFO contains 12 bytes of encoded DIF.

The DOF generated by the compilation process is sent to the DTrace kernel subsystem, and the system is instrumented accordingly. When a probe is enabled, an enabling control block (ECB) is created and associated with the probe (see Figure 10.3). An ECB holds some consumer-specific state and also the DIFOs for this probe enabling. If it is the first enabling for this probe, then the framework calls the appropriate provider, instructing it to enable this probe. Each ECB contains the DIFO for the predicates and actions associated with this enabling of the probe. All the enablings for a probe, whether by one or multiple consumers, are represented by ECBs that are chained together and processed in order when the probe is fired. The order is dictated by the sequence in which they appear in a D script and by the time at which that the instrumentation occurs (for example, new ECBs are put at the end of existing ECBs).

Enabling Control Blocks (ECBs)

Figure 10.3. Enabling Control Blocks (ECBs)

The majority of the DTrace subsystem is implemented as a series of kernel modules with the core framework being implemented in dtrace(7d). The framework itself performs no actual instrumentation; that is the responsibility of loadable kernel modules called providers. The providers have intimate knowledge of specific subsystems: how they are instrumented and exactly what can be instrumented (these individual sites being identified by a probe). When a consumer instructs a provider to enable a probe, the provider modifies the system appropriately. The modifications are specific to the provider, but all instrumentation methods achieve the same goal of transferring control into the DTrace framework to carry out the tracing directives for the given probe. This is achieved by execution of the dtrace_probe() function.

As an example of instrumentation, let’s look at how the entry point to the ufs_write() kernel function is instrumented by the fbt provider on the SPARC platform. A function begins with a well-known sequence of instructions, which the fbt provider looks for and modifies.

sol10 # mdb -k
Loading modules: [ unix krtld genunix dtrace specfs ufs ip sctp usba uhci fctl s1394 nca
lofs audiosup nfs random sppp crypto ptm ipc ]
> ufs_write::dis -n 1
ufs_write:                      save      %sp, -0x110, %sp
ufs_write+4:                    stx       %i4, [%sp + 0x8af]

The save instruction on the SPARC machine allocates stack space for the function to use, and most functions begin with this. If we enable fbt::ufs_write:entry in another window, ufs_write() now looks like this:

> ufs_write::dis -n 1
ufs_write:                      ba,a      +0x2bb388     <dt=0x3d96>
ufs_write+4:                    stx       %i4, [%sp + 0x8af]

The save instruction has been replaced with a branch to a different location. In this case, the location is the address of the first instruction in ufs_write + 0x2bb388. So, looking at the contents of that location, we see the following:

> ufs_write+0x2bb388::dis
0x14b36ec:                      save       %sp, -0x110, %sp
0x14b36f0:                      sethi      %hi(0x3c00), %o0
0x14b36f4:                      or         %o0, 0x196, %o0
0x14b36f8:                      mov        %i0, %o1
0x14b36fc:                      mov        %i1, %o2
0x14b3700:                      mov        %i2, %o3
0x14b3704:                      mov        %i3, %o4
0x14b3708:                      mov        %i4, %o5
0x14b370c:                      sethi      %hi(0x11f8000), %g1
0x14b3710:                      call       -0xe7720      <dtrace_probe>
0x14b3714:                      or         %g1, 0x360, %o7

The save instruction that was replaced is executed first. The next seven instructions set up the input arguments for the call to dtrace_probe(), which transfers control to the DTrace framework. The first argument loaded into register o0 is the probe ID for ufs_write, which is used to find the ECBs to be executed for this probe. The next five mov instructions copy the five input arguments for ufs_write so that they appear as arguments to dtrace_probe(). They can then be used when probe processing occurs.

This example illustrates how a kernel function’s entry point is instrumented. Instrumenting, for example, a system call entry point requires a very different instrumentation method. Placing the domain-specific knowledge in provider modules makes DTrace easily extensible in terms of instrumenting different software subsystems and different hardware architectures.

When a probe is fired, the instrumentation inserted by the provider transfers control into the DTrace framework and we are now in what is termed “probe context.” Interrupts are disabled for the executing CPU. The ECBs that are registered for the firing probe are iterated over, and each DIF instruction in each DIFO is interpreted. Data generated from the ECB processing is buffered in a set of per-consumer, per-CPU buffers that are read periodically by the consumer.

When a tracing session is terminated, all instrumentation carried out by providers is removed and the system returns to its original state.

Summary

DTrace is a revolutionary framework for instrumenting and observing the behaviour of systems, and the applications they run. The limits to what can be learned with DTrace are bound only by the users knowledge of the system and application, but it is not necessary to be an operating systems expert or software developer to make effective use of DTrace. The usability of DTrace allows for users at any level to make effective use of the tool, gaining insight into performance and general application behaviour.

Probe Reference

The I/O Provider

The io probes are listed in Table 10.2, and the arguments are described in Sections 10.6.1.1 through 10.6.1.3.

Table 10.2. io Probes

Probe

Description

start

Probe that fires when an I/O request is about to be made either to a peripheral device or to an NFS server. The bufinfo_t corresponding to the I/O request is pointed to by args[0]. The devinfo_t of the device to which the I/O is being issued is pointed to by args[1]. The fileinfo_t of the file that corresponds to the I/O request is pointed to by args[2]. Note that file information availability depends on the filesystem making the I/O request. See fileinfo_t for more information.

done

Probe that fires after an I/O request has been fulfilled. The bufinfo_t corresponding to the I/O request is pointed to by args[0]. The done probe fires after the I/O completes, but before completion processing has been performed on the buffer. As a result B_DONE is not set in b_flags at the time the done probe fires. The devinfo_t of the device to which the I/O was issued is pointed to by args[1]. The fileinfo_t of the file that corresponds to the I/O request is pointed to by args[2].

wait-start

Probe that fires immediately before a thread begins to wait pending completion of a given I/O request. The buf(9S) structure corresponding to the I/O request for which the thread will wait is pointed to by args[0]. The devinfo_t of the device to which the I/O was issued is pointed to by args[1]. The fileinfo_t of the file that corresponds to the I/O request is pointed to by args[2]. Some time after the wait-start probe fires, the wait-done probe will fire in the same thread.

wait-done

Probe that fires when a thread is done waiting for the completion of a given I/O request. The bufinfo_t corresponding to the I/O request for which the thread will wait is pointed to by args[0]. The devinfo_t of the device to which the I/O was issued is pointed to by args[1]. The fileinfo_t of the file that corresponds to the I/O request is pointed to by args[2]. The wait-done probe fires only after the wait-start probe has fired in the same thread.

bufinfo_t structure

The bufinfo_t structure is the abstraction that describes an I/O request. The buffer corresponding to an I/O request is pointed to by args[0] in the start, done, wait-start, and wait-done probes. The bufinfo_t structure definition is as follows:

typedef struct bufinfo {
              int b_flags;                    /* flags */
              size_t b_bcount;                /* number of bytes */
              caddr_t b_addr;                 /* buffer address */
              uint64_t b_blkno;               /* expanded block # on device */
              uint64_t b_lblkno;              /* block # on device */
              size_t b_resid;                 /* # of bytes not transferred */
              size_t b_bufsize;               /* size of allocated buffer */
              caddr_t b_iodone;               /* I/O completion routine */
              dev_t b_edev;                   /* extended device */
} bufinfo_t;
                                                               See /usr/lib/dtrace/io.d

The b_flags member indicates the state of the I/O buffer, and consists of a bitwise-or of different state values. The valid state values are in Table 10.3.

Table 10.3. b_flags Values

Flag

Description

B_DONE

Indicates that the data transfer has completed.

B_ERROR

Indicates an I/O transfer error. It is set in conjunction with the b_error field.

B_PAGEIO

Indicates that the buffer is being used in a paged I/O request. See the description of the b_addr field for more information.

B_PHYS

Indicates that the buffer is being used for physical (direct) I/O to a user data area.

B_READ

Indicates that data is to be read from the peripheral device into main memory.

B_WRITE

Indicates that the data is to be transferred from main memory to the peripheral device.

B_ASYNC

The I/O request is asynchronous, and will not be waited upon. The wait-start and wait-done probes don't fire for asynchronous I/O requests. Note that some I/Os directed to be asynchronous might not have B_ASYNC set: the asynchronous I/O subsystem might implement the asynchronous request by having a separate worker thread perform a synchronous I/O operation.

The structure members are as follows:

  • b_bcount is the number of bytes to be transferred as part of the I/O request.

  • b_addr is the virtual address of the I/O request, unless B_PAGEIO is set. The address is a kernel virtual address unless B_PHYS is set, in which case it is a user virtual address. If B_PAGEIO is set, the b_addr field contains kernel private data. Exactly one of B_PHYS and B_PAGEIO can be set, or neither flag will be set.

  • b_lblkno identifies which logical block on the device is to be accessed. The mapping from a logical block to a physical block (such as the cylinder, track, and so on) is defined by the device.

  • b_resid is set to the number of bytes not transferred because of an error.

  • b_bufsize contains the size of the allocated buffer.

  • b_iodoneidentifies a specific routine in the kernel that is called when the I/O is complete.

  • b_error may hold an error code returned from the driver in the event of an I/O error. b_error is set in conjunction with the B_ERROR bit set in the b_flags member.

  • b_edev contains the major and minor device numbers of the device accessed. Consumers may use the D subroutines getmajor() and getminor() to extract the major and minor device numbers from the b_edev field.

devinfo_t

The devinfo_t structure provides information about a device. The devinfo_t structure corresponding to the destination device of an I/O is pointed to by args[1] in the start, done, wait-start, and wait-done probes. The members of devinfo_t are as follows:

typedef struct devinfo {
              int dev_major;                  /* major number */
              int dev_minor;                  /* minor number */
              int dev_instance;               /* instance number */
              string dev_name;                /* name of device */
              string dev_statname;            /* name of device + instance/minor */
              string dev_pathname;            /* pathname of device */
} devinfo_t;
                                                               See /usr/lib/dtrace/io.d
  • dev_major. The major number of the device. See getmajor(9F) for more information.

  • dev_minor. The minor number of the device. See getminor(9F) for more information.

  • dev_instance. The instance number of the device. The instance of a device is different from the minor number. The minor number is an abstraction managed by the device driver. The instance number is a property of the device node. You can display device node instance numbers with prtconf(1M).

  • dev_name. The name of the device driver that manages the device. You can display device driver names with the -D option to prtconf(1M).

  • dev_statname. The name of the device as reported by iostat(1M). This name also corresponds to the name of a kernel statistic as reported by kstat(1M). This field is provided so that aberrant iostat or kstat output can be quickly correlated to actual I/O activity.

  • dev_pathname. The full path of the device. This path may be specified as an argument to prtconf(1M) to obtain detailed device information. The path specified by dev_pathname includes components expressing the device node, the instance number, and the minor node. However, all three of these elements aren’t necessarily expressed in the statistics name. For some devices, the statistics name consists of the device name and the instance number. For other devices, the name consists of the device name and the number of the minor node. As a result, two devices that have the same dev_statname may differ in dev_pathname.

fileinfo_t

The fileinfo_t structure provides information about a file. The file to which an I/O corresponds is pointed to by args[2] in the start, done, wait-start, and wait-done probes. The presence of file information is contingent upon the filesystem providing this information when dispatching I/O requests. Some filesystems, especially third-party filesystems, might not provide this information. Also, I/O requests might emanate from a filesystem for which no file information exists. For example, any I/O to filesystem metadata will not be associated with any one file. Finally, some highly optimized filesystems might aggregate I/O from disjoint files into a single I/O request. In this case, the filesystem might provide the file information either for the file that represents the majority of the I/O or for the file that represents some of the I/O. Alternately, the filesystem might provide no file information at all in this case.

The definition of the fileinfo_t structure is as follows:

typedef struct fileinfo {
              string fi_name;                 /* name (basename of fi_pathname) */
              string fi_dirname;              /* directory (dirname of fi_pathname) */
              string fi_pathname;             /* full pathname */
              offset_t fi_offset;             /* offset within file */
              string fi_fs;                   /* filesystem */
              string fi_mount;                /* mount point of file system */
} fileinfo_t;
                                                               See /usr/lib/dtrace/io.d
  • fi_name. Contains the name of the file but does not include any directory components. If no file information is associated with an I/O, the fi_name field will be set to the string <none>. In some rare cases, the pathname associated with a file might be unknown. In this case, the fi_name field will be set to the string <unknown>.

  • fi_dirname. Contains only the directory component of the file name. As with fi_name, this string may be set to <none> if no file information is present, or <unknown> if the pathname associated with the file is not known.

  • fi_pathname. Contains the full pathname to the file. As with fi_name, this string may be set to <none> if no file information is present, or <unknown> if the pathname associated with the file is not known.

  • fi_offset. Contains the offset within the file, or -1 if either file information is not present or if the offset is otherwise unspecified by the filesystem.

Virtual Memory Provider Probes

The vminfo provider probes correspond to the fields in the “vm” named kstat: A probe provided by vminfo fires immediately before the corresponding vm value is incremented. Table 10.4 lists the probes available from the VM provider. A probe takes the following arguments:

  • arg0. The value by which the statistic is to be incremented. For most probes, this argument is always 1, but for some it may take other values; these probes are noted in Table 10.4.

  • arg1. A pointer to the current value of the statistic to be incremented. This value is a 64–bit quantity that is incremented by the value in arg0. Dereferencing this pointer allows consumers to determine the current count of the statistic corresponding to the probe.

For example, if you should see the following paging activity with vmstat, indicating page-in from the swap device, you could drill down to investigate.

# vmstat -p 3
     memory           page          executable      anonymous      filesystem
   swap  free  re  mf  fr  de  sr  epi  epo  epf  api  apo  apf  fpi  fpo  fpf
 1512488 837792 160 20 12   0   0    0    0    0 8102    0    0   12   12   12
 1715812 985116 7  82   0   0   0    0    0    0 7501    0    0   45    0    0
 1715784 983984 0   2   0   0   0    0    0    0 1231    0    0   53    0    0
 1715780 987644 0   0   0   0   0    0    0    0 2451    0    0   33    0    0

$ dtrace -n anonpgin'{@[execname] = count()}'
dtrace: description 'anonpgin' matched 1 probe
  svc.startd                                                       1
  sshd                                                             2
  ssh                                                              3
  dtrace                                                           6
  vmstat                                                          28
  filebench                                                      913

The VM probes are described in Table 10.4.

Table 10.4. DTrace VM Provider Probes and Descriptions

Probe Name

Description

anonfree

Fires whenever an unmodified anonymous page is freed as part of paging activity. Anonymous pages are those that are not associated with a file; memory containing such pages include heap memory, stack memory, or memory obtained by explicitly mapping zero(7D).

anonpgin

Fires whenever an anonymous page is paged in from a swap device.

anonpgout

Fires whenever a modified anonymous page is paged out to a swap device.

as_fault

Fires whenever a fault is taken on a page and the fault is neither a protection fault nor a copy-on-write fault.

cow_fault

Fires whenever a copy-on-write fault is taken on a page. arg0 contains the number of pages that are created as a result of the copy-on-write.

dfree

Fires whenever a page is freed as a result of paging activity. Whenever dfree fires, exactly one of anonfree, execfree, or fsfree will also subsequently fire.

execfree

Fires whenever an unmodified executable page is freed as a result of paging activity.

execpgin

Fires whenever an executable page is paged in from the backing store.

execpgout

Fires whenever a modified executable page is paged out to the backing store. If it occurs at all, most paging of executable pages will occur in terms of execfree; execpgout can only fire if an executable page is modified in memory—an uncommon occurrence in most systems.

fsfree

Fires whenever an unmodified file system data page is freed as part of paging activity.

fspgin

Fires whenever a file system page is paged in from the backing store.

fspgout

Fires whenever a modified file system page is paged out to the backing store.

kernel_asflt

Fires whenever a page fault is taken by the kernel on a page in its own address space. Whenever kernel_asflt fires, it will be immediately preceded by a firing of the as_fault probe.

maj_fault

Fires whenever a page fault is taken that results in I/O from a backing store or swap device. Whenever maj_fault fires, it will be immediately preceded by a firing of the pgin probe.

pgfrec

Fires whenever a page is reclaimed off of the free page list.

The Sched Provider

The sched probes are described in Table 10.5.

Table 10.5. sched Probes

Probe

Description

change-pri

Probe that fires whenever a thread’s priority is about to be changed. The lwpsinfo_t of the thread is pointed to by args[0]. The thread's current priority is in the pr_pri field of this structure. The psinfo_t of the process containing the thread is pointed to by args[1]. The thread’s new priority is contained in args[2].

dequeue

Probe that fires immediately before a runnable thread is dequeued from a run queue. The lwpsinfo_t of the thread being dequeued is pointed to by args[0]. The psinfo_t of the process containing the thread is pointed to by args[1]. The cpuinfo_t of the CPU from which the thread is being dequeued is pointed to by args[2]. If the thread is being dequeued from a run queue that is not associated with a particular CPU, the cpu_id member of this structure will be -1.

enqueue

Probe that fires immediately before a runnable thread is enqueued to a run queue. The lwpsinfo_t of the thread being enqueued is pointed to by args[0]. The psinfo_t of the process containing the thread is pointed to by args[1]. The cpuinfo_t of the CPU to which the thread is being enqueued is pointed to by args[2]. If the thread is being enqueued from a run queue that is not associated with a particular CPU, the cpu_id member of this structure will be -1. The value in args[3] is a boolean indicating whether the thread will be enqueued to the front of the run queue. The value is non-zero if the thread will be enqueued at the front of the run queue, and zero if the thread will be enqueued at the back of the run queue.

off-cpu

Probe that fires when the current CPU is about to end execution of a thread. The curcpu variable indicates the current CPU. The curlwpsinfo variable indicates the thread that is ending execution. The curpsinfo variable describes the process containing the current thread. The lwpsinfo_t structure of the thread that the current CPU will next execute is pointed to by args[0]. The psinfo_t of the process containing the next thread is pointed to by args[1].

on-cpu

Probe that fires when a CPU has just begun execution of a thread. The curcpu variable indicates the current CPU. The curlwpsinfo variable indicates the thread that is beginning execution. The curpsinfo variable describes the process containing the current thread.

preempt

Probe that fires immediately before the current thread is preempted. After this probe fires, the current thread will select a thread to run and the off-cpu probe will fire for the current thread. In some cases, a thread on one CPU will be preempted, but the preempting thread will run on another CPU in the meantime. In this situation, the preempt probe will fire, but the dispatcher will be unable to find a higher priority thread to run and the remain-cpu probe will fire instead of the off-cpu probe.

remain-cpu

Probe that fires when a scheduling decision has been made, but the dispatcher has elected to continue to run the current thread. The curcpu variable indicates the current CPU. The curlwpsinfo variable indicates the thread that is beginning execution. The curpsinfo variable describes the process containing the current thread.

schedctl-nopreempt

Probe that fires when a thread is preempted and then re-enqueued at the front of the run queue due to a preemption control request. See schedctl_init(3C) for details on preemption control. As with preempt, either off-cpu or remain-cpu will fire after schedctl-nopreempt. Because schedctl-nopreempt denotes a re-enqueuing of the current thread at the front of the run queue, remain-cpu is more likely to fire after schedctl-nopreempt than off-cpu. The lwpsinfo_t of the thread being preempted is pointed to by args[0]. The psinfo_t of the process containing the thread is pointed to by args[1].

schedctl-preempt

Probe that fires when a thread that is using preemption control is nonetheless preempted and re-enqueued at the back of the run queue. See schedctl_init(3C) for details on preemption control. As with preempt, either off-cpu or remain-cpu will fire after schedctl-preempt. Like preempt (and unlike schedctl-nopreempt), schedctl-preempt denotes a reenqueuing of the current thread at the back of the run queue. As a result, off-cpu is more likely to fire after schedctl-preempt than remain-cpu. The lwpsinfo_t of the thread being preempted is pointed to by args[0]. The psinfo_t of the process containing the thread is pointed to by args[1].

schedctl-yield

Probe that fires when a thread that had preemption control enabled and its time slice artificially extended executed code to yield the CPU to other threads.

sleep

Probe that fires immediately before the current thread sleeps on a synchronization object. The type of the synchronization object is contained in the pr_stype member of the lwpsinfo_t pointed to by curlwpsinfo. The address of the synchronization object is contained in the pr_wchan member of the lwpsinfo_t pointed to by curlwpsinfo. The meaning of this address is a private implementation detail, but the address value may be treated as a token unique to the synchronization object.

surrender

Probe that fires when a CPU has been instructed by another CPU to make a scheduling decision—often because a higher-priority thread has become runnable.

tick

Probe that fires as a part of clock tick-based accounting. In clock tick-based accounting, CPU accounting is performed by examining which threads and processes are running when a fixed-interval interrupt fires. The lwpsinfo_t that corresponds to the thread that is being assigned CPU time is pointed to by args[0]. The psinfo_t that corresponds to the process that contains the thread is pointed to by args[1].

wakeup

Probe that fires immediately before the current thread wakes a thread sleeping on a synchronization object. The lwpsinfo_t of the sleeping thread is pointed to by args[0]. The psinfo_t of the process containing the sleeping thread is pointed to by args[1]. The type of the synchronization object is contained in the pr_stype member of the lwpsinfo_t of the sleeping thread. The address of the synchronization object is contained in the pr_wchan member of the lwpsinfo_t of the sleeping thread. The meaning of this address is a private implementation detail, but the address value may be treated as a token unique to the synchronization object.

Arguments

The argument types for the sched probes are listed in Table 10.5; the arguments are described in Table 10.6.

Table 10.6. sched Probe Arguments

Probe

args[0]

args[1]

args[2]

args[3]

change-pri

lwpsinfo_t *

psinfo_t *

pri_t

dequeue

lwpsinfo_t *

psinfo_t *

cpuinfo_t *

enqueue

lwpsinfo_t *

psinfo_t *

cpuinfo_t *

int

off-cpu

lwpsinfo_t *

psinfo_t *

on-cpu

preempt

remain-cpu

schedctl-nopreempt

lwpsinfo_t *

psinfo_t *

schedctl-preempt

lwpsinfo_t *

psinfo_t *

schedctl-yield

lwpsinfo_t *

psinfo_t *

sleep

surrender

lwpsinfo_t *

psinfo_t *

tick

lwpsinfo_t *

psinfo_t *

wakeup

lwpsinfo_t *

psinfo_t *

As Table 10.6 indicates, many sched probes have arguments consisting of a pointer to an lwpsinfo_t and a pointer to a psinfo_t, indicating a thread and the process containing the thread, respectively. These structures are described in detail in lwpsinfo_t and psinfo_t, respectively.

The cpuinfo_t structure defines a CPU. As Table 10.6 indicates, arguments to both the enqueue and dequeue probes include a pointer to a cpuinfo_t. Additionally, the cpuinfo_t corresponding to the current CPU is pointed to by the curcpu variable.

typedef struct cpuinfo {
              processorid_t cpu_id;           /* CPU identifier */
              psetid_t cpu_pset;              /* processor set identifier */
              chipid_t cpu_chip;              /* chip identifier */
              lgrp_id_t cpu_lgrp;             /* locality group identifer */
              processor_info_t cpu_info;      /* CPU information */
} cpuinfo_t;

The definition of the cpuinfo_t structure is as follows:

  • cpu_id. The processor identifier, as returned by psrinfo(1M) and p_online(2).

  • cpu_pset. The processor set that contains the CPU, if any. See psrset(1M) for more details on processor sets.

  • cpu_chip. The identifier of the physical chip. Physical chips may contain several CPUs. See psrinfo(1M) for more information.

  • The cpu_lgrp. The identifier of the latency group associated with the CPU. See liblgrp(3LIB) for details on latency groups.

  • The cpu_info. The processor_info_t structure associated with the CPU, as returned by processor_info(2).

DTrace Lockstat Provider

The lockstat provider makes available probes that can be used to discern lock contention statistics or to understand virtually any aspect of locking behavior. The lockstat(1M) command is actually a DTrace consumer that uses the lockstat provider to gather its raw data.

The lockstat provider makes available two kinds of probes: content-event probes and hold-event probes.

  • Contention-event probes. Correspond to contention on a synchronization primitive; they fire when a thread is forced to wait for a resource to become available. Solaris is generally optimized for the noncontention case, so prolonged contention is not expected. These probes should be used to understand those cases where contention does arise. Because contention is relatively rare, enabling contention-event probes generally doesn’t substantially affect performance.

  • Hold-event probes. Correspond to acquiring, releasing, or otherwise manipulating a synchronization primitive. These probes can be used to answer arbitrary questions about the way synchronization primitives are manipulated. Because Solaris acquires and releases synchronization primitives very often (on the order of millions of times per second per CPU on a busy system), enabling hold-event probes has a much higher probe effect than does enabling contention-event probes. While the probe effect induced by enabling them can be substantial, it is not pathological; they may still be enabled with confidence on production systems.

The lockstat provider makes available probes that correspond to the different synchronization primitives in Solaris; these primitives and the probes that correspond to them are discussed in the remainder of this chapter.

Adaptive Lock Probes

The four lockstat probes pertaining to adaptive locks are in Table 10.7. For each probe, arg0 contains a pointer to the kmutex_t structure that represents the adaptive lock.

Table 10.7. Adaptive Lock Probes

Probe Name

Description

adaptive-acquire

Hold-event probe that fires immediately after an adaptive lock is acquired.

adaptive-block

Contention-event probe that fires after a thread that has blocked on a held adaptive mutex has reawakened and has acquired the mutex. If both probes are enabled, adaptive-block fires before adaptive-acquire. At most one of adaptive-block and adaptive-spin fires for a single lock acquisition. arg1 for adaptive-block contains the sleep time in nanoseconds.

adaptive-spin

Contention-event probe that fires after a thread that has spun on a held adaptive mutex has successfully acquired the mutex. If both are enabled, adaptive-spin fires before adaptive-acquire. At most one of adaptive-spin and adaptive-block fires for a single lock acquisition. arg1 for adaptive-spin contains the spin count: the number of iterations that were taken through the spin loop before the lock was acquired. The spin count has little meaning on its own but can be used to compare spin times.

adaptive-release

Hold-event probe that fires immediately after an adaptive lock is released.

Spin Lock Probes

The three probes pertaining to spin locks are in Table 10.8.

Table 10.8. Spin Lock Probes

Probe Name

Description

spin-acquire

Hold-event probe that fires immediately after a spin lock is acquired.

spin-spin

Contention-event probe that fires after a thread that has spun on a held spin lock has successfully acquired the spin lock. If both are enabled, spin-spin fires before spin-acquire.arg1 for spin-spin contains the spin count: the number of iterations that were taken through the spin loop before the lock was acquired. The spin count has little meaning on its own but can be used to compare spin times.

spin-release

Hold-event probe that fires immediately after a spin lock is released.

Thread Locks

Thread lock hold events are available as spin lock hold-event probes (that is, spin-acquire and spin-release), but contention events have their own probe specific to thread locks. The thread lock hold-event probe is described in Table 10.9.

Table 10.9. Thread Lock Probes

Probe Name

Description

thread-spin

Contention-event probe that fires after a thread has spun on a thread lock. Like other contention-event probes, if both the contention-event probe and the hold-event probe are enabled, thread-spin fires before spin-acquire. Unlike other contention-event probes, however, thread-spin fires before the lock is actually acquired. As a result, multiple thread-spin probe firings may correspond to a single spin-acquire probe firing.

Readers/Writer Lock Probes

The probes pertaining to readers/writer locks are in Table 10.10. For each probe, arg0 contains a pointer to the krwlock_t structure that represents the adaptive lock.

Table 10.10. Readers/Writer Lock Probes

Probe Name

Description

rw-acquire

Hold-event probe that fires immediately after a readers/writer lock is acquired. arg1 contains the constant RW_READER if the lock was acquired as a reader, and RW_WRITER if the lock was acquired as a writer.

rw-block

Contention-event probe that fires after a thread that has blocked on a held readers/writer lock has reawakened and has acquired the lock. arg1 contains the length of time (in nanoseconds) that the current thread had to sleep to acquire the lock. arg2 contains the constant RW_READER if the lock was acquired as a reader, and RW_WRITER if the lock was acquired as a writer. arg3 and arg4 contain more information on the reason for blocking. arg3 is nonzero if and only if the lock was held as a writer when the current thread blocked. arg4 contains the readers count when the current thread blocked. If both the rw-block and rw-acquire probes are enabled, rw-block fires before rw-acquire.

rw-upgrade

Hold-event probe that fires after a thread has successfully upgraded a readers/writer lock from a reader to a writer. Upgrades do not have an associated contention event because they are only possible through a nonblocking interface, rw_tryupgrade(TRYUPGRADE.9F).

rw-downgrade

Hold-event probe that fires after a thread had downgraded its ownership of a readers/writer lock from writer to reader. Downgrades do not have an associated contention event because they always succeed without contention.

rw-release

Hold-event probe that fires immediately after a readers/writer lock is released. arg1 contains the constant RW_READER if the released lock was held as a reader, and RW_WRITER if the released lock was held as a writer. Due to upgrades and downgrades, the lock may not have been released as it was acquired.

The Java Virtual Machine Provider

This following section lists all probes published by the hotspot provider.

VM Life Cycle Probes

Three probes are available related to the VM life cycle, as shown in Table 10.11.

Table 10.11. VM Life Cycle Probes

Probe

Description

vm-init-begin

This probe fires just as the VM initialization begins. It occurs just after JNI_CreateVM() is called, as the VM is initializing.

vm-init-end

This probe fires when the VM initialization finishes, and the VM is ready to start running application code.

vm-shutdown

Probe that fires as the VM is shutting down due to program termination or error

Thread Life Cycle Probes

Two probes are available for tracking thread start and stop events, as shown in Table 10.12.

Table 10.12. Thread Life Cycle Probes

Probe

Description

thread-start

Probe that fires as a thread is started

thread-stop

Probe that fires when the thread has completed

Each of these probes has the arguments shown in Table 10.13.

Table 10.13. Thread Life Cycle Probe Arguments

Argument

Description

args[0]

A pointer to mUTF-8 string data which contains the thread name

args[1]

The length of the thread name (in bytes)

args[2]

The Java thread ID. This is the value that will match other hotspot probes that contain a thread argument.

args[3]

The native/OS thread ID. This is the ID assigned by the host operating system.

args[4]

A boolean value that indicates if this thread is a daemon or not. A value of 0 indicates a non-daemon thread.

Class-Loading Probes

Two probes are available for tracking class loading and unloading activity, as shown in Table 10.14.

Table 10.14. Class-Loading Probes

Probe

Description

class-loaded

Probe that fires after the class has been loaded

class-unloaded

Probe that fires after the class has been unloaded from the system

Each of these probes has the arguments shown in Table 10.15.

Table 10.15. Class-Loading Probe Arguments

Argument

Description

args[0]

A pointer to mUTF-8 string data which contains the name of the class begin loaded

args[1]

The length of the class name (in bytes)

args[2]

The class loader ID, which is a unique identifier for a class loader in the VM. This is the class loader that has loaded or is loading the class

args[3]

A boolean value which indicates if the class is a shared class (if the class was loaded from the shared archive)

Garbage Collection Probes

The following probes measure the duration of a system-wide garbage collection cycle (for those garbage collectors that have a defined begin and end), and each memory pool can be tracked independently. The probes for individual pools pass the memory manager’s name, the pool name, and pool usage information at both the begin and end of pool collection.

The provider’s GC-related probes are shown in Table 10.16.

Table 10.16. Garbage Collection Probes

Probe

Description

gc-begin

Probe that fires when system-wide collection is about to start. Its one argument (arg[0]) is a boolean value that indicates if this is to be a Full GC.

gc-end

Probe that fires when system-wide collection has completed. No arguments.

mem-pool-gc-begin

Probe that fires when an individual memory pool is about to be collected. Provides the arguments listed in Table 10.17.

mem-pool-gc-end

Probe that fires after an individual memory pool has been collected.

The memory pool probe arguments are as follows:

Table 10.17. Garbage Collection Probe Arguments

Argument

Description

args[0]

A pointer to mUTF-8 string data that contains the name of the manager which manages this memory pool

args[1]

The length of the manager name (in bytes)

args[2]

A pointer to mUTF-8 string data that contains the name of the memory pool

args[3]

The length of the memory pool name (in bytes)

args[4]

The initial size of the memory pool (in bytes)

args[5]

The amount of memory in use in the memory pool (in bytes)

args[6]

The number of committed pages in the memory pool

args[7]

The maximum size of the memory pool

Method Compilation Probes

The following probes indicate which methods are being compiled and by which compiler. Then, when the method compilation has completed, it can be loaded and possibly unloaded later. Probes are available to track these events as they occur.

Probes that mark the begin and end of method compilation are shown in Table 10.18.

Table 10.18. Method Compilation Probes

Probe

Description

method-compile-begin

Probe that fires as method compilation begins. Provides the arguments listed below

method-compile-end

Probe that fires when method compilation completes. In addition to the arguments listed below, argv[8] is a boolean value which indicates if the compilation was successful

Method compilation probe arguments are shown in Table 10.19.

Table 10.19. Method Compilation Probe Arguments

Argument

Description

args[0]

A pointer to mUTF-8 string data which contains the name of the compiler which is compiling this method

args[1]

The length of the compiler name (in bytes)

args[2]

A pointer to mUTF-8 string data which contains the name of the class of the method being compiled

args[3]

The length of the class name (in bytes)

args[4]

A pointer to mUTF-8 string data which contains the name of the method being compiled

args[5]

The length of the method name (in bytes)

args[6]

A pointer to mUTF-8 string data which contains the signature of the method being compiled

args[7]

The length of the signature(in bytes)

When compiled methods are installed for execution, the probes shown in Table 10.20 are fired.

Table 10.20. Compiled Method Install Probes

Probe

Description

compiled-method-load

Probe that fires when a compiled method is installed. In addition to the arguments listed below, argv[6] contains a pointer to the compiled code, and argv[7] is the size of the compiled code.

compiled-method-unload

Probe that fires when a compiled method is unin-stalled. Provides the arguments listed in Table 10.21.

Compiled method loading probe arguments are as follows:

Table 10.21. Compiled Method Install Probe Arguments

Argument

Description

args[0]

A pointer to mUTF-8 string data which contains the name of the class of the method being installed

args[1]

The length of the class name (in bytes)

args[2]

A pointer to mUTF-8 string data which contains the name of the method being installed

args[3]

The length of the method name (in bytes)

args[4]

A pointer to mUTF-8 string data which contains the signature of the method being installed

args[5]

The length of the signature(in bytes)

Monitor Probes

As an application runs, threads will enter and exit monitors, wait on objects, and perform notifications. Probes are available for all wait and notification events, as well as for contended monitor entry and exit events. A contended monitor entry is the situation where a thread attempts to enter a monitor when another thread is already in the monitor. A contended monitor exit event occurs when a thread leaves a monitor and other threads are waiting to enter to the monitor. Thus, contended enter and contended exit events may not match up to each other in relation to the thread that encounters these events, though it is expected that a contended exit from one thread should match up to a contended enter on another thread (the thread waiting for the monitor).

All monitor events provide the thread ID, a monitor ID, and the type of the class of the object as arguments. It is expected that the thread and the class will help map back to the program, while the monitor ID can provide matching information between probe firings.

Since the existance of these probes in the VM causes performance degradation, they will only fire if the VM has been started with the command-line option -XX:+ExtendedDtraceProbes. By default they are present in any listing of the probes in the VM, but are dormant without the flag. It is intended that this restriction be removed in future releases of the VM, where these probes will be enabled all the time with no impact to performance.

The available probes are shown in Table 10.22.

Table 10.22. Monitor Probes

Probe

Description

monitor-contended-enter

Probe that fires as a thread attempts to enter a contended monitor.

monitor-contended-entered

Probe that fires when the thread successfully enters the contended monitor.

monitor-contended-exit

Probe that fires when the thread leaves a monitor and other threads are waiting to enter.

monitor-wait

Probe that fires as a thread begins a wait on an object via Object.wait(). The probe has an additional argument, args[4] which is a “long” value which indicates the timeout being used.

monitor-waited

Probe that fires when the thread completes an Object.wait() and has been either been notified, or timed out.

monitor-notify

Probe that fires when a thread calls Object.notify() to notify waiters on a monitor monitor-notifyAll Probe that fires when a thread calls Object. notifyAll() to notify waiters on a monitor.

Monitor probe arguments are shown in Table 10.23.

Table 10.23. Monitor Probe Arguments

Argument

Description

args[0]

The Java thread identifier for the thread peforming the monitor operation

args[1]

A unique, but opaque identifier for the specific monitor that the action is performed upon

args[2]

A pointer to mUTF-8 string data which contains the name of the class of the object being acted upon

args[3]

The length of the class name (in bytes)

Application Tracking Probes

A few probes are provided to allow fine-grained examination of the Java thread execution. These consist of probes that fire anytime a method in entered or returned from, as well as a probe that fires whenever a Jav object has been allocated.

Since the existance of these probes in the VM causes performance degradation, they will only fire if the VM has been started with the command-line option -XX:+ExtendedDtraceProbes. By default they are present in any listing of the probes in the VM, but are dormant without the flag. It is intended that this restriction be removed in future releases of the VM, where these probes will be enabled all the time with no impact to performance.

The method entry and return probes are shown in Table 10.24.

Table 10.24. Application Tracking Probes

Probe

Description

method-entry

Probe which fires when a method is begin entered. Only fires if the VM was created with the ExtendedDtraceProbes command-line argument.

method-return

Probe which fires when a method returns normally or due to an exception. Only fires if the VM was created with the Extended-DtraceProbes command-line argument.

Method probe arguments are shown in Table 10.25.

Table 10.25. Application Tracking Probe Arguments

Argument

Description

args[0]

The Java thread ID of the thread that is entering or leaving the method

args[1]

A pointer to mUTF-8 string data which contains the name of the class of the method

args[2]

The length of the class name (in bytes)

args[3]

A pointer to mUTF-8 string data which contains the name of the method

args[4]

The length of the method name (in bytes)

args[5]

A pointer to mUTF-8 string data which contains the signature of the method

args[6]

The length of the signature(in bytes)

The available allocation probe is shown in Table 10.26.

Table 10.26. Allocation Probe

Probe

Description

object-alloc

Probe that fires when any object is allocated, provided that the VM was created with the ExtendedDtraceProbes command-line argument.

The object allocation probe has the arguments shown in Table 10.27.

Table 10.27. Allocation Probe Arguments

Argument

Description

args[0]

The Java thread ID of the thread that is allocating the object

args[1]

A pointer to mUTF-8 string data which contains the name of the class of the object being allocated

args[2]

The length of the class name (in bytes)

args[3]

The size of the object being allocated

The hotspot_jni Provider

The JNI provides a number of methods for invoking code written in the Java Programming Language, and for examining the state of the VM. DTrace probes are provided at the entry point and return point for each of these methods. The probes are provided by the hotspot_jni provider. The name of the probe is the name of the JNI method, appended with “_entry” for enter probes, and “_return” for return probes. The arguments available at each entry probe are the arguments that were provided to the function (with the exception of the Invoke* methods, which omit the arguments that are passed to Java method). The return probes have the return value of the method as an argument (if available).

MDB Reference

Table 10.28. MDB Reference for DTrace

dcmd or walker

Description

dcmd difinstr

Disassemble a DIF instruction

dcmd difo

Print a DIF object

dcmd dof_actdesc

Print a DOF actdesc

dcmd dof_ecbdesc

Print a DOF ecbdesc

dcmd dof_hdr

Print a DOF header

dcmd dof_probedesc

Print a DOF probedesc

dcmd dof_relodesc

Print a DOF relodesc

dcmd dof_relohdr

Print a DOF relocation header

dcmd dof_sec

Print a DOF section header

dcmd dofdump

Dump DOF

dcmd dtrace

Print dtrace(1M)-like output

dcmd dtrace_aggstat

Print DTrace aggregation hash statistics

dcmd dtrace_dynstat

Print DTrace dynamic variable hash statistics

dcmd dtrace_errhash

Print DTrace error hash

dcmd dtrace_helptrace

Print DTrace helper trace

dcmd dtrace_state

Print active DTrace consumers

dcmd id2probe

Translate a dtrace_id_t to a dtrace_probe_t

walk dof_sec

Walk DOF section header table given header address

walk dtrace_aggkey

Walk DTrace aggregation keys

walk dtrace_dynvar

Walk DTrace dynamic variables

walk dtrace_errhash

Walk hash of DTrace error messasges

walk dtrace_helptrace

Walk DTrace helper trace entries

walk dtrace_state

Walk DTrace per-consumer softstate

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.67.235