Tutorial: DTrace by Example

By Rickey C. Weisner, September 2009

DTrace, or Dynamic Tracing, is a powerful diagnostic tool introduced in the Solaris 10 OS. Since its introduction, it has been implemented in other operating systems, the most noteworthy being FreeBSD and Mac OS X. This tutorial uses DTrace to analyze several applications. The tutorial emphasizes practical real-world use of DTrace, as opposed to an introduction to language syntax. Using mpstat and prstat information, the tutorial illustrates which questions one should ask and how one can use DTrace to answer these questions. These DTrace-inspired answers lead to increased performance.

Contents

Introduction to DTrace

DTrace, or dynamic tracing, was first available in the Solaris 10 OS in 2004. Since then, DTrace has been made available in FreeBSD and Max OS X. DTrace differs from traditional tools in that code is instrumented dynamically.

To disassemble a kernel function and initiate a probe:

  1. First, disassemble the kernel function.
    arwen:
    
                            mdb -k
    Loading modules: [ unix krtld genunix specfs ufs ip sctp usba uhci s1394 nca fctl lofs nfs audiosup sppp random crypto ptm ipc ]
     >  
                             ufs_delete::dis
     ufs_delete:                     pushq  %rbp
     ufs_delete+1:                   movq   %rsp,%rbp
     ufs_delete+4:                   subq   $0x70,%rsp
     ufs_delete+8:                   movq   %rbx,-0x28(%rbp)
     ufs_delete+0xc:                 movq   %r13,-0x18(%rbp)
     ^D
                          
  2. Initiate a DTrace probe.
     arwen:dtrace -f ufs_delete
    dtrace: description 'ufs_delete' matched 2 probes
    ^Z[1] + Stopped (SIGTSTP)        dtrace -f ufs_delete
                          
  3. Repeat the disassembly.
    arwen:mdb -k
    Loading modules: [ unix krtld genunix specfs dtrace ufs ip sctp usba uhci s1394 nca fctl lofs nfs audiosup sppp random crypto ptm ipc ]
    > '''ufs_delete::dis'''
    ufs_delete:                     '''int    $0x3'''
    ufs_delete+1:                   movq   %rsp,%rbp
    ufs_delete+4:                   subq   $0x70,%rsp
    ufs_delete+8:                   movq   %rbx,-0x28(%rbp)
    ufs_delete+0xc:                 movq   %r13,-0x18(%rbp)
                          

DTrace inserts the interrupt dynamically.

DTrace uses the following format to identify a probe: provider:module:function:name

Or, from the command line:

-P     provider-m     module
-f     function
-n     name
-s     script
-p     pid
-c     run command
-l     list
-v     verbose

So a simple DTrace script would look like this:

syscall:::entry                     // probe description/pid == $1/                        // predicate
{
@[probefunc] = count();    // action
}

In the preceding script, syscall is the provider. The module designation is omitted so the script matches on any module. The function designation is likewise omitted, so the script matches on any function. In this example, we are interested only in the entry to the function as opposed to the return. The predicate ( /pid == $target/) determines if the action ( @[probefunc] = count();) is performed. The action occurs only when the pid (process id) matches the first parameter passed on the command line.

The action is an aggregation (designated by the @) that contains the count by function name of the system calls executed. By default, the contents of the aggregation are printed upon exit from the script.

The DTrace Documentation has hundreds of pages, and yes, I read them all initially. However, if you are not using a new language daily, details soon become hazy. When I need or want to use DTrace, I "borrow" someone else's DTrace and make a few modifications to suit my needs and environment. Two of my favorite sources for sample DTrace are /usr/demo/dtrace and the DTrace Toolkit.

Examples in /usr/demo/dtrace

In the Solaris OS, the examples from the DTrace user guide are found in /usr/demo/dtrace.

 $  cd /usr/demo/dtrace
$
ls | head
applicat.d
badopen.d
begin.d
callout.d
clause.d
clear.d
countdown.d
...
                  

Each script contains a reference to the chapter from whence it came.

For example, from applicat.d:

...* The full text of the this chapter may be found here:
* http://wikis.sun.com/display/DTrace/io+Provider
...

Or you can use a browser to examine index.html:

Chapter                       ScriptIntroduction                  countdown.d
counter.d
hello.d
...
Variables                     clause.d
rtime.d
Structs and Unions            kstat.d
ksyms.d
rwinfo.d
Aggregations                  clear.d
dateprof.d
...

Brendan Gregg's Toolkit

One of the co-authors of Solaris Performance and Tools [McDougall2007V2], Brendan Gregg, created a toolkit. It contains numerous d-scripts as well as perl scripts that preprocess and postprocess DTrace. One nice feature of the toolkit is the example directory, which contains output from each of the available programs.

Examples:

...Copyright
Readme
anonpgpid_example.txt
bitesize_example.txt
connections_example.txt
cpudists_example.txt
cputimes_example.txt
cputypes_example.txt
cpuwalk_example.txt
crash_example.txt
...

From hotuser_example.txt, the following are demonstrations of the hotuser DTrace program.

Here, hotuser is run on a test program named dofuncs, which is hard-coded to spend half its time in delta(), a third in beta(), and a sixth in alpha().

         # ./hotuser -c ./dofuncsSampling... Hit Ctrl-C to end.
^C
FUNCTION                COUNT           PCNT
dofuncs`alpha          511              16.5%
dofuncs`beta             1029             33.3%
dofuncs`delta            1552          50.2%
...
                  

mpstat Program

DTrace is most useful when asking a question that is indicated by other performance metrics. The mpstat program is one of the programs that frequently illustrates which questions to ask. The mpstat program presents a high-level view of kernel activity.

mpstat 10 5CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
0   17   0    2   428  225  240    7   25    2    0   407    3   0   0  97
1   15   0    1   149  141  225   10   26    2    0   308    2   2   0  96
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
0    3   0    0   428  219  245  103    3   11    0   732   99   1   0   0
1   21   0    0   242  140  425  196    3    9    0  1320   98   2   0   0
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
0    0   0    0   432  227  240  100    3   12    0   751   99   1   0   0
1    0   0    0   243  140  443  207    4    9    0  1324   98   2   0   0
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
0    0   0    0   455  248  264  112    4    9    0   734   99   1   0   0
1    0   0    0   246  140  448  205    4   12    0  1336   98   2   0   0

In the preceding mpstat example:

minf   page faults satisfied from memorymjf    page faults satisfied from disk
xcal   cross calls
intr   interrupts
ithr   interrupts threads
csw    context switches
icsw   involuntary context switches
migr   thread migrations
smtx   spins on mutex locks
srw    spins on read/writer locks
syscl  system calls
usr    CPU time in user mode
sys    CPU time in kernel mode
wt     wait on IO
idl    idle

The above measurements are in values per second.

prstat -mL Tool

The prstat tool is another tool that that provides useful first-level analysis metrics, providing similar information as mpstat but at a thread level.

PID USERNAME  USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID3040 rw8775   0.8 0.2 0.0 0.0 0.0 0.0  99 0.0  19   7 750   0 madplay/1
3030 root     0.1 0.2 0.0 0.0 0.0 0.0 100 0.0  30   1 400   0 nx/1
1311 rw8775   0.3 0.0 0.0 0.0 0.0 0.0 100 0.0  78  40 273   0 iobar.i386/1
700 rw8775    0.2 0.0 0.0 0.0 0.0 0.0 100 0.0 321   7  1K 129 Xorg/1
303 rw8775    0.1 0.0 0.0 0.0 0.0 0.0 100 0.0  20   4 170   0 xterm/1
310 rw8775    0.0 0.1 0.0 0.0 0.0 0.0 100 0.0 117  20 273   0 netbar.i386/1
2972 rw8775   0.1 0.0 0.0 0.0 0.0 0.0 100 0.0 167  16 673   0 soffice.bin/1
3042 rw8775   0.0 0.0 0.0 0.0 0.0 0.0 100 0.0  24   2 386   0 prstat/1
1305 rw8775   0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 174  57 552   0 perfbar/1
1296 rw8775   0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   2   0  31   0 xterm/1
Total: 105 processes, 231 lwps, load averages: 0.06, 0.06, 0.04

where:

USR   Percent CPU in USR per thread SYS   Percent CPU in SYS per thread
TRP   Percent CPU spent trapping into the kernel
TFL   Percent CPU spent servicing Text page faults
DFL   Percent CPU spent servicing Data page faults
LCK   Percent CPU spent waiting for locks
SLP   Percent CPU spent asleep... using IO
LAT   Percent CPU spent on a run
VCX   Number of voluntary context switches per interval
ICX   Number of involuntary context switches per interval
SCL   Number of system calls per interval
SIG   Number of signals per interval

Example 1: Memory Management

It is not uncommon for applications to use a threaded implementation in order to scale on modern multithreaded and multicore hardware architectures. As the thread count increases, these applications can run into lock contention.

Consider the following prstat -mL:

  PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID19037 root     100 0.0 0.0 0.0 0.0 0.0 0.0 0.0   0  25   0   0 malloc_hog/14
19037 root     100 0.0 0.0 0.0 0.0 0.0 0.0 0.0   0  25   0   0 malloc_hog/12
19037 root     100 0.0 0.0 0.0 0.0 0.0 0.0 0.0   0  25   0   0 malloc_hog/11
19037 root     100 0.0 0.0 0.0 0.0 0.0 0.0 0.0   0  25   0   0 malloc_hog/8
19037 root     100 0.0 0.0 0.0 0.0 0.0 0.0 0.0   0  25   0   0 malloc_hog/3
19037 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 malloc_hog/15
19037 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 malloc_hog/13
19037 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 malloc_hog/10
...
Total: 1 processes, 21 lwps, load averages: 3.76, 1.20, 0.45

Here we see serious lock contention (notice LCK).

Because these are userland locks, we can use a DTrace-based program named plockstat to check for hot locks:

$ plockstat -C -e 10 -p `pgrep malloc_hog`-C -> check for contention events
-e -> sample for 10 sec
-p -> check pid
Mutex block
Count     nsec Lock                         Caller
-------------------------------------------------------------------------------
4 699547535 libc.so.1`libc_malloc_lock  malloc_hog`malloc_thread+0xc
1   102140 libc.so.1`libc_malloc_lock   malloc_hog`malloc_thread+0x30

The DTrace-based program, plockstat, enables us to drill down to the root cause of the lock contention. The threads are blocked on a lock related to the malloc memory management routine. Fortunately, with the Solaris OS, there are several MT hot memory allocators to choose from. The operating systems comes with libumem, libmtmalloc, and of course there is the freely available libhoard (see www.hoard.org).

By LD preloading libumem we get the following:

  PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID19054 root     100 0.0 0.0 0.0 0.0 0.0 0.0 0.0   0  25   0   0 malloc_hog/16
19054 root     100 0.0 0.0 0.0 0.0 0.0 0.0 0.0   0  25   0   0 malloc_hog/15
19054 root     100 0.0 0.0 0.0 0.0 0.0 0.0 0.0   0  25   0   0 malloc_hog/13
19054 root     100 0.0 0.0 0.0 0.0 0.0 0.0 0.0   0  25   0   0 malloc_hog/11
19054 root     100 0.0 0.0 0.0 0.0 0.0 0.0 0.0   0  25   0   0 malloc_hog/20
19054 root     100 0.0 0.0 0.0 0.0 0.0 0.0 0.0   0  25   0   0 malloc_hog/19
19054 root     100 0.0 0.0 0.0 0.0 0.0 0.0 0.0   0  25   0   0 malloc_hog/17
...

No more lock contention!

Example 2: Condition Variables

Consider mpstat 10:

CPU minf mjf  xcal  intr ithr  csw  icsw migr  smtx  srw syscl  usr sys  wt idl0    0   0  8259 14363   60 18143  823 2683  7533    0 15433   15  38   0  47
1    0   0  7371 11574    2 17084  692 2269  6567    0 14318   14  34   0  52
2    0   0  6844  9808    0 15798  577 1882  6072    0 13287   13  31   0  55
3    0   0  6307  8481    0 14765  493 1621  5532    0 12330   12  29   0  59
4    0   0  6495  8268    0 15149  520 1635  5798    0 12776   12  29   0  60
5    0   0  5837  7344    0 13849  415 1333  5204    0 11547   11  26   0  63
6    0   0  5498  6638    0 13056  355 1141  4870    0 10881   10  24   0  65
7    0   0  5300  6187    0 12639  329 1017  4671    0 10502   10  24   0  67
8    0   0  5554  6303    0 13199  349 1114  4916    0 11111   10  25   0  66
...
22    0   0  2528  2601    0  6015  129  395  2264    0  5002    4  12   0  84
23    0   0  1951  2034    0  4746  104  320  1729    0  3916    3  10   0  87
24    0   0  5078  1143    0  2601   69  209  4640    0 17470    8  14   0  78
25    0   0     0     2    0     0    0    0     0    0     0    0   0   0 100
26    0   0   234   132    0   288   11   31   204    0   630    0   1   0  99
27    0   0  1415   795    0  1818   44  133  1279    0  4002    2   5   0  93
28    0   0 20586  1343    0  2994   90  262 18910    0 76303   35  51   0  15
29    0   0   219   198    0   447   11   32   194    0   466    0   1   0  99
30    0   0     0     1    0     0    0    0     0    0     0    0   0   0 100
31    0   0     0     1    0     0    0    0     0    0     0    0   0   0 100

Let's try to find out what is causing the cross calls.

Consider a modified xcallsbypid.d from the the DTrace Toolkit:

#!/usr/sbin/dtrace -ssysinfo:::xcalls
/ execname == "what" /            // we are only interested in executables named what
{
@[ustack(10)]=count();    // 10 deep userland stack
}
tick-10sec                    // print every 10 sec
{
printfprintf("%Y \n",walltimestamp);
printa(@);
clear(@);
trunc(@,0);
}

When executed, this script yields the following:

               libc.so.1`__lwp_park+0x10libc.so.1`cond_wait_queue+0x4c
libc.so.1`cond_wait+0x10
what`consumer+0x5c
libc.so.1`_lwp_start
160350
libc.so.1`__lwp_unpack+0xc
what`consumer+0xd0
libc.so.1`_lwp_start
269825

With DTrace we thus discover the root cause of the xcalls. The xcalls are triggered by lwp_unpack and park. Looking at the source from opensolaris.org we see that lwp_unpark is used to wake up an LWP sleeping on a queue. Next, we examine the source for calls that wake up a thread. We discover cond_broadcast(&thecv), which wakes up all threads waiting on a condition variable. An alternative is to wake up a single thread sleeping on the condition ( cond_signal). If we change cond_broadcastt to cond_signal we see the following ( ptime monitors execution time using microstate accounting):

ptime what 30    (cond_broadcast)            ptime what_new 30 (cond_signal)   real      14.547                            real       11.659
user      37.466                            user       12.655
sys     1:03.826                            sys        14.172

The ptime result shows that the CPU execution time using cond_signal is 25% of the time when using cond_broadcast!

Example 3: memset

Consider the following prstat output.

  PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROC./LWPID4293 rw8775    99 0.0 0.0 0.0 0.0 0.0 0.0 0.5  0  566   0   0 test_memset/1

This measurement shows 99% user CPU utilization. Before DTrace, 100% CPU in USR meant there was nothing to be done at the system level. Yet, with Dtrace, it is useful and practical to do a poor man's profile.

For the poor man's profile, we use hotuser from the DTraceToolkit.

hotuser -p `pgrep test_memset`Sampling... Hit Ctrl-C to end.
^C
FUNCTION                 COUNT   PCNT
test_memset`0x8050a8c      210   3.1%
test_memset`main          1704  25.3%
libc.so.1`memset          4815  71.6%

We can see that for 72% of the execution time, the program is within a single function. This example may seem extreme; however, I have observed real-world applications in memset 50% of the time, so this example is more normal than it seems. The next question that needs to be answered is, how large are these memsets? To determine this we use the pid provider.

#!/usr/sbin/dtrace -spid$1::memset:entry                // the pid is the first passed parameter
{
@["memset"] = quantize(arg2);// aggregate on the number of bytes set
}
tick-10sec
{
printf("\n memset size count Date: %Y ", walltimestamp);
printa(@);
clear(@);
trunc(@,0);
}
tick-100sec
{exit(0);}

Every time memset is called in the pid of interest, we shall record the size of the memset. Every 10 seconds the aggregation is printed in the form of a histogram. After 100 seconds the script exits.

memset_profile.d `pgrep test_memset`dtrace: script 'memset_profile.d' matched 4 probes
CPU     ID                    FUNCTION:NAME
1  80324                      :tick-10sec
memset size count
Date: 2008 Aug 25 17:55:44
memset
value  ------------- Distribution ------------- count
2 |                                         0
4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 6608471
8 |                                         0

So, in our test case, we see allocations of less than or equal to 4 bytes. Four bytes can be done by an assignment statement and do not need a memset. Where are these memset calls occurring? We can modify our script slightly in order to get a stack.

#!/usr/sbin/dtrace -spid$1::memset:entry
/ arg2 < 5 /
{
@[ustack(2),arg2] = count();    -> two level deep stack
}
tick-10sec
{
printf("\n memset stack, size count Date: %Y ", walltimestamp);
printa(@);
clear(@);
trunc(@,0);
}

When the size for the memset is less than 5 bytes, we record a two-deep stack. It is usually best to start with shallow stack traces and increase the depth only when necessary, as stack traces are expensive. We then see the following:

memset_stack.d `pgrep test_memset`dtrace: script 'memset_stack.d' matched 4 probes
CPU     ID                    FUNCTION:NAME
1  80324                      :tick-10sec
memset stack, size
count Date: 2008 Aug 25 17:59:12
libc.so.1`memset
test_memset`main+0x72
4          6052925

The example shows us specifically where the memset call occurs. I once had an engagement where memset took 20 percent of the execution time, and 90% of that was from a single library doing less than 4-byte memsets. Replacing the memsets with appropriate assignment statements, we see the importance of appropriate assignments.

For example, change:

memset(c,0,4);

to

c[0]=0; c[1]=0; c[2]=0; c[3]=0;   I do it one byte at a time to prevent SIGBUS errors.

Comparing the two approaches we see the following:

ptime test_memset 100000000        ptime test_assignment 100000000real        4.559                real        0.363
user        4.539                user        0.353
sys         0.011                sys         0.009

Using DTrace, we drilled down to the root cause of the excessive memsets, and discovered a program change that led to a factor of ten performance improvement.

Example 4: FIFO

Consider mpstat 10:

CPU minf mjf xcal intr ithr  csw icsw migr   smtx srw syscl  usr sys wt idl0    0   0    0   527  254 1318  270  186 223997   0 431715  11  88  0   1
1    0   0    0   246  140 1427  347  180 239895   0 421440  12  88  0   0
CPU minf mjf xcal intr ithr  csw icsw migr   smtx srw syscl  usr sys wt idl
0    0   0    0   451  227 2161  840  128 244244   0 419694  12  86  0   2
1    0   0    0   195  141 2276  800  134 218031   0 429392  11  89  0   0

The first question is: What is causing the SPINs on Mutex Locks(smtx)? We use lockstat, a DTrace-based application, to identify the cause of the lock contention.

$ lockstat -C sleep 10

Adaptive mutex spin: 4966398 events in 10.041 seconds (494628 events/sec)Count indv cuml rcnt     nsec Lock                   Caller
-------------------------------------------------------------------------
2484949  50%  50% 0.00     1425 0xffffff0191702a00     fifo_write
2480887  50% 100% 0.00      870 0xffffff0191702a00     fifo_read

Lockstat tells us that the spins are due to fifo activity. To determine which system calls are being executed, we use /usr/demo/dtrace/syscall.d .

Running the following:

syscall:::entry/pid == $target/                // command is executed (-c)
// or pid passed (-p)
{
@[probefunc] = count(); // aggregate on the function name,
// storing the count
}

Output:

dtrace -s /usr/demo/dtrace/syscall.d -p `pgrep ptest_write`^C
write                     8636675

Now that DTrace has told us which system call is being executed, we can consider how to improve it. One way to reduce the CPU utilization of any system call is to do less of it. For writes, this means doing fewer writes by doing larger writes. Changing the program to use larger writes, we see the following:

using at 1024 byte write size:        megabytes/sec= 445.227011with an 8K write size:                megabytes/sec= 984.639115

Analysis of the DTrace-enabled root cause led to a modification yielding twice the throughput!

Example Five: User Error

My customer was receiving an "ENOMEM" while starting his application. The problem was explained in the following manner: "When I insert records into the database through my application one at a time, I can insert 30,000 records with no issues. But when I restart the database, the application runs out of address space."

Being a skeptic, I wanted to observe the database doing a one-at-a-time insertion. However, this exercise was going to take 7 days! Not having 7 days, I agreed to have the user manually load the database with the 30,000 records. We started the application (it was a 32-bit application, therefore a 4-gigabyte address space), and it did run out of address space.

As my first measurement, I decided to measure the address space size upon entering and exiting the function that loaded records from the database. The pmap tool is a useful tool for this purpose, but first I had to catch the execution of the application as it started up:

proc:::exec-success/ execname == "foobar" /
{
printf("stopping %d\n", pid );
stop();
system("/var/tmp/rcw/bin/my_d/pid_foo.d %d >> /var/tmp/rcw/statdir/dlog 2>&1 &", pid);
system("/bin/prun %d", pid);
exit(0);
}

So whenever the exec system call was successful, I checked the executable name. If it was foobar, I stopped it. I used a call to system to start a dtrace script against foobar. Then I restarted foobar.

The dtrace to monitor the address space resembled the following:

pid$1::foo_insert:entry{
printf("%s, tid = %d \n",probefunc, tid);
stop();                                // stop the process
system("/bin/pmap -x %d", pid);        // check its address space
system("/bin/prun %d", pid);          // restart the process
}

And, of course, we did something similar for return.

The pmap tool showed 640 megabytes upon entry and 740 megabytes upon exit. I was at a loss as to where the additional address space was being allocated.

At exit from the critical function:

28700:  foobarAddress  Kbytes     RSS    Anon  Locked Mode   Mapped File
00010000   56168   45128       -       - r-x--  foobar
036F8000    6728    6448    5248       - rwx--  foobar
03D8A000      24      24      24       - rwx--    [ anon ]
03D90000  641792  640384  640384       - rwx--    [ heap ]
...
total Kb  742920  724000  647392       -

At program termination:

28700:  foobar'''Address  Kbytes     RSS    Anon  Locked Mode   Mapped File
00010000   56168   45128       -       - r-x--  foobar
036F8000    6728    6448    5248       - rwx--  foobar
03D8A000      24      24      24       - rwx--     [ anon ]
03D90000 4051904 4050496 4050496       - rwx--     [ heap ]
total Kb 4165688 4146768 4057512       -

First, we decided to check out allocation sizes.

pid$1:*:*malloc*:entry,pid$1:*:*memalign*:entry,
pid$1:*:*valloc*:entry
{
@[probemod,probefunc,arg0]= count();    // probemod is the
// module name, e.g. libc.so.1
}
// probefunc is the function name
pid$1:*:*realloc*:entry
{
@[probemod,probefunc,arg1]= count();    // aggregate on the size,
// count the number of requests
}
tick-10sec                                // dump data every ten seconds
{
printf("\n Pid = %d Date: %Y \n", $1, walltimestamp);     // print a header
printa("probemod = %s, probefunc = %s size = %d, count = %@d \n",@);
//print the aggregation
printa(@);
clear(@);
trunc(@,0);
}
probemod = libumem.so.1 probefunc = malloc size = 2048 count = 6
probemod = libumem.so.1 probefunc = malloc size = 2748 count = 6
probemod = libc.so.1 probefunc = lmalloc size = 20 count = 19
probemod = libumem.so.1 probefunc = malloc size = 4 count = 35
probemod = libumem.so.1 probefunc = malloc size = 4613 count = 38
probemod = libumem.so.1 probefunc = malloc size = 1280 count = 46
probemod = libumem.so.1 probefunc = malloc size = 1024 count = 249
probemod = libumem.so.1 probefunc = malloc size = 12 count = 302
probemod = libumem.so.1 probefunc = malloc size = 112501 count = 1466 // suspicious
probemod = libumem.so.1 probefunc = malloc size = 24 count = 1499
                  

We immediately noticed over 2 GB of 100k sized allocations. We used the following to determine where the allocations were occurring:

pid$1:*:*malloc*:entry,pid$1:*:*memalign*:entry,
pid$1:*:*valloc*:entry
/ arg0 > 100000 /                    // we are only interested
// in the large allocations
{
@[ustack(10),arg0]= count(); // now we aggregate on size
// and location(stack)
}
pid$1:*:*realloc*:entry
{
@[ustack(10),arg1]= count();
}
tick-10sec
{
printf("\n Pid = %d Date: %Y \n", $1, walltimestamp);
printa("stack = %k , size = %d, count = %@d \n",@);
clear(@);
trunc(@,0);
}

Our output resembled the following:

    libumem.so.1`malloc
libCrun.so.1`__1c2n6FI_pv_+0x2c
a.out`__1cXfoobarHlrExcpCacheNode2t5B6MXXb_v_+0x4c
a.out`__1cVfoobarHlrExcpCachingSinitExceptionCache6MpnUCpCallmMobileDbCache__b_+0x334
a.out`__1cMfoobarDbpSmVprocDbpWaitForInitCmd6MrnPCmnPrSdlProcess_iiipv_n0AGstates__+0x15fc
a.out`__1cMfoobarDbpSmGdbpSdl6FrnPCmnPrSdlProcess_iiipv_nNCmnPrSdlError__+0xbc
a.out`__1cPfooSignal6Miipv_nNCmnPrSdlError__+0x24
a.out`__1cQfooignal6MrnJSdlSignal_b_nNCmnPrSdlError__+0xf0
a.out`__1cQfooInternalSignalLoop6M_nNCmnPrSdlError__+0x4c
a.out`__1cQfooExternalSignal6Miiipvp0i_nNCmnPrSdlError__+0xe0
12501             1442
                  

Thus, we determined the root cause: we were allocating gigabytes of exception cache.

The developer, when confronted with this data, was in a total state of denial. He looked into the issue and discovered that the tester's SQL to create the 30000 database entries was incorrect! The tester was initializing a field to all 0s when it should have been 0xffffff.

Useful Tip: Listing All Probes

If you do want to create your own script, it is useful to note that dtrace -l lists all available probes.

Here I ask for all entry probes in the tcp module.

 $  dtrace -l -n tcp::entry or dtrace -l -m tcp
ID   PROVIDER            MODULE                          FUNCTION NAME
57791        fbt               tcp                       inet_attach entry
57793        fbt               tcp                       inet_detach entry
57795        fbt               tcp                       inet_info entry
57797        fbt               tcp                       _info entry
                  

Then use -v to get the probe arguments.

 $  dtrace -lv -n fbt:tcp:inet_info:entry
ID   PROVIDER            MODULE                          FUNCTION NAME
57795        fbt               tcp                         inet_info entry
Probe Description Attributes
Identifier Names: Private
Data Semantics:   Private
Dependency Class: Unknown
Argument Attributes
Identifier Names: Private
Data Semantics:   Private
Dependency Class: ISA
Argument Types
args[0]: dev_info_t *
args[1]: ddi_info_cmd_t
args[2]: void *
args[3]: void **
                  

To go farther, you need to use the include files to understand the arguments.

One of my favorite providers is the profile provider. It is the core provider for the hotuser script mentioned earlier. It runs upon every CPU at specified intervals. It is a lightweight way to do a poor man's profile.

 profile:::profile-101        /* run 101 times per sec *// cpu == $1 && arg0 != 0 /    /* only interested in a certain CPU in kernel */
{
@calls[curthread , stack(5)]=count();    /* aggregate on thread ptr, 5 deep stack*/
}

Conclusion

DTrace does not replace any of the existing performance measuring tools that a practitioner typically employs to characterize performance. Where once the characterization led to experiments, conjectures, static instrumentation of code, and so on, DTrace nows allows the analyst to directly measure in live production environments, and thus to avoid experiments, static instrumentation, and hidden assumptions. With DTrace we can get to the root cause quickly, directly, and without ambiguity.

For More Information

Credits and Thanks

To Darryl Gove and Jim Fiori of Sun Microsystems for their reviews and comments.

About the Author

Rickey C. Weisner is a Senior Engineer working for the Solaris 10 Adoption group in the US Systems Practice. Rick has 18 years of experience at Sun. For the past 12 years he has been doing performance analysis and supporting software development. Prior to joining Sun he had 10 years of experience in research and development, commercial programming, and nuclear engineering.