DTrace, or Dynamic Tracing, is a powerful diagnostic tool introduced in the Solaris 10 OS. Since its introduction, it has been implemented in other operating systems, the most noteworthy being FreeBSD and Mac OS X. This tutorial uses DTrace to analyze several applications. The tutorial emphasizes practical real-world use of DTrace, as opposed to an introduction to language syntax. Using mpstat
and prstat
information, the tutorial illustrates which questions one should ask and how one can use DTrace to answer these questions. These DTrace-inspired answers lead to increased performance.
Contents
/usr/demo/dtrace
mpstat
Program prstat -mL
Tool memset
Introduction to DTrace
DTrace, or dynamic tracing, was first available in the Solaris 10 OS in 2004. Since then, DTrace has been made available in FreeBSD and Max OS X. DTrace differs from traditional tools in that code is instrumented dynamically.
To disassemble a kernel function and initiate a probe:
arwen:
mdb -k
Loading modules: [ unix krtld genunix specfs ufs ip sctp usba uhci s1394 nca fctl lofs nfs audiosup sppp random crypto ptm ipc ]
>
ufs_delete::dis
ufs_delete: pushq %rbp
ufs_delete+1: movq %rsp,%rbp
ufs_delete+4: subq $0x70,%rsp
ufs_delete+8: movq %rbx,-0x28(%rbp)
ufs_delete+0xc: movq %r13,-0x18(%rbp)
^D
arwen:dtrace -f ufs_delete
dtrace: description 'ufs_delete' matched 2 probes
^Z[1] + Stopped (SIGTSTP) dtrace -f ufs_delete
arwen:mdb -k
Loading modules: [ unix krtld genunix specfs dtrace ufs ip sctp usba uhci s1394 nca fctl lofs nfs audiosup sppp random crypto ptm ipc ]
> '''ufs_delete::dis'''
ufs_delete: '''int $0x3'''
ufs_delete+1: movq %rsp,%rbp
ufs_delete+4: subq $0x70,%rsp
ufs_delete+8: movq %rbx,-0x28(%rbp)
ufs_delete+0xc: movq %r13,-0x18(%rbp)
DTrace inserts the interrupt dynamically.
DTrace uses the following format to identify a probe: provider:module:function:name
Or, from the command line:
-P provider-m module
-f function
-n name
-s script
-p pid
-c run command
-l list
-v verbose
So a simple DTrace script would look like this:
syscall:::entry // probe description/pid == $1/ // predicate
{
@[probefunc] = count(); // action
}
In the preceding script, syscall
is the provider. The module designation is omitted so the script matches on any module. The function designation is likewise omitted, so the script matches on any function. In this example, we are interested only in the entry to the function as opposed to the return. The predicate ( /pid == $target/
) determines if the action ( @[probefunc] = count();
) is performed. The action occurs only when the pid (process id) matches the first parameter passed on the command line.
The action is an aggregation (designated by the @) that contains the count by function name of the system calls executed. By default, the contents of the aggregation are printed upon exit from the script.
The DTrace Documentation has hundreds of pages, and yes, I read them all initially. However, if you are not using a new language daily, details soon become hazy. When I need or want to use DTrace, I "borrow" someone else's DTrace and make a few modifications to suit my needs and environment. Two of my favorite sources for sample DTrace are /usr/demo/dtrace
and the DTrace Toolkit.
Examples in /usr/demo/dtrace
In the Solaris OS, the examples from the DTrace user guide are found in /usr/demo/dtrace
.
$ cd /usr/demo/dtrace
$
ls | head
applicat.d
badopen.d
begin.d
callout.d
clause.d
clear.d
countdown.d
...
Each script contains a reference to the chapter from whence it came.
For example, from applicat.d:
...* The full text of the this chapter may be found here:
* http://wikis.sun.com/display/DTrace/io+Provider
...
Or you can use a browser to examine index.html
:
Chapter ScriptIntroduction countdown.d
counter.d
hello.d
...
Variables clause.d
rtime.d
Structs and Unions kstat.d
ksyms.d
rwinfo.d
Aggregations clear.d
dateprof.d
...
Brendan Gregg's Toolkit
One of the co-authors of Solaris Performance and Tools [McDougall2007V2], Brendan Gregg, created a toolkit. It contains numerous d-scripts as well as perl scripts that preprocess and postprocess DTrace. One nice feature of the toolkit is the example directory, which contains output from each of the available programs.
Examples:
...Copyright
Readme
anonpgpid_example.txt
bitesize_example.txt
connections_example.txt
cpudists_example.txt
cputimes_example.txt
cputypes_example.txt
cpuwalk_example.txt
crash_example.txt
...
From hotuser_example.txt
, the following are demonstrations of the hotuser DTrace program.
Here, hotuser is run on a test program named dofuncs
, which is hard-coded to spend half its time in delta(), a third in beta(), and a sixth in alpha().
# ./hotuser -c ./dofuncsSampling... Hit Ctrl-C to end.
^C
FUNCTION COUNT PCNT
dofuncs`alpha 511 16.5%
dofuncs`beta 1029 33.3%
dofuncs`delta 1552 50.2%
...
mpstat
Program
DTrace is most useful when asking a question that is indicated by other performance metrics. The mpstat
program is one of the programs that frequently illustrates which questions to ask. The mpstat
program presents a high-level view of kernel activity.
mpstat 10 5CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 17 0 2 428 225 240 7 25 2 0 407 3 0 0 97
1 15 0 1 149 141 225 10 26 2 0 308 2 2 0 96
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 3 0 0 428 219 245 103 3 11 0 732 99 1 0 0
1 21 0 0 242 140 425 196 3 9 0 1320 98 2 0 0
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 0 432 227 240 100 3 12 0 751 99 1 0 0
1 0 0 0 243 140 443 207 4 9 0 1324 98 2 0 0
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 0 455 248 264 112 4 9 0 734 99 1 0 0
1 0 0 0 246 140 448 205 4 12 0 1336 98 2 0 0
In the preceding mpstat
example:
minf page faults satisfied from memorymjf page faults satisfied from disk
xcal cross calls
intr interrupts
ithr interrupts threads
csw context switches
icsw involuntary context switches
migr thread migrations
smtx spins on mutex locks
srw spins on read/writer locks
syscl system calls
usr CPU time in user mode
sys CPU time in kernel mode
wt wait on IO
idl idle
The above measurements are in values per second.
prstat -mL
Tool
The prstat
tool is another tool that that provides useful first-level analysis metrics, providing similar information as mpstat
but at a thread level.
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID3040 rw8775 0.8 0.2 0.0 0.0 0.0 0.0 99 0.0 19 7 750 0 madplay/1
3030 root 0.1 0.2 0.0 0.0 0.0 0.0 100 0.0 30 1 400 0 nx/1
1311 rw8775 0.3 0.0 0.0 0.0 0.0 0.0 100 0.0 78 40 273 0 iobar.i386/1
700 rw8775 0.2 0.0 0.0 0.0 0.0 0.0 100 0.0 321 7 1K 129 Xorg/1
303 rw8775 0.1 0.0 0.0 0.0 0.0 0.0 100 0.0 20 4 170 0 xterm/1
310 rw8775 0.0 0.1 0.0 0.0 0.0 0.0 100 0.0 117 20 273 0 netbar.i386/1
2972 rw8775 0.1 0.0 0.0 0.0 0.0 0.0 100 0.0 167 16 673 0 soffice.bin/1
3042 rw8775 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 24 2 386 0 prstat/1
1305 rw8775 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 174 57 552 0 perfbar/1
1296 rw8775 0.0 0.0 0.0 0.0 0.0 0.0 100 0.0 2 0 31 0 xterm/1
Total: 105 processes, 231 lwps, load averages: 0.06, 0.06, 0.04
where:
USR Percent CPU in USR per thread SYS Percent CPU in SYS per thread
TRP Percent CPU spent trapping into the kernel
TFL Percent CPU spent servicing Text page faults
DFL Percent CPU spent servicing Data page faults
LCK Percent CPU spent waiting for locks
SLP Percent CPU spent asleep... using IO
LAT Percent CPU spent on a run
VCX Number of voluntary context switches per interval
ICX Number of involuntary context switches per interval
SCL Number of system calls per interval
SIG Number of signals per interval
Example 1: Memory Management
It is not uncommon for applications to use a threaded implementation in order to scale on modern multithreaded and multicore hardware architectures. As the thread count increases, these applications can run into lock contention.
Consider the following prstat -mL
:
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID19037 root 100 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 25 0 0 malloc_hog/14
19037 root 100 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 25 0 0 malloc_hog/12
19037 root 100 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 25 0 0 malloc_hog/11
19037 root 100 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 25 0 0 malloc_hog/8
19037 root 100 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 25 0 0 malloc_hog/3
19037 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 malloc_hog/15
19037 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 malloc_hog/13
19037 root 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0 0 0 0 malloc_hog/10
...
Total: 1 processes, 21 lwps, load averages: 3.76, 1.20, 0.45
Here we see serious lock contention (notice LCK).
Because these are userland locks, we can use a DTrace-based program named plockstat
to check for hot locks:
$ plockstat -C -e 10 -p `pgrep malloc_hog`-C -> check for contention events
-e -> sample for 10 sec
-p -> check pid
Mutex block
Count nsec Lock Caller
-------------------------------------------------------------------------------
4 699547535 libc.so.1`libc_malloc_lock malloc_hog`malloc_thread+0xc
1 102140 libc.so.1`libc_malloc_lock malloc_hog`malloc_thread+0x30
The DTrace-based program, plockstat
, enables us to drill down to the root cause of the lock contention. The threads are blocked on a lock related to the malloc
memory management routine. Fortunately, with the Solaris OS, there are several MT hot memory allocators to choose from. The operating systems comes with libumem
, libmtmalloc
, and of course there is the freely available libhoard
(see www.hoard.org).
By LD preloading libumem
we get the following:
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID19054 root 100 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 25 0 0 malloc_hog/16
19054 root 100 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 25 0 0 malloc_hog/15
19054 root 100 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 25 0 0 malloc_hog/13
19054 root 100 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 25 0 0 malloc_hog/11
19054 root 100 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 25 0 0 malloc_hog/20
19054 root 100 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 25 0 0 malloc_hog/19
19054 root 100 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 25 0 0 malloc_hog/17
...
No more lock contention!
Example 2: Condition Variables
Consider mpstat 10
:
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl0 0 0 8259 14363 60 18143 823 2683 7533 0 15433 15 38 0 47
1 0 0 7371 11574 2 17084 692 2269 6567 0 14318 14 34 0 52
2 0 0 6844 9808 0 15798 577 1882 6072 0 13287 13 31 0 55
3 0 0 6307 8481 0 14765 493 1621 5532 0 12330 12 29 0 59
4 0 0 6495 8268 0 15149 520 1635 5798 0 12776 12 29 0 60
5 0 0 5837 7344 0 13849 415 1333 5204 0 11547 11 26 0 63
6 0 0 5498 6638 0 13056 355 1141 4870 0 10881 10 24 0 65
7 0 0 5300 6187 0 12639 329 1017 4671 0 10502 10 24 0 67
8 0 0 5554 6303 0 13199 349 1114 4916 0 11111 10 25 0 66
...
22 0 0 2528 2601 0 6015 129 395 2264 0 5002 4 12 0 84
23 0 0 1951 2034 0 4746 104 320 1729 0 3916 3 10 0 87
24 0 0 5078 1143 0 2601 69 209 4640 0 17470 8 14 0 78
25 0 0 0 2 0 0 0 0 0 0 0 0 0 0 100
26 0 0 234 132 0 288 11 31 204 0 630 0 1 0 99
27 0 0 1415 795 0 1818 44 133 1279 0 4002 2 5 0 93
28 0 0 20586 1343 0 2994 90 262 18910 0 76303 35 51 0 15
29 0 0 219 198 0 447 11 32 194 0 466 0 1 0 99
30 0 0 0 1 0 0 0 0 0 0 0 0 0 0 100
31 0 0 0 1 0 0 0 0 0 0 0 0 0 0 100
Let's try to find out what is causing the cross calls.
Consider a modified xcallsbypid.d
from the the DTrace Toolkit:
#!/usr/sbin/dtrace -ssysinfo:::xcalls
/ execname == "what" / // we are only interested in executables named what
{
@[ustack(10)]=count(); // 10 deep userland stack
}
tick-10sec // print every 10 sec
{
printfprintf("%Y \n",walltimestamp);
printa(@);
clear(@);
trunc(@,0);
}
When executed, this script yields the following:
libc.so.1`__lwp_park+0x10libc.so.1`cond_wait_queue+0x4c
libc.so.1`cond_wait+0x10
what`consumer+0x5c
libc.so.1`_lwp_start
160350
libc.so.1`__lwp_unpack+0xc
what`consumer+0xd0
libc.so.1`_lwp_start
269825
With DTrace we thus discover the root cause of the xcalls. The xcalls are triggered by lwp_unpack
and park
. Looking at the source from opensolaris.org we see that lwp_unpark
is used to wake up an LWP sleeping on a queue. Next, we examine the source for calls that wake up a thread. We discover cond_broadcast(&thecv)
, which wakes up all threads waiting on a condition variable. An alternative is to wake up a single thread sleeping on the condition ( cond_signal
). If we change cond_broadcast
t to cond_signal
we see the following ( ptime
monitors execution time using microstate accounting):
ptime what 30 (cond_broadcast) ptime what_new 30 (cond_signal) real 14.547 real 11.659
user 37.466 user 12.655
sys 1:03.826 sys 14.172
The ptime
result shows that the CPU execution time using cond_signal
is 25% of the time when using cond_broadcast
!
Example 3: memset
Consider the following prstat
output.
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROC./LWPID4293 rw8775 99 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0 566 0 0 test_memset/1
This measurement shows 99% user CPU utilization. Before DTrace, 100% CPU in USR meant there was nothing to be done at the system level. Yet, with Dtrace, it is useful and practical to do a poor man's profile.
For the poor man's profile, we use hotuser from the DTraceToolkit.
hotuser -p `pgrep test_memset`Sampling... Hit Ctrl-C to end.
^C
FUNCTION COUNT PCNT
test_memset`0x8050a8c 210 3.1%
test_memset`main 1704 25.3%
libc.so.1`memset 4815 71.6%
We can see that for 72% of the execution time, the program is within a single function. This example may seem extreme; however, I have observed real-world applications in memset
50% of the time, so this example is more normal than it seems. The next question that needs to be answered is, how large are these memsets
? To determine this we use the pid provider.
#!/usr/sbin/dtrace -spid$1::memset:entry // the pid is the first passed parameter
{
@["memset"] = quantize(arg2);// aggregate on the number of bytes set
}
tick-10sec
{
printf("\n memset size count Date: %Y ", walltimestamp);
printa(@);
clear(@);
trunc(@,0);
}
tick-100sec
{exit(0);}
Every time memset
is called in the pid of interest, we shall record the size of the memset
. Every 10 seconds the aggregation is printed in the form of a histogram. After 100 seconds the script exits.
memset_profile.d `pgrep test_memset`dtrace: script 'memset_profile.d' matched 4 probes
CPU ID FUNCTION:NAME
1 80324 :tick-10sec
memset size count
Date: 2008 Aug 25 17:55:44
memset
value ------------- Distribution ------------- count
2 | 0
4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 6608471
8 | 0
So, in our test case, we see allocations of less than or equal to 4 bytes. Four bytes can be done by an assignment statement and do not need a memset
. Where are these memset
calls occurring? We can modify our script slightly in order to get a stack.
#!/usr/sbin/dtrace -spid$1::memset:entry
/ arg2 < 5 /
{
@[ustack(2),arg2] = count(); -> two level deep stack
}
tick-10sec
{
printf("\n memset stack, size count Date: %Y ", walltimestamp);
printa(@);
clear(@);
trunc(@,0);
}
When the size for the memset
is less than 5 bytes, we record a two-deep stack. It is usually best to start with shallow stack traces and increase the depth only when necessary, as stack traces are expensive. We then see the following:
memset_stack.d `pgrep test_memset`dtrace: script 'memset_stack.d' matched 4 probes
CPU ID FUNCTION:NAME
1 80324 :tick-10sec
memset stack, size
count Date: 2008 Aug 25 17:59:12
libc.so.1`memset
test_memset`main+0x72
4 6052925
The example shows us specifically where the memset
call occurs. I once had an engagement where memset
took 20 percent of the execution time, and 90% of that was from a single library doing less than 4-byte memsets. Replacing the memset
s with appropriate assignment statements, we see the importance of appropriate assignments.
For example, change:
memset(c,0,4);
to
c[0]=0; c[1]=0; c[2]=0; c[3]=0; I do it one byte at a time to prevent SIGBUS errors.
Comparing the two approaches we see the following:
ptime test_memset 100000000 ptime test_assignment 100000000real 4.559 real 0.363
user 4.539 user 0.353
sys 0.011 sys 0.009
Using DTrace, we drilled down to the root cause of the excessive memsets
, and discovered a program change that led to a factor of ten performance improvement.
Example 4: FIFO
Consider mpstat 10
:
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl0 0 0 0 527 254 1318 270 186 223997 0 431715 11 88 0 1
1 0 0 0 246 140 1427 347 180 239895 0 421440 12 88 0 0
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 0 451 227 2161 840 128 244244 0 419694 12 86 0 2
1 0 0 0 195 141 2276 800 134 218031 0 429392 11 89 0 0
The first question is: What is causing the SPINs on Mutex Locks(smtx)? We use lockstat
, a DTrace-based application, to identify the cause of the lock contention.
$ lockstat -C sleep 10
Adaptive mutex spin: 4966398 events in 10.041 seconds (494628 events/sec)Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------
2484949 50% 50% 0.00 1425 0xffffff0191702a00 fifo_write
2480887 50% 100% 0.00 870 0xffffff0191702a00 fifo_read
Lockstat
tells us that the spins are due to fifo
activity. To determine which system calls are being executed, we use /usr/demo/dtrace/syscall.d
.
Running the following:
syscall:::entry/pid == $target/ // command is executed (-c)
// or pid passed (-p)
{
@[probefunc] = count(); // aggregate on the function name,
// storing the count
}
Output:
dtrace -s /usr/demo/dtrace/syscall.d -p `pgrep ptest_write`^C
write 8636675
Now that DTrace has told us which system call is being executed, we can consider how to improve it. One way to reduce the CPU utilization of any system call is to do less of it. For writes, this means doing fewer writes by doing larger writes. Changing the program to use larger writes, we see the following:
using at 1024 byte write size: megabytes/sec= 445.227011with an 8K write size: megabytes/sec= 984.639115
Analysis of the DTrace-enabled root cause led to a modification yielding twice the throughput!
Example Five: User Error
My customer was receiving an "ENOMEM" while starting his application. The problem was explained in the following manner: "When I insert records into the database through my application one at a time, I can insert 30,000 records with no issues. But when I restart the database, the application runs out of address space."
Being a skeptic, I wanted to observe the database doing a one-at-a-time insertion. However, this exercise was going to take 7 days! Not having 7 days, I agreed to have the user manually load the database with the 30,000 records. We started the application (it was a 32-bit application, therefore a 4-gigabyte address space), and it did run out of address space.
As my first measurement, I decided to measure the address space size upon entering and exiting the function that loaded records from the database. The pmap
tool is a useful tool for this purpose, but first I had to catch the execution of the application as it started up:
proc:::exec-success/ execname == "foobar" /
{
printf("stopping %d\n", pid );
stop();
system("/var/tmp/rcw/bin/my_d/pid_foo.d %d >> /var/tmp/rcw/statdir/dlog 2>&1 &", pid);
system("/bin/prun %d", pid);
exit(0);
}
So whenever the exec
system call was successful, I checked the executable name. If it was foobar
, I stopped it. I used a call to system
to start a dtrace
script against foobar
. Then I restarted foobar
.
The dtrace
to monitor the address space resembled the following:
pid$1::foo_insert:entry{
printf("%s, tid = %d \n",probefunc, tid);
stop(); // stop the process
system("/bin/pmap -x %d", pid); // check its address space
system("/bin/prun %d", pid); // restart the process
}
And, of course, we did something similar for return.
The pmap
tool showed 640 megabytes upon entry and 740 megabytes upon exit. I was at a loss as to where the additional address space was being allocated.
At exit from the critical function:
28700: foobarAddress Kbytes RSS Anon Locked Mode Mapped File
00010000 56168 45128 - - r-x-- foobar
036F8000 6728 6448 5248 - rwx-- foobar
03D8A000 24 24 24 - rwx-- [ anon ]
03D90000 641792 640384 640384 - rwx-- [ heap ]
...
total Kb 742920 724000 647392 -
At program termination:
28700: foobar'''Address Kbytes RSS Anon Locked Mode Mapped File
00010000 56168 45128 - - r-x-- foobar
036F8000 6728 6448 5248 - rwx-- foobar
03D8A000 24 24 24 - rwx-- [ anon ]
03D90000 4051904 4050496 4050496 - rwx-- [ heap ]
total Kb 4165688 4146768 4057512 -
First, we decided to check out allocation sizes.
pid$1:*:*malloc*:entry,pid$1:*:*memalign*:entry,
pid$1:*:*valloc*:entry
{
@[probemod,probefunc,arg0]= count(); // probemod is the
// module name, e.g. libc.so.1
}
// probefunc is the function name
pid$1:*:*realloc*:entry
{
@[probemod,probefunc,arg1]= count(); // aggregate on the size,
// count the number of requests
}
tick-10sec // dump data every ten seconds
{
printf("\n Pid = %d Date: %Y \n", $1, walltimestamp); // print a header
printa("probemod = %s, probefunc = %s size = %d, count = %@d \n",@);
//print the aggregation
printa(@);
clear(@);
trunc(@,0);
}
probemod = libumem.so.1 probefunc = malloc size = 2048 count = 6
probemod = libumem.so.1 probefunc = malloc size = 2748 count = 6
probemod = libc.so.1 probefunc = lmalloc size = 20 count = 19
probemod = libumem.so.1 probefunc = malloc size = 4 count = 35
probemod = libumem.so.1 probefunc = malloc size = 4613 count = 38
probemod = libumem.so.1 probefunc = malloc size = 1280 count = 46
probemod = libumem.so.1 probefunc = malloc size = 1024 count = 249
probemod = libumem.so.1 probefunc = malloc size = 12 count = 302
probemod = libumem.so.1 probefunc = malloc size = 112501 count = 1466 // suspicious
probemod = libumem.so.1 probefunc = malloc size = 24 count = 1499
We immediately noticed over 2 GB of 100k sized allocations. We used the following to determine where the allocations were occurring:
pid$1:*:*malloc*:entry,pid$1:*:*memalign*:entry,
pid$1:*:*valloc*:entry
/ arg0 > 100000 / // we are only interested
// in the large allocations
{
@[ustack(10),arg0]= count(); // now we aggregate on size
// and location(stack)
}
pid$1:*:*realloc*:entry
{
@[ustack(10),arg1]= count();
}
tick-10sec
{
printf("\n Pid = %d Date: %Y \n", $1, walltimestamp);
printa("stack = %k , size = %d, count = %@d \n",@);
clear(@);
trunc(@,0);
}
Our output resembled the following:
libumem.so.1`malloc
libCrun.so.1`__1c2n6FI_pv_+0x2c
a.out`__1cXfoobarHlrExcpCacheNode2t5B6MXXb_v_+0x4c
a.out`__1cVfoobarHlrExcpCachingSinitExceptionCache6MpnUCpCallmMobileDbCache__b_+0x334
a.out`__1cMfoobarDbpSmVprocDbpWaitForInitCmd6MrnPCmnPrSdlProcess_iiipv_n0AGstates__+0x15fc
a.out`__1cMfoobarDbpSmGdbpSdl6FrnPCmnPrSdlProcess_iiipv_nNCmnPrSdlError__+0xbc
a.out`__1cPfooSignal6Miipv_nNCmnPrSdlError__+0x24
a.out`__1cQfooignal6MrnJSdlSignal_b_nNCmnPrSdlError__+0xf0
a.out`__1cQfooInternalSignalLoop6M_nNCmnPrSdlError__+0x4c
a.out`__1cQfooExternalSignal6Miiipvp0i_nNCmnPrSdlError__+0xe0
12501 1442
Thus, we determined the root cause: we were allocating gigabytes of exception cache.
The developer, when confronted with this data, was in a total state of denial. He looked into the issue and discovered that the tester's SQL to create the 30000 database entries was incorrect! The tester was initializing a field to all 0s when it should have been 0xffffff.
Useful Tip: Listing All Probes
If you do want to create your own script, it is useful to note that dtrace -l
lists all available probes.
Here I ask for all entry probes in the tcp
module.
$ dtrace -l -n tcp::entry or dtrace -l -m tcp
ID PROVIDER MODULE FUNCTION NAME
57791 fbt tcp inet_attach entry
57793 fbt tcp inet_detach entry
57795 fbt tcp inet_info entry
57797 fbt tcp _info entry
Then use -v
to get the probe arguments.
$ dtrace -lv -n fbt:tcp:inet_info:entry
ID PROVIDER MODULE FUNCTION NAME
57795 fbt tcp inet_info entry
Probe Description Attributes
Identifier Names: Private
Data Semantics: Private
Dependency Class: Unknown
Argument Attributes
Identifier Names: Private
Data Semantics: Private
Dependency Class: ISA
Argument Types
args[0]: dev_info_t *
args[1]: ddi_info_cmd_t
args[2]: void *
args[3]: void **
To go farther, you need to use the include files to understand the arguments.
One of my favorite providers is the profile provider. It is the core provider for the hotuser script mentioned earlier. It runs upon every CPU at specified intervals. It is a lightweight way to do a poor man's profile.
profile:::profile-101 /* run 101 times per sec *// cpu == $1 && arg0 != 0 / /* only interested in a certain CPU in kernel */
{
@calls[curthread , stack(5)]=count(); /* aggregate on thread ptr, 5 deep stack*/
}
Conclusion
DTrace does not replace any of the existing performance measuring tools that a practitioner typically employs to characterize performance. Where once the characterization led to experiments, conjectures, static instrumentation of code, and so on, DTrace nows allows the analyst to directly measure in live production environments, and thus to avoid experiments, static instrumentation, and hidden assumptions. With DTrace we can get to the root cause quickly, directly, and without ambiguity.
For More Information
Credits and Thanks
To Darryl Gove and Jim Fiori of Sun Microsystems for their reviews and comments.
About the Author
Rickey C. Weisner is a Senior Engineer working for the Solaris 10 Adoption group in the US Systems Practice. Rick has 18 years of experience at Sun. For the past 12 years he has been doing performance analysis and supporting software development. Prior to joining Sun he had 10 years of experience in research and development, commercial programming, and nuclear engineering.