With my explanation of a demo gone wrong, several people have asked me the more general question: how does one demo DTrace? This question doesn’t have a single answer, especially given that DTrace is best demonstrated with ad hoc queries of the system. Indeed, the best demos are when someone in the audience shouts out their own question that they want to see answered: answering such a question instantly with DTrace nearly always blows away the questioner — who has presumably suffered in the past trying to answer similar questions. Of course, saying “why, there are many ways to demo to DTrace!” is a useless answer to the question of how one demos DTrace; while there are many ways that one can demo DTrace, it’s useful to get a flavor of how a typical demo might go. So with the substantial caveat that this is not the way to demo DTrace, but merely a way, let me walk you through an example demo.
It all starts by running dtrace(1M) without any arguments:
# dtrace
Usage: dtrace [-32|-64] [-aACeEFGHlqSvVwZ] [-b bufsz] [-c cmd] [-D name[=def]]
[-I path] [-o output] [-p pid] [-s script] [-U name]
[-x opt[=val]] [-X a|c|s|t]
[-P provider [[ predicate ] action ]]
[-m [ provider: ] module [[ predicate ] action ]]
[-f [[ provider: ] module: ] func [[ predicate ] action ]]
[-n [[[ provider: ] module: ] func: ] name [[ predicate ] action ]]
[-i probe-id [[ predicate ] action ]] [ args ... ]
predicate -> '/' D-expression '/'
action -> '{' D-statements '}'
-32 generate 32-bit D programs and ELF files
-64 generate 64-bit D programs and ELF files
-a claim anonymous tracing state
-A generate driver.conf(4) directives for anonymous tracing
-b set trace buffer size
-c run specified command and exit upon its completion
-C run cpp(1) preprocessor on script files
-D define symbol when invoking preprocessor
-e exit after compiling request but prior to enabling probes
-E exit after enabling probes but prior to tracing data
-f enable or list probes matching the specified function name
-F coalesce trace output by function
-G generate an ELF file containing embedded dtrace program
-H print included files when invoking preprocessor
-i enable or list probes matching the specified probe id
-I add include directory to preprocessor search path
-l list probes matching specified criteria
-m enable or list probes matching the specified module name
-n enable or list probes matching the specified probe name
-o set output file
-p grab specified process-ID and cache its symbol tables
-P enable or list probes matching the specified provider name
-q set quiet mode (only output explicitly traced data)
-s enable or list probes according to the specified D script
-S print D compiler intermediate code
-U undefine symbol when invoking preprocessor
-v set verbose mode (report program stability attributes)
-V report DTrace API version
-w permit destructive actions
-x enable or modify compiler and tracing options
-X specify ISO C conformance settings for preprocessor
-Z permit probe descriptions that match zero probes
I usually just point out this just gives your typical Unix-ish help
message — implicitly making the point that dipping into DTrace doesn’t
require much more than typing its name on the command line.
From the output of the usage message,
I highlight the “-l” option, noting that it lists all the probes in the
system. I then run with that option — being sure to pipe the output to
more(1):
# dtrace -l | more
ID PROVIDER MODULE FUNCTION NAME
1 dtrace BEGIN
2 dtrace END
3 dtrace ERROR
4 lockstat genunix mutex_enter adaptive-acquire
5 lockstat genunix mutex_enter adaptive-block
6 lockstat genunix mutex_enter adaptive-spin
7 lockstat genunix mutex_exit adaptive-release
8 lockstat genunix mutex_destroy adaptive-release
9 lockstat genunix mutex_tryenter adaptive-acquire
10 lockstat genunix lock_set spin-acquire
11 lockstat genunix lock_set spin-spin
12 lockstat genunix lock_set_spl spin-acquire
13 lockstat genunix lock_set_spl spin-spin
14 lockstat genunix lock_try spin-acquire
15 lockstat genunix lock_clear spin-release
16 lockstat genunix lock_clear_splx spin-release
17 lockstat genunix CLOCK_UNLOCK spin-release
18 lockstat genunix rw_enter rw-acquire
19 lockstat genunix rw_enter rw-block
20 lockstat genunix rw_exit rw-release
21 lockstat genunix rw_tryenter rw-acquire
22 lockstat genunix rw_tryupgrade rw-upgrade
23 lockstat genunix rw_downgrade rw-downgrade
24 lockstat genunix thread_lock thread-spin
25 lockstat genunix thread_lock_high thread-spin
26 fasttrap fasttrap fasttrap
27 syscall nosys entry
28 syscall nosys return
29 syscall rexit entry
30 syscall rexit return
--More--
I point out that each of these rows is a point of instrumentation. I then explain each of the columns, explaining that the PROVIDER column denotes the DTrace notion of instrumentation providers which allows for disjoint instrumentation techniques. If the audience is a particularly savvy one, I may point out that the lockstat provider is a recasting of the instrumentation technique used by lockstat(1M), and that lockstat has been made to be a DTrace consumer. I mention that the MODULE column contains the name of the kernel module (in the case of a kernel probe) or shared object name (in the case of a user-level probe), that the FUNCTION column contains the function that the probe is in, and that the NAME column contains the name of the probe. I explain that a each probe is uniquely identified by the probe tuple: provider, module, function and name.
Scrolling down through the output, I point out the probes from syscall provider, describing its ability to instrument every system call entry and return. (For some audiences, I may pause here to remind them of the definition of a system call.)
I then continue to scroll down my probe listing until I get to output that looks something like this:
...
1268 fbt unix dvma_release entry
1269 fbt unix dvma_release return
1270 fbt unix softlevel1 entry
1271 fbt unix softlevel1 return
1272 fbt unix freq_notsc entry
1273 fbt unix freq_notsc return
1274 fbt unix psm_unmap entry
1275 fbt unix psm_unmap return
1276 fbt unix cpu_visibility_online entry
1277 fbt unix cpu_visibility_online return
1278 fbt unix setbackdq entry
1279 fbt unix setbackdq return
1280 fbt unix lgrp_choose entry
1281 fbt unix lgrp_choose return
1282 fbt unix cpu_intr_swtch_enter entry
1283 fbt unix cpu_intr_swtch_enter return
1284 fbt unix page_upgrade entry
1285 fbt unix page_upgrade return
1286 fbt unix page_lock_es entry
1287 fbt unix page_lock_es return
1288 fbt unix lgrp_shm_policy_set entry
1289 fbt unix lgrp_shm_policy_set return
1290 fbt unix segkmem_gc entry
1291 fbt unix segkmem_gc return
1292 fbt unix disp_anywork entry
1293 fbt unix disp_anywork return
1294 fbt unix prgetprxregsize entry
1295 fbt unix prgetprxregsize return
1296 fbt unix cpuid_pass1 entry
1297 fbt unix cpuid_pass1 return
1298 fbt unix cpuid_pass2 entry
1299 fbt unix cpuid_pass2 return
1300 fbt unix cpuid_pass3 entry
1301 fbt unix cpuid_pass3 return
1302 fbt unix cpuid_pass4 entry
1303 fbt unix cpuid_pass4 return
1304 fbt unix release_bootstrap entry
1305 fbt unix release_bootstrap return
1306 fbt unix i_ddi_rnumber_to_regspec entry
1307 fbt unix i_ddi_rnumber_to_regspec return
1308 fbt unix page_mem_avail entry
1309 fbt unix page_mem_avail return
1310 fbt unix page_pp_lock entry
1311 fbt unix page_pp_lock return
...
I pause here to explain that the fbt provider can instrument every function entry and return in the kernel. Making the observation that there are quite a few functions in the kernel, I then quit out of the output, and re-run the command — this time piping the output through wc. Usually, the output is something like this:
# dtrace -l | wc -l
32521
Anyone who has dealt with traditional instrumentation frameworks will typically express some shock at this — that there are (for example) 32,520 points of instrumentation on an optimized, stock system. Occasionally, someone who has perhaps heard of DTrace may pipe up: “Is that where that 30,000 number comes from that I’ve seen in in the press?” The quick answer is that this is indeed where that number comes from — but the longer answer is that this really only the beginning because with the pid provider, DTrace can instrument every instruction in every running app. For perhaps obvious reasons, we create these pid probes lazily — we don’t want to clog up the system with millions of unused probes.
After having listed all probes and counted them, my next invocation of dtrace is to list the probes yet again — but this time listing only probes that match the probe tuple “syscall:::entry". I explain that this means I want to match probes from the syscall provider named entry — and that I don’t care about the module and function. This output is simply a shorter version of the previous listing:
# dtrace -l -n syscall:::entry
ID PROVIDER MODULE FUNCTION NAME
27 syscall nosys entry
29 syscall rexit entry
31 syscall forkall entry
33 syscall read entry
35 syscall write entry
37 syscall open entry
39 syscall close entry
41 syscall wait entry
...
I then explain that to enable these probes (instead of just listing them), I need to merely not include the “-l“:
# dtrace -n syscall:::entry
dtrace: description 'syscall:::entry' matched 226 probes
CPU ID FUNCTION:NAME
0 125 ioctl:entry
0 125 ioctl:entry
0 261 sysconfig:entry
0 261 sysconfig:entry
0 195 sigaction:entry
0 195 sigaction:entry
0 125 ioctl:entry
0 261 sysconfig:entry
0 61 brk:entry
0 61 brk:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 403 fstat64:entry
0 61 brk:entry
0 61 brk:entry
0 403 fstat64:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 35 write:entry
0 125 ioctl:entry
0 403 fstat64:entry
0 35 write:entry
0 125 ioctl:entry
0 125 ioctl:entry
0 339 pollsys:entry
0 33 read:entry
...
I let this run on the screen for a while, explaining that we’ve gone from an optimized, uninstrumented system call table to an instrumented one — and that we’re now seeing every system call as it happens on the box. I explain that (on the one hand) this is novel information: in Solaris, we haven’t had a way to get system call information for the entire system. (I often remind the audience that this may look similar to truss , but it’s really quite different: truss operates only on a single process, and has an enormous probe effect from constantly stopping and running the target process.)
But while this information may be novel (at least for Solaris), it’s actually not that valuable — merely knowing that we executed these systems calls is unlikely to be terrible helpful. Rather, we may like to know which applications are inducing these system calls. To do this, we add a clause in the DTrace control language, D. The clause we need to add is quite simple; we’re going to use the trace action to record the current application name, which is stored in execname variable. So I ^C the previous enabling, and run the following:
# dtrace -n syscall:::entry'{trace(execname)}'
dtrace: description 'syscall:::entry' matched 226 probes
...
0 125 ioctl:entry xterm
0 125 ioctl:entry xterm
0 339 pollsys:entry xterm
0 339 pollsys:entry xterm
0 313 lwp_sigmask:entry Xsun
0 313 lwp_sigmask:entry Xsun
0 313 lwp_sigmask:entry Xsun
0 313 lwp_sigmask:entry Xsun
0 33 read:entry Xsun
0 233 writev:entry Xsun
0 125 ioctl:entry xterm
0 33 read:entry xterm
0 339 pollsys:entry xterm
0 339 pollsys:entry xterm
0 125 ioctl:entry xterm
0 125 ioctl:entry xterm
0 339 pollsys:entry xterm
0 125 ioctl:entry xterm
0 125 ioctl:entry xterm
0 339 pollsys:entry xterm
0 339 pollsys:entry xterm
0 313 lwp_sigmask:entry Xsun
0 313 lwp_sigmask:entry Xsun
0 313 lwp_sigmask:entry Xsun
0 313 lwp_sigmask:entry Xsun
0 339 pollsys:entry Xsun
0 35 write:entry dtrace
0 125 ioctl:entry xterm
0 125 ioctl:entry xterm
0 339 pollsys:entry xterm
0 33 read:entry xterm
0 125 ioctl:entry xterm
0 35 write:entry xterm
0 125 ioctl:entry xterm
0 339 pollsys:entry xterm
0 125 ioctl:entry xterm
0 33 read:entry xterm
0 35 write:entry xterm
0 125 ioctl:entry xterm
0 125 ioctl:entry xterm
0 339 pollsys:entry xterm
0 339 pollsys:entry xterm
0 313 lwp_sigmask:entry Xsun
...
This is clearly more useful: we can see which apps are performing system calls. But seeing this also leads to a very important observation, one that some heckler in a savvy audience may have already made by now: what we’re seeing in the above output is the system calls to write all of this gunk out to the terminal. That is, if we were to take this output and cut it, sort it, uniq it, and sort it again, we would come to the conclusion that dtrace, xterm and Xsun were the problem — regardless of what was being executed on the machine.
This is a serious problem, and the DTrace solution represents a substantial difference from virtually all earler work in this domain: in DTrace, the aggregation of data is a first-class citizen. Instead of aggregating data postmortem with traditional Unix tools, DTrace allows data to be aggregated in situ. This aggregation can reduce the data flow by up to a factor of the number of data points — a tremendous savings of both space and time.
So to recast the example in terms of aggregations, we want to aggregate on the application name, with our aggregating action being to count:
# dtrace -n syscall:::entry'{@[execname] = count()}'
dtrace: description 'syscall:::entry' matched 226 probes
When we run the above, we don’t see any output — just that we matched some number of probes. I explain that behind the scenes, we’re just keeping a table of application name, and the count of system calls. Whenever we ^C the above, we get the answer:
^C
utmpd 4
automountd 8
inetd 9
svc.configd 9
syslogd 16
FvwmAuto 39
in.routed 48
svc.startd 97
MozillaFirebird- 125
sendmail 183
fvwm2 621
dhcpagent 1192
dtrace 2195
xclock 2894
vim 7760
xterm 11768
Xsun 24916
#
I often point out that in DTrace — as in good cocktail conversation — the answer to one question typically provokes the next question. Looking at the above output, we may have all sorts of questions. For example, we may well ask: what the hell is dhcpagent doing? To answer this question, we can add a predicate to our clause:
# dtrace -n syscall:::entry'/execname == "dhcpagent"/{@[probefunc] = count()}'
The predicate is contained between the forward slashes; the expression in the curly braces is only evaluated if the predicate evaluates to a non-zero value. And note that we have changed the tuple on which we are aggregating: instead of aggregating on execname (which, thanks to the predicate, will always be “dhcpagent“), we’re aggregating on the probefunc. For the syscall provider, the probefunc is the name of the system call.
Running the above for ten seconds or so, and then ^C‘ing yields the following (at least on my laptop):
^C
pollsys 3
close 3
open 3
lwp_sigmask 6
fstat64 6
read 12
ioctl 12
llseek 15
Now, we may be interested in drilling down more into one of these system calls. Let’s say we want to know what dhcpagent is doing when it performs an open. We can effect this by narrowing our enabling a bit: instead of enabling all syscall probes named entry, we want to enable only those probes that are also from the function named open. We want to keep our predicate that the execname is dhcpagent, but this time we don’t want to aggregate on the probefunc (which, thanks to the narrowed enabling, will always be “open“); we want to aggregate on the name of the file that is being opened — which is the first argument to the open system call. To do this, we’ll aggregate on arg0. For slightly arcane reasons, we need to use the copyinstr action to effect this. The enabling:
# dtrace -n syscall::open:entry'/execname == "dhcpagent"/{@[copyinstr(arg0)] = count()}'
dtrace: description 'syscall::open:entry' matched 1 probe
Again, running the above for about ten seconds:
^C
/etc/default/dhcpagent 4
This tells us that there were four calls to the open system call from dhcpagent — and that all four were to the same file. Just on the face of it, this is suspicious: why is dhcpagent opening the file “/etc/default/dhcpagent” with such frequency? Hasn’t whoever wrote this code ever heard of stat?1 To answer the former question, we’re going to change our enabling to aggregate on dhcpagent‘s stack backtrace when it performs an open. This is done with the ustack action. The new enabling:
# dtrace -n syscall::open:entry'/execname == "dhcpagent"/{@[ustack()] = count()}'
dtrace: description 'syscall::open:entry' matched 1 probe
Running the above for ten seconds or so and then ^C‘ing yields the following:
^C
libc.so.1`__open+0xc
libc.so.1`_endopen+0x92
libc.so.1`fopen+0x26
libcmd.so.1`defopen+0x40
dhcpagent`df_get_string+0x27
dhcpagent`df_get_int+0x1c
dhcpagent`dhcp_requesting+0x248
libinetutil.so.1`iu_expire_timers+0x8a
libinetutil.so.1`iu_handle_events+0x95
dhcpagent`main+0x300
805314a
4
This tells us that all four of the open calls were from the above stack trace.
From here we have many options, and our course of action would depend on how we wanted to investigate this. Up until this point, we have been instrumenting only the system call layer — the shim layer between applications and the kernel. We may wish to now instrument the application itself. To do this, we first need to know how many dhcpagent processes are contributing to the output we’re seeing above. There are lots of ways to do this; a simple way is to aggregate on pid:
# dtrace -n syscall:::entry'/execname == "dhcpagent"/{@[pid] = count()}'
dtrace: description 'syscall:::entry' matched 226 probes
^C
43 80
This indicates that in the time between when we started the above, and the time that we ^C‘d, we saw 80 system calls from pid 43 — and that is was the only dhcpagent process that made any system calls. Now that we have the pid that we’re intrested in, we can instrument the application. For starters, let’s just instrument the defopen function in libcmd.so.1. (We know that it’s being called because we saw it in the stack backtrace of the open system call.) To do this:
# dtrace -n pid43::defopen:entry
dtrace: description 'pid43::defopen:entry' matched 1 probe
CPU ID FUNCTION:NAME
0 32622 defopen:entry
0 32622 defopen:entry
0 32622 defopen:entry
0 32622 defopen:entry
^C
With this we have instrumented the running app. When we ^C‘d, the app went back to being its usual optimized self. (Or at least, as optimized as dhcpagent ever is.)
If we wanted to trace both entry to and return from the defopen function, we could add another enabling to the mix:
# dtrace -n pid43::defopen:entry,pid43::defopen:return
dtrace: description 'pid43::defopen:entry,pid43::defopen:return' matched 2 probes
CPU ID FUNCTION:NAME
0 32622 defopen:entry
0 32623 defopen:return
0 32622 defopen:entry
0 32623 defopen:return
0 32622 defopen:entry
0 32623 defopen:return
0 32622 defopen:entry
0 32623 defopen:return
^C
But we may want to know more than this: we may want to know everything called from defopen. Answering this is a bit more sophisticated — and it’s a little too much for the command line. To answer this question, we head to an editor and type in the following DTrace script:
#!/usr/sbin/dtrace -s
pid43::defopen:entry
{
self->follow = 1;
}
pid43:::entry,
pid43:::return
/self->follow/
{}
pid43::defopen:return
/self->follow/
{
self->follow = 0;
exit(0);
}
So what is this script doing? We’re enabling defopen again, but this time we’re setting a thread-local variable — which we named “follow” to 1. We then enabled every function entry and return in the process, with the predicate that we’re only interested if our thread-local variable is non-zero. This has the effect of only tracing function entry and return if the function call was induced by defopen. We’re only interested in capturing one of these traces, which is the reason why we explicitly exit in the pid43::defopen:return clause.
Running the above is going to instrument the hell out of dhcpagent. We’re going to see that we matched many, many probes. Then we’ll see a bunch of output as defopen is called. Finally, we’ll be dropped back onto the shell prompt as we hit the exit action. By the time we see the shell again, we have restored the dhcpagent process to its uninstrumented self. Running the above, assuming it’s in dhcp.d:
# chmod +x dhcp.d
# ./dhcp.d
dtrace: script './dhcp.d' matched 10909 probes
CPU ID FUNCTION:NAME
0 32622 defopen:entry
0 33328 _get_thr_data:entry
0 37823 thr_getspecific:entry
0 43272 thr_getspecific:return
0 38784 _get_thr_data:return
0 37176 fopen:entry
0 37161 _findiop:entry
0 37525 pthread_rwlock_wrlock:entry
0 37524 rw_wrlock_impl:entry
0 37513 rw_read_held:entry
0 42963 rw_read_held:return
0 37514 rw_write_held:entry
0 42964 rw_write_held:return
0 37518 rwlock_lock:entry
0 37789 sigon:entry
0 43238 sigon:return
0 42968 rwlock_lock:return
0 42974 rw_wrlock_impl:return
0 42975 pthread_rwlock_wrlock:return
0 37174 getiop:entry
0 37674 mutex_trylock:entry
0 43123 mutex_trylock:return
0 37676 mutex_unlock:entry
0 43125 mutex_unlock:return
0 42625 getiop:return
0 37174 getiop:entry
0 37674 mutex_trylock:entry
0 43123 mutex_trylock:return
0 37676 mutex_unlock:entry
0 43125 mutex_unlock:return
0 42625 getiop:return
0 37174 getiop:entry
0 37674 mutex_trylock:entry
0 43123 mutex_trylock:return
0 37676 mutex_unlock:entry
0 43125 mutex_unlock:return
0 42625 getiop:return
0 37174 getiop:entry
0 37674 mutex_trylock:entry
0 43123 mutex_trylock:return
0 35744 memset:entry
0 41198 memset:return
0 37676 mutex_unlock:entry
0 43125 mutex_unlock:return
0 42625 getiop:return
0 37531 pthread_rwlock_unlock:entry
0 37514 rw_write_held:entry
0 37680 mutex_held:entry
0 43129 mutex_held:return
0 42964 rw_write_held:return
0 37789 sigon:entry
0 43238 sigon:return
0 42981 pthread_rwlock_unlock:return
0 42612 _findiop:return
0 37123 _endopen:entry
0 37309 _open:entry
0 37951 __open:entry
0 43397 __open:return
0 42760 _open:return
0 37152 _flockget:entry
0 37670 mutex_lock:entry
0 37669 mutex_lock_impl:entry
0 43118 mutex_lock_impl:return
0 43119 mutex_lock:return
0 42603 _flockget:return
0 37676 mutex_unlock:entry
0 43125 mutex_unlock:return
0 42574 _endopen:return
0 42627 fopen:return
0 32623 defopen:return
0 32623 defopen:return
#
This is interesting, but a little hard to sift through. Go back to dhcp.d, and add the following line:
#pragma D option flowindent
This sets a simple option that indents a code flow: probes named entry cause indentation to increase two spaces, and probes named return cause indentation to decrease two spaces. Running the new dhcp.d:
# ./dhcp.d
dtrace: script './dhcp.d' matched 10909 probes
CPU FUNCTION
0 -> defopen
0 -> _get_thr_data
0 -> thr_getspecific
0 <- thr_getspecific
0 <- _get_thr_data
0 -> fopen
0 -> _findiop
0 -> pthread_rwlock_wrlock
0 -> rw_wrlock_impl
0 -> rw_read_held
0 <- rw_read_held
0 -> rw_write_held
0 <- rw_write_held
0 -> rwlock_lock
0 -> sigon
0 <- sigon
0 <- rwlock_lock
0 <- rw_wrlock_impl
0 <- pthread_rwlock_wrlock
0 -> getiop
0 -> mutex_trylock
0 <- mutex_trylock
0 -> mutex_unlock
0 <- mutex_unlock
0 <- getiop
0 -> getiop
0 -> mutex_trylock
0 <- mutex_trylock
0 -> mutex_unlock
0 <- mutex_unlock
0 <- getiop
0 -> getiop
0 -> mutex_trylock
0 <- mutex_trylock
0 -> mutex_unlock
0 <- mutex_unlock
0 <- getiop
0 -> getiop
0 -> mutex_trylock
0 <- mutex_trylock
0 -> memset
0 <- memset
0 -> mutex_unlock
0 <- mutex_unlock
0 <- getiop
0 -> pthread_rwlock_unlock
0 -> rw_write_held
0 -> mutex_held
0 <- mutex_held
0 <- rw_write_held
0 -> sigon
0 <- sigon
0 <- pthread_rwlock_unlock
0 <- _findiop
0 -> _endopen
0 -> _open
0 -> __open
0 <- __open
0 <- _open
0 -> _flockget
0 -> mutex_lock
0 -> mutex_lock_impl
0 <- mutex_lock_impl
0 <- mutex_lock
0 <- _flockget
0 -> mutex_unlock
0 <- mutex_unlock
0 <- _endopen
0 <- fopen
0 | defopen:return
0 <- defopen
#
Now it's much easier to see what exactly is going on. One thing I may be interested in doing is tracing a timestamp at each of these points -- to get an idea of how much time we're spending in each of these functions. To do this, copy dhcp.d to dhcptime.d, and modify it to look like this:
#!/usr/sbin/dtrace -s
#pragma D option flowindent
pid43::defopen:entry
{
self->start = vtimestamp;
}
pid43:::entry,
pid43:::return
/self->start/
{
trace(vtimestamp - self->start);
}
pid43::defopen:return
/self->start/
{
self->start = 0;
exit(0);
}
Running dhcptime.d:
# ./dhcptime.d
dtrace: script './dhcptime.d' matched 10909 probes
CPU FUNCTION
0 -> defopen 0
0 -> _get_thr_data 2276
0 -> thr_getspecific 6214
0 <- thr_getspecific 8445
0 <- _get_thr_data 10512
0 -> fopen 14296
0 -> _findiop 34498
0 -> pthread_rwlock_wrlock 37822
0 -> rw_wrlock_impl 39265
0 -> rw_read_held 42514
0 <- rw_read_held 61148
0 -> rw_write_held 64573
0 <- rw_write_held 66630
0 -> rwlock_lock 69802
0 -> sigon 89495
0 <- sigon 92150
0 <- rwlock_lock 94463
0 <- rw_wrlock_impl 96615
0 <- pthread_rwlock_wrlock 114732
0 -> getiop 117720
0 -> mutex_trylock 121825
0 <- mutex_trylock 141821
0 -> mutex_unlock 144477
0 <- mutex_unlock 146819
0 <- getiop 149181
0 -> getiop 150408
0 -> mutex_trylock 168143
0 <- mutex_trylock 169478
0 -> mutex_unlock 170679
0 <- mutex_unlock 171952
0 <- getiop 173196
0 -> getiop 174371
0 -> mutex_trylock 175493
0 <- mutex_trylock 176733
0 -> mutex_unlock 177897
0 <- mutex_unlock 179068
0 <- getiop 180403
0 -> getiop 181585
0 -> mutex_trylock 182721
0 <- mutex_trylock 184085
0 -> memset 185393
0 <- memset 186869
0 -> mutex_unlock 188002
0 <- mutex_unlock 189275
0 <- getiop 190758
0 -> pthread_rwlock_unlock 211137
0 -> rw_write_held 212521
0 -> mutex_held 214849
0 <- mutex_held 217061
0 <- rw_write_held 218361
0 -> sigon 237062
0 <- sigon 238458
0 <- pthread_rwlock_unlock 241227
0 <- _findiop 243451
0 -> _endopen 263709
0 -> _open 267861
0 -> __open 271268
0 <- __open 334013
0 <- _open 336267
0 -> _flockget 338543
0 -> mutex_lock 340668
0 -> mutex_lock_impl 342788
0 <- mutex_lock_impl 345111
0 <- mutex_lock 363767
0 <- _flockget 365602
0 -> mutex_unlock 366861
0 <- mutex_unlock 368156
0 <- _endopen 370564
0 <- fopen 389523
0 | defopen:return 391539
0 <- defopen
#
The times in the above are all in nanoseconds. That is, the call to _get_thr_data was 2,276 nanoseconds after we called defopen, the first call to thr_getspecific was 6,214 nanoseconds, and so on. Note the big jump in __open above: we spent over sixty microseconds (334,013 - 271,268 > 60,000) in that one function. What happened there? We went into the kernel of course. (Remember, we found defopen by aggregating on the user stack from the open system call.) We may wish to know everything -- kernel and user -- that happens from defopen. To do this, go back to the original dhcp.d and change this:
pid43:::entry,
pid43:::return
/self->follow/
{}
To this:
pid43:::entry,
pid43:::return,
fbt:::
/self->follow/
{
printf(" (%s)", probeprov);
}
The "fbt:::" line enables every probe from the fbt provider. Recall that this provider makes available a probe upon entry to and return from every function in the kernel. I've added an additional line that records the name of the current provider (the "probeprov" variable) to make clear if a probe is in the kernel or is in user-space. This script is really going to instrument the snot out of the system; running the new dhcp.d:
# ./dhcp.d
dtrace: script './dhcp.d' matched 42092 probes
CPU FUNCTION
0 -> defopen (pid43)
0 -> _get_thr_data (pid43)
0 -> thr_getspecific (pid43)
0 <- thr_getspecific (pid43)
0 <- _get_thr_data (pid43)
0 -> fopen (pid43)
0 -> _findiop (pid43)
0 -> pthread_rwlock_wrlock (pid43)
0 -> rw_wrlock_impl (pid43)
0 -> rw_read_held (pid43)
0 <- rw_read_held (pid43)
0 -> rw_write_held (pid43)
0 <- rw_write_held (pid43)
0 -> rwlock_lock (pid43)
0 -> sigon (pid43)
0 <- sigon (pid43)
0 <- rwlock_lock (pid43)
0 <- rw_wrlock_impl (pid43)
0 <- pthread_rwlock_wrlock (pid43)
0 -> getiop (pid43)
0 -> mutex_trylock (pid43)
0 <- mutex_trylock (pid43)
0 -> mutex_unlock (pid43)
0 <- mutex_unlock (pid43)
0 <- getiop (pid43)
0 -> getiop (pid43)
0 -> mutex_trylock (pid43)
0 <- mutex_trylock (pid43)
0 -> mutex_unlock (pid43)
0 <- mutex_unlock (pid43)
0 <- getiop (pid43)
0 -> getiop (pid43)
0 -> mutex_trylock (pid43)
0 <- mutex_trylock (pid43)
0 -> mutex_unlock (pid43)
0 <- mutex_unlock (pid43)
0 <- getiop (pid43)
0 -> getiop (pid43)
0 -> mutex_trylock (pid43)
0 <- mutex_trylock (pid43)
0 -> memset (pid43)
0 <- memset (pid43)
0 -> mutex_unlock (pid43)
0 <- mutex_unlock (pid43)
0 <- getiop (pid43)
0 -> pthread_rwlock_unlock (pid43)
0 -> rw_write_held (pid43)
0 -> mutex_held (pid43)
0 <- mutex_held (pid43)
0 <- rw_write_held (pid43)
0 -> sigon (pid43)
0 <- sigon (pid43)
0 <- pthread_rwlock_unlock (pid43)
0 <- _findiop (pid43)
0 -> _endopen (pid43)
0 -> _open (pid43)
0 -> __open (pid43)
0 -> syscall_mstate (fbt)
0 -> tsc_gethrtimeunscaled (fbt)
0 <- tsc_gethrtimeunscaled (fbt)
0 <- syscall_mstate (fbt)
0 -> open (fbt)
0 -> copen (fbt)
0 -> falloc (fbt)
0 -> ufalloc (fbt)
0 -> ufalloc_file (fbt)
0 -> fd_find (fbt)
0 <- fd_find (fbt)
0 -> fd_reserve (fbt)
0 <- fd_reserve (fbt)
0 <- ufalloc_file (fbt)
0 <- ufalloc (fbt)
0 -> kmem_cache_alloc (fbt)
0 <- kmem_cache_alloc (fbt)
0 -> crhold (fbt)
0 <- crhold (fbt)
0 <- falloc (fbt)
0 -> vn_openat (fbt)
0 -> lookupnameat (fbt)
0 -> pn_get_buf (fbt)
0 <- pn_get_buf (fbt)
0 -> lookuppnat (fbt)
0 -> lookuppnvp (fbt)
0 -> pn_fixslash (fbt)
0 <- pn_fixslash (fbt)
0 -> pn_getcomponent (fbt)
0 <- pn_getcomponent (fbt)
0 -> fop_lookup (fbt)
0 -> ufs_lookup (fbt)
0 -> dnlc_lookup (fbt)
0 <- dnlc_lookup (fbt)
0 -> ufs_iaccess (fbt)
0 -> crgetuid (fbt)
0 <- crgetuid (fbt)
0 <- ufs_iaccess (fbt)
0 <- ufs_lookup (fbt)
0 <- fop_lookup (fbt)
0 -> vn_mountedvfs (fbt)
0 <- vn_mountedvfs (fbt)
0 -> vn_rele (fbt)
0 <- vn_rele (fbt)
0 -> pn_getcomponent (fbt)
0 <- pn_getcomponent (fbt)
0 -> fop_lookup (fbt)
0 -> ufs_lookup (fbt)
0 -> dnlc_lookup (fbt)
0 <- dnlc_lookup (fbt)
0 -> ufs_iaccess (fbt)
0 -> crgetuid (fbt)
0 <- crgetuid (fbt)
0 <- ufs_iaccess (fbt)
0 <- ufs_lookup (fbt)
0 <- fop_lookup (fbt)
0 -> vn_mountedvfs (fbt)
0 <- vn_mountedvfs (fbt)
0 -> vn_rele (fbt)
0 <- vn_rele (fbt)
0 -> pn_getcomponent (fbt)
0 <- pn_getcomponent (fbt)
0 -> fop_lookup (fbt)
0 -> ufs_lookup (fbt)
0 -> dnlc_lookup (fbt)
0 <- dnlc_lookup (fbt)
0 -> ufs_iaccess (fbt)
0 -> crgetuid (fbt)
0 <- crgetuid (fbt)
0 <- ufs_iaccess (fbt)
0 <- ufs_lookup (fbt)
0 <- fop_lookup (fbt)
0 -> vn_mountedvfs (fbt)
0 <- vn_mountedvfs (fbt)
0 -> vn_rele (fbt)
0 <- vn_rele (fbt)
0 -> pn_setlast (fbt)
0 <- pn_setlast (fbt)
0 -> vn_path (fbt)
0 <- vn_path (fbt)
0 <- lookuppnvp (fbt)
0 <- lookuppnat (fbt)
0 <- lookupnameat (fbt)
0 -> fop_getattr (fbt)
0 -> ufs_getattr (fbt)
0 <- ufs_getattr (fbt)
0 <- fop_getattr (fbt)
0 -> fop_access (fbt)
0 -> ufs_access (fbt)
0 -> ufs_iaccess (fbt)
0 -> crgetuid (fbt)
0 <- crgetuid (fbt)
0 <- ufs_iaccess (fbt)
0 <- ufs_access (fbt)
0 <- fop_access (fbt)
0 -> fop_open (fbt)
0 -> ufs_open (fbt)
0 <- ufs_open (fbt)
0 -> vn_rele (fbt)
0 <- vn_rele (fbt)
0 <- fop_open (fbt)
0 <- vn_openat (fbt)
0 -> setf (fbt)
0 -> cv_broadcast (fbt)
0 <- cv_broadcast (fbt)
0 <- setf (fbt)
0 <- copen (fbt)
0 <- open (fbt)
0 -> syscall_mstate (fbt)
0 -> tsc_gethrtimeunscaled (fbt)
0 <- tsc_gethrtimeunscaled (fbt)
0 <- syscall_mstate (fbt)
0 <- __open (pid43)
0 <- _open (pid43)
0 -> _flockget (pid43)
0 -> mutex_lock (pid43)
0 -> mutex_lock_impl (pid43)
0 <- mutex_lock_impl (pid43)
0 <- mutex_lock (pid43)
0 <- _flockget (pid43)
0 -> mutex_unlock (pid43)
0 <- mutex_unlock (pid43)
0 <- _endopen (pid43)
0 <- fopen (pid43)
0 | defopen:return (pid43)
0 <- defopen
#
As the transition from pid43 to fbt in the above indicates, this script shows us flow control across the user/kernel boundary. (At the risk of sounding too much like Ron Popeil, the above output is the clearest display of code flow across a protection boundary from any tool -- research or otherwise.)
By this point in the demo, the audience is usually convinced that DTrace can instrument huge tracts of the system that were never before observable; the focus usually shifts from "can this do anything?" to "can this do anything for me?" The short answer is "yes" (of course), but a longer answer is merited. Unfortunately, this blog entry has already gone on long enough however; I'll save the longer answer for another day...
1 For whatever it's worth, a bug has been filed on this issue. Once it's fixed, I will sadly have to find some other sub-optimal software with which to demonstrate DTrace. The good news is that such software isn't terribly hard to find.