Demo Perils
One of the downsides of being an operating systems developer is that the demos of the technology that you develop often suck. (“Look, it boots! And hey, we can even run programs and it doesn’t crash!”) So it’s been a pleasant change to develop DTrace, a technology that packs a jaw-dropping demo. In demonstrating DTrace for customers around the world, I have had the distinct (and rare) pleasure of impressing the most technically adept (and often jaded) audiences. My typical demonstration is on my Solaris x86 laptop, where I use DTrace to instrument the running system – exploring with the audience the peculiarities that exist even on an idle laptop. (This usually involves discovering and understanding the unnecessary work being done by acroread, dhcpagent, sendmail, etc.) This ad hoc demo shows DTrace as it’s meant to be used: dynamically answering questions that themselves were formed on-the-fly.
And when I demonstrate DTrace, I always do so on the absolute latest Solaris 10 build. Our mantra in Solaris Kernel Development is “FCS Quality All the Time” – we believe that the product should always be ready to be run in production. And if we’re going to tell a customer that it’s ready to be run in production, we damn well better run it in production ourselves. This has the added advantage that we tend to run into bugs before our customers do, allowing us to ship a final product that is that much more solid. Over the past year, I have given hundreds of DTrace demonstrations in front of customers running latest bits, and before last week, it had always gone off without a hitch…1
Last week, I had the opportunity to give a DTrace demonstration for a highly technical – and highly influential – audience at a Fortune 100 company. When I demonstrate DTrace, I typically do a couple of invocations on the command line before things become sufficiently complicated to merit writing a DTrace script. And it was when I went to run the first such script (a script that explored the activity of xclock
) that it happened:
# dtrace -s ./xclock.d
Segmentation Fault (core dumped)
#
If you’ve never had it, there’s no feeling quite like having a demo blow up on you: it’s as if you peed your pants, failed an exam and were punched in the gut – all at the same horrifying instant. It’s a feeling that every software developer should have exactly once in their lives: that unique rush of shock, and then humiliation and then despair, followed by the adrenal surge of a fight-or-flight reaction. In the time it takes a single process to dump core, you go from an (over)confident technologist to a frightened woodland creature, transfixed by the light of an oncoming freight train. For the woodland creature, at least it all ends mercifully quickly; the creature is spared the suffering of trying to explain away its foolishness. The hapless technologist, on the other hand, is left with several options:
-
Pretend that you didn’t write the software: “Boy, will you get a load of those fancy-pants software engineers? Overpriced underworked morons, every last one!”
-
Explain that this is demo software and isn’t expected to work: “Well, that’s why we haven’t shipped it yet! I mean, what fool would run this stuff anyway? Other than me, that is.”
-
Make light of it: “Hey, knock knock! Who’s there? Not my software, that’s for sure! Wocka wocka wocka!”
-
Suck it up: “That’s a serious problem. If you can excuse me for a second, let me get a handle on what we’ve got here that we can demo.”
I always aim for this last option, but on the rare occasion that this has happened to me (and this is – honest – probably the worst that a customer-facing demo has gone for me) I usually end up with some combination of the last three, often with plenty of stuttering, some mild swearing (“Damn! Damn!”) and profuse sweating.
In my particular case, the worst part was not knowing the exact pathology of the bug that I had just run into. Was there something basic that was broken or toxic about my machine? Would all scripts that I tried to run dump core? And if this was broken, what else was broken? Would I panic the machine or crash a target app if I continued? (Much more serious problems, both.) In an effort to get a handle on it, I did a quick pstack on the core file:
0804718f ???????? (8046604, 2)
d137c839 dt_instr_size (82d051a, 8067320, 223, d1380fe2) + 59
d137c0c2 dt_pid_create_return_probe (81651b8, 8067320, 8046af0, 8047170, 80472d
d137370d dt_pid_per_sym (80472ac, 8047170, d087b02c) + 15b
d13739ae dt_pid_sym_filt (80472ac, 8047170, d087b02c, 804715c) + 7c
d13152ca Psymbol_iter_com (81651b8, ffffffff, 8069060, 1, 407, 1) + 1e0
d13153ae Psymbol_iter_by_addr (81651b8, 8069060, 1, 407, d1373932, 80472ac) + 1
d1373b81 dt_pid_per_mod (80472ac, 82cf600, 8069060) + 191
d1373d56 dt_pid_mod_filt (80472ac, 82cf600, 8069060) + a3
d1314fe4 Pobject_iter (81651b8, d1373cb3, 80472ac) + 4f
d13740b4 dt_pid_create_probes (82cafa0, 8067320) + 344
d1353af8 dt_setcontext (8067320, 82cafa0) + 42
d13537d4 dt_compile_one_clause (8067320, 82be430, 82cdae0) + 32
d1353a9c dt_compile_clause (8067320, 82be430) + 26
d1354d66 dt_compile (8067320, 16a, 3, 0, 80, 1) + 3d9
d1355263 dtrace_program_strcompile (8067320, 8047ec2, 3, 80, 1, 8066848) + 23
080526ef ???????? (8066e48)
0805370e main (3, 8047df8, 8047e08) + 8fc
0805177a ???????? (3, 8047eb8, 8047ebf, 8047ec2, 0, 8047edf)
This was dying in the code that analyzes a target binary as part of creating pid
provider probes. There was at least a chance that this problem was localized to something specific about the xclock
program text – it was worth trying a similar script on a different process. Fortunately, I was able to stave off total panic long enough to write such a script and – even better – this one worked. The problem did indeed seem to be localized to something specific in xclock. And thanks to my coreadm settings, the core file from the seg faulting dtrace had been stashed away for later analysis; the best thing I could do at that point was drive on with the rest of the demo.
And this is what I did. The rest of the demo went well, and the audience was ultimately impressed with the technology. And while I never quite regained my stride (in part because my mind was racing about which change to DTrace could have introduced the problem), I was at least sufficiently effective – we achieved the goals of the meeting.2 On the plane back home, I root-caused the problem and developed a fix. The next day, I integrated the fix into Solaris – and I don’t think I’ve ever been so relieved to put latest bits on my laptop!
In the end, having the demo blow up certainly wasn’t a pleasant experience – but I wouldn’t change my decision to demo on the latest bits. Not only did we discover a serious bug, we discovered the hole in our test suite that prevented us from finding the bug before it integrated. So who am I to get upset about a little personal humiliation if the upshot is a better product? ;)
-
This is a slight exaggeration. I had actually run into DTrace bugs in front of customers, but they were always sufficiently small that only a trained eye would realize that something was amiss – things like slightly incorrect error messages. ↩︎
-
The primary goal of such a demo is often to get the customer sufficiently excited about Solaris 10 to download Solaris Express (usually for x86) and start playing around with the technology themselves. We are nearly always successful in this – and I have even had a few customers start downloading Solaris Express before the end of the meeting! ↩︎