Happy 5th Birthday, DTrace!
It’s hard to believe, but DTrace is five years old today: it was on September 3, 2003 that DTrace integrated into Solaris. DTrace was a project that extended all three of us to our absolute limit as software engineers – and the 24 hours before integration was then (and remains now) the most harrowing of my career. As it will hopefully remain my most stressful experience as an engineer, the story of that final day merits a retelling…
Our project had been running for nearly two years, but it was not until mid-morning on September 2nd – the day before we were slated to integrate – that it was discovered that the DTrace prototype failed to boot on some very old hardware (the UltraSPARC-I, the oldest hardware still supported at that time). Now, “failed to boot” can meet a bunch of different things, but this was about as awful as it gets: a hard hang after the banner message. That is, booting mysteriously stopped making progress soon after control transferred to the kernel – and one could not break in with the kernel debugger. This is an awful failure mode because with no debugger and no fatal error, one has no place to start other than to start adding print statements – or start ripping out the code that is the difference between the working system and the busted one. This was a terrifying position to be in less than 24 hours before integration! Strangely, it was only the non-DEBUG variant that failed to boot: the DEBUG version laden with assertions worked fine. Our only lucky break was that we were able to find two machines that exhibited the problem, enabling us to bifurcate our efforts: I starting ripping out DTrace-specific code in one workspace, while Mike started frenetically adding print statements in another…
Meanwhile, while we were scrambling to save our project, Eric was having his first day at Sun. My office door was closed, and with our integration pending and me making frequent (and rapid) trips back and forth to the lab, the message to my coworkers was clear: stay the hell back. Eric was blissfully unaware of these implicit signals, however, and he cheerfully poked his head in my office to say hello (Eric had worked the previous summer in our group as an intern). I can’t remember exactly what I said to Eric when he opened my office door, but suffice it to say that the implicit signals were replaced with a very explicit one – and I remain grateful to this day that Eric didn’t quit on the spot…
Back on our problem, Mike – through process of elimination – had made the key breakthrough: it wasn’t actually an instruction set architecture (ISA) issue, but rather it seemed to be a host bus adapter (HBA) issue. This was an incredibly important discovery: while we had a bevy of architectural changes that could conceivably be invalid on an ancient CPU, we had no such HBA-specific changes – this was more likely to be something marring the surface of our work rather than cracking its foundation. Mike further observed that running a DEBUG variant of these ancient HBA drivers (esp and fas) would boot on an otherwise non-DEBUG kernel. At that, I remembered that we actually did have some cosmetic changes to these drivers, and on carefully reviewing the diffs, we found a deadly problem: in folding some old tracing code under a DEBUG-only #define, a critical line (the one that actually initiates the I/O) became compiled in only when DEBUG was defined. We hadn’t seen this until now because these drivers were only used on ancient machines – machines on which we had never tested non-DEBUG. We fixed the problem, and all of our machines booted DEBUG and non-DEBUG – and we felt like we were breathing again for the first time in the more than six hours that we had been working on the problem. (Here is the mail that I sent out explaining the problem.)
To celebrate DTrace’s birthday beyond just recounting the terror of its integration, I wanted to make a couple of documents public that we have not previously shared:
-
The primordial presentation on DTrace, developed in late 1999. Some of the core ideas are present here (in particular, production instrumentation and zero disabled probe effect), but we hadn’t yet figured out some very basic notions – like that we needed our own language.
-
Our first real internal presentation on DTrace, presented March 12, 2002 as a Kernel Technical Discussion. Here the thinking is much better fleshed out around kernel-level instrumentation – and a prototype existed and was demonstrated. But a key direction for the technology – the ability to instrument user-level generally and in semantically relevant ways in particular – was still to come when Adam joined the team shortly after this presentation. (A video of this presentation also exists; in the unlikely event that anyone wants to actually relive three hours of largely outmoded thinking, I’ll find a way to make it available.)
-
The e-mail we sent out after integration, September 3, 2003 – five years ago today.
We said it then, and it’s even truer today: it’s been quite a ride. Happy 5th Birthday, DTrace – and thanks again to everyone in the DTrace community for making it what it has become!