The Observation Deck

Search
Close this search box.

Category: Solaris

In Februrary 1996, I came out to Sun Microsystems to interview for a job knowing only two things: that I wanted to do operating systems kernel development — and that I didn’t particularly want to work for Sun. I was right on the first count, but knew I was wrong on the second just moments into my first conversation with Jeff. He was emphatic that I should join him in forging the future, sharing both my enthusiasm for what was possible and my disdain for the broken, busted and boogered-up. Fourteen years later, I don’t for a moment regret my decision to join Jeff and Sun: we fostered an environment where the OS was viewed not as a regrettable drag on progress, but rather as a nexus of innovation — incubating technologies that today make a real difference in people’s lives.

In 2006, itching to try something new, Mike and I talked the company into taking the risk of allowing several of us to start Fishworks. That Sun supported our endeavor so enthusiastically was the company at its finest: empowering engineers to tackle hard problems, and inspiring them to bring innovative solutions to market. And with the budding success of the 7000 Series, I would like to believe that we made good on the company’s faith in us — and more generally on its belief in innovation as differentiator.

Now the time has come for me to venture again into something new — but this time it is to be beyond the company’s walls. This is obviously with mixed emotion; while I am excited about the future, it is very difficult for me personally to leave a company in which I have had such close relationships with so many. One of Sun’s greatest strengths was that we technologists were never discouraged from interacting directly and candidly with our customers and users, and many of our most important innovations came from these relationships. This symbiosis was critically important at several junctures of my own career, and I owe many of you a profound debt of gratitude — both for your counsel over the years, and for your willingness to bet your own business and livelihood on the technologies that I helped develop. You, like us, are innovators who love nothing more than great technology, and your steadfast faith in us means more to me than I can express; thank you.

As for my virtual address, it too is changing. This post will be my last at blogs.sun.com; in the future, you can find my blog at its new (permanent) home: http://dtrace.org/blogs/bmc (where comments on this entry will be open). As for e-mail, you can find me at the first letter of my first name concatenated with my last name at acm.org.

Thank you again for everything; take care — and stay in touch!

For as long as I’ve been in computing, the subject of concurrency has always induced a kind of thinking man’s hysteria. When I was coming up, the name of the apocalypse was symmetric multiprocessing — and its arrival was to be the Day of Reckoning for software. There seemed to be no end of doomsayers, even among those who putatively had the best understanding of concurrency. (Of note was a famous software engineer who — despite substantial experience in SMP systems at several different computer companies — confidently asserted to me in 1995 that it was “simply impossible” for an SMP kernel to “ever” scale beyond 8 CPUs. Needless to say, several of his past employers have since proved him wrong…)

There also seemed to be no end of concurrency hucksters and shysters, each eager to peddle their own quack cure for the miasma. Of these, the one that stuck in my craw was the two-level scheduling model, whereby many user-level threads are multiplexed on fewer kernel-level (schedulable) entities. (To paraphrase what has been oft said of little-known computer architectures, you haven’t heard of it for a reason.) The rationale for the model — that it allowed for cheaper synchronization and lightweight thread creation — seemed to me at the time to be long on assertions and short on data. So working with my undergraduate advisor, I developed a project to explore this model both quantitatively and dynamically, work that I undertook in the first half of my senior year. And early on in that work, it became clear that — in part due to intractable attributes of the model — the two-level thread scheduling model was delivering deeply suboptimal performance…

Several months after starting the investigation, I came to interview for a job at Sun with Jeff, and he (naturally) asked me to describe my undergraduate work. I wanted to be careful here: Sun was the major proponent of the two-level model, and while I felt that I had the hard data to assert that the model was essentially garbage, I also didn’t want to make a potential employer unnecessarily upset. So I stepped gingerly: “As you may know,” I began, “the two-level threading model is very… intricate.” “Intricate?!” Jeff exclaimed, “I’d say it’s completely busted!” (That moment may have been the moment that I decided to come work with Jeff and for Sun: the fact that an engineer could speak so honestly spoke volumes for both the engineer and the company. And despite Sun’s faults, this engineering integrity remains at Sun’s core to this day — and remains a draw to so many of us who have stayed here through the ups and downs.) With that, the dam had burst: Jeff and I proceeded to gush about how flawed we each thought the model to be — and how dogmatic its presentation. So paradoxically, I ended up getting a job at Sun in part by telling them that their technology was unsound!

Back at school, I completed my thesis. Like much undergraduate work, it’s terribly crude in retrospect — but I stand behind its fundamental conclusion that the unintended consequences of the two-level scheduling model make it essentially impossible to achieve optimal performance. Upon arriving at Sun, I developed an early proof-of-concept of the (much simpler) single-level model. Roger Faulkner did the significant work of productizing this as an alternative threading model in Solaris 8 — and he eliminated the two-level scheduling model entirely in Solaris 9, thus ending the ill-begotten experiment of the two-level scheduling model somewhere shy of its tenth birthday. (Roger gave me the honor of approving his request to integrate this work, an honor that I accepted with gusto.)

So why this meandering walk through a regrettable misadventure in the history of software systems? Because over a decade later, concurrency is still being used to instill panic in the uninformed. This time, it is chip-level multiprocessing (CMP) instead of SMP that promises to be the End of Days — and the shysters have taken a new guise in the form of transactional memory. The proponents of this new magic tonic are in some ways darker than their forebears: it is no longer enough to warn of Judgement Day — they must also conjure up notions of Original Sin to motivate their perverted salvation. “The heart of the problem is, perhaps, that no one really knows how to organize and maintain large systems that rely on locking” admonished Nir Shavit recently in CACM. (Which gives rise to the natural follow-up question: is the Solaris kernel not large, does it not rely on locking or do we not know how to organize and maintain it? Or is that we do not exist at all?) Shavit continues: “Locks are not modular and do not compose, and the association between locks and data is established mostly by convention.” Again, no data, no qualifiers, no study, no rationale, no evidence of experience trying to develop such systems — just a naked assertion used as a prop for a complicated and dubious solution. Are there elements of truth in Shavit’s claims? Of course: one can write sloppy, lock-based programs that become a galactic, unmaintainable mess. But does it mean that such monstrosities are inevitable? No, of course not.

So fine, the problem statement is (deeply) flawed. Does that mean that the solution is invalid? Not necessarily — but experience has taught me to be wary of crooked problem statements. And in this case (perhaps not surprisingly) I take umbrage with the solution as well. Even if one assumes that writing a transaction is conceptually easier than acquiring a lock, and even if one further assumes that transaction-based pathologies like livelock are easier on the brain than lock-based pathologies like deadlock, there remains a fatal flaw with transactional memory: much system software can never be in a transaction because it does not merely operate on memory. That is, system software frequently takes action outside of its own memory, requesting services from software or hardware operating on a disjoint memory (the operating system kernel, an I/O device, a hypervisor, firmware, another process — or any of these on a remote machine). In much system software, the in-memory state that corresponds to these services is protected by a lock — and the manipulation of such state will never be representable in a transaction. So for me at least, transactional memory is an unacceptable solution to a non-problem.

As it turns out, I am not alone in my skepticism. When we on the Editorial Advisory Board of ACM Queue sought to put together an issue on concurrency, the consensus was twofold: to find someone who could provide what we felt was much-needed dissent on TM (and in particular on its most egregious outgrowth, software transactional memory), and to have someone speak from experience on the rise of CMP and what it would mean for practitioners.

For this first article, we were lucky enough to find Calin Cascaval and colleagues, who ended up writing a must-read article on STM in November’s CACM. Their conclusions are unavoidable: STM is a dog. (Or as Cascaval et al. more delicately put it: “Based on our results, we believe that the road for STM is quite challenging.”) Their work is quantitative and analytical and (best of all, in my opinion) the authors never lose sight of the problem that transactional memory was meant to solve: to make parallel programming easier. This is important, because while many of the leaks in the TM vessel can ultimately be patched, the patches themselves add layer upon layer of complexity. Cascaval et al. conclude:

And because the argument for TM hinges upon its simplicity and productivity benefits, we are deeply skeptical of any proposed solutions to performance problems that require extra work by the programmer.

And while their language is tighter (and the subject of their work a weightier and more active research topic), the conclusions of Cascaval et al. are eerily similar to my final verdict on the two-level scheduling model, over a decade ago:

The dominating trait of the [two-level] scheduling model is its complexity. Unfortunately, virtually all of its complexity is exported to the programmer. The net result is that programmers must have a complete understanding of the model the inner workings of its implementation in order to be able to successfully tap its strengths and avoid its pitfalls.

So TM advocates: if Roger Faulkner knocks on your software’s door bearing a scythe, you would be well-advised to not let him in…

For the second article, we brainstormed potential authors — but as we dug up nothing but dry holes, I found myself coming to an unescapable conclusion: Jeff and I should write this, if nothing else as a professional service to prevent the latest concurrency hysteria from reaching epidemic proportions. The resulting article appears in full in the September issue of Queue, and substantially excerpted in the November issue of CACM. Writing the article was a gratifying experience, and gave us the opportunity to write down much of what we learned the hard way in the 1990s. In particular, it was cathartic to explore the history of concurrency. Having been at concurrency’s epicenter for nearly a decade, I felt that the rise of CMP has been recently misrepresented as a failure of hardware creativity — and it was vindicating to read CMP’s true origins in the original DEC Piranha paper: that given concurrent databases and operating systems, implementing multiple cores on the die was simply the best way to deliver OLTP performance. That is, it was the success of concurrent software — and not the failure of imagination on the part of hardware designers — that gave rise to the early CMP implementations. Hopefully practitioners will enjoy reading the article as much as we enjoyed writing it — and here’s hoping that we live to see a day when concurrency doesn’t attract so many schemers and dreamers!

It’s hard to believe, but DTrace is five years old today: it was on September 3, 2003 that DTrace integrated into Solaris. DTrace was a project that extended all three of us to our absolute limit as software engineers — and the 24 hours before integration was then (and remains now) the most harrowing of my career. As it will hopefully remain my most stressful experience as an engineer, the story of that final day merits a retelling…

Our project had been running for nearly two years, but it was not until mid-morning on September 2nd — the day before we were slated to integrate — that it was discovered that the DTrace prototype failed to boot on some very old hardware (the UltraSPARC-I, the oldest hardware still supported at that time). Now, “failed to boot” can meet a bunch of different things, but this was about as awful as it gets: a hard hang after the banner message. That is, booting mysteriously stopped making progress soon after control transferred to the kernel — and one could not break in with the kernel debugger. This is an awful failure mode because with no debugger and no fatal error, one has no place to start other than to start adding print statements — or start ripping out the code that is the difference between the working system and the busted one. This was a terrifying position to be in less than 24 hours before integration! Strangely, it was only the non-DEBUG variant that failed to boot: the DEBUG version laden with assertions worked fine. Our only lucky break was that we were able to find two machines that exhibited the problem, enabling us to bifurcate our efforts: I starting ripping out DTrace-specific code in one workspace, while Mike started frenetically adding print statements in another…

Meanwhile, while we were scrambling to save our project, Eric was having his first day at Sun. My office door was closed, and with our integration pending and me making frequent (and rapid) trips back and forth to the lab, the message to my coworkers was clear: stay the hell back. Eric was blissfully unaware of these implicit signals, however, and he cheerfully poked his head in my office to say hello (Eric had worked the previous summer in our group as an intern). I can’t remember exactly what I said to Eric when he opened my office door, but suffice it to say that the implicit signals were replaced with a very explicit one — and I remain grateful to this day that Eric didn’t quit on the spot…

Back on our problem, Mike — through process of elimination — had made the key breakthrough: it wasn’t actually an instruction set architecture (ISA) issue, but rather it seemed to be a host bus adapter (HBA) issue. This was an incredibly important discovery: while we had a bevy of architectural changes that could conceivably be invalid on an ancient CPU, we had no such HBA-specific changes — this was more likely to be something marring the surface of our work rather than cracking its foundation. Mike further observed that running a DEBUG variant of these ancient HBA drivers (esp and fas) would boot on an otherwise non-DEBUG kernel. At that, I remembered that we actually did have some cosmetic changes to these drivers, and on carefully reviewing the diffs, we found a deadly problem: in folding some old tracing code under a DEBUG-only #define, a critical line (the one that actually initiates the I/O) became compiled in only when DEBUG was defined. We hadn’t seen this until now because these drivers were only used on ancient machines — machines on which we had never tested non-DEBUG. We fixed the problem, and all of our machines booted DEBUG and non-DEBUG — and we felt like we were breathing again for the first time in the more than six hours that we had been working on the problem. (Here is the mail that I sent out explaining the problem.)

To celebrate DTrace’s birthday beyond just recounting the terror of its integration, I wanted to make a couple of documents public that we have not previously shared:

  • The primordial presentation on DTrace, developed in late 1999. Some of the core ideas are present here (in particular, production instrumentation and zero disabled probe effect), but we hadn’t yet figured out some very basic notions — like that we needed our own language.
  • Our first real internal presentation on DTrace, presented March 12, 2002 as a Kernel Technical Discussion. Here the thinking is much better fleshed out around kernel-level instrumentation — and a prototype existed and was demonstrated. But a key direction for the technology — the ability to instrument user-level generally and in semantically relevant ways in particular — was still to come when Adam joined the team shortly after this presentation. (A video of this presentation also exists; in the unlikely event that anyone wants to actually relive three hours of largely outmoded thinking, I’ll find a way to make it available.)

  • The e-mail we sent out after integration, September 3, 2003 — five years ago today.

We said it then, and it’s even truer today: it’s been quite a ride. Happy 5th Birthday, DTrace — and thanks again to everyone in the DTrace community for making it what it has become!

In general, I don’t believe in drawing attention to bugs in the software of others: any significant body of software is likely to have bugs, and I think one can too easily draw overly broad inferences by looking at software through the lens of its defects (a pathology that I have previously discussed at some length). However — and as you might imagine from the preamble — I’m about to make an exception to that gentlemanly rule…

I, along with zillions of others, read the breathless hype about the new would-be Google slayer, cuil. When a new search engine pops up, my egotistical reflex is to first search for “dtrace”, and the results of searching for “dtrace” on cuil were very, um, interesting. The search results themselves were fine; more creative were the images that cuil decided to associate with them. If you look at that screenshot, you will be able to find an image of a quilt, a strip mall, what could pass for a program from an August Wilson play, and — strangest of all — a sign that reads “Welcome to Palisades Interstate Parkway“. I can’t say for certain that I’ve never travelled on the Palisades Interstate Parkway, but with all deference to that 38.25-mile stretch of tarmac, I do believe that I can say that it played no role in DTrace — or DTrace in it. Indeed, I can say with absolute confidence that searching for “palisades interstate parkway dtrace” will, in short order, yield only this blog entry — provided, that is, that one doesn’t perform said search on cuil… 😉

As I have discussed before, I strongly believe that to understand systems, you must understand their pathologies — systems are most instructive when they fail. Unfortunately, we in computing systems do not have a strong history of studying pathology: despite the fact that failure in our domain can be every bit as expensive (if not more so) than in traditional engineering domains, our failures do not (usually) involve loss of life or physical property and there is thus little public demand for us to study them — and a tremendous industrial bias for us to forget them as much and as quickly as possible. The result is that our many failures go largely unstudied — and the rich veins of wisdom that these failures generate live on only in oral tradition passed down by the perps (occasionally) and the victims (more often).

A counterexample to this — and one of my favorite systems papers of all time — is Robert Colwell‘s brilliant Performance Effects of Architectural Complexity in the Intel 432. This paper, which dissects the abysmal performance of Intel’s infamous 432, practically drips with wisdom, and is just as relevant today as it was when the paper was originally published nearly twenty years ago.

For those who have never heard of the Intel 432, it was a microprocessor conceived of in the mid-1970s to be the dawn of a new era in computing, incorporating many of the latest notions of the day. But despite its lofty ambitions, the 432 was an unmitigated disaster both from an engineering perspective (the performance was absolutely atrocious) and from a commercial perspective (it did not sell — a fact presumably not unrelated to its terrible performance). To add insult to injury, the 432 became a sort of punching bag for researchers, becoming, as Colwell described, “the favorite target for whatever point a researcher wanted to make.”

But as Colwell et al. reveal, the truth behind the 432 is a little more complicated than trendy ideas gone awry; the microprocessor suffered from not only untested ideas, but also terrible execution. For example, one of the core ideas of the 432 is that it was a capability-based system, implemented with a rich hardware-based object model. This model had many ramifications for the hardware, but it also introduced a dangerous dependency on software: the hardware was implicitly dependent on system software (namely, the compiler) for efficient management of protected object contexts (“environments” in 432 parlance). As it happened, the needed compiler work was not done, and the Ada compiler as delivered was pessimal: every function was implemented in its own environment, meaning that every function was in its own context, and that every function call was therefore a context switch!. As Colwell explains, this software failing was the greatest single inhibitor to performance, costing some 25-35 percent on the benchmarks that he examined.

If the story ended there, the tale of the 432 would be plenty instructive — but the story takes another series of interesting twists: because the object model consumed a bunch of chip real estate (and presumably a proportional amount of brain power and department budget), other (more traditional) microprocessor features were either pruned or eliminated. The mortally wounded features included a data cache (!), an instruction cache (!!) and registers (!!!). Yes, you read correctly: this machine had no data cache, no instruction cache and no registers — it was exclusively memory-memory. And if that weren’t enough to assure awful performance: despite having 200 instructions (and about a zillion addressing modes), the 432 had no notion of immediate values other than 0 or 1. Stunningly, Intel designers believed that 0 and 1 “would cover nearly all the need for constants”, a conclusion that Colwell (generously) describes as “almost certainly in error.” The upshot of these decisions is that you have more code (because you have no immediates) accessing more memory (because you have no registers) that is dog-slow (because you have no data cache) that itself is not cached (because you have no instruction cache). Yee haw!

Colwell’s work builds to crescendo as it methodically takes apart each of these architectural issues — and then attempts to model what the microprocessor would look like were it properly implemented. The conclusion he comes to is the object model — long thought to be the 432’s singular flaw — was only one part of a more complicated picture, and that its performance was “dominated, in large part, by artifacts and not by concepts.” If there’s one imperfection with Colwell’s work, it’s that he doesn’t realize how convincingly he’s made the case that these artifacts were induced by a rigid and foolish adherence to the concepts.

So what is the relevance of Colwell’s paper now, 20 years later? One of the principal problems that Colwell describes is the disconnect between innovation at the hardware and software levels. This disconnect continues to be a theme, and can be seen in current controversies in networking (TOE or no?), in virtualization (just how much microprocessor support do we want/need — and at what price?), and (most clearly, in my opinion) in hardware transactional memory. Indeed, like an apparition from beyond the grave, the Intel 432 story should serve as a chilling warning to those working on transactional memory today: like the 432 object model, hardware transactional memory requires both novel microprocessor architecture and significant new system software. And like the 432 object model, hardware transactional memory has been touted more for its putative programmer productivity than for its potential performance gains. This is not to say that hardware transactional memory is not an appropriate direction for a microprocessor, just that its advocates should not so stubbornly adhere to their novelty that they lose sight of the larger system. To me, that is the lesson of the Intel 432 — and thanks to Colwell’s work, that lesson is available to all who wish to learn it.

The interest in DTrace on Linux is heating up again — this time in an inferno on the Linux 2008 Kernel Summit discussion list. Under discussion is SystemTap, the Linux-born DTrace-knockoff, with people like Ted Ts’o explaining why they find SystemTap generally unusable (“Do you really expect system administrators to use this tool?”) and in stark contrast to DTrace (“it just works”).

While the comparison is clearly flattering, I find it a bit disappointing that no one in the discussion seems to realize that DTrace “just works” not merely my implementation, but also by design. Over and over again, we made architectural and technical design decisions that would yield an instrumentation framework that would be not just safe, powerful and flexible, but also usable. The subtle bit here is that many of those decisions were not at the surface of the system (where the discussion on the Linux list seems to be currently mired), but in its guts. To phrase it more concretely, innovations like CTF, DOF and provider-specified stability may seem like mind-numbing, arcane implementation detail (and okay, they probably are that too), but they are the foundation upon which the usability of DTrace is built. If you don’t solve the problems that they solve, you won’t have a system anywhere near as usable as DTrace.

So does SystemTap appreciate either the importance of these problems or the scope of their solutions? Almost certainly not — for if they did, they would come to the same conclusion that technologists at Apple, QNX, and the FreeBSD project have come to: the only way to have a system at parity with DTrace is to port DTrace.

Fortunately for Linux users, there are some in the community who have made this realization. In particular, Paul Fox has a nascent port of DTrace to Linux. Paul still has a long way to go (and I’m sure he could use whatever help Linux folks are willing to offer) but it’s impossible to believe that Paul isn’t on a shorter and more realistic path than SystemTap to achieving safe, powerful, flexible — and usable! — dynamic Linux instrumentation. Good luck to you Paul; we continue to be available to help where we can — and may the Linux community realize the value of your work sooner rather than later!

Recent Posts

November 26, 2023
November 18, 2023
November 27, 2022
October 11, 2020
July 31, 2019
December 16, 2018
September 18, 2018
December 21, 2016
September 30, 2016
September 26, 2016
September 13, 2016
July 29, 2016
December 17, 2015
September 16, 2015
January 6, 2015
November 10, 2013
September 3, 2013
June 7, 2012
September 15, 2011
August 15, 2011
March 9, 2011
September 24, 2010
August 11, 2010
July 30, 2010
July 25, 2010
March 10, 2010
November 26, 2009
February 19, 2009
February 2, 2009
November 10, 2008
November 3, 2008
September 3, 2008
July 18, 2008
June 30, 2008
May 31, 2008
March 16, 2008
December 18, 2007
December 5, 2007
November 11, 2007
November 8, 2007
September 6, 2007
August 21, 2007
August 2, 2007
July 11, 2007
May 20, 2007
March 19, 2007
October 12, 2006
August 17, 2006
August 7, 2006
May 1, 2006
December 13, 2005
November 16, 2005
September 13, 2005
September 9, 2005
August 21, 2005
August 16, 2005

Archives