As is known but perhaps not widely reported, all three of us on Team DTrace are products of Brown University Computer Science. More specifically, we were all students in (and later TAs for) Brown’s operating systems course, CS169. This course has been taught by the same professor, Tom Doeppner, over its thirty year lifetime, and has become something of a legend in Silicon Valley, having produced some of the top engineers at major companies like NetApp, SGI, Adobe, and VMware — not to mention tons of smaller companies. And at Sun, CS169 has cast a particularly long shadow, with seven CS169 alums (Adam, Dan, Dave, Eric, Matt, Mike and me) having together played major roles in developing many of the revolutionary technologies in Solaris 10 (specifically, DTrace, ZFS, SMF, FMA and Zones).
I mention the Brown connection because this past Thursday, Brown hosted a symposium to honor both the DTrace team in particular and the contributions of former CS169 undergraduate TAs more generally. We were each invited to give a presentation on a topic of our choosing, and seizing the opportunity for intellectual indulgence, I chose to reflect on a broad topic: the inculcation of systems thinking. My thoughts on this topic deserve their own lengthy blog entry, but this presentation will have to suffice for now — albeit stripped of the references to the Tupolev Tu-144, LBJ, Ray Kurzweil, the 737 rudder reversal and Ruby stack backtraces that peppered (or perhaps polluted?) the actual talk…
8 Responses
I look forward to the blog post, as I didn’t quite follow the deck. (I would love an example of a pathological system.) Then again, I couldn’t even get professor “Detner’s” name right, so maybe there is no hope for me.
BTW, I never much trusted Brown Grads. Bunch of hippie communists.
A pathological system is any that is malfunctioning at some systemic level — so anything from a cancerous cell to an economic recession to an unhandled software exception represents a pathological system at some level. And as mentioned in the deck, while system pathologies can be fatal, they need not be to still be pathological. Indeed, non-fatal pathologies are often the more difficult to diagnose — it’s easier to give an autopsy than a prognosis…
Bryan,
The best author I’ve come across for exploring exactly these sorts of systems is Petroski. His main thesis being that we learn from failures rather than from successes.
Interestingly we don’t seem to focus much on system failures in computer science and computer engineering classes.
One problem with the focus on failures is that we tend to pick case studies that are too new and hence can be analyzed too closely with modern tools and analysis and so we don’t learn the design/engineering lessons we simply try to re-solve the problem.
A key to learning from failures then is sometimes to have the distance of history (and I’m paraphrasing Petroski here) so that we don’t get bogged down in the particulars of the failure itself, and can focus on the methodology and assumptions and climate that led to it.
A problem with applying this approach in the computer world is that we don’t have the distance of history yet for many of the failures.
Andy, Agreed about Petroski, but I might differ on the necessity of distance: in Petroski’s domain (civil engineering), failures are exceedingly rare, and often take several years to completely understand. In our domain (software), failure is, to put it euphemistically, much more common, and a single failure can often be completely understood within minutes or hours. Indeed, a single bug that consumes (say) more than a week of a single engineer’s time remains quite rare (and nearly always a great story when they are to be had!). I very much agree with you that one wants to abstract away from the details of the solution to the larger issues around design and methodology — I would just contend that one does not need much distance from a software problem to be able to do just that.
Bryan,
One counter to that is that despite lots of engineering effort, good testing processes, etc. we still end up with certain bugs that defy detection in an easy manner. A nice example would be the ANI bug in MS recently, and Michael Howard’s excellent writeup.
http://blogs.msdn.com/sdl/archive/2007/04/26/lessons-learned-from-the-animated-cursor-security-bug.aspx
The interesting lesson here surrounds complicated failures that are hard to detect.
I’ll concede that this isn’t necessarily an example of a pathological system yet at the same time even isolated bugs like that can be exceedingly difficult to detect despite people’s excellent efforts. And this isn’t even one that relies on weird hardware timings, clock skew, etc.
Bryan. I was researching some DTrace topics and came across this entry in your blog. Not only did it consume 10 minutes of my life (very satisfacoritly, I might add), it also gave me some good ideas about how to teach Dtrace to the uninitiated. I wholeheartedly agree that we can learn more from observing mistakes than we can from observing perfection. As an instructor who has been teaching IT for more than 20 years, I am also heartened by your admiration for professor Doeppner. If he’s been inspiring people in his education for 30 years, then perhaps we older IT professionals still have something to offer to the youth of IT?
Andy, you’re exactly right: no amount of engineering effort will eliminate all pathology and the remaining failures can be very difficult to understand — which is why studying it is so critically important. When we honor pathology, we naturally develop systems in which we can better diagnose it. Thanks for the pointer to the ANI bug writeup — this is exactly what we collectively need much more of.
Jeff, glad you enjoyed it! And yes, I have great reverence for the wisdom that only time can give you — and therefore for the IT professionals that have been around long enough to acquire it. 😉 And if you know Tom, one of his greatest strengths is his fascination with history — a subject that is (frankly) sorely neglected in our domain…