DTrace in the zone

When the zones facility was originally developed many years ago, our focus on integrating it with respect to DTrace was firmly on the global zone: we wanted to be sure that we could use DTrace from the global zone to instrument the entire system, across all zones. This was (and remains) unique functionality – no other implementation of OS-based virtualization has the whole-system dynamic instrumentation to go with it. But as that focus implies, the ability to use DTrace in anything but the global zone was not a consideration. Fortunately, in 2006, Dan Price on the zones team took it upon himself to allow DTrace to be used in the non-global zone. This was a tricky body of work that had to carefully leverage the DTrace privilege model developed by Adam, and it allowed root users in the non-global zone to have access to the syscall provider and any user-level providers in their zone (both USDT-provided and pid-provided).

As great as this initial work was, when I started at Joyent nearly two years ago, DTrace support in the zone hadn’t really advanced beyond it – and there were still significant shortcomings. In particular, out of (reasonable) paranoia, nothing in the kernel could be instrumented or recorded by an enabling in the non-global zone. The problem is that many commonly used D variables – most glaringly, cpu, curlwpsinfo, curpsinfo, and fds[] – rely on access to kernel structures. Worse, because of our success in abstracting away the kernel with these variables, users didn’t even know that they were trying to record illegal kernel data – which meant that the error message when it failed was entirely unintuitive:

  [my-non-global-zone ~]# dtrace -n BEGIN'{trace(curpsinfo->pr_psargs)}'
  dtrace: description 'BEGIN' matched 1 probe
  dtrace: error on enabled probe ID 1 (ID 1: dtrace:::BEGIN): invalid kernel access in action #1 at DIF offset 44

The upshot of this was, as one Joyent customer put it to me bluntly, “whenever I type in a script from Brendan’s book, it doesn’t work.” And just to make sure I got the message, the customer sharpened the point considerably: “I’m deployed on SmartOS because I want to be able to use DTrace, but you won’t let me have it!” At that point, I was pretty much weeping and begging for forgiveness – we clearly had to do better.

As I considered this problem, though, a somewhat obvious solution to at least a portion of the problem became clear. The most painful missing variables – curpsinfo and curlwpsinfo – are abstractions on the kernel’s proc_t and kthread_t structures, respectively. There are two important observations to be made about these structures. First, looking at these structures, there is nothing about one’s own proc_t or kthread_t structures that one can use for privilege escalation: it’s either stuff you already can figure out about yourself via /proc, or kernel addresses that you can’t do anything with anyway. Second (and essentially), we know that these structures cannot disappear or otherwise change identity while in probe context: these structures are only freed after the process or thread that they represent has exited – and we know by merits of our execution in probe context that we have not, in fact, exited. So we could solve a good portion of the problem just by allowing access to these structures for processes for which one has the dtrace_proc privilege. This change was straightforward (and the highlight of it for me personally was that it reminded me of some clever code that Adam had written that dynamically instruments DIF to restrict loads as needed). So this took care of curpsinfo, curlwpsinfo, and cpu – but it also brought into stark relief an even tougher problem: fds[].

The use of fds[] in the non-global zone has long been a problem, and it’s a brutal one, as fds[] requires grovelling around in kernel data structures. For context, here is the definition of fds[] for illumos, as defined in io.d:

inline fileinfo_t fds[int fd] = xlate  (
    fd >= 0 && fd t_procp->p_user.u_finfo.fi_nfiles ?
    curthread->t_procp->p_user.u_finfo.fi_list[fd].uf_file : NULL);

Clearly, one cannot simply allow such arbitrary kernel pointer chasing from the non-global zone without risk of privilege escalation. At dtrace.conf, I implied (mistakenly, as it turns out) that this problem depends on (or would be otherwise solved by) dynamic translators. As I thought about this problem after the conference, I realized that I was wrong: even if the logic to walk those structures were in-kernel (as it would be, for example, with the addition of a new subroutine), it did not change the fundamental problem that the structures that needed to be referenced could themselves disappear (and worse, change identity) during the execution of an enabling in probe context. As I considered the problem, I realized that I had been too hung up on making this arbitrary – and not focussed enough on the specifics of getting fds[] to work.

With the focus sharpening to fds[], I realized that some details of the the kernel’s implementation could actually make this reasonable to implement. In particular, Jeff’s implementation of allocating slots in fi_list means that an fi_list never goes away (it is doubled but the old fi_list is not freed) and that fi_nfiles is never greater than the memory referred to by fi_list. (This technique – memory retiring – can be used to implement a highly concurrent table; the interested reader is pointed to both the ACM Queue article that Jeff and I wrote that describes it in detail and to the implementation itself.) So as long as one is only able to get the file_t for one’s own file descriptor, we could know that the array of file_t pointers would not itself be freed over the course of probe context. That brings us to the file_t itself: unlike the fi_list, the file_t can be freed while one is in probe context because one thread could be closing a file while another is in probe context referencing fds[]. Solving this problem required modifying the kernel, if one slightly: an added a hook to the closef() path that DTrace can optionally use to issue a dtrace_sync(). (A dtrace_sync() issues a synchronous cross-call to all CPUs; because DTrace disables interrupts in probe context, this can be used as a barrier with respect to probe context.)

Adding a hook of this nature requires an understanding of the degree to which the underlying code path is performance-critical. That is, to contemplate adding this hook, we needed to ask: how hot is closef(), anyway? Historically in my career as a software engineer, this kind of question would be answered with a combination of ass scratching and hand waving. Post-DTrace, of course, we can answer those questions directly – but only if we have a machine that has a representative workload (which we only had to a very limited degree at Sun). But one of my favorite things about working at Joyent is that we don’t have to guess when we have a question like this: we can just fire up DTrace on every node in the public cloud for a couple of seconds and get the actual data from (many) actual production machines. And in this case, the data was at least a little bit surprising: we had (many) nodes that were calling closef() thousands of times per second. (The hottest node saw nearly 5,000 closef()’s per second over a ten second period – and there were many that saw on the order of 2,000 per second.) This is way too hot to always be executing a dtrace_sync() when DTrace is loaded (as it effectively always is on our systems), so we had to be sure that we only executed the dtrace_sync() in the presence of an enabling that actually used fds[] with reduced privileges (e.g., from a non-global zone).

This meant that fds[] needed to be implemented in terms of a subroutine such that we could know when one of these is enabled. That is, we needed a D subroutine to translate from a file descriptor to a file_t pointer within the current context. Given this, the name of the subroutine was obvious: it had to be getf() – the Unix routine that does exactly this in the kernel, and has since Fourth Edition Unix, circa 1973. (Aside: one of the reasons I love illumos is because of this heritage: compare the block comment above the Seventh Edition implementation of getf() to that above getf() currently in illumos – yes, we still have comments written by Ken!) So the change to allow fds[] in the non-global zone ended up being more involved, but in the end, it wasn’t too bad – especially given the functionality that it unlocked.

With these two changes made, Brendan and I brainstormed about what remained out of reach for the non-global zone – and the only other thing that we could come up with as commonly wanting in the non-global zone would be access to the sched and proc providers. Allowing use of these in-kernel providers would require allowing enabling of the probes but not firing them when outside the context of one’s own zone (similar to the syscall provider) and further explicitly forbidding getting anything about privileged context (specifically, register state). Fortunately, I had had to do exactly this to fix a somewhat nasty zones-related issue with the profile provider, and it seemed that it wouldn’t be too bad to extend this with these slightly new semantics. This proved to be the case, and the change to allow sched, proc, vminfo and sysinfo providers in the non-global zone ended up being very modest.

So with these three changes, I am relieved to report that DTrace is now completely usable in the non-global zone – and all without sacrificing the security model of zones! If you are a Joyent cloud customer, we will be rolling out a platform with this modification across the cloud (it necessitates a reboot, so don’t expect it before your next scheduled maintenance window); if you are a SmartOS user, look for this in our next SmartOS release; and if you are using another illumos-based distro (e.g., OpenIndiana or OmniOS) look for it in an upcoming release – we will be integrating these changes into illumos, so you can expect them to be in downstream distros soon. And here’s to DTrace in the zone!