We had an interesting issue come up the other day, one that ended up being a pretty nasty bug in DTrace. It merits a detailed explanation, but first, a cautionary note: we’re headed into rough country; in the immortal words of Sixth Edition, you are not expected to understand this.
The problem we noticed was this: when running node and instrumenting certain hot functions (memcpy:entry was noted in particular), after some period of activity, the node process would die on a strange seg fault. And looking at the core dumps, we were (usually) lost in some code path in memcpy() with registers or a stack that was self-inconsistent. In debugging this, it was noted (with DTrace, natch) that the node process was under somewhat intense signal processing activity (we’ll get to why in a bit), and that (in particular) the instrumented function activity and the signal activity seemed to be interacting “shortly” before the fatal failure.
Before wading any deeper, it’s worth taking an aside to explain how we deal with signals in DTrace — a discussion that in turn requires some background on the way loss-less user-level tracing works in DTrace. One of the important constraints that we placed upon ourselves (albeit one that we didn’t talk about too much) is that DTrace is loss-less: if multiple threads on multiple CPUs hit the same point of instrumentation, no thread will see an uninstrumented code path. This may seem obvious, but it’s not the way traditional debuggers work: normally, a debugger stops a process (and all of its threads) when it hits a breakpoint — which it induces by modifying program text with a synchronous trap instruction (e.g., an “int 3” in x86). Then, the program text is modified to contain the original instruction, the thread that hit the breakpoint (and only that thread) is single-stepped (often using hardware support) over the original instruction, and the original instruction is again replaced with a breakpoint instruction. We felt this to be unacceptable for DTrace, and imposed a constraint that tracing always be loss-less.
For user-level instrumentation (that is, the pid provider), magic is required: we need to execute the instrumented instruction out-of-line. For an overview of the mechanism for this, see Adam‘s block comment in fasttrap_isa.c. The summary is that we have a per-thread region of memory at user-level to which we copy both the instrumented instruction and a control transfer instruction that gets us back past the point of instrumentation. Then, when we hit the probe (and after we’ve done in-kernel probe processing), we point the instruction pointer to be in the per-thread region and return to user-level; back in user-land, we execute the instruction and transfer control back to the rest of the program.
Okay, so easy, right? Well, not so fast — enter signals: if we take a signal while we’re executing this little user-level trampoline, we could corrupt program state. At first, it may not be obvious how this is the case; if we execute a signal before we execute the instruction in the trampoline we will simply execute the signal handler (whatever it may be) and return to the trampoline, finish that off, and then get back to the business of executing the program. Where’s the problem? The problem is that if the signal handler executes something that itself is instrumented with DTrace: if this is the case, we’ll hit another probe, and that probe will copy its instruction to the same trampoline. This is bad: it means that upon return from the signal, we will return to the trampoline and execute the wrong instruction. Only one of them, of course. Which, if you’re lucky, will induce an immediate crash. But more likely, the program will execute tens, hundreds or even thousands of instructions before dying on some “impossible” problem. Sound nasty to debug? It is — and next time you see Adam, show some respect.
So how do we solve this problem? For this, Adam made an important observation: the kernel is in charge of signal delivery, and it knows about all of this insanity, so can’t it prevent this from happening? Adam solved this by extending the trampoline after the control transfer instruction to consist of the original instruction (again) followed by an explicit trap into the kernel. If the kernel detects that we’re about to deliver a signal while in this window (that is, with the program counter on the original instruction in the trampoline), it moves the program counter to the second half of the trampoline and returns to user-level with the signal otherwise unhandled. This leaves the signal pending, but only allows the program to execute the one instruction before bringing it back downtown on the explicit trap in the second half of the trampoline. There, we will see that we are in this situation, update the trapped program counter to be the instruction after the instrumented instruction (telling the program a small but necessary lie about where signal delivery actually occured), and bounce back to user-level; the kernel will see the bits set on the thread structure indicating that a signal is pending, and the Right Thing will happen.
And it all works! Well, almost — and that brings us to the bug. After some intensive DTrace/mdb sessions and a long Jabber conversation with Adam, it became clear that while signals were involved, the basic mechanism was actually functioning as designed. More experimenting (and more DTrace, of course) revealed that the problem was even nastier than the lines along which we were thinking: yes, we were taking a signal in the trampoline — but that logic was working as designed, with the program counter being moved correctly into the second half of the trampoline. The problem was that we were taking a second asynchronous signal, this time in the second half of the trampoline, and this brought us to this logic in dtrace_safe_defer_signal:
/* * If we've executed the original instruction, but haven't performed * the jmp back to t->t_dtrace_npc or the clean up of any registers * used to emulate %rip-relative instructions in 64-bit mode, do that * here and take the signal right away. We detect this condition by * seeing if the program counter is the range [scrpc + isz, astpc). */ if (t->t_dtrace_astpc - rp->r_pc < t->t_dtrace_astpc - t->t_dtrace_scrpc - isz) { ... rp->r_pc = t->t_dtrace_npc; t->t_dtrace_ft = 0; return (0); }
Now, the reason for this clause (the meat of which I’ve elided) is the subject of its own discussion, but it is not correct as written: it works (due to some implicit use of unsigned math) when the program counter is in the first half of the trampoline, but it fails if the program counter is on the instrumented instruction in the second half of the trampoline (denoted with t_dtrace_astpc). If we hit this case — first one signal on the first instruction of the trampoline followed by another signal on the first instruction on the second half of the trampoline — we will redirect control to be the instruction after the instrumentation (t_dtrace_npc) without having executed the instruction. This, if it needs to be said, is bad — and leads to exactly the kind of mysterious death that we’re seeing.
Fortunately, the fix is simple — here’s the patch. All of this leaves just one more mystery: why are we seeing all of these signals on node, anyway? They are all SIGPROF signals, and when I pinged Ryan about this yesterday, he was as surprised as I was that node seemed to be doing them when otherwise idle. A little digging on both of our parts revealed that the signals we were seeing are an artifact of some V8 magic: the use of SIGPROF is part of the feedback-driven optimization in the new Crankshaft compilation infrastructure for V8. So the signals are goodness — and that DTrace can now correctly handle them with respect to arbitrary instrumentation that much more so!