Back out all the things! Oh wait..

It's one of those things that everybody knows. When something breaks, you revert recent changes (or bisect) until it works. You keep going until you solve the problem. The identified code is removed from upstream and everyone moves on. That's the way it works, right?

But wait, whatever happened to figuring out the actual problem? Reverting changes are a a useful tool but that's just it: it is a diagnostic tool to get some clues. You might have bought yourself some time, but if backing out change-X makes the problem go away, that doesn't necessarily mean that change-X was the actual problem. You don't have an actual diagnosis.

We had one of those today in the cluster. Every month we rebuild everything to make sure that we still can, and to make sure we find problems before you do.

It started innocently enough:

# /usr/sbin/ntpd
# ntpq -p
ntpq: connection refused  

And sure enough, a segfault was recorded on the console. A ktrace exercise showed that it crashed right after some mmap(2) calls related to malloc(3). After a check of the source control change logs it was very apparent that a major revision of malloc(3) had landed over the last few weeks. Could that be it? It was easy enough test by restoring the old C library. Voila! Problem solved. It must be the malloc(3) revamp, right? Some of the other developers seemed sure of it.

Not so fast. The important question hadn't been answered. Why?

Something didn't feel right. I began to wonder:

  1. Why were we the first people to hit this in the three weeks it was live?
  2. Why did it crash on roughly 1/10th of our servers?
  3. Why did it crash differently on different servers?
  4. Why did it crash 100% of the time on one machine, about 50% of the time on another, and maybe 1% of the time on some others?
  5. When looking at the ktrace logs of the different machines and behaviors, what was that suspicious looking mlockall(2) doing right around there?

The last one was troubling and triggered the spidey-sense. Could mlockall(2) be related to this or was it just a coincidence? It seemed to be the difference between traces that worked and traces that didn't. But mlockall(2) should have absolutely no effect on validity of memory mappings - it is a hint to the VM system that our memory should not be paged out, albeit a very strong hint.

So I disabled mlockall(2) from ntpd to gather more clues. Voila! It restored 100% reliability just like reverting the malloc(3) changes did. So much for that that conclusive assignment of blame to the malloc(3) work. But we still were not finished: we still did not know "Why?". If there was one coincidence already, why not two? Or perhaps more?

Over the next few hours of chasing the the "Why?" question, the group finally converged on an answer. The root cause was a bug in the kernel's page fault handler that dated all the way back to 1997. The only involvement that malloc(3) had in this was a change in the arena layout. The different memory access pattern made a race with mlockall(2) in another thread trigger a kernel bug. By a few microseconds. On a couple of machines.

If we hadn't tracked down the root cause there would have been accusations made regarding on the malloc(3) work, wasted time and effort hunting a bug that wasn't there, hurt feelings, and ultimately some red faces.

Lessons learned?

  1. Backing out is a diagnostic tool, not a solution.
  2. Backing out could not have exposed this particular bug.
  3. Don't give up when you have the luxury of time, you will likely learn something in the process of figuring out Why?
  4. Computers are hard after all.

Docs: malloc(3) mlockall(2)

PS: It turns out other people have hit this periodically over the years but dismissed it as a networking problem or "Huh, that was odd.."

Update: The fix was committed here: