How a failed disk lead to a kernel bug in the network stack.

Sometimes, you should just stay in bed. Today was one of those days.

Over my first coffee (of many) I check in with the other freebsd.org admin folks and see if there were any fires while I was asleep. I was reminded that one of our mail servers has a failed disk. It needed to be relocated to one of the other facilities anyway.

This lead to pondering about IP address reputation being a potential problem. I was reminded that gmail absolutely hates the FreeBSD Foundation and mark everything directly from the servers in the freebsd.org cluster as spam so it's clearly a cause for concern.

We decide to send some test messages to see if we could figure out other reasons why gmail might be rejecting it.

Whoops. I use KDE's kmail and it was not working. I still don't know why - something is very angry between akonadi and D-Bus and kmail. I could use akonadiconsole to read email so the imap connection was working. Oh well, it was time to rebuild world/kernel/packages anyway.. anyway, it was time to ignore that and continue.

After using a fall-back webmail I notice a harmless looking spamassassin flag: T_SPF_TEMPFAIL. I went to have a look at the SPF record, and host/dig/drill say there is no SPF record for freebsdfoundation.org. Glen asserts that this is most definitely not the case.

After chasing that thread I discovered that our OpenDNSSEC instance had a corrupt signature for only the TXT records in its cache. The reason my debugging attempts showed there was no SPF record was because DNSSEC was correctly flagging the records as fake.

I purged the signature cache by changing the TXT records to include the text "this computer stuff will never work" and watched the zone transfers commence. I attempted to monitor the propagation of the change on our internal resolvers. I was treated to some unsettling pauses.

A little voice whispered to me "Uh oh". We've been seeing strange timeouts for days now. And with kerberos. It was time to get to the bottom of it.

After some extended confusion with tcpdump on the firewall I was wondering why I was only seeing part of the packet flow. Eventually I realized that I was looking at it completely reversed - the firewall shouldn't be seeing ANY traffic as it was local within the same vlan. Why was the default gateway being used?

This lead to the discovery that some IPv6 packets in the network stack were being sent to the wrong ethernet MAC address. I didn't encounter this in previous cluster refreshes so I did some rollbacks to see if it was environmental or a FreeBSD problem. Sure enough it was a new problem that had been introduced within the last 8 weeks.

I tried switching a machine to FreeBSD-11, it worked fine there so the problem must have occurred after the recent 11 / 12 branch point.

Several of us scanned the change logs. Nothing made sense. None of the changes seemed to stand out that had not also been merged into FreeBSD-11 (which worked). I started a straight up binary search.

The window was narrowed substantially and one particular change looked plausible - except we had excluded it as it was already working in FreeBSD-11.

Sure enough, it was that specific change. Reverting just that revision made everything fine again. I filed a quick bugzilla ticket.

There must be another factor that affects this but I do not see it. It is now 14 hours after the "disk failed, now what?" and we're looking at bizarre kernel bugs. I can't think straight. The quick fix was to merely back out the change locally for now and worry about it later.

At that point I was referred to my earlier post Back out all the things! Oh wait.. and had to promise to pick it up again tomorrow, not merely later.

And since I was on the subject of the blog, the thought occurred "This might make an amusing war story" so I figured I should make some quick notes. Guess what my reward was?

MOZILLA_PKIX_ERROR_OCSP_RESPONSE_FOR_CERT_MISSING  

Aargh. I should definitely have stayed in bed.

Summary:

Disk failed ->  
IP reputation ->  
Mail configuration checking ->  
kmail not working ->  
dbus not working ->  
rebuild world + kernel ->  
rebuild packages ->  
notice spamassassin flag ->  
missing DNS SPF record ->  
DNSSEC invalid signature ->  
corrupt signature cache ->  
dig/drill/host timeouts ->  
how does tcpdump work anyway? ->  
wait, how does networking work? ->  
kernel bug ->  
works on FreeBSD-11, doesn't work on FreeBSD-12 ->  
kernel change review ->  
binary search after excluding changes also in FreeBSD-11 ->  
discover culprit change is in FreeBSD-11 anyway and works there ->  
Write a bugzilla ticket ->  
Contemplate writing a blog post ->  
MOZILLA_PKIX_ERROR_OCSP_RESPONSE_FOR_CERT_MISSING ->  
Resolve to quit computers and get a nice outdoor job.  

Comments? Twitter or Contact info

PS: If you read Sci-Fi, check out Michael Lucas' recent book: Hydrogen Sleets, because reasons.