Site migration fun

As many people in the FreeBSD community know, we're in the process of moving core parts of the FreeBSD.org cluster from one site to another.

As part of the process we're taking the opportunity to pay off some of the tech debt and clean things up. In general, we're trying to re-cast things as portable components that are safe enough to move around on the internet/cloud/whatever. Most of this is easy but there's some really gnarly problems to be solved.

This evening I hit a tricky problem in another skunkworks project. I figured I'd just move one of the easy components - it was supposed to be a distraction while my mind churned in the background on the first problem. There's not many components in our cluster that are easier to move than Kerberos bits.

This turned into more of an adventure than I expected. The timeline:

  • /var/heimdal/* database copied.
  • /etc/krb5.conf updated.
  • /etc/krb5.keytab - pre-shared secrets primed for the encrypted node-to-node transfers.
  • crontab and rc.conf updated.
  • Test, ensure everything is in sync.

Easy, right? So far, so good. Time to make it live:

  • old jail stopped
  • new jail started
  • dns records updated
  • pf firewall rules updated
  • wait for dns to be signed and propagate
  • and waited...
  • re-flushed caches and waited some more...

Then the dreaded Oops, something went wrong! alert arrived:

ods-signer sign freebsd.org  
Unable to connect to engine: connect() failed: No such file or directory  
*** Error code 1

Uh oh. Not good. After a manual restart attempt:

Oct 13 05:36:17 ns0 ods-enforcerd: ERROR: database version number incompatible  
with software; require 4, found 3. Please run the migration scripts  

Damn it, this was self inflicted - I broke the golden rule by making two changes at once. I should know better by now, but I had to deal with this RIGHT NOW.

A voyage of discovery commenced. The first puzzle was.. What migration scripts?! The other residents in my home got to hear some newly invented profanity. I eventually found a MIGRATION document but it was as clear as mud. It appeared that I was going to have to extract a file from the distribution tarball but fortunately that did not turn out to be the case. After some guesswork I figuring out what "the database" referred to and ran some commands that didn't report errors.

# sqlite3 /usr/local/var/opendnssec/kasp.db < /usr/local/share/opendnssec/migrate_1_4_8.sqlite3

The server finally started up.

Except, now:

Oct 13 06:17:14 ns0 ods-signerd: [backup] bad ixfr journal: first SOA wrong serial (was 2015101305, expected 2015101300)  
Oct 13 06:17:14 ns0 ods-signerd: [zone] corrupted journal file zone XXXX.in-addr.arpa, skipping (General error)  

I'm still not sure what I was supposed to do, but I forced a re-sign of all zones and it stopped yelling at me. I have encountered this before and I am still mystified as to whether action is required. The SOA numbers are correct. It looks like it recovered, but it didn't say. I really don't like not knowing.

At the end of this I am left shaking my head. DNS and DNSSEC already worry and confuse people and this sort of thing does not help. Between BIND and error messages like this, it is no wonder so many are afraid of it.

When something like DNSSEC is involved, it pays to be crystal clear so there are no misunderstandings:

  • With upgrade instructions, give examples too. Remember that the person who encountered the message is staring at a broken system and they may not have encountered your migration process before.
  • Better yet, if there's a logical thing to do (eg: apply trivial migration patches automatically) then just do it - particularly when you initialized the database and schema yourself.
  • If you throw a scary error message, be clear about whether action is required or not, and whether it was self recovered.

Make no mistake, I do like OpenDNSSEC. Compared to what we used on FreeBSD.org before (dnssec-tools with BIND) it is a vast improvement. I'm just glad it was me who encountered it rather that one of the other admins - they are already gun-shy over this.

After all the DNS drama was solved and propagated, the actual migration of the Kerberos and the replication system Just Worked(TM) - as expected. That part was as easy as it was supposed to have been. At least something went according to plan tonight.

Contact information: https://blog.crashed.org/contact/