Using Let's Encrypt within FreeBSD.org

I decided to give Let's Encrypt certificates a shot on my personal web servers earlier this year after a disaster with StartSSL. I'd like to share what I've learned.

The biggest gotcha is that people tend to develop bad habits when they only have to deal with certificates once a year or so. The beginning part of the process is manual and the deployment of certificates somehow never quite gets automated, or things get left out.

That all changes with Let's Encrypt certificates. Instead of 1-5 year lifetime certificates the Let's Encrypt certificates are only valid for 90 days. Most people will be wanting to renew every 60-80 days. This forces the issue - you really need to automate and make it robust.

The Let's Encrypt folks provide tools to do this for you for the common cases. You run it on the actual machine, it manages the certificates and adjusts the server configuration files for you. Their goal is to provide a baseline shake-n-bake solution.

I was not willing to give that level of control to a third party tool for my own servers - and it was absolutely out of the question for for the FreeBSD.org cluster.

The Let's Encrypt / acme ecosystem is developing a rich set of components. There are a lot of options if the default tool does not suit you.

I should probably mention that we do things on the FreeBSD.org cluster that many people would find a bit strange. The biggest problem that we have to deal with is that the traditional model of a firewall/bastion between "us" and "them" does not apply. We design for the assumption that hostile users are already on the "inside" of the network. The cluster is spread over 8 distinct sites with naked internet and no vpn between them. There is actually very little trust between the systems in this network - eg: ssh is for people only - no headless users can ssh. There are no passwords. Sudo can't be used. The command and control systems use signing. We don't trust anything by IPv4/IPv6 address because we have to assume MITM is a thing. And so on. In general, things are constructed to be trigger / polling / pull based.

One thing we very much try to do is slave configurations directly from source control. eg: nginx.conf on the various sites comes directly out of our admin repository. DNS zone files are the same. We aim for automation of commit-to-production.

The downside is that this makes automation and integration of Let's Encrypt clients interesting. If server configuration files can't be modified; and replicated web infrastructure is literally read-only (via jails/nullfs); and DNS zone files are static; and headless users can't ssh and therefore cannot do commits, how do you do the verification tokens in an automated fashion? Interesting, indeed.

My first encounter with the Let's Encrypt ecosystem was with the widely used third party client now known as deyhydrated. This is a solid choice, but we ended up switching away from it at FreeBSD.org. There were a number things I didn't like about it. I didn't like the way hooks worked at the time or the way it handled transient problems. My biggest complaint (admittedly rather petty of me) was the requirement to bring bash and its support footprint into the jails. We now use acme.sh instead.

acme.sh is nice and simple, works on straight up /bin/sh and had just the right hook mechanism that I could use for dns-01 validation.

The acme process is fairly simple at face value.

  • create account
  • verify your administrative control of domains by placing a random cookie on the http server or in dns.
  • make a signing request, receive a certificate.

We wanted to be able to use certificates on things like ldap and smtp servers. You can't do http file verification on those so we had to use dns validation of domains. Now we come back to the static zone file issue I hinted at above. Here's how I solved it for FreeBSD.org. While it is specific to our situation, folks may find it amusing or get some ideas. I'm going to try and keep the details on the lean site to avoid overload.

First, we have a jail that has nothing except acme.sh and a few support tools inside. (BTW: We love jails in the FreeBSD cluster. We use them everywhere. You should too!) There is a stub DNS api hook that looks loosely like this:

#Usage: dns_fbsd_add   _acme-challenge.www.domain.com   "XKrxpRBosdIKFzxW_CT3KLZNf6q0HG9i01zxXp5CPBs"
dns_fbsd_add() {  
  fulldomain="$1"
  txtvalue="$2"
  _info "Adding DNS ${fulldomain} ${txtvalue}"
  echo "${fulldomain}. 60 IN TXT \"${txtvalue}\"" > "/home/certbot/dnsextra/${fulldomain}"
  return $?
}

There is a corresponding removal counterpart. The hook script creates /home/certbot/dnsextra/_acme-challenge.foo.freebsd.org and simply waits for a few minutes.

That /home/certbot/dnsextra/ directory is nullfs mounted read-only into another jail. The cron task in the other jail that polls the subversion server for updates to the DNS zone files also checks this nullfs directory for changes.

The additions are white-list filtered by record type and sanity checked. This mechanism can only insert TXT records. This isn't the actual script, but the gist is:

cat dnsextra/*.freebsd.org | grep TXT > extra/freebsd.org.txt  

The Makefile that compiles the zone files has a fragment that something like this:

cat soa/freebsd.org primary/freebsd.org extra/freebsd.org.txt > unsigned/freebsd.org  

It is rude and crude, but the new unsigned/freebsd.org is signed and published via opendnssec/nsd.

The Let's Encrypt validator can see the _acme-challenge token within about 2 minutes. This is how we prove that we have administrative control of the domain. Once verified, the Let's Encrypt CA will issue a certificate.

We export the fullchain files into a publication location. There is another jail that can read the fullchain certificates via nullfs and they are published with our non-secrets update mechanism

Since we are using DNSSEC, here is a good opportunity to maintain signed TLSA fingerprints. We have a map file in subversion that tells the builder which TLSA fingerprints come from what certificate files.

The catch with TLSA record updates is managing the update event horizon. You are supposed to have both fingerprints listed across the update cycle.

We use 'TLSA 3 1 1' records to avoid issues with propagation delays for now. TLSA 3 0 1 changes with every renewal, while 3 1 1 only changes when you generate a new private key.

In our dns Makefile, we have something like:

... foreach $cert, $domain, $port
  ldns-dane -c certfile/$cert $domain $port 3 1 1 > extra/_$port._tcp.$domain

This creates files like extra/_443._tcp.foo.freebsd.org using ports/dns/ldns.

Remember those concatenation script rules above? They can be made to conveniently insert TLSA records into the zone files as well.

While here, I also automated the SSHFP record signing and publication too. Another jail collects fingerprints from our phonehome mechanism and makes them available to the zone compiler, via nullfs again.

What about deployment? The fingerprints are published and the certificates magically flow to the back-end jails via our internal command and control management system.

Here is a major gotcha! If your nginx.conf has something like

ssl_certificate /etc/ssl/foo.freebsd.org.crt;  

and you do a 'service nginx reload', what happens?

If you said "it reloads the certificate", you would be dead wrong. The majority of TLS/SSL servers require a full restart to re-load the certificates if the filename is unchanged. I found out the hard way.

If you did something creative like giving the certificate filenames a timestamp in your distribution/deployment mechanism, that generally does cause a reload of the certificates when you reload the servers. In the FreeBSD.org cluster the files (and hence certificate pathnames) are static. For now, for us, it means restarts every 60 days.

We have restart triggers for postfix, nginx, nghttpx, ldap, and apache. They're fairly simple and look a bit like this (butchered for space):

#! /bin/sh
# postfix-reload
.....
certfiles=$(postconf -n | awk -F " = " '$1 ~ /(cert|key)_file/ {print $2}' | sort -u)  
.....
reload=false  
for f in $certfiles; do  
  if [ -f "$f" ]; then
    if [ /var/spool/postfix/pid/master.pid -ot "$f" ]; then
      reload=true
    fi
  fi
done  
if $reload; then  
  echo "postfix master.pid file older than certificates; restart required!"
  service postfix restart
fi  

Again, crude, and you add your sanity checks and make sure you're not exposed to user provided input. The script fragment above simply restarts postfix if the certificates or private keys are updated.

What about private keys? In our current system, we don't have a server-specific secrets distribution mechanism. For now, we have acme.sh reuse the private key and we manually transport the private key to the destination server as a one-time bootstrap. dehydrated regenerates private keys each time by default but it can be configured to not do this. If you are running everything on the same server this sort of thing generally isn't an issue.

acme.sh is a bit obscure to configure. A fragment of our account.conf file:

We tweak some internal settings via this file, I'm sure I will be punished over this. You are supposed to set things like the dns api on the command line but I read the script and subverted it.

Here's an example of issuing a certificate for the first time at home.

% acme.sh --issue -d example.wemm.org 
[Mon Nov 21 19:19:40 PST 2016] Creating domain key
[Mon Nov 21 19:19:40 PST 2016] Single domain='example.wemm.org'
[Mon Nov 21 19:19:40 PST 2016] Getting domain auth token for each domain
[Mon Nov 21 19:19:40 PST 2016] Getting webroot for domain='example.wemm.org'
[Mon Nov 21 19:19:40 PST 2016] _w='dns_myapi'
[Mon Nov 21 19:19:40 PST 2016] Getting new-authz for domain='example.wemm.org'
[Mon Nov 21 19:19:40 PST 2016] Try new-authz for the 0 time.
[Mon Nov 21 19:19:45 PST 2016] The new-authz request is ok.
[Mon Nov 21 19:19:45 PST 2016] Found domain api file: /home/certbot/.acme.sh/dnsapi/dns_myapi.sh
[Mon Nov 21 19:19:45 PST 2016] Adding DNS _acme-challenge.example.wemm.org xKyfkr_Vt1jySkWEasJE_mI7IEkKQ-CnJIrVTqcldVA
[Mon Nov 21 19:19:46 PST 2016] Sleep 20 seconds for the txt records to take effect
[Mon Nov 21 19:20:07 PST 2016] Verifying:example.wemm.org
[Mon Nov 21 19:20:13 PST 2016] Success
[Mon Nov 21 19:20:13 PST 2016] Removing DNS _acme-challenge.example.wemm.org
[Mon Nov 21 19:20:14 PST 2016] Verify finished, start to sign.
[Mon Nov 21 19:20:15 PST 2016] Cert success.
-----BEGIN CERTIFICATE-----
[.. deleted ..]
7wOW/oW9h7U=  
-----END CERTIFICATE-----
[Mon Nov 21 19:20:15 PST 2016] Your cert is in  /home/certbot/.acme-certs/example.wemm.org/example.wemm.org.cer 
[Mon Nov 21 19:20:15 PST 2016] Your cert key is in  /home/certbot/.acme-certs/example.wemm.org/example.wemm.org.key 
[Mon Nov 21 19:20:15 PST 2016] The intermediate CA cert is in  /home/certbot/.acme-certs/example.wemm.org/ca.cer 
[Mon Nov 21 19:20:15 PST 2016] And the full chain certs is there:  /home/certbot/.acme-certs/example.wemm.org/fullchain.cer 

If you are counting, that is 15 seconds plus DNS propagation time (20 seconds at home - I use a different dns push mechanism).

You can see the authentication, domain validation and certificate phases.

This is a forced renewal:

% acme.sh --renew -d photos.wemm.org --force
[Mon Nov 21 19:12:34 PST 2016] Renew: 'photos.wemm.org'
[Mon Nov 21 19:12:34 PST 2016] Single domain='photos.wemm.org'
[Mon Nov 21 19:12:34 PST 2016] Getting domain auth token for each domain
[Mon Nov 21 19:12:34 PST 2016] Getting webroot for domain='photos.wemm.org'
[Mon Nov 21 19:12:34 PST 2016] _w='dns_myapi'
[Mon Nov 21 19:12:34 PST 2016] Getting new-authz for domain='photos.wemm.org'
[Mon Nov 21 19:12:34 PST 2016] Try new-authz for the 0 time.
[Mon Nov 21 19:12:39 PST 2016] The new-authz request is ok.
[Mon Nov 21 19:12:40 PST 2016] photos.wemm.org is already verified, skip.
[Mon Nov 21 19:12:40 PST 2016] photos.wemm.org is already verified, skip dns-01.
[Mon Nov 21 19:12:40 PST 2016] Verify finished, start to sign.
[Mon Nov 21 19:12:40 PST 2016] Cert success.
-----BEGIN CERTIFICATE-----
[.. deleted ..]
FxxgGA/lG6utV/n7GNRVsbiZ/6JFP/mGWg==  
-----END CERTIFICATE-----
[Mon Nov 21 19:12:40 PST 2016] Your cert is in  /home/certbot/.acme-certs/photos.wemm.org/photos.wemm.org.cer 
[Mon Nov 21 19:12:40 PST 2016] Your cert key is in  /home/certbot/.acme-certs/photos.wemm.org/photos.wemm.org.key 
[Mon Nov 21 19:12:41 PST 2016] The intermediate CA cert is in  /home/certbot/.acme-certs/photos.wemm.org/ca.cer 
[Mon Nov 21 19:12:41 PST 2016] And the full chain certs is there:  /home/certbot/.acme-certs/photos.wemm.org/fullchain.cer 
% 

This example took 7 seconds to renew the certificate. It doesn't need to do the dns validation again.

Once you have things set up, your cron entry is typically something like 27 11 * * * acme.sh --cron >/dev/null It generates error text if there's a problem, otherwise no news is good news.

If you're inserting TLSA records like I do, you can check that the signed fingerprint matches the certificate:

% ldns-dane verify www.freebsd.org 443
8.8.178.110 dane-validated successfully  
2001:1900:2254:206a::50:0 dane-validated successfully  

Some caveats that I'd like to mention:

  • Don't enable ocsp-must-staple if you are serving with nginx. This sets a flag in the certificate that you are promising to always have stapling enabled in your server. The catch is nginx is broken - it doesn't actually do ocsp stapling until after a few queries. Firefox will (correctly) complain bitterly about this after every server restart.

  • I cannot stress this enough: MAKE SURE that you do a full hands-off test of renewals and see that the full end-to-end flow works. Do a forced early renewal like acme.sh --renew -d foo.freebsd.org --force and watch. You don't want to find out in two months at 4am that your reload script takes your site down. The point of the article is to talk about strategies for automating it.

  • Watch your rate limits! They are not kidding about this. Test with staging certificates first. https://letsencrypt.org/docs/rate-limits/ confused me at first. While it says "renewals are exempted" they still count against the weekly certificate limit. You can bulk renew 30 certificates just fine, but you can't do anything else until 7 days later.

  • acme.sh moves fast and is under very active development. I'm cautious about when to update.

  • This stuff isn't hard. There's always solutions, even to self inflicted problems. It is quite easy if you're using something with proper DNS API access. If not then think of it as an opportunity to exercise some creativity. The http-based domain validation process is easiest of all if that works for you.

  • If you have a monitoring system, use it. Really. Don't set yourself up for surprises.

  • Use rotating backups of the key / cert material, particularly with something like acme.sh / dehydrated. They allow you to directly access the work areas and it is not hard to "oops" something if you are not paying attention.

Some final thoughts.

Why free certificates over commercial ones? Well, why not? Certificates are a lowest common denominator thing. We've only used Domain Validated certificates for FreeBSD.org, and provided that the signing chain is widely trusted, there is no practical difference to a browser between a free or commercial Domain Validated certificate these days. The 90 day lifetime is solveable with automation. https://slashdot.org uses Let's Encrypt certificates. It is quite ready for prime time.

Secondly.. I've had brain surgery just a few weeks ago. I set this up at home one week after surgery, and for the FreeBSD.org cluster exactly two weeks after. If people give me hell for crappy scripting, I have an excuse ready to go. :-)

Links:

Comments? Twitter or Contact info