the fast, reliable localhost tunneling solution


Certificate expiration problem and postmortem

By Bjarni RĂșnar 2011-12-28, 22:35

Update: The problem has been resolved. See below for details.

There is currently a problem with the PageKite service related to expired SSL certificates. This appears to prevent some people from making the initial connection to a front-end.

We are working to fix this as quickly as possible, but in the meantime it is possible to work around the problem by disabling encryption:

$ pagekite --nossl ...

We will update this post when we have more news.

Updated at 23:00: Thanks to StartCom's extremely fast turn-around time, and our well-worn automated deployment scripts, everything appears to be back in order now. We'll be monitoring the system for any anomalies for the next couple of hours, just in case. Please accept our apologies for any inconvenience.

Detailed postmortem

Yesterday at 15:27 GMT the certificate for frontends.b5p.us expired, which is the certificate used by our front-end servers to verify their authenticity to the PageKite back-end software. As a result, most users of PageKite in the default (secure) configuration were presented with a "could not connect to front-end" error message when trying to launch their kites. Users who had already established a connection were not affected until around 22:30 GMT when we were notified of the problem and restarted the Icelandic front-end (as part of the troubleshooting process). By 23:00 a new certificate had been generated, signed and deployed to all front-end servers, thus restoring service to full capacity.

The total impact of this outage was about 8.5 hours of the service being unavailable for new connections, overlapping with about 30 minutes of complete unavailability for our Icelandic users.

The root cause of this event was simply human error: we were aware that our certificates were expiring and had begun work on renewing them, but being somewhat distracted by the holidays, we didn't read our e-mail carefully enough and overlooked the fact that the front-end certificate was set to expire a couple of days earlier than the others.

In order to prevent this problem from reoccurring, we are taking the following steps:

  • Reducing the number of certificates in use by the service, to simplify management
  • Improving our automated monitoring to monitor certificate expiration
  • Improving our automated monitoring to monitor end-to-end service availability

We aim to learn from our mistakes. :-)

Comments

None, comments are closed.

The Blog

Welcome to the PageKite blog!

Here we write about anything and everything to do with running the service, building a company, open-source, privacy online... you name it.

But mostly it's about PageKite.

Other venues