During the first few days of 2014, there was a DNS-related outage that affected many PageKite users, causing both difficulty flying new kites and difficulty connecting to the kites that were already flying.
This is the most significant outage we have experienced since launching the PageKite.net service and we are very sorry about any inconvenience it may have caused. This post is a full "postmortem", a document which explains what happened, what the impact was, how the issue was resolved and what steps were and will be taken to prevent the problem from happening again.
Rough outage time-line
On January 1, the VPS server which was the master DNS for the pagekite.net zone had its IP address changed by our upstream provider.
Around January 2nd or 3rd, the DNS provider which ran the "slaves" for our zone stopped serving results for our zone, most likely as a result of being unable to make contact with the master. The exact timing of this event is unknown, due to insufficient monitoring and DNS caching effects.
On January 3rd, the configuration of the slave servers was updated, restoring partial availability.
On January 4th, all DNS services for the pagekite.net domain were moved to gandi.net, fully resolving the issue.
The domain pagekite.net and all subdomains were completely unresolvable for at least 12 hours around January 3rd, and service was degraded (DNS responses were slower or would fail) for the the first four days of the year. This impacted the following user-facing PageKite services:
- This website, due to its use of the pagekite.net domain name
- Dynamic DNS updates on up.pagekite.net and white-label customers using CNAME aliases of that
- The pagekite.me domain (and all subdomains) became unresolvable, preventing access to already flying kites
- The b5p.us domain (and all subdomains) became unresolvable, preventing discovery of available front-end relays
The last two issues resulted in almost total service unavailability for a significant number of PageKite service users until they were resolved, as flying kites couldn't be resolved and new kites couldn't find relays to connect to.
Some white-label customers are using .net or .com domains for their kites. The root name servers for those top-level domains do serve glue records for nsX.pagekite.net, and thus kept those kites visible and available during the outage.
These customers may, depending on configuration, still have been unable to update their dynamic DNS records or discover new front-end relays which would have caused kite unavailability in some cases.
Analysis and lessons learned
Our monitors were insufficient to detect and report how serious the outage was. This was largely due to the effects of DNS caching - the monitors had access to cached information about the affected zones long after the problems had become visible to the wider Internet.
The slave DNS service we were relying on has been deemed unfit for use.
The pagekite.me and b5p.us problems were due to the fact that the root name servers for these top level domains do not serve glue records for the authoritative PageKite dynamic name servers (which reside on ns1, ns2 and ns3.pagekite.net). Further testing has revealed that this weakness is also shared by the .is top level domain. This is markedly different behavior from that of the root name-servers for .com and .net, and our assumptions about how these systems behaved were incorrect. This has implications not just for reliability, but also for performance, as looking up names under these domains will require more DNS requests to complete.
Our reliance on CNAME records in various white-label configurations needs to be reconsidered, as this increases the risk that DNS issues impacting one domain will impact others.
- We have retired the old master DNS server and the 3rd party slave service and moved the pagekite.net zone to gandi's infrastructure which should be significantly more reliable.
- Direct, un-cached monitoring of the pagekite.net root DNS servers has been enabled.
- The monitoring server itself was moved to a new location.
- The use of CNAMEs in infrastructure-related DNS records is being reconsidered and will be phased out wherever possible.
- The TTL (caching lifetime) policy for infrastructure-related DNS records is being reviewed.
Once again, we thank you for your patience and understanding.
We hope the steps above will suffice to prevent this problem from reoccurring, and hopefully by sharing this document we can help others avoid making the same mistakes.
Happy New Year! :-)