"Yesterday an unannounced DNS change apparently made our mail server go incognito to the rest of the world. The consequences of this came sneaking over night as the changes propagated through the DNS network. Whammy.How well does such an informal and simple postmortem stack up against the postmortem best practices? Let's find out:
On top of this our upstream internet provider late last night PST (early morning CET) experienced a failure that prevented our servers from reaching external destinations. Web access was not affected but email, widget, targets, basically everything that relied on communication from our servers to the outside world were. Double whammy.
It took too long time to realize that we had two separate issues at hand. We kept focusing on the former as root cause for the latter. And it took unacceptably long to determine that we had a network outage."
- Admit failure - Yes, no question.
- Sound like a human - Yes, very much so.
- Have a communication channel - Yes, both the blog and the Twitter account.
- Above all else, be authentic - Yes, extremely authentic.
- Start time and end time of the incident - No.
- Who/what was impacted - Yes, though more detail would have been nice.
- What went wrong - Yes, well done.
- Lessons learned - Not much.
- Details on the technologies involved - No
- Answers to the Five Why's - No
- Human elements - Some
- What others can learn from this experience - Some
The meat was definitely there. The biggest missing piece is insight into what lessons were learned and what is being done to improve for the future. Mikkel says that "We've learned an important lesson and will do our best to ensure that no 3rd parties can take us down like this again", but the specifics are lacking. The exact time of the start and end of the event would have been useful as well, for those companies wondering whether this explains their issues that day.
It's always impressive to see the CEO of a company put himself out there like this and admit failure. It is (naively) easier to pretend like everything is OK and hope the downtime blows over. In reality, getting out in front of the problem and being transparent, communicating during the downtime (in this case over Twitter), and after the event is over (in this postmortem), are the best things you can do to turn your disaster into an opportunity to increase customer trust.
As it happens, I will be speaking at the upcoming Velocity 2010 conference about this very topic!
Update: Zendesk has put out a more in-depth review of what happened, which includes everything that was missing from the original post (which as the CEO pointed out in the comments, was meant to be a quick update of what they knew at the time). This new post includes the time frame of the incident, details on what exactly went wrong with the technology, and most importantly lessons and takeaways to improve things for the future. Well done.