Transparent Uptime: Downtime post-mortems, and a look at oneforty.com

Transparency happens one incident at a time, one process at a time. True transparency, the kind that benefits both the customer and the company, doesn't come easy. It requires pre-incident planning, intra-incident collaboration, and post-incident communication. I plan to blog about this holistic framework in the near future, but today I'd like to use the downtime postmortem posted by Mike Champion, describing the recent downtime of oneforty.com, to build a basic template for how to handle post-incident communication.

The Incident

On January 14th 2010, oneforty.com experienced a "hiccup":

A few weeks ago we rolled out an alpha version of our ecommerce platform and the news was covered on a few blogs, including TechCrunch. At roughly the same time (it seemed) there were alerts about the amount of swap space on one or more of our servers. The alerts would typically flap between a warning and then return to normal levels. I figured the two events were related and that the alerts were due to increased traffic, but not a serious issue.

The post goes on to describe (in detail) what went wrong, actions taken during the event, and lessons learned. An excellent post-mortem by any standard. Simply posting a post-mortem publicly is (sadly) a huge achievement. What can we learn from this post, and what should your post-mortem's include? Let me propose a rough guideline...

A guideline for post-mortem communication

Prerequisites

Admit failure - Hiding downtime is no longer an option (thanks to Twitter)
Sound like a human - Do not use a standard template, do not apologize for "inconveniencing" us.
Have a communication channel - Ideally you've set up a process to handle incidents before the event, and communicated publicly during the event. Customers will need to know where to find your updates.
Above all else, be authentic

Requirements:

Start time and end time of the incident.
Who/what was impacted.
What went wrong, with insight into the root cause analysis process.
What's being done to improve the situation, lessons learned.

Nice-to-have's:

Details on the technologies involved.
Answers to the Five Why's.
Human elements - heroic efforts, unfortunate coincidences, effective teamwork, etc.
What others can learn from this experience.

How did oneforty.com do?

Prerequisites:

Admit failure: Yes, using their Twitter account.
Sound like a human: Yes. Very non-generic and highly detailed post.
Have a communication channel: Partial. The Twitter account exists, and the blog by Mike Champion exists, but it would be hard pressed as a user to find these two venues when the service is down and I need to know what's going on.
Be authentic: Yes.

Requirements:

Start/end time: No. Can't find that anywhere, and have to extrapolate it from the first tweet to the last.
Who/what was impacted: No. I have to assume the entire site and all visitors were impacted, but there was no mention of this.
What went wrong: Yes. A lot of detail, takes us through the entire experience of root cause analysis.
Lessons learned: Yes. Extremely solid.

Nice-to-haves:

Technologies involved: Yes.
Answers to the Five Why's: No.
Human elements: Yes. An engaging story.
What others can learn from this experience: Yes, a lot to take away if you are an Engine Yard customer.

Conclusion

The basic aim of a post-mortem is to reassure your customers that you recognize there was a problem, that you have resolved it, and that you are going to learn from the experience. Holistically, oneforty.com accomplished this. When thinking about your own post-mortems postings, just imagine an angry customers that has lost faith in your service, and needs to be reassured that you know what you are doing. If you can accomplish that you've succeeded.

Transparent Uptime

Monday, February 8, 2010

Downtime post-mortems, and a look at oneforty.com

No comments:

Post a Comment

About Me

Resources

Cloud Health Status Updates

Blog Archive

Disclaimer