Thursday, February 11, 2010

Google using social proof to expose slow networks

On the heels of their recent foray into the broadband game, Google just launched a public "YouTube Video Speed History" feature (screenshot above), providing YouTube users with insight into how their video performance compares with other users in their area, other ISPs, and globally. It also subtly (and importantly) shows you the performance of other ISPs near you.

With this passive transparency into your YouTube video performance, Google is using social proof and basic psychology to create a grassroots push for improved performance at the last mile. Dealing with buffering and low quality videos is much easier to ignore than the knowledge that your neighbor is getting a better experience. The genius lies in that fact that no one an complain about Google simply making this data transparent.

The message is simple. Google wants to make the web experience faster, and is using every tool in their tool belt to make this happen. This has already included the launch of Chrome (to push other browsers to get faster), the release of Page Speed and an associated set of best practices (to help developers optimize their page performance), the proposed changes to HTTP (to improve the underlying protocols), a proposal to extend the DNS protocol and the launch of it's own recursive DNS servers (to improve DNS lookup times), and recently the announcement of an experimental fiber network (to own the last mile). Until it has that control, it can make a lot a headway by convincing users that they are getting a bum rap.

GitHub System Status page, the most fun status page yet


What they are doing well:
  1. Big green banner clearly telling you that thing are OK (and I assume a red banner when they are not)
  2. Easy to find (http://status.github.com/)
  3. Linked to their twitter account, which is being used effectively for real time updates.
  4. Ability to easily notify github of an issue from the status page.
  5. Fun, well designed.
What they could improve:
  1. Add automated real-time health status, currently relies on manual updates.
  2. More detail on which parts of the system are up/down, currently all or nothing.
  3. A Google search for "GitHub Status" returns a link to an old destination as the first result, which doesn't point users to this new site.
  4. Link from github.com to the status page is burried at the bottom. Make sure users can find it when they need to.
  5. Add an RSS feed.

Monday, February 8, 2010

Downtime post-mortems, and a look at oneforty.com

Transparency happens one incident at a time, one process at a time. True transparency, the kind that benefits both the customer and the company, doesn't come easy. It requires pre-incident planning, intra-incident collaboration, and post-incident communication. I plan to blog about this holistic framework in the near future, but today I'd like to use the downtime postmortem posted by Mike Champion, describing the recent downtime of oneforty.com, to build a basic template for how to handle post-incident communication.

The Incident
A few weeks ago we rolled out an alpha version of our ecommerce platform and the news was covered on a few blogs, including TechCrunch. At roughly the same time (it seemed) there were alerts about the amount of swap space on one or more of our servers. The alerts would typically flap between a warning and then return to normal levels. I figured the two events were related and that the alerts were due to increased traffic, but not a serious issue.
The post goes on to describe (in detail) what went wrong, actions taken during the event, and lessons learned. An excellent post-mortem by any standard. Simply posting a post-mortem publicly is (sadly) a huge achievement. What can we learn from this post, and what should your post-mortem's include? Let me propose a rough guideline...

A guideline for post-mortem communication
Prerequisites
  1. Admit failure - Hiding downtime is no longer an option (thanks to Twitter)
  2. Sound like a human - Do not use a standard template, do not apologize for "inconveniencing" us.
  3. Have a communication channel - Ideally you've set up a process to handle incidents before the event, and communicated publicly during the event. Customers will need to know where to find your updates.
  4. Above all else, be authentic
Requirements:
  1. Start time and end time of the incident.
  2. Who/what was impacted.
  3. What went wrong, with insight into the root cause analysis process.
  4. What's being done to improve the situation, lessons learned.
Nice-to-have's:
  1. Details on the technologies involved.
  2. Answers to the Five Why's.
  3. Human elements - heroic efforts, unfortunate coincidences, effective teamwork, etc.
  4. What others can learn from this experience.

How did oneforty.com do?
Prerequisites:
  1. Admit failure: Yes, using their Twitter account.
  2. Sound like a human: Yes. Very non-generic and highly detailed post.
  3. Have a communication channel: Partial. The Twitter account exists, and the blog by Mike Champion exists, but it would be hard pressed as a user to find these two venues when the service is down and I need to know what's going on.
  4. Be authentic: Yes.
Requirements:
  1. Start/end time: No. Can't find that anywhere, and have to extrapolate it from the first tweet to the last.
  2. Who/what was impacted: No. I have to assume the entire site and all visitors were impacted, but there was no mention of this.
  3. What went wrong: Yes. A lot of detail, takes us through the entire experience of root cause analysis.
  4. Lessons learned: Yes. Extremely solid.
Nice-to-haves:
  1. Technologies involved: Yes.
  2. Answers to the Five Why's: No.
  3. Human elements: Yes. An engaging story.
  4. What others can learn from this experience: Yes, a lot to take away if you are an Engine Yard customer.

Conclusion
The basic aim of a post-mortem is to reassure your customers that you recognize there was a problem, that you have resolved it, and that you are going to learn from the experience. Holistically, oneforty.com accomplished this. When thinking about your own post-mortems postings, just imagine an angry customers that has lost faith in your service, and needs to be reassured that you know what you are doing. If you can accomplish that you've succeeded.