Transparent Uptime: Google App Engine downtime postmortem, nearly a perfect model for others

Friday, March 5, 2010

Prerequisites:

Admit failure - Yes
Sound like a human - Yes, more in some sections then others
Have a communication channel - Yes, the Google App Engine Downtime Notify group. Ideally it would have been linked to from the App Engine System Status Dashboard as well.
Above all else, be authentic - Yes

Requirements:

Start time and end time of the incident - Yes, including GMT time, and a highly detailed timeline of the entire event
Who/what was impacted - Partly, and also partly covered during the actual incident
What went wrong - Yes, yes, and yes! Incredible amount of detail here.
Lessons learned - Yes! Not only are there five specific action items, they are also introducing new architectural changes and customer choice as a result.

Bonus:

Details on the technologies involved - Somewhat
Answers to the Five Why's - No
Human elements - heroic efforts, unfortunate coincidences, effective teamwork, etc - Partial
What others can learn from this experience - Partial

Takeaways and thoughts:

A vast majority of the issues were training related. This is an important lesson: all of the technology and process in the world won't help you if your on-call team is unaware of what to do. This is especially true during the stress of a large incident. Follow the advice of Google and run regular on-call drills (including rare issues), keep documentation updated, and give on-call people the authority to make decisions on the spot.
Extremely impressed with the decision to use this opportunity to improve their service by giving their customers the choice between datastore performance and reliability. This is a perfect example of turning downtime into a positive.
Interesting insight into their process of detecting an incident to communicating externally. From the start of the incident at 7:48am to the first external communication at 8:01am is not too shabby. Not sure why it took so long to post the update to the downtime forum and the health status dashboard (8:36am).
The amount of time and thought that went into this postmortem shows how much Google is concerned about their service, and impressions around its reliability.

What could be improved:

External communication could be faster. No reason not to post something as soon as the investigation begins, not to mention posting to the forum dedicated to downtime notifications and the health status dashboard immediately. When the incident started the dashboard had very limited data, which should be automatic and real-time.
A post to this postmortem from the health status dashboard would make it a lot easier to find. I didn't see this until someone sent it to me.
Timelines and concrete deliverables on the changes (e.g. on-call training sessions, documentation updates, new datastore feature release) would give us more confidence that things will actually change.