Friday, March 5, 2010

Google App Engine downtime postmortem, nearly a perfect model for others

Google posted one of the most detailed and well thought out postmortems I've seen to explain what happened around their 2/24/10 App Engine Downtime. Let's run it through the gauntlet:

  1. Admit failure - Yes
  2. Sound like a human - Yes, more in some sections then others
  3. Have a communication channel - Yes, the Google App Engine Downtime Notify group. Ideally it would have been linked to from the App Engine System Status Dashboard as well.
  4. Above all else, be authentic - Yes
  1. Start time and end time of the incident - Yes, including GMT time, and a highly detailed timeline of the entire event
  2. Who/what was impacted - Partly, and also partly covered during the actual incident
  3. What went wrong - Yes, yes, and yes! Incredible amount of detail here.
  4. Lessons learned - Yes! Not only are there five specific action items, they are also introducing new architectural changes and customer choice as a result.
  1. Details on the technologies involved - Somewhat
  2. Answers to the Five Why's - No
  3. Human elements - heroic efforts, unfortunate coincidences, effective teamwork, etc - Partial
  4. What others can learn from this experience - Partial
Takeaways and thoughts:
  1. A vast majority of the issues were training related. This is an important lesson: all of the technology and process in the world won't help you if your on-call team is unaware of what to do. This is especially true during the stress of a large incident. Follow the advice of Google and run regular on-call drills (including rare issues), keep documentation updated, and give on-call people the authority to make decisions on the spot.
  2. Extremely impressed with the decision to use this opportunity to improve their service by giving their customers the choice between datastore performance and reliability. This is a perfect example of turning downtime into a positive.
  3. Interesting insight into their process of detecting an incident to communicating externally. From the start of the incident at 7:48am to the first external communication at 8:01am is not too shabby. Not sure why it took so long to post the update to the downtime forum and the health status dashboard (8:36am).
  4. The amount of time and thought that went into this postmortem shows how much Google is concerned about their service, and impressions around its reliability.
What could be improved:
  1. External communication could be faster. No reason not to post something as soon as the investigation begins, not to mention posting to the forum dedicated to downtime notifications and the health status dashboard immediately. When the incident started the dashboard had very limited data, which should be automatic and real-time.
  2. A post to this postmortem from the health status dashboard would make it a lot easier to find. I didn't see this until someone sent it to me.
  3. Timelines and concrete deliverables on the changes (e.g. on-call training sessions, documentation updates, new datastore feature release) would give us more confidence that things will actually change.


  1. Google was unable to post in dashboard as it was down and I imagine they had issues with its admin interface. The live health data stopped appearing for majority of parameters, while the AppEngine Status Dashboard sometimes promptly appeared within downtime incident.

    The Status Dashboard have to be the most reliable part in issues like this, but it was not due to dependency upon AppEngine infrastructure.

  2. 1. goof up with help from a power outage
    2. publish details of goof up, such that fandom hype up over this act itself, and everyone forgets about the goof up
    3. ...
    4. profit

  3. Coach Outlet

    Christian Louboutin Shoes

    Valentino Shoes

    Michael Kors Outlet

    Coach Factory Outlet

    Coach Outlet Online

    Coach Purses

    Kate Spade Outlet

    Toms Shoes

    Hermes Belts

    Louis Vuitton

    Fendi Handbags

    Giuseppe Shoes

    Michael Kors Outlet

    Stephen Curry Shoes

    Salomon Shoes

    North Face Outlet

    Coach Outlet

    North Face Outlet

    Burberry Outlet

    North Face Outlet

    North Face Jackets

    Skechers Shoes

    Toms Outlet

    North Face Outlet

    Nike Air Max

    Nike Hoodies

    Marc Jacobs Handbags

    Marc Jacobs Outlet

    Jimmy Choo Shoes

    Jimmy Choos

    Burberry Belt

    Louis Vuitton Belt

    Salvatore Ferragamo

    Marc Jacobs Handbags

    Lululemon Outlet

    True Religion Outlet

    Tommy Hilfiger

    Michael Kors Outlet

    Coach Outlet

    Red Bottoms

    Kevin Durant Shoes

    New Balance Outlet

    Adidas Outlet

    Coach Outlet Online

    Stephen Curry Jersey

    Vans Outlet

    Ralph Lauren Outlet

    True Religion Outlet

    ED Hardy Outlet

    North Face Outlet

    UGG Outlet

    UGG Outlet

    North Face Outlet

    Ugg Boots Sale

    UGGS For Women

    Skechers Go Walk

    Adidas Yeezy Boost

    Adidas Yeezy

    Adidas NMD

    Coach Outlet

  4. Highly qualified information because it can provide useful info thanks.
    obat herbal psoriasis vulgaris

  5. very good article and useful once for my admin and pardon me permission to share articles herein may be useful and helpful Obat amandel bengkak

  6. may be useful for all, helpful article once and pardon me permission to share also here :

    Cara menyembuhkan lambung perih
    Obat lambung bengkak
    Cara menyembuhkan thalasemia

  7. By reading this article I get a lot of lessons and this is very useful . cara menggugurkan kandungan

  8. Thanks for the very helpful information. Because, I just know there is a very extraordinary article like this, thanks to

    Toko Obat Herbal
    Obat Nyeri Sendi

  9. Thanks for the very useful information, If you do not know then we will share the article. Please read because this is very important for us all.

    Obat herbal osteoporosis
    Obat Herbal Campak
    Obat Sakit Tenggorokan
    Obat Luka Diabetes
    Obat Herbal Prurigo