Transparent Uptime: Google raising the bar in post-mortem transparency

Thursday, July 9, 2009

Google raising the bar in post-mortem transparency

In the most detailed post-mortem I've ever seen come out of a cloud provider, Google chronicles the minute by minute timeline of their App Engine downtime event, reviews what went wrong, and commits to fixing the root cause at many levels:

What are we doing to fix it?
1. The underlying bug in GFS has already been addressed and the fix
will be pushed to all datacenters as soon as possible. It has also
been determined that the bug has been live for at least a year, so the
risk of recurrence should be low. Site reliability engineers are aware
of this issue and can quickly fix it if it should recur before then.
2. The App Engine team is accelerating its schedule to release the new
clustering system that was already under development. When this system
is in place, it will greatly reduce the likelihood of a complete
outage like this one.
3. The App Engine team is actively investigating new solutions to cope
with long-term unavailability of the primary persistence layer. These
solutions will be designed to ensure that applications can cope
reasonably with long-term catastrophic outages, no matter how rare.
4. Changes will be made to the Status Site configuration to ensure
that the Status Site is properly available during outages.