Tuesday, December 16, 2008

Google App Engine System Status - A Review

Building off of the rules for a successful public health dashboard, let's see what Google did well, what they can improve, and what questions remain:

Rule #1: Must show the current status for each "service" you offer
  • Considering this is meant to cover only the App Engine service, and not any other Google service, I would say they accomplished their goal. Every API they offer appears to be covered, in addition to the "Serving" metric which appears to test the overall service externally.
  • I appreciate the alphabetic sorting of services, but I would suggest making the "Serving" status a bit more prominent as that would seem to be by far the most important metric.
  • Conclusion: Met!
Rule #2: Data must be accurate and timely
  • Hard to say until an event occurs or we hear feedback about this from users.
  • The announcement does claim the data is an "up-to-the-minute overview of our system status with real-time, unedited data." If this is true, this is excellent news.
  • The fact that an "Investigating" status is an option tells me that the status may not always be real-time or unedited. Or I may just be a bit too paranoid :)
  • In addition the fact that "No issues" and "Minor performance issues" are both considered healthy tells us that issues Google considers "minor" will be ignored or non-transparent. That's bad news. Though it does fit with their SLA questions that came up recently.
  • Conclusion: Time will tell (but promising)
Rule #3: Must be easy to find
  • If I were experiencing a problem with App Engine, I would first go to the homepage here. Unfortunately I don't see any link to the system status page. A user would either have to stumble upon the blog post announcing this page, or work through the forum...defeating the purpose of the system status page!
  • The URL to the system status (http://code.google.com/status/appengine/) page is not easy to remember. Since Google doesn't seem to own appengine.com, this is may not be easy to fix, but that doesn't matter to a user that's in the middle of an emergency and needs to figure out what's going on. The good news is that at the time of this writing, a Google search for "google app engine status" has the status page as the third result, and I would think that it will raise to #1 very soon.
  • Conclusion: Not met (but easy to fix by adding a link from the App Engine homepage).
Rule #4: Must provide details for events in real time
  • Again, hard to say until we see an issue occur.
  • What I'm most interested in is how much detail they provide when an event does occur, and whether they send users over to the forums or to the blog, or simply provide the information on the status page.
  • Conclusion: Time will tell.
Rule #5: Provide historical uptime and performance data
  • Great job with this. I dare say they've jumped head of every other cloud service in the amount and detail on performance data they provide.
  • Still unclear how much historical data will be maintained, but even 7 days is enough to satisfy me.
  • Conclusion: Met!
Rule #6: Provide a way to be notified of status changes
Rule #7: Provide details on how the data is gathered
  • Beyond the mention that they are "using some of the same raw monitoring data that our engineering team uses internally", no real information on how this data is collected, how often it is updated, or where the monitoring happens from.
  • Conclusion: Not met.
Overall, in spite of more rules being missed than met, the more difficult requirements are looking great, and the pieces are in place to create a very complete and extremely useful central place for their customers to come in time of need. I'm excited to see where Google takes this dashboard from here, and how other cloud services respond to this ever growing need.

2 comments:

Note: Only a member of this blog may post a comment.