"We're building a dashboard to provide you with system status information. This dashboard, which we aim to make available in a few months, will enable us to share the following information during an outage:Who knew! Glad to see they aren't rushing this out the door and are (hopefully) putting in the time to do it right. Judging by the amount of thought they've already put into this email, I'm already impressed.
- A description of the problem, with emphasis on user impact. Our belief is during the course of an outage, we should be singularly focused on solving the problem. Solving production problems involves an investigative process that's iterative. Until the problem is solved, we don't have accurate information around root cause, much less corrective action, that will be particularly useful to you. Given this practical reality, we believe that informing you that a problem exists and assuring you that we're working on resolving it is the useful thing to do.
- A continuously updated estimated time-to-resolution. Many of you have told us that it's important to let you know when the problem will be solved. Once again, the answer is not always immediately known. In this case, we'll provide regular updates to you as we progress through the troubleshooting process."
Google will have a challenge in presenting their myriad of services in an easy to understand dashboard. Their online services are complex, highly GUI oriented, and distributed in such a way that a small fraction of users could be down while the rest of the world is fine. My three big questions for Google are:
- How will you be collecting your uptime and performance data? Can we rely on it to be accurate and unbaised?
- What will be considered "downtime" for a complex app like Google Docs or Gmail?
- How will you communicate with your users during the downtime event? Will that communication channel be easy to find?