Transparent Uptime: 11/01/2008

Sunday, November 23, 2008

Transparency case study, courtesy of ylastic

ylastic (a company that provides tools to help manage AWS services) kept their users in the loop during an outage by communicating status updates over Twitter:

You can find the entire set of updates at ylastic's twitter page.

I keep coming back to the same question. Do your users know where to go during a downtime event? ylastic has their web site, their blog, their forums, and their twitter feed. As a user, how do I know where to look when I'm having a problem and want to know what's going on with the service (which is generally an emergency)? As the company, how do I keep users from clogging my support email box in spite of my efforts to get status updates out to the world? In this case it looks like the only place that had any information was the twitter feed. If users weren't aware it existed, both sides would be out of luck.

What every SaaS service needs is a clear central place, that their users can easily find, that provides real time updates on downtime or performance events. It's great that you're willing to communicate during the event, but if no one can find those updates, what's the point? Don't get me started on falling trees.

On another note, kudos to ylastic for their transparency on the following fronts:

Providing insight into their product roadmap. Very much what SaaS providers must do to build the trust relationship with their users (which is critical to the success of any online hosted application).
Their upcoming iPhone app that among other things gives you the AWS Service Health status on the go.
Simply giving status updates on Twitter.

Tuesday, November 18, 2008

Google offers up new SLA's for paying customers

Two noteworthy items from the news of Google announcing new SLA's for paying customers:

Google measures uptime as "average uptime per user based on server-side error rates".
They claim 99.9% uptime for the last year, in spite of the outage in August.

Does anyone else see a problem here? It blows my mind that not only do we have to rely on Google to tell us what their uptime was, but we allow them to tell us what uptime means. To me, downtime could include Google's services being completely down and not in a position to return return any kind of server-side error. To Google, that doesn't count. And even if it did, do we have to rely on twitter and public outrage to keep Google honest when downtime does occur?

Two simple things we need to fight for in order to keep SLA's and their associated advantages worth something:

Define uptime based on any issue under the control of the provider that keeps you from using the service as any time (planned or unplanned).
Demand third party validation of the SLA results.

This same request applies to AWS, Salesforce, and any other SaaS service with SLA's (which should be everyone!).

To end on a more positive note, the closing notes from Google's announcement:

"More than 1 million businesses have selected Google Apps to run their business, and tens of millions of people use Gmail every day. With this type of adoption, a disruption of any size — even a minor one affecting fewer than 0.003% of Google Apps Premier Edition users, like the one a few weeks ago — attracts a disproportional amount of attention. We've made a series of commitments to improve our communications with customers during any outages, and we have an unwavering commitment to make all issues visible and transparent through our open user groups.

Google is one of the 1 million businesses that run on Google Apps, and any service interruption affects our users and our business; our engineers are also some of our most demanding customers. We understand the importance of delivering on the cloud's promise of greater security, reliability and capability at lower cost. We are hugely thankful to our customers who drive us to become better every day."

P.S. I've added a comment to my previous "Online Users Bill of Rights" post to take this issue into account.

Amazon launches the Festivus of CDN's...a CDN for the rest of us

You can read about it here, here, and here. A CDN-as-a-Service (CaaS?). Does this put the nail in the coffin and make the CDN business a commodity? Probably.

What impresses me most is not the technology or the pricing or the ease of use of the new service (called CloudFront btw). What really gets me hot is the fact that the AWS Service Health Dashboard already has the performance and uptime of CloudFront up and live! That my friends is sign of true commitment to transparency, and more importantly a well functioning process.

Saturday, November 15, 2008

Comparing Amazon Web Services, SalesForce, and Zoho's online health dashboards

Now that there are three major SaaS players offering online service health dashboards, and one from Google on it's way, I thought it would be a useful exercise to compare the offerings from Amazon Web Services, Salesforce, and Zoho. This will hopefully be helpful for anyone planning to launch their own health dashboard, and to the general online community in making sense of what is important to understand about these dashboards.

Disclaimer: If I have mistakenly misrepresented anything, or if I missed any information, PLEASE let me know in the comments below.

What providers are we looking about today?

What is the URL of each status page (and are they easy to remember in times of need)?

Amazon Web Services: http://status.aws.amazon.com/ (Somewhat easy to remember)
Salesforce: http://trust.salesforce.com/ (Easy to remember)
Zoho: http://status.zoho.com/ (Extremely easy to remember)

What are these status pages called?

Amazon Web Services: "AWS Service Health Dashboard"
Salesforce: "Trust.salesforce.com - System Status" (Note: salesforce.com goes beyond simply providing system status by also providing security notices, both under their "Trust.salesforce.com brand")
Zoho: "Zoho Service Health Status"

What services' health are reported on?

Amazon Web Services: All four core services (EC2, S3, SQS, SimpleDB), plus Mechanical Turk and FlexPay. They also break out the two S3 datacenter locations (EU and US), the two ends of a Mechanical Turk transaction (Requester and Worker), plus the EC2 API.
Salesforce: Only the core salesforce.com services across 12 individual systems (based on geographic location and purpose).
Zoho: All 23 Zoho services are covered, plus their mobile site and their single sign-on system.

What health information is provided?

Amazon Web Services: Current status, plus about 30 days of historical status. Status is determined to be one of "Service is operating normally", "Performance Issues", or "Service disruption". "Information messages" are occasionally provided.
Salesforce: Current status, plus exactly 30 days of historical status. Status is determined to be either "Instances available", "Performance Issues", "Service disruption", or "Status not available". "Informational messages" are also provided on occasion.
Zoho: Current status and the response time for the past hour, in addition to historical uptime for the past week. Also provided are two graphs representing uptime and response time for the past seven days. If that wasn't enough, current uptime and response from six geographical locations is also given.

Where does the uptime and performance data come from?

Amazon Web Services: No clue.
Salesforce: No clue.
Zoho: Their own "Site 24x7" monitoring service.

What is considered downtime and what is considered a performance issue?

Amazon Web Services: No clue.
Salesforce: No clue.
Zoho: No clue.

Are real time updates provided during downtime events? Is it easy to find?

Amazon Web Services: Yes, but unclear how consistently and how easy it is to find that information.
Salesforce: Yes, right underneath the current status.
Zoho: Does not appear so, but if the issue is big enough they may update customers through their blog.

Is information provided on past downtime events?

Amazon Web Services: Yes. Mousing over a past performance or downtime event brings up a chronological log of events that took place, from detection to resolution. In addition, major downtime events are explained.
Salesforce: Yes. Clicking on any past event brings up a window giving the time of the event, a detailed description of the problem, and a root cause analysis.
Zoho: No. Unless they are described in the blog.

Is there a way to easily report problems users are having?

Amazon Web Services: Yes, clicking the "Report an Issue" link.
Salesforce: No, other then using the standard support channels.
Zoho: No, other then using the standard support channels.

How can you get notified of problems (without watching this page 24/7)?

Amazon Web Services: Ability to subscribe to RSS feeds for change in status of each service.
Salesforce: No.
Zoho: No.

Conclusions: The best practices for online service health dashboards are still being formed, and it's clear that each service provider has approached the need for transparency differently. Amazon Web Services provides a simple and easy to understand overview of the health of each service, but provides little insight into who is impacted and what specific functionality is down. Salesforce provides clear insight into what customers may be affected by an event, but does little in offering insight into specific functionality that may be down or slow. Zoho provides the most data by far for each service they provide, but does not have a system in place to communicate details about specific downtime events beyond the company blog. Amazon and Salesfroce completely lack insight into how that they collect the health information, and all three give no information on what is meant by downtime or performance problems.

A closing questions for each provider:

Amazon Web Services: What does "EC2 API" actually mean? Which API is this referring to and why not cover the API's for the other services?
Salesforce: Does each server status cover every application level and API on that server? Can you offer more insight into specific services?
Zoho: Do you expect to add details about current and past downtime events to the health dashboard? What do you expect your customers to do when they see a red light? If you answer "Email Support", you don't get the power of this status page.
To all: How is the health actually monitored (especially for the GUI focused Salesforce and Zoho services? Working at a (the best) web monitoring company, I know how hard it is to monitor complex web applications.

Notable mentions: The following services also offer up health dashboard page, but to keep the comparison from getting overly complex I decided to leave them out. If anyone would like me to review these, or any other service that I missed, I'd be more then happy to. Just leave a note in the comments

Friday, November 14, 2008

Google Status Dashboard been brewing since August?

According to MoonWatcher back on August 27th, an email from Google explains:

"We're building a dashboard to provide you with system status information. This dashboard, which we aim to make available in a few months, will enable us to share the following information during an outage:
A description of the problem, with emphasis on user impact. Our belief is during the course of an outage, we should be singularly focused on solving the problem. Solving production problems involves an investigative process that's iterative. Until the problem is solved, we don't have accurate information around root cause, much less corrective action, that will be particularly useful to you. Given this practical reality, we believe that informing you that a problem exists and assuring you that we're working on resolving it is the useful thing to do.
A continuously updated estimated time-to-resolution. Many of you have told us that it's important to let you know when the problem will be solved. Once again, the answer is not always immediately known. In this case, we'll provide regular updates to you as we progress through the troubleshooting process."

Who knew! Glad to see they aren't rushing this out the door and are (hopefully) putting in the time to do it right. Judging by the amount of thought they've already put into this email, I'm already impressed.

Google will have a challenge in presenting their myriad of services in an easy to understand dashboard. Their online services are complex, highly GUI oriented, and distributed in such a way that a small fraction of users could be down while the rest of the world is fine. My three big questions for Google are:

How will you be collecting your uptime and performance data? Can we rely on it to be accurate and unbaised?
What will be considered "downtime" for a complex app like Google Docs or Gmail?
How will you communicate with your users during the downtime event? Will that communication channel be easy to find?

I'm looking forward to seeing what Google comes up with. I have no doubt the release of a Google Health Status Dashboard will not only be huge news for the tech industry, but will be the tipping point that drives all serious SaaS services to offer their own status dashboard. Transparency is so close I can taste it!

Wednesday, November 12, 2008

Zoho opens up their kimono and the blogosphere applauds

Big news from Zoho (provider of online applications and services that compete with Google and Microsoft). As of yesterday, they have reached the next level in uptime transparency by launching Zoho Service Health Status! As Raju Vegesna states in their big announcement, "This initiative is yet another step to being more open and transparent with our users." Kudos to Zoho for recognizing the need and delivering. They join Amazon AWS and SalesForce as the three large service providers offering a very public health dashboard of their services.

I'm really excited to see that the blogosphere gets the significance of this move:

Mashable:

"One frequent problem with web applications and services - and thus the whole web 2.0 phenomenon - is lack of communication when something goes wrong. Sure, it’s nice to have your online e-mail client available from every computer, but what happens when it goes down? Often, it’s just you in the dark, waiting for problems to be resolved, with little or no official information on what’s happening to ease your mind...

This is a great idea. If something goes wrong with any one of Zoho’s applications, you can quickly check out if the problem is on your side or theirs. Of course, I’m sure that the folks at Zoho will continue to inform their customers about problems, updates, downtime and similar issues via blog posts, but being able to see what’s wrong for yourself, at any given time, is an advantage Zoho’s customers will certainly enjoy. All other web startups take notice: this is the level of transparency we’d like to see from everyone, not just Zoho."

WebWorkerDaily:

"After taking a look, I’d say that all applications hosted online could benefit from this level of kimono-opening."

CNET:

"Web application specialist Zoho has joined the growing ranks of companies willing to share detailed information on how well their online services are holding up.
This move toward transparency is increasingly important as potential customers consider relying on such services...
Publishing the performance measurements for online services is catching on as cloud computing grows more serious. Going hand in hand with that is offering service level agreements (SLAs) with specific uptime commitments."

Who's next to open up? I'm looking at you Google!

Sunday, November 2, 2008

Kudos to Microsoft for showing humanity and transparency in a recent Entourage regression bug fix

From Microsoft's Mac Office blog:

"We’ve been working hard for the last week and a half to bring Entourage users today’s 12.1.4 update. It’s incredibly frustrating when we get through a release process and a new issue is introduced by an update. When we start to hear feedback and customer reports about issues with an update, I simply cringe because so much work goes into preventing that from happening. Unfortunately, the recent Office for Mac 2008 12.1.3 update introduced a bug that prevented some Entourage users from sending meeting invites to others. We’re sorry.

However, we also believe it’s better to pair an apology with a solution. With that in mind, with the just released 12.1.4 update, meeting invites will be working as expected once again for all Entourage users.

A lot of work goes into every update release to make sure that we are improving the product's quality. With every update, each and every change that goes into the release is under tight review. There are multi-developer code reviews, focused test passes, and verifications with targeted customers who reported the issue we are targeting to fix. Even then, sometimes things do not work out. With 12.1.3, we did all of the above, yet two cases slipped-through..."

Read the details on the release here. Why don't we see this kind of honesty more often?

Transparent Uptime