Transparent Uptime: 12/01/2008

Monday, December 22, 2008

Comprehensive review of SaaS SLAs - A sad state of affairs

A recent story about the holes in Google's SLA got me wondering about the state of service level agreements in the SaaS space. The importance of SLA's in the enterprise online world are obvious. I'm sad to report that of the state of the union is not good. Of the handful of major SaaS players, most have no SLAs at all. Of those that do, the coverage is extremely loose, and the penalty for missing the SLAs is weak. To make my point, I've put together an exhaustive (yet pointedly short) list of the SLAs that do exist. I've extracted the key elements and removed the legal mumbo-jumbo (for easy consumption). Enjoy!

Comparing the SLAs of the major SaaS players

Google Apps:

What: "web interface will be operational and available for GMail, Google Calendar, Google Talk, Google Docs, and Google Sites"
Uptime guarantee: 99.9%
Time period: any calendar month
Penalty: 3, 7, or 15 days of service at no charge, depending on the monthly uptime percentage
Important caveats:

"Downtime" means, for a domain, if there is more than a five percent user error rate. Downtime is measured based on server side error rate.
"Downtime Period" means, for a domain, a period of ten consecutive minutes of Downtime. Intermittent Downtime for a period of less than ten minutes will not be counted towards any Downtime Periods.

Amazon S3:

What: Amazon Simple Storage Service
Uptime guarantee: 99.9%
Time period: "any monthly billing cycle"
Penalty: 10-25% of total charges paid by customer for a billing cycle, based on the monthly uptime percentage
Important caveats:

“Error Rate” means: (i) the total number of internal server errors returned by Amazon S3 as error status “InternalError” or “ServiceUnavailable” divided by (ii) the total number of requests during that five minute period. We will calculate the Error Rate for each Amazon S3 account as a percentage for each five minute period in the monthly billing cycle. The calculation of the number of internal server errors will not include errors that arise directly or indirectly as a result of any of the Amazon S3 SLA Exclusions (as defined below).
“Monthly Uptime Percentage” is calculated by subtracting from 100% the average of the Error Rates from each five minute period in the monthly billing cycle.
"We will apply any Service Credits only against future Amazon S3 payments otherwise due from you""

Amazon EC2:

What: Amazon Elastic Compute Cloud service
Uptime guarantee: 99.95%
Time period: "the preceding 365 days from the date of an SLA claim"
Penalty: "a Service Credit equal to 10% of their bill for the Eligible Credit Period"
Important caveats:

“Annual Uptime Percentage” is calculated by subtracting from 100% the percentage of 5 minute periods during the Service Year in which Amazon EC2 was in the state of “Region Unavailable.” If you have been using Amazon EC2 for less than 365 days, your Service Year is still the preceding 365 days but any days prior to your use of the service will be deemed to have had 100% Region Availability. Any downtime occurring prior to a successful Service Credit claim cannot be used for future claims. Annual Uptime Percentage measurements exclude downtime resulting directly or indirectly from any Amazon EC2 SLA Exclusion (defined below).
“Unavailable” means that all of your running instances have no external connectivity during a five minute period and you are unable to launch replacement instances.

...that's it!

Notable Exceptions (a.k.a. lack of an SLA)

Salesforce.com (are you serious??)
Google App Engine (youth will only be an excuse for so long)
Zoho
Quickbase
OpenDNS
OpenSRS

Conclusions
There's no question that for the enterprise market to get on board with SaaS in any meaningful way accountability is key. Public health dashboards are one piece of the puzzle. SLAs are the other. The longer we delay in demanding these from our key service providers (I'm looking at you Salesforce), the longer and more difficult the move into the cloud will end up being. The incentive in the short term for a not-so-major SaaS player should be to take the initiave and focus on building a strong sense of accountability and trust. As it begins to take business away from the more established (and less trustworthy) services, the bar will rise and customers will begin to demand these vital services from all of their providers. The day's of weak or non-existant SLAs for SaaS providers are numbered.

Disclaimer: If I've misrepresented anything above, or if your SaaS service has a strong SLA, please let us know in the comments. I really hope someone out there is working to raise the bar on this sad state.

Wednesday, December 17, 2008

Steve Souders - The State of Performance 2008 (and a look ahead to 2009)

Mr. Web Site Performance, Steve Souders, put together a really nice "review of what happened in 2008 with regard to web performance, and [his] predictions and hopes for what we’ll see in 2009." Check it out at Steve Souders' blog. He references a lot of tools and services you may have missed over the course of the year, and gives some good advice for web developers (and online businesses).

Tuesday, December 16, 2008

Google App Engine System Status - A Review

Building off of the rules for a successful public health dashboard, let's see what Google did well, what they can improve, and what questions remain:

Rule #1: Must show the current status for each "service" you offer

Considering this is meant to cover only the App Engine service, and not any other Google service, I would say they accomplished their goal. Every API they offer appears to be covered, in addition to the "Serving" metric which appears to test the overall service externally.
I appreciate the alphabetic sorting of services, but I would suggest making the "Serving" status a bit more prominent as that would seem to be by far the most important metric.
Conclusion: Met!

Rule #2: Data must be accurate and timely

Hard to say until an event occurs or we hear feedback about this from users.
The announcement does claim the data is an "up-to-the-minute overview of our system status with real-time, unedited data." If this is true, this is excellent news.
The fact that an "Investigating" status is an option tells me that the status may not always be real-time or unedited. Or I may just be a bit too paranoid :)
In addition the fact that "No issues" and "Minor performance issues" are both considered healthy tells us that issues Google considers "minor" will be ignored or non-transparent. That's bad news. Though it does fit with their SLA questions that came up recently.
Conclusion: Time will tell (but promising)

Rule #3: Must be easy to find

If I were experiencing a problem with App Engine, I would first go to the homepage here. Unfortunately I don't see any link to the system status page. A user would either have to stumble upon the blog post announcing this page, or work through the forum...defeating the purpose of the system status page!
The URL to the system status (http://code.google.com/status/appengine/) page is not easy to remember. Since Google doesn't seem to own appengine.com, this is may not be easy to fix, but that doesn't matter to a user that's in the middle of an emergency and needs to figure out what's going on. The good news is that at the time of this writing, a Google search for "google app engine status" has the status page as the third result, and I would think that it will raise to #1 very soon.
Conclusion: Not met (but easy to fix by adding a link from the App Engine homepage).

Rule #4: Must provide details for events in real time

Again, hard to say until we see an issue occur.
What I'm most interested in is how much detail they provide when an event does occur, and whether they send users over to the forums or to the blog, or simply provide the information on the status page.
Conclusion: Time will tell.

Rule #5: Provide historical uptime and performance data

Great job with this. I dare say they've jumped head of every other cloud service in the amount and detail on performance data they provide.
Still unclear how much historical data will be maintained, but even 7 days is enough to satisfy me.
Conclusion: Met!

Rule #6: Provide a way to be notified of status changes

Nada here, beyond pointing people to the Downtime Notify Google Group.
Conclusion: Not met.

Rule #7: Provide details on how the data is gathered

Beyond the mention that they are "using some of the same raw monitoring data that our engineering team uses internally", no real information on how this data is collected, how often it is updated, or where the monitoring happens from.
Conclusion: Not met.

Overall, in spite of more rules being missed than met, the more difficult requirements are looking great, and the pieces are in place to create a very complete and extremely useful central place for their customers to come in time of need. I'm excited to see where Google takes this dashboard from here, and how other cloud services respond to this ever growing need.

Google launches System Status Dashboard for AppEngine

Google has finally launched a health dashboard for their AppEngine service!

From the announcement:

"The new System Status Site provides a detailed view into the performance of various App Engine components using some of the same raw monitoring data that our engineering team uses internally. This includes:
up-to-the-minute overview of our system status with real-time, unedited data
daily overall serving status for each of our APIs, including any outages or downtime
detailed historical latency and error-rate graphs for the App Engine Datastore, Images, Mail, Memcache, Serving, URL Fetch, and Users components

In addition to the Downtime Notify Google Group, we'll use this dashboard to announce scheduled downtime and explain any issues that affect App Engine applications. You'll be able to see real data behind any issues that we experience along with explanations from our team.

We'll continue to tune this dashboard to make sure we're providing useful and accurate information about App Engine's uptime."

My 10 second first impression is that overall they did a great job, especially the details you can get when drilling down on a specific service and day (clicking on a checkmark). Time will tell how many of the rules of successful dashboard's they meet. I plan to dive a little deeper in the next day or two, but for now...kudos to Google for making this a reality!

Sunday, December 14, 2008

Visionaries

The 100 oldest registered .com domains: http://www.iwhois.com/oldest/

Monday, December 1, 2008

7 keys to a successful public health dashboard

Lets first define what makes an online health dashboard "successful", and in the process explain why you (as a SaaS provider) should have one:

Your support costs go down as your users are able to self-identify system wide problems without calling or emailing your support department. Users will no longer have to guess whether their issues are local or global, and can more quickly get to the root of the problem before complaining to you.
You are better able to communicate with your users during downtime events, taking advantage of the broadcast nature of the Internet versus the one-to-one nature of email and the phone. You spend less time communicating the same thing over and over and more time resolving the issue.
You create a single and obvious place for your users to come to when they are experiencing downtime. You save your users' time currently spent searching forums, Twitter, or your blog.
Trust is the cornerstone of successful SaaS adoption. Your customers are betting their business and their livelihoods on your service or platform. Both current and prospective customers require confidence in your service. Both need to know they won't be left in the dark, alone and uninformed, when you run into trouble. Real time insight into unexpected events is the best way to build this trust. Keeping them in the dark and alone is no longer an option.
It's only a matter of time before every serious SaaS provider will be offering a public health dashboard. Your users will demand it.

With that out of the way, let's move on to detailing what exactly it takes to create a successful public health dashboard. Generally I would suggest looking to your users to tell you what they need. I still strongly recommend you do this, especially if your users are technically savvy. However, as this industry is still so young, and most companies are still unsure of what their users will demand, I humbly submit my 7 rules for public health dashboard success:

The Rules

First things first

Before we get into the rules, I'd like to mention a few public "system status" pages that don't quite meet the label of "health dashboard" but do give us a starting point for providing public health information. There's no reason any SaaS provider today should not be offering at least a basic chronological list of potential issues, downtime events, and resolution details similar to one of the following: craigslist system status, 37signals System Status, Twitter Status, GitHub Status, Mosso System Status. Now...on to the rules for creating a successful online public health dashboard!

The Basics

Rule #1: Must show the current status for each "service" you offer

A status light or short description that visitors can use to quickly identify how the service(s) they are interested in are doing right now. Example #1, Example #2.
Most health dashboards do this well. Keep it simple. Skype's Heartbeat tries to be clever, but I fear that the first impression that the big thumping red hearts give visitors is that something is wrong.
Don't forget to identify what the status icons and messages actually mean. Example #1 (legend at bottom right), Example #2 (descriptions below the table at bottom). Bad Example (no information on what the possible states are).

Rule #2: Data must be accurate and timely

This should go unsaid, but some comments in an online forum show us that it isn't as obvious as it should be.
The data should be based on real time monitoring, not manual updates that require a human.
The entire benefit is lost if your users cannot trust this data, or it arrives too late.

Rule #3: Must be easy to find

It is worthless to provide a public health dashboard if your users are unaware it exists, or are unable to find it in time of need.
Anticipate where your users go when they experience downtime, and create a clear path to the status page. Ideally there will be a link from your home page, and at the minimum from your main support page. Example #1 (footer of each page), Example #2 (top right of every page). Many examples of support page links.
Also consider making the URL as easy to remember as possible. "status.yourdomain.com" or "yourdomain.com/status" seem to be the preferred method.

Rule #4: Must provide details for events in real time

You must go beyond simply noting that something is wrong. Your must provide insight into what is going on, what services are affected, and if possible an ETA on resolution. Users will be OK with a big red light for only so long.
This can be as simple as a timestamped message noting that you are investigating, with regular updates about the investigate and projected resolution times.
The key here is to keep your users from having to contact your support department, defeating much of the gain in having a public health dashboard. Use this as an opportunity to build a trust relationship with your customer by being transparent throughout the process.

Beyond the basics

Rule #5: Provide historical uptime and performance data

Make sure to provide root cause analysis for each downtime event. The more detail the better. Example #1, Example #2 (click on any event in the past).
This will be important to your prospects as they evaluate your transparency. Don't be afraid of problems you've had in the past. Owning up to problems strengthens trust, which should be one of your main goals.
This will be important to your customers as they do post mortem analysis for their superiors.
Provide at least one week of historical data, ideally at least a month. Example #1, Example #2, Example #3, Example #4 (notice each service has an archive link).

Rule #6: Provide a way to be notified of status changes

RSS/email/SMS/Twitter/API's/etc. It's too early to know how users will want to consume this information, but my opinion is that the two most useful options would be to allow email alerts on downtime, and an API that allows users to build their applications to work around the downtime automatically.
Currently many status dashboard provide RSS feeds. Example #1 (even provides email and Twitter alerts!), Example #2.
Along these same lines, providing advanced notice of upcoming maintenance windows is extremely useful. I would hope these are announced in other mediums as well (e.g. email).

Rule #7: Provide details on how the data is gathered

What is the uptime and performance data worth if we have no insight into where it comes from? Currently with most health dashboards we have to assume either the provider built their own monitoring platform, or that they are making status updates manually (Zoho is the one exception, since they have their own monitoring service).
Beyond simply knowing where the data comes from, what exactly does "Performance issues" mean? What are the thresholds that determine that a service is considered to be "disrupted"? From what location's is the monitoring done from? Am I out of luck if I live in Asia and the monitoring is done from New York?
It would be extremely useful to have the data validated by a third party, especially as this gets into the world of SLA's. We can't have the fox watching the hen house when it comes to money.

The future

For those seeking to truly be ahead of the curve and open up the kimono further, I suggest the following rules as well:

Provide geographical uptime and performance data (Zoho is ahead of the game on this one). The more information you provide to your users publicly, the less questions you'll have to deal with privately.
The status page should be hosted externally, at a different location from your primary data center. This should be obvious, but I doubt many companies consider this as a problem. The last thing you want when your primary data center goes down is to have to field calls that could be avoided if your status page was still up.
Break out each individual service and function as much as possible. Similar to how Flickr opened up their API to match up with practically every internal function call, allow your users to have insight into the very specific functionality they need.
Connect your downtime events to your SLA's. Allow your users to easily track how you're doing compared to what you promised. The day's of hoping that your users forget about this are over.

I hope the above advice provides value to companies out there considering their own health dashboards. I would love to hear from SaaS providers already providing health dashboards, especially those I haven't already linked to. I'd especially love to hear some feedback from companies on the benefits they've seen in providing a health dashboard, in customer feedback, reduced costs, or competitive advantages.

I'm excited see over the next few months how things change in this space, what rules become most important to users, and how online service providers respond to the oncoming demand for transparency. Time will tell, but I do know this is only the beginning.

For reference, some of the public health dashboards I referenced in this post:

Transparent Uptime

Monday, December 22, 2008

Comprehensive review of SaaS SLAs - A sad state of affairs

Wednesday, December 17, 2008

Steve Souders - The State of Performance 2008 (and a look ahead to 2009)

Tuesday, December 16, 2008

Google App Engine System Status - A Review

Google launches System Status Dashboard for AppEngine

Sunday, December 14, 2008

Visionaries

Monday, December 1, 2008

7 keys to a successful public health dashboard

About Me

Resources

Cloud Health Status Updates

Blog Archive

Disclaimer