Monday, December 22, 2008

Comprehensive review of SaaS SLAs - A sad state of affairs

A recent story about the holes in Google's SLA got me wondering about the state of service level agreements in the SaaS space. The importance of SLA's in the enterprise online world are obvious. I'm sad to report that of the state of the union is not good. Of the handful of major SaaS players, most have no SLAs at all. Of those that do, the coverage is extremely loose, and the penalty for missing the SLAs is weak. To make my point, I've put together an exhaustive (yet pointedly short) list of the SLAs that do exist. I've extracted the key elements and removed the legal mumbo-jumbo (for easy consumption). Enjoy!

Comparing the SLAs of the major SaaS players

Google Apps:
  • What: "web interface will be operational and available for GMail, Google Calendar, Google Talk, Google Docs, and Google Sites"
  • Uptime guarantee: 99.9%
  • Time period: any calendar month
  • Penalty: 3, 7, or 15 days of service at no charge, depending on the monthly uptime percentage
  • Important caveats:
  1. "Downtime" means, for a domain, if there is more than a five percent user error rate. Downtime is measured based on server side error rate.
  2. "Downtime Period" means, for a domain, a period of ten consecutive minutes of Downtime. Intermittent Downtime for a period of less than ten minutes will not be counted towards any Downtime Periods.
Amazon S3:
  • What: Amazon Simple Storage Service
  • Uptime guarantee: 99.9%
  • Time period: "any monthly billing cycle"
  • Penalty: 10-25% of total charges paid by customer for a billing cycle, based on the monthly uptime percentage
  • Important caveats:
  1. “Error Rate” means: (i) the total number of internal server errors returned by Amazon S3 as error status “InternalError” or “ServiceUnavailable” divided by (ii) the total number of requests during that five minute period. We will calculate the Error Rate for each Amazon S3 account as a percentage for each five minute period in the monthly billing cycle. The calculation of the number of internal server errors will not include errors that arise directly or indirectly as a result of any of the Amazon S3 SLA Exclusions (as defined below).
  2. “Monthly Uptime Percentage” is calculated by subtracting from 100% the average of the Error Rates from each five minute period in the monthly billing cycle.
  3. "We will apply any Service Credits only against future Amazon S3 payments otherwise due from you""
Amazon EC2:
  • What: Amazon Elastic Compute Cloud service
  • Uptime guarantee: 99.95%
  • Time period: "the preceding 365 days from the date of an SLA claim"
  • Penalty: "a Service Credit equal to 10% of their bill for the Eligible Credit Period"
  • Important caveats:
  1. “Annual Uptime Percentage” is calculated by subtracting from 100% the percentage of 5 minute periods during the Service Year in which Amazon EC2 was in the state of “Region Unavailable.” If you have been using Amazon EC2 for less than 365 days, your Service Year is still the preceding 365 days but any days prior to your use of the service will be deemed to have had 100% Region Availability. Any downtime occurring prior to a successful Service Credit claim cannot be used for future claims. Annual Uptime Percentage measurements exclude downtime resulting directly or indirectly from any Amazon EC2 SLA Exclusion (defined below).
  2. “Unavailable” means that all of your running instances have no external connectivity during a five minute period and you are unable to launch replacement instances.
...that's it!

Notable
Exceptions (a.k.a. lack of an SLA)
  • Salesforce.com (are you serious??)
  • Google App Engine (youth will only be an excuse for so long)
  • Zoho
  • Quickbase
  • OpenDNS
  • OpenSRS
Conclusions
There's no question that for the enterprise market to get on board with SaaS in any meaningful way accountability is key. Public health dashboards are one piece of the puzzle. SLAs are the other. The longer we delay in demanding these from our key service providers (I'm looking at you Salesforce), the longer and more difficult the move into the cloud will end up being. The incentive in the short term for a not-so-major SaaS player should be to take the initiave and focus on building a strong sense of accountability and trust. As it begins to take business away from the more established (and less trustworthy) services, the bar will rise and customers will begin to demand these vital services from all of their providers. The day's of weak or non-existant SLAs for SaaS providers are numbered.

Disclaimer: If I've misrepresented anything above, or if your SaaS service has a strong SLA, please let us know in the comments. I really hope someone out there is working to raise the bar on this sad state.

Wednesday, December 17, 2008

Steve Souders - The State of Performance 2008 (and a look ahead to 2009)

Mr. Web Site Performance, Steve Souders, put together a really nice "review of what happened in 2008 with regard to web performance, and [his] predictions and hopes for what we’ll see in 2009." Check it out at Steve Souders' blog. He references a lot of tools and services you may have missed over the course of the year, and gives some good advice for web developers (and online businesses).

Tuesday, December 16, 2008

Google App Engine System Status - A Review

Building off of the rules for a successful public health dashboard, let's see what Google did well, what they can improve, and what questions remain:

Rule #1: Must show the current status for each "service" you offer
  • Considering this is meant to cover only the App Engine service, and not any other Google service, I would say they accomplished their goal. Every API they offer appears to be covered, in addition to the "Serving" metric which appears to test the overall service externally.
  • I appreciate the alphabetic sorting of services, but I would suggest making the "Serving" status a bit more prominent as that would seem to be by far the most important metric.
  • Conclusion: Met!
Rule #2: Data must be accurate and timely
  • Hard to say until an event occurs or we hear feedback about this from users.
  • The announcement does claim the data is an "up-to-the-minute overview of our system status with real-time, unedited data." If this is true, this is excellent news.
  • The fact that an "Investigating" status is an option tells me that the status may not always be real-time or unedited. Or I may just be a bit too paranoid :)
  • In addition the fact that "No issues" and "Minor performance issues" are both considered healthy tells us that issues Google considers "minor" will be ignored or non-transparent. That's bad news. Though it does fit with their SLA questions that came up recently.
  • Conclusion: Time will tell (but promising)
Rule #3: Must be easy to find
  • If I were experiencing a problem with App Engine, I would first go to the homepage here. Unfortunately I don't see any link to the system status page. A user would either have to stumble upon the blog post announcing this page, or work through the forum...defeating the purpose of the system status page!
  • The URL to the system status (http://code.google.com/status/appengine/) page is not easy to remember. Since Google doesn't seem to own appengine.com, this is may not be easy to fix, but that doesn't matter to a user that's in the middle of an emergency and needs to figure out what's going on. The good news is that at the time of this writing, a Google search for "google app engine status" has the status page as the third result, and I would think that it will raise to #1 very soon.
  • Conclusion: Not met (but easy to fix by adding a link from the App Engine homepage).
Rule #4: Must provide details for events in real time
  • Again, hard to say until we see an issue occur.
  • What I'm most interested in is how much detail they provide when an event does occur, and whether they send users over to the forums or to the blog, or simply provide the information on the status page.
  • Conclusion: Time will tell.
Rule #5: Provide historical uptime and performance data
  • Great job with this. I dare say they've jumped head of every other cloud service in the amount and detail on performance data they provide.
  • Still unclear how much historical data will be maintained, but even 7 days is enough to satisfy me.
  • Conclusion: Met!
Rule #6: Provide a way to be notified of status changes
Rule #7: Provide details on how the data is gathered
  • Beyond the mention that they are "using some of the same raw monitoring data that our engineering team uses internally", no real information on how this data is collected, how often it is updated, or where the monitoring happens from.
  • Conclusion: Not met.
Overall, in spite of more rules being missed than met, the more difficult requirements are looking great, and the pieces are in place to create a very complete and extremely useful central place for their customers to come in time of need. I'm excited to see where Google takes this dashboard from here, and how other cloud services respond to this ever growing need.

Google launches System Status Dashboard for AppEngine

Google has finally launched a health dashboard for their AppEngine service!



From the announcement:
"The new System Status Site provides a detailed view into the performance of various App Engine components using some of the same raw monitoring data that our engineering team uses internally. This includes:
  • up-to-the-minute overview of our system status with real-time, unedited data
  • daily overall serving status for each of our APIs, including any outages or downtime
  • detailed historical latency and error-rate graphs for the App Engine Datastore, Images, Mail, Memcache, Serving, URL Fetch, and Users components

In addition to the Downtime Notify Google Group, we'll use this dashboard to announce scheduled downtime and explain any issues that affect App Engine applications. You'll be able to see real data behind any issues that we experience along with explanations from our team.

We'll continue to tune this dashboard to make sure we're providing useful and accurate information about App Engine's uptime."

My 10 second first impression is that overall they did a great job, especially the details you can get when drilling down on a specific service and day (clicking on a checkmark). Time will tell how many of the rules of successful dashboard's they meet. I plan to dive a little deeper in the next day or two, but for now...kudos to Google for making this a reality!

Sunday, December 14, 2008

Visionaries

The 100 oldest registered .com domains: http://www.iwhois.com/oldest/

Monday, December 1, 2008

7 keys to a successful public health dashboard

Lets first define what makes an online health dashboard "successful", and in the process explain why you (as a SaaS provider) should have one:
  1. Your support costs go down as your users are able to self-identify system wide problems without calling or emailing your support department. Users will no longer have to guess whether their issues are local or global, and can more quickly get to the root of the problem before complaining to you.
  2. You are better able to communicate with your users during downtime events, taking advantage of the broadcast nature of the Internet versus the one-to-one nature of email and the phone. You spend less time communicating the same thing over and over and more time resolving the issue.
  3. You create a single and obvious place for your users to come to when they are experiencing downtime. You save your users' time currently spent searching forums, Twitter, or your blog.
  4. Trust is the cornerstone of successful SaaS adoption. Your customers are betting their business and their livelihoods on your service or platform. Both current and prospective customers require confidence in your service. Both need to know they won't be left in the dark, alone and uninformed, when you run into trouble. Real time insight into unexpected events is the best way to build this trust. Keeping them in the dark and alone is no longer an option.
  5. It's only a matter of time before every serious SaaS provider will be offering a public health dashboard. Your users will demand it.
With that out of the way, let's move on to detailing what exactly it takes to create a successful public health dashboard. Generally I would suggest looking to your users to tell you what they need. I still strongly recommend you do this, especially if your users are technically savvy. However, as this industry is still so young, and most companies are still unsure of what their users will demand, I humbly submit my 7 rules for public health dashboard success:

The Rules

First things first

Before we get into the rules, I'd like to mention a few public "system status" pages that don't quite meet the label of "health dashboard" but do give us a starting point for providing public health information. There's no reason any SaaS provider today should not be offering at least a basic chronological list of potential issues, downtime events, and resolution details similar to one of the following: craigslist system status, 37signals System Status, Twitter Status, GitHub Status, Mosso System Status. Now...on to the rules for creating a successful online public health dashboard!

The Basics

Rule #1: Must show the current status for each "service" you offer
  • A status light or short description that visitors can use to quickly identify how the service(s) they are interested in are doing right now. Example #1, Example #2.
  • Most health dashboards do this well. Keep it simple. Skype's Heartbeat tries to be clever, but I fear that the first impression that the big thumping red hearts give visitors is that something is wrong.
  • Don't forget to identify what the status icons and messages actually mean. Example #1 (legend at bottom right), Example #2 (descriptions below the table at bottom). Bad Example (no information on what the possible states are).
Rule #2: Data must be accurate and timely
  • This should go unsaid, but some comments in an online forum show us that it isn't as obvious as it should be.
  • The data should be based on real time monitoring, not manual updates that require a human.
  • The entire benefit is lost if your users cannot trust this data, or it arrives too late.
Rule #3: Must be easy to find
  • It is worthless to provide a public health dashboard if your users are unaware it exists, or are unable to find it in time of need.
  • Anticipate where your users go when they experience downtime, and create a clear path to the status page. Ideally there will be a link from your home page, and at the minimum from your main support page. Example #1 (footer of each page), Example #2 (top right of every page). Many examples of support page links.
  • Also consider making the URL as easy to remember as possible. "status.yourdomain.com" or "yourdomain.com/status" seem to be the preferred method.
Rule #4: Must provide details for events in real time
  • You must go beyond simply noting that something is wrong. Your must provide insight into what is going on, what services are affected, and if possible an ETA on resolution. Users will be OK with a big red light for only so long.
  • This can be as simple as a timestamped message noting that you are investigating, with regular updates about the investigate and projected resolution times.
  • The key here is to keep your users from having to contact your support department, defeating much of the gain in having a public health dashboard. Use this as an opportunity to build a trust relationship with your customer by being transparent throughout the process.
Beyond the basics

Rule #5: Provide historical uptime and performance data
  • Make sure to provide root cause analysis for each downtime event. The more detail the better. Example #1, Example #2 (click on any event in the past).
  • This will be important to your prospects as they evaluate your transparency. Don't be afraid of problems you've had in the past. Owning up to problems strengthens trust, which should be one of your main goals.
  • This will be important to your customers as they do post mortem analysis for their superiors.
  • Provide at least one week of historical data, ideally at least a month. Example #1, Example #2, Example #3, Example #4 (notice each service has an archive link).
Rule #6: Provide a way to be notified of status changes
  • RSS/email/SMS/Twitter/API's/etc. It's too early to know how users will want to consume this information, but my opinion is that the two most useful options would be to allow email alerts on downtime, and an API that allows users to build their applications to work around the downtime automatically.
  • Currently many status dashboard provide RSS feeds. Example #1 (even provides email and Twitter alerts!), Example #2.
  • Along these same lines, providing advanced notice of upcoming maintenance windows is extremely useful. I would hope these are announced in other mediums as well (e.g. email).
Rule #7: Provide details on how the data is gathered
  • What is the uptime and performance data worth if we have no insight into where it comes from? Currently with most health dashboards we have to assume either the provider built their own monitoring platform, or that they are making status updates manually (Zoho is the one exception, since they have their own monitoring service).
  • Beyond simply knowing where the data comes from, what exactly does "Performance issues" mean? What are the thresholds that determine that a service is considered to be "disrupted"? From what location's is the monitoring done from? Am I out of luck if I live in Asia and the monitoring is done from New York?
  • It would be extremely useful to have the data validated by a third party, especially as this gets into the world of SLA's. We can't have the fox watching the hen house when it comes to money.
The future

For those seeking to truly be ahead of the curve and open up the kimono further, I suggest the following rules as well:
  • Provide geographical uptime and performance data (Zoho is ahead of the game on this one). The more information you provide to your users publicly, the less questions you'll have to deal with privately.
  • The status page should be hosted externally, at a different location from your primary data center. This should be obvious, but I doubt many companies consider this as a problem. The last thing you want when your primary data center goes down is to have to field calls that could be avoided if your status page was still up.
  • Break out each individual service and function as much as possible. Similar to how Flickr opened up their API to match up with practically every internal function call, allow your users to have insight into the very specific functionality they need.
  • Connect your downtime events to your SLA's. Allow your users to easily track how you're doing compared to what you promised. The day's of hoping that your users forget about this are over.
I hope the above advice provides value to companies out there considering their own health dashboards. I would love to hear from SaaS providers already providing health dashboards, especially those I haven't already linked to. I'd especially love to hear some feedback from companies on the benefits they've seen in providing a health dashboard, in customer feedback, reduced costs, or competitive advantages.

I'm excited see over the next few months how things change in this space, what rules become most important to users, and how online service providers respond to the oncoming demand for transparency. Time will tell, but I do know this is only the beginning.

For reference, some of the public health dashboards I referenced in this post:

Sunday, November 23, 2008

Transparency case study, courtesy of ylastic

ylastic (a company that provides tools to help manage AWS services) kept their users in the loop during an outage by communicating status updates over Twitter:


You can find the entire set of updates at ylastic's twitter page.

I keep coming back to the same question. Do your users know where to go during a downtime event? ylastic has their web site, their blog, their forums, and their twitter feed. As a user, how do I know where to look when I'm having a problem and want to know what's going on with the service (which is generally an emergency)? As the company, how do I keep users from clogging my support email box in spite of my efforts to get status updates out to the world? In this case it looks like the only place that had any information was the twitter feed. If users weren't aware it existed, both sides would be out of luck.

What every SaaS service needs is a clear central place, that their users can easily find, that provides real time updates on downtime or performance events. It's great that you're willing to communicate during the event, but if no one can find those updates, what's the point? Don't get me started on falling trees.

On another note, kudos to ylastic for their transparency on the following fronts:
  • Providing insight into their product roadmap. Very much what SaaS providers must do to build the trust relationship with their users (which is critical to the success of any online hosted application).
  • Their upcoming iPhone app that among other things gives you the AWS Service Health status on the go.
  • Simply giving status updates on Twitter.

Tuesday, November 18, 2008

Google offers up new SLA's for paying customers

Two noteworthy items from the news of Google announcing new SLA's for paying customers:
  1. Google measures uptime as "average uptime per user based on server-side error rates".
  2. They claim 99.9% uptime for the last year, in spite of the outage in August.
Does anyone else see a problem here? It blows my mind that not only do we have to rely on Google to tell us what their uptime was, but we allow them to tell us what uptime means. To me, downtime could include Google's services being completely down and not in a position to return return any kind of server-side error. To Google, that doesn't count. And even if it did, do we have to rely on twitter and public outrage to keep Google honest when downtime does occur?

Two simple things we need to fight for in order to keep SLA's and their associated advantages worth something:
  1. Define uptime based on any issue under the control of the provider that keeps you from using the service as any time (planned or unplanned).
  2. Demand third party validation of the SLA results.
This same request applies to AWS, Salesforce, and any other SaaS service with SLA's (which should be everyone!).

To end on a more positive note, the closing notes from Google's announcement:
"More than 1 million businesses have selected Google Apps to run their business, and tens of millions of people use Gmail every day. With this type of adoption, a disruption of any size — even a minor one affecting fewer than 0.003% of Google Apps Premier Edition users, like the one a few weeks ago — attracts a disproportional amount of attention. We've made a series of commitments to improve our communications with customers during any outages, and we have an unwavering commitment to make all issues visible and transparent through our open user groups.

Google is one of the 1 million businesses that run on Google Apps, and any service interruption affects our users and our business; our engineers are also some of our most demanding customers. We understand the importance of delivering on the cloud's promise of greater security, reliability and capability at lower cost. We are hugely thankful to our customers who drive us to become better every day."

P.S. I've added a comment to my previous "Online Users Bill of Rights" post to take this issue into account.

Amazon launches the Festivus of CDN's...a CDN for the rest of us

You can read about it here, here, and here. A CDN-as-a-Service (CaaS?). Does this put the nail in the coffin and make the CDN business a commodity? Probably.


What impresses me most is not the technology or the pricing or the ease of use of the new service (called CloudFront btw). What really gets me hot is the fact that the AWS Service Health Dashboard already has the performance and uptime of CloudFront up and live! That my friends is sign of true commitment to transparency, and more importantly a well functioning process.

Saturday, November 15, 2008

Comparing Amazon Web Services, SalesForce, and Zoho's online health dashboards

Now that there are three major SaaS players offering online service health dashboards, and one from Google on it's way, I thought it would be a useful exercise to compare the offerings from Amazon Web Services, Salesforce, and Zoho. This will hopefully be helpful for anyone planning to launch their own health dashboard, and to the general online community in making sense of what is important to understand about these dashboards.

Disclaimer: If I have mistakenly misrepresented anything, or if I missed any information, PLEASE let me know in the comments below.

What providers are we looking about today?
What is the URL of each status page (and are they easy to remember in times of need)?
What are these status pages called?
  • Amazon Web Services: "AWS Service Health Dashboard"
  • Salesforce: "Trust.salesforce.com - System Status" (Note: salesforce.com goes beyond simply providing system status by also providing security notices, both under their "Trust.salesforce.com brand")
  • Zoho: "Zoho Service Health Status"
What services' health are reported on?
  • Amazon Web Services: All four core services (EC2, S3, SQS, SimpleDB), plus Mechanical Turk and FlexPay. They also break out the two S3 datacenter locations (EU and US), the two ends of a Mechanical Turk transaction (Requester and Worker), plus the EC2 API.
  • Salesforce: Only the core salesforce.com services across 12 individual systems (based on geographic location and purpose).
  • Zoho: All 23 Zoho services are covered, plus their mobile site and their single sign-on system.
What health information is provided?
  • Amazon Web Services: Current status, plus about 30 days of historical status. Status is determined to be one of "Service is operating normally", "Performance Issues", or "Service disruption". "Information messages" are occasionally provided.
  • Salesforce: Current status, plus exactly 30 days of historical status. Status is determined to be either "Instances available", "Performance Issues", "Service disruption", or "Status not available". "Informational messages" are also provided on occasion.
  • Zoho: Current status and the response time for the past hour, in addition to historical uptime for the past week. Also provided are two graphs representing uptime and response time for the past seven days. If that wasn't enough, current uptime and response from six geographical locations is also given.
Where does the uptime and performance data come from?
  • Amazon Web Services: No clue.
  • Salesforce: No clue.
  • Zoho: Their own "Site 24x7" monitoring service.
What is considered downtime and what is considered a performance issue?
  • Amazon Web Services: No clue.
  • Salesforce: No clue.
  • Zoho: No clue.
Are real time updates provided during downtime events? Is it easy to find?
  • Amazon Web Services: Yes, but unclear how consistently and how easy it is to find that information.
  • Salesforce: Yes, right underneath the current status.
  • Zoho: Does not appear so, but if the issue is big enough they may update customers through their blog.
Is information provided on past downtime events?
  • Amazon Web Services: Yes. Mousing over a past performance or downtime event brings up a chronological log of events that took place, from detection to resolution. In addition, major downtime events are explained.
  • Salesforce: Yes. Clicking on any past event brings up a window giving the time of the event, a detailed description of the problem, and a root cause analysis.
  • Zoho: No. Unless they are described in the blog.
Is there a way to easily report problems users are having?
  • Amazon Web Services: Yes, clicking the "Report an Issue" link.
  • Salesforce: No, other then using the standard support channels.
  • Zoho: No, other then using the standard support channels.
How can you get notified of problems (without watching this page 24/7)?
  • Amazon Web Services: Ability to subscribe to RSS feeds for change in status of each service.
  • Salesforce: No.
  • Zoho: No.
Conclusions: The best practices for online service health dashboards are still being formed, and it's clear that each service provider has approached the need for transparency differently. Amazon Web Services provides a simple and easy to understand overview of the health of each service, but provides little insight into who is impacted and what specific functionality is down. Salesforce provides clear insight into what customers may be affected by an event, but does little in offering insight into specific functionality that may be down or slow. Zoho provides the most data by far for each service they provide, but does not have a system in place to communicate details about specific downtime events beyond the company blog. Amazon and Salesfroce completely lack insight into how that they collect the health information, and all three give no information on what is meant by downtime or performance problems.

A closing questions for each provider:
  • Amazon Web Services: What does "EC2 API" actually mean? Which API is this referring to and why not cover the API's for the other services?
  • Salesforce: Does each server status cover every application level and API on that server? Can you offer more insight into specific services?
  • Zoho: Do you expect to add details about current and past downtime events to the health dashboard? What do you expect your customers to do when they see a red light? If you answer "Email Support", you don't get the power of this status page.
  • To all: How is the health actually monitored (especially for the GUI focused Salesforce and Zoho services? Working at a (the best) web monitoring company, I know how hard it is to monitor complex web applications.
Notable mentions: The following services also offer up health dashboard page, but to keep the comparison from getting overly complex I decided to leave them out. If anyone would like me to review these, or any other service that I missed, I'd be more then happy to. Just leave a note in the comments

Friday, November 14, 2008

Google Status Dashboard been brewing since August?

According to MoonWatcher back on August 27th, an email from Google explains:
"We're building a dashboard to provide you with system status information. This dashboard, which we aim to make available in a few months, will enable us to share the following information during an outage:
  1. A description of the problem, with emphasis on user impact. Our belief is during the course of an outage, we should be singularly focused on solving the problem. Solving production problems involves an investigative process that's iterative. Until the problem is solved, we don't have accurate information around root cause, much less corrective action, that will be particularly useful to you. Given this practical reality, we believe that informing you that a problem exists and assuring you that we're working on resolving it is the useful thing to do.
  2. A continuously updated estimated time-to-resolution. Many of you have told us that it's important to let you know when the problem will be solved. Once again, the answer is not always immediately known. In this case, we'll provide regular updates to you as we progress through the troubleshooting process."
Who knew! Glad to see they aren't rushing this out the door and are (hopefully) putting in the time to do it right. Judging by the amount of thought they've already put into this email, I'm already impressed.

Google will have a challenge in presenting their myriad of services in an easy to understand dashboard. Their online services are complex, highly GUI oriented, and distributed in such a way that a small fraction of users could be down while the rest of the world is fine. My three big questions for Google are:
  1. How will you be collecting your uptime and performance data? Can we rely on it to be accurate and unbaised?
  2. What will be considered "downtime" for a complex app like Google Docs or Gmail?
  3. How will you communicate with your users during the downtime event? Will that communication channel be easy to find?
I'm looking forward to seeing what Google comes up with. I have no doubt the release of a Google Health Status Dashboard will not only be huge news for the tech industry, but will be the tipping point that drives all serious SaaS services to offer their own status dashboard. Transparency is so close I can taste it!

Wednesday, November 12, 2008

Zoho opens up their kimono and the blogosphere applauds

Big news from Zoho (provider of online applications and services that compete with Google and Microsoft). As of yesterday, they have reached the next level in uptime transparency by launching Zoho Service Health Status! As Raju Vegesna states in their big announcement, "This initiative is yet another step to being more open and transparent with our users." Kudos to Zoho for recognizing the need and delivering. They join Amazon AWS and SalesForce as the three large service providers offering a very public health dashboard of their services.

I'm really excited to see that the blogosphere gets the significance of this move:

Mashable:
"One frequent problem with web applications and services - and thus the whole web 2.0 phenomenon - is lack of communication when something goes wrong. Sure, it’s nice to have your online e-mail client available from every computer, but what happens when it goes down? Often, it’s just you in the dark, waiting for problems to be resolved, with little or no official information on what’s happening to ease your mind...

This is a great idea. If something goes wrong with any one of Zoho’s applications, you can quickly check out if the problem is on your side or theirs. Of course, I’m sure that the folks at Zoho will continue to inform their customers about problems, updates, downtime and similar issues via blog posts, but being able to see what’s wrong for yourself, at any given time, is an advantage Zoho’s customers will certainly enjoy. All other web startups take notice: this is the level of transparency we’d like to see from everyone, not just Zoho."

WebWorkerDaily:
"After taking a look, I’d say that all applications hosted online could benefit from this level of kimono-opening."

CNET:
"Web application specialist Zoho has joined the growing ranks of companies willing to share detailed information on how well their online services are holding up.

This move toward transparency is increasingly important as potential customers consider relying on such services...

Publishing the performance measurements for online services is catching on as cloud computing grows more serious. Going hand in hand with that is offering service level agreements (SLAs) with specific uptime commitments."


Who's next to open up? I'm looking at you Google!

Sunday, November 2, 2008

Kudos to Microsoft for showing humanity and transparency in a recent Entourage regression bug fix

From Microsoft's Mac Office blog:
"We’ve been working hard for the last week and a half to bring Entourage users today’s 12.1.4 update. It’s incredibly frustrating when we get through a release process and a new issue is introduced by an update. When we start to hear feedback and customer reports about issues with an update, I simply cringe because so much work goes into preventing that from happening. Unfortunately, the recent Office for Mac 2008 12.1.3 update introduced a bug that prevented some Entourage users from sending meeting invites to others. We’re sorry.

However, we also believe it’s better to pair an apology with a solution. With that in mind, with the just released 12.1.4 update, meeting invites will be working as expected once again for all Entourage users.

A lot of work goes into every update release to make sure that we are improving the product's quality. With every update, each and every change that goes into the release is under tight review. There are multi-developer code reviews, focused test passes, and verifications with targeted customers who reported the issue we are targeting to fix. Even then, sometimes things do not work out. With 12.1.3, we did all of the above, yet two cases slipped-through..."
Read the details on the release here. Why don't we see this kind of honesty more often?

Saturday, October 25, 2008

Google G1's email service is down, but does anyone know where to get help?

As reported by Gizmodo, T-Mobile G1's POP3 and IMAP email is down right now. The big question is whether or not anyone knows that the only place to get information on the status of the downtime, and the only way T-Mobile is communicating with it's users is through this thread on the T-Mobile Forum. How many support calls and emails is T-Mobile getting right now?

I'm not a T-Mobile customer, but if I was experiencing this problem, I would go to http://www.tmobile.com/, click Support, and end up here. And then I'd be stuck. If ended up digging through all of this help, I would spend at least 10 minutes figuring out where to go, or end up calling the support department.

The solution is not complicated. Your customers need a single obvious place to go find out what's happening right now. The fewer calls into support the better, both for you and for the customer. Or put another way, broadcasting broad updates to all of your customers is much more efficient both for your customers and for your staff.

In this specific case, I would suggest a big link or button off of the Support page taking the user to a page similar to our Hall of Fame AWS Service Health Dashboard with a running feed of user updated and more importantly responses from the T-Mobile support people as often as possible.

Saturday, October 4, 2008

A boat load of links about cloud computing

Enjoy!

P.S. Is "Cloud Computing" the new "Web 2.0"?

Cloud Computing Bill of Rights

I'm glad to see someone else felt there was a need for an online users bill of rights:
"Before you architect your application systems for the cloud, you have to set some ground rules on what to expect from the cloud vendors you either directly or indirectly leverage. It is important that you walk into these relationships with certain expectations, in both the short and long term, and both those that protect you and those that protect the vendor.

This post is an attempt to capture many of the core rights that both customers and vendors of the cloud should come to expect, with the goal of setting that baseline for future Cloud Oriented Architecture discussions."
This is focused specifically on cloud computing providers. First there was version 0.1, followed by version 0.2, followed by a Wikified version. I especially like the wiki version (clean, straight forward, to the point). I see a lot of overlap between this and a general "Online Users Bill of Rights".

The big question is what it will take for online service providers to give a damn and agree to any sort of bill of rights. It's hard enough to get a consistent and worthwhile SLA in place. Maybe it's too early to push this on companies. Maybe we need to wait for some major downtime that disrupts business in a serious way for users to realize the importance of something like this. But isn't that far too late?

Google Video gave it the old college try.

I just came across Google Video's blog. Looking at the history of the posts, looks like they had a pretty good run at transparency for about a year, and are not pooped out (last post being in June 25th 2008). Good effort!

A good analysis of how ecosystem monitoring will make your life easier

My college over at Webmetrics posted a really good discussion of ecosystem monitoring, and how it aims to solve the problems described in my previous post. Check it out: The Benefits of Eco-System Management.

Sunday, September 21, 2008

How Google's outages hurt your business

As reported by TechCrunch, Google's Custom Search service was down for over 12 hours last week. Notice it took Google over 12 hours to even respond to the the complaints! Then it took about 2 more hours to resolve the issue.

Let's skip over the lack of transparency from Google during the event, except to say that it's pretty sad that it took so long to at least admit to the problem. To their defense, they claim it affected a small number of clients. And Google is generally open about their problems, so we'll give them a pass on this one.

Why does it matter to you?

Unlike downtime at GMail and Google Reader, a SaaS services like Custom Search being down is a big deal. Why? Because if you were using Custom Search, to your visitors it looks like YOU are down. Imagine being a customer of Smug Mug, visiting their help page and ending up with a really slow or broken search. Would you blame Google or Smug Mug? Sure many customers would probably blame themselves, but just the possibility that your perceived uptime and user experience is dependent on a third party (that you have no control over or insight into) should give you pause. How are you supposed to even know that these services are down? Imagine if it was something more critical to your business like your ad network or the payment processing system?

Is SaaS doomed?

In spite of these dangers, the benefit of using SaaS solutions is very strong. Why bother building and hosting something outside your core competency when a service out there does it for you. You can read about the benefits of SaaS here, here, here, here, and here. I doubt I have to convince you of that. So the question is how you can continue to reap the benefits of SaaS while minimizing your exposure to problems you can't control. Is there a solution?

A solution

The key to a successful SaaS implementation is having real time access to the uptime and performance of the SaaS solutions your business relies on. If you knew right away that Google's Custom Search solution was down, at the least you could react put up a friendly message for your visitors ("Don't blame us, it's Google's fault!"). Even better you'd have a fail-over plan in place to switch to another solution. Same thing if this was an ad network or a payment system that went down. You would have some control over your user's experiences, and would no longer have to pray that all of your solution providers are up 100% of the time (good luck!). Without this knowledge, you're either assuming these services never go down, or you don't realize that your visitors have no idea that the issues aren't your fault.

The company I work for recently launched a solution that deals with this very need. It's all about working together with your SaaS providers, sharing performance and uptime data, and being able to see the same data your providers are seeing. As with most problems, it often times boils down to opening up the communication lines.

As more businesses come to rely on SaaS solutions, the more exposure these business will have to this kind of "perceived" downtime. The naive solution is to expect 100% uptime. The real solution is to know when that downtime does occur, and to have a plan of action.

Nice to see some talk of transperency in the blogosphere

An post by Steve Rubel of Micro Persuasion titled Radical Transparency: Three Lessons Apple Can Learn from Google:
"Google isn't exactly known as the most transparent company in the world, but they're light years ahead of Apple - a company that in some ways they share a kinship with when it comes to their reputation for innovation. Apple (or for that matter any big company) can learn a lot about radical transparency, customer service and PR from Google, even though they're hardly perfect here."
He goes on to review the various places that Google and Apple make public their bugs and known issues. What's missing here obviously is any mention of transparency in uptime and performance. But to fill in the gaps, as we've seen previously, Google does a much better job here as well.

Saturday, September 20, 2008

Robert Scoble hosting a webinar on scalability

Didn't see this one coming:

Avoiding the Fail Whale - Thursday, October 9th 1pm EST

Building a server environment that’s scalable and reliable can be tough, especially when your traffic goes “nuts” virtually overnight. Fast Company Live presents a special one-hour live webinar, moderated by Robert Scoble and featuring a panel of tech leaders from companies big to small who are facing these very issues.

Confirmed guests include:

  • Matt Mullenweg: Founder of Automattic, the company behind WordPress.
  • Paul Bucheit: One of the founders of FriendFeed and the creator of Gmail.
  • Nat Brown: CTO of iLike, a music community service that had one of the first Facebook apps.

The discussion will cover architectural choices, growth hurdles and how the panelists overcame them. The first half-hour will be devoted to the panel discussion, while the second half-hour will be open to live questions from registered webinar attendees.

I'm certainly registering. Will be interesting to get their thoughts on load testing, and how they plan for such large amounts of unpredictable load.

Sunday, September 7, 2008

Gamer's Bill of Rights?

What's more important, a Gamer's Bill of Rights, or an Online Users's Bill of Rights? Do you even have to think about it?

Twitter showing improved uptime

Both Read Write Web and TechCrunch point out that Twitter has seen much improved uptime in the past couple months, reaching 99.88% uptime this past month.

TechCrunch quotes co-founder Biz Stone:

"Twitter has been making great progress in terms of uptime and reliability. Fail Whale sightings are far less frequent these days thanks to our efforts but we still have a long journey ahead. Last month we saw 99.88% uptime and so far this month we are at 99.96%. Our engineering and operations teams have been taking a very methodical approach to improving Twitter. We’re using the word “craftsmanship” to characterize our work here at the office. Reliability and dependability continue to be top on or list of key goals."

What I like most is the details that Twitter provides on their blog describing where the issues stem from:
"I've always respected a good sense of pacing. It's easy to be fast and loose, but it takes a certain discipline, foresight, and patience to guide something through the right way. For most of Twitter's early days, pacing could be considered an unattainable luxury. Our effort started with a bang and quickly accelerated to a disconcerting velocity that never let up. We found ourselves reacting to situations instead of crafting solutions and features we wanted to make.

With nearly two years at full speed, thousands of successes (with as many mistakes), and countless lessons learned, we've finally discovered our rhythm as a team. By carefully regrouping all aspects of our work, breaking the problem down into smaller parts, and iterating rapidly, Twitter, Inc. is poised to bring a new kind of communication to every part of the world."
Kudos to Twitter for not only the improved uptime, but for keeping it's users in the loop on things that generally are discussed only behind closed doors.

Saturday, August 23, 2008

What if the cloud disappeared tomorrow? Thoughts on a "Online Users Bill of Rights"

NPR did a story on the (often unexpected) risks involved in storing your data in the cloud. What would you do if Gmail, Flickr, or Yahoo decided they no longer cared to store your massive amount of free data and ran a large "rm -rf". Sure they'd get some pretty bad PR, but if you look at their EULA's, I'm betting they have the right to do this. Can we ever trust that our data is really safe in the cloud?

What's needed here is a "Online Users Bill of Rights". This would define specific standards that protect users and gives them insight into decisions currently made behind closed doors. Here's a start:

1. Files, documents, or anything else that the user has created and saved online cannot be removed or be made inaccessible without a 30 day advanced notice.

2. The service must be accessible 95% of the time each month. Specifically, users must be able to access their data, be able to delete or retrieve existing data, with availability of at least 95% in each month long period. It is also highly encouraged to make public a tighter uptime commitment, including the consequences of not meeting that commitment.

3. During downtime events, the service must make a best effort to provide status updates, estimates as to when service will be restored, and an explanation of what led to the downtime after the event. It is also highly encouraged to make known a central location to distribute this information.

4. The service will provide a performance SLA describing the average page load time they expect to see, and the consequences of not meeting that average in any given month. This is especially important for API's and services like AWS.

5. The service must give at least 30 days notice prior to making any "major" changes in the functionality or level of service provided up to that point (including API interfaces). It is also highly encouraged to involve the users in the decision making process prior these changes.

This Bill of Rights would need to be signed off on by any online service that stores data for users (Google, Yahoo, Flickr) or provides online service that other business rely on (Amazon AWS, Salesforce, API providers). I'd like to see the day when users simply do not trust online services that aren't willing to sign off on this.

The above is just a first draft, and I'd love to get some input on this. I would purposefully keep the list somewhat open to interpretation, staying away from legalese, and focusing on the spirit of the idea of transparency and user rights (similar to the concept of a B Corporation).

What do you think?

Friday, August 22, 2008

Microsoft celebrates its downtime

After experiencing downtime in its launch of Photosynth this past week, Microsoft admits it's projections were a little off:
"We have been abolsutely overwhlemed by demand, and have turned Photosynth.com into a special static/read-only mode for the moment. The team is hard at work adding capacity and getting the full site back online. We've been under incredible demand since we released just over 12 hours ago. With everyone waking up around the world traffic has been on a steady ramp up since that release and has far exceeded even our most optimistic expectations.

Getting ready for the launch we did massive amounts of performance testing, built capacity model after capacity model, and yet with all of that, you threw so much traffic our way that we need to add more capacity. We are adding that extra horsepower right now and should be back up shortly.

Thank you for the incredible reception! "

Nice to see some visibility into their thinking up to launch, and the preparation they (unsuccessfully) went through. The next best thing to real time downtime status is a well formed explanation after launch (assuming the downtime is not prolonged), and something this personal coming out of Microsoft is a good sign.

Monday, August 18, 2008

Apple makes up for their downtime with 60 days of free service

From http://support.apple.com/kb/HT2826:

Why is Apple granting a 60-day subscription extension?
The transition from .Mac to MobileMe was rockier than we had hoped. While we are making a lot of improvements, the MobileMe service is still not up to our standards. We are extending subscriptions 60-days free of charge to express appreciation for our members’ patience as we continue to improve the service.

Am I eligible for the 60-day extension?
You are eligible if you are a MobileMe member whose account was active as of August 19, 2008 at 0:00 Pacific Daylight Time.

That's one way to deal with downtime!

Saturday, August 16, 2008

Do You Trust the Cloud?

Quoting a lifehacker post with the same name:
"While web-based applications promise gigabytes of storage, anywhere-access, easy backup, and no software requirements beyond your browser to use them, becoming dependent on webapps can leave you high and dry when those services go out."
Referencing the recent downtime of Gmail, MobileMe, and Amazon S3 got me thinking...what does it take to actually "trust the cloud". What would give users confidence in choosing these services, and sticking with them through the inevitable issues? The simple answer is transparency!

How good are these specific services at providing transparency into their downtime? Let's review:

Gmail downtime (2 hours) on 8/11/08
The Gmail team kept users updated during the downtime using a Google Group thread, with surprisingly frequent updates and details (over the 2 hour downtime period). After the event was over and they were able to get their thoughts in order, they then posted a message on their Gmail blog. A big red flag however shows in the spike in Twitter posts and the huge spike in searches for "gmail down". We can tell that users are unsure where to go to get the official word on what's going on, which means that all of the work the Gmail team is putting into keeping users up to date falls on deaf ears. If you post an update and no one sees it, does your update exist?
Conclusion: Very good transparency, but needs some work on making known the forum they are using to spread information. Kudos for the rarity of downtime this services has experience in it's history (too easy to overlook).

MobileMe downtime (2 hours) on 8/11/08
With their handy dandy System Status Receny History page, the MobileMe team documented the downtime. However the only way users were able to know anything was wrong DURING the event was a big fail went attempting to use the service. As CNET reports, "the same thing happened in mid July with enough blowback to cause Apple to offer a 30-day extension to both fre trial and paying users." In a valiant yet fruitless effort to keep users in the know, Apple created the MobileMe Status blog, which as of now still has no news of the recent downtime. On the plus side they have created a MobileMe Mail Chat page for users to get personal support when issues arise. On the downside, according to one comment "even the support guy didn't know that the service outage was going on".
Conclusion: Unnacceptable job keeping users in the loop, passable job documenting the events after the fact, and far too many random glitches to make this OK. Let's hope they get their act together soon and open up about their issues (won't be holding my breath...this is Apple after all).

Amazon S3 downtime (6 hours) on 7/20/08
By far the more critical of these online services means they should be held to a higher standard. During the event, the AWS team posted outage messages
and their Service Health Dashboard clearly showed they were having issues. After the event a detailed explanation went up on their site.
Conclusion: There's a reason I have Amazon AWS in the "Transaparency Hall of Fame" (top right of this blog). They've been at this a while, and their users have forced them to make this process as transparent as possible. They could get better at giving specific details during the actual event, and 6 hours of downtime is no laughing matter, but they did a good job and they continue to set the bar for transparency in online services.

Yesterday CNET posted the "10 Worst Web glitches of 2008 (so far)", which includes the above events, among others. What does this tell us? Clearly downtime across the broad spectrum of online services, from Amazon to Netflix to Google, is not going away. We need to learn to live with unreliable online services. The long term success of these services will be determined by how users perceive the reliability of these services, contrasted with the advantages of building in the cloud. That perception of reliability requires complete and utter transparency in the goings on of that service, especially during downtime events. We still have a long ways to go until there we can really "trust the cloud".

Thursday, August 14, 2008

Twitter lacking transparency in its own functionality

Slightly off topic, but still relevant to the concept of transparency in the online world, Twitter recently changed it's limits on the number of followers any one person can have (to help curtail spam).

The problem, as described by WebWorkerDaily:
"Though the blog post says there’s no magic number, quite a few Twitterers - including some heavy participants - report hitting the limit at 2000. Some have been trying to get the attention of Twitter management to discuss this for days, with little or no result. It’s reminiscent of Twitter’s attitude towards making money from the service, which amounts to “we have a plan, but we won’t tell you,” or to fixing issues, which seems to be “we’re working on it, leave us alone.” In a world of Web 2.0 openness, Twitter seems to be carrying traditional business values of secrecy a bit further than most."
It's all too easy to keep your users in the dark. It takes real vision to open up and keep your users in the know.

Tuesday, August 12, 2008

First post!

The notorious first post. Who ever actually reads the very first post on a blog anyway? Someone's got to I guess. You're reading it...so that means it's time to get to business. My god the pressure!

My goal for this blog is to focus on the idea of transparency in the uptime and performance of web sites and services . What does that mean? Let me tell you. My argument is that if you run an online application (e.g. a plain jane web site, a web services, an API, or anything else that sits online) and your users rely on it, you MUST be as open as possible about its downtime events, performance problems, and anything else that could affect the quality of service for your users. Gone are the days when you could hide behind the white glow of anonymity in the online space, and hope that no one notices your application is down for half the day (I'm looking at you Twitter). Not only is this a going to help your business, and make your users happy, but it's only a matter of time before your users demand it.

Some examples:
  1. http://trust.salesforce.com/
  2. http://status.aws.amazon.com/
  3. http://status.twitter.com/
All three of these services (SalesForce.com, Amazon, and Twitter) have come to realize, after much prodding from their users, and numerous downtime events, that making this information public is a really good idea!

Originally I was inspired to this idea thanks to a great article in Wired magazine titled "The See-Through CEO". Definitely check it out.

My goals for this blog at this point are the following:
  1. Document examples of really great transparency, or the lack thereof.
  2. Develop a guideline of transparency that you can use in your own professional life.
  3. Help you in you and maybe your business reap the benefits of being transparent, and get ahead of the curve on your competition.
I would guess that 95% of all blogs die within the first 3 months. My goal is to be posting a 1 year anniversary story, re-evaluating the state of the industry a year from now, and hopefully helping you become more successful along the way!