Transparent Uptime

Tuesday, January 13, 2009

The bulls**t of outage language

As this blog is often dedicated to pointing out downtime events, and offering advice on how to best communicate before/during/after the (inevitable) event, I thought this post by 37signals could come in handy next time you have to write an apology email to your customers.

Service operators generally suck at saying they’re sorry. I should know, I’ve had to do it plenty of times and it’s always hard. There’s really never a great way to say it, but there sure are plenty of terrible ways.

One of the worst stock dummies that even I have resorted to in a moment of weakness is this terrible non-apology: “We apologize for any inconvenience this may have caused”. Oh please. Let’s break down why it’s bad...

I'll let you read the advice yourself, but I will point out a few of the visitor comments that speak to the message I've been harping on over the past few months:

Josh Catone:
Serious question: What WOULD be a better way to communicate with customers after downtime in your opinion? You didn’t offer and alternatives. I know you said stock responses should never be used… but I’d love to see some examples of what you think works..

and

Dan Gebhardt:
I’d recommend using a website monitoring service (we use Pingdom [editors note: *cough* Webmetrics *cough*]) to provide public accountability for your uptime. This not only proves that uptime is as important to you as it is to your customers, it can also help customers see any particular outage in the context of your overall service record.

and

Mark Weiss:
While a personal well thought out apology is nice. As a user I want to know when things are going to be working again. I want to know if I should go for a quick walk in the park or if I have time for some food, drinks, and then possibly a nap.
Just keep me informed so I know how to manage my time.
I think Flickr holds top honors for the best down time strategy and message. http://blogs.zdnet.com/Burnette/?p=147

and

Itinerant Networker:
Empathy’s not enough. Service providers should reveal details about why an outage happened, what they’re doing to make it not happen again, and should clearly communicate with customers (frequently) on the ETA of the outage. The most frustrating thing I hear is “we don’t have an ETR [estimated time to recovery]”. That is not acceptable in a service business – give me an ETR and then an estimate of how reliable the ETR is. This goes for even the lowest cable modem user calling $provider – the tier 1 guys should have at least some clue.

The bottom line is that what matters most is not that you never go down, but how you deal with that downtime. All your customers need is some form of honest communication during the event, some transparency into the severity of the problem, and a human explanation of what went wrong afterward. It really isn't very hard.

Saturday, January 10, 2009

How transparency can help your business

When looking to gain the benefits of transparency (into your downtime and performance issues), you first need to understand the use cases (or more accurately, the user st o ries) that describe the problems that transparency can solve. It's easy to put something out there looking for the press and marketing benefits. It's a lot more challenging (and beneficial) to understand what transparency can do for your business, and then actually solve those problems.

Transparency user stories

As an end user/customer:

Your service seems to be down. I'd like to know if it's down for anyone else or if it's just me.
I know your service is down, and I want to know when it'll be back up.
I want some kind of explanation of why you went down.

As a business customer using your service as part of my own service offering:

Before betting my business on your service/platform, I need to know how reliable it has been.
My own customers are reporting that my service is down, but everything looks fine on my end. I need to know if your service is down, and if so I need information to keep my customers up to date.
I want to find which link in my ecosystem of external services is broken or slow right away.
One of my customers reported a problem in the past, and I'd like to correlate it with hiccups your service may have had in the past.
I need to know well in advance of any upcoming maintenance windows.
I need to know well in advance if you plan to change any features that are critical to me, or if the performance of the service will change.

As a SaaS provider:

I want my customers (and my prospects) to trust my service. I don't want my customers to lose that trust if I ever go down.
My support department gets flooded with calls and emails during a downtime event.
I want to understand what the uptime and performance of my services are at all times from around the world. Both for internal reasons, and to help my customers diagnose issues they are reporting.
I want to differentiate from my competition based on reliability and customer support.

In the next post, I will dive into ways to attack each of these user stories. Stay tuned.

Thursday, January 8, 2009

Salesforce.com down for over 30 minutes, and what we can learn from it

See what the blogosphere was saying...and see what more traditional media was saying.

Update: Again, Twitter ends up being the best place to confirm a problem and get updates across the world:

Update 2: Salesforce has posted an explanation of what led to the downtime (
from trust.salesforce.com):

"6:51 pm PST : Service disruption for all instances - resolved
Starting at 20:39 UTC, a core network device failed due to memory allocation errors. The failure caused it to stop passing data but did not properly trigger a graceful fail over to the redundant system as the memory allocation errors where present on the failover system as well. This resulted in a full service failure for all instances. Salesforce.com had to initiate manual recovery steps to bring the service back up.
The manual recovery steps was completed at 21:17 UTC restoring most services except for AP0 and NA3 search indexing. Search of existing data would work but new data would not be indexed for searching.
Emergency maintenance was performed at 23:24 UTC to restore search indexing for AP0 and NA3 and the implementation of a work-around for the memory allocation error.
While we are confident the root cause has been addressed by the work-around the Salesforce.com technology team will continue to work with hardware vendors to fully detail the root cause and identify if further patching or fixes will be needed.
Further updates will be available as the work progresses."

Update 3: Lots of coverage of this event all over the web. All of the coverage focuses on the downtime itself, how unacceptable it is, and bad this makes the cloud look. That's all crap. Everything fails. In-house apps more-so then anything. We can't avoid downtime. What we can avoid is the communication during and after the event, to avoid situations like this:

"Salesforce, the 800-pound gorilla in the software-as-a-service jungle, was unreachable for the better part of an hour, beginning around noon California time. Customers who tried to access their accounts alternately were unable to reach the site at all or received an error message when trying to log in.
Even the company's highly touted public health dashboard was also out of commission. That prompted a flurry of tweets on Twitter from customers wondering if they were the only ones unable to reach the site."

That's where SaaS providers need to focus! Create lines of communication, open the kimono, and let the rays of transparency shine through. It's completely in your control.

Sunday, January 4, 2009

A comprehensive list of SaaS public health dashboards

To anyone looking to build a public health dashboard for their own online service, the following list should give you a head start in understanding what's out there. I also keep an up-to-date list in my delicious account that you can reference at any time. I would suggest reviewing the examples below when coming up with your own design, potentially combining the various approaches to create something truly useful to your customers.

Note: This list is divided up into three tiers. The tiers are determined by a rough combination of company size, service popularity, importance to the general public, and quality of the end result.

Tier One

AWS Service Health Dashboard (http://status.aws.amazon.com/)
Trust.salesforce.com - System Status (http://trust.salesforce.com/trust/status/)
Zoho Service Health Status (http://status.zoho.com/)
OpenDNS System Status (http://system.opendns.com/)
OpenSRS System Status (http://status.opensrs.com/)
Google App Engine System Status (http://code.google.com/status/appengine)

Tier Two

QuickBase Service Status (http://service.quickbase.com/updates.aspx)
NetSuite System Status (http://status.netsuite.com/status.html)
Mogulus Service Health (http://www.mogulus.com/support/servicehealth)
Skype Heartbeat (http://heartbeat.skype.com/)
BlueTie Real Time Status Center (http://support.bluetie.com/?q=node/819)

Tier Three

Twitter Status (http://status.twitter.com/)
SAP System Status (http://www.sytecpa.org/technical/systemStatus.asp)
University of Florida Service Monitoring (http://open-systems.ufl.edu/status/)
Capitalserv - Current service status (http://www.capitalserv.com/current_service_status.aspx)
Everyone.net Email Service Status (http://www.everyone.net/main/scripts/status.cgi)
MSN Messenger Service Status (http://messenger.msn.com/Status.aspx)
Federal Reserve Financial Services Service Status (http://www.frbservices.org/app/status/serviceStatus.do)
Boardhost Service Status (http://status.boardhost.com/)
Primus System Status (http://systemstatus.iprimus.com.au/)

Non-dashboard system status pages

World of Warcraft Service Status (http://forums.worldofwarcraft.com/board.html?forumId=11113&sid=1)
Second Life Grid Status Reports (http://status.secondlifegrid.net/)
GitHub Status (http://github.wordpress.com/)
The WELL System Status (http://www.well.com/status.html)
MODIS Rapid Response System - System Status (http://rapidfire.sci.gsfc.nasa.gov/status/)

Don't forget to also review the seven keys to a successful health dashboard, especially since not one public dashboard I've come across meets all of the rules.

Again, the full list can always be found here. If I missed any public dashboards, I'd love to know...simply point to them in the comments and I'll make sure to add them to the list.

Monday, December 22, 2008

Comprehensive review of SaaS SLAs - A sad state of affairs

A recent story about the holes in Google's SLA got me wondering about the state of service level agreements in the SaaS space. The importance of SLA's in the enterprise online world are obvious. I'm sad to report that of the state of the union is not good. Of the handful of major SaaS players, most have no SLAs at all. Of those that do, the coverage is extremely loose, and the penalty for missing the SLAs is weak. To make my point, I've put together an exhaustive (yet pointedly short) list of the SLAs that do exist. I've extracted the key elements and removed the legal mumbo-jumbo (for easy consumption). Enjoy!

Comparing the SLAs of the major SaaS players

Google Apps:

What: "web interface will be operational and available for GMail, Google Calendar, Google Talk, Google Docs, and Google Sites"
Uptime guarantee: 99.9%
Time period: any calendar month
Penalty: 3, 7, or 15 days of service at no charge, depending on the monthly uptime percentage
Important caveats:

"Downtime" means, for a domain, if there is more than a five percent user error rate. Downtime is measured based on server side error rate.
"Downtime Period" means, for a domain, a period of ten consecutive minutes of Downtime. Intermittent Downtime for a period of less than ten minutes will not be counted towards any Downtime Periods.

Amazon S3:

What: Amazon Simple Storage Service
Uptime guarantee: 99.9%
Time period: "any monthly billing cycle"
Penalty: 10-25% of total charges paid by customer for a billing cycle, based on the monthly uptime percentage
Important caveats:

“Error Rate” means: (i) the total number of internal server errors returned by Amazon S3 as error status “InternalError” or “ServiceUnavailable” divided by (ii) the total number of requests during that five minute period. We will calculate the Error Rate for each Amazon S3 account as a percentage for each five minute period in the monthly billing cycle. The calculation of the number of internal server errors will not include errors that arise directly or indirectly as a result of any of the Amazon S3 SLA Exclusions (as defined below).
“Monthly Uptime Percentage” is calculated by subtracting from 100% the average of the Error Rates from each five minute period in the monthly billing cycle.
"We will apply any Service Credits only against future Amazon S3 payments otherwise due from you""

Amazon EC2:

What: Amazon Elastic Compute Cloud service
Uptime guarantee: 99.95%
Time period: "the preceding 365 days from the date of an SLA claim"
Penalty: "a Service Credit equal to 10% of their bill for the Eligible Credit Period"
Important caveats:

“Annual Uptime Percentage” is calculated by subtracting from 100% the percentage of 5 minute periods during the Service Year in which Amazon EC2 was in the state of “Region Unavailable.” If you have been using Amazon EC2 for less than 365 days, your Service Year is still the preceding 365 days but any days prior to your use of the service will be deemed to have had 100% Region Availability. Any downtime occurring prior to a successful Service Credit claim cannot be used for future claims. Annual Uptime Percentage measurements exclude downtime resulting directly or indirectly from any Amazon EC2 SLA Exclusion (defined below).
“Unavailable” means that all of your running instances have no external connectivity during a five minute period and you are unable to launch replacement instances.

...that's it!

Notable Exceptions (a.k.a. lack of an SLA)

Salesforce.com (are you serious??)
Google App Engine (youth will only be an excuse for so long)
Zoho
Quickbase
OpenDNS
OpenSRS

Conclusions
There's no question that for the enterprise market to get on board with SaaS in any meaningful way accountability is key. Public health dashboards are one piece of the puzzle. SLAs are the other. The longer we delay in demanding these from our key service providers (I'm looking at you Salesforce), the longer and more difficult the move into the cloud will end up being. The incentive in the short term for a not-so-major SaaS player should be to take the initiave and focus on building a strong sense of accountability and trust. As it begins to take business away from the more established (and less trustworthy) services, the bar will rise and customers will begin to demand these vital services from all of their providers. The day's of weak or non-existant SLAs for SaaS providers are numbered.

Disclaimer: If I've misrepresented anything above, or if your SaaS service has a strong SLA, please let us know in the comments. I really hope someone out there is working to raise the bar on this sad state.

Wednesday, December 17, 2008

Steve Souders - The State of Performance 2008 (and a look ahead to 2009)

Mr. Web Site Performance, Steve Souders, put together a really nice "review of what happened in 2008 with regard to web performance, and [his] predictions and hopes for what we’ll see in 2009." Check it out at Steve Souders' blog. He references a lot of tools and services you may have missed over the course of the year, and gives some good advice for web developers (and online businesses).

Tuesday, December 16, 2008

Google App Engine System Status - A Review

Building off of the rules for a successful public health dashboard, let's see what Google did well, what they can improve, and what questions remain:

Rule #1: Must show the current status for each "service" you offer

Considering this is meant to cover only the App Engine service, and not any other Google service, I would say they accomplished their goal. Every API they offer appears to be covered, in addition to the "Serving" metric which appears to test the overall service externally.
I appreciate the alphabetic sorting of services, but I would suggest making the "Serving" status a bit more prominent as that would seem to be by far the most important metric.
Conclusion: Met!

Rule #2: Data must be accurate and timely

Hard to say until an event occurs or we hear feedback about this from users.
The announcement does claim the data is an "up-to-the-minute overview of our system status with real-time, unedited data." If this is true, this is excellent news.
The fact that an "Investigating" status is an option tells me that the status may not always be real-time or unedited. Or I may just be a bit too paranoid :)
In addition the fact that "No issues" and "Minor performance issues" are both considered healthy tells us that issues Google considers "minor" will be ignored or non-transparent. That's bad news. Though it does fit with their SLA questions that came up recently.
Conclusion: Time will tell (but promising)

Rule #3: Must be easy to find

If I were experiencing a problem with App Engine, I would first go to the homepage here. Unfortunately I don't see any link to the system status page. A user would either have to stumble upon the blog post announcing this page, or work through the forum...defeating the purpose of the system status page!
The URL to the system status (http://code.google.com/status/appengine/) page is not easy to remember. Since Google doesn't seem to own appengine.com, this is may not be easy to fix, but that doesn't matter to a user that's in the middle of an emergency and needs to figure out what's going on. The good news is that at the time of this writing, a Google search for "google app engine status" has the status page as the third result, and I would think that it will raise to #1 very soon.
Conclusion: Not met (but easy to fix by adding a link from the App Engine homepage).

Rule #4: Must provide details for events in real time

Again, hard to say until we see an issue occur.
What I'm most interested in is how much detail they provide when an event does occur, and whether they send users over to the forums or to the blog, or simply provide the information on the status page.
Conclusion: Time will tell.

Rule #5: Provide historical uptime and performance data

Great job with this. I dare say they've jumped head of every other cloud service in the amount and detail on performance data they provide.
Still unclear how much historical data will be maintained, but even 7 days is enough to satisfy me.
Conclusion: Met!

Rule #6: Provide a way to be notified of status changes

Nada here, beyond pointing people to the Downtime Notify Google Group.
Conclusion: Not met.

Rule #7: Provide details on how the data is gathered

Beyond the mention that they are "using some of the same raw monitoring data that our engineering team uses internally", no real information on how this data is collected, how often it is updated, or where the monitoring happens from.
Conclusion: Not met.

Overall, in spite of more rules being missed than met, the more difficult requirements are looking great, and the pieces are in place to create a very complete and extremely useful central place for their customers to come in time of need. I'm excited to see where Google takes this dashboard from here, and how other cloud services respond to this ever growing need.