Saturday, January 10, 2009

How transparency can help your business

When looking to gain the benefits of transparency (into your downtime and performance issues), you first need to understand the use cases (or more accurately, the user stories) that describe the problems that transparency can solve. It's easy to put something out there looking for the press and marketing benefits. It's a lot more challenging (and beneficial) to understand what transparency can do for your business, and then actually solve those problems.

Transparency user stories

As an end user/customer:
  1. Your service seems to be down. I'd like to know if it's down for anyone else or if it's just me.
  2. I know your service is down, and I want to know when it'll be back up.
  3. I want some kind of explanation of why you went down.
As a business customer using your service as part of my own service offering:
  1. Before betting my business on your service/platform, I need to know how reliable it has been.
  2. My own customers are reporting that my service is down, but everything looks fine on my end. I need to know if your service is down, and if so I need information to keep my customers up to date.
  3. I want to find which link in my ecosystem of external services is broken or slow right away.
  4. One of my customers reported a problem in the past, and I'd like to correlate it with hiccups your service may have had in the past.
  5. I need to know well in advance of any upcoming maintenance windows.
  6. I need to know well in advance if you plan to change any features that are critical to me, or if the performance of the service will change.
As a SaaS provider:
  1. I want my customers (and my prospects) to trust my service. I don't want my customers to lose that trust if I ever go down.
  2. My support department gets flooded with calls and emails during a downtime event.
  3. I want to understand what the uptime and performance of my services are at all times from around the world. Both for internal reasons, and to help my customers diagnose issues they are reporting.
  4. I want to differentiate from my competition based on reliability and customer support.
In the next post, I will dive into ways to attack each of these user stories. Stay tuned.

Thursday, January 8, 2009 down for over 30 minutes, and what we can learn from it

See what the blogosphere was saying...and see what more traditional media was saying.

Update: Again, Twitter ends up being the best place to confirm a problem and get updates across the world:

Update 2: Salesforce has posted an explanation of what led to the downtime (
"6:51 pm PST : Service disruption for all instances - resolved
Starting at 20:39 UTC, a core network device failed due to memory allocation errors. The failure caused it to stop passing data but did not properly trigger a graceful fail over to the redundant system as the memory allocation errors where present on the failover system as well. This resulted in a full service failure for all instances. had to initiate manual recovery steps to bring the service back up.
The manual recovery steps was completed at 21:17 UTC restoring most services except for AP0 and NA3 search indexing. Search of existing data would work but new data would not be indexed for searching.
Emergency maintenance was performed at 23:24 UTC to restore search indexing for AP0 and NA3 and the implementation of a work-around for the memory allocation error.
While we are confident the root cause has been addressed by the work-around the technology team will continue to work with hardware vendors to fully detail the root cause and identify if further patching or fixes will be needed.
Further updates will be available as the work progresses."

Update 3: Lots of coverage of this event all over the web. All of the coverage focuses on the downtime itself, how unacceptable it is, and bad this makes the cloud look. That's all crap. Everything fails. In-house apps more-so then anything. We can't avoid downtime. What we can avoid is the communication during and after the event, to avoid situations like this:
"Salesforce, the 800-pound gorilla in the software-as-a-service jungle, was unreachable for the better part of an hour, beginning around noon California time. Customers who tried to access their accounts alternately were unable to reach the site at all or received an error message when trying to log in.

Even the company's highly touted public health dashboard was also out of commission. That prompted a flurry of tweets on Twitter from customers wondering if they were the only ones unable to reach the site."

That's where SaaS providers need to focus! Create lines of communication, open the kimono, and let the rays of transparency shine through. It's completely in your control.

Sunday, January 4, 2009

A comprehensive list of SaaS public health dashboards

To anyone looking to build a public health dashboard for their own online service, the following list should give you a head start in understanding what's out there. I also keep an up-to-date list in my delicious account that you can reference at any time. I would suggest reviewing the examples below when coming up with your own design, potentially combining the various approaches to create something truly useful to your customers.

Note: This list is divided up into three tiers. The tiers are determined by a rough combination of company size, service popularity, importance to the general public, and quality of the end result.

Tier One
Tier Two
Tier Three
Non-dashboard system status pages
Don't forget to also review the seven keys to a successful health dashboard, especially since not one public dashboard I've come across meets all of the rules.

Again, the full list can always be found here. If I missed any public dashboards, I'd love to know...simply point to them in the comments and I'll make sure to add them to the list.