- Your support costs go down as your users are able to self-identify system wide problems without calling or emailing your support department. Users will no longer have to guess whether their issues are local or global, and can more quickly get to the root of the problem before complaining to you.
- You are better able to communicate with your users during downtime events, taking advantage of the broadcast nature of the Internet versus the one-to-one nature of email and the phone. You spend less time communicating the same thing over and over and more time resolving the issue.
- You create a single and obvious place for your users to come to when they are experiencing downtime. You save your users' time currently spent searching forums, Twitter, or your blog.
- Trust is the cornerstone of successful SaaS adoption. Your customers are betting their business and their livelihoods on your service or platform. Both current and prospective customers require confidence in your service. Both need to know they won't be left in the dark, alone and uninformed, when you run into trouble. Real time insight into unexpected events is the best way to build this trust. Keeping them in the dark and alone is no longer an option.
- It's only a matter of time before every serious SaaS provider will be offering a public health dashboard. Your users will demand it.
First things first
Before we get into the rules, I'd like to mention a few public "system status" pages that don't quite meet the label of "health dashboard" but do give us a starting point for providing public health information. There's no reason any SaaS provider today should not be offering at least a basic chronological list of potential issues, downtime events, and resolution details similar to one of the following: craigslist system status, 37signals System Status, Twitter Status, GitHub Status, Mosso System Status. Now...on to the rules for creating a successful online public health dashboard!
Rule #1: Must show the current status for each "service" you offer
- A status light or short description that visitors can use to quickly identify how the service(s) they are interested in are doing right now. Example #1, Example #2.
- Most health dashboards do this well. Keep it simple. Skype's Heartbeat tries to be clever, but I fear that the first impression that the big thumping red hearts give visitors is that something is wrong.
- Don't forget to identify what the status icons and messages actually mean. Example #1 (legend at bottom right), Example #2 (descriptions below the table at bottom). Bad Example (no information on what the possible states are).
- This should go unsaid, but some comments in an online forum show us that it isn't as obvious as it should be.
- The data should be based on real time monitoring, not manual updates that require a human.
- The entire benefit is lost if your users cannot trust this data, or it arrives too late.
- It is worthless to provide a public health dashboard if your users are unaware it exists, or are unable to find it in time of need.
- Anticipate where your users go when they experience downtime, and create a clear path to the status page. Ideally there will be a link from your home page, and at the minimum from your main support page. Example #1 (footer of each page), Example #2 (top right of every page). Many examples of support page links.
- Also consider making the URL as easy to remember as possible. "status.yourdomain.com" or "yourdomain.com/status" seem to be the preferred method.
- You must go beyond simply noting that something is wrong. Your must provide insight into what is going on, what services are affected, and if possible an ETA on resolution. Users will be OK with a big red light for only so long.
- This can be as simple as a timestamped message noting that you are investigating, with regular updates about the investigate and projected resolution times.
- The key here is to keep your users from having to contact your support department, defeating much of the gain in having a public health dashboard. Use this as an opportunity to build a trust relationship with your customer by being transparent throughout the process.
Rule #5: Provide historical uptime and performance data
- Make sure to provide root cause analysis for each downtime event. The more detail the better. Example #1, Example #2 (click on any event in the past).
- This will be important to your prospects as they evaluate your transparency. Don't be afraid of problems you've had in the past. Owning up to problems strengthens trust, which should be one of your main goals.
- This will be important to your customers as they do post mortem analysis for their superiors.
- Provide at least one week of historical data, ideally at least a month. Example #1, Example #2, Example #3, Example #4 (notice each service has an archive link).
- RSS/email/SMS/Twitter/API's/etc. It's too early to know how users will want to consume this information, but my opinion is that the two most useful options would be to allow email alerts on downtime, and an API that allows users to build their applications to work around the downtime automatically.
- Currently many status dashboard provide RSS feeds. Example #1 (even provides email and Twitter alerts!), Example #2.
- Along these same lines, providing advanced notice of upcoming maintenance windows is extremely useful. I would hope these are announced in other mediums as well (e.g. email).
- What is the uptime and performance data worth if we have no insight into where it comes from? Currently with most health dashboards we have to assume either the provider built their own monitoring platform, or that they are making status updates manually (Zoho is the one exception, since they have their own monitoring service).
- Beyond simply knowing where the data comes from, what exactly does "Performance issues" mean? What are the thresholds that determine that a service is considered to be "disrupted"? From what location's is the monitoring done from? Am I out of luck if I live in Asia and the monitoring is done from New York?
- It would be extremely useful to have the data validated by a third party, especially as this gets into the world of SLA's. We can't have the fox watching the hen house when it comes to money.
For those seeking to truly be ahead of the curve and open up the kimono further, I suggest the following rules as well:
- Provide geographical uptime and performance data (Zoho is ahead of the game on this one). The more information you provide to your users publicly, the less questions you'll have to deal with privately.
- The status page should be hosted externally, at a different location from your primary data center. This should be obvious, but I doubt many companies consider this as a problem. The last thing you want when your primary data center goes down is to have to field calls that could be avoided if your status page was still up.
- Break out each individual service and function as much as possible. Similar to how Flickr opened up their API to match up with practically every internal function call, allow your users to have insight into the very specific functionality they need.
- Connect your downtime events to your SLA's. Allow your users to easily track how you're doing compared to what you promised. The day's of hoping that your users forget about this are over.
I'm excited see over the next few months how things change in this space, what rules become most important to users, and how online service providers respond to the oncoming demand for transparency. Time will tell, but I do know this is only the beginning.
For reference, some of the public health dashboards I referenced in this post: