Transparent Uptime: dashboard

Showing posts with label dashboard. Show all posts

Tuesday, February 24, 2009

I spy with my little eye...Mosso working on a health status dashbord

The transparency that Twitter brings is awesome:

I'm looking forward to see how many of the rules of successful health status dashboards they follow!

Wednesday, February 11, 2009

Transparency User Story #1: Your service seems to be down. I'd like to know if it's down for anyone else or if it's just me.

Note: This post is the first in a series of at least a dozen posts where I attempt to drill into the transparency user stories described in an earlier post.

Let's assume that you've decided that you want to make your service or organization more transparent, specifically when it comes to it's uptime and performance. You've convinced your management team, you've got the engineering and marketing resources, and your rearing to go. You want to get something done and can't wait to make things happen. Stop right there. Do not pass go, do not collect $200. You first need to figure out what it is you're solving for. What problems (and opportunities) do you want to tackle in your drive for transparency?

Glad you asked. There are about a dozen user stories that I've listed in a previous post that describe the most common problems transparency can solve. In this post, I will dive into the first user story:

As an end user or customer, it looks to me like your service is down. I'd like to know if it's down for everyone or if it's just me.

Very straight forward, and very common. So common there are even a couple simple free web services out there that helps people figure this out. Let's assume for this exercise that your site is up the entire time.

Examples of the problem in action

Your customer's Internet connection is down. He loads up www.yoursite.com. It cannot load. He thinks you are down, and calls your support department demanding service.
Your customer's DNS servers are acting up, and are unable to resolve www.yoursite.com inside his network. He finds that google.com is loading fine, and is sure your site is down. He sends you an irate email.
Your customer network routes are unstable, causing inconsistent connectivity to your site. He loads www.yourcompetitor.com fine, while www.yoursite.com fails. He Twitters all about it.

Why this hurts your business

Unnecessary support calls
Unnecessary support emails.
Negative word of mouth that is completely unfounded.

How to solve this problem

An offsite public health dashboard
A known third party, such as this and this or eventually this
A constant presence across social media (Twitter especially) watching for false reports
Keeping a running blog noting any downtime events, which tells your users that unless something is posted, nothing is wrong. You must be diligent about posting when there is actually something wrong however.
Share your real time performance with your large customers. Your customers may even want to be alerted when you go down.

Example solutions in the real world

Tuesday, January 27, 2009

Most insane dashboard...ever

I don't know if words can describe this dashboard created by Sprint to promote their mobile Internet card. All I can say is that it's awesome in it's uselessness. Which I think is the point. Which should be the exact opposite of your goals in creating a public health dashboard for your own customers.

Thursday, January 8, 2009

Salesforce.com down for over 30 minutes, and what we can learn from it

See what the blogosphere was saying...and see what more traditional media was saying.

Update: Again, Twitter ends up being the best place to confirm a problem and get updates across the world:

Update 2: Salesforce has posted an explanation of what led to the downtime (
from trust.salesforce.com):

"6:51 pm PST : Service disruption for all instances - resolved
Starting at 20:39 UTC, a core network device failed due to memory allocation errors. The failure caused it to stop passing data but did not properly trigger a graceful fail over to the redundant system as the memory allocation errors where present on the failover system as well. This resulted in a full service failure for all instances. Salesforce.com had to initiate manual recovery steps to bring the service back up.
The manual recovery steps was completed at 21:17 UTC restoring most services except for AP0 and NA3 search indexing. Search of existing data would work but new data would not be indexed for searching.
Emergency maintenance was performed at 23:24 UTC to restore search indexing for AP0 and NA3 and the implementation of a work-around for the memory allocation error.
While we are confident the root cause has been addressed by the work-around the Salesforce.com technology team will continue to work with hardware vendors to fully detail the root cause and identify if further patching or fixes will be needed.
Further updates will be available as the work progresses."

Update 3: Lots of coverage of this event all over the web. All of the coverage focuses on the downtime itself, how unacceptable it is, and bad this makes the cloud look. That's all crap. Everything fails. In-house apps more-so then anything. We can't avoid downtime. What we can avoid is the communication during and after the event, to avoid situations like this:

"Salesforce, the 800-pound gorilla in the software-as-a-service jungle, was unreachable for the better part of an hour, beginning around noon California time. Customers who tried to access their accounts alternately were unable to reach the site at all or received an error message when trying to log in.
Even the company's highly touted public health dashboard was also out of commission. That prompted a flurry of tweets on Twitter from customers wondering if they were the only ones unable to reach the site."

That's where SaaS providers need to focus! Create lines of communication, open the kimono, and let the rays of transparency shine through. It's completely in your control.

Sunday, January 4, 2009

A comprehensive list of SaaS public health dashboards

To anyone looking to build a public health dashboard for their own online service, the following list should give you a head start in understanding what's out there. I also keep an up-to-date list in my delicious account that you can reference at any time. I would suggest reviewing the examples below when coming up with your own design, potentially combining the various approaches to create something truly useful to your customers.

Note: This list is divided up into three tiers. The tiers are determined by a rough combination of company size, service popularity, importance to the general public, and quality of the end result.

Tier One

AWS Service Health Dashboard (http://status.aws.amazon.com/)
Trust.salesforce.com - System Status (http://trust.salesforce.com/trust/status/)
Zoho Service Health Status (http://status.zoho.com/)
OpenDNS System Status (http://system.opendns.com/)
OpenSRS System Status (http://status.opensrs.com/)
Google App Engine System Status (http://code.google.com/status/appengine)

Tier Two

QuickBase Service Status (http://service.quickbase.com/updates.aspx)
NetSuite System Status (http://status.netsuite.com/status.html)
Mogulus Service Health (http://www.mogulus.com/support/servicehealth)
Skype Heartbeat (http://heartbeat.skype.com/)
BlueTie Real Time Status Center (http://support.bluetie.com/?q=node/819)

Tier Three

Twitter Status (http://status.twitter.com/)
SAP System Status (http://www.sytecpa.org/technical/systemStatus.asp)
University of Florida Service Monitoring (http://open-systems.ufl.edu/status/)
Capitalserv - Current service status (http://www.capitalserv.com/current_service_status.aspx)
Everyone.net Email Service Status (http://www.everyone.net/main/scripts/status.cgi)
MSN Messenger Service Status (http://messenger.msn.com/Status.aspx)
Federal Reserve Financial Services Service Status (http://www.frbservices.org/app/status/serviceStatus.do)
Boardhost Service Status (http://status.boardhost.com/)
Primus System Status (http://systemstatus.iprimus.com.au/)

Non-dashboard system status pages

World of Warcraft Service Status (http://forums.worldofwarcraft.com/board.html?forumId=11113&sid=1)
Second Life Grid Status Reports (http://status.secondlifegrid.net/)
GitHub Status (http://github.wordpress.com/)
The WELL System Status (http://www.well.com/status.html)
MODIS Rapid Response System - System Status (http://rapidfire.sci.gsfc.nasa.gov/status/)

Don't forget to also review the seven keys to a successful health dashboard, especially since not one public dashboard I've come across meets all of the rules.

Again, the full list can always be found here. If I missed any public dashboards, I'd love to know...simply point to them in the comments and I'll make sure to add them to the list.

Monday, December 1, 2008

7 keys to a successful public health dashboard

Lets first define what makes an online health dashboard "successful", and in the process explain why you (as a SaaS provider) should have one:

Your support costs go down as your users are able to self-identify system wide problems without calling or emailing your support department. Users will no longer have to guess whether their issues are local or global, and can more quickly get to the root of the problem before complaining to you.
You are better able to communicate with your users during downtime events, taking advantage of the broadcast nature of the Internet versus the one-to-one nature of email and the phone. You spend less time communicating the same thing over and over and more time resolving the issue.
You create a single and obvious place for your users to come to when they are experiencing downtime. You save your users' time currently spent searching forums, Twitter, or your blog.
Trust is the cornerstone of successful SaaS adoption. Your customers are betting their business and their livelihoods on your service or platform. Both current and prospective customers require confidence in your service. Both need to know they won't be left in the dark, alone and uninformed, when you run into trouble. Real time insight into unexpected events is the best way to build this trust. Keeping them in the dark and alone is no longer an option.
It's only a matter of time before every serious SaaS provider will be offering a public health dashboard. Your users will demand it.

With that out of the way, let's move on to detailing what exactly it takes to create a successful public health dashboard. Generally I would suggest looking to your users to tell you what they need. I still strongly recommend you do this, especially if your users are technically savvy. However, as this industry is still so young, and most companies are still unsure of what their users will demand, I humbly submit my 7 rules for public health dashboard success:

The Rules

First things first

Before we get into the rules, I'd like to mention a few public "system status" pages that don't quite meet the label of "health dashboard" but do give us a starting point for providing public health information. There's no reason any SaaS provider today should not be offering at least a basic chronological list of potential issues, downtime events, and resolution details similar to one of the following: craigslist system status, 37signals System Status, Twitter Status, GitHub Status, Mosso System Status. Now...on to the rules for creating a successful online public health dashboard!

The Basics

Rule #1: Must show the current status for each "service" you offer

A status light or short description that visitors can use to quickly identify how the service(s) they are interested in are doing right now. Example #1, Example #2.
Most health dashboards do this well. Keep it simple. Skype's Heartbeat tries to be clever, but I fear that the first impression that the big thumping red hearts give visitors is that something is wrong.
Don't forget to identify what the status icons and messages actually mean. Example #1 (legend at bottom right), Example #2 (descriptions below the table at bottom). Bad Example (no information on what the possible states are).

Rule #2: Data must be accurate and timely

This should go unsaid, but some comments in an online forum show us that it isn't as obvious as it should be.
The data should be based on real time monitoring, not manual updates that require a human.
The entire benefit is lost if your users cannot trust this data, or it arrives too late.

Rule #3: Must be easy to find

It is worthless to provide a public health dashboard if your users are unaware it exists, or are unable to find it in time of need.
Anticipate where your users go when they experience downtime, and create a clear path to the status page. Ideally there will be a link from your home page, and at the minimum from your main support page. Example #1 (footer of each page), Example #2 (top right of every page). Many examples of support page links.
Also consider making the URL as easy to remember as possible. "status.yourdomain.com" or "yourdomain.com/status" seem to be the preferred method.

Rule #4: Must provide details for events in real time

You must go beyond simply noting that something is wrong. Your must provide insight into what is going on, what services are affected, and if possible an ETA on resolution. Users will be OK with a big red light for only so long.
This can be as simple as a timestamped message noting that you are investigating, with regular updates about the investigate and projected resolution times.
The key here is to keep your users from having to contact your support department, defeating much of the gain in having a public health dashboard. Use this as an opportunity to build a trust relationship with your customer by being transparent throughout the process.

Beyond the basics

Rule #5: Provide historical uptime and performance data

Make sure to provide root cause analysis for each downtime event. The more detail the better. Example #1, Example #2 (click on any event in the past).
This will be important to your prospects as they evaluate your transparency. Don't be afraid of problems you've had in the past. Owning up to problems strengthens trust, which should be one of your main goals.
This will be important to your customers as they do post mortem analysis for their superiors.
Provide at least one week of historical data, ideally at least a month. Example #1, Example #2, Example #3, Example #4 (notice each service has an archive link).

Rule #6: Provide a way to be notified of status changes

RSS/email/SMS/Twitter/API's/etc. It's too early to know how users will want to consume this information, but my opinion is that the two most useful options would be to allow email alerts on downtime, and an API that allows users to build their applications to work around the downtime automatically.
Currently many status dashboard provide RSS feeds. Example #1 (even provides email and Twitter alerts!), Example #2.
Along these same lines, providing advanced notice of upcoming maintenance windows is extremely useful. I would hope these are announced in other mediums as well (e.g. email).

Rule #7: Provide details on how the data is gathered

What is the uptime and performance data worth if we have no insight into where it comes from? Currently with most health dashboards we have to assume either the provider built their own monitoring platform, or that they are making status updates manually (Zoho is the one exception, since they have their own monitoring service).
Beyond simply knowing where the data comes from, what exactly does "Performance issues" mean? What are the thresholds that determine that a service is considered to be "disrupted"? From what location's is the monitoring done from? Am I out of luck if I live in Asia and the monitoring is done from New York?
It would be extremely useful to have the data validated by a third party, especially as this gets into the world of SLA's. We can't have the fox watching the hen house when it comes to money.

The future

For those seeking to truly be ahead of the curve and open up the kimono further, I suggest the following rules as well:

Provide geographical uptime and performance data (Zoho is ahead of the game on this one). The more information you provide to your users publicly, the less questions you'll have to deal with privately.
The status page should be hosted externally, at a different location from your primary data center. This should be obvious, but I doubt many companies consider this as a problem. The last thing you want when your primary data center goes down is to have to field calls that could be avoided if your status page was still up.
Break out each individual service and function as much as possible. Similar to how Flickr opened up their API to match up with practically every internal function call, allow your users to have insight into the very specific functionality they need.
Connect your downtime events to your SLA's. Allow your users to easily track how you're doing compared to what you promised. The day's of hoping that your users forget about this are over.

I hope the above advice provides value to companies out there considering their own health dashboards. I would love to hear from SaaS providers already providing health dashboards, especially those I haven't already linked to. I'd especially love to hear some feedback from companies on the benefits they've seen in providing a health dashboard, in customer feedback, reduced costs, or competitive advantages.

I'm excited see over the next few months how things change in this space, what rules become most important to users, and how online service providers respond to the oncoming demand for transparency. Time will tell, but I do know this is only the beginning.

For reference, some of the public health dashboards I referenced in this post:

Saturday, November 15, 2008

Comparing Amazon Web Services, SalesForce, and Zoho's online health dashboards

Now that there are three major SaaS players offering online service health dashboards, and one from Google on it's way, I thought it would be a useful exercise to compare the offerings from Amazon Web Services, Salesforce, and Zoho. This will hopefully be helpful for anyone planning to launch their own health dashboard, and to the general online community in making sense of what is important to understand about these dashboards.

Disclaimer: If I have mistakenly misrepresented anything, or if I missed any information, PLEASE let me know in the comments below.

What providers are we looking about today?

What is the URL of each status page (and are they easy to remember in times of need)?

Amazon Web Services: http://status.aws.amazon.com/ (Somewhat easy to remember)
Salesforce: http://trust.salesforce.com/ (Easy to remember)
Zoho: http://status.zoho.com/ (Extremely easy to remember)

What are these status pages called?

Amazon Web Services: "AWS Service Health Dashboard"
Salesforce: "Trust.salesforce.com - System Status" (Note: salesforce.com goes beyond simply providing system status by also providing security notices, both under their "Trust.salesforce.com brand")
Zoho: "Zoho Service Health Status"

What services' health are reported on?

Amazon Web Services: All four core services (EC2, S3, SQS, SimpleDB), plus Mechanical Turk and FlexPay. They also break out the two S3 datacenter locations (EU and US), the two ends of a Mechanical Turk transaction (Requester and Worker), plus the EC2 API.
Salesforce: Only the core salesforce.com services across 12 individual systems (based on geographic location and purpose).
Zoho: All 23 Zoho services are covered, plus their mobile site and their single sign-on system.

What health information is provided?

Amazon Web Services: Current status, plus about 30 days of historical status. Status is determined to be one of "Service is operating normally", "Performance Issues", or "Service disruption". "Information messages" are occasionally provided.
Salesforce: Current status, plus exactly 30 days of historical status. Status is determined to be either "Instances available", "Performance Issues", "Service disruption", or "Status not available". "Informational messages" are also provided on occasion.
Zoho: Current status and the response time for the past hour, in addition to historical uptime for the past week. Also provided are two graphs representing uptime and response time for the past seven days. If that wasn't enough, current uptime and response from six geographical locations is also given.

Where does the uptime and performance data come from?

Amazon Web Services: No clue.
Salesforce: No clue.
Zoho: Their own "Site 24x7" monitoring service.

What is considered downtime and what is considered a performance issue?

Amazon Web Services: No clue.
Salesforce: No clue.
Zoho: No clue.

Are real time updates provided during downtime events? Is it easy to find?

Amazon Web Services: Yes, but unclear how consistently and how easy it is to find that information.
Salesforce: Yes, right underneath the current status.
Zoho: Does not appear so, but if the issue is big enough they may update customers through their blog.

Is information provided on past downtime events?

Amazon Web Services: Yes. Mousing over a past performance or downtime event brings up a chronological log of events that took place, from detection to resolution. In addition, major downtime events are explained.
Salesforce: Yes. Clicking on any past event brings up a window giving the time of the event, a detailed description of the problem, and a root cause analysis.
Zoho: No. Unless they are described in the blog.

Is there a way to easily report problems users are having?

Amazon Web Services: Yes, clicking the "Report an Issue" link.
Salesforce: No, other then using the standard support channels.
Zoho: No, other then using the standard support channels.

How can you get notified of problems (without watching this page 24/7)?

Amazon Web Services: Ability to subscribe to RSS feeds for change in status of each service.
Salesforce: No.
Zoho: No.

Conclusions: The best practices for online service health dashboards are still being formed, and it's clear that each service provider has approached the need for transparency differently. Amazon Web Services provides a simple and easy to understand overview of the health of each service, but provides little insight into who is impacted and what specific functionality is down. Salesforce provides clear insight into what customers may be affected by an event, but does little in offering insight into specific functionality that may be down or slow. Zoho provides the most data by far for each service they provide, but does not have a system in place to communicate details about specific downtime events beyond the company blog. Amazon and Salesfroce completely lack insight into how that they collect the health information, and all three give no information on what is meant by downtime or performance problems.

A closing questions for each provider:

Amazon Web Services: What does "EC2 API" actually mean? Which API is this referring to and why not cover the API's for the other services?
Salesforce: Does each server status cover every application level and API on that server? Can you offer more insight into specific services?
Zoho: Do you expect to add details about current and past downtime events to the health dashboard? What do you expect your customers to do when they see a red light? If you answer "Email Support", you don't get the power of this status page.
To all: How is the health actually monitored (especially for the GUI focused Salesforce and Zoho services? Working at a (the best) web monitoring company, I know how hard it is to monitor complex web applications.

Notable mentions: The following services also offer up health dashboard page, but to keep the comparison from getting overly complex I decided to leave them out. If anyone would like me to review these, or any other service that I missed, I'd be more then happy to. Just leave a note in the comments

Transparent Uptime