Showing posts with label dashboard health. Show all posts
Showing posts with label dashboard health. Show all posts

Thursday, February 11, 2010

GitHub System Status page, the most fun status page yet


What they are doing well:
  1. Big green banner clearly telling you that thing are OK (and I assume a red banner when they are not)
  2. Easy to find (http://status.github.com/)
  3. Linked to their twitter account, which is being used effectively for real time updates.
  4. Ability to easily notify github of an issue from the status page.
  5. Fun, well designed.
What they could improve:
  1. Add automated real-time health status, currently relies on manual updates.
  2. More detail on which parts of the system are up/down, currently all or nothing.
  3. A Google search for "GitHub Status" returns a link to an old destination as the first result, which doesn't point users to this new site.
  4. Link from github.com to the status page is burried at the bottom. Make sure users can find it when they need to.
  5. Add an RSS feed.

Saturday, July 4, 2009

Cloud and SaaS SLA's

Daniel Druker over at the SaaS 2.0 blog recently posted an extremely thorough description of what we should be expecting from Cloud and SaaS services when it comes to SLA agreements:
In my experience, there are four key areas to consider in your SLA:

First is addressing control: The service level agreement must guarantee the quality and performance of operational functions like availability, reliability, performance, maintenance, backup, disaster recovery, etc that used to be under the control of the in-house IT function when the applications were running on-premises and managed by internal IT, but are now under the vendor's control since the applications are running in the cloud and managed by the vendor.

Second is addressing operational risk: The service level agreement should also address perceived risks around security, privacy and data ownership - I say perceived because most SaaS vendors are actually far better at these things than nearly all of their clients are. Guaranteed commitments to undergoing regular SAS70 Type II audits and external security evaluations are also important parts of mitigating operational risk.

Third is addressing business risk: As cloud computing companies become more comfortable with their ability to deliver value and success, more of them will start to include business success guarantees in the SLA - such as guarantees around successful and timely implementations, the quality of technical support, business value received and even to money back guarantees - if a client isn't satisfied, they get their money back. Cloud/SaaS vendor can rationally consider offering business risk guarantees because their track record of successful implementations is typically vastly higher than their enterprise software counterparts.

Last is penalties, rewards and transparency: The service level agreement must have real financial penalties / teeth when an SLA violation occurs. If there isn't any pain for the vendor when they fail to meet their SLA, the SLA doesn't mean anything. Similarly, the buyer should also be willing to pay a reward for extraordinary service level achievements that deliver real benefits - if 100% availability is an important goal for you, consider paying the vendor a bonus when they achieve it. Transparency is also important - the vendor should also maintain a public website with continuous updates as to how the vendor is performing against their SLA, and should publish their SLA and their privacy policies. The best cloud vendors realize that their excellence in operations and their SLAs are real selling points, so they aren't afraid to open their kimonos in public.
Considering the sad state of affairs in existing SLA's, I'm hoping to see some progress here from the big boys, if nothing else as a competitive advantage as they try to differentiate themselves.

Tuesday, February 24, 2009

I spy with my little eye...Mosso working on a health status dashbord

The transparency that Twitter brings is awesome:


I'm looking forward to see how many of the rules of successful health status dashboards they follow!

Wednesday, February 11, 2009

Transparency User Story #1: Your service seems to be down. I'd like to know if it's down for anyone else or if it's just me.

Note: This post is the first in a series of at least a dozen posts where I attempt to drill into the transparency user stories described in an earlier post.

Let's assume that you've decided that you want to make your service or organization more transparent, specifically when it comes to it's uptime and performance. You've convinced your management team, you've got the engineering and marketing resources, and your rearing to go. You want to get something done and can't wait to make things happen. Stop right there. Do not pass go, do not collect $200. You first need to figure out what it is you're solving for. What problems (and opportunities) do you want to tackle in your drive for transparency?

Glad you asked. There are about a dozen user stories that I've listed in a previous post that describe the most common problems transparency can solve. In this post, I will dive into the first user story:
As an end user or customer, it looks to me like your service is down. I'd like to know if it's down for everyone or if it's just me.
Very straight forward, and very common. So common there are even a couple simple free web services out there that helps people figure this out. Let's assume for this exercise that your site is up the entire time.

Examples of the problem in action
  1. Your customer's Internet connection is down. He loads up www.yoursite.com. It cannot load. He thinks you are down, and calls your support department demanding service.
  2. Your customer's DNS servers are acting up, and are unable to resolve www.yoursite.com inside his network. He finds that google.com is loading fine, and is sure your site is down. He sends you an irate email.
  3. Your customer network routes are unstable, causing inconsistent connectivity to your site. He loads www.yourcompetitor.com fine, while www.yoursite.com fails. He Twitters all about it.
Why this hurts your business
  1. Unnecessary support calls
  2. Unnecessary support emails.
  3. Negative word of mouth that is completely unfounded.
How to solve this problem
  1. An offsite public health dashboard
  2. A known third party, such as this and this or eventually this
  3. A constant presence across social media (Twitter especially) watching for false reports
  4. Keeping a running blog noting any downtime events, which tells your users that unless something is posted, nothing is wrong. You must be diligent about posting when there is actually something wrong however.
  5. Share your real time performance with your large customers. Your customers may even want to be alerted when you go down.
Example solutions in the real world
  1. Many public health dashboards
  2. The QuickBooks team notifying users that their service was back up
  3. Sharing your monitoring data in real time with your serious customers
  4. Searching Twitter for outage discussion

Tuesday, December 16, 2008

Google App Engine System Status - A Review

Building off of the rules for a successful public health dashboard, let's see what Google did well, what they can improve, and what questions remain:

Rule #1: Must show the current status for each "service" you offer
  • Considering this is meant to cover only the App Engine service, and not any other Google service, I would say they accomplished their goal. Every API they offer appears to be covered, in addition to the "Serving" metric which appears to test the overall service externally.
  • I appreciate the alphabetic sorting of services, but I would suggest making the "Serving" status a bit more prominent as that would seem to be by far the most important metric.
  • Conclusion: Met!
Rule #2: Data must be accurate and timely
  • Hard to say until an event occurs or we hear feedback about this from users.
  • The announcement does claim the data is an "up-to-the-minute overview of our system status with real-time, unedited data." If this is true, this is excellent news.
  • The fact that an "Investigating" status is an option tells me that the status may not always be real-time or unedited. Or I may just be a bit too paranoid :)
  • In addition the fact that "No issues" and "Minor performance issues" are both considered healthy tells us that issues Google considers "minor" will be ignored or non-transparent. That's bad news. Though it does fit with their SLA questions that came up recently.
  • Conclusion: Time will tell (but promising)
Rule #3: Must be easy to find
  • If I were experiencing a problem with App Engine, I would first go to the homepage here. Unfortunately I don't see any link to the system status page. A user would either have to stumble upon the blog post announcing this page, or work through the forum...defeating the purpose of the system status page!
  • The URL to the system status (http://code.google.com/status/appengine/) page is not easy to remember. Since Google doesn't seem to own appengine.com, this is may not be easy to fix, but that doesn't matter to a user that's in the middle of an emergency and needs to figure out what's going on. The good news is that at the time of this writing, a Google search for "google app engine status" has the status page as the third result, and I would think that it will raise to #1 very soon.
  • Conclusion: Not met (but easy to fix by adding a link from the App Engine homepage).
Rule #4: Must provide details for events in real time
  • Again, hard to say until we see an issue occur.
  • What I'm most interested in is how much detail they provide when an event does occur, and whether they send users over to the forums or to the blog, or simply provide the information on the status page.
  • Conclusion: Time will tell.
Rule #5: Provide historical uptime and performance data
  • Great job with this. I dare say they've jumped head of every other cloud service in the amount and detail on performance data they provide.
  • Still unclear how much historical data will be maintained, but even 7 days is enough to satisfy me.
  • Conclusion: Met!
Rule #6: Provide a way to be notified of status changes
Rule #7: Provide details on how the data is gathered
  • Beyond the mention that they are "using some of the same raw monitoring data that our engineering team uses internally", no real information on how this data is collected, how often it is updated, or where the monitoring happens from.
  • Conclusion: Not met.
Overall, in spite of more rules being missed than met, the more difficult requirements are looking great, and the pieces are in place to create a very complete and extremely useful central place for their customers to come in time of need. I'm excited to see where Google takes this dashboard from here, and how other cloud services respond to this ever growing need.