Transparent Uptime: uptime

Showing posts with label uptime. Show all posts

Tuesday, February 16, 2010

The Tao of Web Performance and Uptime

Who cares about fast web pages. Who cares about uptime. I mean really.

Does it truly matter to you whether a page loads a couple seconds faster? Are we wasting our lives keeping servers up 99.999%? Are we making an impact on the world in a meaningful way? Does it actually matter in the scheme of things?

I think the answer is yes. It does matter. I matters a lot. But it isn't for the reasons you think.

What would most people answer when asked "Why are you spending your life making web pages faster and keeping servers up?" Here are my guesses:

It's my job

I'll get fired. I'll let my peers down. I'll hurt the company. My boss will be mad at me if I don't do what I'm told. All good reasons (unless you've read Seth Godin's new book). But do these reasons make you honestly care about web page performance? Does it make you happy to spend your precious lifetime keeping servers running all hours of the night? The real question is not whether performance and uptime matter. The question that you should be asking is: Does performance and uptime matter to you, as a human being? If your answer to this relies on this being your job, that caring about it provides security and comfort, then the answer is no. You won't find fulfillment in your work if your motivation is being a good cog.

It's my company

Being the person making money off of the cogs (as they improve page performance and keep the system stable) changes the equation. No doubt, keeping your servers up is critical to the success of your online business (usually). Furthermore, the ROI of page performance is fairly conclusive. Clearly, uptime and performance lead to more revenue (or at least less lost revenue). But what do you actually care about in this equation? How quickly the pages load, or how much money you're making as a result of those faster pages? You don't want to spend your time optimizing pages all day. You want to get done with it as quickly as possible and get back to doing business. You certainly care about page performance and uptime, but only as a tool, a necessary evil, that helps you optimize your true passion (whatever that may be).

It's fun

You look at tuning performance or building highly reliable systems as a puzzle. You enjoy the work because you are good at it, or you want to accomplish something no one else has in the past. Your motivation is either the thrill of the problem or personal brand building in your group/company/industry. You enjoy performance and uptime for the opportunity that it offers, and the feeling of accomplishment that it brings when you improve performance by 23% or keep the system up during a marketing blitz. This explanation gets close to being a good reason, but it lacks something. It's selfish. It focuses on you. It doesn't give you a purpose, or impact the world in a meaningful way. Fun will take you so far, but at some point you'll wonder "what's the point?" and move on to the next challenge.

It makes people happy

This is it. This is why it matters. This is why it is worth spending your time making web pages faster and keeping servers up. To put it simply, it make people happier. To quote Matt Mullenweg, founder of Wordpress:

"That's why [performance] is important and why we should be obsessed and not be discouraged when it doesn't change the funnel. My theory here is when an interface is faster, you feel good. And ultimately what that comes down to is you feel in control. The web app isn't controlling me, I'm controlling it. Ultimately that feeling of control translates to happiness in everyone. In order to increase the happiness in the world, we all have to keep working on this. "

How can we quantify this? We have data showing that a slower google.com and bing.com results in less searches, and more importantly that user satisfaction goes down with each additional performance decrease. AOL shows us that page views drop off as page load times increase. Optimizations to Google Maps increasing user interaction with the site significantly. The faster the site, the more you want to use it. Let's delve into more evidence...

Flow

If you haven't yet come across the concept of flow:

"Flow is the mental state of operation in which the person is fully immersed in what he or she is doing by a feeling of energized focus, full involvement, and success in the process of the activity. Proposed by Mihály Csíkszentmihályi, the positive psychology concept has been widely referenced across a variety of fields.

According to Csíkszentmihályi, flow is completely focused motivation. It is a single-minded immersion and represents perhaps the ultimate in harnessing the emotions in the service of performing and learning. In flow the emotions are not just contained and channeled, but positive, energized, and aligned with the task at hand. To be caught in the ennui of depression or the agitation of anxiety is to be barred from flow. The hallmark of flow is a feeling of spontaneous joy, even rapture, while performing a task."

How does performance and uptime relate to flow? Researchers asked this very question and found some unsurprising results:

"Hoffman, Novak, and Yung found that the speed of interaction had a“direct positive influence on flow” on feelings of challenge and arousal (which directly influence flow), and on importance. Skill, control, and time distortion also had a direct influence on flow.

The researchers then applied their model to consumer behavior on the web. They tested web applications (chat, newsgroups, and so on) and web shopping, asking subjects to specify which features were most important when shopping on the web.

They found that speed had the greatest effect on the amount of time spent online and on frequency of visits for web applications. For repeat visits, the most important factors were skill/control, length of time on the web, importance, and speed.

So to make your site compelling enough to return to, make sure that it offers a perceived level of control by matching challenges to user skills, important content, and fast response times."

When asked about the importance of speed on flow, Csikszentmihalyi offers:

"If you mean the speed at which the program loads, the screens change, the commands are carried out—then indeed speed should correlate with flow. If you are playing a fantasy game, for instance, and it takes time to move from one level to the next, then the interruption allows you to get distracted, to lose the concentration on the alternate reality. You have time to think: “Why am I wasting time on this? Shouldn’t I be taking the dog for a walk, or studying?”— and the game is over, psychologically speaking."

Clearly, speed plays a key role in attaining flow. If you believe (as I do) that flow is a good thing, and brings on happiness, then giving your visitors the chance to enter a flow state is a worthwhile pursuit.

Usability

From the guru of web usability, Jakob Nielsen:

"Every web usability study I have conducted since 1994 has shown the same thing: Users beg us to speed up the page downloads. In the beginning my reaction was along the lines of "Let's just give them better design, and they will be happy to wait for it." I have since become a reformed sinner believing that fast response times are the most important design criterion for web pages; even my skull isn't thick enough to withstand consistent user pleas year after year."

Users are begging us to speed up page load times!

User Psychology

A good number of studies have further connected slow web pages (and unreliable web applications) with frustration and higher blood pressure.

From "Determining causes and severity of end-user frustration":

"slow response time generated higher ratings of frustration and impatience"
"Frustration occurs at an interruption or inhibition of the goal-attainment process, where a barrier or conflict is put in the path of an individual"
"Slow websites inhibit users from reaching their goals, causing frustration"

From "Visitors’ ﬂow experience while browsing a Web site: its measurement, contributing factors and consequences":

"It was found that in the context of human–computer interactions while browsing a Web site, ﬂow experience was characterized by time distortion, enjoyment, and telepresence."

A study done by Forrester Consulting (on behalf of Akamai):

"finds that website performance has a direct impact on revenues, profits and satisfaction."
"The findings indicate that website performance is second only to security in user expectations"

Drive

Daniel Pink's new book Drive argues that "the biggest motivator at work is making progress" (link). Anything that gets in the way of you making progress makes you less happy. As the web becomes a bigger part of where work is done (be it SaaS, the cloud, or Twitter), the more important the speed and reliability of those web sites becomes. Progress, motivation, and happiness will be increasingly tied to the performance and stability of the web.

Still not convinced?

Let's look at the flip side. Slow and unreliable sites make people very upset:

Downtime creates pain and frustration. Slow web applications piss people off. Ironically the more popular your site, and the more useful it is to your users, more unhappiness you can cause.

This unhappiness does not end at your firewall. I don't have to tell you how stressful downtime is internally. Your peers have to work nights, your boss has to explain what happened to their boss. If you are dogfooding your applications, internal productivity is affected from both downtime and bad performance. Simply put, the absence of uptime and bad web performance creates a lot of unhappy people.

Where does this leave us?

Let's ask the same question we asked earlier:

"Does it truly matter to you whether a page loads a couple seconds faster? Are we wasting our lives keeping servers up 99.999%? Are we making an impact on the world in a meaningful way? Does it actually matter in the scheme of things?"

Hopefully I've convinced you that there is a strong link between performance/stability and happy users. That's all well and good, but does this matter in the scheme of things? Two stats stand out to me:

The number of Internet users worldwide: 1,733,993,741
The amount of time spent online per week: 13 hours

If we as an industry can impact the happiness of almost 2 billion people by making the web a little bit faster, or a little more stable, I say that this does indeed impacts the world in a meaningful way. By plugging away at our little problems, our minor tweaks, our tools and tricks, we are helping our users, and the world at large, become a happier place.

"Happiness is the meaning and the purpose of life, the whole aim and end of human existence" -- Aristotle

Saturday, July 4, 2009

Cloud and SaaS SLA's

Daniel Druker over at the SaaS 2.0 blog recently posted an extremely thorough description of what we should be expecting from Cloud and SaaS services when it comes to SLA agreements:

In my experience, there are four key areas to consider in your SLA:

First is addressing control: The service level agreement must guarantee the quality and performance of operational functions like availability, reliability, performance, maintenance, backup, disaster recovery, etc that used to be under the control of the in-house IT function when the applications were running on-premises and managed by internal IT, but are now under the vendor's control since the applications are running in the cloud and managed by the vendor.

Second is addressing operational risk: The service level agreement should also address perceived risks around security, privacy and data ownership - I say perceived because most SaaS vendors are actually far better at these things than nearly all of their clients are. Guaranteed commitments to undergoing regular SAS70 Type II audits and external security evaluations are also important parts of mitigating operational risk.

Third is addressing business risk: As cloud computing companies become more comfortable with their ability to deliver value and success, more of them will start to include business success guarantees in the SLA - such as guarantees around successful and timely implementations, the quality of technical support, business value received and even to money back guarantees - if a client isn't satisfied, they get their money back. Cloud/SaaS vendor can rationally consider offering business risk guarantees because their track record of successful implementations is typically vastly higher than their enterprise software counterparts.

Last is penalties, rewards and transparency: The service level agreement must have real financial penalties / teeth when an SLA violation occurs. If there isn't any pain for the vendor when they fail to meet their SLA, the SLA doesn't mean anything. Similarly, the buyer should also be willing to pay a reward for extraordinary service level achievements that deliver real benefits - if 100% availability is an important goal for you, consider paying the vendor a bonus when they achieve it. Transparency is also important - the vendor should also maintain a public website with continuous updates as to how the vendor is performing against their SLA, and should publish their SLA and their privacy policies. The best cloud vendors realize that their excellence in operations and their SLAs are real selling points, so they aren't afraid to open their kimonos in public.

Considering the sad state of affairs in existing SLA's, I'm hoping to see some progress here from the big boys, if nothing else as a competitive advantage as they try to differentiate themselves.

Saturday, January 10, 2009

How transparency can help your business

When looking to gain the benefits of transparency (into your downtime and performance issues), you first need to understand the use cases (or more accurately, the user st o ries) that describe the problems that transparency can solve. It's easy to put something out there looking for the press and marketing benefits. It's a lot more challenging (and beneficial) to understand what transparency can do for your business, and then actually solve those problems.

Transparency user stories

As an end user/customer:

Your service seems to be down. I'd like to know if it's down for anyone else or if it's just me.
I know your service is down, and I want to know when it'll be back up.
I want some kind of explanation of why you went down.

As a business customer using your service as part of my own service offering:

Before betting my business on your service/platform, I need to know how reliable it has been.
My own customers are reporting that my service is down, but everything looks fine on my end. I need to know if your service is down, and if so I need information to keep my customers up to date.
I want to find which link in my ecosystem of external services is broken or slow right away.
One of my customers reported a problem in the past, and I'd like to correlate it with hiccups your service may have had in the past.
I need to know well in advance of any upcoming maintenance windows.
I need to know well in advance if you plan to change any features that are critical to me, or if the performance of the service will change.

As a SaaS provider:

I want my customers (and my prospects) to trust my service. I don't want my customers to lose that trust if I ever go down.
My support department gets flooded with calls and emails during a downtime event.
I want to understand what the uptime and performance of my services are at all times from around the world. Both for internal reasons, and to help my customers diagnose issues they are reporting.
I want to differentiate from my competition based on reliability and customer support.

In the next post, I will dive into ways to attack each of these user stories. Stay tuned.

Sunday, January 4, 2009

A comprehensive list of SaaS public health dashboards

To anyone looking to build a public health dashboard for their own online service, the following list should give you a head start in understanding what's out there. I also keep an up-to-date list in my delicious account that you can reference at any time. I would suggest reviewing the examples below when coming up with your own design, potentially combining the various approaches to create something truly useful to your customers.

Note: This list is divided up into three tiers. The tiers are determined by a rough combination of company size, service popularity, importance to the general public, and quality of the end result.

Tier One

AWS Service Health Dashboard (http://status.aws.amazon.com/)
Trust.salesforce.com - System Status (http://trust.salesforce.com/trust/status/)
Zoho Service Health Status (http://status.zoho.com/)
OpenDNS System Status (http://system.opendns.com/)
OpenSRS System Status (http://status.opensrs.com/)
Google App Engine System Status (http://code.google.com/status/appengine)

Tier Two

QuickBase Service Status (http://service.quickbase.com/updates.aspx)
NetSuite System Status (http://status.netsuite.com/status.html)
Mogulus Service Health (http://www.mogulus.com/support/servicehealth)
Skype Heartbeat (http://heartbeat.skype.com/)
BlueTie Real Time Status Center (http://support.bluetie.com/?q=node/819)

Tier Three

Twitter Status (http://status.twitter.com/)
SAP System Status (http://www.sytecpa.org/technical/systemStatus.asp)
University of Florida Service Monitoring (http://open-systems.ufl.edu/status/)
Capitalserv - Current service status (http://www.capitalserv.com/current_service_status.aspx)
Everyone.net Email Service Status (http://www.everyone.net/main/scripts/status.cgi)
MSN Messenger Service Status (http://messenger.msn.com/Status.aspx)
Federal Reserve Financial Services Service Status (http://www.frbservices.org/app/status/serviceStatus.do)
Boardhost Service Status (http://status.boardhost.com/)
Primus System Status (http://systemstatus.iprimus.com.au/)

Non-dashboard system status pages

World of Warcraft Service Status (http://forums.worldofwarcraft.com/board.html?forumId=11113&sid=1)
Second Life Grid Status Reports (http://status.secondlifegrid.net/)
GitHub Status (http://github.wordpress.com/)
The WELL System Status (http://www.well.com/status.html)
MODIS Rapid Response System - System Status (http://rapidfire.sci.gsfc.nasa.gov/status/)

Don't forget to also review the seven keys to a successful health dashboard, especially since not one public dashboard I've come across meets all of the rules.

Again, the full list can always be found here. If I missed any public dashboards, I'd love to know...simply point to them in the comments and I'll make sure to add them to the list.

Monday, December 22, 2008

Comprehensive review of SaaS SLAs - A sad state of affairs

A recent story about the holes in Google's SLA got me wondering about the state of service level agreements in the SaaS space. The importance of SLA's in the enterprise online world are obvious. I'm sad to report that of the state of the union is not good. Of the handful of major SaaS players, most have no SLAs at all. Of those that do, the coverage is extremely loose, and the penalty for missing the SLAs is weak. To make my point, I've put together an exhaustive (yet pointedly short) list of the SLAs that do exist. I've extracted the key elements and removed the legal mumbo-jumbo (for easy consumption). Enjoy!

Comparing the SLAs of the major SaaS players

Google Apps:

What: "web interface will be operational and available for GMail, Google Calendar, Google Talk, Google Docs, and Google Sites"
Uptime guarantee: 99.9%
Time period: any calendar month
Penalty: 3, 7, or 15 days of service at no charge, depending on the monthly uptime percentage
Important caveats:

"Downtime" means, for a domain, if there is more than a five percent user error rate. Downtime is measured based on server side error rate.
"Downtime Period" means, for a domain, a period of ten consecutive minutes of Downtime. Intermittent Downtime for a period of less than ten minutes will not be counted towards any Downtime Periods.

Amazon S3:

What: Amazon Simple Storage Service
Uptime guarantee: 99.9%
Time period: "any monthly billing cycle"
Penalty: 10-25% of total charges paid by customer for a billing cycle, based on the monthly uptime percentage
Important caveats:

“Error Rate” means: (i) the total number of internal server errors returned by Amazon S3 as error status “InternalError” or “ServiceUnavailable” divided by (ii) the total number of requests during that five minute period. We will calculate the Error Rate for each Amazon S3 account as a percentage for each five minute period in the monthly billing cycle. The calculation of the number of internal server errors will not include errors that arise directly or indirectly as a result of any of the Amazon S3 SLA Exclusions (as defined below).
“Monthly Uptime Percentage” is calculated by subtracting from 100% the average of the Error Rates from each five minute period in the monthly billing cycle.
"We will apply any Service Credits only against future Amazon S3 payments otherwise due from you""

Amazon EC2:

What: Amazon Elastic Compute Cloud service
Uptime guarantee: 99.95%
Time period: "the preceding 365 days from the date of an SLA claim"
Penalty: "a Service Credit equal to 10% of their bill for the Eligible Credit Period"
Important caveats:

“Annual Uptime Percentage” is calculated by subtracting from 100% the percentage of 5 minute periods during the Service Year in which Amazon EC2 was in the state of “Region Unavailable.” If you have been using Amazon EC2 for less than 365 days, your Service Year is still the preceding 365 days but any days prior to your use of the service will be deemed to have had 100% Region Availability. Any downtime occurring prior to a successful Service Credit claim cannot be used for future claims. Annual Uptime Percentage measurements exclude downtime resulting directly or indirectly from any Amazon EC2 SLA Exclusion (defined below).
“Unavailable” means that all of your running instances have no external connectivity during a five minute period and you are unable to launch replacement instances.

...that's it!

Notable Exceptions (a.k.a. lack of an SLA)

Salesforce.com (are you serious??)
Google App Engine (youth will only be an excuse for so long)
Zoho
Quickbase
OpenDNS
OpenSRS

Conclusions
There's no question that for the enterprise market to get on board with SaaS in any meaningful way accountability is key. Public health dashboards are one piece of the puzzle. SLAs are the other. The longer we delay in demanding these from our key service providers (I'm looking at you Salesforce), the longer and more difficult the move into the cloud will end up being. The incentive in the short term for a not-so-major SaaS player should be to take the initiave and focus on building a strong sense of accountability and trust. As it begins to take business away from the more established (and less trustworthy) services, the bar will rise and customers will begin to demand these vital services from all of their providers. The day's of weak or non-existant SLAs for SaaS providers are numbered.

Disclaimer: If I've misrepresented anything above, or if your SaaS service has a strong SLA, please let us know in the comments. I really hope someone out there is working to raise the bar on this sad state.

Tuesday, December 16, 2008

Google App Engine System Status - A Review

Building off of the rules for a successful public health dashboard, let's see what Google did well, what they can improve, and what questions remain:

Rule #1: Must show the current status for each "service" you offer

Considering this is meant to cover only the App Engine service, and not any other Google service, I would say they accomplished their goal. Every API they offer appears to be covered, in addition to the "Serving" metric which appears to test the overall service externally.
I appreciate the alphabetic sorting of services, but I would suggest making the "Serving" status a bit more prominent as that would seem to be by far the most important metric.
Conclusion: Met!

Rule #2: Data must be accurate and timely

Hard to say until an event occurs or we hear feedback about this from users.
The announcement does claim the data is an "up-to-the-minute overview of our system status with real-time, unedited data." If this is true, this is excellent news.
The fact that an "Investigating" status is an option tells me that the status may not always be real-time or unedited. Or I may just be a bit too paranoid :)
In addition the fact that "No issues" and "Minor performance issues" are both considered healthy tells us that issues Google considers "minor" will be ignored or non-transparent. That's bad news. Though it does fit with their SLA questions that came up recently.
Conclusion: Time will tell (but promising)

Rule #3: Must be easy to find

If I were experiencing a problem with App Engine, I would first go to the homepage here. Unfortunately I don't see any link to the system status page. A user would either have to stumble upon the blog post announcing this page, or work through the forum...defeating the purpose of the system status page!
The URL to the system status (http://code.google.com/status/appengine/) page is not easy to remember. Since Google doesn't seem to own appengine.com, this is may not be easy to fix, but that doesn't matter to a user that's in the middle of an emergency and needs to figure out what's going on. The good news is that at the time of this writing, a Google search for "google app engine status" has the status page as the third result, and I would think that it will raise to #1 very soon.
Conclusion: Not met (but easy to fix by adding a link from the App Engine homepage).

Rule #4: Must provide details for events in real time

Again, hard to say until we see an issue occur.
What I'm most interested in is how much detail they provide when an event does occur, and whether they send users over to the forums or to the blog, or simply provide the information on the status page.
Conclusion: Time will tell.

Rule #5: Provide historical uptime and performance data

Great job with this. I dare say they've jumped head of every other cloud service in the amount and detail on performance data they provide.
Still unclear how much historical data will be maintained, but even 7 days is enough to satisfy me.
Conclusion: Met!

Rule #6: Provide a way to be notified of status changes

Nada here, beyond pointing people to the Downtime Notify Google Group.
Conclusion: Not met.

Rule #7: Provide details on how the data is gathered

Beyond the mention that they are "using some of the same raw monitoring data that our engineering team uses internally", no real information on how this data is collected, how often it is updated, or where the monitoring happens from.
Conclusion: Not met.

Overall, in spite of more rules being missed than met, the more difficult requirements are looking great, and the pieces are in place to create a very complete and extremely useful central place for their customers to come in time of need. I'm excited to see where Google takes this dashboard from here, and how other cloud services respond to this ever growing need.

Monday, December 1, 2008

7 keys to a successful public health dashboard

Lets first define what makes an online health dashboard "successful", and in the process explain why you (as a SaaS provider) should have one:

Your support costs go down as your users are able to self-identify system wide problems without calling or emailing your support department. Users will no longer have to guess whether their issues are local or global, and can more quickly get to the root of the problem before complaining to you.
You are better able to communicate with your users during downtime events, taking advantage of the broadcast nature of the Internet versus the one-to-one nature of email and the phone. You spend less time communicating the same thing over and over and more time resolving the issue.
You create a single and obvious place for your users to come to when they are experiencing downtime. You save your users' time currently spent searching forums, Twitter, or your blog.
Trust is the cornerstone of successful SaaS adoption. Your customers are betting their business and their livelihoods on your service or platform. Both current and prospective customers require confidence in your service. Both need to know they won't be left in the dark, alone and uninformed, when you run into trouble. Real time insight into unexpected events is the best way to build this trust. Keeping them in the dark and alone is no longer an option.
It's only a matter of time before every serious SaaS provider will be offering a public health dashboard. Your users will demand it.

With that out of the way, let's move on to detailing what exactly it takes to create a successful public health dashboard. Generally I would suggest looking to your users to tell you what they need. I still strongly recommend you do this, especially if your users are technically savvy. However, as this industry is still so young, and most companies are still unsure of what their users will demand, I humbly submit my 7 rules for public health dashboard success:

The Rules

First things first

Before we get into the rules, I'd like to mention a few public "system status" pages that don't quite meet the label of "health dashboard" but do give us a starting point for providing public health information. There's no reason any SaaS provider today should not be offering at least a basic chronological list of potential issues, downtime events, and resolution details similar to one of the following: craigslist system status, 37signals System Status, Twitter Status, GitHub Status, Mosso System Status. Now...on to the rules for creating a successful online public health dashboard!

The Basics

Rule #1: Must show the current status for each "service" you offer

A status light or short description that visitors can use to quickly identify how the service(s) they are interested in are doing right now. Example #1, Example #2.
Most health dashboards do this well. Keep it simple. Skype's Heartbeat tries to be clever, but I fear that the first impression that the big thumping red hearts give visitors is that something is wrong.
Don't forget to identify what the status icons and messages actually mean. Example #1 (legend at bottom right), Example #2 (descriptions below the table at bottom). Bad Example (no information on what the possible states are).

Rule #2: Data must be accurate and timely

This should go unsaid, but some comments in an online forum show us that it isn't as obvious as it should be.
The data should be based on real time monitoring, not manual updates that require a human.
The entire benefit is lost if your users cannot trust this data, or it arrives too late.

Rule #3: Must be easy to find

It is worthless to provide a public health dashboard if your users are unaware it exists, or are unable to find it in time of need.
Anticipate where your users go when they experience downtime, and create a clear path to the status page. Ideally there will be a link from your home page, and at the minimum from your main support page. Example #1 (footer of each page), Example #2 (top right of every page). Many examples of support page links.
Also consider making the URL as easy to remember as possible. "status.yourdomain.com" or "yourdomain.com/status" seem to be the preferred method.

Rule #4: Must provide details for events in real time

You must go beyond simply noting that something is wrong. Your must provide insight into what is going on, what services are affected, and if possible an ETA on resolution. Users will be OK with a big red light for only so long.
This can be as simple as a timestamped message noting that you are investigating, with regular updates about the investigate and projected resolution times.
The key here is to keep your users from having to contact your support department, defeating much of the gain in having a public health dashboard. Use this as an opportunity to build a trust relationship with your customer by being transparent throughout the process.

Beyond the basics

Rule #5: Provide historical uptime and performance data

Make sure to provide root cause analysis for each downtime event. The more detail the better. Example #1, Example #2 (click on any event in the past).
This will be important to your prospects as they evaluate your transparency. Don't be afraid of problems you've had in the past. Owning up to problems strengthens trust, which should be one of your main goals.
This will be important to your customers as they do post mortem analysis for their superiors.
Provide at least one week of historical data, ideally at least a month. Example #1, Example #2, Example #3, Example #4 (notice each service has an archive link).

Rule #6: Provide a way to be notified of status changes

RSS/email/SMS/Twitter/API's/etc. It's too early to know how users will want to consume this information, but my opinion is that the two most useful options would be to allow email alerts on downtime, and an API that allows users to build their applications to work around the downtime automatically.
Currently many status dashboard provide RSS feeds. Example #1 (even provides email and Twitter alerts!), Example #2.
Along these same lines, providing advanced notice of upcoming maintenance windows is extremely useful. I would hope these are announced in other mediums as well (e.g. email).

Rule #7: Provide details on how the data is gathered

What is the uptime and performance data worth if we have no insight into where it comes from? Currently with most health dashboards we have to assume either the provider built their own monitoring platform, or that they are making status updates manually (Zoho is the one exception, since they have their own monitoring service).
Beyond simply knowing where the data comes from, what exactly does "Performance issues" mean? What are the thresholds that determine that a service is considered to be "disrupted"? From what location's is the monitoring done from? Am I out of luck if I live in Asia and the monitoring is done from New York?
It would be extremely useful to have the data validated by a third party, especially as this gets into the world of SLA's. We can't have the fox watching the hen house when it comes to money.

The future

For those seeking to truly be ahead of the curve and open up the kimono further, I suggest the following rules as well:

Provide geographical uptime and performance data (Zoho is ahead of the game on this one). The more information you provide to your users publicly, the less questions you'll have to deal with privately.
The status page should be hosted externally, at a different location from your primary data center. This should be obvious, but I doubt many companies consider this as a problem. The last thing you want when your primary data center goes down is to have to field calls that could be avoided if your status page was still up.
Break out each individual service and function as much as possible. Similar to how Flickr opened up their API to match up with practically every internal function call, allow your users to have insight into the very specific functionality they need.
Connect your downtime events to your SLA's. Allow your users to easily track how you're doing compared to what you promised. The day's of hoping that your users forget about this are over.

I hope the above advice provides value to companies out there considering their own health dashboards. I would love to hear from SaaS providers already providing health dashboards, especially those I haven't already linked to. I'd especially love to hear some feedback from companies on the benefits they've seen in providing a health dashboard, in customer feedback, reduced costs, or competitive advantages.

I'm excited see over the next few months how things change in this space, what rules become most important to users, and how online service providers respond to the oncoming demand for transparency. Time will tell, but I do know this is only the beginning.

For reference, some of the public health dashboards I referenced in this post:

Tuesday, November 18, 2008

Amazon launches the Festivus of CDN's...a CDN for the rest of us

You can read about it here, here, and here. A CDN-as-a-Service (CaaS?). Does this put the nail in the coffin and make the CDN business a commodity? Probably.

What impresses me most is not the technology or the pricing or the ease of use of the new service (called CloudFront btw). What really gets me hot is the fact that the AWS Service Health Dashboard already has the performance and uptime of CloudFront up and live! That my friends is sign of true commitment to transparency, and more importantly a well functioning process.

Saturday, November 15, 2008

Comparing Amazon Web Services, SalesForce, and Zoho's online health dashboards

Now that there are three major SaaS players offering online service health dashboards, and one from Google on it's way, I thought it would be a useful exercise to compare the offerings from Amazon Web Services, Salesforce, and Zoho. This will hopefully be helpful for anyone planning to launch their own health dashboard, and to the general online community in making sense of what is important to understand about these dashboards.

Disclaimer: If I have mistakenly misrepresented anything, or if I missed any information, PLEASE let me know in the comments below.

What providers are we looking about today?

What is the URL of each status page (and are they easy to remember in times of need)?

Amazon Web Services: http://status.aws.amazon.com/ (Somewhat easy to remember)
Salesforce: http://trust.salesforce.com/ (Easy to remember)
Zoho: http://status.zoho.com/ (Extremely easy to remember)

What are these status pages called?

Amazon Web Services: "AWS Service Health Dashboard"
Salesforce: "Trust.salesforce.com - System Status" (Note: salesforce.com goes beyond simply providing system status by also providing security notices, both under their "Trust.salesforce.com brand")
Zoho: "Zoho Service Health Status"

What services' health are reported on?

Amazon Web Services: All four core services (EC2, S3, SQS, SimpleDB), plus Mechanical Turk and FlexPay. They also break out the two S3 datacenter locations (EU and US), the two ends of a Mechanical Turk transaction (Requester and Worker), plus the EC2 API.
Salesforce: Only the core salesforce.com services across 12 individual systems (based on geographic location and purpose).
Zoho: All 23 Zoho services are covered, plus their mobile site and their single sign-on system.

What health information is provided?

Amazon Web Services: Current status, plus about 30 days of historical status. Status is determined to be one of "Service is operating normally", "Performance Issues", or "Service disruption". "Information messages" are occasionally provided.
Salesforce: Current status, plus exactly 30 days of historical status. Status is determined to be either "Instances available", "Performance Issues", "Service disruption", or "Status not available". "Informational messages" are also provided on occasion.
Zoho: Current status and the response time for the past hour, in addition to historical uptime for the past week. Also provided are two graphs representing uptime and response time for the past seven days. If that wasn't enough, current uptime and response from six geographical locations is also given.

Where does the uptime and performance data come from?

Amazon Web Services: No clue.
Salesforce: No clue.
Zoho: Their own "Site 24x7" monitoring service.

What is considered downtime and what is considered a performance issue?

Amazon Web Services: No clue.
Salesforce: No clue.
Zoho: No clue.

Are real time updates provided during downtime events? Is it easy to find?

Amazon Web Services: Yes, but unclear how consistently and how easy it is to find that information.
Salesforce: Yes, right underneath the current status.
Zoho: Does not appear so, but if the issue is big enough they may update customers through their blog.

Is information provided on past downtime events?

Amazon Web Services: Yes. Mousing over a past performance or downtime event brings up a chronological log of events that took place, from detection to resolution. In addition, major downtime events are explained.
Salesforce: Yes. Clicking on any past event brings up a window giving the time of the event, a detailed description of the problem, and a root cause analysis.
Zoho: No. Unless they are described in the blog.

Is there a way to easily report problems users are having?

Amazon Web Services: Yes, clicking the "Report an Issue" link.
Salesforce: No, other then using the standard support channels.
Zoho: No, other then using the standard support channels.

How can you get notified of problems (without watching this page 24/7)?

Amazon Web Services: Ability to subscribe to RSS feeds for change in status of each service.
Salesforce: No.
Zoho: No.

Conclusions: The best practices for online service health dashboards are still being formed, and it's clear that each service provider has approached the need for transparency differently. Amazon Web Services provides a simple and easy to understand overview of the health of each service, but provides little insight into who is impacted and what specific functionality is down. Salesforce provides clear insight into what customers may be affected by an event, but does little in offering insight into specific functionality that may be down or slow. Zoho provides the most data by far for each service they provide, but does not have a system in place to communicate details about specific downtime events beyond the company blog. Amazon and Salesfroce completely lack insight into how that they collect the health information, and all three give no information on what is meant by downtime or performance problems.

A closing questions for each provider:

Amazon Web Services: What does "EC2 API" actually mean? Which API is this referring to and why not cover the API's for the other services?
Salesforce: Does each server status cover every application level and API on that server? Can you offer more insight into specific services?
Zoho: Do you expect to add details about current and past downtime events to the health dashboard? What do you expect your customers to do when they see a red light? If you answer "Email Support", you don't get the power of this status page.
To all: How is the health actually monitored (especially for the GUI focused Salesforce and Zoho services? Working at a (the best) web monitoring company, I know how hard it is to monitor complex web applications.

Notable mentions: The following services also offer up health dashboard page, but to keep the comparison from getting overly complex I decided to leave them out. If anyone would like me to review these, or any other service that I missed, I'd be more then happy to. Just leave a note in the comments

Wednesday, November 12, 2008

Zoho opens up their kimono and the blogosphere applauds

Big news from Zoho (provider of online applications and services that compete with Google and Microsoft). As of yesterday, they have reached the next level in uptime transparency by launching Zoho Service Health Status! As Raju Vegesna states in their big announcement, "This initiative is yet another step to being more open and transparent with our users." Kudos to Zoho for recognizing the need and delivering. They join Amazon AWS and SalesForce as the three large service providers offering a very public health dashboard of their services.

I'm really excited to see that the blogosphere gets the significance of this move:

Mashable:

"One frequent problem with web applications and services - and thus the whole web 2.0 phenomenon - is lack of communication when something goes wrong. Sure, it’s nice to have your online e-mail client available from every computer, but what happens when it goes down? Often, it’s just you in the dark, waiting for problems to be resolved, with little or no official information on what’s happening to ease your mind...

This is a great idea. If something goes wrong with any one of Zoho’s applications, you can quickly check out if the problem is on your side or theirs. Of course, I’m sure that the folks at Zoho will continue to inform their customers about problems, updates, downtime and similar issues via blog posts, but being able to see what’s wrong for yourself, at any given time, is an advantage Zoho’s customers will certainly enjoy. All other web startups take notice: this is the level of transparency we’d like to see from everyone, not just Zoho."

WebWorkerDaily:

"After taking a look, I’d say that all applications hosted online could benefit from this level of kimono-opening."

CNET:

"Web application specialist Zoho has joined the growing ranks of companies willing to share detailed information on how well their online services are holding up.
This move toward transparency is increasingly important as potential customers consider relying on such services...
Publishing the performance measurements for online services is catching on as cloud computing grows more serious. Going hand in hand with that is offering service level agreements (SLAs) with specific uptime commitments."

Who's next to open up? I'm looking at you Google!

Transparent Uptime