Transparent Uptime: amazon

Showing posts with label amazon. Show all posts

Wednesday, June 30, 2010

Quote in WSJ

In today's issue of the Wall Street Journal:

Lenny Rachitsky, the head of research and development for the website monitoring company Webmetrics.com, said companies can take advantage of unexpected outages by communicating with customers about what is going on—something Amazon didn't do during the outage, beyond its note to sellers. "Customers don't expect you to be perfect, as long as they feel that they can trust you," he said. "All it takes is to give your users some sense of control."

A similar sentiment was posited by Eric Savitz over at Barrons:

So, here’s the thing: it seems to me that Amazon actually made a bad situation worse by failing to communicate the details of the situation with its customers. My little post Tuesday afternoon on the technical troubles triggered 149 comments, and counting. The company’s customers did not like having the site go down, and even more, they did not like being left in the dark. And so far, the company still has not come clean on what went wrong. Some of the people who commented on my previous post were worried that their personal data might have been compromised. I have no real reason to think that was the case, but it certainly seems odd to me that Amazon has taken what appear to be a defensive and closed-mouth stance on an issue so basic to its customers: the ability to simply use the site. Jeff Bezos, your customers deserve better.

Tuesday, June 29, 2010

Amazon.com goes down, good case study of consumer-facing transparency (or lack thereof)

One of the questions I received from the audience after my talk last week was about how B2C companies should handle downtime and transparency. Today we have a great case study, as Amazon.com was down/degraded for about three hours:

You often hear about Amazon Web Services having some downtime issues, but it’s rare to see Amazon.com itself have major issues. In fact, I can’t ever remember it happening the past couple of years. But that’s very much the case today as for the past couple of hours the service has been switching back and forth between being totally down and being up, but showing no products. (source)

The telling quote, and impression that appears to be prevalent across Twitter and other blogs that have picked up the story is this:

Obviously, Twitter is abuzz about this — though there’s no word from Amazon on Twitter yet about the downtime. Amazon Web Services, meanwhile, all seem to be a go, according to their dashboard. The mobile apps on the iPhone, iPad and Android devices are sort of working, but it doesn’t appear you can go to actual product pages.

Let's think about this from the perspective of the customer. They visit Amazon.com and see this:

They wonder what's going on. They question whether something is wrong with their computer. If they are technical enough they may visit the Amazon's Twitter account to see if there is anything going on (a whole lot of nothing):

Maybe the visitor is even more technical, and knows about the public health dashboard that Amazon offers for their AWS clients. Well, that again gives us the wrong impression (all green lights):

At this point the user is frustrated. She may hop on Twitter and search for something like "amazon down", which would show her that a lot of other people are also having the same problem. This would at least make her feel better. Otherwise she would be stuck, wondering what is going on, how long it'll last, and whether to try shopping someplace else.

It turns out that Amazon did in fact put out an update about what was going on...in the well hidden Amazon services seller forum:

Realistically, Amazon doesn't go down very often, and for most people this is more of an annoyance than anything. I don't see Amazon customers losing trust in Amazon as a result of his incident. As Jesse Robbins put it:

They key here is that now Amazon has a lot less room for error. One more major downtime like this, especially within the year, will begin to eat away at the trust that customers have built for the service. To be proactive in avoiding that problem, and to give themselves more room for error, I would strongly advise Amazon to do the following:

Put some sort of communication out within 24 hours acknowledging the issues.
Put out a detailed postmortem, explaining what happened, and what they are doing to improve for the future.
Improve your process around updating the public about amazon.com downtime. The Twitter account is a good start, and it's very promising that you put out a communication to the public. The problem is that the places your users looked for updates they saw nothing, and the forum you posted to very few users would ever think to check. I would launch a new public health dashboard focused on overall Amazon.com health (and make sure to host this outside of your infrastructure!), which would include the AWS health as a subset (or a simply link), along with other increasingly important elements of your company: Kindle download health, shipping health, etc.
Implement the improvements discussed in the postmortem.

Other takeaways

I'm feeling that transparency in the B2C world is rarely as critical as in B2B relationships. There are certainly cases where consumers are just as inconvenienced and frustrated when their services are down, but in terms of impact and revenue loss, the bar has to be much higher for B2B businesses. I also believe that consumers are much more forgiving of downtime, and won't require as much from a company when they go down. This will change however as consumers become more dependent on the cloud for their everyday lives.
Amazon set the bar high for their AWS transparency. Users of those services automatically checked the existing communication channels, which is what you would want. Unfortunately Amazon did not set up a process to connect those two parts of the company.
This also exposed the problem with having different processes and tools for different parts of your organization. Ideally there would be a central place for status across the entire amazon.com property. It's understandable that AWS is doing things a bit differently, but the consequence as we saw was that users waste time looking at the wrong place. This is something Rackspace has trouble with as well.

Monday, January 18, 2010

Real-time timeline of cloud health status updates

Putting the combined cloud health status feed mentioned in the previous post on a timeline offers us convenient way to tell what's been going on with each of the cloud providers, and with the cloud in general:

It may be useful to break out each cloud into a separate feed/timeline as well, if you're hosted on a single cloud. If there is interest, I'll post those as well.

Thursday, January 14, 2010

Cloud health RSS feed

You may have noticed I've added a new module to the right side of this blog. This is a new feed I created (using Yahoo Pipes) combining the status update feeds of all of the major cloud providers (who offer feeds), in affect creating a single stream of cloud health status. Feel free to play with the pipe and add anything I may have missed.

Update: Improved on the original pipe by prepending each cloud feed with it's name, to more easily tell which cloud is reporting the problem:

Amazon responds to scalability concerns

Thanks to the fine reporting by Data Center Knowledge, we now have an official response from Amazon regarding the claims against their cloud capacity:

Amazon says that if customers are experiencing performance problems, it isn’t because EC2 is overloaded. “We do not have over-capacity issues,” said Amazon spokesperson Kay Kinton. “When customers report a problem they are having, we take it very seriously. Sometimes this means working with customers to tweak their configurations or it could mean making modifications in our services to assure maximum performance.”

Only time will tell whether this is a recurring issue (and something competitors will exploit), or an unsubstatiated fluke.

Update: The debate continues...

Tuesday, November 18, 2008

Amazon launches the Festivus of CDN's...a CDN for the rest of us

You can read about it here, here, and here. A CDN-as-a-Service (CaaS?). Does this put the nail in the coffin and make the CDN business a commodity? Probably.

What impresses me most is not the technology or the pricing or the ease of use of the new service (called CloudFront btw). What really gets me hot is the fact that the AWS Service Health Dashboard already has the performance and uptime of CloudFront up and live! That my friends is sign of true commitment to transparency, and more importantly a well functioning process.

Saturday, November 15, 2008

Comparing Amazon Web Services, SalesForce, and Zoho's online health dashboards

Now that there are three major SaaS players offering online service health dashboards, and one from Google on it's way, I thought it would be a useful exercise to compare the offerings from Amazon Web Services, Salesforce, and Zoho. This will hopefully be helpful for anyone planning to launch their own health dashboard, and to the general online community in making sense of what is important to understand about these dashboards.

Disclaimer: If I have mistakenly misrepresented anything, or if I missed any information, PLEASE let me know in the comments below.

What providers are we looking about today?

What is the URL of each status page (and are they easy to remember in times of need)?

Amazon Web Services: http://status.aws.amazon.com/ (Somewhat easy to remember)
Salesforce: http://trust.salesforce.com/ (Easy to remember)
Zoho: http://status.zoho.com/ (Extremely easy to remember)

What are these status pages called?

Amazon Web Services: "AWS Service Health Dashboard"
Salesforce: "Trust.salesforce.com - System Status" (Note: salesforce.com goes beyond simply providing system status by also providing security notices, both under their "Trust.salesforce.com brand")
Zoho: "Zoho Service Health Status"

What services' health are reported on?

Amazon Web Services: All four core services (EC2, S3, SQS, SimpleDB), plus Mechanical Turk and FlexPay. They also break out the two S3 datacenter locations (EU and US), the two ends of a Mechanical Turk transaction (Requester and Worker), plus the EC2 API.
Salesforce: Only the core salesforce.com services across 12 individual systems (based on geographic location and purpose).
Zoho: All 23 Zoho services are covered, plus their mobile site and their single sign-on system.

What health information is provided?

Amazon Web Services: Current status, plus about 30 days of historical status. Status is determined to be one of "Service is operating normally", "Performance Issues", or "Service disruption". "Information messages" are occasionally provided.
Salesforce: Current status, plus exactly 30 days of historical status. Status is determined to be either "Instances available", "Performance Issues", "Service disruption", or "Status not available". "Informational messages" are also provided on occasion.
Zoho: Current status and the response time for the past hour, in addition to historical uptime for the past week. Also provided are two graphs representing uptime and response time for the past seven days. If that wasn't enough, current uptime and response from six geographical locations is also given.

Where does the uptime and performance data come from?

Amazon Web Services: No clue.
Salesforce: No clue.
Zoho: Their own "Site 24x7" monitoring service.

What is considered downtime and what is considered a performance issue?

Amazon Web Services: No clue.
Salesforce: No clue.
Zoho: No clue.

Are real time updates provided during downtime events? Is it easy to find?

Amazon Web Services: Yes, but unclear how consistently and how easy it is to find that information.
Salesforce: Yes, right underneath the current status.
Zoho: Does not appear so, but if the issue is big enough they may update customers through their blog.

Is information provided on past downtime events?

Amazon Web Services: Yes. Mousing over a past performance or downtime event brings up a chronological log of events that took place, from detection to resolution. In addition, major downtime events are explained.
Salesforce: Yes. Clicking on any past event brings up a window giving the time of the event, a detailed description of the problem, and a root cause analysis.
Zoho: No. Unless they are described in the blog.

Is there a way to easily report problems users are having?

Amazon Web Services: Yes, clicking the "Report an Issue" link.
Salesforce: No, other then using the standard support channels.
Zoho: No, other then using the standard support channels.

How can you get notified of problems (without watching this page 24/7)?

Amazon Web Services: Ability to subscribe to RSS feeds for change in status of each service.
Salesforce: No.
Zoho: No.

Conclusions: The best practices for online service health dashboards are still being formed, and it's clear that each service provider has approached the need for transparency differently. Amazon Web Services provides a simple and easy to understand overview of the health of each service, but provides little insight into who is impacted and what specific functionality is down. Salesforce provides clear insight into what customers may be affected by an event, but does little in offering insight into specific functionality that may be down or slow. Zoho provides the most data by far for each service they provide, but does not have a system in place to communicate details about specific downtime events beyond the company blog. Amazon and Salesfroce completely lack insight into how that they collect the health information, and all three give no information on what is meant by downtime or performance problems.

A closing questions for each provider:

Amazon Web Services: What does "EC2 API" actually mean? Which API is this referring to and why not cover the API's for the other services?
Salesforce: Does each server status cover every application level and API on that server? Can you offer more insight into specific services?
Zoho: Do you expect to add details about current and past downtime events to the health dashboard? What do you expect your customers to do when they see a red light? If you answer "Email Support", you don't get the power of this status page.
To all: How is the health actually monitored (especially for the GUI focused Salesforce and Zoho services? Working at a (the best) web monitoring company, I know how hard it is to monitor complex web applications.

Notable mentions: The following services also offer up health dashboard page, but to keep the comparison from getting overly complex I decided to leave them out. If anyone would like me to review these, or any other service that I missed, I'd be more then happy to. Just leave a note in the comments

Transparent Uptime