Transparent Uptime

Tuesday, July 20, 2010

Why Transparency Works

We've talked about the benefits of transparency. We've talked about implementing transparency. We've talked about transparency in action. What we haven't yet talked about is...why the heck does transparency work? Why does transparency make your users happier? Why do customers trust you more when you are transparent? Why do we want to know what's going on? What allows us to be OK with major problems by simply knowing what is going on? My theory is simple: Transparency gives us a sense of control, and control is required for happiness. Allow me to elaborate.

Downtime and learned helplessness

The concept of learned helplessness was developed in the 1960s and 1970s by Martin Seligman at the University of Pennsylvania. He found that animals receiving electric shocks, which they had no ability to prevent or avoid, were unable to act in subsequent situations where avoidance or escape was possible. Extending the ramifications of these findings to humans, Seligman and his colleagues found that human motivation [...] is undermined by a lack of control over one's surroundings. (source)

Learned helplessness was discovered by accident when Seligman was researching Pavlovian conditioning. His experiment was set up to associate a tone with a (harmless) shock, to test whether the animal would learn to run away from just the sound of the tone. In the now famous experiment, one group of dogs was restrained and unable to escape the shock for a period of time (i.e. this group had no control over its situation). Later this group was placed into an area that now allowed them to escape the shock; unexpectedly the dogs stayed put. The shocks continued to come, yet the dogs simply curled up in the corner and whimpered. These dogs exhibited depression, and in a sense gave up on life, because these negative events were seemingly random. Seligman concluded that "the strongest predictor of a depressive response was lack of control over the negative stimulus." What is downtime if not a lack of control over a negative stimulus?

The Cloud and loss of control

Many concerns come up when businesses consider the cloud, but as the survey by IDC below shows the overriding concern is rooted in a loss of control:

You give up a lot of control in exchange for reduced cost, higher efficiency, and increased flexibility. Yet that that desire for control persists, and the remaining bits of control you maintain become even more valuable.

Downtime kills our sense of control

Downtime is quite simply a negative event over which you have almost no control. Especially when using SaaS/cloud services your remaining semblance of control vanishes as soon as service goes down and you have no insight into what is going on. We are the dogs trapped in the shock machine, whimpering in the corner.

As I described in my talk, downtime is inevitable. Thanks to things like risk homeostasis, black swan events, unknown unknowns, and our own nature, there is no way to avoid failure. All we can do is prepare for it, and communicate/explain what is going on. And that is the key to keeping us from the fate of a depressed canine. Transparency gives us a sense of control over the uncontrollable.

How transparency gives us back the sense of control

Imagine walking through the park, the sun shining, the birds singing. All of a sudden you notice a strong pain in your arm. Your mind jumps to the worst. Are you having a heart attack, did something just bite you, are you getting older and sicker? Then a split second later you remember...your buddy jokingly punched you earlier in the day! The punch must have been harder than you remember, but it explains the pain. Instantly you feel better. Though the pain is the same, you understand the source. You have an explanation for the pain. Transparency delivers that explanation.

When Amazon goes down, or Gmail isn't loading, users feel pain. Part of that pain comes from the inconvenience of not being able to do what you want to get done, or the lost revenue that comes with downtime. But just as painful is the sense of fatalistic helplessness, especially if someone is breathing down your neck expecting you to fix the problem. Without insight into what is happening with the service, you are completely without control. If on the other hand the service provides an explanation, through a public health dashboard or a status blog or a simple tweet, your fatalistic reaction turns to concrete concern. Your mind goes from assuming the worst (e.g. this service is terrible, they don't know what they are doing, it always fails) to focusing on a real and specific problem (e.g. some hard drive in the datacenter failed, they had some user error, this'll be gone soon). A specific problem is fixable, an unexplained pain is not. Transparency brings the pain down to a specific and knowable problem, while also holding the provider accountable for their issues (which indirectly gives you even more control). Or better said:

Seligman believes it is possible to change people's explanatory styles to replace learned helplessness with "learned optimism." To combat (or even prevent) learned helplessness in both adults and children, he has successfully used techniques similar to those used in cognitive therapy with persons suffering from depression. These include identifying negative interpretations of events, evaluating their accuracy, generating more accurate interpretations, and decatastrophizing (countering the tendency to imagine the worst possible consequences for an event). (source)

By providing a sense of control, transparency is one of the keys to keeping us happy, productive, and sane in an increasingly uncontrollable world.

Thursday, July 8, 2010

Facebook and transparency

As some of you may know, I use Facebook as an example of how not-to-do-transparency in my talk. Immediately following my talk at Velocity, I received the following comment from Bret Taylor (CTO of Facebook):

The "Platform Live Status" page that is mentioned is such:

There's some really good stuff here (e.g. it exists, it looks up-to-date, and it has some great features). There is also a lot of room for improvement. Putting aside the fact that this wasn't meant to be a fully-featured dashboard, and is far better then nothing, lets run their status page through the rules for a successful public health dashboard and see what we can advise for the next evolution of Facebook's transparency initiative:

Rule #1: Must show the current status for each "service" you offer

Today the status page only gives the status of various services through plain text. For example, at the time of this writing, the "hello, active and total user counts are currently missing from both public profile pages and the API." The two graphs to the right show API response time and error rate across all API functions, not per-API or per-function area. Showing a graph and/or status light for each API/function would add tremendous value for developers that use specific parts of the application and only need to know about those specific areas. It would also make it easier to automate functionality, and to decide which components can be relied on in your architecture.

Recommendation: A graph and status light for each specific API function and end-point that developers may use. See Google's health dashboard for ideas.

Rule #2: Data must be accurate and timely

From the outside this appears to be solid. My big worry is that updates are currently very manual, which isn't going to scale. I haven't watched the site long enough to gauge how timely the updates are, but let's give them the benefit of the doubt. The main reason for this rule, requiring that your data be accurate and timely, comes down to trust. If your users get a hint of inaccuracy or delays in updates, they lose faith in the tool and stop using it. Your users will resort back to emailing/tweeting/complaining, which defeats the entire purpose.

Recommendation: Automate status updates as much as possible. Set up regular monitoring that posts status changes automatically. Create a formal process that requires someone to post a detailed update within a Minimum-Time-To-Communicate.

Rule #3: Must be easy to find

This may be the biggest problem today with Facebook's status page. I've been collecting public heath blogs/dashboards for a couple years now, and I've never come across it. Google'ing for "facebook uptime" or "facebook status" does not help. There are over 100 links to the page, but most are from deep within developer forums. If Facebook is serious about using transparency to their advantage, this page needs to be linked to from the first place that developers would go when they experience issues with the API.

Recommendation: Link to the status page from here and here. Not being a Facebook developer, I'm not the best judge of this, but I'm sure Facebook has plenty of data to figure this out.

Rule #4: Must provide details for events in real time

We discussed this already, but it's very important, especially for API-based developers. The error rate graph is very useful for this, which appears to be real-time. I would do more with it.

Recommendations: Show error rate per API/function (including the types of issues seen), and show historical information to give an impression of what's "normal". Developer mostly need to know who is at fault. If you simply let them know that something is up on your end, they'll feel a lot better and be able to go on with their day. See trust.salesforce.com for an easy way so integrate basic updates into dashboard (click on an error icon).

Rule #5: Provide historical uptime and performance data

Mostly lacking in this area. The graphs only go back to the start of the current day, and the text status-updates go back about 2 weeks. A historical perspective gives new developers a baseline to go by, and gives existing developers a chance to correlate issues they saw on their end.

Recommendation: See OpenSRS's dashboard for a simple way to do historical uptime/performance by service/API. Clicking on the "archive" link shows you past updates for every service.

Rule #6: Provide a way to be notified of status changes

Facebook is actually doing a great job here. They have both an RSS feed and an email option, which is extremely rare and extremely awesome. This allows developers to be pushed updates, and to integrate the updates into your internal dashboards. Great job here.

Recommendations: None!

Rule #7: Provide details on how the data is gathered

Currently customers have no insight into how the API response/errors are measured, and what the policy around status updates is. Is it ad-hoc, is it comprehensive, is it automated? It's hard to rely on this data today without insight into those policies and processes.

Recommendation: Add an explanation to the bottom of the page, or as a link off of the page, going into some of these details. You don't have to reveal your special sauce, just give us confidence that we can rely on this data.

Bonus

The list of top bugs along the left side is a GREAT idea. This takes transparency to another level, and I would highly encourage other sites to adopt this practice. Developers are the target audience for both health issue and outstanding bugs, so why not combine them (along with the "Developer Updates" feed) into a single dashboard ? Brilliant.
I like how the "Current Status" is broken out into a big yellow box at the top, making it clear what the situation is right now. This is much better than the default approach of showing the latest status as simply the top news item in the chronological list. A nice touch.

Conclusion

The most important takeaway is that Facebook has taken the hardest step toward transparency: getting a status blog/dashboard online. If they were to implement some of the recommendations above, they would see more of the benefits that come with transparency, and set a great example for other development platforms.

Thursday, July 1, 2010

Benefits of Transparency

I thought it would be helpful to consolidate a list of the primary benefits of web sites/services being transparent online. If there are any I missed, please leave a comment and I'll update the list:

Benefits of Transparency (for online websites and services)

1. Build trust with your users

2. Increase loyalty, reduce churn
3. Improve perception of your reliability
4. Reduce support costs
5. Control the message
6. Gain a competitive advantage
7. More time to focus on the actual problem
8. Reduce stress
9. Learn

See below for more detail...

1. Build trust with your users

Your users have a pretty low bar for how they expect to be treated. They basically expect you to screw them, hide information from them, and do the bare minimum to take their money. If you do something good for them, something unexpected like admit that you have problems proactively, and show your humanity, your users will develop a sense of trust for your service and your company. I believe that trust may be the most important asset you can earn on the web, especially if you deal things that are really important to your customers (e.g. money, email, photos, etc.).

Example: If the car company does a recall as soon as there is a hint of a problem, you trust them a lot more then if they are forced to do a recall after a number of deaths.

The more times you are proactive and admit to problems before you are caught, the stronger the sense of trust gets. If you are instead forced to admit your problems, or your customers complain before you tell them that you are aware of the problem, the harder it gets to convince them that you know what you are doing and that you care about the quality of the service.

2. Increase loyalty, reduce churn

Your users don't expect you to be perfect. They will forgive you when you have a problem. But only if they feel that they can trust you, that you know what you are doing, and that things are improving. Your users will stick with you if they feel like you know what you are doing, that you feel their pain, that you are taking these issues seriously. Apologizing and explaining after the fact is much more difficult. It is hard to convince your customers that you know what you are doing and that you care about their issues if you avoid the problem, or worse pretend that it doesn't exist.

Example: Atlassian's security breach a few months ago...they could have lost a lot of concerned customers questioning their is trustworthy. Instead they increased loyalty and trust by being up front about the situation, explaining what they are doing about it, and improving for the future. If instead the issue was exposed independently, they would have seen a mass exodus.

A major downtime event is innately going to lead to unhappy customers. You may as well try to turn it around into something worthwhile, and try to keep as many customers as you can. A nice side benefit is that the more your users learn to trust you, the more loyal and forgiving they become. It's a powerful loop that you want to get on the right side of.

3. Improve perception of your reliability

When users run into a problem with your service, whether it's their fault or yours, they'll often assume the wrong is on your end. If you instead show them exactly when you are actually having problems, and if you do this reliably and consistently, they'll know when you really have problems, and end up seeing that you aren't down as often as they thought. It's ironic that the more open you are about how often you have a problem, the less often your users will think you really are down.

Example: A complex web applications made up of many components, say using Google App Engine, the Foursquare API, and Google ads. You get alerted about a timeout issue...will you assume that Google is at fault or one of the other components. A quick visit to Google's public dashboard would show you that they are perfectly fine, and that the problem lies with one of the other services (which need their own public dashboards).

4. Reduce support costs

During a downtime incident your support department gets flooded with the same type of question..."I'm seeing a problem, what's going on?" and "Is the site down or is it just me?". If you can allow your customers to serve themselves, or make it easy for your support department to point complaints to a single succinct explanation, they can operate much more efficiently, and focus on higher level issues.

Also, a lot of times support doesn't even know what's going on during a downtime event, and having something to check themselves gives them more insight into the health of the system

Example: Amazon Web Services barely has support. They have a paid support service, and their forums, but otherwise there is very little real-time support. They can do this because they have a real-time public health dashboard that addresses 90% of the questions users are going to have in their day-to-day use of the service.

5. Control the message

If you don't tell your users what's going on during an event, they are going to speculate and assume the worst. They'll assume you aren't aware of the problem, that it'll last a long time, and that you're not taking it seriously. Even a simple update telling users that you are aware of the problem and are working on it gives them confidence that this isn't going to be the end of the company, and that you feel their pain.

Example: Users of Twitter experience on-and-off issues, but they can always tell how healthy the service is as a whole by visiting their public dashboard and status blog. They don't have to wonder how far-reaching the downtime is, or how long it'll last.

6. Gain a competitive advantage

All else being equal, when prospects are comparing your service to a competitor, especially when your service is critical to their own life/business, being able to tell a story about being transparent and open is a powerful differentiator. It gives your prospect a feeling of control, that they won't be left in the dark when the sh** hits the fan and their boss is breathing down their neck.

7. More time to focus on the actual problem

Especially for a small company, you can spend more time dealing with resolving the issue and less time fielding calls/emails. The better your process, the less you have to worry about beyond fixing the actual problem.

8. Reduce stress

With a defined process, ideally one that is procedural, you keeps people from freaking out and having to scramble at the worst possible time. The last thing you want to be doing during a downtime event is figuring out who can say what, and how to actually contact your entire customer base about a potential problem.

9. Learn
As noted by a comment by Heather Leson in the original post, disasters are an opportunity to help both customers and company staff share in the learning process. The more open you are about your issues, the more opportunity you'll have in both learning from your customers that may have had similar experiences, and the more your customers will learn from your experience. You aren't alone. Your customers have a vested interest in helping you succeed. You may be surprised by how forthcoming they are with advice and recommendations for your situation. Google App Engine ended up adding new features after a major downtime event, no doubt based on customer feedback. Amazon added their public health dashboard after one too many outages. As Heather put it, "Mutual success is one of the cornerstones of open source/open web organizations."

Wednesday, June 30, 2010

Quote in WSJ

In today's issue of the Wall Street Journal:

Lenny Rachitsky, the head of research and development for the website monitoring company Webmetrics.com, said companies can take advantage of unexpected outages by communicating with customers about what is going on—something Amazon didn't do during the outage, beyond its note to sellers. "Customers don't expect you to be perfect, as long as they feel that they can trust you," he said. "All it takes is to give your users some sense of control."

A similar sentiment was posited by Eric Savitz over at Barrons:

So, here’s the thing: it seems to me that Amazon actually made a bad situation worse by failing to communicate the details of the situation with its customers. My little post Tuesday afternoon on the technical troubles triggered 149 comments, and counting. The company’s customers did not like having the site go down, and even more, they did not like being left in the dark. And so far, the company still has not come clean on what went wrong. Some of the people who commented on my previous post were worried that their personal data might have been compromised. I have no real reason to think that was the case, but it certainly seems odd to me that Amazon has taken what appear to be a defensive and closed-mouth stance on an issue so basic to its customers: the ability to simply use the site. Jeff Bezos, your customers deserve better.

Tuesday, June 29, 2010

Amazon.com goes down, good case study of consumer-facing transparency (or lack thereof)

One of the questions I received from the audience after my talk last week was about how B2C companies should handle downtime and transparency. Today we have a great case study, as Amazon.com was down/degraded for about three hours:

You often hear about Amazon Web Services having some downtime issues, but it’s rare to see Amazon.com itself have major issues. In fact, I can’t ever remember it happening the past couple of years. But that’s very much the case today as for the past couple of hours the service has been switching back and forth between being totally down and being up, but showing no products. (source)

The telling quote, and impression that appears to be prevalent across Twitter and other blogs that have picked up the story is this:

Obviously, Twitter is abuzz about this — though there’s no word from Amazon on Twitter yet about the downtime. Amazon Web Services, meanwhile, all seem to be a go, according to their dashboard. The mobile apps on the iPhone, iPad and Android devices are sort of working, but it doesn’t appear you can go to actual product pages.

Let's think about this from the perspective of the customer. They visit Amazon.com and see this:

They wonder what's going on. They question whether something is wrong with their computer. If they are technical enough they may visit the Amazon's Twitter account to see if there is anything going on (a whole lot of nothing):

Maybe the visitor is even more technical, and knows about the public health dashboard that Amazon offers for their AWS clients. Well, that again gives us the wrong impression (all green lights):

At this point the user is frustrated. She may hop on Twitter and search for something like "amazon down", which would show her that a lot of other people are also having the same problem. This would at least make her feel better. Otherwise she would be stuck, wondering what is going on, how long it'll last, and whether to try shopping someplace else.

It turns out that Amazon did in fact put out an update about what was going on...in the well hidden Amazon services seller forum:

Realistically, Amazon doesn't go down very often, and for most people this is more of an annoyance than anything. I don't see Amazon customers losing trust in Amazon as a result of his incident. As Jesse Robbins put it:

They key here is that now Amazon has a lot less room for error. One more major downtime like this, especially within the year, will begin to eat away at the trust that customers have built for the service. To be proactive in avoiding that problem, and to give themselves more room for error, I would strongly advise Amazon to do the following:

Put some sort of communication out within 24 hours acknowledging the issues.
Put out a detailed postmortem, explaining what happened, and what they are doing to improve for the future.
Improve your process around updating the public about amazon.com downtime. The Twitter account is a good start, and it's very promising that you put out a communication to the public. The problem is that the places your users looked for updates they saw nothing, and the forum you posted to very few users would ever think to check. I would launch a new public health dashboard focused on overall Amazon.com health (and make sure to host this outside of your infrastructure!), which would include the AWS health as a subset (or a simply link), along with other increasingly important elements of your company: Kindle download health, shipping health, etc.
Implement the improvements discussed in the postmortem.

Other takeaways

I'm feeling that transparency in the B2C world is rarely as critical as in B2B relationships. There are certainly cases where consumers are just as inconvenienced and frustrated when their services are down, but in terms of impact and revenue loss, the bar has to be much higher for B2B businesses. I also believe that consumers are much more forgiving of downtime, and won't require as much from a company when they go down. This will change however as consumers become more dependent on the cloud for their everyday lives.
Amazon set the bar high for their AWS transparency. Users of those services automatically checked the existing communication channels, which is what you would want. Unfortunately Amazon did not set up a process to connect those two parts of the company.
This also exposed the problem with having different processes and tools for different parts of your organization. Ideally there would be a central place for status across the entire amazon.com property. It's understandable that AWS is doing things a bit differently, but the consequence as we saw was that users waste time looking at the wrong place. This is something Rackspace has trouble with as well.

Monday, June 28, 2010

Video of my talk (Upside of Downtime) at Velocity 2010

Video of my talk has been posted (below), though watching it and listening to myself feels pretty damn weird. I've been blown away by response I've gotten to this talk. I know of at handful of companies circulating these slides/notes internally and working to make their companies more transparent. I've personally heard from a number of people at the conference that were discussing the ideas with their coworkers thinking about the best approach to take action. Even Facebook (the example I used of how not to handle downtime) has found resonance with the talk, and pointed me to a little known status page.

I'm hoping to start a conversation around the framework and continue to evolve it. I'm going to expand on the ideas in this blog, so if there is anything specific you would like me to explore (e.g. hard ROI, B2C examples, cultural differences, etc), please let me know.

Enjoy the video:

The slides can be found here: http://www.slideshare.net/lennysan/the-upside-of-downtime-velocity-2010-4564992

Wednesday, June 23, 2010

The Upside of Downtime (Velocity 2010)

Here is the full deck from my talk at Velocity, including two bonus sections at the end:

The Upside of Downtime (Velocity 2010)

Also, here is the "Upside of Downtime Framework" cheat-sheet (click through to download):