Transparent Uptime: 07/01/2010

Wednesday, July 28, 2010

Transparency in action at OpenSRS

OpenSRS has long been a company that "gets it", so I was excited to have the opportunity to interview Ken Schafer, who leads the transparency efforts at OpenSRS and Tucows. OpenSRS has an excellent public health dashboard, and continues to put a lot of effort into transparency. Heather Leson, who works with Ken, has done a lot to raise the bar in the online transparency community. My hope is that the more transparent we all get about our own transparency efforts (too much?) the more we all benefit. Below, Ken tells us how he got the company to accept the need for transparency, what hurdles they had to overcome, and what benefits they've seen. Enjoy the interview, and if you have any questions for Ken, please post them as comments below.

Q. Can you briefly explain your role and how you got involved in your company’s transparency initiative?
My formal title is Executive Vice President of Product & Marketing. That means I'm on the overall Tucows exec team and I'm also responsible for the product strategy and marketing of OpenSRS, our wholesale Internet services group.

Tucows is one of the original Internet companies - founded in 1993. We've moved well beyond the original software download site and now the company makes most of its money providing easy-to-use Internet services.

OpenSRS provides end users with over 10 million domain names, millions of mailboxes, and tens of thousands of digital certificates through over 10,000 resellers in over 100 countries. Our resellers are primarily web hosts, ISPs, web developers and designers, and IT consultants.

Q. What has your group done to create transparency for your organization?
Given the technical adeptness of our resellers we've always tried not to talk down to them and to provide as much information as we can. Our success and the success of our resellers are highly dependent on each other so we're very open to sharing and in fact since the beginning of OpenSRS in 1999 we've run mailing lists, blogs, forums, wiki, status pages and a host of other ways for us to communicate better with our resellers.

Transparency is kind of in the nature of the business at this point.

Right now we provide transparency into what we're doing through a blog, a reseller forum, our Status site and our activity on a host of social networks.

Q. What was the biggest hurdle you had to get over to push this through?
The biggest challenge is really whether your commitment to transparency can survive the bad times. Being transparent when you've got a status board full of green tick marks isn't that hard. When everything starts turning red and staying that way, THAT'S a lot harder.

We're generally proud of our uptime and the quality of our services but a few years ago we struggled with scaling some of our applications and, frankly, our communication around the problems we were facing suffered as a result. People here were just too embarrassed to tell our resellers that we'd messed stuff up and in particular to admit to our fellow geeks HOW we'd messed up.

But when we pushed and DID share information and admitted our mistakes and talked about what we could do to make it better what we found was that our resellers were appreciative AND very sympathetic. They'd all been there too and knew it was hard to fess up to our errors in judgment and they really appreciated it.

One thing we STILL struggle with is how we communicate around network attacks. Our services run a big chunk of the Internet and as such we're under pretty much constant attack of one sort or another. We handle most of these without anyone noticing. Our operations and security teams do an amazing job of keeping things running smoothly in the face of these attacks but every once in a while something new - in scope, scale or technique - happens that puts pressure on our systems until we can adjust to the new threat.

In those cases we've tended to put our desire for transparency aside and give minimal information so as not to show our hand to the bad guys. It's a struggle between what we share so customers understand what is happening and not showing potential vulnerabilities that others could exploit.

I guess "sharing what is exploitable" is where I draw the line when it comes to transparency.

Q. What benefits have you seen as a result of your transparency?
One of the biggest benefits is in the overall quality of the service. When you say that EVERY problem is going to get publicly and permanently posted to a status page it REALLY focusses the organization on quality of service!

Q. Can you give us some insight into the processes around your transparency? Specifically who manages the communication, who is responsible for maintaining the dashboard, and what the general process looks like before/during/after a big event.
Our communications team (Marketing) is responsible for the OpenSRS Status page. We generally hire marketers that are technically comfortable so they can write to be understood and understand what they're writing about.

We have someone from Marketing on call 24/7/365 and whenever an issue cannot be resolved in an agreed-to period of time (generally 15 minutes) our Network Operations Center (also 24/7/365) informs Marketing and we post to Status.

Our Status page is a heavily customized version of Wordpress plus an email notification system and auto-updates to our Twitter feed.

Marketing and NOC then stay in touch until the issue is resolved, posting updates as material changes occur or at two hour intervals if the issue is ongoing.

You'll notice this is a largely manual system. We decided against posting our internal monitoring tools publicly because of the complexity of our operations. Multiple services each composed of multiple sub-systems running in data centers around the word mean that the raw data isn't as useful to resellers as it may be for some less complex environments.

In the event of a serious problem we also have an escalation process - once again managed by Marketing - that brings in additional levels of communications and executives. For major issues we also have a "War Room" procedure that is put in place until the issue is resolved.

Q. What would you say to other organizations that are considering transparency as a strategic initiative?
The days of hiding are over. You now have a choice of whether you want to tell the story or have others misrepresent the story on your behalf. It seems scary to admit you have problems but you gain so much by being open and honest that the stress of taking a new approach to communications is easily outweighed.

Tuesday, July 27, 2010

I'm doing an O'Reilly Webcast this Thursday!

The folks at O'Reilly asked me to do a webcast of my talk, and I was happy to oblige. This talk will be very similar to the one I did at Velocity. I don't think I'll be doing this talk for much longer, so this may be your last chance to hear it live. I'd love to have you there and to hear any feedback you may have about the message. The webcast will begin at 10am PST this coming Thursday, and you can register here.

Tuesday, July 20, 2010

Why Transparency Works

We've talked about the benefits of transparency. We've talked about implementing transparency. We've talked about transparency in action. What we haven't yet talked about is...why the heck does transparency work? Why does transparency make your users happier? Why do customers trust you more when you are transparent? Why do we want to know what's going on? What allows us to be OK with major problems by simply knowing what is going on? My theory is simple: Transparency gives us a sense of control, and control is required for happiness. Allow me to elaborate.

Downtime and learned helplessness

The concept of learned helplessness was developed in the 1960s and 1970s by Martin Seligman at the University of Pennsylvania. He found that animals receiving electric shocks, which they had no ability to prevent or avoid, were unable to act in subsequent situations where avoidance or escape was possible. Extending the ramifications of these findings to humans, Seligman and his colleagues found that human motivation [...] is undermined by a lack of control over one's surroundings. (source)

Learned helplessness was discovered by accident when Seligman was researching Pavlovian conditioning. His experiment was set up to associate a tone with a (harmless) shock, to test whether the animal would learn to run away from just the sound of the tone. In the now famous experiment, one group of dogs was restrained and unable to escape the shock for a period of time (i.e. this group had no control over its situation). Later this group was placed into an area that now allowed them to escape the shock; unexpectedly the dogs stayed put. The shocks continued to come, yet the dogs simply curled up in the corner and whimpered. These dogs exhibited depression, and in a sense gave up on life, because these negative events were seemingly random. Seligman concluded that "the strongest predictor of a depressive response was lack of control over the negative stimulus." What is downtime if not a lack of control over a negative stimulus?

The Cloud and loss of control

Many concerns come up when businesses consider the cloud, but as the survey by IDC below shows the overriding concern is rooted in a loss of control:

You give up a lot of control in exchange for reduced cost, higher efficiency, and increased flexibility. Yet that that desire for control persists, and the remaining bits of control you maintain become even more valuable.

Downtime kills our sense of control

Downtime is quite simply a negative event over which you have almost no control. Especially when using SaaS/cloud services your remaining semblance of control vanishes as soon as service goes down and you have no insight into what is going on. We are the dogs trapped in the shock machine, whimpering in the corner.

As I described in my talk, downtime is inevitable. Thanks to things like risk homeostasis, black swan events, unknown unknowns, and our own nature, there is no way to avoid failure. All we can do is prepare for it, and communicate/explain what is going on. And that is the key to keeping us from the fate of a depressed canine. Transparency gives us a sense of control over the uncontrollable.

How transparency gives us back the sense of control

Imagine walking through the park, the sun shining, the birds singing. All of a sudden you notice a strong pain in your arm. Your mind jumps to the worst. Are you having a heart attack, did something just bite you, are you getting older and sicker? Then a split second later you remember...your buddy jokingly punched you earlier in the day! The punch must have been harder than you remember, but it explains the pain. Instantly you feel better. Though the pain is the same, you understand the source. You have an explanation for the pain. Transparency delivers that explanation.

When Amazon goes down, or Gmail isn't loading, users feel pain. Part of that pain comes from the inconvenience of not being able to do what you want to get done, or the lost revenue that comes with downtime. But just as painful is the sense of fatalistic helplessness, especially if someone is breathing down your neck expecting you to fix the problem. Without insight into what is happening with the service, you are completely without control. If on the other hand the service provides an explanation, through a public health dashboard or a status blog or a simple tweet, your fatalistic reaction turns to concrete concern. Your mind goes from assuming the worst (e.g. this service is terrible, they don't know what they are doing, it always fails) to focusing on a real and specific problem (e.g. some hard drive in the datacenter failed, they had some user error, this'll be gone soon). A specific problem is fixable, an unexplained pain is not. Transparency brings the pain down to a specific and knowable problem, while also holding the provider accountable for their issues (which indirectly gives you even more control). Or better said:

Seligman believes it is possible to change people's explanatory styles to replace learned helplessness with "learned optimism." To combat (or even prevent) learned helplessness in both adults and children, he has successfully used techniques similar to those used in cognitive therapy with persons suffering from depression. These include identifying negative interpretations of events, evaluating their accuracy, generating more accurate interpretations, and decatastrophizing (countering the tendency to imagine the worst possible consequences for an event). (source)

By providing a sense of control, transparency is one of the keys to keeping us happy, productive, and sane in an increasingly uncontrollable world.

Thursday, July 8, 2010

Facebook and transparency

As some of you may know, I use Facebook as an example of how not-to-do-transparency in my talk. Immediately following my talk at Velocity, I received the following comment from Bret Taylor (CTO of Facebook):

The "Platform Live Status" page that is mentioned is such:

There's some really good stuff here (e.g. it exists, it looks up-to-date, and it has some great features). There is also a lot of room for improvement. Putting aside the fact that this wasn't meant to be a fully-featured dashboard, and is far better then nothing, lets run their status page through the rules for a successful public health dashboard and see what we can advise for the next evolution of Facebook's transparency initiative:

Rule #1: Must show the current status for each "service" you offer

Today the status page only gives the status of various services through plain text. For example, at the time of this writing, the "hello, active and total user counts are currently missing from both public profile pages and the API." The two graphs to the right show API response time and error rate across all API functions, not per-API or per-function area. Showing a graph and/or status light for each API/function would add tremendous value for developers that use specific parts of the application and only need to know about those specific areas. It would also make it easier to automate functionality, and to decide which components can be relied on in your architecture.

Recommendation: A graph and status light for each specific API function and end-point that developers may use. See Google's health dashboard for ideas.

Rule #2: Data must be accurate and timely

From the outside this appears to be solid. My big worry is that updates are currently very manual, which isn't going to scale. I haven't watched the site long enough to gauge how timely the updates are, but let's give them the benefit of the doubt. The main reason for this rule, requiring that your data be accurate and timely, comes down to trust. If your users get a hint of inaccuracy or delays in updates, they lose faith in the tool and stop using it. Your users will resort back to emailing/tweeting/complaining, which defeats the entire purpose.

Recommendation: Automate status updates as much as possible. Set up regular monitoring that posts status changes automatically. Create a formal process that requires someone to post a detailed update within a Minimum-Time-To-Communicate.

Rule #3: Must be easy to find

This may be the biggest problem today with Facebook's status page. I've been collecting public heath blogs/dashboards for a couple years now, and I've never come across it. Google'ing for "facebook uptime" or "facebook status" does not help. There are over 100 links to the page, but most are from deep within developer forums. If Facebook is serious about using transparency to their advantage, this page needs to be linked to from the first place that developers would go when they experience issues with the API.

Recommendation: Link to the status page from here and here. Not being a Facebook developer, I'm not the best judge of this, but I'm sure Facebook has plenty of data to figure this out.

Rule #4: Must provide details for events in real time

We discussed this already, but it's very important, especially for API-based developers. The error rate graph is very useful for this, which appears to be real-time. I would do more with it.

Recommendations: Show error rate per API/function (including the types of issues seen), and show historical information to give an impression of what's "normal". Developer mostly need to know who is at fault. If you simply let them know that something is up on your end, they'll feel a lot better and be able to go on with their day. See trust.salesforce.com for an easy way so integrate basic updates into dashboard (click on an error icon).

Rule #5: Provide historical uptime and performance data

Mostly lacking in this area. The graphs only go back to the start of the current day, and the text status-updates go back about 2 weeks. A historical perspective gives new developers a baseline to go by, and gives existing developers a chance to correlate issues they saw on their end.

Recommendation: See OpenSRS's dashboard for a simple way to do historical uptime/performance by service/API. Clicking on the "archive" link shows you past updates for every service.

Rule #6: Provide a way to be notified of status changes

Facebook is actually doing a great job here. They have both an RSS feed and an email option, which is extremely rare and extremely awesome. This allows developers to be pushed updates, and to integrate the updates into your internal dashboards. Great job here.

Recommendations: None!

Rule #7: Provide details on how the data is gathered

Currently customers have no insight into how the API response/errors are measured, and what the policy around status updates is. Is it ad-hoc, is it comprehensive, is it automated? It's hard to rely on this data today without insight into those policies and processes.

Recommendation: Add an explanation to the bottom of the page, or as a link off of the page, going into some of these details. You don't have to reveal your special sauce, just give us confidence that we can rely on this data.

Bonus

The list of top bugs along the left side is a GREAT idea. This takes transparency to another level, and I would highly encourage other sites to adopt this practice. Developers are the target audience for both health issue and outstanding bugs, so why not combine them (along with the "Developer Updates" feed) into a single dashboard ? Brilliant.
I like how the "Current Status" is broken out into a big yellow box at the top, making it clear what the situation is right now. This is much better than the default approach of showing the latest status as simply the top news item in the chronological list. A nice touch.

Conclusion

The most important takeaway is that Facebook has taken the hardest step toward transparency: getting a status blog/dashboard online. If they were to implement some of the recommendations above, they would see more of the benefits that come with transparency, and set a great example for other development platforms.

Thursday, July 1, 2010

Benefits of Transparency

I thought it would be helpful to consolidate a list of the primary benefits of web sites/services being transparent online. If there are any I missed, please leave a comment and I'll update the list:

Benefits of Transparency (for online websites and services)

1. Build trust with your users

2. Increase loyalty, reduce churn
3. Improve perception of your reliability
4. Reduce support costs
5. Control the message
6. Gain a competitive advantage
7. More time to focus on the actual problem
8. Reduce stress
9. Learn

See below for more detail...

1. Build trust with your users

Your users have a pretty low bar for how they expect to be treated. They basically expect you to screw them, hide information from them, and do the bare minimum to take their money. If you do something good for them, something unexpected like admit that you have problems proactively, and show your humanity, your users will develop a sense of trust for your service and your company. I believe that trust may be the most important asset you can earn on the web, especially if you deal things that are really important to your customers (e.g. money, email, photos, etc.).

Example: If the car company does a recall as soon as there is a hint of a problem, you trust them a lot more then if they are forced to do a recall after a number of deaths.

The more times you are proactive and admit to problems before you are caught, the stronger the sense of trust gets. If you are instead forced to admit your problems, or your customers complain before you tell them that you are aware of the problem, the harder it gets to convince them that you know what you are doing and that you care about the quality of the service.

2. Increase loyalty, reduce churn

Your users don't expect you to be perfect. They will forgive you when you have a problem. But only if they feel that they can trust you, that you know what you are doing, and that things are improving. Your users will stick with you if they feel like you know what you are doing, that you feel their pain, that you are taking these issues seriously. Apologizing and explaining after the fact is much more difficult. It is hard to convince your customers that you know what you are doing and that you care about their issues if you avoid the problem, or worse pretend that it doesn't exist.

Example: Atlassian's security breach a few months ago...they could have lost a lot of concerned customers questioning their is trustworthy. Instead they increased loyalty and trust by being up front about the situation, explaining what they are doing about it, and improving for the future. If instead the issue was exposed independently, they would have seen a mass exodus.

A major downtime event is innately going to lead to unhappy customers. You may as well try to turn it around into something worthwhile, and try to keep as many customers as you can. A nice side benefit is that the more your users learn to trust you, the more loyal and forgiving they become. It's a powerful loop that you want to get on the right side of.

3. Improve perception of your reliability

When users run into a problem with your service, whether it's their fault or yours, they'll often assume the wrong is on your end. If you instead show them exactly when you are actually having problems, and if you do this reliably and consistently, they'll know when you really have problems, and end up seeing that you aren't down as often as they thought. It's ironic that the more open you are about how often you have a problem, the less often your users will think you really are down.

Example: A complex web applications made up of many components, say using Google App Engine, the Foursquare API, and Google ads. You get alerted about a timeout issue...will you assume that Google is at fault or one of the other components. A quick visit to Google's public dashboard would show you that they are perfectly fine, and that the problem lies with one of the other services (which need their own public dashboards).

4. Reduce support costs

During a downtime incident your support department gets flooded with the same type of question..."I'm seeing a problem, what's going on?" and "Is the site down or is it just me?". If you can allow your customers to serve themselves, or make it easy for your support department to point complaints to a single succinct explanation, they can operate much more efficiently, and focus on higher level issues.

Also, a lot of times support doesn't even know what's going on during a downtime event, and having something to check themselves gives them more insight into the health of the system

Example: Amazon Web Services barely has support. They have a paid support service, and their forums, but otherwise there is very little real-time support. They can do this because they have a real-time public health dashboard that addresses 90% of the questions users are going to have in their day-to-day use of the service.

5. Control the message

If you don't tell your users what's going on during an event, they are going to speculate and assume the worst. They'll assume you aren't aware of the problem, that it'll last a long time, and that you're not taking it seriously. Even a simple update telling users that you are aware of the problem and are working on it gives them confidence that this isn't going to be the end of the company, and that you feel their pain.

Example: Users of Twitter experience on-and-off issues, but they can always tell how healthy the service is as a whole by visiting their public dashboard and status blog. They don't have to wonder how far-reaching the downtime is, or how long it'll last.

6. Gain a competitive advantage

All else being equal, when prospects are comparing your service to a competitor, especially when your service is critical to their own life/business, being able to tell a story about being transparent and open is a powerful differentiator. It gives your prospect a feeling of control, that they won't be left in the dark when the sh** hits the fan and their boss is breathing down their neck.

7. More time to focus on the actual problem

Especially for a small company, you can spend more time dealing with resolving the issue and less time fielding calls/emails. The better your process, the less you have to worry about beyond fixing the actual problem.

8. Reduce stress

With a defined process, ideally one that is procedural, you keeps people from freaking out and having to scramble at the worst possible time. The last thing you want to be doing during a downtime event is figuring out who can say what, and how to actually contact your entire customer base about a potential problem.

9. Learn
As noted by a comment by Heather Leson in the original post, disasters are an opportunity to help both customers and company staff share in the learning process. The more open you are about your issues, the more opportunity you'll have in both learning from your customers that may have had similar experiences, and the more your customers will learn from your experience. You aren't alone. Your customers have a vested interest in helping you succeed. You may be surprised by how forthcoming they are with advice and recommendations for your situation. Google App Engine ended up adding new features after a major downtime event, no doubt based on customer feedback. Amazon added their public health dashboard after one too many outages. As Heather put it, "Mutual success is one of the cornerstones of open source/open web organizations."

Transparent Uptime

Wednesday, July 28, 2010

Transparency in action at OpenSRS

Tuesday, July 27, 2010

I'm doing an O'Reilly Webcast this Thursday!

Tuesday, July 20, 2010

Why Transparency Works

Thursday, July 8, 2010

Facebook and transparency

Thursday, July 1, 2010

Benefits of Transparency

About Me

Resources

Cloud Health Status Updates

Blog Archive

Disclaimer