Transparent Uptime: 2010

Wednesday, November 10, 2010

All good things...must come to an end

After nearly two and half years, over one hundred posts, a presentation at Velocity 2010, a quote in the Wall Street Journal, an O'Reilly webinar, and immeasurable friendships, connections, and opportunities that have come as result of this blog, I am (sadly) putting the blog on indefinite hiatus. As of next week, I will be leaving my job (of nearly 10 years) and pursuing my dream of starting my own company. In that new world I do not foresee having the time necessary to give this blog the time it deserves, and so to avoid leaving it in a state of perpetual uncertainty, this will be my final post.

It saddens me to bring this site to an end. I've gotten more out of it than I could have ever hoped. Reading over my first post (as painful as that is), I am happy to see that I have met the goals I set out for myself. Things have come a long way since those days, but there is still far more to do. My biggest hope is that you as a reader have gained some nugget of useful knowledge out of my writings, and that you continue to push forward on the basic ideas of transparency, openness, and simply helping your company act more human.

In regards to my startup, I don't have a lot of details to share just yet, but if you are interested in staying up to date please follow me on twitter or LinkedIn. I can also give you my personal email address if you would like to contact me for any reason. All I can say at this point is that I will be moving to one and only city of Montreal to work with the wonderful folks at Year One Labs. Mysterious eh?

Below is a list of my favorite (and most popular) posts from the past 2+ years:

Note: If you are doing anything similar that you think readers of this blog would find useful, please let me know in the comments and I'll update this post.

Signing off,
Lenny Rachitsky (@lennysan, LinkedIn)

Friday, October 8, 2010

Etsy.com opens the kimono and talks frankly about outages

You know when John Allspaw (VP of Ops at Etsy, Manager of Operations at Flickr, Infrastructure Architect at Friendster) is involved, you're going to get a unique perspective on things. A few weeks ago Etsy.com was down. John (and his operations) department decided it would be a good opportunity to take what I'll call an "outage bankruptcy" and basically reset expectations. In an extremely detailed and well thought out post (titled "Frank Talk about Site Outages") he goes on to describe the entire end-to-end processes that go into managing uptime at Etsy. I would recommend reading the entire post, but I thought it would be useful to point out the things that we can all take away from the experience of one of the most well respected operations people in the industry:

Metrics

"Today, we gather a little over 30,000 metrics, on everything from CPU usage, to network bandwidth, to the rate of listings and re-listings done by Etsy sellers. Some of those metrics are gathered every 20 seconds, 24 hours a day, 365 days a year. About 2,000 metrics will alert someone on our operations staff (we have an on-call rotation) to wake up in the middle of the night to fix a problem."

Takeaway: Capture data on every part of your infrastructure, and later decide which metrics are leading indicators of problems. He goes on to talk about the importance of external monitoring (outside of your firewall) to measure the actual end-user experience.

Communication

"When we have an outage or issue that affects a measurable portion of the site’s functionality, we quickly group together to coordinate our response. We follow the same basic approach as most incident response teams. We assign some people to address the problem and others to update the rest of the staff and post to http://etsystatus.com to alert the community. Changes that are made to mitigate the outage are largely done in a one-at-a-time fashion, and we track both our time-to-detect as well as our time-to-resolve, for use in a follow-up meeting after the outage, called a “post-mortem” meeting. Thankfully, our average time-to-detect is on the order of 2 minutes for any outages or major site issues in the past year. This is mostly due to continually tuning our alerting system."

Takeaway: Two important points here. First, communication and collaboration are key to successfully managing issues. Second, and even more interesting, is the need for two teams...one to address the problem and one to communicate status updates both internally and externally. This is often a missing piece for companies, where no updates go out because everyone is busy fixing the problem.

Post-Mortems

"After any outage, we meet to gather information about the incident. We reconstruct the time-line of events; when we knew of the outage, what we did to fix it, when we declared the site to be stable again. We do a root cause analysis to characterize why the outage happened in the first place. We make a list of remediation tasks to be done shortly thereafter, focused on preventing the root cause from happening again. These tasks can be as simple as fixing a bug, or as complex as putting in new infrastructure to increase the fault-tolerance of the site. We document this process, for use as a reference point in measuring our progress."

Takeaway: Fixing the problem and getting back online is not enough. Make it a an automatic habit to schedule a postmortem to do a deep dive into the root cause(s) of the problem, and address not only the immediate bugs but also the deeper issues that led to the root cause. The Five Why's can help here, as can the Lean methodology of investing a proportional number of hours into the most problematic parts of the infrastructure.

Single Point of Failure Reduction

"As Etsy has grown from a tiny little start-up to the mission-critical service it is today, we’ve had to outgrow some of our infrastructure. One reason we have for this evolution is to avoid depending on single pieces of hardware to be up and running all of the time. Servers can fail at any time, and Etsy.com should be able to keep working if a single server dies. To do that, we have to put our data in multiple places, keep them in sync, and make sure our code can route around any individual failures.

So we’ve been working a lot this year to reduce those “single points of failure,” and to put in redundancy as fast as we safely can. Some of this means being very careful (paranoid) as we migrate data from the single instances to multiple or replicated instances. As you can imagine, it’s a bit of a feat to move that volume of data around while still seeing a peak of 15 new listings per second, all the while not interrupting the site’s functionality."

Takeaway: Reduce single points of failure incrementally. Do what you can in the time you have.

Change Management and Risk

"For every type of technical change, we have answers to questions like:

What problem does the change solve?
Has this kind of change happened before? Is there a successful history?
When is the change going to start? When is it expected to end?
What is the expected effect of this change on the Etsy community? Is a downtime required for the change?
What is the rollback plan, if something goes wrong?
What test is needed to make sure that the change succeeded?

As with all change, the risk involved and the answers to these questions are largely dependent on the judgment of the person at the helm. At Etsy, we believe that if we understand the likely failures, and if there’s a plan in place to fix any unexpected issues, we’ll make progress.

Just as important, we also track the results of changes. We have an excellent history with respect to the number of successful changes. This is a good record that we plan on keeping."

Takeway: Be prepared for failure by anticipating worst-case scenario's for every change. Be ready to roll back and respond. More importantly, make sure to track when things go right to have a realistic measure of risk.

Other takeaways:

Declaring "outage bankruptcy" is not the ideal approach. But it is better than simply going along without any authentic communication with your customers throughout a period of instability. Your customers will understand, if you act human.
Etsy has been doing a great job keeping customers up to date at http://etsystatus.com/.
A glance at the comments on the page shows a few upset customers, but a generally positive response.

Wednesday, October 6, 2010

Foursquare gets transparency

Early Monday morning of this week, Foursquare went down hard:

11 hours later, the #caseofthemondays was over and they were back online. Throughout the those 11 hours, users had one of the following experiences:

1. When visiting foursquare.com, they saw:

2. When using the iPhone/Android/Blackberry app, they saw an error telling them the service is down and to try again later.

3. When checking Twitter (the not default source of downtime information), they saw a lot of people complaining and the following tweets from the official @foursquare account (if they thought of checking the @foursquare account):

Those were the only options available to a user of Foursquare for those 11 hours. A important question we need to answer is whether anyone seriously cared. Are users of consumer services like Foursquare legitimately concerned with Foursquare's downtime? Are they going to leave for competing services or just quit the whole check-in game? I'd like to believe that 11 hours of downtime matters, but honestly it's too early to tell. This will be a great test of the stickiness and Whuffie that Foursquare has built up.

The way I see it is that this is one strike against Foursquare (which includes the continued instability they've seen since Monday). They probably won't see a significant impact to their user base. However, if this happens again, and again, and again, the story changes. And as I've argued, downtime is inevitable. Foursquare will certainly go down again. They key is not reducing downtime to zero, but how you handle that downtime to avoid giving your competition an opening and even more importantly using that downtime to build trust and loyalty with your users. How do you accomplish this? Transparency.

We've talked about the benefits of transparency, why transparency works, and how to implement it. We saw above how Foursquare handled the pre- and intra- downtime steps (not well), so let's take a look at how they did in the post-downtime phase by reviewing the public postmortem (both of them) they published. As always, let's run it through the gauntlet.

Prerequisites:

Admit failure - Excellent. The entire first paragraph describes the downtime, and how painful it was to users.
Sound like a human - Very much. This has never been a problem for Foursquare. The tone is very trustworthy.
Have a communication channel - Prior to the event, all they had were their twitter accounts and their API developer forums. As a result of this incident, they have since launched http://status.foursquare.com/, and have promised to update @4sqsupport on a regular basis throughout the incident.
Above all else, be authentic - This may be the biggest thing going for them.

Requirements:

Start time and end time of the incident - Missing. All we know is that they were down for 11 hours. I don't see this as being critical in this case, but it would have been nice to have.
Who/what was impacted - A bit vague, but the impression was that everyone was impacted.
What went wrong - Extremely well done. I feel very informed, and can sympathize with the situation.
Lessons learned - Again, extremely well done. I love the structure they used: What happened, What we’ll be doing differently – technically speaking, What we’re doing differently – in terms of process. Very effective.

Bonus:

Details on the technologies involved - Yes!
Answers to the Five Why's - No :(
Human elements - heroic efforts, unfortunate coincidences, effective teamwork, etc - Yes!
What others can learn from this experience - Yes!

Other takeaways:

Foursquare launched a public heath status feed! Check it out at http://status.foursquare.com/.
I really like the structure used in this postmortem. It has inspired me to want to create a basic template for postmortems. Stay tuned...
Could this be Facebook's Friendster moment? I hope not. My personal project rely's completely on Foursquare.
I've come to realize that for in most cases, downtime is less impactful to the long term success of a business than site performance. Downtime users understand and just try again later. Slowness eats away at you, you start to hate using the service and jump on an opportunity to use something more fun/fast/pleasant.

Going forward, the big question will be whether Foursquare maintains their new processes, keeps the status blog up to date, and can fix their scalability issues. I for one am rooting for them.

Wednesday, September 29, 2010

Case Study: Facebook outage

I'm a bit late to the story (something called a day job getting in the way!) but I can't pass up an opportunity to discuss how Facebook handled the "worst outage [they've] had in over four years". I blogged about the intra-incident communication the day they had the outage, so let's review the postmortem that came out after they had recovered, and how they handled the downtime as a whole.

Using the "Upside of Downtime" framework (above) as a guide:

Prepare: Much room for improvement. The health status feed is hard to find for the average user/developer, and the information was limited. On the plus side, it exists. Twitter was also used to communicate updates, but again the information was limited.
Communicate: Without a strong foundation create by the Prepare step, you don't have much opportunity to excel at the Communicate step. There was an opportunity to use the basic communication channels they had in place (status feed, twitter) more effectively by communicating throughout the incident, with more actionable information, but alas this was not the case. Instead, there was mass speculation about the root cause and the severity. That is exactly what you want to strive to avoid.
Explain: Let's find out by running the postmortem through our guideline for postmortem communication...

Prerequisites:

Admit failure - Excellent, almost a textbook admittance without hedging or blaming.

Sound like a human - Well done. Posted from Director of Engineering at Facebook Robert Johnson's personal account, the tone and style was personal and effective.

Have a communication channel - Can be improved greatly. Making the existing health status page easier to find, more public, and more useful would help in all future incidents. I've covered how Facebook can improve this page in a previous post.

Above all else, be authentic - No issues here.

Requirements:

Start time and end time of the incident - Missing.

Who/what was impacted - Partial. I can understand this being difficult in the case of Facebook, but I would have liked to see more specifics around how many many users were affected. On one hand this is a global consumer service that may not be critical to people's lives. On the other hand though, if you treat your users with respect, they'll reward you for it.

What went wrong - Well done, maybe the best part of the postmortem.

Lessons learned - Partial. It sounds like many lessons were certainly learned, but they weren't directly shared. I'd love to know what the "design patterns of other systems at Facebook that deal more gracefully with feedback loops and transient spikes" look like.

Bonus:

Details on the technologies involved - No

Answers to the Five Why's - No

Human elements - heroic efforts, unfortunate coincidences, effective teamwork, etc - No

What others can learn from this experience - Marginal

Biggest lesson for us to take away: Preparation is key to successfully managing outages, and using them to build trust with your users.

Transparency in action at Twitter

Enjoyed that tweet from the other day. As you may know, Twitter ran into a very public cross-site scripting (XSS) vulnerability recently:

"The short story: This morning at 2:54 am PDT Twitter was notified of a security exploit that surfaced about a half hour before that, and we immediately went to work on fixing it. By 7:00 am PDT, the primary issue was solved. And, by 9:15 am PDT, a more minor but related issue tied to hovercards was also fixed."

News of the vulnerability exploded, but very quickly Twitter came out with a fix and just as importantly an detailed explanation of what happened, what they did about it, and where they are going from here:

The security exploit that caused problems this morning Pacific time was caused by cross-site scripting (XSS). Cross-site scripting is the practice of placing code from an untrusted website into another one. In this case, users submitted javascript code as plain text into a Tweet that could be executed in the browser of another user.

We discovered and patched this issue last month. However, a recent site update (unrelated to new Twitter) unknowingly resurfaced it.

Early this morning, a user noticed the security hole and took advantage of it on Twitter.com. First, someone created an account that exploited the issue by turning tweets different colors and causing a pop-up box with text to appear when someone hovered over the link in the Tweet. This is why folks are referring to this an “onMouseOver” flaw -- the exploit occurred when someone moused over a link.

Other users took this one step further and added code that caused people to retweet the original Tweet without their knowledge.

This exploit affected Twitter.com and did not impact our mobile web site or our mobile applications. The vast majority of exploits related to this incident fell under the prank or promotional categories. Users may still see strange retweets in their timelines caused by the exploit. However, we are not aware of any issues related to it that would cause harm to computers or their accounts. And, there is no need to change passwords because user account information was not compromised through this exploit.

We’re not only focused on quickly resolving exploits when they surface but also on identifying possible vulnerabilities beforehand. This issue is now resolved. We apologize to those who may have encountered it.

Well done.

Thursday, September 23, 2010

Facebook downtime

Facebook has been experiencing some major downtime today in various locations around the world:

"After issues at a third-party networking provider took down Facebook for some users on Wednesday, the social networking site is once again struggling to stay online.

The company reports latency issues with its API on itsdeveloper site, but the problem is clearly broader than that with thousands of users tweeting about the outage.

On our end when we attempt to access Facebook, we’re seeing the message: “Internal Server Error – The server encountered an internal error or misconfiguration and was unable to complete your request.” Facebook “Like” buttons also appear to be down on our site and across the Web"

Details are still sketchy (there's speculation Akamai is at fault). And that's the problem. It's almost all speculation right now. The official word from facebook is simply:

"We are currently experiencing latency issues with the API, and we are actively investigating. We will provide an update when either the issue is resolved or we have an ETA for resolution."

That's not going to cut it when you have 500+ millions, and countless developers (Zynga must be freaking out right now). I'm seeing about 400 tweets/second complaining about the downtime. Outages will happen. The problem isn't the downtime itself. Where Facebook is missing the boat is using this opportunity to build increased trust with their user and developer community by simply opening up the curtains a bit and telling us something useful. I've seen some movement from Facebook on this front before. But there's much more they can do, and I'm hoping this experience pushes them in the right drirection. Give us back a sense of control and we'll be happy.

P.S. You can watch for updates here, here, and here.

Wednesday, September 22, 2010

BP portraying Deepwater Horizon explosion as a "Normal Accident"...unknowingly calls for end of drilling

While reading last week's issue of Time magazine, I came across this explanation of BP's pitch attempting to explain the recent accident in the Gulf:

"Following a four-month investigation, BP released a report Sept. 8 that tried to divert blame from itself to other companies -- including contractors like Transocean -- for the April 20 explosion that sank the Deepwater Horizon rig, killing 11 people and resulting in the worst oil spill in U.S. history. A team of investigators cited 'a complex and interlinked series of mechanical failures, human judgement' and 'engineering design' as the ultimate cause of the accident."

Though to some it may come off as a naive "it's not our fault" strategy, the reality (and consequence) is a lot more interesting. I've spoken before about the concept of a "Normal Accident", but let's define it again:

Normal Accident Theory: When a technology has become sufficiently complex and tightly coupled, accidents are inevitable and therefore in a sense 'normal'.

Accidents such as Three Mile Island and a number of others, all began with a mechanical or other technical mishap and then spun out of control through a series of technical cause-effect chains because the operators involved could not stop the cascade or unwittingly did things that made it worse. Apparently trivial errors suddenly cascade through the system in unpredictable ways and cause disastrous results.

What BP is saying is that their systems are so "complex and interlinked" that they were unable to avert the disaster. In a sense, they are arguing that disaster was inevitable. If "Normal Accident Theory" can be believed, BP is indirectly suggesting deep water oil drilling should be abandoned:

"This way of analysing technology has normative consequences: If potentially disastrous technologies, such as nuclear power or biotechnology, cannot be made entirely 'disaster proof', we must consider abandoning them altogether.

Charles Perrow, the author of Normal Accident Theory, came to the conclusion that "some technologies, such as nuclear power, should simply be abandoned because they are not worth the risk".

Where do I sign?

Thursday, September 16, 2010

Chase.com goes down due to third party DB issues, apologizes...eventually

From Data Center Knowledge:

"The Chase.com online banking portal is back online and processing customer bill payments that were delayed during lengthy outages Tuesday and Wednesday, the company said this morning.

The Chase web site crashed Monday evening when a third party vendor’s database software corrupted the log-in process, the bank told the Wall Street Journal. Chase said no customer data was at risk and that its telephone banking and ATMs functioned as usual throughout the outage."

Unfortunately there was no communication during the event, and finally got a message out to customers that visited the website four days after the first outage:

The "we're sorry" message is well done, but overall...not good.

Monday, September 13, 2010

Domino's using transparency as a competitive advantage

From the NY Times:

Domino’s Pizza is extending its campaign that promises customers transparency along with tasty, value-priced pizza.

The campaign, by Crispin Porter & Bogusky, part of MDC Partners, began with a reformulation of pizza recipes and continued recently with a pledge to show actual products in advertising rather than enhanced versions lovingly tended to by professional food artists.

The vow to be more real was accompanied by a request to send Domino’s photographs of the company’s pizzas as they arrive at customers’ homes. AWeb site, showusyourpizza.com, was set up to receive the photos.

A commercial scheduled to begin running on Monday will feature Patrick Doyle, the chief executive of Domino’s, pointing to one of the photographs that was uploaded to the Web site. The photo shows a miserable mess of a delivered pizza; the toppings and a lot of the cheese are stuck to the inside of the box.

“This is not acceptable,” Mr. Doyle says in the spot, addressing someone he identifies as “Bryce in Minnesota.”

“You shouldn’t have to get this from Domino’s,” Mr. Doyle continues. “We’re better than this.” He goes on to say that such subpar pizza “really gets me upset” and promises: “We’re going to learn; we’re going to get better. I guarantee it.”

Friday, August 13, 2010

How to Prevent Downtime Due to Human Error

Great post today over at Datacenter Knowledge, citing the fact that "70 percent of the problems that plague data centers" are caused by human error. Below are the best practices to avoid data center failure by human error:

1. Shielding Emergency OFF Buttons – Emergency Power Off (EPO) buttons are generally located near doorways in the data center. Often, these buttons are not covered or labeled, and are mistakenly shut off during an emergency, which shuts down power to the entire data center. Labeling and covering EPO buttons can prevent someone from accidentally pushing the button. See Averting Disaster with the EPO Button and Best Label Ever for an EPO Button for more on this topic.

2. Documented Method of Procedure - A documented step-by-step, task-oriented procedure mitigates or eliminates the risk associated with performing maintenance. Don’t limit the procedure to one vendor, and ensure back-up plans are included in case of unforeseen events.

3. Correct Component Labeling - To correctly and safely operate a power system, all switching devices must be labeled correctly, as well as the facility one-line diagram to ensure correct sequence of operation. Procedures should be in place to double check device labeling.

4. Consistent Operating Practices – Sometimes data center managers get too comfortable and don’t follow procedures, forget or skip steps, or perform the procedure from memory and inadvertently shut down the wrong equipment. It is critical to keep all operational procedures up to date and follow the instructions to operate the system.

5. Ongoing Personnel Training – Ensure all individuals with access to the data center, including IT, emergency, security and facility personnel, have basic knowledge of equipment so that it’s not shut down by mistake.

6. Secure Access Policies – Organizations without data center sign-in policies run the risk of security breaches. Having a sign-in policy that requires an escort for visitors, such as vendors, will enable data center managers to know who is entering and exiting the facility at all times.

7. Enforcing Food/Drinks Policies – Liquids pose the greatest risk for shorting out critical computer components. The best way to communicate your data center’s food/drink policy is to post a sign outside the door that states what the policy is, and how vigorously the policy is enforced.

8. Avoiding Contaminants – Poor indoor air quality can cause unwanted dust particles and debris to enter servers and other IT infrastructure. Much of the problem can be alleviated by having all personnel who access the data center wear antistatic booties, or by placing a mat outside the data center. This includes packing and unpacking equipment outside the data center. Moving equipment inside the data center increases the chances that fibers from boxes and skids will end up in server racks and other IT infrastructure.

Thursday, August 12, 2010

Downtime, downtime, downtime - DNS Made Easy, Posterous, Evernote

It's been a busy week on the interwebs. Either downtime incidents are becoming more common, or I'm just finding out about more of them. One nice thing about this blog is that readers send me downtime events that they come across. I don't know if I want to be the first person that people think of when they see downtime, but I'll take it. In the spirit of this blog, let's take a look at the recent downtime events to see what they did right, what they can improve, and what we can all learn from their experience.

DNS Made Easy
On Saturday August 7th, DNS Made Easy was host to a massive DDoS attack:

"The firm said it experienced 1.5 hours of actual downtime during the attack, which lasted eight hours. Carriers including Level3, GlobalCrossing, Tinet, Tata, and Deutsche Telekom assisted in blocking the attack, which due to its size flooded network backbones with junk."

Prerequisites:

Admit failure - Through a serious of customer email communications and tweets, there was a clear admittance of failure early and often.
Sound like a human - Yes, the communications all sounded genuine and human.
Have a communication channel - Marginal. The communication channels were Twitter and email, which are not as powerful as a health status dashboard.
Above all else, be authentic - Great job here. All of the communication I saw sounded authentic and heartfelt, including the final postmortem. Well done.

Requirements:

Start time and end time of the incident - Yes, final postmortem email communication included the official start and end times (8:00 UTC - 14:00 UTC).
Who/what was impacted - The postmortem addressed this directly, but didn't spell out a completely clear picture of who was affected and who wasn't. This is probably because there isn't a clear distinction between sites that were and weren't affected. To address this, they recommended customers review their DNS query traffic to see how they were affected.
What went wrong - A good amount of detail on this, and I hope there is more coming. DDoS attacks are a great examples of where sharing knowledge and experience help the community as a whole, so I hope to see more detail come out about this.
Lessons learned - The postmortem included some lessons learned, but nothing very specific. I would have liked to see more here.

Bonus:

Details on the technologies involved - Some.
Answers to the Five Why's - Nope.
Human elements - heroic efforts, unfortunate coincidences, effective teamwork, etc - Some.
What others can learn from this experience - Some.

Other notes:

The communication throughout the incident was excellent, though they could have benefited from a public dashboard or status blog that went beyond twitter and private customer emails.
I don't think this is the right way to address the question of whether SLA credits will be issued: "Yes it will be. With thousands paying companies we obviously do not want every organization to submit an SLA form."

Posterous

Starting Wednesday 8/4, Posterous began to experience various stability issues:

"As you’re no doubt aware, Posterous has had a rocky six days.

On Wednesday and Friday, our servers were hit by massive Denial of Service (DoS) attacks. We responded quickly and got back online within an hour, but it didn’t matter; the site went down and our users couldn’t post.

On Friday night, our team worked around the clock to move to new data centers, better capable of handling the onslaught. It wasn’t easy. Throughout the weekend we were fixing issues, optimizing the site, some things going smoothly, others less so.

Just at the moments we thought the worst was behind us, we’d run up against another challenge. It tested not only our technical abilities, but our stamina, patience, and we lost more than a few hairs in the process."

Posterous continued to update their users on their blog, and on twitter. They also sent out an email communication to all of their customers to let everyone know about the issues.

Prerequisites:

Admit failure - Clearly yes, both on the blog and on Twitter.
Sound like a human - Very much so.
Have a communication channel - A combination of blog and Twitter. Again, not ideal, as customers may not think about visiting the blog or checking Twitter. Especially when the blog is inaccessible during the downtime, and they may not be aware of the Twitter account. One of the keys to communication channel is to host if offsite, which would have been important in this case.
Above all else, be authentic - No issues here, well done.

Requirements:

Start time and end time of the incident - A bit vague in the postmortem, but can be calculated from the Twitter communication. Can be improved.
Who/what was impacted - The initial post described this fairly well, that all customers hosted on Posterous.com are affected, including custom domains.
What went wrong - A series of things went wrong in this case, and I believe the issues were described fairly well.
Lessons learned - Much room for improvement here. I don't see any real lessons learned in the postmortem posts or other communications. There were things put in place to avoid the issues int he future, such as moving to a new datacenter and adding hardware, but I don't see any real lessons learned as a result of this downtime.

Bonus:

Details on the technologies involved - Very little.
Answers to the Five Why's - No.
Human elements - Yes, in the final postmortem, well done.
What others can learn from this experience - Not a lot here.

Evernote

From their blog:

"EvernoteEvernote servers. We immediately contacted all affected users via email and our support team walked them through the recovery process. We automatically upgraded all potentially affected users to Evernote Premium (or added a year of Premium to anyone who had already upgraded) because we wanted to make sure that they had access to priority tech support if they needed help recovering their notes and as a partial apology for the inconvenience."

Prerequisites:

Admit failure - Extremely solid, far beyond the bare minimum.
Sound like a human - Yes.
Have a communication channel - A simple health status blog (which according to the comments is not easy to find), a blog, and a Twitter channel. Biggest area of improvement here is to make the status blog easier to find. I have no idea how to get to that from the site or the application, and that defeats its purpose.
Above all else, be authentic - The only communication I saw was the final postmortem, and in that I think in that post (and the comments) they were very authentic.

Requirements:

Start time and end time of the incident - Rough timeframe, would have liked to see more detail.
Who/what was impacted - First time I've seen an exact figure like "6,323" users. Impressive.
What went wrong - Yes, at the end of the postmortem.
Lessons learned - Marginal. A bit vague and hand-wavy.

Bonus:

Details on the technologies involved - Not bad.
Answers to the Five Why's - No.
Human elements - No.
What others can learn from this experience - Not a lot here.

Conclusion

Overall, I'm impressed with how these companies are handling downtime. Each communicated early and often. Each admitted failure immediately, and kept their users up to date. Each put out a solid postmortem that detailed the key information. It's interesting to see how Twitter is becoming the de-facto communication channel during an incident. I still wonder how effective it is in getting news out to all of your users, and how many users are aware of it. Overall, well done guys.

Update: DNS Made Easy just launched a public health dashboard!

Tuesday, August 10, 2010

Transparency in action at Twilio

When Twilio launched an open-source public health dashboard tool a couple of weeks ago, I knew I had to learn more about Twilio. I connected with John Britton (Developer Evangelist at Twilio) to get some insight into the Twilio's transparency story. Enjoy...

Q. What motivated Twilio to launch a public health dashboard and to put resources into transparency?
Twilio's goal is to bring the simplicity and transparency common in the world of web technologies to the opaque world of telephony and communications. Just as Amazon AWS and other web infrastructure providers give customers direct and immediate information on service availability, Stashboard allows Twilio to provide a dedicated status portal that our customers can visit anytime to get up-to-the-minute information on system heath. During the development of Stashboard, we realized how many other companies and businesses could use a simple, scalable status page, so we open sourced it! You can download the source code or fork your own version.

Q. What roadblocks did you encounter on the way to launching the public dashboard, and how did you overcome them?
The most difficult part of building and launching Stashboard was creating a robust set of APIs that would encompass Twilio's services as well as other services from companies interested in running an instance of Stashboard themselves. We looked at existing status dashboards for inspiration, including the Amazon AWS Status Page and the Google Apps Status Page, and settled on a very general design independent from Twilio's product. The result is a dashboard that can be utilized to track a variety of APIs and services. For example, a few days after the release of Stashboard, MongoHQ, a hosted MongoDB database provider launched their own instance of Stashboard to give their customers API status information.

Q. What benefits have you seen as a result of your transparency initiatives?
Twilio's rapid growth is a great example of how developers at both small and large companies have responded to Twilio's simple open approach. The Twilio developer community has grown to more then 15,000 strong and we see more and more applications and developers on the platform everyday. Twilio was founded by developers who have a strong background in web services and distributed systems. This is reflected in our adoption of open standards like HTTP and operational transparency with services like http://status.twilio.com. Another example is the community that has grown up around OpenVBX, a downloadable phone system for small business Twilio developed and open sourced a few week ago. We opened OpenVBX to provide developers the simplest way to hack, skin, and integrate it with their own systems.

Q. What is your hope with the open source dashboard framework?
The main goal of Stashboard is to give back to the community. We use open source software extensively inside Twilio and we hope that by opening up Stashboard it will help other hosted services and improve the whole web services ecosystem.

Q. What would you say to companies considering transparency in their uptime/performance?
Openness and transparency are key to building trust with customers. Take the telecom industry as an example. They are known for being completely closed. Customers rarely love or trust their telecom providers. In contrast, Twilio brings the open approach of the web to telecom and the response has been truly amazing. When customers know they can depend on a company to provide accurate data concerning performance and reliability, they are more willing give that company their business and recommend it to their peers. Twilio's commitment to transparency and openness has been a huge driver of our success and Stashboard and projects like OpenVBX are just the beginning.

Wednesday, July 28, 2010

Transparency in action at OpenSRS

OpenSRS has long been a company that "gets it", so I was excited to have the opportunity to interview Ken Schafer, who leads the transparency efforts at OpenSRS and Tucows. OpenSRS has an excellent public health dashboard, and continues to put a lot of effort into transparency. Heather Leson, who works with Ken, has done a lot to raise the bar in the online transparency community. My hope is that the more transparent we all get about our own transparency efforts (too much?) the more we all benefit. Below, Ken tells us how he got the company to accept the need for transparency, what hurdles they had to overcome, and what benefits they've seen. Enjoy the interview, and if you have any questions for Ken, please post them as comments below.

Q. Can you briefly explain your role and how you got involved in your company’s transparency initiative?
My formal title is Executive Vice President of Product & Marketing. That means I'm on the overall Tucows exec team and I'm also responsible for the product strategy and marketing of OpenSRS, our wholesale Internet services group.

Tucows is one of the original Internet companies - founded in 1993. We've moved well beyond the original software download site and now the company makes most of its money providing easy-to-use Internet services.

OpenSRS provides end users with over 10 million domain names, millions of mailboxes, and tens of thousands of digital certificates through over 10,000 resellers in over 100 countries. Our resellers are primarily web hosts, ISPs, web developers and designers, and IT consultants.

Q. What has your group done to create transparency for your organization?
Given the technical adeptness of our resellers we've always tried not to talk down to them and to provide as much information as we can. Our success and the success of our resellers are highly dependent on each other so we're very open to sharing and in fact since the beginning of OpenSRS in 1999 we've run mailing lists, blogs, forums, wiki, status pages and a host of other ways for us to communicate better with our resellers.

Transparency is kind of in the nature of the business at this point.

Right now we provide transparency into what we're doing through a blog, a reseller forum, our Status site and our activity on a host of social networks.

Q. What was the biggest hurdle you had to get over to push this through?
The biggest challenge is really whether your commitment to transparency can survive the bad times. Being transparent when you've got a status board full of green tick marks isn't that hard. When everything starts turning red and staying that way, THAT'S a lot harder.

We're generally proud of our uptime and the quality of our services but a few years ago we struggled with scaling some of our applications and, frankly, our communication around the problems we were facing suffered as a result. People here were just too embarrassed to tell our resellers that we'd messed stuff up and in particular to admit to our fellow geeks HOW we'd messed up.

But when we pushed and DID share information and admitted our mistakes and talked about what we could do to make it better what we found was that our resellers were appreciative AND very sympathetic. They'd all been there too and knew it was hard to fess up to our errors in judgment and they really appreciated it.

One thing we STILL struggle with is how we communicate around network attacks. Our services run a big chunk of the Internet and as such we're under pretty much constant attack of one sort or another. We handle most of these without anyone noticing. Our operations and security teams do an amazing job of keeping things running smoothly in the face of these attacks but every once in a while something new - in scope, scale or technique - happens that puts pressure on our systems until we can adjust to the new threat.

In those cases we've tended to put our desire for transparency aside and give minimal information so as not to show our hand to the bad guys. It's a struggle between what we share so customers understand what is happening and not showing potential vulnerabilities that others could exploit.

I guess "sharing what is exploitable" is where I draw the line when it comes to transparency.

Q. What benefits have you seen as a result of your transparency?
One of the biggest benefits is in the overall quality of the service. When you say that EVERY problem is going to get publicly and permanently posted to a status page it REALLY focusses the organization on quality of service!

Q. Can you give us some insight into the processes around your transparency? Specifically who manages the communication, who is responsible for maintaining the dashboard, and what the general process looks like before/during/after a big event.
Our communications team (Marketing) is responsible for the OpenSRS Status page. We generally hire marketers that are technically comfortable so they can write to be understood and understand what they're writing about.

We have someone from Marketing on call 24/7/365 and whenever an issue cannot be resolved in an agreed-to period of time (generally 15 minutes) our Network Operations Center (also 24/7/365) informs Marketing and we post to Status.

Our Status page is a heavily customized version of Wordpress plus an email notification system and auto-updates to our Twitter feed.

Marketing and NOC then stay in touch until the issue is resolved, posting updates as material changes occur or at two hour intervals if the issue is ongoing.

You'll notice this is a largely manual system. We decided against posting our internal monitoring tools publicly because of the complexity of our operations. Multiple services each composed of multiple sub-systems running in data centers around the word mean that the raw data isn't as useful to resellers as it may be for some less complex environments.

In the event of a serious problem we also have an escalation process - once again managed by Marketing - that brings in additional levels of communications and executives. For major issues we also have a "War Room" procedure that is put in place until the issue is resolved.

Q. What would you say to other organizations that are considering transparency as a strategic initiative?
The days of hiding are over. You now have a choice of whether you want to tell the story or have others misrepresent the story on your behalf. It seems scary to admit you have problems but you gain so much by being open and honest that the stress of taking a new approach to communications is easily outweighed.

Tuesday, July 27, 2010

I'm doing an O'Reilly Webcast this Thursday!

The folks at O'Reilly asked me to do a webcast of my talk, and I was happy to oblige. This talk will be very similar to the one I did at Velocity. I don't think I'll be doing this talk for much longer, so this may be your last chance to hear it live. I'd love to have you there and to hear any feedback you may have about the message. The webcast will begin at 10am PST this coming Thursday, and you can register here.

Tuesday, July 20, 2010

Why Transparency Works

We've talked about the benefits of transparency. We've talked about implementing transparency. We've talked about transparency in action. What we haven't yet talked about is...why the heck does transparency work? Why does transparency make your users happier? Why do customers trust you more when you are transparent? Why do we want to know what's going on? What allows us to be OK with major problems by simply knowing what is going on? My theory is simple: Transparency gives us a sense of control, and control is required for happiness. Allow me to elaborate.

Downtime and learned helplessness

The concept of learned helplessness was developed in the 1960s and 1970s by Martin Seligman at the University of Pennsylvania. He found that animals receiving electric shocks, which they had no ability to prevent or avoid, were unable to act in subsequent situations where avoidance or escape was possible. Extending the ramifications of these findings to humans, Seligman and his colleagues found that human motivation [...] is undermined by a lack of control over one's surroundings. (source)

Learned helplessness was discovered by accident when Seligman was researching Pavlovian conditioning. His experiment was set up to associate a tone with a (harmless) shock, to test whether the animal would learn to run away from just the sound of the tone. In the now famous experiment, one group of dogs was restrained and unable to escape the shock for a period of time (i.e. this group had no control over its situation). Later this group was placed into an area that now allowed them to escape the shock; unexpectedly the dogs stayed put. The shocks continued to come, yet the dogs simply curled up in the corner and whimpered. These dogs exhibited depression, and in a sense gave up on life, because these negative events were seemingly random. Seligman concluded that "the strongest predictor of a depressive response was lack of control over the negative stimulus." What is downtime if not a lack of control over a negative stimulus?

The Cloud and loss of control

Many concerns come up when businesses consider the cloud, but as the survey by IDC below shows the overriding concern is rooted in a loss of control:

You give up a lot of control in exchange for reduced cost, higher efficiency, and increased flexibility. Yet that that desire for control persists, and the remaining bits of control you maintain become even more valuable.

Downtime kills our sense of control

Downtime is quite simply a negative event over which you have almost no control. Especially when using SaaS/cloud services your remaining semblance of control vanishes as soon as service goes down and you have no insight into what is going on. We are the dogs trapped in the shock machine, whimpering in the corner.

As I described in my talk, downtime is inevitable. Thanks to things like risk homeostasis, black swan events, unknown unknowns, and our own nature, there is no way to avoid failure. All we can do is prepare for it, and communicate/explain what is going on. And that is the key to keeping us from the fate of a depressed canine. Transparency gives us a sense of control over the uncontrollable.

How transparency gives us back the sense of control

Imagine walking through the park, the sun shining, the birds singing. All of a sudden you notice a strong pain in your arm. Your mind jumps to the worst. Are you having a heart attack, did something just bite you, are you getting older and sicker? Then a split second later you remember...your buddy jokingly punched you earlier in the day! The punch must have been harder than you remember, but it explains the pain. Instantly you feel better. Though the pain is the same, you understand the source. You have an explanation for the pain. Transparency delivers that explanation.

When Amazon goes down, or Gmail isn't loading, users feel pain. Part of that pain comes from the inconvenience of not being able to do what you want to get done, or the lost revenue that comes with downtime. But just as painful is the sense of fatalistic helplessness, especially if someone is breathing down your neck expecting you to fix the problem. Without insight into what is happening with the service, you are completely without control. If on the other hand the service provides an explanation, through a public health dashboard or a status blog or a simple tweet, your fatalistic reaction turns to concrete concern. Your mind goes from assuming the worst (e.g. this service is terrible, they don't know what they are doing, it always fails) to focusing on a real and specific problem (e.g. some hard drive in the datacenter failed, they had some user error, this'll be gone soon). A specific problem is fixable, an unexplained pain is not. Transparency brings the pain down to a specific and knowable problem, while also holding the provider accountable for their issues (which indirectly gives you even more control). Or better said:

Seligman believes it is possible to change people's explanatory styles to replace learned helplessness with "learned optimism." To combat (or even prevent) learned helplessness in both adults and children, he has successfully used techniques similar to those used in cognitive therapy with persons suffering from depression. These include identifying negative interpretations of events, evaluating their accuracy, generating more accurate interpretations, and decatastrophizing (countering the tendency to imagine the worst possible consequences for an event). (source)

By providing a sense of control, transparency is one of the keys to keeping us happy, productive, and sane in an increasingly uncontrollable world.

Transparent Uptime

Wednesday, November 10, 2010

All good things...must come to an end

Friday, October 8, 2010

Etsy.com opens the kimono and talks frankly about outages

Wednesday, October 6, 2010

Foursquare gets transparency

Wednesday, September 29, 2010

Case Study: Facebook outage

Transparency in action at Twitter

Thursday, September 23, 2010

Facebook downtime

Wednesday, September 22, 2010

BP portraying Deepwater Horizon explosion as a "Normal Accident"...unknowingly calls for end of drilling

Thursday, September 16, 2010

Chase.com goes down due to third party DB issues, apologizes...eventually

Monday, September 13, 2010

Domino's using transparency as a competitive advantage

Friday, August 13, 2010

How to Prevent Downtime Due to Human Error

Thursday, August 12, 2010

Downtime, downtime, downtime - DNS Made Easy, Posterous, Evernote

Tuesday, August 10, 2010

Transparency in action at Twilio

Wednesday, July 28, 2010

Transparency in action at OpenSRS

Tuesday, July 27, 2010

I'm doing an O'Reilly Webcast this Thursday!

Tuesday, July 20, 2010

Why Transparency Works

About Me

Resources

Cloud Health Status Updates

Blog Archive

Disclaimer