Transparent Uptime

Wednesday, June 15, 2011

Passing the Transparency Torch

The torch has been passed! The following is re-posted from a new blog Transparent Performance:

"Almost three years ago, I started blogging about the importance of transparency at my blog Transparent Uptime. I chronicled the benefits of being transparent in your companies handling of downtime and performance. I did case studies on transparency done right, and transparency done wrong. I spoke at conferences preaching the gospel of performance. What was a strange idea back then is now becoming obvious, especially to young companies. Just as things were ramping up, I was forced to put my blog on hiatus.

In the time since, I’ve been looking for someone to hand off the torch of transparency to. I’m really excited to be a part of the launch of Transparent Performance, a new online consortium that will bring together experts from around the industry to continue the transparency movement. Transparency isn’t about altruism, or being “good”. Transparency is good business decision. My hope is that Transparent Performance can help the industry cross the chasm, and make it both obvious and trivial to be transparent in your uptime and performance. If anyone can do it, these guys can.

And if you’re interested in becoming a contributing editor too, do drop them a line or comment below."

Everyone that still follows this blog, or stumbles across this post, do yourself a favor and go to Transparent Performance.

Wednesday, November 10, 2010

All good things...must come to an end

After nearly two and half years, over one hundred posts, a presentation at Velocity 2010, a quote in the Wall Street Journal, an O'Reilly webinar, and immeasurable friendships, connections, and opportunities that have come as result of this blog, I am (sadly) putting the blog on indefinite hiatus. As of next week, I will be leaving my job (of nearly 10 years) and pursuing my dream of starting my own company. In that new world I do not foresee having the time necessary to give this blog the time it deserves, and so to avoid leaving it in a state of perpetual uncertainty, this will be my final post.

It saddens me to bring this site to an end. I've gotten more out of it than I could have ever hoped. Reading over my first post (as painful as that is), I am happy to see that I have met the goals I set out for myself. Things have come a long way since those days, but there is still far more to do. My biggest hope is that you as a reader have gained some nugget of useful knowledge out of my writings, and that you continue to push forward on the basic ideas of transparency, openness, and simply helping your company act more human.

In regards to my startup, I don't have a lot of details to share just yet, but if you are interested in staying up to date please follow me on twitter or LinkedIn. I can also give you my personal email address if you would like to contact me for any reason. All I can say at this point is that I will be moving to one and only city of Montreal to work with the wonderful folks at Year One Labs. Mysterious eh?

Below is a list of my favorite (and most popular) posts from the past 2+ years:

Note: If you are doing anything similar that you think readers of this blog would find useful, please let me know in the comments and I'll update this post.

Signing off,
Lenny Rachitsky (@lennysan, LinkedIn)

Friday, October 8, 2010

Etsy.com opens the kimono and talks frankly about outages

You know when John Allspaw (VP of Ops at Etsy, Manager of Operations at Flickr, Infrastructure Architect at Friendster) is involved, you're going to get a unique perspective on things. A few weeks ago Etsy.com was down. John (and his operations) department decided it would be a good opportunity to take what I'll call an "outage bankruptcy" and basically reset expectations. In an extremely detailed and well thought out post (titled "Frank Talk about Site Outages") he goes on to describe the entire end-to-end processes that go into managing uptime at Etsy. I would recommend reading the entire post, but I thought it would be useful to point out the things that we can all take away from the experience of one of the most well respected operations people in the industry:

Metrics

"Today, we gather a little over 30,000 metrics, on everything from CPU usage, to network bandwidth, to the rate of listings and re-listings done by Etsy sellers. Some of those metrics are gathered every 20 seconds, 24 hours a day, 365 days a year. About 2,000 metrics will alert someone on our operations staff (we have an on-call rotation) to wake up in the middle of the night to fix a problem."

Takeaway: Capture data on every part of your infrastructure, and later decide which metrics are leading indicators of problems. He goes on to talk about the importance of external monitoring (outside of your firewall) to measure the actual end-user experience.

Communication

"When we have an outage or issue that affects a measurable portion of the site’s functionality, we quickly group together to coordinate our response. We follow the same basic approach as most incident response teams. We assign some people to address the problem and others to update the rest of the staff and post to http://etsystatus.com to alert the community. Changes that are made to mitigate the outage are largely done in a one-at-a-time fashion, and we track both our time-to-detect as well as our time-to-resolve, for use in a follow-up meeting after the outage, called a “post-mortem” meeting. Thankfully, our average time-to-detect is on the order of 2 minutes for any outages or major site issues in the past year. This is mostly due to continually tuning our alerting system."

Takeaway: Two important points here. First, communication and collaboration are key to successfully managing issues. Second, and even more interesting, is the need for two teams...one to address the problem and one to communicate status updates both internally and externally. This is often a missing piece for companies, where no updates go out because everyone is busy fixing the problem.

Post-Mortems

"After any outage, we meet to gather information about the incident. We reconstruct the time-line of events; when we knew of the outage, what we did to fix it, when we declared the site to be stable again. We do a root cause analysis to characterize why the outage happened in the first place. We make a list of remediation tasks to be done shortly thereafter, focused on preventing the root cause from happening again. These tasks can be as simple as fixing a bug, or as complex as putting in new infrastructure to increase the fault-tolerance of the site. We document this process, for use as a reference point in measuring our progress."

Takeaway: Fixing the problem and getting back online is not enough. Make it a an automatic habit to schedule a postmortem to do a deep dive into the root cause(s) of the problem, and address not only the immediate bugs but also the deeper issues that led to the root cause. The Five Why's can help here, as can the Lean methodology of investing a proportional number of hours into the most problematic parts of the infrastructure.

Single Point of Failure Reduction

"As Etsy has grown from a tiny little start-up to the mission-critical service it is today, we’ve had to outgrow some of our infrastructure. One reason we have for this evolution is to avoid depending on single pieces of hardware to be up and running all of the time. Servers can fail at any time, and Etsy.com should be able to keep working if a single server dies. To do that, we have to put our data in multiple places, keep them in sync, and make sure our code can route around any individual failures.

So we’ve been working a lot this year to reduce those “single points of failure,” and to put in redundancy as fast as we safely can. Some of this means being very careful (paranoid) as we migrate data from the single instances to multiple or replicated instances. As you can imagine, it’s a bit of a feat to move that volume of data around while still seeing a peak of 15 new listings per second, all the while not interrupting the site’s functionality."

Takeaway: Reduce single points of failure incrementally. Do what you can in the time you have.

Change Management and Risk

"For every type of technical change, we have answers to questions like:

What problem does the change solve?
Has this kind of change happened before? Is there a successful history?
When is the change going to start? When is it expected to end?
What is the expected effect of this change on the Etsy community? Is a downtime required for the change?
What is the rollback plan, if something goes wrong?
What test is needed to make sure that the change succeeded?

As with all change, the risk involved and the answers to these questions are largely dependent on the judgment of the person at the helm. At Etsy, we believe that if we understand the likely failures, and if there’s a plan in place to fix any unexpected issues, we’ll make progress.

Just as important, we also track the results of changes. We have an excellent history with respect to the number of successful changes. This is a good record that we plan on keeping."

Takeway: Be prepared for failure by anticipating worst-case scenario's for every change. Be ready to roll back and respond. More importantly, make sure to track when things go right to have a realistic measure of risk.

Other takeaways:

Declaring "outage bankruptcy" is not the ideal approach. But it is better than simply going along without any authentic communication with your customers throughout a period of instability. Your customers will understand, if you act human.
Etsy has been doing a great job keeping customers up to date at http://etsystatus.com/.
A glance at the comments on the page shows a few upset customers, but a generally positive response.

Wednesday, October 6, 2010

Foursquare gets transparency

Early Monday morning of this week, Foursquare went down hard:

11 hours later, the #caseofthemondays was over and they were back online. Throughout the those 11 hours, users had one of the following experiences:

1. When visiting foursquare.com, they saw:

2. When using the iPhone/Android/Blackberry app, they saw an error telling them the service is down and to try again later.

3. When checking Twitter (the not default source of downtime information), they saw a lot of people complaining and the following tweets from the official @foursquare account (if they thought of checking the @foursquare account):

Those were the only options available to a user of Foursquare for those 11 hours. A important question we need to answer is whether anyone seriously cared. Are users of consumer services like Foursquare legitimately concerned with Foursquare's downtime? Are they going to leave for competing services or just quit the whole check-in game? I'd like to believe that 11 hours of downtime matters, but honestly it's too early to tell. This will be a great test of the stickiness and Whuffie that Foursquare has built up.

The way I see it is that this is one strike against Foursquare (which includes the continued instability they've seen since Monday). They probably won't see a significant impact to their user base. However, if this happens again, and again, and again, the story changes. And as I've argued, downtime is inevitable. Foursquare will certainly go down again. They key is not reducing downtime to zero, but how you handle that downtime to avoid giving your competition an opening and even more importantly using that downtime to build trust and loyalty with your users. How do you accomplish this? Transparency.

We've talked about the benefits of transparency, why transparency works, and how to implement it. We saw above how Foursquare handled the pre- and intra- downtime steps (not well), so let's take a look at how they did in the post-downtime phase by reviewing the public postmortem (both of them) they published. As always, let's run it through the gauntlet.

Prerequisites:

Admit failure - Excellent. The entire first paragraph describes the downtime, and how painful it was to users.
Sound like a human - Very much. This has never been a problem for Foursquare. The tone is very trustworthy.
Have a communication channel - Prior to the event, all they had were their twitter accounts and their API developer forums. As a result of this incident, they have since launched http://status.foursquare.com/, and have promised to update @4sqsupport on a regular basis throughout the incident.
Above all else, be authentic - This may be the biggest thing going for them.

Requirements:

Start time and end time of the incident - Missing. All we know is that they were down for 11 hours. I don't see this as being critical in this case, but it would have been nice to have.
Who/what was impacted - A bit vague, but the impression was that everyone was impacted.
What went wrong - Extremely well done. I feel very informed, and can sympathize with the situation.
Lessons learned - Again, extremely well done. I love the structure they used: What happened, What we’ll be doing differently – technically speaking, What we’re doing differently – in terms of process. Very effective.

Bonus:

Details on the technologies involved - Yes!
Answers to the Five Why's - No :(
Human elements - heroic efforts, unfortunate coincidences, effective teamwork, etc - Yes!
What others can learn from this experience - Yes!

Other takeaways:

Foursquare launched a public heath status feed! Check it out at http://status.foursquare.com/.
I really like the structure used in this postmortem. It has inspired me to want to create a basic template for postmortems. Stay tuned...
Could this be Facebook's Friendster moment? I hope not. My personal project rely's completely on Foursquare.
I've come to realize that for in most cases, downtime is less impactful to the long term success of a business than site performance. Downtime users understand and just try again later. Slowness eats away at you, you start to hate using the service and jump on an opportunity to use something more fun/fast/pleasant.

Going forward, the big question will be whether Foursquare maintains their new processes, keeps the status blog up to date, and can fix their scalability issues. I for one am rooting for them.

Wednesday, September 29, 2010

Case Study: Facebook outage

I'm a bit late to the story (something called a day job getting in the way!) but I can't pass up an opportunity to discuss how Facebook handled the "worst outage [they've] had in over four years". I blogged about the intra-incident communication the day they had the outage, so let's review the postmortem that came out after they had recovered, and how they handled the downtime as a whole.

Using the "Upside of Downtime" framework (above) as a guide:

Prepare: Much room for improvement. The health status feed is hard to find for the average user/developer, and the information was limited. On the plus side, it exists. Twitter was also used to communicate updates, but again the information was limited.
Communicate: Without a strong foundation create by the Prepare step, you don't have much opportunity to excel at the Communicate step. There was an opportunity to use the basic communication channels they had in place (status feed, twitter) more effectively by communicating throughout the incident, with more actionable information, but alas this was not the case. Instead, there was mass speculation about the root cause and the severity. That is exactly what you want to strive to avoid.
Explain: Let's find out by running the postmortem through our guideline for postmortem communication...

Prerequisites:

Admit failure - Excellent, almost a textbook admittance without hedging or blaming.

Sound like a human - Well done. Posted from Director of Engineering at Facebook Robert Johnson's personal account, the tone and style was personal and effective.

Have a communication channel - Can be improved greatly. Making the existing health status page easier to find, more public, and more useful would help in all future incidents. I've covered how Facebook can improve this page in a previous post.

Above all else, be authentic - No issues here.

Requirements:

Start time and end time of the incident - Missing.

Who/what was impacted - Partial. I can understand this being difficult in the case of Facebook, but I would have liked to see more specifics around how many many users were affected. On one hand this is a global consumer service that may not be critical to people's lives. On the other hand though, if you treat your users with respect, they'll reward you for it.

What went wrong - Well done, maybe the best part of the postmortem.

Lessons learned - Partial. It sounds like many lessons were certainly learned, but they weren't directly shared. I'd love to know what the "design patterns of other systems at Facebook that deal more gracefully with feedback loops and transient spikes" look like.

Bonus:

Details on the technologies involved - No

Answers to the Five Why's - No

Human elements - heroic efforts, unfortunate coincidences, effective teamwork, etc - No

What others can learn from this experience - Marginal

Biggest lesson for us to take away: Preparation is key to successfully managing outages, and using them to build trust with your users.

Transparency in action at Twitter

Enjoyed that tweet from the other day. As you may know, Twitter ran into a very public cross-site scripting (XSS) vulnerability recently:

"The short story: This morning at 2:54 am PDT Twitter was notified of a security exploit that surfaced about a half hour before that, and we immediately went to work on fixing it. By 7:00 am PDT, the primary issue was solved. And, by 9:15 am PDT, a more minor but related issue tied to hovercards was also fixed."

News of the vulnerability exploded, but very quickly Twitter came out with a fix and just as importantly an detailed explanation of what happened, what they did about it, and where they are going from here:

The security exploit that caused problems this morning Pacific time was caused by cross-site scripting (XSS). Cross-site scripting is the practice of placing code from an untrusted website into another one. In this case, users submitted javascript code as plain text into a Tweet that could be executed in the browser of another user.

We discovered and patched this issue last month. However, a recent site update (unrelated to new Twitter) unknowingly resurfaced it.

Early this morning, a user noticed the security hole and took advantage of it on Twitter.com. First, someone created an account that exploited the issue by turning tweets different colors and causing a pop-up box with text to appear when someone hovered over the link in the Tweet. This is why folks are referring to this an “onMouseOver” flaw -- the exploit occurred when someone moused over a link.

Other users took this one step further and added code that caused people to retweet the original Tweet without their knowledge.

This exploit affected Twitter.com and did not impact our mobile web site or our mobile applications. The vast majority of exploits related to this incident fell under the prank or promotional categories. Users may still see strange retweets in their timelines caused by the exploit. However, we are not aware of any issues related to it that would cause harm to computers or their accounts. And, there is no need to change passwords because user account information was not compromised through this exploit.

We’re not only focused on quickly resolving exploits when they surface but also on identifying possible vulnerabilities beforehand. This issue is now resolved. We apologize to those who may have encountered it.

Well done.

Thursday, September 23, 2010

Facebook downtime

Facebook has been experiencing some major downtime today in various locations around the world:

"After issues at a third-party networking provider took down Facebook for some users on Wednesday, the social networking site is once again struggling to stay online.

The company reports latency issues with its API on itsdeveloper site, but the problem is clearly broader than that with thousands of users tweeting about the outage.

On our end when we attempt to access Facebook, we’re seeing the message: “Internal Server Error – The server encountered an internal error or misconfiguration and was unable to complete your request.” Facebook “Like” buttons also appear to be down on our site and across the Web"

Details are still sketchy (there's speculation Akamai is at fault). And that's the problem. It's almost all speculation right now. The official word from facebook is simply:

"We are currently experiencing latency issues with the API, and we are actively investigating. We will provide an update when either the issue is resolved or we have an ETA for resolution."

That's not going to cut it when you have 500+ millions, and countless developers (Zynga must be freaking out right now). I'm seeing about 400 tweets/second complaining about the downtime. Outages will happen. The problem isn't the downtime itself. Where Facebook is missing the boat is using this opportunity to build increased trust with their user and developer community by simply opening up the curtains a bit and telling us something useful. I've seen some movement from Facebook on this front before. But there's much more they can do, and I'm hoping this experience pushes them in the right drirection. Give us back a sense of control and we'll be happy.

P.S. You can watch for updates here, here, and here.