Wednesday, September 29, 2010

Case Study: Facebook outage

I'm a bit late to the story (something called a day job getting in the way!) but I can't pass up an opportunity to discuss how Facebook handled the "worst outage [they've] had in over four years".  I blogged about the intra-incident communication the day they had the outage, so let's review the postmortem that came out after they had recovered, and how they handled the downtime as a whole.






Using the "Upside of Downtime" framework (above) as a guide:
  1. Prepare: Much room for improvement. The health status feed is hard to find for the average user/developer, and the information was limited. On the plus side, it exists. Twitter was also used to communicate updates, but again the information was limited.
  2. Communicate: Without a strong foundation create by the Prepare step, you don't have much opportunity to excel at the Communicate step. There was an opportunity to use the basic communication channels they had in place (status feed, twitter) more effectively by communicating throughout the incident, with more actionable information, but alas this was not the case. Instead, there was mass speculation about the root cause and the severity. That is exactly what you want to strive to avoid.
  3. Explain: Let's find out by running the postmortem through our guideline for postmortem communication...


Prerequisites:
  1. Admit failure - Excellent, almost a textbook admittance without hedging or blaming.
  2. Sound like a human - Well done. Posted from Director of Engineering at Facebook Robert Johnson's personal account, the tone and style was personal and effective.
  3. Have a communication channel - Can be improved greatly. Making the existing health status page easier to find, more public, and more useful would help in all future incidents. I've covered how Facebook can improve this page in a previous post.
  4. Above all else, be authentic - No issues here.
Requirements:
  1. Start time and end time of the incident - Missing.
  2. Who/what was impacted - Partial. I can understand this being difficult in the case of Facebook, but I would have liked to see more specifics around how many many users were affected. On one hand this is a global consumer service that may not be critical to people's lives. On the other hand though, if you treat your users with respect, they'll reward you for it.
  3. What went wrong - Well done, maybe the best part of the postmortem.
  4. Lessons learned - Partial. It sounds like many lessons were certainly learned, but they weren't directly shared. I'd love to know what the "design patterns of other systems at Facebook that deal more gracefully with feedback loops and transient spikes" look like.
Bonus:
  1. Details on the technologies involved - No
  2. Answers to the Five Why's - No
  3. Human elements - heroic efforts, unfortunate coincidences, effective teamwork, etc - No
  4. What others can learn from this experience - Marginal


Biggest lesson for us to take away: Preparation is key to successfully managing outages, and using them to build trust with your users.

Transparency in action at Twitter



Enjoyed that tweet from the other day. As you may know, Twitter ran into a very public cross-site scripting (XSS) vulnerability recently:
"The short story: This morning at 2:54 am PDT Twitter was notified of a security exploit that surfaced about a half hour before that, and we immediately went to work on fixing it. By 7:00 am PDT, the primary issue was solved. And, by 9:15 am PDT, a more minor but related issue tied to hovercards was also fixed."
News of the vulnerability exploded, but very quickly Twitter came out with a fix and just as importantly an detailed explanation of what happened, what they did about it, and where they are going from here:
 The security exploit that caused problems this morning Pacific time was caused by cross-site scripting (XSS). Cross-site scripting is the practice of placing code from an untrusted website into another one. In this case, users submitted javascript code as plain text into a Tweet that could be executed in the browser of another user. 
We discovered and patched this issue last month. However, a recent site update (unrelated to new Twitter) unknowingly resurfaced it. 
Early this morning, a user noticed the security hole and took advantage of it on Twitter.com. First, someone created an account that exploited the issue by turning tweets different colors and causing a pop-up box with text to appear when someone hovered over the link in the Tweet. This is why folks are referring to this an “onMouseOver” flaw -- the exploit occurred when someone moused over a link. 
Other users took this one step further and added code that caused people to retweet the original Tweet without their knowledge. 
This exploit affected Twitter.com and did not impact our mobile web site or our mobile applications. The vast majority of exploits related to this incident fell under the prank or promotional categories. Users may still see strange retweets in their timelines caused by the exploit. However, we are not aware of any issues related to it that would cause harm to computers or their accounts. And, there is no need to change passwords because user account information was not compromised through this exploit.
We’re not only focused on quickly resolving exploits when they surface but also on identifying possible vulnerabilities beforehand. This issue is now resolved. We apologize to those who may have encountered it.
Well done.