Wednesday, September 29, 2010

Case Study: Facebook outage

I'm a bit late to the story (something called a day job getting in the way!) but I can't pass up an opportunity to discuss how Facebook handled the "worst outage [they've] had in over four years".  I blogged about the intra-incident communication the day they had the outage, so let's review the postmortem that came out after they had recovered, and how they handled the downtime as a whole.






Using the "Upside of Downtime" framework (above) as a guide:
  1. Prepare: Much room for improvement. The health status feed is hard to find for the average user/developer, and the information was limited. On the plus side, it exists. Twitter was also used to communicate updates, but again the information was limited.
  2. Communicate: Without a strong foundation create by the Prepare step, you don't have much opportunity to excel at the Communicate step. There was an opportunity to use the basic communication channels they had in place (status feed, twitter) more effectively by communicating throughout the incident, with more actionable information, but alas this was not the case. Instead, there was mass speculation about the root cause and the severity. That is exactly what you want to strive to avoid.
  3. Explain: Let's find out by running the postmortem through our guideline for postmortem communication...


Prerequisites:
  1. Admit failure - Excellent, almost a textbook admittance without hedging or blaming.
  2. Sound like a human - Well done. Posted from Director of Engineering at Facebook Robert Johnson's personal account, the tone and style was personal and effective.
  3. Have a communication channel - Can be improved greatly. Making the existing health status page easier to find, more public, and more useful would help in all future incidents. I've covered how Facebook can improve this page in a previous post.
  4. Above all else, be authentic - No issues here.
Requirements:
  1. Start time and end time of the incident - Missing.
  2. Who/what was impacted - Partial. I can understand this being difficult in the case of Facebook, but I would have liked to see more specifics around how many many users were affected. On one hand this is a global consumer service that may not be critical to people's lives. On the other hand though, if you treat your users with respect, they'll reward you for it.
  3. What went wrong - Well done, maybe the best part of the postmortem.
  4. Lessons learned - Partial. It sounds like many lessons were certainly learned, but they weren't directly shared. I'd love to know what the "design patterns of other systems at Facebook that deal more gracefully with feedback loops and transient spikes" look like.
Bonus:
  1. Details on the technologies involved - No
  2. Answers to the Five Why's - No
  3. Human elements - heroic efforts, unfortunate coincidences, effective teamwork, etc - No
  4. What others can learn from this experience - Marginal


Biggest lesson for us to take away: Preparation is key to successfully managing outages, and using them to build trust with your users.

13 comments:

  1. This is so interesting, I have never seen this before , Nice Post. Bandar Domino
    a

    ReplyDelete
  2. I suppose you will keep the quality work going on. It's for some other informative blog. Where else could I get that type of information obat kelenjar getah bening obat benjolan di tangan obat infeksi saluran kencing

    ReplyDelete
  3. good articles and solid content is very interesting
    http://www.sinidomino.com/

    ReplyDelete
  4. REALLY GOOD! i like it so much<3 Thanks for the Good Artickle Sir.
    agen judi poker online terpercaya di indonesia

    ReplyDelete
  5. This article is very interesting to read and very easy to understand I am very interested in your article thanks.
    raja poker

    ReplyDelete
  6. I have exactly what info I want. Check, please. Wait, it's free? Awesome! dewapoker

    ReplyDelete
  7. I was very impressed by this post, this site has always been pleasant news. Thank you very much for such an interesting post. Keep working, great job! In my free time, I like play game
    bandar togel
    togel singapura




    ReplyDelete
  8. It’s really a great and useful piece of information. I’m glad that you just shared this helpful information with us.
    Please stay us informed like this. Thanks for sharing. ??

    Agen Bola
    Agen Bola Terpercaya
    Poker Online Indonesia Terpercaya
    Poker Online Indonesiaa

    ReplyDelete
  9. articles that are very good and for the future I hope your article more useful thanks again.
    dewa poker

    ReplyDelete

Note: Only a member of this blog may post a comment.