Wednesday, September 29, 2010

Case Study: Facebook outage

I'm a bit late to the story (something called a day job getting in the way!) but I can't pass up an opportunity to discuss how Facebook handled the "worst outage [they've] had in over four years".  I blogged about the intra-incident communication the day they had the outage, so let's review the postmortem that came out after they had recovered, and how they handled the downtime as a whole.






Using the "Upside of Downtime" framework (above) as a guide:
  1. Prepare: Much room for improvement. The health status feed is hard to find for the average user/developer, and the information was limited. On the plus side, it exists. Twitter was also used to communicate updates, but again the information was limited.
  2. Communicate: Without a strong foundation create by the Prepare step, you don't have much opportunity to excel at the Communicate step. There was an opportunity to use the basic communication channels they had in place (status feed, twitter) more effectively by communicating throughout the incident, with more actionable information, but alas this was not the case. Instead, there was mass speculation about the root cause and the severity. That is exactly what you want to strive to avoid.
  3. Explain: Let's find out by running the postmortem through our guideline for postmortem communication...


Prerequisites:
  1. Admit failure - Excellent, almost a textbook admittance without hedging or blaming.
  2. Sound like a human - Well done. Posted from Director of Engineering at Facebook Robert Johnson's personal account, the tone and style was personal and effective.
  3. Have a communication channel - Can be improved greatly. Making the existing health status page easier to find, more public, and more useful would help in all future incidents. I've covered how Facebook can improve this page in a previous post.
  4. Above all else, be authentic - No issues here.
Requirements:
  1. Start time and end time of the incident - Missing.
  2. Who/what was impacted - Partial. I can understand this being difficult in the case of Facebook, but I would have liked to see more specifics around how many many users were affected. On one hand this is a global consumer service that may not be critical to people's lives. On the other hand though, if you treat your users with respect, they'll reward you for it.
  3. What went wrong - Well done, maybe the best part of the postmortem.
  4. Lessons learned - Partial. It sounds like many lessons were certainly learned, but they weren't directly shared. I'd love to know what the "design patterns of other systems at Facebook that deal more gracefully with feedback loops and transient spikes" look like.
Bonus:
  1. Details on the technologies involved - No
  2. Answers to the Five Why's - No
  3. Human elements - heroic efforts, unfortunate coincidences, effective teamwork, etc - No
  4. What others can learn from this experience - Marginal


Biggest lesson for us to take away: Preparation is key to successfully managing outages, and using them to build trust with your users.

Transparency in action at Twitter



Enjoyed that tweet from the other day. As you may know, Twitter ran into a very public cross-site scripting (XSS) vulnerability recently:
"The short story: This morning at 2:54 am PDT Twitter was notified of a security exploit that surfaced about a half hour before that, and we immediately went to work on fixing it. By 7:00 am PDT, the primary issue was solved. And, by 9:15 am PDT, a more minor but related issue tied to hovercards was also fixed."
News of the vulnerability exploded, but very quickly Twitter came out with a fix and just as importantly an detailed explanation of what happened, what they did about it, and where they are going from here:
 The security exploit that caused problems this morning Pacific time was caused by cross-site scripting (XSS). Cross-site scripting is the practice of placing code from an untrusted website into another one. In this case, users submitted javascript code as plain text into a Tweet that could be executed in the browser of another user. 
We discovered and patched this issue last month. However, a recent site update (unrelated to new Twitter) unknowingly resurfaced it. 
Early this morning, a user noticed the security hole and took advantage of it on Twitter.com. First, someone created an account that exploited the issue by turning tweets different colors and causing a pop-up box with text to appear when someone hovered over the link in the Tweet. This is why folks are referring to this an “onMouseOver” flaw -- the exploit occurred when someone moused over a link. 
Other users took this one step further and added code that caused people to retweet the original Tweet without their knowledge. 
This exploit affected Twitter.com and did not impact our mobile web site or our mobile applications. The vast majority of exploits related to this incident fell under the prank or promotional categories. Users may still see strange retweets in their timelines caused by the exploit. However, we are not aware of any issues related to it that would cause harm to computers or their accounts. And, there is no need to change passwords because user account information was not compromised through this exploit.
We’re not only focused on quickly resolving exploits when they surface but also on identifying possible vulnerabilities beforehand. This issue is now resolved. We apologize to those who may have encountered it.
Well done.

Thursday, September 23, 2010

Facebook downtime

Facebook has been experiencing some major downtime today in various locations around the world:

"After issues at a third-party networking provider took down Facebook for some users on Wednesday, the social networking site is once again struggling to stay online.
The company reports latency issues with its API on itsdeveloper site, but the problem is clearly broader than that with thousands of users tweeting about the outage.
On our end when we attempt to access Facebook, we’re seeing the message: “Internal Server Error – The server encountered an internal error or misconfiguration and was unable to complete your request.” Facebook “Like” buttons also appear to be down on our site and across the Web"
Details are still sketchy (there's speculation Akamai is at fault). And that's the problem. It's almost all speculation right now. The official word from facebook is simply:
"We are currently experiencing latency issues with the API, and we are actively investigating. We will provide an update when either the issue is resolved or we have an ETA for resolution."
That's not going to cut it when you have 500+ millions, and countless developers (Zynga must be freaking out right now). I'm seeing about 400 tweets/second complaining about the downtime. Outages will happen. The problem isn't the downtime itself. Where Facebook is missing the boat is using this opportunity to build increased trust with their user and developer community by simply opening up the curtains a bit and telling us something useful. I've seen some movement from Facebook on this front before. But there's much more they can do, and I'm hoping this experience pushes them in the right drirection. Give us back a sense of control and we'll be happy
P.S. You can watch for updates here, here, and here.

Wednesday, September 22, 2010

BP portraying Deepwater Horizon explosion as a "Normal Accident"...unknowingly calls for end of drilling

While reading last week's issue of Time magazine, I came across this explanation of BP's pitch attempting to explain the recent accident in the Gulf:
"Following a four-month investigation, BP released a report Sept. 8 that tried to divert blame from itself to other companies -- including contractors like Transocean -- for the April 20 explosion that sank the Deepwater Horizon rig, killing 11 people and resulting in the worst oil spill in U.S. history. A team of investigators cited 'a complex and interlinked series of mechanical failures, human judgement' and 'engineering design' as the ultimate cause of the accident."
Though to some it may come off as a naive "it's not our fault" strategy, the reality (and consequence) is a lot more interesting. I've spoken before about the concept of a "Normal Accident", but let's define it again:
Normal Accident Theory: When a technology has become sufficiently complex and tightly coupled, accidents are inevitable and therefore in a sense 'normal'. 
Accidents such as Three Mile Island and a number of others, all began with a mechanical or other technical mishap and then spun out of control through a series of technical cause-effect chains because the operators involved could not stop the cascade or unwittingly did things that made it worse. Apparently trivial errors suddenly cascade through the system in unpredictable ways and cause disastrous results.
What BP is saying is that their systems are so "complex and interlinked" that they were unable to avert the disaster. In a sense, they are arguing that disaster was inevitable. If "Normal Accident Theory" can be believed, BP is indirectly suggesting deep water oil drilling should be abandoned:
"This way of analysing technology has normative consequences: If potentially disastrous technologies, such as nuclear power or biotechnology, cannot be made entirely 'disaster proof', we must consider abandoning them altogether. 
Charles Perrow, the author of Normal Accident Theory, came to the conclusion that "some technologies, such as nuclear power, should simply be abandoned because they are not worth the risk".
Where do I sign?

Thursday, September 16, 2010

Chase.com goes down due to third party DB issues, apologizes...eventually

From Data Center Knowledge:
"The Chase.com online banking portal is back online and processing customer bill payments that were delayed during lengthy outages Tuesday and Wednesday, the company said this morning.
The Chase web site crashed Monday evening when a third party vendor’s database software corrupted the log-in process, the bank told the Wall Street Journal. Chase said no customer data was at risk and that its telephone banking and ATMs functioned as usual throughout the outage."
Unfortunately there was no communication during the event, and finally got a message out to customers that visited the website four days after the first outage:



The "we're sorry" message is well done, but overall...not good.

Monday, September 13, 2010

Domino's using transparency as a competitive advantage

From the NY Times:
Domino’s Pizza is extending its campaign that promises customers transparency along with tasty, value-priced pizza.
The campaign, by Crispin Porter & Bogusky, part of MDC Partners, began with a reformulation of pizza recipes and continued recently with a pledge to show actual products in advertising rather than enhanced versions lovingly tended to by professional food artists.
The vow to be more real was accompanied by a request to send Domino’s photographs of the company’s pizzas as they arrive at customers’ homes. AWeb site, showusyourpizza.com, was set up to receive the photos.
A commercial scheduled to begin running on Monday will feature Patrick Doyle, the chief executive of Domino’s, pointing to one of the photographs that was uploaded to the Web site. The photo shows a miserable mess of a delivered pizza; the toppings and a lot of the cheese are stuck to the inside of the box.
“This is not acceptable,” Mr. Doyle says in the spot, addressing someone he identifies as “Bryce in Minnesota.”
“You shouldn’t have to get this from Domino’s,” Mr. Doyle continues. “We’re better than this.” He goes on to say that such subpar pizza “really gets me upset” and promises: “We’re going to learn; we’re going to get better. I guarantee it.”