Thursday, August 12, 2010

Downtime, downtime, downtime - DNS Made Easy, Posterous, Evernote

It's been a busy week on the interwebs. Either downtime incidents are becoming more common, or I'm just finding out about more of them. One nice thing about this blog is that readers send me downtime events that they come across. I don't know if I want to be the first person that people think of when they see downtime, but I'll take it. In the spirit of this blog, let's take a look at the recent downtime events to see what they did right, what they can improve, and what we can all learn from their experience.


DNS Made Easy
On Saturday August 7th, DNS Made Easy was host to a massive DDoS attack:
"The firm said it experienced 1.5 hours of actual downtime during the attack, which lasted eight hours. Carriers including Level3, GlobalCrossing, Tinet, Tata, and Deutsche Telekom assisted in blocking the attack, which due to its size flooded network backbones with junk."

Prerequisites:
  1. Admit failure - Through a serious of customer email communications and tweets, there was a clear admittance of failure early and often.
  2. Sound like a human - Yes, the communications all sounded genuine and human.
  3. Have a communication channel - Marginal. The communication channels were Twitter and email, which are not as powerful as a health status dashboard.
  4. Above all else, be authentic - Great job here. All of the communication I saw sounded authentic and heartfelt, including the final postmortem. Well done.
Requirements:
  1. Start time and end time of the incident - Yes, final postmortem email communication included the official start and end times (8:00 UTC - 14:00 UTC).
  2. Who/what was impacted - The postmortem addressed this directly, but didn't spell out a completely clear picture of who was affected and who wasn't. This is probably because there isn't a clear distinction between sites that were and weren't affected. To address this, they recommended customers review their DNS query traffic to see how they were affected.
  3. What went wrong - A good amount of detail on this, and I hope there is more coming. DDoS attacks are a great examples of where sharing knowledge and experience help the community as a whole, so I hope to see more detail come out about this.
  4. Lessons learned - The postmortem included some lessons learned, but nothing very specific. I would have liked to see more here.
Bonus:
  1. Details on the technologies involved - Some.
  2. Answers to the Five Why's - Nope.
  3. Human elements - heroic efforts, unfortunate coincidences, effective teamwork, etc - Some.
  4. What others can learn from this experience - Some.
Other notes:
  • The communication throughout the incident was excellent, though they could have benefited from a public dashboard or status blog that went beyond twitter and private customer emails.
  • I don't think this is the right way to address the question of whether SLA credits will be issued: "Yes it will be. With thousands paying companies we obviously do not want every organization to submit an SLA form."

Posterous
"As you’re no doubt aware, Posterous has had a rocky six days.
On Wednesday and Friday, our servers were hit by massive Denial of Service (DoS) attacks. We responded quickly and got back online within an hour, but it didn’t matter; the site went down and our users couldn’t post.
On Friday night, our team worked around the clock to move to new data centers, better capable of handling the onslaught. It wasn’t easy. Throughout the weekend we were fixing issues, optimizing the site, some things going smoothly, others less so.
Just at the moments we thought the worst was behind us, we’d run up against another challenge. It tested not only our technical abilities, but our stamina, patience, and we lost more than a few hairs in the process."
Posterous continued to update their users on their blog, and on twitter. They also sent out an email communication to all of their customers to let everyone know about the issues.

Prerequisites:
  1. Admit failure - Clearly yes, both on the blog and on Twitter.
  2. Sound like a human - Very much so.
  3. Have a communication channel - A combination of blog and Twitter. Again, not ideal, as customers may not think about visiting the blog or checking Twitter. Especially when the blog is inaccessible during the downtime, and they may not be aware of the Twitter account. One of the keys to communication channel is to host if offsite, which would have been important in this case.
  4. Above all else, be authentic - No issues here, well done.
Requirements:
  1. Start time and end time of the incident - A bit vague in the postmortem, but can be calculated from the Twitter communication. Can be improved.
  2. Who/what was impacted - The initial post described this fairly well, that all customers hosted on Posterous.com are affected, including custom domains.
  3. What went wrong - A series of things went wrong in this case, and I believe the issues were described fairly well.
  4. Lessons learned - Much room for improvement here. I don't see any real lessons learned in the postmortem posts or other communications. There were things put in place to avoid the issues int he future, such as moving to a new datacenter and adding hardware, but I don't see any real lessons learned as a result of this downtime.
Bonus:
  1. Details on the technologies involved - Very little.
  2. Answers to the Five Why's - No.
  3. Human elements - Yes, in the final postmortem, well done.
  4. What others can learn from this experience - Not a lot here.
Evernote
From their blog:
"EvernoteEvernote servers. We immediately contacted all affected users via email and our support team walked them through the recovery process. We automatically upgraded all potentially affected users to Evernote Premium (or added a year of Premium to anyone who had already upgraded) because we wanted to make sure that they had access to priority tech support if they needed help recovering their notes and as a partial apology for the inconvenience."
Prerequisites:
  1. Admit failure - Extremely solid, far beyond the bare minimum.
  2. Sound like a human - Yes.
  3. Have a communication channel - A simple health status blog (which according to the comments is not easy to find), a blog, and a Twitter channel. Biggest area of improvement here is to make the status blog easier to find. I have no idea how to get to that from the site or the application, and that defeats its purpose.
  4. Above all else, be authentic - The only communication I saw was the final postmortem, and in that I think in that post (and the comments) they were very authentic.
Requirements:
  1. Start time and end time of the incident - Rough timeframe, would have liked to see more detail.
  2. Who/what was impacted - First time I've seen an exact figure like "6,323" users. Impressive.
  3. What went wrong - Yes, at the end of the postmortem.
  4. Lessons learned - Marginal. A bit vague and hand-wavy. 
Bonus:
  1. Details on the technologies involved - Not bad.
  2. Answers to the Five Why's - No.
  3. Human elements - No.
  4. What others can learn from this experience - Not a lot here.
Conclusion
Overall, I'm impressed with how these companies are handling downtime. Each communicated early and often. Each admitted failure immediately, and kept their users up to date. Each put out a solid postmortem that detailed the key information. It's interesting to see how Twitter is becoming the de-facto communication channel during an incident. I still wonder how effective it is in getting news out to all of your users, and how many users are aware of it. Overall, well done guys.

Update: DNS Made Easy just launched a public health dashboard!

2 comments: