Friday, August 13, 2010

How to Prevent Downtime Due to Human Error

Great post today over at Datacenter Knowledge, citing the fact that "70 percent of the problems that plague data centers" are caused by human error. Below are the best practices to avoid data center failure by human error:

1. Shielding Emergency OFF Buttons – Emergency Power Off (EPO) buttons are generally located near doorways in the data center. Often, these buttons are not covered or labeled, and are mistakenly shut off during an emergency, which shuts down power to the entire data center. Labeling and covering EPO buttons can prevent someone from accidentally pushing the button. See Averting Disaster with the EPO Button and Best Label Ever for an EPO Button for more on this topic.
2. Documented Method of Procedure - A documented step-by-step, task-oriented procedure mitigates or eliminates the risk associated with performing maintenance. Don’t limit the procedure to one vendor, and ensure back-up plans are included in case of unforeseen events.
3. Correct Component Labeling - To correctly and safely operate a power system, all switching devices must be labeled correctly, as well as the facility one-line diagram to ensure correct sequence of operation. Procedures should be in place to double check device labeling.
4. Consistent Operating Practices – Sometimes data center managers get too comfortable and don’t follow procedures, forget or skip steps, or perform the procedure from memory and inadvertently shut down the wrong equipment. It is critical to keep all operational procedures up to date and follow the instructions to operate the system.
5. Ongoing Personnel Training – Ensure all individuals with access to the data center, including IT, emergency, security and facility personnel, have basic knowledge of equipment so that it’s not shut down by mistake.
6. Secure Access Policies – Organizations without data center sign-in policies run the risk of security breaches. Having a sign-in policy that requires an escort for visitors, such as vendors, will enable data center managers to know who is entering and exiting the facility at all times.
7. Enforcing Food/Drinks Policies – Liquids pose the greatest risk for shorting out critical computer components. The best way to communicate your data center’s food/drink policy is to post a sign outside the door that states what the policy is, and how vigorously the policy is enforced.
8. Avoiding Contaminants – Poor indoor air quality can cause unwanted dust particles and debris to enter servers and other IT infrastructure. Much of the problem can be alleviated by having all personnel who access the data center wear antistatic booties, or by placing a mat outside the data center. This includes packing and unpacking equipment outside the data center. Moving equipment inside the data center increases the chances that fibers from boxes and skids will end up in server racks and other IT infrastructure.

Thursday, August 12, 2010

Downtime, downtime, downtime - DNS Made Easy, Posterous, Evernote

It's been a busy week on the interwebs. Either downtime incidents are becoming more common, or I'm just finding out about more of them. One nice thing about this blog is that readers send me downtime events that they come across. I don't know if I want to be the first person that people think of when they see downtime, but I'll take it. In the spirit of this blog, let's take a look at the recent downtime events to see what they did right, what they can improve, and what we can all learn from their experience.


DNS Made Easy
On Saturday August 7th, DNS Made Easy was host to a massive DDoS attack:
"The firm said it experienced 1.5 hours of actual downtime during the attack, which lasted eight hours. Carriers including Level3, GlobalCrossing, Tinet, Tata, and Deutsche Telekom assisted in blocking the attack, which due to its size flooded network backbones with junk."

Prerequisites:
  1. Admit failure - Through a serious of customer email communications and tweets, there was a clear admittance of failure early and often.
  2. Sound like a human - Yes, the communications all sounded genuine and human.
  3. Have a communication channel - Marginal. The communication channels were Twitter and email, which are not as powerful as a health status dashboard.
  4. Above all else, be authentic - Great job here. All of the communication I saw sounded authentic and heartfelt, including the final postmortem. Well done.
Requirements:
  1. Start time and end time of the incident - Yes, final postmortem email communication included the official start and end times (8:00 UTC - 14:00 UTC).
  2. Who/what was impacted - The postmortem addressed this directly, but didn't spell out a completely clear picture of who was affected and who wasn't. This is probably because there isn't a clear distinction between sites that were and weren't affected. To address this, they recommended customers review their DNS query traffic to see how they were affected.
  3. What went wrong - A good amount of detail on this, and I hope there is more coming. DDoS attacks are a great examples of where sharing knowledge and experience help the community as a whole, so I hope to see more detail come out about this.
  4. Lessons learned - The postmortem included some lessons learned, but nothing very specific. I would have liked to see more here.
Bonus:
  1. Details on the technologies involved - Some.
  2. Answers to the Five Why's - Nope.
  3. Human elements - heroic efforts, unfortunate coincidences, effective teamwork, etc - Some.
  4. What others can learn from this experience - Some.
Other notes:
  • The communication throughout the incident was excellent, though they could have benefited from a public dashboard or status blog that went beyond twitter and private customer emails.
  • I don't think this is the right way to address the question of whether SLA credits will be issued: "Yes it will be. With thousands paying companies we obviously do not want every organization to submit an SLA form."

Posterous
"As you’re no doubt aware, Posterous has had a rocky six days.
On Wednesday and Friday, our servers were hit by massive Denial of Service (DoS) attacks. We responded quickly and got back online within an hour, but it didn’t matter; the site went down and our users couldn’t post.
On Friday night, our team worked around the clock to move to new data centers, better capable of handling the onslaught. It wasn’t easy. Throughout the weekend we were fixing issues, optimizing the site, some things going smoothly, others less so.
Just at the moments we thought the worst was behind us, we’d run up against another challenge. It tested not only our technical abilities, but our stamina, patience, and we lost more than a few hairs in the process."
Posterous continued to update their users on their blog, and on twitter. They also sent out an email communication to all of their customers to let everyone know about the issues.

Prerequisites:
  1. Admit failure - Clearly yes, both on the blog and on Twitter.
  2. Sound like a human - Very much so.
  3. Have a communication channel - A combination of blog and Twitter. Again, not ideal, as customers may not think about visiting the blog or checking Twitter. Especially when the blog is inaccessible during the downtime, and they may not be aware of the Twitter account. One of the keys to communication channel is to host if offsite, which would have been important in this case.
  4. Above all else, be authentic - No issues here, well done.
Requirements:
  1. Start time and end time of the incident - A bit vague in the postmortem, but can be calculated from the Twitter communication. Can be improved.
  2. Who/what was impacted - The initial post described this fairly well, that all customers hosted on Posterous.com are affected, including custom domains.
  3. What went wrong - A series of things went wrong in this case, and I believe the issues were described fairly well.
  4. Lessons learned - Much room for improvement here. I don't see any real lessons learned in the postmortem posts or other communications. There were things put in place to avoid the issues int he future, such as moving to a new datacenter and adding hardware, but I don't see any real lessons learned as a result of this downtime.
Bonus:
  1. Details on the technologies involved - Very little.
  2. Answers to the Five Why's - No.
  3. Human elements - Yes, in the final postmortem, well done.
  4. What others can learn from this experience - Not a lot here.
Evernote
From their blog:
"EvernoteEvernote servers. We immediately contacted all affected users via email and our support team walked them through the recovery process. We automatically upgraded all potentially affected users to Evernote Premium (or added a year of Premium to anyone who had already upgraded) because we wanted to make sure that they had access to priority tech support if they needed help recovering their notes and as a partial apology for the inconvenience."
Prerequisites:
  1. Admit failure - Extremely solid, far beyond the bare minimum.
  2. Sound like a human - Yes.
  3. Have a communication channel - A simple health status blog (which according to the comments is not easy to find), a blog, and a Twitter channel. Biggest area of improvement here is to make the status blog easier to find. I have no idea how to get to that from the site or the application, and that defeats its purpose.
  4. Above all else, be authentic - The only communication I saw was the final postmortem, and in that I think in that post (and the comments) they were very authentic.
Requirements:
  1. Start time and end time of the incident - Rough timeframe, would have liked to see more detail.
  2. Who/what was impacted - First time I've seen an exact figure like "6,323" users. Impressive.
  3. What went wrong - Yes, at the end of the postmortem.
  4. Lessons learned - Marginal. A bit vague and hand-wavy. 
Bonus:
  1. Details on the technologies involved - Not bad.
  2. Answers to the Five Why's - No.
  3. Human elements - No.
  4. What others can learn from this experience - Not a lot here.
Conclusion
Overall, I'm impressed with how these companies are handling downtime. Each communicated early and often. Each admitted failure immediately, and kept their users up to date. Each put out a solid postmortem that detailed the key information. It's interesting to see how Twitter is becoming the de-facto communication channel during an incident. I still wonder how effective it is in getting news out to all of your users, and how many users are aware of it. Overall, well done guys.

Update: DNS Made Easy just launched a public health dashboard!

Tuesday, August 10, 2010

Transparency in action at Twilio

When Twilio launched an open-source public health dashboard tool a couple of weeks ago, I knew I had to learn more about Twilio. I connected with John Britton (Developer Evangelist at Twilio) to get some insight into the Twilio's transparency story. Enjoy...

Q. What motivated Twilio to launch a public health dashboard and to put resources into transparency?
Twilio's goal is to bring the simplicity and transparency common in the world of web technologies to the opaque world of telephony and communications.  Just as Amazon AWS and other web infrastructure providers give customers direct and immediate information on service availability, Stashboard allows Twilio to provide a dedicated status portal that our customers can visit anytime to get up-to-the-minute information on system heath.  During the development of Stashboard, we realized how many other companies and businesses could use a simple, scalable status page, so we open sourced it!  You can download the source code or fork your own version.

Q. What roadblocks did you encounter on the way to launching the public dashboard, and how did you overcome them?
The most difficult part of building and launching Stashboard was creating a robust set of APIs that would encompass Twilio's services as well as other services from companies interested in running an instance of Stashboard themselves. We looked at existing status dashboards for inspiration, including the Amazon AWS Status Page and the Google Apps Status Page, and settled on a very general design independent from Twilio's product. The result is a dashboard that can be utilized to track a variety of APIs and services.  For example, a few days after the release of Stashboard, MongoHQ, a hosted MongoDB database provider launched their own instance of Stashboard to give their customers API status information.

Q. What benefits have you seen as a result of your transparency initiatives?
Twilio's rapid growth is a great example of how developers at both small and large companies have responded to Twilio's simple open approach.  The Twilio developer community has grown to more then 15,000 strong and we see more and more applications and developers on the platform everyday.  Twilio was founded by developers who have a strong background in web services and distributed systems.  This is reflected in our adoption of open standards like HTTP and operational transparency with services like http://status.twilio.com.  Another example is the community that has grown up around OpenVBX, a downloadable phone system for small business Twilio developed and open sourced a few week ago.   We opened OpenVBX to provide developers the simplest way to hack, skin, and integrate it with their own systems.

Q. What is your hope with the open source dashboard framework?
The main goal of Stashboard is to give back to the community.  We use open source software extensively inside Twilio and we hope that by opening up Stashboard it will help other hosted services and improve the whole web services ecosystem.

Q. What would you say to companies considering transparency in their uptime/performance?
Openness and transparency are key to building trust with customers.  Take the telecom industry as an example.  They are known for being completely closed.  Customers rarely love or trust their telecom providers.   In contrast, Twilio brings the open approach of the web to telecom and the response has been truly amazing.  When customers know they can depend on a company to provide accurate data concerning performance and reliability, they are more willing give that company their business and recommend it to their peers.  Twilio's commitment to transparency and openness has been a huge driver of our success and Stashboard and projects like OpenVBX are just the beginning.