Friday, August 13, 2010

How to Prevent Downtime Due to Human Error

Great post today over at Datacenter Knowledge, citing the fact that "70 percent of the problems that plague data centers" are caused by human error. Below are the best practices to avoid data center failure by human error:

1. Shielding Emergency OFF Buttons – Emergency Power Off (EPO) buttons are generally located near doorways in the data center. Often, these buttons are not covered or labeled, and are mistakenly shut off during an emergency, which shuts down power to the entire data center. Labeling and covering EPO buttons can prevent someone from accidentally pushing the button. See Averting Disaster with the EPO Button and Best Label Ever for an EPO Button for more on this topic.
2. Documented Method of Procedure - A documented step-by-step, task-oriented procedure mitigates or eliminates the risk associated with performing maintenance. Don’t limit the procedure to one vendor, and ensure back-up plans are included in case of unforeseen events.
3. Correct Component Labeling - To correctly and safely operate a power system, all switching devices must be labeled correctly, as well as the facility one-line diagram to ensure correct sequence of operation. Procedures should be in place to double check device labeling.
4. Consistent Operating Practices – Sometimes data center managers get too comfortable and don’t follow procedures, forget or skip steps, or perform the procedure from memory and inadvertently shut down the wrong equipment. It is critical to keep all operational procedures up to date and follow the instructions to operate the system.
5. Ongoing Personnel Training – Ensure all individuals with access to the data center, including IT, emergency, security and facility personnel, have basic knowledge of equipment so that it’s not shut down by mistake.
6. Secure Access Policies – Organizations without data center sign-in policies run the risk of security breaches. Having a sign-in policy that requires an escort for visitors, such as vendors, will enable data center managers to know who is entering and exiting the facility at all times.
7. Enforcing Food/Drinks Policies – Liquids pose the greatest risk for shorting out critical computer components. The best way to communicate your data center’s food/drink policy is to post a sign outside the door that states what the policy is, and how vigorously the policy is enforced.
8. Avoiding Contaminants – Poor indoor air quality can cause unwanted dust particles and debris to enter servers and other IT infrastructure. Much of the problem can be alleviated by having all personnel who access the data center wear antistatic booties, or by placing a mat outside the data center. This includes packing and unpacking equipment outside the data center. Moving equipment inside the data center increases the chances that fibers from boxes and skids will end up in server racks and other IT infrastructure.

4 comments:

  1. In a previous gig, I had a sign I would post in the computer lab that was taken from an episode of The Tick. It read:

    Do Not:
    - Eat in lab
    - Set lab on fire
    - Innovate unnecessarily

    Good advice for all.

    ReplyDelete
  2. This is also a very good post which I really enjoyed reading. It is not everyday that I have the possibility to see something

    http://catmario.games

    ReplyDelete
  3. Spot on with this write-up, I absolutely believe this amazing site needs far more attention. I’ll probably be back again to see more, thanks for the info! best rooting apps , rooting apps apk

    ReplyDelete
  4. Wow. This is some thing very amazing. I was in search of such a beautiful writing. Awesome plot and ends very in very simple way. Thank you.
    henrystickmangames.com
    Friv.Pro

    ReplyDelete

Note: Only a member of this blog may post a comment.