Thursday, January 15, 2009 down intermittently today. What lessons can we take away?

Though not widely reported, it appears that saw some intermittent downtime today:
A distributed denial-of-service attack turned dark at least several thousand Web sites hosted by Wednesday morning. The outage was intermittent over several hours, according to Nick Fuller, communications manager.

What caught my eye was some insight on how GoDaddy handled the communication during the event:

To add to the consternation of Web site owners,'s voice mail system pointed to its support page for more information about the outage and when it would be corrected. No such information was posted there.

Luckily this didn't blow up into anything major for GoDaddy, but I'd like to offer up a few suggestions:
  1. If you're pointing your customers to the default support page, make sure to have some kind of call-out link referencing this event. Otherwise customers will be searching through your support forums, getting more frustrated, and end up typing up your support lines (or Twitter'ing their hearts out).
  2. Offer your customers an easy to find public health dashboard (e.g. a link off of the support page). There are numerous benefits that come along with such an offering, but this specific situation would be a perfect use case for one.
  3. Provide a few details on the problem in both the voice mail message, and in whichever online forum you choose to communicate (e.g. health dashboard, blog, twitter, forums, etc.). At the minimum, provide an estimated time to recovery and some details on the scope of the problem.
A little bit of transparency can go a long way. I would venture to say that if any of the above advice was implemented in the future, the customer reaction, and long term benefits, would pay off substantially.

Update: A bit of insight provided by GoDaddy’s Communications Manager Nick Fuller.

Tuesday, January 13, 2009

The bulls**t of outage language

As this blog is often dedicated to pointing out downtime events, and offering advice on how to best communicate before/during/after the (inevitable) event, I thought this post by 37signals could come in handy next time you have to write an apology email to your customers.

Service operators generally suck at saying they’re sorry. I should know, I’ve had to do it plenty of times and it’s always hard. There’s really never a great way to say it, but there sure are plenty of terrible ways.

One of the worst stock dummies that even I have resorted to in a moment of weakness is this terrible non-apology: “We apologize for any inconvenience this may have caused”. Oh please. Let’s break down why it’s bad...

I'll let you read the advice yourself, but I will point out a few of the visitor comments that speak to the message I've been harping on over the past few months:
Josh Catone:
Serious question: What WOULD be a better way to communicate with customers after downtime in your opinion? You didn’t offer and alternatives. I know you said stock responses should never be used… but I’d love to see some examples of what you think works..

Dan Gebhardt:
I’d recommend using a website monitoring service (we use Pingdom [editors note: *cough* Webmetrics *cough*]) to provide public accountability for your uptime. This not only proves that uptime is as important to you as it is to your customers, it can also help customers see any particular outage in the context of your overall service record.

Mark Weiss:
While a personal well thought out apology is nice. As a user I want to know when things are going to be working again. I want to know if I should go for a quick walk in the park or if I have time for some food, drinks, and then possibly a nap.

Just keep me informed so I know how to manage my time.

I think Flickr holds top honors for the best down time strategy and message.

Itinerant Networker:
Empathy’s not enough. Service providers should reveal details about why an outage happened, what they’re doing to make it not happen again, and should clearly communicate with customers (frequently) on the ETA of the outage. The most frustrating thing I hear is “we don’t have an ETR [estimated time to recovery]”. That is not acceptable in a service business – give me an ETR and then an estimate of how reliable the ETR is. This goes for even the lowest cable modem user calling $provider – the tier 1 guys should have at least some clue.

The bottom line is that what matters most is not that you never go down, but how you deal with that downtime. All your customers need is some form of honest communication during the event, some transparency into the severity of the problem, and a human explanation of what went wrong afterward. It really isn't very hard.