Thursday, February 12, 2009

An overview of big downtime events over the past year

A relatively good review of the major downtime events in the recent past, with a solid conclusion at the end:
The bigger Web commerce gets, the bigger the opportunities to mess it up become. Outages and downtimes are inevitable; the trick is minimizing the pain they cause.
As we've seen over the past few months, the simplest way to minimize that pain is by letting your customers know what's going on. Before, during, and after. A little transparency goes a long way.

The transparent business plan by Mark Cuban

Big idea from Mark Cuban:

You must post your business plan here on my blog where I expect other people can and will comment on it. I also expect that other people will steal the idea and use it elsewhere. That is the idea. Call this an open source funding environment.

If its a good idea and worth funding, we want it replicated elsewhere. The idea is not just to help you, but to figure out how to help the economy through hard work and ingenuity. If you come up with the idea and get funding, you have a head start. If you execute better than others, you could possibly make money at it. As you will see from the rules below, these are going to be businesses that are mostly driven by sweat equity.

Read more here. What will you come up with?

Update: Seth Godin's crew gives away 999 potential business ideas. Ideas are a dime a dozen as they say. It's all about the execution.

Wednesday, February 11, 2009

Transparency User Story #1: Your service seems to be down. I'd like to know if it's down for anyone else or if it's just me.

Note: This post is the first in a series of at least a dozen posts where I attempt to drill into the transparency user stories described in an earlier post.

Let's assume that you've decided that you want to make your service or organization more transparent, specifically when it comes to it's uptime and performance. You've convinced your management team, you've got the engineering and marketing resources, and your rearing to go. You want to get something done and can't wait to make things happen. Stop right there. Do not pass go, do not collect $200. You first need to figure out what it is you're solving for. What problems (and opportunities) do you want to tackle in your drive for transparency?

Glad you asked. There are about a dozen user stories that I've listed in a previous post that describe the most common problems transparency can solve. In this post, I will dive into the first user story:
As an end user or customer, it looks to me like your service is down. I'd like to know if it's down for everyone or if it's just me.
Very straight forward, and very common. So common there are even a couple simple free web services out there that helps people figure this out. Let's assume for this exercise that your site is up the entire time.

Examples of the problem in action
  1. Your customer's Internet connection is down. He loads up It cannot load. He thinks you are down, and calls your support department demanding service.
  2. Your customer's DNS servers are acting up, and are unable to resolve inside his network. He finds that is loading fine, and is sure your site is down. He sends you an irate email.
  3. Your customer network routes are unstable, causing inconsistent connectivity to your site. He loads fine, while fails. He Twitters all about it.
Why this hurts your business
  1. Unnecessary support calls
  2. Unnecessary support emails.
  3. Negative word of mouth that is completely unfounded.
How to solve this problem
  1. An offsite public health dashboard
  2. A known third party, such as this and this or eventually this
  3. A constant presence across social media (Twitter especially) watching for false reports
  4. Keeping a running blog noting any downtime events, which tells your users that unless something is posted, nothing is wrong. You must be diligent about posting when there is actually something wrong however.
  5. Share your real time performance with your large customers. Your customers may even want to be alerted when you go down.
Example solutions in the real world
  1. Many public health dashboards
  2. The QuickBooks team notifying users that their service was back up
  3. Sharing your monitoring data in real time with your serious customers
  4. Searching Twitter for outage discussion

Tuesday, February 10, 2009

Differentiate yourself through honesty

A great post over at "A Smart Bear" focusing on being honest with your users. Some of my favorite recommmendations:
  • Admit when you're wrong, quickly and genuinely.
  • As soon as something isn't going to live up to your customer's expectation -- or even your own internal expectations -- tell them. Explain why there's a problem and what you're doing about it.
  • Instead of pretending your new software has no bugs and every feature you could possibly want, actively engage customers in new feature discussions and turn around bug fixes in under 24 hours.
  • Send emails from real people, not from
Honesty is a prerequisite to transparency. Opening up to your customers forces you to be honest. Why not use it as a competitive advantage?

Seth Godin provides his own perspective on the the best approach:

Can you succeed financially by acting in an ethical way?

I think the Net has opened both ends of the curve. On one hand, black hat tactics, scams, deceit and misdirection are far easier than ever to imagine and to scale. There are certainly people quietly banking millions of dollars as they lie and cheat their way to traffic and clicks.

On the other hand, there's far bigger growth associated with transparency. When your Facebook profile shows years of real connections and outreach and help for your friends, it's a lot more likely you'll get that great job.

When your customer service policies delight rather than enrage, word of mouth more than pays your costs. When past investors blog about how successful and ethical you were, it's a lot easier to attract new investors. hammers itself into submission

As reported by the site itself within hours of the incident, was unreachable for about 75 minutes yesterday:
What we had was indeed a DoS, however it was not externally originating. At 8:55 PM EST I received a call saying things were horked, at the same time I had also noticed things were not happy. After fighting with our external management servers to login I finally was able to get in and start looking at traffic. What I saw was a massive amount of traffic going across the core switches; by massive I mean 40 Gbit/sec. After further investigation, I was able to eliminate anything outside our network as the cause, as the incoming ports from Savvis showed very little traffic. So I started poking around on the internal switch ports. While I was doing that I kept having timeouts and problems with the core switches. After looking at the logs on each of the core switches they were complaining about being out of CPU, the error message was actually something to do with multicast. As a precautionary measure I rebooted each core just to make sure it wasn't anything silly. After the cores came back online they instantly went back to 100% fabric CPU usage and started shedding connections again. So slowly I started going through all the switch ports on the cores, trying to isolate where the traffic was originating. The problem was all the cabinet switches were showing 10 Gbit/sec of traffic, making it very hard to isolate. Through the process of elimination I was finally able to isolate the problem down to a pair of switches... After shutting the downlink ports to those switches off, the network recovered and everything came back. I fully believe the switches in that cabinet are still sitting there attempting to send 20Gbit/sec of traffic out trying to do something — I just don't know what yet. Luckily we don't have any machines deployed on [that row in that cabinet] yet so no machines are offline. The network came back up around 10:10 PM EST.
Great to see such a detailed explanation of the incident, walking its users from the initial alert through the investigation and to the final resolution. Even though this is a very techie crowed, I like the narrative nature of the apology, versus a generic "We apologize for the inconvenience and promise never to do it again." The key is to infuse your apology with humanity.

P.S. My favorite part of this incident is comments from the Slashdot crowd. Some of my favorites:
"In Soviet Russia, Slashdot slashdots Slashdot!"

"1. Meme Very Tired. No Longer Wired.
2. 'Soviet Russia' ceased to exist last century.
3. Profit!!!"

"The worst thing about this? 5,000,000 people who think they know what happened, posting 'helpful' suggestions or analysis"

"I think the switch was trying to get first post."