Transparent Uptime

Thursday, February 12, 2009

An overview of big downtime events over the past year

A relatively good review of the major downtime events in the recent past, with a solid conclusion at the end:

The bigger Web commerce gets, the bigger the opportunities to mess it up become. Outages and downtimes are inevitable; the trick is minimizing the pain they cause.

As we've seen over the past few months, the simplest way to minimize that pain is by letting your customers know what's going on. Before, during, and after. A little transparency goes a long way.

The transparent business plan by Mark Cuban

Big idea from Mark Cuban:

You must post your business plan here on my blog where I expect other people can and will comment on it. I also expect that other people will steal the idea and use it elsewhere. That is the idea. Call this an open source funding environment.

If its a good idea and worth funding, we want it replicated elsewhere. The idea is not just to help you, but to figure out how to help the economy through hard work and ingenuity. If you come up with the idea and get funding, you have a head start. If you execute better than others, you could possibly make money at it. As you will see from the rules below, these are going to be businesses that are mostly driven by sweat equity.

Read more here. What will you come up with?

Update: Seth Godin's crew gives away 999 potential business ideas. Ideas are a dime a dozen as they say. It's all about the execution.

Wednesday, February 11, 2009

Transparency User Story #1: Your service seems to be down. I'd like to know if it's down for anyone else or if it's just me.

Note: This post is the first in a series of at least a dozen posts where I attempt to drill into the transparency user stories described in an earlier post.

Let's assume that you've decided that you want to make your service or organization more transparent, specifically when it comes to it's uptime and performance. You've convinced your management team, you've got the engineering and marketing resources, and your rearing to go. You want to get something done and can't wait to make things happen. Stop right there. Do not pass go, do not collect $200. You first need to figure out what it is you're solving for. What problems (and opportunities) do you want to tackle in your drive for transparency?

Glad you asked. There are about a dozen user stories that I've listed in a previous post that describe the most common problems transparency can solve. In this post, I will dive into the first user story:

As an end user or customer, it looks to me like your service is down. I'd like to know if it's down for everyone or if it's just me.

Very straight forward, and very common. So common there are even a couple simple free web services out there that helps people figure this out. Let's assume for this exercise that your site is up the entire time.

Examples of the problem in action

Your customer's Internet connection is down. He loads up www.yoursite.com. It cannot load. He thinks you are down, and calls your support department demanding service.
Your customer's DNS servers are acting up, and are unable to resolve www.yoursite.com inside his network. He finds that google.com is loading fine, and is sure your site is down. He sends you an irate email.
Your customer network routes are unstable, causing inconsistent connectivity to your site. He loads www.yourcompetitor.com fine, while www.yoursite.com fails. He Twitters all about it.

Why this hurts your business

Unnecessary support calls
Unnecessary support emails.
Negative word of mouth that is completely unfounded.

How to solve this problem

An offsite public health dashboard
A known third party, such as this and this or eventually this
A constant presence across social media (Twitter especially) watching for false reports
Keeping a running blog noting any downtime events, which tells your users that unless something is posted, nothing is wrong. You must be diligent about posting when there is actually something wrong however.
Share your real time performance with your large customers. Your customers may even want to be alerted when you go down.

Example solutions in the real world

Tuesday, February 10, 2009

Differentiate yourself through honesty

A great post over at "A Smart Bear" focusing on being honest with your users. Some of my favorite recommmendations:

Admit when you're wrong, quickly and genuinely.
As soon as something isn't going to live up to your customer's expectation -- or even your own internal expectations -- tell them. Explain why there's a problem and what you're doing about it.
Instead of pretending your new software has no bugs and every feature you could possibly want, actively engage customers in new feature discussions and turn around bug fixes in under 24 hours.
Send emails from real people, not from info@company.com.

Honesty is a prerequisite to transparency. Opening up to your customers forces you to be honest. Why not use it as a competitive advantage?

Seth Godin provides his own perspective on the the best approach:

Can you succeed financially by acting in an ethical way?

I think the Net has opened both ends of the curve. On one hand, black hat tactics, scams, deceit and misdirection are far easier than ever to imagine and to scale. There are certainly people quietly banking millions of dollars as they lie and cheat their way to traffic and clicks.

On the other hand, there's far bigger growth associated with transparency. When your Facebook profile shows years of real connections and outreach and help for your friends, it's a lot more likely you'll get that great job.

When your customer service policies delight rather than enrage, word of mouth more than pays your costs. When past investors blog about how successful and ethical you were, it's a lot easier to attract new investors.

Slashdot.org hammers itself into submission

As reported by the site itself within hours of the incident, Slashdot.org was unreachable for about 75 minutes yesterday:

What we had was indeed a DoS, however it was not externally originating. At 8:55 PM EST I received a call saying things were horked, at the same time I had also noticed things were not happy. After fighting with our external management servers to login I finally was able to get in and start looking at traffic. What I saw was a massive amount of traffic going across the core switches; by massive I mean 40 Gbit/sec. After further investigation, I was able to eliminate anything outside our network as the cause, as the incoming ports from Savvis showed very little traffic. So I started poking around on the internal switch ports. While I was doing that I kept having timeouts and problems with the core switches. After looking at the logs on each of the core switches they were complaining about being out of CPU, the error message was actually something to do with multicast. As a precautionary measure I rebooted each core just to make sure it wasn't anything silly. After the cores came back online they instantly went back to 100% fabric CPU usage and started shedding connections again. So slowly I started going through all the switch ports on the cores, trying to isolate where the traffic was originating. The problem was all the cabinet switches were showing 10 Gbit/sec of traffic, making it very hard to isolate. Through the process of elimination I was finally able to isolate the problem down to a pair of switches... After shutting the downlink ports to those switches off, the network recovered and everything came back. I fully believe the switches in that cabinet are still sitting there attempting to send 20Gbit/sec of traffic out trying to do something — I just don't know what yet. Luckily we don't have any machines deployed on [that row in that cabinet] yet so no machines are offline. The network came back up around 10:10 PM EST.

Great to see such a detailed explanation of the incident, walking its users from the initial alert through the investigation and to the final resolution. Even though this is a very techie crowed, I like the narrative nature of the apology, versus a generic "We apologize for the inconvenience and promise never to do it again." The key is to infuse your apology with humanity.

P.S. My favorite part of this incident is comments from the Slashdot crowd. Some of my favorites:

"In Soviet Russia, Slashdot slashdots Slashdot!"

"1. Meme Very Tired. No Longer Wired.
2. 'Soviet Russia' ceased to exist last century.
3. Profit!!!"

"The worst thing about this? 5,000,000 people who think they know what happened, posting 'helpful' suggestions or analysis"

"I think the switch was trying to get first post."

Wednesday, February 4, 2009

Downtime all over the place (Denny's, Vizio, and QuickBooks)

The Super Bowl led to Dennys.com and Vizio.com falling down, while QuickBooks went offline for a number of hours. Denny's and Vizio we can live with, but QuickBooks is a different story. According to the CNet report:

Affected customers of Quickbooks Online were left without access to their financial records. While users of the software version of Quickbooks could access their records, those relying on Intuit for credit-card processing had to do authorizations manually over the phone--a slower, more expensive process.
All online services have outages from time to time, but this one appears to have been lengthy for a line-of-business service, and for many users, unsatisfactorily managed. We received complaints from users that communications from Intuit gave neither a reason for the outages nor an estimate on when the service would return.

Hopefully a lesson learned here. Watching the Twitter traffic at that time, I was impressed to see some signs of life.

Update: Jack in the Box had it's fair share of problems as well.

Saturday, January 31, 2009

The underreported half of Google's new Measurement Lab, and how it can help your online business

In describing the aims of the newly launched Measurement Lab, Vince Cerf describes it as such:

The tools will not only allow broadband customers to test their Internet connections, but also allow security and other researchers to work on ways to improve the Internet.

It's clear that the press and blogosphere are going nuts about the latter half. Specifically, the affect this will have on Net Neutrality, and shady practices of certain telcos. As powerful as this will be in the long run, I want to focus on the other half of the description. The tools to "allow broadband customers to test their Internet connections." I promise this isn't as geeky as it sounds, and it applies directly to helping helping your online business save time and money.

Imagine one of your customers sitting at home ready to use your service. She opens up her web browser, types in your URL, and presses Enter. The browser starts to load the page, the status bar shows "Connecting to yoursite.com...", then "Wating for yoursite.com...". It sits like this for about 15 seconds with a blank page the entire time. She starts to get annoyed. Just to see what happens, she presses refresh and starts the process over. Again, a blank screen, the browser sitting there waiting for your site to begin loading. She then checks that her Internet connection is working by visiting google.com, which loads fine. At this point, if you are lucky, she decides to call your support department up or shoot an email over asking whether there's something wrong. If you are unlucky, she asks around on Twitter, or blogs about it, or just gives up with the new thought in the back of her mind that this service is just plain unreliable. Now, imagine that this scenario took place while your site was perfectly healthy, with no actual downtime anywhere.

Your site is up, but your customer thinks your service is down. The problem lies somewhere along the way between the clients browser and your companies firewalls. The Tubes are clogged just for this specific customer, but how is she supposed to know?

There are a few levels to this problem (followed by the solution):

Level 1: The affect this has on your customer(s)
Online users are still more then likely to give you a few chances before they draw a conclusion, however every incident like this adds to the incorrect negative impression. Especially if this problem manifests itself as a performance issue, slowing or interrupting your customers connections, versus simply keeping them from connecting at all. Your user begins to dread using your service, and look for alternatives every chance they get.

Level 2: The dollar cost to your business
How many calls do you get to your support department from customers claiming they cannot connect to your site, or that your service is broken, or that it's really slow for them? How often does the problem end up being on their end, or completely unreproducible? It may be a relief for your support people, and it may be something your company is happy with, as it confirms that your site is working just fine. Unfortunately, each of these calls costs you money and time. Worse yet, these types of calls generally take the longest to diagnose, as they are vague and require long periods of debugging to get to the root cause. I haven't even mentioned the lost revenue from the missed traffic (if that affects your revenue).

Level 3: The "perception" cost to your business
As described in Level 1, any perceived downtime is just as real as actual downtime in the eyes of your customers. Word of mouth is powerful, especially with todays social media tools, in spreading negative news unfounded as it may be. The more you can do to keep the invalid negative perception from forming, the better.

Level 4: The unknown cost
How often does this happens to your customers? No one has any idea. I said earlier you're "lucky" if your customer decides to pick up the phone and call you about the perceived downtime. More often then not, your customer will simply give up. At worst, they give up with your service entirely. How can you capture this type of information, and help your customers at the same time?

The Solution
Provide a tool that your customers and your support department can use to quickly diagnose where the problem lies. The simplest of these would be to offer a public health dashboard. The more powerful route is to offer tools like these:

Network Diagnostic Tool - provides a sophisticated speed and diagnostic test. An NDT test reports more than just the upload and download speeds--it also attempts to determine what, if any, problems limited these speeds, differentiating between computer configuration and network infrastructure problems.

Network Path and Application Diagnosis - diagnoses some of the common problems affecting the last network mile and end-users' systems. These are the most common causes of all performance problems on wide area network paths.

And what do you know? These are two of the tool that have launched on The Measurement Lab!

Clearly these are still very raw, and not for the every day user. But I see tools like these becoming extremely important for online businesses, both in reducing costs, and in controlling perception. I see this becoming a part of the public health dashboard (which I hope you're hosting separate from your primary site!), allowing users to diagnose problems they are seeing that not reflected in the Internet at large.

I'm going to be watching the development of these tools very closely over the next few months. Most interesting will be noting which other companies support the reasearch, and end up using these tools. Will the focus stay on the Net Neutrality and BitTorrent, or will companies realize the potential of these other tools? We'll find out soon enough!