Wednesday, July 15, 2009

SLA's as an insurance policy? Think again.

From Benjamin Black:

"if SLAs were insurance policies, vendors would quickly be out of business.

given this, the question remains: how do you achieve confidence in the availability of the services on which your business relies? the answer is to use multiple vendors for the same services. this is already common practice in other areas: internet connection multihoming, multiple CDN vendors, multiple ad networks, etc. the cloud does not change this. if you want high availability, you’re going to have to work for it."

Well put. As Wener Vogels continues to preach, everything fails. Build your infrastructure where SLA's are a bonus, not a requirement.

Thursday, July 9, 2009

Google raising the bar in post-mortem transparency

In the most detailed post-mortem I've ever seen come out of a cloud provider, Google chronicles the minute by minute timeline of their App Engine downtime event, reviews what went wrong, and commits to fixing the root cause at many levels:

What are we doing to fix it?

1. The underlying bug in GFS has already been addressed and the fix
will be pushed to all datacenters as soon as possible. It has also
been determined that the bug has been live for at least a year, so the
risk of recurrence should be low. Site reliability engineers are aware
of this issue and can quickly fix it if it should recur before then.

2. The App Engine team is accelerating its schedule to release the new
clustering system that was already under development. When this system
is in place, it will greatly reduce the likelihood of a complete
outage like this one.

3. The App Engine team is actively investigating new solutions to cope
with long-term unavailability of the primary persistence layer. These
solutions will be designed to ensure that applications can cope
reasonably with long-term catastrophic outages, no matter how rare.

4. Changes will be made to the Status Site configuration to ensure
that the Status Site is properly available during outages.

Read the entire post for the full affect. Looks like The Register should think about taking back some of the things they said?

Saturday, July 4, 2009

Cloud and SaaS SLA's

Daniel Druker over at the SaaS 2.0 blog recently posted an extremely thorough description of what we should be expecting from Cloud and SaaS services when it comes to SLA agreements:
In my experience, there are four key areas to consider in your SLA:

First is addressing control: The service level agreement must guarantee the quality and performance of operational functions like availability, reliability, performance, maintenance, backup, disaster recovery, etc that used to be under the control of the in-house IT function when the applications were running on-premises and managed by internal IT, but are now under the vendor's control since the applications are running in the cloud and managed by the vendor.

Second is addressing operational risk: The service level agreement should also address perceived risks around security, privacy and data ownership - I say perceived because most SaaS vendors are actually far better at these things than nearly all of their clients are. Guaranteed commitments to undergoing regular SAS70 Type II audits and external security evaluations are also important parts of mitigating operational risk.

Third is addressing business risk: As cloud computing companies become more comfortable with their ability to deliver value and success, more of them will start to include business success guarantees in the SLA - such as guarantees around successful and timely implementations, the quality of technical support, business value received and even to money back guarantees - if a client isn't satisfied, they get their money back. Cloud/SaaS vendor can rationally consider offering business risk guarantees because their track record of successful implementations is typically vastly higher than their enterprise software counterparts.

Last is penalties, rewards and transparency: The service level agreement must have real financial penalties / teeth when an SLA violation occurs. If there isn't any pain for the vendor when they fail to meet their SLA, the SLA doesn't mean anything. Similarly, the buyer should also be willing to pay a reward for extraordinary service level achievements that deliver real benefits - if 100% availability is an important goal for you, consider paying the vendor a bonus when they achieve it. Transparency is also important - the vendor should also maintain a public website with continuous updates as to how the vendor is performing against their SLA, and should publish their SLA and their privacy policies. The best cloud vendors realize that their excellence in operations and their SLAs are real selling points, so they aren't afraid to open their kimonos in public.
Considering the sad state of affairs in existing SLA's, I'm hoping to see some progress here from the big boys, if nothing else as a competitive advantage as they try to differentiate themselves.

Wednesday, March 18, 2009

Microsoft showing us how it's done, coming clean about Azure downtime

Following up on yesterdays Windows Azure downtime event, Microsoft posted an excellent explanation of what happened:

The Windows Azure Malfunction This Weekend

First things first: we're sorry. As a result of a malfunction in Windows Azure, many participants in our Community Technology Preview (CTP) experienced degraded service or downtime. Windows Azure storage was unaffected.

In the rest of this post, I'd like to explain what went wrong, who was affected, and what corrections we're making.

What Happened?

During a routine operating system upgrade on Friday (March 13th), the deployment service within Windows Azure began to slow down due to networking issues. This caused a large number of servers to time out and fail.

Once these servers failed, our monitoring system alerted the team. At the same time, the Fabric Controller automatically initiated steps to recover affected applications by moving them to different servers. The Fabric Controller is designed to be very cautious about taking broad recovery steps, so it began recovery a few applications at a time. Because this serial process was taking much too long, we decided to pursue a parallel update process, which successfully restored all applications.

What Was Affected?

Any application running only a single instance went down when its server went down. Very few applications running multiple instances went down, although some were degraded due to one instance being down.

In addition, the ability to perform management tasks from the web portal appeared unavailable for many applications due to the Fabric Controller being backed up with work during the serialized recovery process.

How Will We Prevent This in the Future?

We have learned a lot from this experience. We are addressing the network issues and we will be refining and tuning our recovery algorithm to ensure that it can handle malfunctions quickly and gracefully.

For continued availability during upgrades, we recommend that application owners deploy their application with multiple instances of each role. We'll make two the default in our project templates and samples. We will not count the second instance against quota limits, so CTP participants can feel comfortable running two instances of each application role.

This is a solid template to use in coming clean about your own downtime events. Apologize (in a human, non-boilerplate way), explain what happened, who was affected, and what is being done to prevent this in the future. Well done Microsoft.

Tuesday, March 10, 2009

Thursday, March 5, 2009

Google App Engine transparency quick check in

Keep it up Google!

Wednesday, February 25, 2009

Google launches health status dashboard for Google Apps!

Announced here and you can see it here. No time to review it today, but I'll be all over this in the next couple days. Kudos to Google for getting this out!

Update: VERY cool to see an extremely detailed post-mortem from Google on the recent Gmail downtime.

The best marketing advice I've collected over the years

This is completely off topic, but a co-worker recently asked me for some advice on how to market his new online game. I scoured my delicious bookmarks, found the cream of the crop, and came up with the list below.

I wouldn't recommend trying to plow through these links in a matter of minutes. There is some really meaty stuff here, and you won't get much out of it unless you take some time to digest the advice. On the other hand, most of these provide very specific action items, so you should be able to act on the posts right away. Enough talk, enjoy the fruits of my labor!

Best blogging blogging advice, for new and old: (awesome example to follow)

Best advice on making the most of Twitter:

Awesome general purpose marketing advice:

Tuesday, February 24, 2009

I spy with my little eye...Mosso working on a health status dashbord

The transparency that Twitter brings is awesome:

I'm looking forward to see how many of the rules of successful health status dashboards they follow!

Gmail goes down, world survives (barely)

As widely reported by the blogosphere, Gmail was down earlier today for anywhere from 2 to 4 hours. Panic did not ensue...except on Twitter. Plenty has been said about the downtime event (and the demise of the cloud thanks to events like this). I want to focus on my favorite transparent was Google, and did they use this opportunity to build longer term trust in their service? Let's read a few select quotes that I found most illuminating:

What's more disturbing than the Gmail outage is Google's lack of transparency about it. The most recent post on Google's official blog declares the problem over, apologizes for the inconvenience, and explains why some users had to prove to Google that they were human beings before being allowed to log in to their Gmail accounts. But it provides no explanation whatever of what went wrong or what had been done to fix it or prevent its recurrence.

Amazon, by contrast, maintains a Service Health Dashboard for its Amazon Web Services with both a report on the current status of each service and a 35-day history of any problems (I can't tell you how good the reports are because the current time frame shows no incidents.) At a minimum, Google should maintain a similar site for the folks who have come to depend on its services.
Google has apologized and says it isn’t yet sure what happened: I’d love to see the company follow up with a post discussing the outage, its cause, and the company’s response. I’m curious, for instance, whether there’s a single explanation for the multiple problems that the service has had in the past few months.
Finally, it may not hurt to have a few links to the Google Blog and Gmail Blog on your Intranet so that they can find out if something catastrophic is happening. One of my users was smart enough to do this and alert the office.
Almost everyone I follow on Twitter seems to use Gmail. At all points during the outage, almost my entire stream was consumed with tweets about Gmail being down. And Twitter Search, perhaps the ultimate search engine for what people are complaining about in real time, not only had the term “Gmail” as a trending topic of discussions within minutes of Gmail failing, but it also saw “IMAP” and “Gfail” rise into the top terms as well.
Conclusion: Not enough transparency. Twitter again is the only means users had to share what was going on. Google's blog post was nice, but not enough to sate most people. I'm hoping Google comes out with a more detailed analysis, if nothing else to show that they are really trying.

Lessons learned: Provide more to your users then a single "We know there's a problem, and we're sorry" type blog post. This is the bare minimum, something the little guys should be doing. A service as prevelant as GMail must be more transparent. A simple health status dashboard would be a good start. Communicating status updates (at least once an hour) over Twitter would be powerful. Having an obvious place for your users to find status updates would be a start.

To close on a positive note, I think it was put best by Seeking Alpha:
I remember a few years back when my company’s email went down - for days, not hours. It would come back and then go away again as the IT team worked to troubleshoot and fix the problem. The folks working on that IT team weren’t necessarily e-mail experts, though. They were charged with doing everything from upgrading software to configuring network settings. Troubleshooting email was just another job duty.

I still maintain that a cloud-based solution - whether Google’s or anyone else’s - is a more efficient way of running a business. Don’t let one outage - no matter how widespread - tarnish your opinion of a cloud solution. Outages happen both in the cloud and at the local client level. And having been through a days-long outage, I’d say that this restore time was pretty quick.

One final thought: who out there communicates by e-mail alone these days? Speaking for myself, I’m reachable on Twitter, Facebook, SMS text, and Yahoo IM - among other services. Increasingly, e-mail isn’t as business critical as it once was. If you need to communicate with people to get the job done, I’m sure you can think of at least one other way to keep those communications alive beyond just e-mail.

Yes, the outage was bad. But it wasn’t the end of the world.

Wednesday, February 18, 2009

Transparency as a Stimulus

A bit off topic, but I just wanted to share a great article over at Wired about the transparency side benefits that may come along as a result of the stimulus package:
With President Obama's signing of the “American Recovery and Reinvestment Act,” better known as our national Hail Mary stimulus bill, billions will be ladled for infrastructure projects ranging from roads to mass transit to rural broadband.

But the law also contains a measure promoting a less-noted type of economic infrastructure: government data. In the name of transparency, all the Fed’s stimulus-spending data will be posted at a new government site,

That step may be more than a minor victory for the democracy. It could be a stimulus in and of itself.

The reason, open government advocates argue, is that accessible government information—particularly databases released in machine-readable formats, like RSS, XML, and KML—spawn new business and grease the wheels of the economy. "The data is the infrastructure," in the words of Sean Gorman, the CEO of FortiusOne, a company that builds layered maps around open-source geographic information. For every spreadsheet squirreled away on a federal agency server, there are entrepreneurs like Gorman ready to turn a profit by reorganizing, parsing, and displaying it.


The more obvious economic benefits, however, will come from innovations that pop up around freely available data itself. Robinson and three Princeton colleagues argue in a recent Yale Journal of Law and Technology article that the federal government should focus on making as much data available as RSS feeds and XML data dumps, in lieu of spending resources to display the data themselves. “Private actors,” they write, “are better suited to deliver government information to citizens and can constantly create and reshape the tools individuals use to find and leverage public data.”

Check out and to follow this story.

Monday, February 16, 2009

Media Temple goes down, provides a nice case study for downtime transparency

Earlier today we saw Media Temple experience intermittent downtime over the course of an hour. The first tweet showed up around 8am PST noting the downtime. At 9:06am Media Temple provided a short message confirming the outage:

At ~8:30AM Pacific Time we started experiencing networking issues at our El Segundo Data Center. We are working closely with them to determine the cause of these issues and will report any findings as they become available.

At this time we appear to be back fully. The tardiness of this update is a direct result of these networking issues.

So far, not too bad. Though note the broken rule in hosting your status page in the same location as your service. Lesson #1: Host your status page offsite. Let's keep moving with the timeline....

About the same time the blog post went up, a Twitter message by @mt_monitor pointed to the official status update. Great to see that they actually use Twitter to communicate with their users, and judging by the 360 followers, I think this was a smart way to spread the news. On the other hand, this was the only Twitter update from Media Temple throughout the entire incident, which is strange. And it looks like some users were still in the dark for a bit too long. I was also surprised that the @mediatemple feed made no mention of this. Maybe they have a good reason to keep these separate? Looking at the conversation on Twitter, feels like most people by default use the @mediatemple label. Lesson #2: Don't confuse your users by splitting your Twitter identity.

From this point till about 9:40am PST, users were stuck wondering what was going on:

A few select tweets show us what users were thinking. The conversation on Twitter goes on for about 30 pages, or over 450 tweets from users wondering what the heck was going on.

Finally at 9:40am, Media Temple released their findings:

Our engineers have spoken with the engineers at our El Segundo Data Center (EL-IDC3). Here are their findings:

ASN number 47868 was broadcasting invalid BGP data that caused our routers, and a lot of other routers on the internet, to reboot. This invalid BGP data exploited a software bug in our routers. We have applied filters to prevent us from receiving this invalid data.

At this time they are in contact with their vendors to see if there is a firmware update that will address this. You can expect to see network delays and small outages across the internet as other providers try to address this same issue.

Now that everything is back up and users are "happy", what else can we learn from this experience?

  1. Host your status page offsite. (covered above)
  2. Don't confuse your users by splitting your Twitter identity. (covered above)
  3. Some transparency is better then no transparency. The basic status message helped calm people down and reduce support calls.
  4. There was a huge opportunity here for Media Temple to use the tools of social media (e.g. Twitter/Blogging) as a two-way communication channel. Instead, Media Temple used both their blog and Twitter as a broadcast mechanism. I guarantee that if there were just a few more updates throughout the downtime period the tone of the conversation on Twitter would have been much more positive. Moreover, the trust in the service would have been damaged less severely if users were not in the dark for so long.
  5. A health status dashboard would have been very effective in providing information to the public beyond the basic "we are looking into it" status update. Without any work on the part of Media Temple during the event, its users would have been able to better understand the scope of the event, and know instantly whether or not it was still a problem. It would have been extremely powerful when combined with lesson 4, if a representative on Twitter simply pointed users complaining of the downtime to the status page.
  6. The power of Twitter as a mechanism for determining whether a service is down (or whether it is just you), and in spreading news across the world in a matter of minutes, again proves itself.

What every online service can learn from Ma.gnolia's experience

A lot has been said about the problem of trust in the Cloud. Most recently, Ma.gnolia, a social bookmarking service, lost all of its customers data and is in the process of rebuilding (both the service and the data). The naive take-away from this event is to use this as further evidence that the Cloud cannot be trusted. That we're setting ourselves up for disaster down the road with every SaaS service out there. I see this differently. I see this as a key opportunity for the industry to learn from this experience, and to mature. Both through technology (the obvious stuff) and through transparency (the not-so-obvious stuff). Ma.gnolia must be doing something right if the community has been working diligently and collaboratively in restoring the lost data, while waiting for the service to come back online and to use it again.

What can we learn from Ma.gnolia's experience?

Watching the founder Larry Halff explain the situation provides us with some clear technologically oriented lessons:
  1. Test your backups
  2. Have a good version based backup system in place
  3. Outsource your IT infrastructure as much as possible (e.g. AWS, AppEngine, etc.)
This is where most of the attention has been focused, and I have no doubt Larry is suck of hearing what he should have done to have avoided this from ever happening. Let's assume this will happen again and again with online services, just as is it has happened with behind-the-firewall services or local services in times past. Chris Messina (@factoryjoe) and Larry hit the nail on the head in pointing to transparency and trust as the only long term solution to keep your service alive in spite of unexpected downtime issues (skip to the 18:25 mark):

For those that aren't interested in watching 12 minutes of video, here are the main points:
  • Disclose what your infrastructure is and let users decide if they trust it
  • Provide insight into your backup system
  • Create a personal relationship with your users where possible
  • Don't wait for your service to have to go through this experience, learn from events like this
  • Not mentioned, but clearly communicate with your community openly and honestly.
There's no question that this kind of event is a disaster and could very easily mean the end of Ma.gnolia. I'm not arguing that simply blogging about your weekly crashes and yearly data loss is going to save your business. The point is that everything fails, and black swan events will happen. What matters most is not aiming for 100% uptime but aiming for 100% trust between your service and your customers.

Thursday, February 12, 2009

An overview of big downtime events over the past year

A relatively good review of the major downtime events in the recent past, with a solid conclusion at the end:
The bigger Web commerce gets, the bigger the opportunities to mess it up become. Outages and downtimes are inevitable; the trick is minimizing the pain they cause.
As we've seen over the past few months, the simplest way to minimize that pain is by letting your customers know what's going on. Before, during, and after. A little transparency goes a long way.

The transparent business plan by Mark Cuban

Big idea from Mark Cuban:

You must post your business plan here on my blog where I expect other people can and will comment on it. I also expect that other people will steal the idea and use it elsewhere. That is the idea. Call this an open source funding environment.

If its a good idea and worth funding, we want it replicated elsewhere. The idea is not just to help you, but to figure out how to help the economy through hard work and ingenuity. If you come up with the idea and get funding, you have a head start. If you execute better than others, you could possibly make money at it. As you will see from the rules below, these are going to be businesses that are mostly driven by sweat equity.

Read more here. What will you come up with?

Update: Seth Godin's crew gives away 999 potential business ideas. Ideas are a dime a dozen as they say. It's all about the execution.

Wednesday, February 11, 2009

Transparency User Story #1: Your service seems to be down. I'd like to know if it's down for anyone else or if it's just me.

Note: This post is the first in a series of at least a dozen posts where I attempt to drill into the transparency user stories described in an earlier post.

Let's assume that you've decided that you want to make your service or organization more transparent, specifically when it comes to it's uptime and performance. You've convinced your management team, you've got the engineering and marketing resources, and your rearing to go. You want to get something done and can't wait to make things happen. Stop right there. Do not pass go, do not collect $200. You first need to figure out what it is you're solving for. What problems (and opportunities) do you want to tackle in your drive for transparency?

Glad you asked. There are about a dozen user stories that I've listed in a previous post that describe the most common problems transparency can solve. In this post, I will dive into the first user story:
As an end user or customer, it looks to me like your service is down. I'd like to know if it's down for everyone or if it's just me.
Very straight forward, and very common. So common there are even a couple simple free web services out there that helps people figure this out. Let's assume for this exercise that your site is up the entire time.

Examples of the problem in action
  1. Your customer's Internet connection is down. He loads up It cannot load. He thinks you are down, and calls your support department demanding service.
  2. Your customer's DNS servers are acting up, and are unable to resolve inside his network. He finds that is loading fine, and is sure your site is down. He sends you an irate email.
  3. Your customer network routes are unstable, causing inconsistent connectivity to your site. He loads fine, while fails. He Twitters all about it.
Why this hurts your business
  1. Unnecessary support calls
  2. Unnecessary support emails.
  3. Negative word of mouth that is completely unfounded.
How to solve this problem
  1. An offsite public health dashboard
  2. A known third party, such as this and this or eventually this
  3. A constant presence across social media (Twitter especially) watching for false reports
  4. Keeping a running blog noting any downtime events, which tells your users that unless something is posted, nothing is wrong. You must be diligent about posting when there is actually something wrong however.
  5. Share your real time performance with your large customers. Your customers may even want to be alerted when you go down.
Example solutions in the real world
  1. Many public health dashboards
  2. The QuickBooks team notifying users that their service was back up
  3. Sharing your monitoring data in real time with your serious customers
  4. Searching Twitter for outage discussion

Tuesday, February 10, 2009

Differentiate yourself through honesty

A great post over at "A Smart Bear" focusing on being honest with your users. Some of my favorite recommmendations:
  • Admit when you're wrong, quickly and genuinely.
  • As soon as something isn't going to live up to your customer's expectation -- or even your own internal expectations -- tell them. Explain why there's a problem and what you're doing about it.
  • Instead of pretending your new software has no bugs and every feature you could possibly want, actively engage customers in new feature discussions and turn around bug fixes in under 24 hours.
  • Send emails from real people, not from
Honesty is a prerequisite to transparency. Opening up to your customers forces you to be honest. Why not use it as a competitive advantage?

Seth Godin provides his own perspective on the the best approach:

Can you succeed financially by acting in an ethical way?

I think the Net has opened both ends of the curve. On one hand, black hat tactics, scams, deceit and misdirection are far easier than ever to imagine and to scale. There are certainly people quietly banking millions of dollars as they lie and cheat their way to traffic and clicks.

On the other hand, there's far bigger growth associated with transparency. When your Facebook profile shows years of real connections and outreach and help for your friends, it's a lot more likely you'll get that great job.

When your customer service policies delight rather than enrage, word of mouth more than pays your costs. When past investors blog about how successful and ethical you were, it's a lot easier to attract new investors. hammers itself into submission

As reported by the site itself within hours of the incident, was unreachable for about 75 minutes yesterday:
What we had was indeed a DoS, however it was not externally originating. At 8:55 PM EST I received a call saying things were horked, at the same time I had also noticed things were not happy. After fighting with our external management servers to login I finally was able to get in and start looking at traffic. What I saw was a massive amount of traffic going across the core switches; by massive I mean 40 Gbit/sec. After further investigation, I was able to eliminate anything outside our network as the cause, as the incoming ports from Savvis showed very little traffic. So I started poking around on the internal switch ports. While I was doing that I kept having timeouts and problems with the core switches. After looking at the logs on each of the core switches they were complaining about being out of CPU, the error message was actually something to do with multicast. As a precautionary measure I rebooted each core just to make sure it wasn't anything silly. After the cores came back online they instantly went back to 100% fabric CPU usage and started shedding connections again. So slowly I started going through all the switch ports on the cores, trying to isolate where the traffic was originating. The problem was all the cabinet switches were showing 10 Gbit/sec of traffic, making it very hard to isolate. Through the process of elimination I was finally able to isolate the problem down to a pair of switches... After shutting the downlink ports to those switches off, the network recovered and everything came back. I fully believe the switches in that cabinet are still sitting there attempting to send 20Gbit/sec of traffic out trying to do something — I just don't know what yet. Luckily we don't have any machines deployed on [that row in that cabinet] yet so no machines are offline. The network came back up around 10:10 PM EST.
Great to see such a detailed explanation of the incident, walking its users from the initial alert through the investigation and to the final resolution. Even though this is a very techie crowed, I like the narrative nature of the apology, versus a generic "We apologize for the inconvenience and promise never to do it again." The key is to infuse your apology with humanity.

P.S. My favorite part of this incident is comments from the Slashdot crowd. Some of my favorites:
"In Soviet Russia, Slashdot slashdots Slashdot!"

"1. Meme Very Tired. No Longer Wired.
2. 'Soviet Russia' ceased to exist last century.
3. Profit!!!"

"The worst thing about this? 5,000,000 people who think they know what happened, posting 'helpful' suggestions or analysis"

"I think the switch was trying to get first post."

Wednesday, February 4, 2009

Downtime all over the place (Denny's, Vizio, and QuickBooks)

The Super Bowl led to and falling down, while QuickBooks went offline for a number of hours. Denny's and Vizio we can live with, but QuickBooks is a different story. According to the CNet report:

Affected customers of Quickbooks Online were left without access to their financial records. While users of the software version of Quickbooks could access their records, those relying on Intuit for credit-card processing had to do authorizations manually over the phone--a slower, more expensive process.

All online services have outages from time to time, but this one appears to have been lengthy for a line-of-business service, and for many users, unsatisfactorily managed. We received complaints from users that communications from Intuit gave neither a reason for the outages nor an estimate on when the service would return.

Hopefully a lesson learned here. Watching the Twitter traffic at that time, I was impressed to see some signs of life.

Update: Jack in the Box had it's fair share of problems as well.

Saturday, January 31, 2009

The underreported half of Google's new Measurement Lab, and how it can help your online business

In describing the aims of the newly launched Measurement Lab, Vince Cerf describes it as such:
The tools will not only allow broadband customers to test their Internet connections, but also allow security and other researchers to work on ways to improve the Internet.
It's clear that the press and blogosphere are going nuts about the latter half. Specifically, the affect this will have on Net Neutrality, and shady practices of certain telcos. As powerful as this will be in the long run, I want to focus on the other half of the description. The tools to "allow broadband customers to test their Internet connections." I promise this isn't as geeky as it sounds, and it applies directly to helping helping your online business save time and money.

Imagine one of your customers sitting at home ready to use your service. She opens up her web browser, types in your URL, and presses Enter. The browser starts to load the page, the status bar shows "Connecting to", then "Wating for". It sits like this for about 15 seconds with a blank page the entire time. She starts to get annoyed. Just to see what happens, she presses refresh and starts the process over. Again, a blank screen, the browser sitting there waiting for your site to begin loading. She then checks that her Internet connection is working by visiting, which loads fine. At this point, if you are lucky, she decides to call your support department up or shoot an email over asking whether there's something wrong. If you are unlucky, she asks around on Twitter, or blogs about it, or just gives up with the new thought in the back of her mind that this service is just plain unreliable. Now, imagine that this scenario took place while your site was perfectly healthy, with no actual downtime anywhere.

Your site is up, but your customer thinks your service is down. The problem lies somewhere along the way between the clients browser and your companies firewalls. The Tubes are clogged just for this specific customer, but how is she supposed to know?

There are a few levels to this problem (followed by the solution):

Level 1: The affect this has on your customer(s)
Online users are still more then likely to give you a few chances before they draw a conclusion, however every incident like this adds to the incorrect negative impression. Especially if this problem manifests itself as a performance issue, slowing or interrupting your customers connections, versus simply keeping them from connecting at all. Your user begins to dread using your service, and look for alternatives every chance they get.

Level 2: The dollar cost to your business
How many calls do you get to your support department from customers claiming they cannot connect to your site, or that your service is broken, or that it's really slow for them? How often does the problem end up being on their end, or completely unreproducible? It may be a relief for your support people, and it may be something your company is happy with, as it confirms that your site is working just fine. Unfortunately, each of these calls costs you money and time. Worse yet, these types of calls generally take the longest to diagnose, as they are vague and require long periods of debugging to get to the root cause. I haven't even mentioned the lost revenue from the missed traffic (if that affects your revenue).

Level 3: The "perception" cost to your business
As described in Level 1, any perceived downtime is just as real as actual downtime in the eyes of your customers. Word of mouth is powerful, especially with todays social media tools, in spreading negative news unfounded as it may be. The more you can do to keep the invalid negative perception from forming, the better.

Level 4: The unknown cost
How often does this happens to your customers? No one has any idea. I said earlier you're "lucky" if your customer decides to pick up the phone and call you about the perceived downtime. More often then not, your customer will simply give up. At worst, they give up with your service entirely. How can you capture this type of information, and help your customers at the same time?

The Solution
Provide a tool that your customers and your support department can use to quickly diagnose where the problem lies. The simplest of these would be to offer a public health dashboard. The more powerful route is to offer tools like these:
Network Diagnostic Tool - provides a sophisticated speed and diagnostic test. An NDT test reports more than just the upload and download speeds--it also attempts to determine what, if any, problems limited these speeds, differentiating between computer configuration and network infrastructure problems.

Network Path and Application Diagnosis - diagnoses some of the common problems affecting the last network mile and end-users' systems. These are the most common causes of all performance problems on wide area network paths.
And what do you know? These are two of the tool that have launched on The Measurement Lab!

Clearly these are still very raw, and not for the every day user. But I see tools like these becoming extremely important for online businesses, both in reducing costs, and in controlling perception. I see this becoming a part of the public health dashboard (which I hope you're hosting separate from your primary site!), allowing users to diagnose problems they are seeing that not reflected in the Internet at large.

I'm going to be watching the development of these tools very closely over the next few months. Most interesting will be noting which other companies support the reasearch, and end up using these tools. Will the focus stay on the Net Neutrality and BitTorrent, or will companies realize the potential of these other tools? We'll find out soon enough!

Google claims entire internet "harmful"

Between 6:30 a.m. PST and 7:25 a.m. PST this morning, every search on Google resulted in a message claiming each and every link the results "may harm your computer". As usual, Twitter was all over it. This likely cost Google a lot of money in lost ad revenue, and led to much undue stress for some poor sap, but what I'm most interested in is how transparently they communicated about this event. I'm happy to report that within 30 minutes of the problem being identified, a resolution was in place, and a couple hours later, Marissa Mayer, VP, Search Products & User Experience (who is only the fourth most powerful person at Google) clearly explained the situation on their company blog:
What happened? Very simply, human error. Google flags search results with the message "This site may harm your computer" if the site is known to install malicious software in the background or otherwise surreptitiously. We do this to protect our users against visiting sites that could harm their computers. We work with a non-profit called to get our list of URLs. StopBadware carefully researches each consumer complaint to decide fairly whether that URL belongs on the list. Since each case needs to be individually researched, this list is maintained by humans, not algorithms.

We periodically receive updates to that list and received one such update to release on the site this morning. Unfortunately (and here's the human error), the URL of '/' was mistakenly checked in as a value to the file and '/' expands to all URLs. Fortunately, our on-call site reliability team found the problem quickly and reverted the file. Since we push these updates in a staggered and rolling fashion, the errors began appearing between 6:27 a.m. and 6:40 a.m. and began disappearing between 7:10 and 7:25 a.m., so the duration of the problem for any particular user was approximately 40 minutes.

Thanks to our team for their quick work in finding this. And again, our apologies to any of you who were inconvenienced this morning, and to site owners whose pages were incorrectly labelled. We will carefully investigate this incident and put more robust file checks in place to prevent it from happening again.

Thanks for your understanding.
Well handled, and hopefully this does not have negative repercussions for the company long term.

Update: clarifies the situation a bit further, placing the blame back in Google's court:

[Update 12:31] Google has posted an update on their official blog that erroneously states that Google gets its list of URLs from us. This is not accurate. Google generates its own list of badware URLs, and no data that we generate is supposed to affect the warnings in Google’s search listings. We are attempting to work with Google to clarify their statement.

[Update 12:41] Google is working on an updated statement. Meanwhile, to clarify some false press reports, it does not appear to be the case that Google has taken down the warnings for legitimately bad sites. We have spot checked a couple known bad sites, and Google is still flagging those sites as bad. i.e., the problem appears to be corrected on their end.

Wednesday, January 28, 2009

More on how Google's support for M-Lab is a big deal for transparency

In support of Google's own words in launching Measurement Lab:
At Google, we care deeply about sustaining the Internet as an open platform for consumer choice and innovation. No matter your views on net neutrality and ISP network management practices, everyone can agree that Internet users deserve to be well-informed about what they're getting when they sign up for broadband, and good data is the bedrock of sound policy. Transparency has always been crucial to the success of the Internet, and, by advancing network research in this area, M-Lab aims to help sustain a healthy, innovative Internet.
...a few select quotes from around the web commenting on the power of transparency:
"For years, ISPs have been notoriously shady about what they're throttling or blocking. The industry needs a healthy dose of transparency. Right now, we're just a bunch of pissed-off users complaining about our Skype calls getting dropped and our YouTube videos sputtering to a halt. But when it comes to placing blame, most of us are in the dark."

"M-Lab aims to bring more transparency to network activity by allowing researchers to deploy Internet measurement tools and share data. The platform launched Wednesday with three Google servers dedicated to the project, and within six months, Google will provide researchers with 36 servers in 12 locations around the globe. All the data collected will be made publicly available."

"A number of ISPs have lately started to clamp down on peer-to-peer networks and are actively restricting heavy usage of 'unlimited' connections. For users, however, there is very little transparency in this process and it can be very hard to figure out if an ISP is actually actively throttling a connection or preventing certain applications from working properly. In reaction to this, Google, together with the New America Foundation's Open Technology Institute and the PlanetLab Consortium announced the Measurement Lab, an open platform for researchers and a set of tools for users that can be used to examine the state of your broadband connection."

This looks like an initiative at least partly created to deal with net neutrality issues, by providing more transparency to users. This seems to be in Google’s political interest; it would be interesting to see the same transparency be provided with issues like Google’s censorship in countries like China or Germany. Say, a Measurement Lab tool that registers which domains are censored in, collecting those into a public database for research purposes.

Google working to make the Internet more transparent

A big (if you're a geek) announcement from Google today:
When an Internet application doesn't work as expected or your connection seems flaky, how can you tell whether there is a problem caused by your broadband ISP, the application, your PC, or something else? It can be difficult for experts, let alone average Internet users, to address this sort of question today.

Last year we asked a small group of academics about ways to advance network research and provide users with tools to test their broadband connections. Today Google, the New America Foundation's Open Technology Institute, the PlanetLab Consortium, and academic researchers are taking the wraps off of Measurement Lab (M-Lab), an open platform that researchers can use to deploy Internet measurement tools.
Basically, Google is working to help the general public diagnose the hidden problems that creep up with the Internet network. I plan to dive into his a lot further, but for now here are some of the tools that have already been made public as a result of this effort:

Update: I'm even more impressed with how much attention this is getting in the blogosphere. Most people are focusing on the BitTorrent aspects of this, but still, a lot of press for transparency movement:

Tuesday, January 27, 2009

Seth Godin on transparency

One of my favorite bloggers/writers/speakers/personalities/gurus Seth Godin recently had some thoughts on the power of transparency:

Can you succeed financially by acting in an ethical way?

I think the Net has opened both ends of the curve. On one hand, black hat tactics, scams, deceit and misdirection are far easier than ever to imagine and to scale. There are certainly people quietly banking millions of dollars as they lie and cheat their way to traffic and clicks.

On the other hand, there's far bigger growth associated with transparency. When your Facebook profile shows years of real connections and outreach and help for your friends, it's a lot more likely you'll get that great job.

When your customer service policies delight rather than enrage, word of mouth more than pays your costs. When past investors blog about how successful and ethical you were, it's a lot easier to attract new investors.

The Net enlarges the public sphere and shrinks the private one. And black hats require the private sphere to exist and thrive. More light = more success for the ethical players.

In a competitive world, then, one with increasing light, the way to win is not to shave more corners or hide more behavior, because you're going against the grain, fighting the tide of increasing light. In fact, the opposite is true. Individuals and organizations that can compete on generosity and fairness repeatedly defeat those that only do it grudgingly.

FeedBurner woes, and how a little transparency can go a long way

As noted by TechCrunch, FeedBurner is having some tough times:
Complaints about Feedburner, a service that helps websites manage their RSS feeds, have been around as long as the company itself. But you’d think that when Google spent $100 million to buy the company, they’d get it together.

But things haven’t gotten better. Instead, the service is becoming unreliable. Feedburner problems plague website owners far more than they should. And while Google is notoriously slow in absorbing its acquisitions, it’s far past time for them to get their act together and turn Feedburner into a grown up service.

Michael Arrington then points to the lack of transparency, which went from bad to worse:

The main Feedburner blog was shut down in December 2008 and everyone was told to head over to an advertising-focused blog for Feedburner news. I think it’s great that Google wants to do a better job of inserting ads into feeds to make money for publishers. But they have to focus on the quality of the service, too, or the ecosystem won’t work. The message they’re sending to everyone is that the service doesn’t deserve a blog, just the advertising they bolt onto it. Imagine if they did the same thing with search.

Feedburner also has a known issues page that shows what’s currently wrong with the service. It’s clear from that page that the team is having a lot of problems just keeping the lights on. The fact that this most recent issue, broken stats, isn’t reported there yet even though its days old is another red flag.


If Google wants to continue to manage our feeds, we need assurances from them that they want our business. Right now, I don’t believe they do. The people working on Feedburner clearly care about the product and their customers, but they either don’t have enough people or enough resources to take care of business.

Visit the known issues page today and you'll be directed to the FeedBurner Status Blog. As minor as this is new blog is, I see it as a big step forward. There have been a couple updates just the past few days, which I hope is a sign that they are trying to up the transparency game and get to building that trust relationship they are so sorely missing.

Update: I just found a more detailed explanation from FeedBurner of what their plans are, and their renewed focus on creating a reliable and transparent service:
As many of you know, since becoming a part of Google in June of 2007, the FeedBurner team has been hard at work transforming FeedBurner into a service that uses the same underlying architecture as many other Google applications, running in the same high-volume datacenters. As a team, we chose this path for one reason: our highest priority is making sure your feed is served as fast as possible after you update your content, and is as close as technically possible to being available 100% of the time.


To help communicate these issues and resolutions much more effectively, we have created a new blog and feed that you can subscribe to during this transition period. We plan to keep these around as long as necessary. We may also add features to the site that allow you to report your own feed issue details.

The extended team — including both original team members of FeedBurner, newer team members that joined us since we've been at Google, and the rest of Google — is excited about our future on this new integrated-with-Google platform that all publishers will be on at the conclusion of this account transfer process. We are excited because we see the potential for scale and innovation on this platform that will make for a true next generation feed management solution. Most of all, however, we are excited about getting publishers excited for these possibilities as we reveal what we have in store.
Kudos to the FeedBurner team for recognizing the opportunity here and acting on it. Only time will tell how successful this effort will be, but judging by the comments, I'm optimistic.

Most insane dashboard...ever

I don't know if words can describe this dashboard created by Sprint to promote their mobile Internet card. All I can say is that it's awesome in it's uselessness. Which I think is the point. Which should be the exact opposite of your goals in creating a public health dashboard for your own customers.

Tuesday, January 20, 2009

Change has come to America: Being Open

A good post by Mitch Joel about being "Open":
Today was a very special day. It was more than simply watching history in the making. Change became more than just a saying on a button. In many ways we are not just ushering in a new President of the United States of America, but we are ushering in a new way of thinking and of doing business. It's time for everything to be more open.


Facebook became more open.

While Facebook is, without question, the leading online social network, the general flow of it is very linear. Unlike the quick-bite snippets you can grab in a Twitter feed, the Facebook status updates seemed pretty pale in comparison. However, their deal with CNN to broadcast the big day by having the status updates run alongside the live streaming video from Washington was a game-changer. The ability to see what your "friends" were saying and being able to switch to see what everyone else was saying enabled us all to get beyond the fishbowl. It was an amazing blend of traditional mass media reporting and everyone's individual point-of-view collected in one location. Opinions, emotions and even contrary perspectives were public, available and accessible. Plus, if you had something more to add (relevant, idiotic or different), all opinions were equal.

CNN became more open.

Along with allowing all Facebook commentary to run alongside of their broadcasting, they even demonstrated Microsoft's amazing photosynth technology (if you have never seen photosynth in action, you can see it here: TED - Blaise Aguera y Arcas - Jaw-dropping Photosynth demo). People who attended the inauguration were encouraged to send in their photos. The 11,000-plus pictures were dumped into photosynth to give an entirely new photographic representation of this special moment in time. CNN did not stop there. Throughout the day, there were constant references to not only the online channel and conversation that was taking place online, but also the many ways in which the public could share this moment with the world.

The White House website became more open.

The big news on both Twitter and Facebook was that as President Obama was being sworn in, the White House website had already been updated. Even more interesting is how prominent The White House Blog is on the website (granted, there's not that much there just yet). Companies still grapple with whether or not they can handle having a Blog and being that "open." The answer is: if the White House is trying it, why can't you?

Open is good for business.

Maybe your business is still struggling to understand these many new channels. Many businesses still have a more traditional work ethic. If I saw one thing today, it was that all of these more traditional institutions either tried to open up just a little bit more or partnered with someone who would help them open up. Guess what? It worked. People liked how CNN flowed. They were thrilled to see a new White House website. It was memorable to share this moment with your Facebook friends from around the world. It was nothing complex. In fact, it was pretty simple.

If we really want change to happen, opening up just a little bit may well be one of the better ways to see what happens. How will things change? Well, there's word that the President will keep his BlackBerry to stay more "connected to the people." One might argue that this also makes him more accessible... more open. What if you opened up a little bit more? What would happen to your marketing? What would happen to your business?

Obama takes office...and Facebook/CNN flourish

Some stats from ReadWriteWeb:

In the end, not only did Facebook Connect provide an interactive look into the thoughts and feelings of all those watching CNN's coverage via the web - it did so without crashing. According to the statistics, there were 200,000+ status updates, which equaled out to 3,000 people commenting on the Facebook/CNN feed per minute. Right before Obama spoke, that number grew to 8500. Additionally, Obama's Facebook Fan Page has more than 4 million fans and more than 500,000 wall posts. (We wonder if anyone on his staff will ever read all those!).

CNN didn't do too badly either. They broke their total daily streaming record, set earlier on Election Day, and delivered 5.3 million streams. Did you have trouble catching a stream? We didn't hear of any issues, but if you missed out, you can watch it again later today.
The blogosphere, myself included, often point only at the problems and ignore to times when everything works as expected. This looks to be very much the later. Kudos to Facebook and CNN for putting together such a powerful service, on such a powerful day, without issue.

Update: Spoke too soon :(

Thursday, January 15, 2009 down intermittently today. What lessons can we take away?

Though not widely reported, it appears that saw some intermittent downtime today:
A distributed denial-of-service attack turned dark at least several thousand Web sites hosted by Wednesday morning. The outage was intermittent over several hours, according to Nick Fuller, communications manager.

What caught my eye was some insight on how GoDaddy handled the communication during the event:

To add to the consternation of Web site owners,'s voice mail system pointed to its support page for more information about the outage and when it would be corrected. No such information was posted there.

Luckily this didn't blow up into anything major for GoDaddy, but I'd like to offer up a few suggestions:
  1. If you're pointing your customers to the default support page, make sure to have some kind of call-out link referencing this event. Otherwise customers will be searching through your support forums, getting more frustrated, and end up typing up your support lines (or Twitter'ing their hearts out).
  2. Offer your customers an easy to find public health dashboard (e.g. a link off of the support page). There are numerous benefits that come along with such an offering, but this specific situation would be a perfect use case for one.
  3. Provide a few details on the problem in both the voice mail message, and in whichever online forum you choose to communicate (e.g. health dashboard, blog, twitter, forums, etc.). At the minimum, provide an estimated time to recovery and some details on the scope of the problem.
A little bit of transparency can go a long way. I would venture to say that if any of the above advice was implemented in the future, the customer reaction, and long term benefits, would pay off substantially.

Update: A bit of insight provided by GoDaddy’s Communications Manager Nick Fuller.

Tuesday, January 13, 2009

The bulls**t of outage language

As this blog is often dedicated to pointing out downtime events, and offering advice on how to best communicate before/during/after the (inevitable) event, I thought this post by 37signals could come in handy next time you have to write an apology email to your customers.

Service operators generally suck at saying they’re sorry. I should know, I’ve had to do it plenty of times and it’s always hard. There’s really never a great way to say it, but there sure are plenty of terrible ways.

One of the worst stock dummies that even I have resorted to in a moment of weakness is this terrible non-apology: “We apologize for any inconvenience this may have caused”. Oh please. Let’s break down why it’s bad...

I'll let you read the advice yourself, but I will point out a few of the visitor comments that speak to the message I've been harping on over the past few months:
Josh Catone:
Serious question: What WOULD be a better way to communicate with customers after downtime in your opinion? You didn’t offer and alternatives. I know you said stock responses should never be used… but I’d love to see some examples of what you think works..

Dan Gebhardt:
I’d recommend using a website monitoring service (we use Pingdom [editors note: *cough* Webmetrics *cough*]) to provide public accountability for your uptime. This not only proves that uptime is as important to you as it is to your customers, it can also help customers see any particular outage in the context of your overall service record.

Mark Weiss:
While a personal well thought out apology is nice. As a user I want to know when things are going to be working again. I want to know if I should go for a quick walk in the park or if I have time for some food, drinks, and then possibly a nap.

Just keep me informed so I know how to manage my time.

I think Flickr holds top honors for the best down time strategy and message.

Itinerant Networker:
Empathy’s not enough. Service providers should reveal details about why an outage happened, what they’re doing to make it not happen again, and should clearly communicate with customers (frequently) on the ETA of the outage. The most frustrating thing I hear is “we don’t have an ETR [estimated time to recovery]”. That is not acceptable in a service business – give me an ETR and then an estimate of how reliable the ETR is. This goes for even the lowest cable modem user calling $provider – the tier 1 guys should have at least some clue.

The bottom line is that what matters most is not that you never go down, but how you deal with that downtime. All your customers need is some form of honest communication during the event, some transparency into the severity of the problem, and a human explanation of what went wrong afterward. It really isn't very hard.