Wednesday, March 18, 2009

Microsoft showing us how it's done, coming clean about Azure downtime

Following up on yesterdays Windows Azure downtime event, Microsoft posted an excellent explanation of what happened:

The Windows Azure Malfunction This Weekend

First things first: we're sorry. As a result of a malfunction in Windows Azure, many participants in our Community Technology Preview (CTP) experienced degraded service or downtime. Windows Azure storage was unaffected.

In the rest of this post, I'd like to explain what went wrong, who was affected, and what corrections we're making.

What Happened?

During a routine operating system upgrade on Friday (March 13th), the deployment service within Windows Azure began to slow down due to networking issues. This caused a large number of servers to time out and fail.

Once these servers failed, our monitoring system alerted the team. At the same time, the Fabric Controller automatically initiated steps to recover affected applications by moving them to different servers. The Fabric Controller is designed to be very cautious about taking broad recovery steps, so it began recovery a few applications at a time. Because this serial process was taking much too long, we decided to pursue a parallel update process, which successfully restored all applications.

What Was Affected?

Any application running only a single instance went down when its server went down. Very few applications running multiple instances went down, although some were degraded due to one instance being down.

In addition, the ability to perform management tasks from the web portal appeared unavailable for many applications due to the Fabric Controller being backed up with work during the serialized recovery process.

How Will We Prevent This in the Future?

We have learned a lot from this experience. We are addressing the network issues and we will be refining and tuning our recovery algorithm to ensure that it can handle malfunctions quickly and gracefully.

For continued availability during upgrades, we recommend that application owners deploy their application with multiple instances of each role. We'll make two the default in our project templates and samples. We will not count the second instance against quota limits, so CTP participants can feel comfortable running two instances of each application role.

This is a solid template to use in coming clean about your own downtime events. Apologize (in a human, non-boilerplate way), explain what happened, who was affected, and what is being done to prevent this in the future. Well done Microsoft.

Tuesday, March 10, 2009

Thursday, March 5, 2009

Google App Engine transparency quick check in


Keep it up Google!

Wednesday, February 25, 2009

Google launches health status dashboard for Google Apps!


Announced here and you can see it here. No time to review it today, but I'll be all over this in the next couple days. Kudos to Google for getting this out!

Update: VERY cool to see an extremely detailed post-mortem from Google on the recent Gmail downtime.

The best marketing advice I've collected over the years

This is completely off topic, but a co-worker recently asked me for some advice on how to market his new online game. I scoured my delicious bookmarks, found the cream of the crop, and came up with the list below.

I wouldn't recommend trying to plow through these links in a matter of minutes. There is some really meaty stuff here, and you won't get much out of it unless you take some time to digest the advice. On the other hand, most of these provide very specific action items, so you should be able to act on the posts right away. Enough talk, enjoy the fruits of my labor!

Best blogging blogging advice, for new and old:
http://www.chrisbrogan.com/if-i-started-today/
http://www.chrisbrogan.com/27-blogging-secrets-to-power-your-community/
http://sethgodin.typepad.com/seths_blog/2008/11/the-number-one.html
http://www.conversationagent.com/2008/11/why-start-a-blog-and-25-tips-to-make-it-work.html
http://www.copyblogger.com/how-to-write-a-story/
http://www.copyblogger.com/hot-content/
http://www.chrisbrogan.com/the-subtle-art-of-linkbaiting/
http://www.chrisbrogan.com/my-best-advice-about-blogging/
http://www.copyblogger.com/10-sure-fire-headline-formulas-that-work/
http://www.balsamiq.com/blog/ (awesome example to follow)

Best advice on making the most of Twitter:
http://www.chrisbrogan.com/50-ideas-on-using-twitter-for-business/
http://blog.guykawasaki.com/2008/11/looking-for-m-1.html
http://blog.guykawasaki.com/2008/12/how-to-use-twit.html
http://www.ozonesem.com/social-media-marketing/how-to-get-retweeted.html
http://www.copyblogger.com/grow-business-twitter/
http://www.financialaidpodcast.com/2008/12/24/the-twitter-power-guide-ebook/

Awesome general purpose marketing advice:
http://sethgodin.typepad.com/seths_blog/2008/05/avoiding-the-pa.html
http://www.micropersuasion.com/2009/02/leo-babauta-on-the-tao-of-marketing.html
http://www.chrisbrogan.com/50-ways-marketers-can-use-social-media-to-improve-their-marketing/
http://www.copyblogger.com/word-of-mouth-marketing/
http://www.salesforce.com/community/crm-best-practices/marketing-professionals/market-feedback/2008-mktg-exec-social-media-mktg.jsp
http://news.ycombinator.com/item?id=104627

Tuesday, February 24, 2009

I spy with my little eye...Mosso working on a health status dashbord

The transparency that Twitter brings is awesome:


I'm looking forward to see how many of the rules of successful health status dashboards they follow!

Gmail goes down, world survives (barely)

As widely reported by the blogosphere, Gmail was down earlier today for anywhere from 2 to 4 hours. Panic did not ensue...except on Twitter. Plenty has been said about the downtime event (and the demise of the cloud thanks to events like this). I want to focus on my favorite topic...how transparent was Google, and did they use this opportunity to build longer term trust in their service? Let's read a few select quotes that I found most illuminating:

BusinessWeek:
What's more disturbing than the Gmail outage is Google's lack of transparency about it. The most recent post on Google's official blog declares the problem over, apologizes for the inconvenience, and explains why some users had to prove to Google that they were human beings before being allowed to log in to their Gmail accounts. But it provides no explanation whatever of what went wrong or what had been done to fix it or prevent its recurrence.

Amazon, by contrast, maintains a Service Health Dashboard for its Amazon Web Services with both a report on the current status of each service and a 35-day history of any problems (I can't tell you how good the reports are because the current time frame shows no incidents.) At a minimum, Google should maintain a similar site for the folks who have come to depend on its services.
Technologizer:
Google has apologized and says it isn’t yet sure what happened: I’d love to see the company follow up with a post discussing the outage, its cause, and the company’s response. I’m curious, for instance, whether there’s a single explanation for the multiple problems that the service has had in the past few months.
ComputerWorld:
Finally, it may not hurt to have a few links to the Google Blog and Gmail Blog on your Intranet so that they can find out if something catastrophic is happening. One of my users was smart enough to do this and alert the office.
VentureBeat:
Almost everyone I follow on Twitter seems to use Gmail. At all points during the outage, almost my entire stream was consumed with tweets about Gmail being down. And Twitter Search, perhaps the ultimate search engine for what people are complaining about in real time, not only had the term “Gmail” as a trending topic of discussions within minutes of Gmail failing, but it also saw “IMAP” and “Gfail” rise into the top terms as well.
Conclusion: Not enough transparency. Twitter again is the only means users had to share what was going on. Google's blog post was nice, but not enough to sate most people. I'm hoping Google comes out with a more detailed analysis, if nothing else to show that they are really trying.

Lessons learned: Provide more to your users then a single "We know there's a problem, and we're sorry" type blog post. This is the bare minimum, something the little guys should be doing. A service as prevelant as GMail must be more transparent. A simple health status dashboard would be a good start. Communicating status updates (at least once an hour) over Twitter would be powerful. Having an obvious place for your users to find status updates would be a start.

To close on a positive note, I think it was put best by Seeking Alpha:
I remember a few years back when my company’s email went down - for days, not hours. It would come back and then go away again as the IT team worked to troubleshoot and fix the problem. The folks working on that IT team weren’t necessarily e-mail experts, though. They were charged with doing everything from upgrading software to configuring network settings. Troubleshooting email was just another job duty.

I still maintain that a cloud-based solution - whether Google’s or anyone else’s - is a more efficient way of running a business. Don’t let one outage - no matter how widespread - tarnish your opinion of a cloud solution. Outages happen both in the cloud and at the local client level. And having been through a days-long outage, I’d say that this restore time was pretty quick.

One final thought: who out there communicates by e-mail alone these days? Speaking for myself, I’m reachable on Twitter, Facebook, SMS text, and Yahoo IM - among other services. Increasingly, e-mail isn’t as business critical as it once was. If you need to communicate with people to get the job done, I’m sure you can think of at least one other way to keep those communications alive beyond just e-mail.

Yes, the outage was bad. But it wasn’t the end of the world.