Monday, February 16, 2009

What every online service can learn from Ma.gnolia's experience

A lot has been said about the problem of trust in the Cloud. Most recently, Ma.gnolia, a social bookmarking service, lost all of its customers data and is in the process of rebuilding (both the service and the data). The naive take-away from this event is to use this as further evidence that the Cloud cannot be trusted. That we're setting ourselves up for disaster down the road with every SaaS service out there. I see this differently. I see this as a key opportunity for the industry to learn from this experience, and to mature. Both through technology (the obvious stuff) and through transparency (the not-so-obvious stuff). Ma.gnolia must be doing something right if the community has been working diligently and collaboratively in restoring the lost data, while waiting for the service to come back online and to use it again.

What can we learn from Ma.gnolia's experience?

Watching the founder Larry Halff explain the situation provides us with some clear technologically oriented lessons:
  1. Test your backups
  2. Have a good version based backup system in place
  3. Outsource your IT infrastructure as much as possible (e.g. AWS, AppEngine, etc.)
This is where most of the attention has been focused, and I have no doubt Larry is suck of hearing what he should have done to have avoided this from ever happening. Let's assume this will happen again and again with online services, just as is it has happened with behind-the-firewall services or local services in times past. Chris Messina (@factoryjoe) and Larry hit the nail on the head in pointing to transparency and trust as the only long term solution to keep your service alive in spite of unexpected downtime issues (skip to the 18:25 mark):



For those that aren't interested in watching 12 minutes of video, here are the main points:
  • Disclose what your infrastructure is and let users decide if they trust it
  • Provide insight into your backup system
  • Create a personal relationship with your users where possible
  • Don't wait for your service to have to go through this experience, learn from events like this
  • Not mentioned, but clearly communicate with your community openly and honestly.
There's no question that this kind of event is a disaster and could very easily mean the end of Ma.gnolia. I'm not arguing that simply blogging about your weekly crashes and yearly data loss is going to save your business. The point is that everything fails, and black swan events will happen. What matters most is not aiming for 100% uptime but aiming for 100% trust between your service and your customers.

No comments:

Post a Comment