Transparent Uptime: 08/01/2008

Saturday, August 23, 2008

What if the cloud disappeared tomorrow? Thoughts on a "Online Users Bill of Rights"

NPR did a story on the (often unexpected) risks involved in storing your data in the cloud. What would you do if Gmail, Flickr, or Yahoo decided they no longer cared to store your massive amount of free data and ran a large "rm -rf". Sure they'd get some pretty bad PR, but if you look at their EULA's, I'm betting they have the right to do this. Can we ever trust that our data is really safe in the cloud?

What's needed here is a "Online Users Bill of Rights". This would define specific standards that protect users and gives them insight into decisions currently made behind closed doors. Here's a start:

1. Files, documents, or anything else that the user has created and saved online cannot be removed or be made inaccessible without a 30 day advanced notice.

2. The service must be accessible 95% of the time each month. Specifically, users must be able to access their data, be able to delete or retrieve existing data, with availability of at least 95% in each month long period. It is also highly encouraged to make public a tighter uptime commitment, including the consequences of not meeting that commitment.

3. During downtime events, the service must make a best effort to provide status updates, estimates as to when service will be restored, and an explanation of what led to the downtime after the event. It is also highly encouraged to make known a central location to distribute this information.

4. The service will provide a performance SLA describing the average page load time they expect to see, and the consequences of not meeting that average in any given month. This is especially important for API's and services like AWS.

5. The service must give at least 30 days notice prior to making any "major" changes in the functionality or level of service provided up to that point (including API interfaces). It is also highly encouraged to involve the users in the decision making process prior these changes.

This Bill of Rights would need to be signed off on by any online service that stores data for users (Google, Yahoo, Flickr) or provides online service that other business rely on (Amazon AWS, Salesforce, API providers). I'd like to see the day when users simply do not trust online services that aren't willing to sign off on this.

The above is just a first draft, and I'd love to get some input on this. I would purposefully keep the list somewhat open to interpretation, staying away from legalese, and focusing on the spirit of the idea of transparency and user rights (similar to the concept of a B Corporation).

What do you think?

Friday, August 22, 2008

Microsoft celebrates its downtime

After experiencing downtime in its launch of Photosynth this past week, Microsoft admits it's projections were a little off:

"We have been abolsutely overwhlemed by demand, and have turned Photosynth.com into a special static/read-only mode for the moment. The team is hard at work adding capacity and getting the full site back online. We've been under incredible demand since we released just over 12 hours ago. With everyone waking up around the world traffic has been on a steady ramp up since that release and has far exceeded even our most optimistic expectations.
Getting ready for the launch we did massive amounts of performance testing, built capacity model after capacity model, and yet with all of that, you threw so much traffic our way that we need to add more capacity. We are adding that extra horsepower right now and should be back up shortly.
Thank you for the incredible reception! "

Nice to see some visibility into their thinking up to launch, and the preparation they (unsuccessfully) went through. The next best thing to real time downtime status is a well formed explanation after launch (assuming the downtime is not prolonged), and something this personal coming out of Microsoft is a good sign.

Monday, August 18, 2008

Apple makes up for their downtime with 60 days of free service

From http://support.apple.com/kb/HT2826:

Why is Apple granting a 60-day subscription extension?
The transition from .Mac to MobileMe was rockier than we had hoped. While we are making a lot of improvements, the MobileMe service is still not up to our standards. We are extending subscriptions 60-days free of charge to express appreciation for our members’ patience as we continue to improve the service.
Am I eligible for the 60-day extension?
You are eligible if you are a MobileMe member whose account was active as of August 19, 2008 at 0:00 Pacific Daylight Time.

That's one way to deal with downtime!

Saturday, August 16, 2008

Do You Trust the Cloud?

Quoting a lifehacker post with the same name:

"While web-based applications promise gigabytes of storage, anywhere-access, easy backup, and no software requirements beyond your browser to use them, becoming dependent on webapps can leave you high and dry when those services go out."

Referencing the recent downtime of Gmail, MobileMe, and Amazon S3 got me thinking...what does it take to actually "trust the cloud". What would give users confidence in choosing these services, and sticking with them through the inevitable issues? The simple answer is transparency!

How good are these specific services at providing transparency into their downtime? Let's review:

Gmail downtime (2 hours) on 8/11/08
The Gmail team kept users updated during the downtime using a Google Group thread, with surprisingly frequent updates and details (over the 2 hour downtime period). After the event was over and they were able to get their thoughts in order, they then posted a message on their Gmail blog. A big red flag however shows in the spike in Twitter posts and the huge spike in searches for "gmail down". We can tell that users are unsure where to go to get the official word on what's going on, which means that all of the work the Gmail team is putting into keeping users up to date falls on deaf ears. If you post an update and no one sees it, does your update exist?
Conclusion: Very good transparency, but needs some work on making known the forum they are using to spread information. Kudos for the rarity of downtime this services has experience in it's history (too easy to overlook).

MobileMe downtime (2 hours) on 8/11/08
With their handy dandy System Status Receny History page, the MobileMe team documented the downtime. However the only way users were able to know anything was wrong DURING the event was a big fail went attempting to use the service. As CNET reports, "the same thing happened in mid July with enough blowback to cause Apple to offer a 30-day extension to both fre trial and paying users." In a valiant yet fruitless effort to keep users in the know, Apple created the MobileMe Status blog, which as of now still has no news of the recent downtime. On the plus side they have created a MobileMe Mail Chat page for users to get personal support when issues arise. On the downside, according to one comment "even the support guy didn't know that the service outage was going on".
Conclusion: Unnacceptable job keeping users in the loop, passable job documenting the events after the fact, and far too many random glitches to make this OK. Let's hope they get their act together soon and open up about their issues (won't be holding my breath...this is Apple after all).

Amazon S3 downtime (6 hours) on 7/20/08
By far the more critical of these online services means they should be held to a higher standard. During the event, the AWS team posted outage messages and their Service Health Dashboard clearly showed they were having issues. After the event a detailed explanation went up on their site.
Conclusion: There's a reason I have Amazon AWS in the "Transaparency Hall of Fame" (top right of this blog). They've been at this a while, and their users have forced them to make this process as transparent as possible. They could get better at giving specific details during the actual event, and 6 hours of downtime is no laughing matter, but they did a good job and they continue to set the bar for transparency in online services.

Yesterday CNET posted the "10 Worst Web glitches of 2008 (so far)", which includes the above events, among others. What does this tell us? Clearly downtime across the broad spectrum of online services, from Amazon to Netflix to Google, is not going away. We need to learn to live with unreliable online services. The long term success of these services will be determined by how users perceive the reliability of these services, contrasted with the advantages of building in the cloud. That perception of reliability requires complete and utter transparency in the goings on of that service, especially during downtime events. We still have a long ways to go until there we can really "trust the cloud".

Thursday, August 14, 2008

Twitter lacking transparency in its own functionality

Slightly off topic, but still relevant to the concept of transparency in the online world, Twitter recently changed it's limits on the number of followers any one person can have (to help curtail spam).

The problem, as described by WebWorkerDaily:

"Though the blog post says there’s no magic number, quite a few Twitterers - including some heavy participants - report hitting the limit at 2000. Some have been trying to get the attention of Twitter management to discuss this for days, with little or no result. It’s reminiscent of Twitter’s attitude towards making money from the service, which amounts to “we have a plan, but we won’t tell you,” or to fixing issues, which seems to be “we’re working on it, leave us alone.” In a world of Web 2.0 openness, Twitter seems to be carrying traditional business values of secrecy a bit further than most."

It's all too easy to keep your users in the dark. It takes real vision to open up and keep your users in the know.

Tuesday, August 12, 2008

First post!

The notorious first post. Who ever actually reads the very first post on a blog anyway? Someone's got to I guess. You're reading it...so that means it's time to get to business. My god the pressure!

My goal for this blog is to focus on the idea of transparency in the uptime and performance of web sites and services . What does that mean? Let me tell you. My argument is that if you run an online application (e.g. a plain jane web site, a web services, an API, or anything else that sits online) and your users rely on it, you MUST be as open as possible about its downtime events, performance problems, and anything else that could affect the quality of service for your users. Gone are the days when you could hide behind the white glow of anonymity in the online space, and hope that no one notices your application is down for half the day (I'm looking at you Twitter). Not only is this a going to help your business, and make your users happy, but it's only a matter of time before your users demand it.

Some examples:

All three of these services (SalesForce.com, Amazon, and Twitter) have come to realize, after much prodding from their users, and numerous downtime events, that making this information public is a really good idea!

Originally I was inspired to this idea thanks to a great article in Wired magazine titled "The See-Through CEO". Definitely check it out.

My goals for this blog at this point are the following:

Document examples of really great transparency, or the lack thereof.
Develop a guideline of transparency that you can use in your own professional life.
Help you in you and maybe your business reap the benefits of being transparent, and get ahead of the curve on your competition.

I would guess that 95% of all blogs die within the first 3 months. My goal is to be posting a 1 year anniversary story, re-evaluating the state of the industry a year from now, and hopefully helping you become more successful along the way!

Transparent Uptime