Transparent Uptime

Sunday, September 7, 2008

Gamer's Bill of Rights?

What's more important, a Gamer's Bill of Rights, or an Online Users's Bill of Rights? Do you even have to think about it?

Twitter showing improved uptime

Both Read Write Web and TechCrunch point out that Twitter has seen much improved uptime in the past couple months, reaching 99.88% uptime this past month.

TechCrunch quotes co-founder Biz Stone:

"Twitter has been making great progress in terms of uptime and reliability. Fail Whale sightings are far less frequent these days thanks to our efforts but we still have a long journey ahead. Last month we saw 99.88% uptime and so far this month we are at 99.96%. Our engineering and operations teams have been taking a very methodical approach to improving Twitter. We’re using the word “craftsmanship” to characterize our work here at the office. Reliability and dependability continue to be top on or list of key goals."

What I like most is the details that Twitter provides on their blog describing where the issues stem from:

"I've always respected a good sense of pacing. It's easy to be fast and loose, but it takes a certain discipline, foresight, and patience to guide something through the right way. For most of Twitter's early days, pacing could be considered an unattainable luxury. Our effort started with a bang and quickly accelerated to a disconcerting velocity that never let up. We found ourselves reacting to situations instead of crafting solutions and features we wanted to make.

With nearly two years at full speed, thousands of successes (with as many mistakes), and countless lessons learned, we've finally discovered our rhythm as a team. By carefully regrouping all aspects of our work, breaking the problem down into smaller parts, and iterating rapidly, Twitter, Inc. is poised to bring a new kind of communication to every part of the world."

Kudos to Twitter for not only the improved uptime, but for keeping it's users in the loop on things that generally are discussed only behind closed doors.

Saturday, August 23, 2008

What if the cloud disappeared tomorrow? Thoughts on a "Online Users Bill of Rights"

NPR did a story on the (often unexpected) risks involved in storing your data in the cloud. What would you do if Gmail, Flickr, or Yahoo decided they no longer cared to store your massive amount of free data and ran a large "rm -rf". Sure they'd get some pretty bad PR, but if you look at their EULA's, I'm betting they have the right to do this. Can we ever trust that our data is really safe in the cloud?

What's needed here is a "Online Users Bill of Rights". This would define specific standards that protect users and gives them insight into decisions currently made behind closed doors. Here's a start:

1. Files, documents, or anything else that the user has created and saved online cannot be removed or be made inaccessible without a 30 day advanced notice.

2. The service must be accessible 95% of the time each month. Specifically, users must be able to access their data, be able to delete or retrieve existing data, with availability of at least 95% in each month long period. It is also highly encouraged to make public a tighter uptime commitment, including the consequences of not meeting that commitment.

3. During downtime events, the service must make a best effort to provide status updates, estimates as to when service will be restored, and an explanation of what led to the downtime after the event. It is also highly encouraged to make known a central location to distribute this information.

4. The service will provide a performance SLA describing the average page load time they expect to see, and the consequences of not meeting that average in any given month. This is especially important for API's and services like AWS.

5. The service must give at least 30 days notice prior to making any "major" changes in the functionality or level of service provided up to that point (including API interfaces). It is also highly encouraged to involve the users in the decision making process prior these changes.

This Bill of Rights would need to be signed off on by any online service that stores data for users (Google, Yahoo, Flickr) or provides online service that other business rely on (Amazon AWS, Salesforce, API providers). I'd like to see the day when users simply do not trust online services that aren't willing to sign off on this.

The above is just a first draft, and I'd love to get some input on this. I would purposefully keep the list somewhat open to interpretation, staying away from legalese, and focusing on the spirit of the idea of transparency and user rights (similar to the concept of a B Corporation).

What do you think?

Friday, August 22, 2008

Microsoft celebrates its downtime

After experiencing downtime in its launch of Photosynth this past week, Microsoft admits it's projections were a little off:

"We have been abolsutely overwhlemed by demand, and have turned Photosynth.com into a special static/read-only mode for the moment. The team is hard at work adding capacity and getting the full site back online. We've been under incredible demand since we released just over 12 hours ago. With everyone waking up around the world traffic has been on a steady ramp up since that release and has far exceeded even our most optimistic expectations.
Getting ready for the launch we did massive amounts of performance testing, built capacity model after capacity model, and yet with all of that, you threw so much traffic our way that we need to add more capacity. We are adding that extra horsepower right now and should be back up shortly.
Thank you for the incredible reception! "

Nice to see some visibility into their thinking up to launch, and the preparation they (unsuccessfully) went through. The next best thing to real time downtime status is a well formed explanation after launch (assuming the downtime is not prolonged), and something this personal coming out of Microsoft is a good sign.

Monday, August 18, 2008

Apple makes up for their downtime with 60 days of free service

From http://support.apple.com/kb/HT2826:

Why is Apple granting a 60-day subscription extension?
The transition from .Mac to MobileMe was rockier than we had hoped. While we are making a lot of improvements, the MobileMe service is still not up to our standards. We are extending subscriptions 60-days free of charge to express appreciation for our members’ patience as we continue to improve the service.
Am I eligible for the 60-day extension?
You are eligible if you are a MobileMe member whose account was active as of August 19, 2008 at 0:00 Pacific Daylight Time.

That's one way to deal with downtime!

Saturday, August 16, 2008

Do You Trust the Cloud?

Quoting a lifehacker post with the same name:

"While web-based applications promise gigabytes of storage, anywhere-access, easy backup, and no software requirements beyond your browser to use them, becoming dependent on webapps can leave you high and dry when those services go out."

Referencing the recent downtime of Gmail, MobileMe, and Amazon S3 got me thinking...what does it take to actually "trust the cloud". What would give users confidence in choosing these services, and sticking with them through the inevitable issues? The simple answer is transparency!

How good are these specific services at providing transparency into their downtime? Let's review:

Gmail downtime (2 hours) on 8/11/08
The Gmail team kept users updated during the downtime using a Google Group thread, with surprisingly frequent updates and details (over the 2 hour downtime period). After the event was over and they were able to get their thoughts in order, they then posted a message on their Gmail blog. A big red flag however shows in the spike in Twitter posts and the huge spike in searches for "gmail down". We can tell that users are unsure where to go to get the official word on what's going on, which means that all of the work the Gmail team is putting into keeping users up to date falls on deaf ears. If you post an update and no one sees it, does your update exist?
Conclusion: Very good transparency, but needs some work on making known the forum they are using to spread information. Kudos for the rarity of downtime this services has experience in it's history (too easy to overlook).

MobileMe downtime (2 hours) on 8/11/08
With their handy dandy System Status Receny History page, the MobileMe team documented the downtime. However the only way users were able to know anything was wrong DURING the event was a big fail went attempting to use the service. As CNET reports, "the same thing happened in mid July with enough blowback to cause Apple to offer a 30-day extension to both fre trial and paying users." In a valiant yet fruitless effort to keep users in the know, Apple created the MobileMe Status blog, which as of now still has no news of the recent downtime. On the plus side they have created a MobileMe Mail Chat page for users to get personal support when issues arise. On the downside, according to one comment "even the support guy didn't know that the service outage was going on".
Conclusion: Unnacceptable job keeping users in the loop, passable job documenting the events after the fact, and far too many random glitches to make this OK. Let's hope they get their act together soon and open up about their issues (won't be holding my breath...this is Apple after all).

Amazon S3 downtime (6 hours) on 7/20/08
By far the more critical of these online services means they should be held to a higher standard. During the event, the AWS team posted outage messages and their Service Health Dashboard clearly showed they were having issues. After the event a detailed explanation went up on their site.
Conclusion: There's a reason I have Amazon AWS in the "Transaparency Hall of Fame" (top right of this blog). They've been at this a while, and their users have forced them to make this process as transparent as possible. They could get better at giving specific details during the actual event, and 6 hours of downtime is no laughing matter, but they did a good job and they continue to set the bar for transparency in online services.

Yesterday CNET posted the "10 Worst Web glitches of 2008 (so far)", which includes the above events, among others. What does this tell us? Clearly downtime across the broad spectrum of online services, from Amazon to Netflix to Google, is not going away. We need to learn to live with unreliable online services. The long term success of these services will be determined by how users perceive the reliability of these services, contrasted with the advantages of building in the cloud. That perception of reliability requires complete and utter transparency in the goings on of that service, especially during downtime events. We still have a long ways to go until there we can really "trust the cloud".

Thursday, August 14, 2008

Twitter lacking transparency in its own functionality

Slightly off topic, but still relevant to the concept of transparency in the online world, Twitter recently changed it's limits on the number of followers any one person can have (to help curtail spam).

The problem, as described by WebWorkerDaily:

"Though the blog post says there’s no magic number, quite a few Twitterers - including some heavy participants - report hitting the limit at 2000. Some have been trying to get the attention of Twitter management to discuss this for days, with little or no result. It’s reminiscent of Twitter’s attitude towards making money from the service, which amounts to “we have a plan, but we won’t tell you,” or to fixing issues, which seems to be “we’re working on it, leave us alone.” In a world of Web 2.0 openness, Twitter seems to be carrying traditional business values of secrecy a bit further than most."

It's all too easy to keep your users in the dark. It takes real vision to open up and keep your users in the know.