Transparent Uptime

Thursday, June 10, 2010

Quick update (and Velocity preview)

Alas this blog has been quite for too long. My pathetic excuse is that I'm channeling the efforts that would normally go to this blog into my upcoming talk at Velocity. To make up for my negligence, here is a sneak peek at the talk:

I will post the entire slide deck here on the blog immediately following the talk. Stay tuned!

Friday, April 30, 2010

A proposal for new community focused on web performance

I've been really impressed with the StackExchange platform (http://www.stackexchange.com, made by the same people that run stackoverflow.com), and I feel that it could be an extremely effective platform to host a web performance focused community. They built the platform from scratch in order to improve on the innate flaws with regular threaded discussion boards (e.g. Yahoo forums, Google Groups, phpBB, vBulletin, etc.). More importantly, the platform walks the line between incentivizing quick answers (for immediate feedback), and keeping answers from getting obsolete over time.

My hope is that this site becomes an evolving source of definitive answers on web performance best practices, tips, tool tricks, book recommendations, data exchange, etc.

The process to make this a reality is:

1. Submit a proposal for peer review

2. If there is enough support (votes), it moves on to the next stage.

3. People that would like to participate in the community (and help manage it) sign up

4. The details of the community get ironed out (moderators, name, tags, etc.)

5. It goes public

I've gone ahead and submitted the initial proposal (step 1):

http://meta.stackexchange.com/questions/5821/proposal-for-stackexchange-site-focused-on-web-site-performance

I'm just here to get the initial ball rolling, but from here on out it's going to be all about the greater community. This next stage, where everyone votes on the proposals, is going to make or break the concept. It's already received a good amount of votes, but it's going to take a lot more support to push it forward. If you think this has legs, and can see the value, vote it up!

Tuesday, April 20, 2010

Transparent censorship

Google has decided to fight censorship with transparency:

...it's no surprise that Google, like other technology and telecommunications companies, regularly receives demands from government agencies to remove content from our services. Of course many of these requests are entirely legitimate, such as requests for the removal of child pornography. We also regularly receive requests from law enforcement agencies to hand over private user data. Again, the vast majority of these requests are valid and the information needed is for legitimate criminal investigations. However, data about these activities historically has not been broadly available. We believe that greater transparency will lead to less censorship.

To this end they have launched a Government Requests tool:

We are today launching a new Government Requests tool to give people information about the requests for user data or content removal we receive from government agencies around the world. For this launch, we are using data from July-December, 2009, and we plan to update the data in 6-month increments. Read this post to learn more about our principles surrounding free expression and controversial content on the web.

Takeaway: If you are forced to do something bad, tell everyone about it.

Monday, April 19, 2010

A basic introduction to the Cloud

I've been getting requests recently to give a high-level overview of "The Cloud". What it means, why people are excited, and what they should care. I decided to put together a basic introduction of the Cloud (embeded below). Feel free to use parts of this in your own presentations, and I'd love to hear feedback on this:

The Cloud - An introduction

View more presentations from Lenny Rachitsky.

Tuesday, April 13, 2010

Atlassian has security breach, responds with transparency, sees benefits

This past Sunday, Atlassian (makers of Confluence, JIRA, and other popular collaboration tools), experienced a security breach:

Around 9pm U.S. PST Sunday evening, Atlassian detected a security breach on one of our internal systems. The breach potentially exposed passwords for customers who purchased Atlassian products before July 2008. During July 2008, we migrated our customer database into Atlassian Crowd, our identity management product, and all customer passwords were encrypted. However, the old database table was not taken offline or deleted, and it is this database table that we believe could have been exposed during the breach.

Instead of keeping the break-in private, and hoping for the incident to blow over quickly, they emailed their entire customer base the very next day with the gory details:

It turned out that this email alarmed a number of customers who had to no reason to worry (as their accounts were unaffected), which led to another email:

Along with this email, Atlassian went further and posted an extremely detailed postmortem of the entire event, detailing who was impacted, actions you need to take as a customer, lessons learned, and next steps that they are taking to improve for the future. Incidentally, this postmortem would fair very well if run through our postmortem best practices (even though the incident is completely different from a downtime event, for which the best practices were formed).

The Pay Off
Normally, an incident like this should create a large number of very unhappy customers. Instead, thanks to the quick, honest, and transparent response, we see the following reaction:

...and if you think I'm just picking out the positive reactions, compare a Twitter search for "Atlassian" with "positive" and "negative" sentiment. At the time of this writing, there are over 3 pages of positive results, and less than one page of negative (and many of the negative is unrelated to this incident). And this is after a major security breach.

Clearly a case of transparency turning a disaster into an opportunity, and how to take advantage of that opportunity by being open and honest with your users.

Friday, April 9, 2010

Your sites performance now affects your Google search ranking

Today Google officially followed up on a promise they made last year:

"Speeding up websites is important — not just to site owners, but to all Internet users. Faster sites create happy users and we've seen in our internal studies that when a site responds slowly, visitors spend less time there. But faster sites don't just improve user experience; recent data shows that improving site speed also reduces operating costs. Like us, our users place a lot of value in speed — that's why we've decided to take site speed into account in our search rankings. We use a variety of sources to determine the speed of a site relative to other sites."

If the ROI of page performance wasn't clear enough, we now have a big new reason to focus on optimizing performance. The big question is what Google considers "slow", and how search rankings are affected (e.g. are you boosted up if you are really fast, or are you pushed down if you are really slow, or both?). When are you done optimizing? Google has a big opportunity to set the bar, and give sites a clear target. Without that, the the impact of this move may not be as beneficial to the speed of the web as they hope.

What we know

Site speed is taken "into account" in search rankings.
"While site speed is a new signal, it doesn't carry as much weight as the relevance of a page".
"Signal for site speed only applies for visitors searching in English on Google.com at this point".
Google is tracking site speed using both the Googlebot crawler and the Google Toolbar passive performance stats.
You can see what performance Google is recoding for your site (only from Google Toolbar data) in the Webmaster Tools, under "Labs"
In the "Performance overview" graph, Google considers a load time over 1.5 seconds "slow".
Google is taking speed very seriously. The faster the web gets, the better for them.

What we don't know

What "slow" means, and at what point you are penalized (or rewarded).
How much weight is given to the Googlebot stats versus the Google Toolbar stats.
What Google considers "done" when a page loads (e.g. Load event, DOMComplete event, HTML download, above-the-fold load, etc.). Does Googlebot load images/objects, and if so does it use a realistic browser engine?
How much historical data it looks at to determine your site speed, and how often it updates that data.
Will there be any transparency into the penalties/rewards.

What I think

Site performance is only going to play a factor when your site is extremely slow.
Extremely slow sites will be pushed down in the rankings, but fast sites probably won't see a rise in the rankings.
"Slow" is probably a high number, something like 10-20 seconds, and plays a bigger role in the final rankings as the speed gets slower. Regular sites won't be affected, even if they are subjectively slow.
This is probably just the beginning, and we should expect tweaking of these metrics as we become more comfortable with them. We'll probably be seeing new metrics along the same lines in the coming years (e.g. geographical performance, Time-to-Interact versus onLoad, consistency versus average, reliability, etc.).

Tuesday, April 6, 2010

Zendesk - Transparency in action

A colleague pointed me to a simple postmortem written by the CEO of Zendesk, Mikkel Svane:

"Yesterday an unannounced DNS change apparently made our mail server go incognito to the rest of the world. The consequences of this came sneaking over night as the changes propagated through the DNS network. Whammy.

On top of this our upstream internet provider late last night PST (early morning CET) experienced a failure that prevented our servers from reaching external destinations. Web access was not affected but email, widget, targets, basically everything that relied on communication from our servers to the outside world were. Double whammy.

It took too long time to realize that we had two separate issues at hand. We kept focusing on the former as root cause for the latter. And it took unacceptably long to determine that we had a network outage."

How well does such an informal and simple postmortem stack up against the postmortem best practices? Let's find out:

Prerequisites:

Admit failure - Yes, no question.
Sound like a human - Yes, very much so.
Have a communication channel - Yes, both the blog and the Twitter account.
Above all else, be authentic - Yes, extremely authentic.

Requirements:

Start time and end time of the incident - No.
Who/what was impacted - Yes, though more detail would have been nice.
What went wrong - Yes, well done.
Lessons learned - Not much.

Bonus:

Details on the technologies involved - No
Answers to the Five Why's - No
Human elements - Some
What others can learn from this experience - Some

Conclusion:

The meat was definitely there. The biggest missing piece is insight into what lessons were learned and what is being done to improve for the future. Mikkel says that "We've learned an important lesson and will do our best to ensure that no 3rd parties can take us down like this again", but the specifics are lacking. The exact time of the start and end of the event would have been useful as well, for those companies wondering whether this explains their issues that day.

It's always impressive to see the CEO of a company put himself out there like this and admit failure. It is (naively) easier to pretend like everything is OK and hope the downtime blows over. In reality, getting out in front of the problem and being transparent, communicating during the downtime (in this case over Twitter), and after the event is over (in this postmortem), are the best things you can do to turn your disaster into an opportunity to increase customer trust.

As it happens, I will be speaking at the upcoming Velocity 2010 conference about this very topic!

Update: Zendesk has put out a more in-depth review of what happened, which includes everything that was missing from the original post (which as the CEO pointed out in the comments, was meant to be a quick update of what they knew at the time). This new post includes the time frame of the incident, details on what exactly went wrong with the technology, and most importantly lessons and takeaways to improve things for the future. Well done.