Transparent Uptime: 4/4/10

Friday, April 9, 2010

Your sites performance now affects your Google search ranking

Today Google officially followed up on a promise they made last year:

"Speeding up websites is important — not just to site owners, but to all Internet users. Faster sites create happy users and we've seen in our internal studies that when a site responds slowly, visitors spend less time there. But faster sites don't just improve user experience; recent data shows that improving site speed also reduces operating costs. Like us, our users place a lot of value in speed — that's why we've decided to take site speed into account in our search rankings. We use a variety of sources to determine the speed of a site relative to other sites."

If the ROI of page performance wasn't clear enough, we now have a big new reason to focus on optimizing performance. The big question is what Google considers "slow", and how search rankings are affected (e.g. are you boosted up if you are really fast, or are you pushed down if you are really slow, or both?). When are you done optimizing? Google has a big opportunity to set the bar, and give sites a clear target. Without that, the the impact of this move may not be as beneficial to the speed of the web as they hope.

What we know

Site speed is taken "into account" in search rankings.
"While site speed is a new signal, it doesn't carry as much weight as the relevance of a page".
"Signal for site speed only applies for visitors searching in English on Google.com at this point".
Google is tracking site speed using both the Googlebot crawler and the Google Toolbar passive performance stats.
You can see what performance Google is recoding for your site (only from Google Toolbar data) in the Webmaster Tools, under "Labs"
In the "Performance overview" graph, Google considers a load time over 1.5 seconds "slow".
Google is taking speed very seriously. The faster the web gets, the better for them.

What we don't know

What "slow" means, and at what point you are penalized (or rewarded).
How much weight is given to the Googlebot stats versus the Google Toolbar stats.
What Google considers "done" when a page loads (e.g. Load event, DOMComplete event, HTML download, above-the-fold load, etc.). Does Googlebot load images/objects, and if so does it use a realistic browser engine?
How much historical data it looks at to determine your site speed, and how often it updates that data.
Will there be any transparency into the penalties/rewards.

What I think

Site performance is only going to play a factor when your site is extremely slow.
Extremely slow sites will be pushed down in the rankings, but fast sites probably won't see a rise in the rankings.
"Slow" is probably a high number, something like 10-20 seconds, and plays a bigger role in the final rankings as the speed gets slower. Regular sites won't be affected, even if they are subjectively slow.
This is probably just the beginning, and we should expect tweaking of these metrics as we become more comfortable with them. We'll probably be seeing new metrics along the same lines in the coming years (e.g. geographical performance, Time-to-Interact versus onLoad, consistency versus average, reliability, etc.).

Tuesday, April 6, 2010

Zendesk - Transparency in action

A colleague pointed me to a simple postmortem written by the CEO of Zendesk, Mikkel Svane:

"Yesterday an unannounced DNS change apparently made our mail server go incognito to the rest of the world. The consequences of this came sneaking over night as the changes propagated through the DNS network. Whammy.

On top of this our upstream internet provider late last night PST (early morning CET) experienced a failure that prevented our servers from reaching external destinations. Web access was not affected but email, widget, targets, basically everything that relied on communication from our servers to the outside world were. Double whammy.

It took too long time to realize that we had two separate issues at hand. We kept focusing on the former as root cause for the latter. And it took unacceptably long to determine that we had a network outage."

How well does such an informal and simple postmortem stack up against the postmortem best practices? Let's find out:

Prerequisites:

Admit failure - Yes, no question.
Sound like a human - Yes, very much so.
Have a communication channel - Yes, both the blog and the Twitter account.
Above all else, be authentic - Yes, extremely authentic.

Requirements:

Start time and end time of the incident - No.
Who/what was impacted - Yes, though more detail would have been nice.
What went wrong - Yes, well done.
Lessons learned - Not much.

Bonus:

Details on the technologies involved - No
Answers to the Five Why's - No
Human elements - Some
What others can learn from this experience - Some

Conclusion:

The meat was definitely there. The biggest missing piece is insight into what lessons were learned and what is being done to improve for the future. Mikkel says that "We've learned an important lesson and will do our best to ensure that no 3rd parties can take us down like this again", but the specifics are lacking. The exact time of the start and end of the event would have been useful as well, for those companies wondering whether this explains their issues that day.

It's always impressive to see the CEO of a company put himself out there like this and admit failure. It is (naively) easier to pretend like everything is OK and hope the downtime blows over. In reality, getting out in front of the problem and being transparent, communicating during the downtime (in this case over Twitter), and after the event is over (in this postmortem), are the best things you can do to turn your disaster into an opportunity to increase customer trust.

As it happens, I will be speaking at the upcoming Velocity 2010 conference about this very topic!

Update: Zendesk has put out a more in-depth review of what happened, which includes everything that was missing from the original post (which as the CEO pointed out in the comments, was meant to be a quick update of what they knew at the time). This new post includes the time frame of the incident, details on what exactly went wrong with the technology, and most importantly lessons and takeaways to improve things for the future. Well done.

Transparent Uptime

Friday, April 9, 2010

Your sites performance now affects your Google search ranking

Tuesday, April 6, 2010

Zendesk - Transparency in action

About Me

Resources

Cloud Health Status Updates

Blog Archive

Disclaimer