Showing posts with label velocity. Show all posts
Showing posts with label velocity. Show all posts

Monday, June 28, 2010

Video of my talk (Upside of Downtime) at Velocity 2010

Video of my talk has been posted (below), though watching it and listening to myself feels pretty damn weird. I've been blown away by response I've gotten to this talk. I know of at handful of companies circulating these slides/notes internally and working to make their companies more transparent. I've personally heard from a number of people at the conference that were discussing the ideas with their coworkers thinking about the best approach to take action. Even Facebook (the example I used of how not to handle downtime) has found resonance with the talk, and pointed me to a little known status page.

I'm hoping to start a conversation around the framework and continue to evolve it. I'm going to expand on the ideas in this blog, so if there is anything specific you would like me to explore (e.g. hard ROI, B2C examples, cultural differences, etc), please let me know.

Enjoy the video:


The slides can be found here: http://www.slideshare.net/lennysan/the-upside-of-downtime-velocity-2010-4564992

Wednesday, June 23, 2010

The Upside of Downtime (Velocity 2010)

Here is the full deck from my talk at Velocity, including two bonus sections at the end:
The Upside of Downtime (Velocity 2010)


Also, here is the "Upside of Downtime Framework" cheat-sheet (click through to download):

Monday, June 21, 2010

See you at Velocity 2010!

Tonight I leave for the (sold out) O'Reilly Velocity conference in Santa Clara, CA. I'll be presenting "The Upside of Downtime: How to Turn a Disaster into an Opportunity" on Wednesday at 4:35pm. If you're a reader of this blog and are at the conference, I'd love to meet up! Tweet me @lennysan or simply leave a comment here.

As soon as my talk ends, I will be posting the full slide-deck right here on this blog. Stay tuned!

P.S. If you're reading this post during my talk, here are some of the links I may or may not be referencing:

Tuesday, April 6, 2010

Zendesk - Transparency in action

A colleague pointed me to a simple postmortem written by the CEO of Zendesk, Mikkel Svane:
"Yesterday an unannounced DNS change apparently made our mail server go incognito to the rest of the world. The consequences of this came sneaking over night as the changes propagated through the DNS network. Whammy.

On top of this our upstream internet provider late last night PST (early morning CET) experienced a failure that prevented our servers from reaching external destinations. Web access was not affected but email, widget, targets, basically everything that relied on communication from our servers to the outside world were. Double whammy.

It took too long time to realize that we had two separate issues at hand. We kept focusing on the former as root cause for the latter. And it took unacceptably long to determine that we had a network outage."
How well does such an informal and simple postmortem stack up against the postmortem best practices? Let's find out:

Prerequisites:
  1. Admit failure - Yes, no question.
  2. Sound like a human - Yes, very much so.
  3. Have a communication channel - Yes, both the blog and the Twitter account.
  4. Above all else, be authentic - Yes, extremely authentic.
Requirements:
  1. Start time and end time of the incident - No.
  2. Who/what was impacted - Yes, though more detail would have been nice.
  3. What went wrong - Yes, well done.
  4. Lessons learned - Not much.
Bonus:
  1. Details on the technologies involved - No
  2. Answers to the Five Why's - No
  3. Human elements - Some
  4. What others can learn from this experience - Some
Conclusion:

The meat was definitely there. The biggest missing piece is insight into what lessons were learned and what is being done to improve for the future. Mikkel says that "We've learned an important lesson and will do our best to ensure that no 3rd parties can take us down like this again", but the specifics are lacking. The exact time of the start and end of the event would have been useful as well, for those companies wondering whether this explains their issues that day.

It's always impressive to see the CEO of a company put himself out there like this and admit failure. It is (naively) easier to pretend like everything is OK and hope the downtime blows over. In reality, getting out in front of the problem and being transparent, communicating during the downtime (in this case over Twitter), and after the event is over (in this postmortem), are the best things you can do to turn your disaster into an opportunity to increase customer trust.

As it happens, I will be speaking at the upcoming Velocity 2010 conference about this very topic!

Update: Zendesk has put out a more in-depth review of what happened, which includes everything that was missing from the original post (which as the CEO pointed out in the comments, was meant to be a quick update of what they knew at the time). This new post includes the time frame of the incident, details on what exactly went wrong with the technology, and most importantly lessons and takeaways to improve things for the future. Well done.