
Thursday, March 18, 2010
Tuesday, March 16, 2010
How to Trust the Cloud
Next week I head off to Beijing, China to speak at Cloud Computing Congress China. Below is a sneak peek at the talk I plan to give (let me know what you think!):
Friday, March 5, 2010
Google App Engine downtime postmortem, nearly a perfect model for others
Google posted one of the most detailed and well thought out postmortems I've seen to explain what happened around their 2/24/10 App Engine Downtime. Let's run it through the gauntlet:
Prerequisites:
- Admit failure - Yes
- Sound like a human - Yes, more in some sections then others
- Have a communication channel - Yes, the Google App Engine Downtime Notify group. Ideally it would have been linked to from the App Engine System Status Dashboard as well.
- Above all else, be authentic - Yes
Requirements:
- Start time and end time of the incident - Yes, including GMT time, and a highly detailed timeline of the entire event
- Who/what was impacted - Partly, and also partly covered during the actual incident
- What went wrong - Yes, yes, and yes! Incredible amount of detail here.
- Lessons learned - Yes! Not only are there five specific action items, they are also introducing new architectural changes and customer choice as a result.
Bonus:
- Details on the technologies involved - Somewhat
- Answers to the Five Why's - No
- Human elements - heroic efforts, unfortunate coincidences, effective teamwork, etc - Partial
- What others can learn from this experience - Partial
Takeaways and thoughts:
- A vast majority of the issues were training related. This is an important lesson: all of the technology and process in the world won't help you if your on-call team is unaware of what to do. This is especially true during the stress of a large incident. Follow the advice of Google and run regular on-call drills (including rare issues), keep documentation updated, and give on-call people the authority to make decisions on the spot.
- Extremely impressed with the decision to use this opportunity to improve their service by giving their customers the choice between datastore performance and reliability. This is a perfect example of turning downtime into a positive.
- Interesting insight into their process of detecting an incident to communicating externally. From the start of the incident at 7:48am to the first external communication at 8:01am is not too shabby. Not sure why it took so long to post the update to the downtime forum and the health status dashboard (8:36am).
- The amount of time and thought that went into this postmortem shows how much Google is concerned about their service, and impressions around its reliability.
What could be improved:
- External communication could be faster. No reason not to post something as soon as the investigation begins, not to mention posting to the forum dedicated to downtime notifications and the health status dashboard immediately. When the incident started the dashboard had very limited data, which should be automatic and real-time.
- A post to this postmortem from the health status dashboard would make it a lot easier to find. I didn't see this until someone sent it to me.
- Timelines and concrete deliverables on the changes (e.g. on-call training sessions, documentation updates, new datastore feature release) would give us more confidence that things will actually change.
Wednesday, March 3, 2010
Cloud photos for presentations and documents
I've been working on a talk I'm giving at Cloud Computing Congress in China and (I'm not ashamed to admit) I spent a lot of time looking for just the right cloud photo. I thought I'd share the best photo's I came across, in case you ever need to give a talk about The Cloud (or just love photo's of clouds).
Note: These photos are all be licensed under Creative Commons, and came from a combination of Flickr Creative Commons search and Google Images (manually excluding non-Creative Commons images).





You can find all of these cloud photos here. If you have other favorites and are willing to share, please add a link to them in the comments.
The photos are divided into four sections:
Simple clouds
Dark clouds





Powerful clouds
Unique clouds
You can find all of these cloud photos here. If you have other favorites and are willing to share, please add a link to them in the comments.
Labels:
cloud,
cloud computing,
presentations,
slides
Tuesday, March 2, 2010
A guideline for postmortem communication
Building on previous posts, the following is a proposed Guideline for Postmortem Communication:
Prerequisites:
- Admit failure - Hiding downtime is no longer an option
- Sound like a human - Do not use a standard template, do not apologize for "inconveniencing" us
- Have a communication channel - Set up a process to handle incidents prior the event (e.g. public health dashboard, status blog, twitter account, etc.)
- Above all else, be authentic - You must be believed to be heard
Requirements:
- Start time and end time of the incident
- Who/what was impacted - Should I be worried about this incident?
- What went wrong - What broke and how you fixed it (with insight into the root cause analysis process)
- Lessons learned - What's being done to improve the situation for the future, in technology, process, and communication
Bonus:
- Details on the technologies involved
- Answers to the Five Why's
- Human elements - heroic efforts, unfortunate coincidences, effective teamwork, etc
- What others can learn from this experience
Friday, February 26, 2010
The top 7 most overused cloud metaphors, sorted by weather pattern
Along with innovation, agility, and efficiency cloud computing has brought us a bevy of clever metaphors around the cloud concept. Unfortunately these play-on-words headlines are quickly becoming cliches. Below is a list of the most common metaphors I've come across, and headlines that use them. I hope this list will force us all to be a bit more original. (Note: Sorted by weather event chronological order)
Partly Cloudy
Dark Lining
Dark Side
Bursting
Rain
Clearing the Air
Silver Lining
- Not Every Cloud has a Silver Lining
- Is Integration Cloud Computing's Silver Lining?
- Not every cloud has a silver lining
- Is Google Chrome OS cloud computing's silver lining?
- Cloud Computing Is Silver Lining in Tough Economy
Bonus: aaS
Have you come across any others? Let me know and I'll update this post.
Friday, February 19, 2010
As Wordpress goes down, a chance to analyze another postmortem arrises
If you recall, we put together a proposed guideline for postmortem communication in a previous post:
How did Wordpress do in their postmortem on 2/19/10?
Prerequisites
- Admit failure - Hiding downtime is no longer an option (thanks to Twitter)
- Sound like a human - Do not use a standard template, do not apologize for "inconveniencing" us.
- Have a communication channel - Ideally you've set up a process to handle incidents before the event, and communicated publicly during the event. Customers will need to know where to find your updates.
- Above all else, be authentic
Requirements:
- Start time and end time of the incident.
- Who/what was impacted.
- What went wrong, with insight into the root cause analysis process.
- What's being done to improve the situation, lessons learned.
Nice-to-have's:
- Details on the technologies involved.
- Answers to the Five Why's.
- Human elements - heroic efforts, unfortunate coincidences, effective teamwork, etc.
- What others can learn from this experience.
Prerequisites:
- Admit failure: Yes. The very first paragraph makes it clear they screwed up.
- Sound like a human: Yes. Extremely personal post.
- Have a communication channel: Yes, but not ideal. A combination of their general Twitter account, and the founders blog. Could be improved, but overall OK.
- Be authentic: Yes. 110% authentic!
Requirements:
- Start/end time: No. Only focused on duration.
- Who/what was impacted: Yes. Describes that 10.2 million blogs were affected, for 110 minutes, taking away 5.5 million pageviews.
- What went wrong: Yes. Router issues, though investigation is continuing.
- Lessons learned: Partial. Mostly a promise to share the results of the investigation.
Nice-to-have's:
- Technologies involved: No.
- Answers to the Five Why's: No.
- Human elements: Yes. "the entire team was on pins and needles trying to get your blogs back as soon as possible"
- What others can learn: No.
Conclusion
The intent of the blog was to communicate quickly that they are aware of the severity of the issue and are taking it seriously. The details are lacking, mostly because it was posted so quickly. Still utility of this kind of post is extremely powerful, which makes me wonder if having a pre-postmortem with a simple admittance the issue, with an authentic voice and some detail, is a necessary step in the pre/during/post event communication process.
Subscribe to:
Posts (Atom)