Friday, February 19, 2010

As Wordpress goes down, a chance to analyze another postmortem arrises

If you recall, we put together a proposed guideline for postmortem communication in a previous post:

  1. Admit failure - Hiding downtime is no longer an option (thanks to Twitter)
  2. Sound like a human - Do not use a standard template, do not apologize for "inconveniencing" us.
  3. Have a communication channel - Ideally you've set up a process to handle incidents before the event, and communicated publicly during the event. Customers will need to know where to find your updates.
  4. Above all else, be authentic
  1. Start time and end time of the incident.
  2. Who/what was impacted.
  3. What went wrong, with insight into the root cause analysis process.
  4. What's being done to improve the situation, lessons learned.
  1. Details on the technologies involved.
  2. Answers to the Five Why's.
  3. Human elements - heroic efforts, unfortunate coincidences, effective teamwork, etc.
  4. What others can learn from this experience.

How did Wordpress do in their
postmortem on 2/19/10?

  1. Admit failure: Yes. The very first paragraph makes it clear they screwed up.
  2. Sound like a human: Yes. Extremely personal post.
  3. Have a communication channel: Yes, but not ideal. A combination of their general Twitter account, and the founders blog. Could be improved, but overall OK.
  4. Be authentic: Yes. 110% authentic!
  1. Start/end time: No. Only focused on duration.
  2. Who/what was impacted: Yes. Describes that 10.2 million blogs were affected, for 110 minutes, taking away 5.5 million pageviews.
  3. What went wrong: Yes. Router issues, though investigation is continuing.
  4. Lessons learned: Partial. Mostly a promise to share the results of the investigation.
  1. Technologies involved: No.
  2. Answers to the Five Why's: No.
  3. Human elements: Yes. "the entire team was on pins and needles trying to get your blogs back as soon as possible"
  4. What others can learn: No.
The intent of the blog was to communicate quickly that they are aware of the severity of the issue and are taking it seriously. The details are lacking, mostly because it was posted so quickly. Still utility of this kind of post is extremely powerful, which makes me wonder if having a pre-postmortem with a simple admittance the issue, with an authentic voice and some detail, is a necessary step in the pre/during/post event communication process.

Tuesday, February 16, 2010

The Tao of Web Performance and Uptime

Who cares about fast web pages. Who cares about uptime. I mean really.

Does it truly matter to you whether a page loads a couple seconds faster? Are we wasting our lives keeping servers up 99.999%? Are we making an impact on the world in a meaningful way? Does it actually matter in the scheme of things?

I think the answer is yes. It does matter. I matters a lot. But it isn't for the reasons you think.

What would most people answer when asked "Why are you spending your life making web pages faster and keeping servers up?" Here are my guesses:

It's my job
I'll get fired. I'll let my peers down. I'll hurt the company. My boss will be mad at me if I don't do what I'm told. All good reasons (unless you've read Seth Godin's new book). But do these reasons make you honestly care about web page performance? Does it make you happy to spend your precious lifetime keeping servers running all hours of the night? The real question is not whether performance and uptime matter. The question that you should be asking is: Does performance and uptime matter to you, as a human being? If your answer to this relies on this being your job, that caring about it provides security and comfort, then the answer is no. You won't find fulfillment in your work if your motivation is being a good cog.

It's my company
Being the person making money off of the cogs (as they improve page performance and keep the system stable) changes the equation. No doubt, keeping your servers up is critical to the success of your online business (usually). Furthermore, the ROI of page performance is fairly conclusive. Clearly, uptime and performance lead to more revenue (or at least less lost revenue). But what do you actually care about in this equation? How quickly the pages load, or how much money you're making as a result of those faster pages? You don't want to spend your time optimizing pages all day. You want to get done with it as quickly as possible and get back to doing business. You certainly care about page performance and uptime, but only as a tool, a necessary evil, that helps you optimize your true passion (whatever that may be).

It's fun
You look at tuning performance or building highly reliable systems as a puzzle. You enjoy the work because you are good at it, or you want to accomplish something no one else has in the past. Your motivation is either the thrill of the problem or personal brand building in your group/company/industry. You enjoy performance and uptime for the opportunity that it offers, and the feeling of accomplishment that it brings when you improve performance by 23% or keep the system up during a marketing blitz. This explanation gets close to being a good reason, but it lacks something. It's selfish. It focuses on you. It doesn't give you a purpose, or impact the world in a meaningful way. Fun will take you so far, but at some point you'll wonder "what's the point?" and move on to the next challenge.

It makes people happy
This is it. This is why it matters. This is why it is worth spending your time making web pages faster and keeping servers up. To put it simply, it make people happier. To quote Matt Mullenweg, founder of Wordpress:
"That's why [performance] is important and why we should be obsessed and not be discouraged when it doesn't change the funnel. My theory here is when an interface is faster, you feel good. And ultimately what that comes down to is you feel in control. The web app isn't controlling me, I'm controlling it. Ultimately that feeling of control translates to happiness in everyone. In order to increase the happiness in the world, we all have to keep working on this. "
How can we quantify this? We have data showing that a slower and results in less searches, and more importantly that user satisfaction goes down with each additional performance decrease. AOL shows us that page views drop off as page load times increase. Optimizations to Google Maps increasing user interaction with the site significantly. The faster the site, the more you want to use it. Let's delve into more evidence...

If you haven't yet come across the concept of flow:
"Flow is the mental state of operation in which the person is fully immersed in what he or she is doing by a feeling of energized focus, full involvement, and success in the process of the activity. Proposed by Mihály Csíkszentmihályi, the positive psychology concept has been widely referenced across a variety of fields.

According to Csíkszentmihályi, flow is completely focused motivation. It is a single-minded immersion and represents perhaps the ultimate in harnessing the emotions in the service of performing and learning. In flow the emotions are not just contained and channeled, but positive, energized, and aligned with the task at hand. To be caught in the ennui of depression or the agitation of anxiety is to be barred from flow. The hallmark of flow is a feeling of spontaneous joy, even rapture, while performing a task."
How does performance and uptime relate to flow? Researchers asked this very question and found some unsurprising results:
"Hoffman, Novak, and Yung found that the speed of interaction had a“direct positive influence on flow” on feelings of challenge and arousal (which directly influence flow), and on importance. Skill, control, and time distortion also had a direct influence on flow.

The researchers then applied their model to consumer behavior on the web. They tested web applications (chat, newsgroups, and so on) and web shopping, asking subjects to specify which features were most important when shopping on the web.

They found that speed had the greatest effect on the amount of time spent online and on frequency of visits for web applications. For repeat visits, the most important factors were skill/control, length of time on the web, importance, and speed.

So to make your site compelling enough to return to, make sure that it offers a perceived level of control by matching challenges to user skills, important content, and fast response times."
When asked about the importance of speed on flow, Csikszentmihalyi offers:
"If you mean the speed at which the program loads, the screens change, the commands are carried out—then indeed speed should correlate with flow. If you are playing a fantasy game, for instance, and it takes time to move from one level to the next, then the interruption allows you to get distracted, to lose the concentration on the alternate reality. You have time to think: “Why am I wasting time on this? Shouldn’t I be taking the dog for a walk, or studying?”— and the game is over, psychologically speaking."
Clearly, speed plays a key role in attaining flow. If you believe (as I do) that flow is a good thing, and brings on happiness, then giving your visitors the chance to enter a flow state is a worthwhile pursuit.

From the guru of web usability, Jakob Nielsen:
"Every web usability study I have conducted since 1994 has shown the same thing: Users beg us to speed up the page downloads. In the beginning my reaction was along the lines of "Let's just give them better design, and they will be happy to wait for it." I have since become a reformed sinner believing that fast response times are the most important design criterion for web pages; even my skull isn't thick enough to withstand consistent user pleas year after year."
Users are begging us to speed up page load times!

User Psychology
A good number of studies have further connected slow web pages (and unreliable web applications) with frustration and higher blood pressure.

  • "slow response time generated higher ratings of frustration and impatience"
  • "Frustration occurs at an interruption or inhibition of the goal-attainment process, where a barrier or conflict is put in the path of an individual"
  • "Slow websites inhibit users from reaching their goals, causing frustration"
  • "It was found that in the context of human–computer interactions while browsing a Web site, flow experience was characterized by time distortion, enjoyment, and telepresence."
A study done by Forrester Consulting (on behalf of Akamai):
  • "finds that website performance has a direct impact on revenues, profits and satisfaction."
  • "The findings indicate that website performance is second only to security in user expectations"

Daniel Pink's new book Drive argues that "the biggest motivator at work is making progress" (link). Anything that gets in the way of you making progress makes you less happy. As the web becomes a bigger part of where work is done (be it SaaS, the cloud, or Twitter), the more important the speed and reliability of those web sites becomes. Progress, motivation, and happiness will be increasingly tied to the performance and stability of the web.

Still not convinced?
Let's look at the flip side. Slow and unreliable sites make people very upset:

Downtime creates pain and frustration. Slow web applications piss people off. Ironically the more popular your site, and the more useful it is to your users, more unhappiness you can cause.

This unhappiness does not end at your firewall. I don't have to tell you how stressful downtime is internally. Your peers have to work nights, your boss has to explain what happened to their boss. If you are dogfooding your applications, internal productivity is affected from both downtime and bad performance. Simply put, the absence of uptime and bad web performance creates a lot of unhappy people.

Where does this leave us?
Let's ask the same question we asked earlier:
"Does it truly matter to you whether a page loads a couple seconds faster? Are we wasting our lives keeping servers up 99.999%? Are we making an impact on the world in a meaningful way? Does it actually matter in the scheme of things?"
Hopefully I've convinced you that there is a strong link between performance/stability and happy users. That's all well and good, but does this matter in the scheme of things? Two stats stand out to me:
  1. The number of Internet users worldwide: 1,733,993,741
  2. The amount of time spent online per week: 13 hours
If we as an industry can impact the happiness of almost 2 billion people by making the web a little bit faster, or a little more stable, I say that this does indeed impacts the world in a meaningful way. By plugging away at our little problems, our minor tweaks, our tools and tricks, we are helping our users, and the world at large, become a happier place.

"Happiness is the meaning and the purpose of life, the whole aim and end of human existence" -- Aristotle