"Today, we gather a little over 30,000 metrics, on everything from CPU usage, to network bandwidth, to the rate of listings and re-listings done by Etsy sellers. Some of those metrics are gathered every 20 seconds, 24 hours a day, 365 days a year. About 2,000 metrics will alert someone on our operations staff (we have an on-call rotation) to wake up in the middle of the night to fix a problem."
Takeaway: Capture data on every part of your infrastructure, and later decide which metrics are leading indicators of problems. He goes on to talk about the importance of external monitoring (outside of your firewall) to measure the actual end-user experience.Communication
"When we have an outage or issue that affects a measurable portion of the site’s functionality, we quickly group together to coordinate our response. We follow the same basic approach as most incident response teams. We assign some people to address the problem and others to update the rest of the staff and post to http://etsystatus.com to alert the community. Changes that are made to mitigate the outage are largely done in a one-at-a-time fashion, and we track both our time-to-detect as well as our time-to-resolve, for use in a follow-up meeting after the outage, called a “post-mortem” meeting. Thankfully, our average time-to-detect is on the order of 2 minutes for any outages or major site issues in the past year. This is mostly due to continually tuning our alerting system."
Takeaway: Two important points here. First, communication and collaboration are key to successfully managing issues. Second, and even more interesting, is the need for two teams...one to address the problem and one to communicate status updates both internally and externally. This is often a missing piece for companies, where no updates go out because everyone is busy fixing the problem.Post-Mortems
"After any outage, we meet to gather information about the incident. We reconstruct the time-line of events; when we knew of the outage, what we did to fix it, when we declared the site to be stable again. We do a root cause analysis to characterize why the outage happened in the first place. We make a list of remediation tasks to be done shortly thereafter, focused on preventing the root cause from happening again. These tasks can be as simple as fixing a bug, or as complex as putting in new infrastructure to increase the fault-tolerance of the site. We document this process, for use as a reference point in measuring our progress."
Takeaway: Fixing the problem and getting back online is not enough. Make it a an automatic habit to schedule a postmortem to do a deep dive into the root cause(s) of the problem, and address not only the immediate bugs but also the deeper issues that led to the root cause. The Five Why's can help here, as can the Lean methodology of investing a proportional number of hours into the most problematic parts of the infrastructure.Single Point of Failure Reduction
"As Etsy has grown from a tiny little start-up to the mission-critical service it is today, we’ve had to outgrow some of our infrastructure. One reason we have for this evolution is to avoid depending on single pieces of hardware to be up and running all of the time. Servers can fail at any time, and Etsy.com should be able to keep working if a single server dies. To do that, we have to put our data in multiple places, keep them in sync, and make sure our code can route around any individual failures.
So we’ve been working a lot this year to reduce those “single points of failure,” and to put in redundancy as fast as we safely can. Some of this means being very careful (paranoid) as we migrate data from the single instances to multiple or replicated instances. As you can imagine, it’s a bit of a feat to move that volume of data around while still seeing a peak of 15 new listings per second, all the while not interrupting the site’s functionality."
Takeaway: Reduce single points of failure incrementally. Do what you can in the time you have.Change Management and Risk
"For every type of technical change, we have answers to questions like:
What problem does the change solve?
Has this kind of change happened before? Is there a successful history?
When is the change going to start? When is it expected to end?
What is the expected effect of this change on the Etsy community? Is a downtime required for the change?
What is the rollback plan, if something goes wrong?
What test is needed to make sure that the change succeeded?
As with all change, the risk involved and the answers to these questions are largely dependent on the judgment of the person at the helm. At Etsy, we believe that if we understand the likely failures, and if there’s a plan in place to fix any unexpected issues, we’ll make progress.
Just as important, we also track the results of changes. We have an excellent history with respect to the number of successful changes. This is a good record that we plan on keeping."
Takeway: Be prepared for failure by anticipating worst-case scenario's for every change. Be ready to roll back and respond. More importantly, make sure to track when things go right to have a realistic measure of risk.Other takeaways:
- Declaring "outage bankruptcy" is not the ideal approach. But it is better than simply going along without any authentic communication with your customers throughout a period of instability. Your customers will understand, if you act human.
- Etsy has been doing a great job keeping customers up to date at http://etsystatus.com/.
- A glance at the comments on the page shows a few upset customers, but a generally positive response.