Wednesday, July 15, 2009

SLA's as an insurance policy? Think again.

From Benjamin Black:

"if SLAs were insurance policies, vendors would quickly be out of business.

given this, the question remains: how do you achieve confidence in the availability of the services on which your business relies? the answer is to use multiple vendors for the same services. this is already common practice in other areas: internet connection multihoming, multiple CDN vendors, multiple ad networks, etc. the cloud does not change this. if you want high availability, you’re going to have to work for it."

Well put. As Wener Vogels continues to preach, everything fails. Build your infrastructure where SLA's are a bonus, not a requirement.

Thursday, July 9, 2009

Google raising the bar in post-mortem transparency

In the most detailed post-mortem I've ever seen come out of a cloud provider, Google chronicles the minute by minute timeline of their App Engine downtime event, reviews what went wrong, and commits to fixing the root cause at many levels:

What are we doing to fix it?

1. The underlying bug in GFS has already been addressed and the fix
will be pushed to all datacenters as soon as possible. It has also
been determined that the bug has been live for at least a year, so the
risk of recurrence should be low. Site reliability engineers are aware
of this issue and can quickly fix it if it should recur before then.

2. The App Engine team is accelerating its schedule to release the new
clustering system that was already under development. When this system
is in place, it will greatly reduce the likelihood of a complete
outage like this one.

3. The App Engine team is actively investigating new solutions to cope
with long-term unavailability of the primary persistence layer. These
solutions will be designed to ensure that applications can cope
reasonably with long-term catastrophic outages, no matter how rare.

4. Changes will be made to the Status Site configuration to ensure
that the Status Site is properly available during outages.

Read the entire post for the full affect. Looks like The Register should think about taking back some of the things they said?

Saturday, July 4, 2009

Cloud and SaaS SLA's

Daniel Druker over at the SaaS 2.0 blog recently posted an extremely thorough description of what we should be expecting from Cloud and SaaS services when it comes to SLA agreements:
In my experience, there are four key areas to consider in your SLA:

First is addressing control: The service level agreement must guarantee the quality and performance of operational functions like availability, reliability, performance, maintenance, backup, disaster recovery, etc that used to be under the control of the in-house IT function when the applications were running on-premises and managed by internal IT, but are now under the vendor's control since the applications are running in the cloud and managed by the vendor.

Second is addressing operational risk: The service level agreement should also address perceived risks around security, privacy and data ownership - I say perceived because most SaaS vendors are actually far better at these things than nearly all of their clients are. Guaranteed commitments to undergoing regular SAS70 Type II audits and external security evaluations are also important parts of mitigating operational risk.

Third is addressing business risk: As cloud computing companies become more comfortable with their ability to deliver value and success, more of them will start to include business success guarantees in the SLA - such as guarantees around successful and timely implementations, the quality of technical support, business value received and even to money back guarantees - if a client isn't satisfied, they get their money back. Cloud/SaaS vendor can rationally consider offering business risk guarantees because their track record of successful implementations is typically vastly higher than their enterprise software counterparts.

Last is penalties, rewards and transparency: The service level agreement must have real financial penalties / teeth when an SLA violation occurs. If there isn't any pain for the vendor when they fail to meet their SLA, the SLA doesn't mean anything. Similarly, the buyer should also be willing to pay a reward for extraordinary service level achievements that deliver real benefits - if 100% availability is an important goal for you, consider paying the vendor a bonus when they achieve it. Transparency is also important - the vendor should also maintain a public website with continuous updates as to how the vendor is performing against their SLA, and should publish their SLA and their privacy policies. The best cloud vendors realize that their excellence in operations and their SLAs are real selling points, so they aren't afraid to open their kimonos in public.
Considering the sad state of affairs in existing SLA's, I'm hoping to see some progress here from the big boys, if nothing else as a competitive advantage as they try to differentiate themselves.

Wednesday, March 18, 2009

Microsoft showing us how it's done, coming clean about Azure downtime

Following up on yesterdays Windows Azure downtime event, Microsoft posted an excellent explanation of what happened:

The Windows Azure Malfunction This Weekend

First things first: we're sorry. As a result of a malfunction in Windows Azure, many participants in our Community Technology Preview (CTP) experienced degraded service or downtime. Windows Azure storage was unaffected.

In the rest of this post, I'd like to explain what went wrong, who was affected, and what corrections we're making.

What Happened?

During a routine operating system upgrade on Friday (March 13th), the deployment service within Windows Azure began to slow down due to networking issues. This caused a large number of servers to time out and fail.

Once these servers failed, our monitoring system alerted the team. At the same time, the Fabric Controller automatically initiated steps to recover affected applications by moving them to different servers. The Fabric Controller is designed to be very cautious about taking broad recovery steps, so it began recovery a few applications at a time. Because this serial process was taking much too long, we decided to pursue a parallel update process, which successfully restored all applications.

What Was Affected?

Any application running only a single instance went down when its server went down. Very few applications running multiple instances went down, although some were degraded due to one instance being down.

In addition, the ability to perform management tasks from the web portal appeared unavailable for many applications due to the Fabric Controller being backed up with work during the serialized recovery process.

How Will We Prevent This in the Future?

We have learned a lot from this experience. We are addressing the network issues and we will be refining and tuning our recovery algorithm to ensure that it can handle malfunctions quickly and gracefully.

For continued availability during upgrades, we recommend that application owners deploy their application with multiple instances of each role. We'll make two the default in our project templates and samples. We will not count the second instance against quota limits, so CTP participants can feel comfortable running two instances of each application role.

This is a solid template to use in coming clean about your own downtime events. Apologize (in a human, non-boilerplate way), explain what happened, who was affected, and what is being done to prevent this in the future. Well done Microsoft.

Tuesday, March 10, 2009

Thursday, March 5, 2009

Google App Engine transparency quick check in


Keep it up Google!

Wednesday, February 25, 2009

Google launches health status dashboard for Google Apps!


Announced here and you can see it here. No time to review it today, but I'll be all over this in the next couple days. Kudos to Google for getting this out!

Update: VERY cool to see an extremely detailed post-mortem from Google on the recent Gmail downtime.