Transparent Uptime

Thursday, January 14, 2010

Cloud health RSS feed

You may have noticed I've added a new module to the right side of this blog. This is a new feed I created (using Yahoo Pipes) combining the status update feeds of all of the major cloud providers (who offer feeds), in affect creating a single stream of cloud health status. Feel free to play with the pipe and add anything I may have missed.

Update: Improved on the original pipe by prepending each cloud feed with it's name, to more easily tell which cloud is reporting the problem:

Amazon responds to scalability concerns

Thanks to the fine reporting by Data Center Knowledge, we now have an official response from Amazon regarding the claims against their cloud capacity:

Amazon says that if customers are experiencing performance problems, it isn’t because EC2 is overloaded. “We do not have over-capacity issues,” said Amazon spokesperson Kay Kinton. “When customers report a problem they are having, we take it very seriously. Sometimes this means working with customers to tweak their configurations or it could mean making modifications in our services to assure maximum performance.”

Only time will tell whether this is a recurring issue (and something competitors will exploit), or an unsubstatiated fluke.

Update: The debate continues...

Wednesday, January 13, 2010

EC2 cloud beginning to burst?

There is talk of EC2 starting to hit some real world limits:

Amazon in the early days was fantastic. Instances started up within a couple of minutes, they rarely had any problems and even their SMALL INSTANCE was strong enough to power even the moderately used MySQL database. For a good 20 months, all was well in the Amazon world, with really no need for concern or complaint.
...
As time went on, and our load increased, the real usefulness of the SMALL instances, soon disappeared with us pretty much writing off any real production use of them. This is a shame, as many of our web servers are not CPU intensive, just I/O instensive. Moving up to the "High-CPU Medium Instance" as our base image has given us some of that early-pioneer feeling that we are indeed getting the intended throughput that we expect from an instance. Feel somewhat cheated here, as Amazon is forcing us to go to a higher priced instance just because they can't seem to cope with the volume of Small instances.

The post goes on to show evidence for the slowdown. In addition, cloudkick posted its own evidence:

Amazon's response, or lack thereof, will be telling. Competition with Rackspace is heating up, and this could develop into a major problem for AWS if not handled well.

Update: Some reaction from the community, re-affirming the importance of communication and transparency:

"It really bugs me that Amazon rarely admits the many faults that go on within their cloud. The status page just shows all green with no notes, even when you see multiple major sites drop off the web, report problems on Twitter, etc.
Numerous times I've had EC2 and EBS go out of contact (or simply have huge network latency, essentially the same). Since my instances run off RAID arrays of EBS volumes, essentially everything dies until EBS reappears." -- dangrossman (HN)

"You leave out one of the biggest problems with AWS, the super lack of documentation and well maintained community sites. This is really where much of the problem lies. If people could more effectively communicate, it would make everybody's life easier on there." -- Adam Nelson (Comments)

Tuesday, January 12, 2010

How to compare website monitoring services

When looking for an external website performance monitoring service, it's often unclear what exactly you should be comparing when reviewing the various offerings. The list below is an effort to break down the key components of any monitoring service, to help you make an more informed decision. This guide applies to any level of service, from your Pingdom's to your Webmetrics':

Monitoring platform

Browser technology - is it emulated or is it using a real browser? Real browser monitoring is a must, unless you are on a tight budget or real-user performance is not a concern.
URL versus transaction - does the service handle transactional monitoring, or only hard-coded URL's? The importance of this depends on what you are monitoring.
Robustness of the scripting language - how does it handle redirects, errors, and minor changes to the site? Does it hard-code URL's and values, or does it use the DOM to navigate the site? Script a few transactions, see how easy the process is, and how reliably the script plays back.
Ability to measure image/objects - does it download image/objects, or just the HTML? Does it download them in parallel, similar to a browser, or serially? Real browser monitoring should have this functionality built-in.
Frequency - how often is your site/application tested? Generally, every 1-5 minutes is what you're going to want.

Alerting

Timeliness - how quickly are you notified of an event?
Details - how much does the alert tell you? Does it help you understand and solve the problem? Look for traceroutes, a screenshot, and details on where the problem happened.
Accuracy - are you falsely alerted? Is the service missing events?

Reporting

Clarity - are the reports easy to understand? Will others "get" them when you send them around?
Flexibility - are you able to get the data you need out of the standard reports? Can you customize the reports to show you only what matters to you?
Ad-hoc and scheduled - can you generate reports at-will, and schedule them to be emailed to you and other parties?
Speed - do you have to wait hours for your reports to generate?
Details - can you see the holistic picture and also drill down into individual samples?
Historical data - how long is your data kept in the system? Can you compare quarter-over-quarter performance? Year-over-year?

Network

Reach - how many different locations around the world can you monitor from? Make sure you cover the areas that most of your customers are coming from.
Flexibility - how many locations can you actually use? Does it cost extra to get what you need?
Reliability - can you get data from the locations you are interested in consistently? Watch for missing samples from the locations that matter most to you.

Web portal

Flexibility - Can you configure your monitoring settings (e.g. URL/script, timeout thresholds, alert contacts, monitoring locations, maintenance windows, keyword matching) quickly and easily? Make sure you can update all of the settings that may change without having to call anyone.
User experience - is it pleasant to use the portal? Can you navigate around and find what you need easily? Is it fast? Is it reliable?

API's

Existence - does the service offer API's?
Power - how much can you do with the API's? Can you take your data out of the system, both the raw data and the processed/averaged data? Can you control your settings, such as turning monitoring off and updating alert contacts, through the API?
Diagnostics - can you run diagnostic tests, such as pings and traceroutes, using the API?

Diagnostics

Tracreroute - can you run traceroutes ad-hoc from any of the monitoring locations any time?
Real-time test - can you test your site form any of the locations ad-hoc, and get the results in real time?

Price

Coverage - how much will it cost you to monitor all of your critical web sites/apps, on a reasonable interval, from the locations you require?
Balance - find the balance between cost and quality of coverage. This may be the most difficult part of your decision. Look at the cost of downtime (e.g. lost revenue, negative customer reaction, marketing) to determine your ROI.

Other

Root-cause - how much does the service help you in determining the root-cause of an event, reducing your MTTR.
Internal monitoring - does the service allow you to monitor your behind-the-firewall sites, or from inside your office?
Load testing - can you use the same monitoring scripts to run load tests against your site?
Professional services - does the service offer consulting to help you with broader performance and reliability concerns?
Customer service - is the company pleasant to deal with? Will they be your partner, or the bane of your existence? Look for someone you can rely on, because monitoring is all about trust.

Of this list, the most important elements to any monitoring solution (the points of comparison that should weight heaviest on your decision) are:

Browser platform - is it a real browser or not?
Frequency of monitoring - cost is generally the main factor here
Locations - the more locations to choose from the better
Alerting - is it accurate, is it useful?
Reporting - a big reason to use a monitoring service

Wednesday, July 15, 2009

SLA's as an insurance policy? Think again.

From Benjamin Black:

"if SLAs were insurance policies, vendors would quickly be out of business.
given this, the question remains: how do you achieve confidence in the availability of the services on which your business relies? the answer is to use multiple vendors for the same services. this is already common practice in other areas: internet connection multihoming, multiple CDN vendors, multiple ad networks, etc. the cloud does not change this. if you want high availability, you’re going to have to work for it."

Well put. As Wener Vogels continues to preach, everything fails. Build your infrastructure where SLA's are a bonus, not a requirement.

Thursday, July 9, 2009

Google raising the bar in post-mortem transparency

In the most detailed post-mortem I've ever seen come out of a cloud provider, Google chronicles the minute by minute timeline of their App Engine downtime event, reviews what went wrong, and commits to fixing the root cause at many levels:

What are we doing to fix it?
1. The underlying bug in GFS has already been addressed and the fix
will be pushed to all datacenters as soon as possible. It has also
been determined that the bug has been live for at least a year, so the
risk of recurrence should be low. Site reliability engineers are aware
of this issue and can quickly fix it if it should recur before then.
2. The App Engine team is accelerating its schedule to release the new
clustering system that was already under development. When this system
is in place, it will greatly reduce the likelihood of a complete
outage like this one.
3. The App Engine team is actively investigating new solutions to cope
with long-term unavailability of the primary persistence layer. These
solutions will be designed to ensure that applications can cope
reasonably with long-term catastrophic outages, no matter how rare.
4. Changes will be made to the Status Site configuration to ensure
that the Status Site is properly available during outages.

Read the entire post for the full affect. Looks like The Register should think about taking back some of the things they said?

Saturday, July 4, 2009

Cloud and SaaS SLA's

Daniel Druker over at the SaaS 2.0 blog recently posted an extremely thorough description of what we should be expecting from Cloud and SaaS services when it comes to SLA agreements:

In my experience, there are four key areas to consider in your SLA:

First is addressing control: The service level agreement must guarantee the quality and performance of operational functions like availability, reliability, performance, maintenance, backup, disaster recovery, etc that used to be under the control of the in-house IT function when the applications were running on-premises and managed by internal IT, but are now under the vendor's control since the applications are running in the cloud and managed by the vendor.

Second is addressing operational risk: The service level agreement should also address perceived risks around security, privacy and data ownership - I say perceived because most SaaS vendors are actually far better at these things than nearly all of their clients are. Guaranteed commitments to undergoing regular SAS70 Type II audits and external security evaluations are also important parts of mitigating operational risk.

Third is addressing business risk: As cloud computing companies become more comfortable with their ability to deliver value and success, more of them will start to include business success guarantees in the SLA - such as guarantees around successful and timely implementations, the quality of technical support, business value received and even to money back guarantees - if a client isn't satisfied, they get their money back. Cloud/SaaS vendor can rationally consider offering business risk guarantees because their track record of successful implementations is typically vastly higher than their enterprise software counterparts.

Last is penalties, rewards and transparency: The service level agreement must have real financial penalties / teeth when an SLA violation occurs. If there isn't any pain for the vendor when they fail to meet their SLA, the SLA doesn't mean anything. Similarly, the buyer should also be willing to pay a reward for extraordinary service level achievements that deliver real benefits - if 100% availability is an important goal for you, consider paying the vendor a bonus when they achieve it. Transparency is also important - the vendor should also maintain a public website with continuous updates as to how the vendor is performing against their SLA, and should publish their SLA and their privacy policies. The best cloud vendors realize that their excellence in operations and their SLAs are real selling points, so they aren't afraid to open their kimonos in public.

Considering the sad state of affairs in existing SLA's, I'm hoping to see some progress here from the big boys, if nothing else as a competitive advantage as they try to differentiate themselves.