Transparent Uptime: monitoring

Tuesday, January 12, 2010

How to compare website monitoring services

When looking for an external website performance monitoring service, it's often unclear what exactly you should be comparing when reviewing the various offerings. The list below is an effort to break down the key components of any monitoring service, to help you make an more informed decision. This guide applies to any level of service, from your Pingdom's to your Webmetrics':

Monitoring platform

Browser technology - is it emulated or is it using a real browser? Real browser monitoring is a must, unless you are on a tight budget or real-user performance is not a concern.
URL versus transaction - does the service handle transactional monitoring, or only hard-coded URL's? The importance of this depends on what you are monitoring.
Robustness of the scripting language - how does it handle redirects, errors, and minor changes to the site? Does it hard-code URL's and values, or does it use the DOM to navigate the site? Script a few transactions, see how easy the process is, and how reliably the script plays back.
Ability to measure image/objects - does it download image/objects, or just the HTML? Does it download them in parallel, similar to a browser, or serially? Real browser monitoring should have this functionality built-in.
Frequency - how often is your site/application tested? Generally, every 1-5 minutes is what you're going to want.

Alerting

Timeliness - how quickly are you notified of an event?
Details - how much does the alert tell you? Does it help you understand and solve the problem? Look for traceroutes, a screenshot, and details on where the problem happened.
Accuracy - are you falsely alerted? Is the service missing events?

Reporting

Clarity - are the reports easy to understand? Will others "get" them when you send them around?
Flexibility - are you able to get the data you need out of the standard reports? Can you customize the reports to show you only what matters to you?
Ad-hoc and scheduled - can you generate reports at-will, and schedule them to be emailed to you and other parties?
Speed - do you have to wait hours for your reports to generate?
Details - can you see the holistic picture and also drill down into individual samples?
Historical data - how long is your data kept in the system? Can you compare quarter-over-quarter performance? Year-over-year?

Network

Reach - how many different locations around the world can you monitor from? Make sure you cover the areas that most of your customers are coming from.
Flexibility - how many locations can you actually use? Does it cost extra to get what you need?
Reliability - can you get data from the locations you are interested in consistently? Watch for missing samples from the locations that matter most to you.

Web portal

Flexibility - Can you configure your monitoring settings (e.g. URL/script, timeout thresholds, alert contacts, monitoring locations, maintenance windows, keyword matching) quickly and easily? Make sure you can update all of the settings that may change without having to call anyone.
User experience - is it pleasant to use the portal? Can you navigate around and find what you need easily? Is it fast? Is it reliable?

API's

Existence - does the service offer API's?
Power - how much can you do with the API's? Can you take your data out of the system, both the raw data and the processed/averaged data? Can you control your settings, such as turning monitoring off and updating alert contacts, through the API?
Diagnostics - can you run diagnostic tests, such as pings and traceroutes, using the API?

Diagnostics

Tracreroute - can you run traceroutes ad-hoc from any of the monitoring locations any time?
Real-time test - can you test your site form any of the locations ad-hoc, and get the results in real time?

Price

Coverage - how much will it cost you to monitor all of your critical web sites/apps, on a reasonable interval, from the locations you require?
Balance - find the balance between cost and quality of coverage. This may be the most difficult part of your decision. Look at the cost of downtime (e.g. lost revenue, negative customer reaction, marketing) to determine your ROI.

Other

Root-cause - how much does the service help you in determining the root-cause of an event, reducing your MTTR.
Internal monitoring - does the service allow you to monitor your behind-the-firewall sites, or from inside your office?
Load testing - can you use the same monitoring scripts to run load tests against your site?
Professional services - does the service offer consulting to help you with broader performance and reliability concerns?
Customer service - is the company pleasant to deal with? Will they be your partner, or the bane of your existence? Look for someone you can rely on, because monitoring is all about trust.

Of this list, the most important elements to any monitoring solution (the points of comparison that should weight heaviest on your decision) are:

Browser platform - is it a real browser or not?
Frequency of monitoring - cost is generally the main factor here
Locations - the more locations to choose from the better
Alerting - is it accurate, is it useful?
Reporting - a big reason to use a monitoring service

Saturday, November 15, 2008

Comparing Amazon Web Services, SalesForce, and Zoho's online health dashboards

Now that there are three major SaaS players offering online service health dashboards, and one from Google on it's way, I thought it would be a useful exercise to compare the offerings from Amazon Web Services, Salesforce, and Zoho. This will hopefully be helpful for anyone planning to launch their own health dashboard, and to the general online community in making sense of what is important to understand about these dashboards.

Disclaimer: If I have mistakenly misrepresented anything, or if I missed any information, PLEASE let me know in the comments below.

What providers are we looking about today?

What is the URL of each status page (and are they easy to remember in times of need)?

Amazon Web Services: http://status.aws.amazon.com/ (Somewhat easy to remember)
Salesforce: http://trust.salesforce.com/ (Easy to remember)
Zoho: http://status.zoho.com/ (Extremely easy to remember)

What are these status pages called?

Amazon Web Services: "AWS Service Health Dashboard"
Salesforce: "Trust.salesforce.com - System Status" (Note: salesforce.com goes beyond simply providing system status by also providing security notices, both under their "Trust.salesforce.com brand")
Zoho: "Zoho Service Health Status"

What services' health are reported on?

Amazon Web Services: All four core services (EC2, S3, SQS, SimpleDB), plus Mechanical Turk and FlexPay. They also break out the two S3 datacenter locations (EU and US), the two ends of a Mechanical Turk transaction (Requester and Worker), plus the EC2 API.
Salesforce: Only the core salesforce.com services across 12 individual systems (based on geographic location and purpose).
Zoho: All 23 Zoho services are covered, plus their mobile site and their single sign-on system.

What health information is provided?

Amazon Web Services: Current status, plus about 30 days of historical status. Status is determined to be one of "Service is operating normally", "Performance Issues", or "Service disruption". "Information messages" are occasionally provided.
Salesforce: Current status, plus exactly 30 days of historical status. Status is determined to be either "Instances available", "Performance Issues", "Service disruption", or "Status not available". "Informational messages" are also provided on occasion.
Zoho: Current status and the response time for the past hour, in addition to historical uptime for the past week. Also provided are two graphs representing uptime and response time for the past seven days. If that wasn't enough, current uptime and response from six geographical locations is also given.

Where does the uptime and performance data come from?

Amazon Web Services: No clue.
Salesforce: No clue.
Zoho: Their own "Site 24x7" monitoring service.

What is considered downtime and what is considered a performance issue?

Amazon Web Services: No clue.
Salesforce: No clue.
Zoho: No clue.

Are real time updates provided during downtime events? Is it easy to find?

Amazon Web Services: Yes, but unclear how consistently and how easy it is to find that information.
Salesforce: Yes, right underneath the current status.
Zoho: Does not appear so, but if the issue is big enough they may update customers through their blog.

Is information provided on past downtime events?

Amazon Web Services: Yes. Mousing over a past performance or downtime event brings up a chronological log of events that took place, from detection to resolution. In addition, major downtime events are explained.
Salesforce: Yes. Clicking on any past event brings up a window giving the time of the event, a detailed description of the problem, and a root cause analysis.
Zoho: No. Unless they are described in the blog.

Is there a way to easily report problems users are having?

Amazon Web Services: Yes, clicking the "Report an Issue" link.
Salesforce: No, other then using the standard support channels.
Zoho: No, other then using the standard support channels.

How can you get notified of problems (without watching this page 24/7)?

Amazon Web Services: Ability to subscribe to RSS feeds for change in status of each service.
Salesforce: No.
Zoho: No.

Conclusions: The best practices for online service health dashboards are still being formed, and it's clear that each service provider has approached the need for transparency differently. Amazon Web Services provides a simple and easy to understand overview of the health of each service, but provides little insight into who is impacted and what specific functionality is down. Salesforce provides clear insight into what customers may be affected by an event, but does little in offering insight into specific functionality that may be down or slow. Zoho provides the most data by far for each service they provide, but does not have a system in place to communicate details about specific downtime events beyond the company blog. Amazon and Salesfroce completely lack insight into how that they collect the health information, and all three give no information on what is meant by downtime or performance problems.

A closing questions for each provider:

Amazon Web Services: What does "EC2 API" actually mean? Which API is this referring to and why not cover the API's for the other services?
Salesforce: Does each server status cover every application level and API on that server? Can you offer more insight into specific services?
Zoho: Do you expect to add details about current and past downtime events to the health dashboard? What do you expect your customers to do when they see a red light? If you answer "Email Support", you don't get the power of this status page.
To all: How is the health actually monitored (especially for the GUI focused Salesforce and Zoho services? Working at a (the best) web monitoring company, I know how hard it is to monitor complex web applications.

Notable mentions: The following services also offer up health dashboard page, but to keep the comparison from getting overly complex I decided to leave them out. If anyone would like me to review these, or any other service that I missed, I'd be more then happy to. Just leave a note in the comments

Transparent Uptime

Tuesday, January 12, 2010

How to compare website monitoring services

Saturday, November 15, 2008

Comparing Amazon Web Services, SalesForce, and Zoho's online health dashboards

About Me

Resources

Cloud Health Status Updates

Blog Archive

Disclaimer