Thursday, January 8, 2009

Salesforce.com down for over 30 minutes, and what we can learn from it

See what the blogosphere was saying...and see what more traditional media was saying.

Update: Again, Twitter ends up being the best place to confirm a problem and get updates across the world:

Update 2: Salesforce has posted an explanation of what led to the downtime (
from trust.salesforce.com):
"6:51 pm PST : Service disruption for all instances - resolved
Starting at 20:39 UTC, a core network device failed due to memory allocation errors. The failure caused it to stop passing data but did not properly trigger a graceful fail over to the redundant system as the memory allocation errors where present on the failover system as well. This resulted in a full service failure for all instances. Salesforce.com had to initiate manual recovery steps to bring the service back up.
The manual recovery steps was completed at 21:17 UTC restoring most services except for AP0 and NA3 search indexing. Search of existing data would work but new data would not be indexed for searching.
Emergency maintenance was performed at 23:24 UTC to restore search indexing for AP0 and NA3 and the implementation of a work-around for the memory allocation error.
While we are confident the root cause has been addressed by the work-around the Salesforce.com technology team will continue to work with hardware vendors to fully detail the root cause and identify if further patching or fixes will be needed.
Further updates will be available as the work progresses."

Update 3: Lots of coverage of this event all over the web. All of the coverage focuses on the downtime itself, how unacceptable it is, and bad this makes the cloud look. That's all crap. Everything fails. In-house apps more-so then anything. We can't avoid downtime. What we can avoid is the communication during and after the event, to avoid situations like this:
"Salesforce, the 800-pound gorilla in the software-as-a-service jungle, was unreachable for the better part of an hour, beginning around noon California time. Customers who tried to access their accounts alternately were unable to reach the site at all or received an error message when trying to log in.

Even the company's highly touted public health dashboard was also out of commission. That prompted a flurry of tweets on Twitter from customers wondering if they were the only ones unable to reach the site."

That's where SaaS providers need to focus! Create lines of communication, open the kimono, and let the rays of transparency shine through. It's completely in your control.

No comments:

Post a Comment