On September 15, RingCentral experienced a system-wide service outage that impacted their customers for hours, and affected its entire customer base.
The outage was explained on their Facebook page as follows: “This morning we experienced a failure of our primary database system. This failure also unexpectedly impacted our backup database. We implemented our disaster recovery system in order to restore your connectivity. The root cause of the failure is currently under investigation here at RingCentral, as well as escalated at Oracle.”
This has prompted us to look at writing an article on reliability, redundancy and fault tolerance, and what it all means.
Be sure to keep your eyes open for it. You’ll be seeing it real soon!
UPDATE 9/17/2010: Late yesterday RingCentral gave further information on their outage on 9/15/2010. Here is what they posted:
“The outage that occurred on Wednesday from 7:00am – 8:50am PDT was the result of a software error that caused the primary & redundant databases to fail. Systems were restored by activating our standby database systems. Customers experienced a loss of inbound calls and web access to their account. Once the service came back up, it took a number of hours to work through the fax backlog. For most all customers dialtone & outbound calling remained available throughout. It is RingCentral’s policy to do everything possible to prevent any interruption of service to our customers. We understand that RingCentral services are vital to your business and we apologize for the inconvenience. Please know that RingCentral is committed to providing exceptional service and we appreciate your business.”