Service Disruptions on 13-14 November 2012

As you may know, dotMailer experienced a series of disruptions in service this week; this included periods of full unavailability and some periods of partial unavailability. The first disruption began at 12:50 GMT 13 November 2012 and the system was fully restored following the last disruption at 01:05 GMT 15 November 2012.

We are very sorry for the impact this exceptional incident has had, and continue to work with our suppliers to increase the robustness and resilience of our network.

Cause

The cause of this service disruption is suspected to be a faulty switch in our data centre. Although this switch was part of a redundant pair, it did not failover as expected. This pair of switches is responsible for key communications between different systems in our network.

Impact

During three periods of total service disruption (Tuesday 12:50—15:50, Wednesday 08:50—12:00 and 18:00—23:55) dotMailer, dotSurvey, our websites and our email was not accessible. This outage had knock-on effects for our image serving systems, as such some of our users experienced problems uploading images and saving campaigns for the periods between outages.

Although an emergency maintenance message was put up, after the second outage our default message "Down for planned maintenance" returned; we rectified it as soon as we became aware of the incorrect message and apologise for this inaccurate description.

Remediation

The entire faulty switch cluster was replaced with spare hardware which we carried in the data centre.

At time of posting this message, the root cause of the failure has not been determined. The pair of switches have been removed from service and taken to the laboratory for testing; we are continuing to work with our suppliers to prevent further issues and ensure our long term availability.

Postmortem

A full postmortem is currently underway; specific items noted so far include:

  • Finish root cause analysis of the outage, following up as appropriate
  • Identify why the failover did not occur as expected and following up as appropriate

During the incident we kept users updated via our support desk and our Twitter feed. If you are not subscribed to these, we encourage you to do so in order to get the most current information about service issues. Following feedback we are reviewing delivering this information through other channels in the future.

We have recently invested over £250,000 in our new storage area network hardware for our London data centre and are actively investing in new hardware for this and other sites to ensure we regain our usual >99.99% uptime and prepare for increasing demand.

If you have any questions about this incident or would like to discuss the matter further, please contact your account manager.

Skip Fidura
Client Services Director, dotMailer

Have more questions? Submit a request

Comments

Powered by Zendesk