RFO: XN5 + XN6 Outage 08/01/2011

Dear All

I can now confirm that service has been restored to users on XN5 + XN6, and we are now going through VM’s individually to ensure everyone is up. If your Virtual Machine is not already online, please allow another 10 minutes and if not reboot it via SolusVM. If you are still having issues please open a ticket.

Yesterday, XN5 and XN6 crashed due to high load. We investigated the condition of both systems and removed those VM’s which were using high amounts of CPU in order to restore stability.

Today at UK Peak time the same happened, ironically to both nodes at the same time. We used several methods to restore service, including starting VM’s in small batches while observing load. On multiple occasions the systems hung and began showing signs of packet loss.

Upon investigating this further, we found that when under stress the RAID card on these systems was briefly going offline despite lack of any I/O errors from the kernel. This has been identified as a driver issue, and has not shown itself to date as the load on these systems has been relativley conisistent.

In order to resolve this, we have rolled out new kernels on the host machines with a newer aacraid driver – the drivers for our Adaptec RAID cards. We have been able to start up VM’s in a much smaller timeframe and the systems are showing a clear performance improvement.

We are very sorry for any inconvenience this has caused you this evening, and the outage has lasted far, far longer than we would like or that you would expect from xenSmart.

We are implementing additional monitoring on these nodes and will be putting together a rollout plan, to both ensure this does not happen to any other nodes, and that the solution we have implemented remains stable.

Based on the duration of this outage we are prepared to extend our Network-based SLA to cover this outage. Any customers who would like to be re-imbursed for this outage, please follow the procedure as per our Service Level Agreement (At the bottom of the xenSmart website) and it will be honored.

Regards,

Chris Elliott
Technical Director

Update: We have found the same bug affecting other nodes:

aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter reset request. SCSI hang ?

We will be updating these in due course.