[Resolved] xn9 Issue

xn9 is currently having issues, although the system & VPS are responding to ping it looks like I/O or RAID related issues. Just waiting on the DC to attach KVM now

Update 1: We have confirmed this is a RAID failure and we are currently working to restore the array.

Update 2: Waiting on DC again at the moment, sorry for the delay.

Update 3: Unfortunatley it appears the RAID array has collapsed and is unusable instead of just being degraded due to a bad disk. We are working through our options to restore the array.

Update 4: We have now re-assembled both sides of the RAID set and the system is now booted off a degraded RAID-10 volume, VPS are currently starting up. We will now proceed to inspect the integrity of the array and hot-swap out any suspect disks.

Update 5: Disks p1 and p2 make up the first half of the RAID-10 array (Or together  a RAID-1 set), disk p2 is bad and a read error caused the system to hang. The system is currently running off the bad disk, p2 while p1 rebuilds itself. Once this has been completed we will immediatley take p2 offline and replace the disk. At this point the array is fragile however we have no reason to believe the current rebuild will not complete successfully. Many thanks for your patience.

Update 6: Full redundancy and performance has now been restored to the array.

[Resolved]DDoS Attack

We are currently facing a 1+GB/s DDoS Attack on our systems, we are working to mitigate this as quickly as possible and all update will be provided here.

Update 2: Traffic is now flowing normally again and now consider this resolved.

Update: This appears to be under control, we have nulled the IPs in question and are monitoring the situation very closely.

CP1 MySQL Issue

We are currently aware of a MysQL issue on CP1 our shared/reseller server which occurs when the daily backups run at around 4AM in the morning. It is normal for MySQL databse tables to lock while a particular database is being backed up, however we have recieved reports of MySQL being inaccessible for extended amounts of time during the backup cycle.

We are working diligently to resolve this, and have a Senior Admin monitoring the situation again tonight around the time this has been occuring.

Thanks

[Complete] Rolling Restarts 11/01/2011

Tonight at 8PM we are doing maintainence on our last server affected by the kernel driver bug, XN1.

Updates will be provided here as usual.

XN1 (Complete)
8.00PM – Going down for reboot now
8.04PM – System is back
8.16PM – All VPS have been restarted and are booting up now

The maintainence has now been completed. Many thanks for your patience.

[Complete] Rolling Restarts 10/01/2011

Tonight we are continuing with maintainence on XN12 and XN3.

Updates will be provided here throughout.

XN12 (Complete)
8.00PM – Going down for reboot now
8.03PM – System is up and starting rolling restarts of VPS
8.10PM – All VPS have been restarted

XN3 (Complete)
8.24PM – Going down for reboot now
8.26PM – System has failed to boot into the new kernel, investigating now
8.37PM – This server appears to have a bootloader (grub) issue which we are working to resolve
8.53PM – Looks like this isn’t such a simple issue. We are still working on it and will restore service ASAP.
9.04PM – Server is now booting up on the old kernel and we are investigating why the new kernel failed to boot. It’s the same hardware as our other servers.
9.22PM – Issue resolved and the system is now up on the new kernel. Restarts of VPS to follow
9.38PM – All VPS have been restarted

Today’s maintainence is now complete. Appologies for the delay on XN3, however this was absolutly mandatory to ensure the machine is not vulnerable to a driver bug which has affected two other systems earlier this week.

[Complete] Rolling Restarts 09/01/2011

We are rebooting XN11 and XN14 tonight as per the earlier email to patch against a driver issue affecting two servers yesterday.

Updates will be provided here throughout.

XN11 (Done)
8.00PM – Going down for restart now
8.04PM – Small issue related to the BIOS on the machine, sorting now
8.06PM – Server is booting now
8.10PM – Server is up and starting rolling restarts of VPS now
8.24PM – All VPS have been restarted and should be up. Please be aware the RAID array on this machine is now doing a verification and I/O wait will be a little higher than usual until this completes. Shouldn’t be more than 30mins

XN14
8.35PM – Going down for restart now
8.39PM – Server is up and starting rolling restarts of VPS now
8.49PM – All VPS are up and we are going through them individualyl to ensure they are on the correct kernel + the matching kernel modules are copied inside VPS’s.
9.15PM – There are about 5 VPS left to restart
9.20PM – All VPS have been restarted and should be up. Please be aware the RAID array on this machine is now doing a verification and I/O wait will be a little higher than usual until this completes. With this being a larger server it will take a few hours for the load to settle and the RAID verification to complete.

RFO: XN5 + XN6 Outage 08/01/2011

Dear All

I can now confirm that service has been restored to users on XN5 + XN6, and we are now going through VM’s individually to ensure everyone is up. If your Virtual Machine is not already online, please allow another 10 minutes and if not reboot it via SolusVM. If you are still having issues please open a ticket.

Yesterday, XN5 and XN6 crashed due to high load. We investigated the condition of both systems and removed those VM’s which were using high amounts of CPU in order to restore stability.

Today at UK Peak time the same happened, ironically to both nodes at the same time. We used several methods to restore service, including starting VM’s in small batches while observing load. On multiple occasions the systems hung and began showing signs of packet loss.

Upon investigating this further, we found that when under stress the RAID card on these systems was briefly going offline despite lack of any I/O errors from the kernel. This has been identified as a driver issue, and has not shown itself to date as the load on these systems has been relativley conisistent.

In order to resolve this, we have rolled out new kernels on the host machines with a newer aacraid driver – the drivers for our Adaptec RAID cards. We have been able to start up VM’s in a much smaller timeframe and the systems are showing a clear performance improvement.

We are very sorry for any inconvenience this has caused you this evening, and the outage has lasted far, far longer than we would like or that you would expect from xenSmart.

We are implementing additional monitoring on these nodes and will be putting together a rollout plan, to both ensure this does not happen to any other nodes, and that the solution we have implemented remains stable.

Based on the duration of this outage we are prepared to extend our Network-based SLA to cover this outage. Any customers who would like to be re-imbursed for this outage, please follow the procedure as per our Service Level Agreement (At the bottom of the xenSmart website) and it will be honored.

Regards,

Chris Elliott
Technical Director

Update: We have found the same bug affecting other nodes:

aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter reset request. SCSI hang ?

We will be updating these in due course.

[Resolved] xn5/xn6 outage

We are again seeing the same issue just encountered on xn5, on xn6 and we are getting KVM on it now

Update: We have hit a number of issues including what appears to be a driver issue for our Adaptec RAID cards causing them to fall over under load. We are now deploying new kernels + will be booting up VPS in due course.

XN5 Status = Fully Restored. If your VPS is not up yet, be patient it is booting.

XN6 Status= Fully Restored. If your VPS is not up yet, be patient it is booting.