RFO: XN3 Outage

This morning the raid controller in XN3 failed, resulting in the machine rebooting and then failing due to no boot device. This is quite unfortunate as this server is only a few months old.

A datacentre engineer was assigned to check the status of the server, and following that were instructed to remove the server from the rack and re-seat the card. This restored service, though the server then crashed again shortly afterwards. A quick check of our spares inventory showed that there was no suitable replacement onsite, though we had a spare at the office.

A senior member of staff (myself) then set off as quickly as possible and arrived at the datacentre just after 1PM with a replacement card. On arrival to the rack I had great difficulty removing the server, and after the help of a datacentre engineer we managed to remove the server by very carefully removing the rails from the chassis while it was still in the rack.

It was apparent that the runners on the right-hand rail were slightly bent and catching on one of the adjustment screws making it impossible to remove the server /w rails fitted as per design.

Once the server was removed I replaced the card, adjusted the rails and powered the machine up. A quick check from our support team confirmed the problem had now been resolved. Just after 2PM we saw VPS starting to come back up.

We are very sorry for the amount of time it took to resolve this issue and will be taking measures to avoid such incidents in future. An audit of our spares inventory has already been done and we will be adding a few additional items, we will also look at getting a larger spare diskless-chassis onsite to enable a faster resolution to such issues.

Chris

[Resolved] xn1 down

xn1 is currently down (affecting some VPS in the 95.154.246.xx range) and we are waiting for the DC to attach a KVM. Apologies for the delay.

Further updates will follow.

Update: Sorry about this folks, it has taken over 2 hours for the datacentre to attach a KVM to the machine. We will be raising a complaint once this has been resolved.

Update 2: We have heard back from the datacentre, the server has no power. We are currently getting the PSU swapped out with a spare and service should be restored shortly. Thanks

Update 3: We are still waiting for the DC to swap out the PSU..

Update 4: We are still waiting for the DC to swap out the PSU, apparently no part onsite when there definitely is… they are re-checking and if not I will drive up there now. The support manager has also been notified about the responsetimes.

Update 5: PSU has been swapped, VMs are booting