Categories
Outages

[Resolved]DDoS Attack

We are currently facing a 1+GB/s DDoS Attack on our systems, we are working to mitigate this as quickly as possible and all update will be provided here.

Update 2: Traffic is now flowing normally again and now consider this resolved.

Update: This appears to be under control, we have nulled the IPs in question and are monitoring the situation very closely.

Categories
Outages

[Resolved] vz2 outage

VZ2 is showing 90% packet loss and is being checked.

Update: Looks like a VPS was being attacked, and has been nullrouted now.

Categories
Outages

RFO: XN5 + XN6 Outage 08/01/2011

Dear All

I can now confirm that service has been restored to users on XN5 + XN6, and we are now going through VM’s individually to ensure everyone is up. If your Virtual Machine is not already online, please allow another 10 minutes and if not reboot it via SolusVM. If you are still having issues please open a ticket.

Yesterday, XN5 and XN6 crashed due to high load. We investigated the condition of both systems and removed those VM’s which were using high amounts of CPU in order to restore stability.

Today at UK Peak time the same happened, ironically to both nodes at the same time. We used several methods to restore service, including starting VM’s in small batches while observing load. On multiple occasions the systems hung and began showing signs of packet loss.

Upon investigating this further, we found that when under stress the RAID card on these systems was briefly going offline despite lack of any I/O errors from the kernel. This has been identified as a driver issue, and has not shown itself to date as the load on these systems has been relativley conisistent.

In order to resolve this, we have rolled out new kernels on the host machines with a newer aacraid driver – the drivers for our Adaptec RAID cards. We have been able to start up VM’s in a much smaller timeframe and the systems are showing a clear performance improvement.

We are very sorry for any inconvenience this has caused you this evening, and the outage has lasted far, far longer than we would like or that you would expect from xenSmart.

We are implementing additional monitoring on these nodes and will be putting together a rollout plan, to both ensure this does not happen to any other nodes, and that the solution we have implemented remains stable.

Based on the duration of this outage we are prepared to extend our Network-based SLA to cover this outage. Any customers who would like to be re-imbursed for this outage, please follow the procedure as per our Service Level Agreement (At the bottom of the xenSmart website) and it will be honored.

Regards,

Chris Elliott
Technical Director

Update: We have found the same bug affecting other nodes:

aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter reset request. SCSI hang ?

We will be updating these in due course.

Categories
Outages

[Resolved] xn5/xn6 outage

We are again seeing the same issue just encountered on xn5, on xn6 and we are getting KVM on it now

Update: We have hit a number of issues including what appears to be a driver issue for our Adaptec RAID cards causing them to fall over under load. We are now deploying new kernels + will be booting up VPS in due course.

XN5 Status = Fully Restored. If your VPS is not up yet, be patient it is booting.

XN6 Status= Fully Restored. If your VPS is not up yet, be patient it is booting.

Categories
Outages

[Resolved] Network Issues

Some customers may be experiencing intermittent packet loss onto our network. We have confirmed this is not on our side and are waiting on further details from our upstream.

Once we have further information it will be posted here, appologies for the inconvenience.

Update: This is now resolved. The temporary packet loss was caused by an issue with our connection to LINX via Sovereign House in London, which has been resolved.

Categories
Outages

RFO – Network Issues 18/10/2010

Hi

This email is the official RFO for the outage today, causing downtime to all of our UK infrastructure.

At approximatley 9AM we became aware of a network issue affecting our racks starting with packet loss and intermittent connectivity. Initial checks showed that our servers and switching equipment are running and operational, leading us to the conclusion this was an issue with the network upstream.

Our datacentre detected the issue was affecting both the primary and secondary Cisco 6500 core routers that are configured in a VSS-1440 redundant cluster. They executed their emergency procedures to identify the problem, but all tests were completing within normal parameters.

A case was raised with Cisco TAC (Technical Assistance Centre) at 10.10AM. A Cisco engineer logged into the routers to try and identify the problem, but after 3 hours was unable to provide a resolution. At this point our datacentre took the decision to reboot the routers; It was either going to be a hardware fault or a software bug within the routers.

During the reboot, the primary router failed to boot up normally. The secondary router booted normally, restoring service.

Our datacentre is continuing to investigate this, both due to the unsatisfactory responsetime by Cisco on the TAC request, and the fact the traffic did not automatically failover to the secondary router as per design of the VSS-1440 redundant cluster.

We sincerely appologise for the downtime caused as a result of this, especially with it being so close to our recent planned power maintainence.

If you have any questions / queries about this, please submit a ticket and request it to be escalated to management.

Kind Regards,

The PCSmart Team

Categories
Outages

[Resolved] Network Issues

We are current experiencing some intermittent network issues, Senior staff are aware of the situation and we are awaiting further information from the datacentre.

Thanks for your patience.

Update 3: All systems online, an RFO will be sent in due course.

Update 2: Please see the latest Update from the DataCentre:

The initial problem with a peering point that occurred earlier this morning has led to a problem within the VSS routing cluster. Our network team quickly eliminated as many causes as possible. This issue was then escalated to Cisco and we are currently involved in joint investigation to try and discern the underlying problem. As soon as we have any progress from this work we will inform you immediately.

Update: We have confirmed this is an issue with our upstream and will send a full RFO in due course.

Categories
Outages

[Resolved] VZ1 issues

Hi,

We are currently experiencing issues with our VZ1 Node, we are looking into this now however it is more than likely this machine will need to be rebooted as it appears to have hung on the console.

More Updates will be posted here as and when we have them.

Update @ 11:12 – The machine has been rebooted and has sucessfully booted back up, VPS are now starting up one by one.

Categories
Outages

[Resolved] vz1 issues

vz1 is currently experiencing high load. We are working to get it under control now

Update: Load is now coming down

Categories
Outages

[Resolved] xn11 I/O Issues

xn11 is currently suffering from some I/O issues again, this is currently being worked on. Further updates will follow

Update: The system has been rebooted, the RAID is in a degraded state but there are no signs of any disk issues. The RAID is now rebuilding, and the degraded disk has been noted as it may require replacement. As unlikely as it is, drives are around the same age so it could be a second faulty disk in a month. We will continue to keep a close eye on this server.