Status Blog – Page 13 – Because Uptime Matters

[Complete] Rolling Restarts 09/01/2011

Post author By admin
Post date January 9, 2011

We are rebooting XN11 and XN14 tonight as per the earlier email to patch against a driver issue affecting two servers yesterday.

Updates will be provided here throughout.

XN11 (Done)
8.00PM – Going down for restart now
8.04PM – Small issue related to the BIOS on the machine, sorting now
8.06PM – Server is booting now
8.10PM – Server is up and starting rolling restarts of VPS now
8.24PM – All VPS have been restarted and should be up. Please be aware the RAID array on this machine is now doing a verification and I/O wait will be a little higher than usual until this completes. Shouldn’t be more than 30mins

XN14
8.35PM – Going down for restart now
8.39PM – Server is up and starting rolling restarts of VPS now
8.49PM – All VPS are up and we are going through them individualyl to ensure they are on the correct kernel + the matching kernel modules are copied inside VPS’s.
9.15PM – There are about 5 VPS left to restart
9.20PM – All VPS have been restarted and should be up. Please be aware the RAID array on this machine is now doing a verification and I/O wait will be a little higher than usual until this completes. With this being a larger server it will take a few hours for the load to settle and the RAID verification to complete.

Outages

[Resolved] vz2 outage

Post author By admin
Post date January 9, 2011

VZ2 is showing 90% packet loss and is being checked.

Update: Looks like a VPS was being attacked, and has been nullrouted now.

Outages

RFO: XN5 + XN6 Outage 08/01/2011

Post author By admin
Post date January 8, 2011

Dear All

I can now confirm that service has been restored to users on XN5 + XN6, and we are now going through VM’s individually to ensure everyone is up. If your Virtual Machine is not already online, please allow another 10 minutes and if not reboot it via SolusVM. If you are still having issues please open a ticket.

Yesterday, XN5 and XN6 crashed due to high load. We investigated the condition of both systems and removed those VM’s which were using high amounts of CPU in order to restore stability.

Today at UK Peak time the same happened, ironically to both nodes at the same time. We used several methods to restore service, including starting VM’s in small batches while observing load. On multiple occasions the systems hung and began showing signs of packet loss.

Upon investigating this further, we found that when under stress the RAID card on these systems was briefly going offline despite lack of any I/O errors from the kernel. This has been identified as a driver issue, and has not shown itself to date as the load on these systems has been relativley conisistent.

In order to resolve this, we have rolled out new kernels on the host machines with a newer aacraid driver – the drivers for our Adaptec RAID cards. We have been able to start up VM’s in a much smaller timeframe and the systems are showing a clear performance improvement.

We are very sorry for any inconvenience this has caused you this evening, and the outage has lasted far, far longer than we would like or that you would expect from xenSmart.

We are implementing additional monitoring on these nodes and will be putting together a rollout plan, to both ensure this does not happen to any other nodes, and that the solution we have implemented remains stable.

Based on the duration of this outage we are prepared to extend our Network-based SLA to cover this outage. Any customers who would like to be re-imbursed for this outage, please follow the procedure as per our Service Level Agreement (At the bottom of the xenSmart website) and it will be honored.

Regards,

Chris Elliott
Technical Director

Update: We have found the same bug affecting other nodes:

aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter abort request (0,0,0,0)
aacraid: Host adapter reset request. SCSI hang ?

We will be updating these in due course.

Outages

[Resolved] xn5/xn6 outage

Post author By admin
Post date January 8, 2011

We are again seeing the same issue just encountered on xn5, on xn6 and we are getting KVM on it now

Update: We have hit a number of issues including what appears to be a driver issue for our Adaptec RAID cards causing them to fall over under load. We are now deploying new kernels + will be booting up VPS in due course.

XN5 Status = Fully Restored. If your VPS is not up yet, be patient it is booting.

XN6 Status= Fully Restored. If your VPS is not up yet, be patient it is booting.

Uncategorized

[Resolved] Outage xn4 + xn5 + xn6

Post author By admin
Post date January 7, 2011

We currently have several nodes offline / with packet loss.

Xn4 and XN6 are restored.

Update: xn5 is now restored, please be aware of some I/O wait until the system settles down

Outages

[Resolved] Network Issues

Post author By admin
Post date November 24, 2010

Some customers may be experiencing intermittent packet loss onto our network. We have confirmed this is not on our side and are waiting on further details from our upstream.

Once we have further information it will be posted here, appologies for the inconvenience.

Update: This is now resolved. The temporary packet loss was caused by an issue with our connection to LINX via Sovereign House in London, which has been resolved.

Outages

RFO – Network Issues 18/10/2010

Post author By admin
Post date October 18, 2010

This email is the official RFO for the outage today, causing downtime to all of our UK infrastructure.

At approximatley 9AM we became aware of a network issue affecting our racks starting with packet loss and intermittent connectivity. Initial checks showed that our servers and switching equipment are running and operational, leading us to the conclusion this was an issue with the network upstream.

Our datacentre detected the issue was affecting both the primary and secondary Cisco 6500 core routers that are configured in a VSS-1440 redundant cluster. They executed their emergency procedures to identify the problem, but all tests were completing within normal parameters.

A case was raised with Cisco TAC (Technical Assistance Centre) at 10.10AM. A Cisco engineer logged into the routers to try and identify the problem, but after 3 hours was unable to provide a resolution. At this point our datacentre took the decision to reboot the routers; It was either going to be a hardware fault or a software bug within the routers.

During the reboot, the primary router failed to boot up normally. The secondary router booted normally, restoring service.

Our datacentre is continuing to investigate this, both due to the unsatisfactory responsetime by Cisco on the TAC request, and the fact the traffic did not automatically failover to the secondary router as per design of the VSS-1440 redundant cluster.

We sincerely appologise for the downtime caused as a result of this, especially with it being so close to our recent planned power maintainence.

If you have any questions / queries about this, please submit a ticket and request it to be escalated to management.

Kind Regards,

The PCSmart Team

Outages

[Resolved] Network Issues

Post author By admin
Post date October 18, 2010

We are current experiencing some intermittent network issues, Senior staff are aware of the situation and we are awaiting further information from the datacentre.

Thanks for your patience.

Update 3: All systems online, an RFO will be sent in due course.

Update 2: Please see the latest Update from the DataCentre:

The initial problem with a peering point that occurred earlier this morning has led to a problem within the VSS routing cluster. Our network team quickly eliminated as many causes as possible. This issue was then escalated to Cisco and we are currently involved in joint investigation to try and discern the underlying problem. As soon as we have any progress from this work we will inform you immediately.

Update: We have confirmed this is an issue with our upstream and will send a full RFO in due course.

Planned Maintainance

[Completed] UK DC Maintainence Progress

Post author By admin
Post date October 1, 2010

Realtime updates (As close as possible) on the UK datacentre maintainence will be posted below. Watch this space!

19:00 01/10/09 – Servers are being prepared for powerdown, and having the latest updates installed including a kernel released today. Don’t worry we are also rolling out automated kernel module updates so you won’t need to copy them in manually.

00:00 02/10/09 – Servers are now being cleanly shutdown. If you have a dedicated server and have not provided us with your login details, we strongly advise powering down your machine now.

01:00 02/10/09 – All servers + power is now down.

05:14 02/10/09 – The following servers are now online: VZ1, XN1, XN2, XN7, XN10

05:39 02/10/09 – All servers are now online, we are working through each node to ensure VPS come up ok.

06:05 02/10/09 – We have issued the all-clear. All of our VPS nodes + customer servers are online and passed sanity checks. Some VPS nodes are doing routine quota checks and RAID verifications so you may find your VPS sluggish for a few hours until things settle, or you VPS is not online yet.

If your VPS is not up, please allow at least 1 hour from now for it to sort itself out, if it still doesn’t come up you can reboot by SolusVM or open a support ticket.

Planned Maintainance

Planned maintainence to Power Systems in UK Datacentre 02/10/2010

Post author By admin
Post date September 24, 2010

This is a copy of an email we sent out on 09/09/10 due to some planned maintainence on 02/10/2010 for mandatory power maintainence at the UK based datacentre we use.

—

Maintanence Task:

Controlled power outage to the Spectrum House facility, allowing for essential repairs to the main power systems.

Scheduled For:

Saturday 02/10/2010 between the hours of 00:00 and 08:00 GMT. We are expecting/hoping to be up sooner than 08.00GMT and senior staff will be in-office for the duration.

Further Details:

The datacentre recently installed a fourth 500kva UPS into a modular system, to add additional resilience and capacity for further growth. Having been designed to be modular, the appropriate connections were already available on the UPS systems. During installation, a fault was identified with the panel which meant the additional UPS could not be connected.

In order to resolve this fault, the panel needs to be electrically isolated, as it cannot be worked on while live for safety reasons. The work is being carried out by a team of engineers from the manufacturer, and the datacentre will be manned with additional staff for the duration, as well as our senior staff on standby for the duration.

It goes without saying that we sincereley appologise for any inconvenience this may cause you, and we are equally as frustrated about the circumstances of a power outage of this length to our equipment.

All servers will be powered down cleanly, if you are a dedicated server customer please contact us so we can store your login details and initiate a graceful shutdown prior to the power loss. Although the window for the maintainence is 00:00 to 08:00, every effort is being taken to reduce the downtime to a minimum.

We will keep you updated as much as possible via our offsite status blog, accessed at http://pcsmarthosting.net during the maintainence.

Please don’t hesitate to ask if you have any further questions.