This email is the official RFO for the outage today, causing downtime to all of our UK infrastructure.
At approximatley 9AM we became aware of a network issue affecting our racks starting with packet loss and intermittent connectivity. Initial checks showed that our servers and switching equipment are running and operational, leading us to the conclusion this was an issue with the network upstream.
Our datacentre detected the issue was affecting both the primary and secondary Cisco 6500 core routers that are configured in a VSS-1440 redundant cluster. They executed their emergency procedures to identify the problem, but all tests were completing within normal parameters.
A case was raised with Cisco TAC (Technical Assistance Centre) at 10.10AM. A Cisco engineer logged into the routers to try and identify the problem, but after 3 hours was unable to provide a resolution. At this point our datacentre took the decision to reboot the routers; It was either going to be a hardware fault or a software bug within the routers.
During the reboot, the primary router failed to boot up normally. The secondary router booted normally, restoring service.
Our datacentre is continuing to investigate this, both due to the unsatisfactory responsetime by Cisco on the TAC request, and the fact the traffic did not automatically failover to the secondary router as per design of the VSS-1440 redundant cluster.
We sincerely appologise for the downtime caused as a result of this, especially with it being so close to our recent planned power maintainence.
If you have any questions / queries about this, please submit a ticket and request it to be escalated to management.
The PCSmart Team
We are current experiencing some intermittent network issues, Senior staff are aware of the situation and we are awaiting further information from the datacentre.
Thanks for your patience.
Update 3: All systems online, an RFO will be sent in due course.
Update 2: Please see the latest Update from the DataCentre:
The initial problem with a peering point that occurred earlier this morning has led to a problem within the VSS routing cluster. Our network team quickly eliminated as many causes as possible. This issue was then escalated to Cisco and we are currently involved in joint investigation to try and discern the underlying problem. As soon as we have any progress from this work we will inform you immediately.
Update: We have confirmed this is an issue with our upstream and will send a full RFO in due course.
Realtime updates (As close as possible) on the UK datacentre maintainence will be posted below. Watch this space!
19:00 01/10/09 – Servers are being prepared for powerdown, and having the latest updates installed including a kernel released today. Don’t worry we are also rolling out automated kernel module updates so you won’t need to copy them in manually.
00:00 02/10/09 – Servers are now being cleanly shutdown. If you have a dedicated server and have not provided us with your login details, we strongly advise powering down your machine now.
01:00 02/10/09 – All servers + power is now down.
05:14 02/10/09 – The following servers are now online: VZ1, XN1, XN2, XN7, XN10
05:39 02/10/09 – All servers are now online, we are working through each node to ensure VPS come up ok.
06:05 02/10/09 – We have issued the all-clear. All of our VPS nodes + customer servers are online and passed sanity checks. Some VPS nodes are doing routine quota checks and RAID verifications so you may find your VPS sluggish for a few hours until things settle, or you VPS is not online yet.
If your VPS is not up, please allow at least 1 hour from now for it to sort itself out, if it still doesn’t come up you can reboot by SolusVM or open a support ticket.