Categories
Outages

RFO – Network Issues 18/10/2010

Hi

This email is the official RFO for the outage today, causing downtime to all of our UK infrastructure.

At approximatley 9AM we became aware of a network issue affecting our racks starting with packet loss and intermittent connectivity. Initial checks showed that our servers and switching equipment are running and operational, leading us to the conclusion this was an issue with the network upstream.

Our datacentre detected the issue was affecting both the primary and secondary Cisco 6500 core routers that are configured in a VSS-1440 redundant cluster. They executed their emergency procedures to identify the problem, but all tests were completing within normal parameters.

A case was raised with Cisco TAC (Technical Assistance Centre) at 10.10AM. A Cisco engineer logged into the routers to try and identify the problem, but after 3 hours was unable to provide a resolution. At this point our datacentre took the decision to reboot the routers; It was either going to be a hardware fault or a software bug within the routers.

During the reboot, the primary router failed to boot up normally. The secondary router booted normally, restoring service.

Our datacentre is continuing to investigate this, both due to the unsatisfactory responsetime by Cisco on the TAC request, and the fact the traffic did not automatically failover to the secondary router as per design of the VSS-1440 redundant cluster.

We sincerely appologise for the downtime caused as a result of this, especially with it being so close to our recent planned power maintainence.

If you have any questions / queries about this, please submit a ticket and request it to be escalated to management.

Kind Regards,

The PCSmart Team