RFO: Network Outage 18/07/12

In the early hours of this morning we became aware of a network issue affecting all servers in one of our racks, including our main website and support desk. We received reports of intermittent packet loss, and potential routing issues as IP’s were responding from some locations and not others.

A ticket was immediately logged with our datacentre by the member of staff currently online, and after some basic checks were made on our hardware, the ticket was placed on hold for attention of a network engineer.

We use a Cisco HSRP setup which provides the switch in each rack with 2x redundant uplinks; should one of those uplinks fail, the other should pick up the slack. Despite the uplinks being online at either end, our switch was dropping packets on the primary uplink, thus causing these intermittent connectivity issues as it didn’t disable the interface and move to the secondary uplink.

Having double and trouble checked everything, we reloaded the configuration, and restarted the switch which restored full connectivity to all systems.

Prior to this incident this rack and its switch had well over a year of uptime, we can only draw from what we have seen today that either:

A) This was a one off/ a glitch (We prefer answers, but technology isnt perfect..!)

B) This was a bug in the Firmware on the switch and we will check this with Cisco, though we installed the latest Cisco IOS before deployment.

We don’t expect any further issues at this point, but will continue to closely monitor and investigate this issue further to prevent such an outage happening again.

We sincerely apologise to all customers affected by this incident, and we will be honouring any SLA credits made via the procedures as outlined on our website.

Chris