Incident Report - Outage edge-george-1
On Tuesday the 21st of June 2016 Spectrum engineers monitored alarms on traffic flow performance through the edge router edge-george-1.
edge-george-1 is a MPLS edge router responsible for a large number of customers and network functions within the George Street Data center. Customers not connected to edge-george-1 or Customers that have a redundant path to other Spectrum Data centers would not have been effected by this outage.
After doing network diagnostics and consulting with the hardware vendor we were advised that there was a memory fault with the active supervisor engine on edge-george-1. We were advised that it's failure was imminent and to manually switch to the redundant supervisor hot spare. There should be no downtime associated with this switch over.
20:15 Staff initiated the transfer to the hot spare.
20:17 Staff could access edge-george-1 again and services were restored to the router with the exception of one 802.1AD interface.
20:26 Staff observed that the down interface was administratively down and re-enabled the interface. At this point the interface came up but traffic would not flow through the interface
20:51 Staff noticed the MAC address associated with the interface had been set to a null value.
20:56 Staff reconfigured the MAC address on the interface and all services were restored to edge-george-1
After the outage staff recognized that the loss of connectivity was due to the MAC address on a critical interface being set to null by the system during the transfer from the running supervisor to the standby supervisor.
Staff understand that is likely a bug within the router's operating system and have reported to the hardware vendor. Spectrum staff have initiated a work around by manual setting the MAC address on all interfaces that use 802.3AD encapsulation as part of our standard operating procedure. We do not expect a recurrence of this type of failure. Prior to this outage edge-george-1 has a system up time of over 4 years.
We will roll out the new standard static MAC address on 802.3AD interfaces throughout the network once risk analysis is complete under a planned hazard.
We do not expect a recurrence of this outage.
You are of welcome to give our help desk a call on 1300133299 if you want to discuss further.