Incident Report - Outage edge-george-1

Here we will post network problems, Planned & Unplanned downtime as well as restoration times and other network issues.

Incident Report - Outage edge-george-1

Postby matt » Wed Jun 22, 2016 3:17 pm

Incident Report - Outage edge-george-1

On Tuesday the 21st of June 2016 Spectrum engineers monitored alarms on traffic flow performance through the edge router edge-george-1.
edge-george-1 is a MPLS edge router responsible for a large number of customers and network functions within the George Street Data center. Customers not connected to edge-george-1 or Customers that have a redundant path to other Spectrum Data centers would not have been effected by this outage.

After doing network diagnostics and consulting with the hardware vendor we were advised that there was a memory fault with the active supervisor engine on edge-george-1. We were advised that it's failure was imminent and to manually switch to the redundant supervisor hot spare. There should be no downtime associated with this switch over.

20:15 Staff initiated the transfer to the hot spare.
20:17 Staff could access edge-george-1 again and services were restored to the router with the exception of one 802.1AD interface.
20:26 Staff observed that the down interface was administratively down and re-enabled the interface. At this point the interface came up but traffic would not flow through the interface
20:51 Staff noticed the MAC address associated with the interface had been set to a null value.
20:56 Staff reconfigured the MAC address on the interface and all services were restored to edge-george-1

After the outage staff recognized that the loss of connectivity was due to the MAC address on a critical interface being set to null by the system during the transfer from the running supervisor to the standby supervisor.

Staff understand that is likely a bug within the router's operating system and have reported to the hardware vendor. Spectrum staff have initiated a work around by manual setting the MAC address on all interfaces that use 802.3AD encapsulation as part of our standard operating procedure. We do not expect a recurrence of this type of failure. Prior to this outage edge-george-1 has a system up time of over 4 years.

We will roll out the new standard static MAC address on 802.3AD interfaces throughout the network once risk analysis is complete under a planned hazard.

We do not expect a recurrence of this outage.

You are of welcome to give our help desk a call on 1300133299 if you want to discuss further.
User avatar
matt
Site Admin
 
Posts: 325
Joined: Thu Apr 09, 2009 11:44 am
Location: George Street Sydney

Re: Incident Report - Outage edge-george-1

Postby matt » Tue Jul 12, 2016 8:10 pm

On July 12th at 17:02 PM edge-george-1 suffered a SRAM Parity error.

The error caused the rotuer's bus to become unstable and 6 of the 12 routing sub processor cards failed. As a result 60% of the traffic in and out of the router was degraded to a single interface that was designed for management traffic only and became extremely unstable.

Engineering staff attempted to reset the interface cards and move critical traffic of the router in an attempt to regain full utilization of the failed cards. However the router would not respond to configuration commands or diagnostic input in a expected way. The cards would not except commands or talk on the bus.

At 17:35 the decision was made to reset the supervisor and bus of edge-george-1 ( a power cycle of the cabinet) before this process was completed critical DNS infrastructure was moved to edge-george-3 The router was rebooted at 17:38

At 17:49 the reboot was completed and all effected customer were restored.

A case has been re-opened with the hardware vendor however as this problem has occurred on a different supervisor to the June 22nd fault the hardware vendor believes the fault is not related.

The Hardware vendor suggested that "As with all parity errors a one-time occurrence may indicate only a transit issue. Monitor for a repeat occurrence if multiple parity errors occur a hardware replacement may be necessary"

As the router has performed without error for 5 years now until the first parity errors started occurring and as the error does not seem to generate a switch over automatically to the standby processor. Spectrum staff have decided that the safest option is to decommission edge-george-1 and move all it's routing functions to the newly commissioned edge-george-3

edge-george-3 not only has over 250 times the switching capability of edge-george-1 but utilizes much more modern switching technology. We expect all users to be migrated over one-by-one within the next 8 weeks. In the mean time we will continue to monitor edge-george-1 as per the vendors recommendation.

If you would like to discuss further or arrange a quicker migration to edge-george-3 please contact our support staff during business hours on 1300133299

We would also like to remind all customers that the quickest way to access information during an outage is to check https://twitter.com/spectrumnet There is a dedicated Spectrum employee updating customers with information as soon as our staff are aware of it. Twitter can often be accessed on your smart phone via your mobile carrier.

We thank you for your understanding.
User avatar
matt
Site Admin
 
Posts: 325
Joined: Thu Apr 09, 2009 11:44 am
Location: George Street Sydney


Return to Service Availability & Announcements

Who is online

Users browsing this forum: No registered users and 1 guest

cron