Network outages plague campus

By Daniel Silverman

March 18, 2005

Problems with a core network switch in Feldberg caused several hours of intermittent network outages this morning, according to Chief Information Officer Perry Hanson. In a letter to the community, Hanson blamed a software fault for outages that caused phone and data services to be unavailable for periods between 1:30 a.m. and noon on Thursday.

ITS staff worked from the early hours of the morning to identify and fix the malfunction. Senior Data Network Engineer Mike Fitzgerald was the first to be informed of the problem.

My pager goes off automatically on loss of connectivity, he told the Hoot.
When he and other ITS system administrators could not troubleshoot the problem remotely, Fitzgerald traveled to campus to diagnose the problem. He arrived at around 2:30 a.m. to find a malfunctioning switch.

Network switches are used to route data traffic between computers. Generally when switches fail, the self-healing networks automatically route around the damage to maintain data service. However, due to the nature of the failure, that did not happen, according to Fitzgerald.

Weve got a lot of redundancy in the network, he said. If it had just gone bad, the network would have been able to reconfigure for that.

However, the switch continuously attempted to fix itself and come back online, confusing other systems.

If it was a hard failure it would have been fine, but because it kept transitioning good to bad to good to bad, the network had to try to keep compensating for it, Fitzgerald told the Hoot.

The problem was identified and fixed by the time CIO Perry Hanson arrived on campus at 4 a.m., according to Fitzgerald. However, another problem became apparently approximately three hours later.

Because of the interruption of the network the phones were trying to talk to the data side of the network instead of the voice side of the network, said Fitzgerald.

Data and voice services are segregated onto different virtual networks running over the same wiring. When the phones could not connect to the central phone servers, called Call Managers, they attempted instead to register with the servers that handle data traffic. Once this problem was corrected, every phone continuously attempted to re-register, causing the campus switches to become overloaded.

Associate CIO Anna Tomecka called the chain of events a rare occurrence that would have been difficult to anticipate.

If we had ten people on around the clock we would have been able to recover a bit sooner, but unless someone was watching for this specifically, they would not have been able to prevent it, she told the Hoot.

The problem was identified more quickly because of special network traffic analysis equipment installed after the phone outages 18 months ago, according to Fitzgerald and System Services Manager John Turner. These systems, called ARP detectors, watch for the specific type of traffic generated by registration requests.

In order to stop the flood of registration requests, known as an ARP storm, ITS staff had to manually cut power to network ports to force the phones to reset completely. This was done over time to different parts of the network, which is why students and staff may have noticed their phones rebooting. The resets were done in segments to ensure that the Call Managers were not overburdened with new registration requests, which could cause the entire process to repeat itself.

The problems were completely resolved before noon, according to Fitzgerald, and in the process valuable lessons were learned about how the network behaves and ways to improve it. Still, the root cause of the problem — why the core switch malfunctioned — remains unknown. Cisco is looking into the problem.

This outage was unrelated to minor intermittent problems with voicemail and other functions experienced by some phone users after routine security maintenance performed on Tuesday.

The Tuesday outage, ended up eating up a lot of our day, but it was an underlying issue that fortunately did not affect very many people, said Turner. That problem was quickly found and corrected after communication with Cisco.
The Thursday outage is the first major phone outage in 18 months, and lasted substantially less time than the previous problems, which took place over three days.

