We are in the process of migrating this forum. A new space will be available soon. We are sorry for the inconvenience.

Routing problem during the night.


oles@ovh.net
03-28-2012, 06:50 PM
Hello,
We had a routing problem during the night (03-28-2012) due to a software bug affecting two core routers in Roubaix. The Cisco ASR 9010 routers provide for the collection bandwidth data at the center in Roubaix (RBX1 RBX2 RBX3 RBX4 RBX5) and the connection to Paris,
Brussels, Amsterdam, London and Frankfurt. In short, the core routing in Roubaix.

This bug is known and is linked to new cards we installed in late January (24x10G by slot). For some reason the router will detect ECC RAM errors and no longer route packets. But nevertheless the card does not state "down" and remains in the router as if it is still operational. Other routers continue to send packets but there is no next hop. The packets are then dropped causing a problem on the network. Worst case scenario: Network goes down.

That night, three cards - 24x10G ASR 9010 routers had this bug almost simultaneously. This broke the network into 3 pieces: United States / London / Amsterdam / Warsaw
Roubaix and Paris, Frankfurt, Madrid, Milan, by dropping the packets in Roubaix. Usually the traffic would have been rerouted but it was blocked in Roubaix.

As a result, we were not able to utilise the network administration facility to retrieve logs from all the routers in order to establish the cause of the problem. We switched to the emergency external routers to check which of the backbone routers was causing the problem. This operation took some time, due to two routers being down and the delay in finding the problem because it was not only an issue with one router rbx-g2-a9 but also rbx-g1-a9. Once restarted, the 3 cards were operational within 5 minutes.

Over the last 3 weeks, we have opened a ticket with Cisco regarding the problem with the RAM ECC. Cisco have been working on the problem and have provided a software patch to fix the it. We will do this tonight. It also looks at how we can improve the management of our routers in the case where the whole backbone is down for some reason. It will be able to handle this but it will be very, very slow.

So, the outage lasted more than 99.9% ie 1:22 when we are allowed "a maximum" of 43 mins of downtime per month. Therefore, there are penalties for exceeding the maximum time. For example: on SD OVH is 5% per hour of downtime. We have created a URL so you can apply for credit under your SLA
See the following link http://travaux.ovh.com/?do=details&id=6533

It's never pleasant to write this kind of email but when our service is not up to standard, we must apologise.

Sorry again.

Regards
Octave