Dallas, TX - Network Outage

Incident Report for LevelOneServers

Postmortem

A week ago, we decommissioned an old core switch in preparation for the upcoming scheduled upgrade this week, this left us with less redundancy on our core environment.

Today, during a work order, one of the facility techs accidentally unplugged one of the fiber cables to one of the other core switch. This caused the OSPF backbone area to go down and as a result the default route was also withdrawn from all the TORs.

Once the issue was identified, we immediately added a static route to all the TORs to resolve the outage and restore connectivity.
After which, we’ve had the facility reseat the SFP to the other core switch and the backbone area was restored.

In our new deployments, we usually force the backbone area to be up even when the rest of the links go down. However, unfortunately this old switch does not support it. (which will be upgraded soon as well)

Posted Jan 21, 2025 - 20:50 UTC

Resolved

This incident has been resolved.
Posted Jan 21, 2025 - 20:30 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Jan 21, 2025 - 20:19 UTC

Update

We are continuing to work on a fix for this issue.
Posted Jan 21, 2025 - 20:10 UTC

Identified

The issue has been identified and a fix is being implemented.
Posted Jan 21, 2025 - 20:09 UTC

Investigating

We are currently investigating this issue.
Posted Jan 21, 2025 - 19:48 UTC
This incident affected: dal.c1 (r01.dal.c1, r02.dal.c1, r03.dal.c1, r05.dal.c1, r06.dal.c1, r07.dal.c1).