Dallas - Outage
Incident Report for LevelOneServers
Postmortem

We’ll be making our own internal investigation on how recovery went on our side and improving our process to improve the recovery time from similar events.

The datacenter has shared their RFO with us:

Full Outage Summary

Utility is lost and both Gen A and Gen B start running. Gen A then tripped off immediately after load transfer for an undetermined reason. Gen A continued to try to restart, completely draining and killing the battery even though it recently passed annual PM. UPS 1A and 2A fully discharged their batteries. Once that happened STS 1A and 2A changed sources to secondary source, which was the Catcher UPS which was being fed from Gen B. This overloaded the Catcher UPS and forced it to bypass without fully discharging its batteries. This loading on Gen B caused significant voltage fluctuations and caused UPS 1B and 2B to declare the source unavailable and continued to stay on battery until fully discharged then the units went offline. Load on block 1A and 2A remained on powered through the Catcher UPS in bypass fed from Gen B. When the utility returned the open transition between Gen B and Utility caused the remaining online equipment to trip offline resulting in all the PDU main input breakers to have tripped. This is why the customer load was not immediately restored with the utility.

Timeline

5:57 AM – Utility power to the facility was lost.

‌ - Gen A starts but eventually fails and drains battery trying to restart.

‌ - Gen B starts and continues to operate until utility returns with significant voltage and frequency instability. (Due to overloading)

5:58 AM – Overall System Status

‌ - UPS 1A / 2A discharging due to no Generator power available.

‌ - UPS 1B / 2B discharging due to ATS transition.

‌ - Mechanical fed from Gen B back online (Half capacity)

5:58 AM – ATS-1B and ATS-2B Load transferred to Generator.

6:01 AM – UPS-1A Offline, STS-1A transfers to Source 2 (Catcher UPS)

6:03 AM – UPS-2A Offline, STS-2A transfers to Source 2 (Catcher UPS)

6:04 AM – Load on Catcher exceeds catcher capacity forcing Catcher UPS to bypass fed from Gen B

6:05 AM – Gen B starts to experience significant voltage and frequency fluctuations due to overloading.

6:06 AM – UPS-1B and UPS-2B due to poor power quality from the generator declare input power to be substandard, resumes discharging from batteries.

6:15 AM – UPS-1B with input power considered bad, fully discharges batteries and downstream load is lost. Downstream STS-1B unable to switch to source 2 (catcher) due to power quality and shuts down. Downstream PDU’s open main input breakers.

6:21 AM – UPS-2B with input power considered bad, fully discharges batteries and downstream load is lost. Downstream STS-2B unable to switch to source 2 (catcher) due to power quality and shuts down. Downstream PDU’s open main input breakers.

6:28 AM – Utility Comes back, all CRACs come back. When ATS-B performs the open transition from Generator to Utility, the remaining PDU’s operating on Gen B lose power due to having no remaining UPSs with battery capacity. Opening the remaining PDU main input breakers.

7:15 AM – Technician identifies UPS 1A, 1B and 2A are all offline. UPS 2B is online in bypass with no load.

8:20 AM – All UPS’s reset and brought back online in normal operation.

8:50 AM – All tripped PDU main input breakers reset, and customer load restored.

Posted Jun 01, 2024 - 07:25 UTC

Resolved
This is resolved. We will release a postmorterm once we have information from the DC.
Posted May 29, 2024 - 05:42 UTC
Monitoring
We are going through server by server making sure everything is online.
Posted May 28, 2024 - 04:30 UTC
Update
PDU was swapped a few hours ago and power was fully restored.
Posted May 28, 2024 - 04:29 UTC
Update
It seems one of the banks of a PDU in r02 failed.
The DC is now replacing the faulty PDU with a new one.
Posted May 27, 2024 - 21:42 UTC
Update
It appears one of the PDUs on r02 is still down, we're investigating
Posted May 27, 2024 - 19:56 UTC
Update
All racks should be on now, we are going rack by rack to ensure servers are up.
Posted May 27, 2024 - 11:43 UTC
Update
We are working with the DC to resolve the issues

Current Status:
- r01 - fully up
- r02 - fully up, main switch unreachable but forwards
- r03 - fully up
- r04 - fully up
- r05 - fully up
- r06 - one pdu down
- r07 - fully up
Posted May 27, 2024 - 10:28 UTC
Update
We are working with the DC to resolve the issues

Current Status:
- r01 - fully up
- r02 - fully up, main switch unreachable but forwards
- r03 - fully up
- r04 - one pdu down
- r05 - fully up
- r06 - one pdu down
- r07 - fully up
Posted May 27, 2024 - 04:33 UTC
Update
It appears we lost power in r04 again.
We are working with the DC to resolve the issues

Current Status:
- r01 - fully up
- r02 - one pdu down
- r03 - fully up
- r04 - down
- r05 - fully up
- r06 - one pdu down
- r07 - fully up
Posted May 26, 2024 - 22:52 UTC
Update
We are working with the DC to resolve the issues

Current Status:
- r01 - fully up
- r02 - one pdu down
- r03 - fully up
- r04 - one pdu down
- r05 - fully up
- r06 - one pdu down
- r07 - fully up
Posted May 26, 2024 - 22:26 UTC
Update
We are working with the DC to resolve the issues

Current Status:
- r01 - fully up
- r02 - down
- r03 - fully up
- r04 - one pdu down + no IPMI network
- r05 - fully up
- r06 - one pdu down
- r07 - fully up
Posted May 26, 2024 - 19:12 UTC
Update
While power is coming back up we are working to restore all networking equipment and servers, this may take time to ensure the breakers don't pop.
Posted May 26, 2024 - 17:56 UTC
Update
While power is coming back up we are working to restore all networking equipment and servers, this may take time to ensure the breakers don't pop.
Posted May 26, 2024 - 16:02 UTC
Update
While power is coming back up we are working to restore all networking equipment and servers, this may take time to ensure the breakers don't pop.
Posted May 26, 2024 - 14:54 UTC
Update
While power is coming back up we are working to restore all networking equipment and servers, this may take time to ensure the breakers don't pop.
Posted May 26, 2024 - 14:52 UTC
Identified
Power is being slowly restored
Posted May 26, 2024 - 14:17 UTC
Update
Indications show there is a power outage in the facility. We've reached out and are awaiting status updates.
Posted May 26, 2024 - 11:41 UTC
Investigating
We are currently investigating this issue.
Posted May 26, 2024 - 11:20 UTC
This incident affected: dal.c1 (r01.dal.c1, r02.dal.c1, r03.dal.c1, r04.dal.c1, r05.dal.c1, r06.dal.c1, r07.dal.c1).