By Anonymous |

Summary:
Unplanned power outage caused by a rain storm affected a fraction of the Linux Farm resources at the SDCC.

Duration:
5/2/2025 2:45 am - 5/2/2025 5:25 am

Group Responsible:
IT Fabric

Affected Area:
Linux Farm

Expected User Impact:
Temporary Loss of cpu resources

Maintenance Type:
Unplanned/Outage
    
Submitted By:
SDCC Announcements

Description:
At approximately 2:45 am on Friday, May 2nd, the SDCC experienced an unplanned and brief power outage that cut electrical power to a limited fraction of the Linux Farm cluster that provides resources to several SDCC-supported experiments. ATLAS Tier-1 and Tier-3, as well as the Alma9 and SL7 shared pool used by Belle-II, DUNE, EIC, PHENIX and STAR were affected. Staff responded to the alarms and began recovery at 4 am. Affected systems were fully restored at 5:25 am. Post-recovery, it was determined the power outage was caused by a brief but significant rain storm in the early hours of May 2nd, coupled with the partial failure of a UPS sub-system whose job is to provide electrical power continuity during utility power instabilities. The rest of the UPS sub-systems performed as expected, and the majority of the our resources (sPHENIX Linux Farm cluster, HPSS tape system, disk storage, file systems, collaborative tools and services,etc) were not affected. An investigation by BNL's Utilities Division is underway, followed by repair of the faulty sub-system and then implementation of an enhanced electrical system monitoring to prevent a re-occurrance of unplanned power outages.