Tue Jun 15 16:41:34 EDT 2021
There was a major cooling failure in the CDCE room in SDCC's datacenter earlier today (6/15), starting around 12:30 PM EST, due to an issue with the chilled water system in the building. Temperatures rose quickly, triggering automated monitoring software shutdowns of compute nodes in that room around 1:00 PM in order to avoid equipment damage. This affected all ATLAS T1 compute nodes, and a large portion of the shared pool (all spool0XYZ systems). Parts of our RHEV system were also affected. The issue with the building chilled water circulation was repaired by approximately 3:00 PM, and the farm equipment was powered back online, and opened to jobs after the room room temperature stabilized at 3:30 PM.
At this time we believe all affected services have been restored. If you continue to experience issues, please submit a ticket to RT.
Chris Hollowell (firstname.lastname@example.org)