• UPDATE, recovered - BNL SDCC datacenter power loss March 7, 2025

    By Anonymous |

    Duration:
    3/7/2025 9:52 am — 3/7/2025 6:30 pm

    Group Responsible:
    IT Fabric

    Affected Area:
    many sdcc storage, compute, and services systems

    Expected Impact:
    impact to multiple services and experiments

    Maintenance Type:
    Unplanned/Outage

    Description:
    Bulk of services were restored as of ~18:30 EST 3/7.

  • BNL SDCC datacenter power loss March 7, 2025

    By Anonymous |

    Duration:
    3/7/2025 9:42 am — 3/7/2025 12:00 pm

    Group Responsible:
    IT Fabric

    Affected Area:
    many sdcc storage, compute, and services systems

    Expected Impact:
    impact to multiple services and experiments

    Maintenance Type:
    Unplanned/Outage

    Description:
    BNL SCDF(SDCC) b725 datacenter experienced power loss on at least one of it's power systems.
    Recovery is underway.
    Postmortem will be done.

  • BNLBox Service Recovered

    By Anonymous |

    Duration:
    3/3/2025 10:25 pm — 3/4/2025 3:00 pm

    Group Responsible:
    IT Services

    Affected Area:
    BNLBox

    Expected Impact:
    Service recovered

    Maintenance Type:
    Information

    Description:
    The backend Lustre storage server has been fixed and BNLBox service has been restored. Please report any residual issues via email to RT-RACF-StorageManagement@bnl.gov.

  • BNLBox Outage

    By Anonymous |

    Duration:
    3/3/2025 9:25 pm — 3/4/2025 5:00 pm

    Group Responsible:
    IT Services

    Affected Area:
    BNLBox

    Expected Impact:
    Service unavailable

    Maintenance Type:
    Unplanned/Outage

    Description:
    Failure of a RAID controller on the Lustre storage server that holds the BNLBox data directory caused the service to become unavailable at about 9:25pm. The server will remain down overnight pending a response from Dell support. Data on the client side will remain unaffected, but any changes will not be synced to the storage service until it recovers. We will send out an update once the service is back in production.

  • Jupyter Infrastructure Migration

    By Anonymous |

    Duration:
    2/18/2025 11:00 am — 2/18/2025 2:00 pm

    Group Responsible:
    IT Fabric

    Affected Area:
    jupyter.sdcc.bnl.gov

    Expected Impact:
    Service Unavailable

    Maintenance Type:
    Planned Maintenance/Downtime

    Description:
    jupyter.sdcc.bnl.gov infrastructure will be migrated to Alma 9 on Tuesday starting 11:00am.
    This will terminate any running jobs you may have.

    During this time the service will be unavailable to users, please be aware.

    After the migration HTC and HPC jobs will be spawned in the Alma9 environment.

    Estimated duration time: 3h.

  • SDCC Globus Server update

    By Anonymous |

    Duration:
    1/16/2025 10:00 am — 1/16/2025 11:00 am

    Group Responsible:
    IT Services

    Affected Area:
    Globus Wed acccess

    Expected Impact:
    access will unavailable and current sessions ended

    Maintenance Type:
    Planned Maintenance/Downtime

    Description:
    Tomorrow 1/16/2025 The Globus server for SDCC will be updated. During this time access to the globus website (https://app.globus.org/) will be unavailable.

  • Mattermost Maintenance

    By Anonymous |

    Duration:
    1/9/2025 9:00 am — 1/9/2025 10:00 am

    Group Responsible:
    IT Services

    Affected Area:
    Mattermost CHAT

    Expected Impact:
    access will unavailable and current sessions ended

    Maintenance Type:
    Planned Maintenance/Downtime

    Description:
    Mattermost will be down for Maintenance on Thursday 01/09/2025 between 9:00AM and 10:00AM and will be unavailable.

  • SDCC fully operational after electrical maintenance work

    By Anonymous |

    Duration:
    12/30/2024 10:00 am — 12/30/2024 4:45 pm

    Group Responsible:
    IT Fabric

    Affected Area:
    All SDCC services

    Expected Impact:
    No access to SDCC resources (computing, storage and services)

    Maintenance Type:
    Planned Maintenance/Downtime

    Description:
    The SDCC is fully operational again, following the electrical maintenance work today. If you experience
    any problems accessing facility services or resources, please create a RT ticket
    (https://www.sdcc.bnl.gov/help/reporting-problems) and report it.

  • Update on the BNL electrical grid activity on 12/30

    By Anonymous |

    Duration:
    12/30/2024 9:00 am — 12/30/2024 6:00 pm

    Group Responsible:
    IT Fabric

    Affected Area:
    All SDCC services

    Expected Impact:
    No access to SDCC resources (computing, storage and services)

    Maintenance Type:
    Planned Maintenance/Downtime

    Description:
    A critical maintenance/replacement procedure on the BNL main electrical grid scheduled for Monday, Dec. 30th was announced
    to the SDCC on very short-notice last week. This procedure is planned to start around 12 noon and last approximately 4 hours.
    We recognize this procedure is happening during the BNL-declared "quiet period", but a postponement
    would incur increased costs to the Lab and potentially place this must-do procedure during the start-up period for RHIC run 25,
    which is deemed even less desirable than the current plan. BNL management has decided to go ahead with the Dec. 30th
    procedure, as planned.

    This procedure requires transferring the power source from the electrical utility to the back-up generator, with an UPS to
    bridge the time gap (a few seconds) between utility and generator power, and then remain on generator power for the duration
    of this procedure. Because there is a small risk of failure during the transfer process and in generator operations and because of
    reduced staff availability during the BNL quiet period, the SDCC management has decided to quiet down the facility resources
    to minimize the chances of data corruption, service disruptions and hardware failures, in the unlikely event that an unplanned
    power outage occurs.

    Quieting down means: 1) draining batch jobs (HTCondor and Slurm), holding new ones from starting and stopping interactive
    access to SDCC cpu resources on SUNDAY (DEC., 29TH) AT 3 PM ET and 2) stopping all data read/write and movement activities
    (disk and tape) on MONDAY (DEC. 30TH) AT 9AM ET.

    Announcements to SDCC Liaisons and program/experimental PoCs will be made when SDCC resources are fully available again.

  • work on BNL electrical grid on 12/30 and impact on SDCC resource

    By Anonymous |

    Duration:
    12/30/2024 9:00 am — 12/30/2024 6:00 pm

    Group Responsible:
    IT Fabric

    Affected Area:
    All SDCC services

    Expected Impact:
    No access to SDCC resources (computing, storage and services)

    Maintenance Type:
    Planned Maintenance/Downtime

    Description:
    A critical maintenance/replacement procedure on the BNL main electrical grid scheduled for Monday, Dec. 30th was announced
    to the SDCC on very short-notice last week. This procedure is planned to start around 12 noon and last approximately 4 hours.
    We recognize this procedure is happening during the BNL-declared "quiet period", but a postponement would incur increased
    costs to the Lab and potentially place this must-do procedure during the start-up period for RHIC run 25, which is deemed even
    less desirable than the current plan. BNL management has decided to go ahead with the Dec. 30th procedure, as planned.

    This procedure requires transferring the power source from the electrical utility to the back-up generator, with an UPS to
    bridge the time gap (a few seconds) between utility and generator power, and then remain on generator power for the duration
    of this procedure. Because there is a small risk of failure during the transfer process and generator operations and because of
    reduced staff availability during the "quiet period", the SDCC management has decided to quiet down the facility resources to
    minimize the chances of data corruption, service disruptions and hardware failures, in the unlikely event that an unplanned
    power outage occurs.

    Quieting down means: 1) draining batch jobs (HTCondor and Slurm), holding new ones from starting and stopping interactive
    access to SDCC cpu resources on Friday (Dec. 27th) evening and 2) stopping all data read/write and movement activities
    (disk and tape) on Monday (Dec. 30th) early morning.

    Announcements to SDCC Liaisons and program/experimental PoC's will be made when SDCC resources are fully available again.