• reduced sl7 shared pool resources

    Duration:
    3/24/2025 3:25 am — 3/24/2025 12:00 pm

    Group Responsible:
    IT Fabric

    Affected Area:
    sl7 condor shared pool

    Expected Impact:
    delayed jobs, after recovery some will likely need restart

    Maintenance Type:
    Unplanned/Outage

    Description:
    The sl7 portion of the shared pool is suffering from reduced resources.\n\nInvestigation under way ...

  • Portions of the Shared Condor pool are down

    Duration:
    3/22/2025 11:25 am — 3/22/2025 10:00 pm

    Group Responsible:
    IT Fabric

    Affected Area:
    HTCondor Shared Pool

    Expected Impact:
    A portion of the compute farm is unavailable

    Maintenance Type:
    Unplanned/Outage

    Description:
    About half of the SL7 hosts on the shared HTCondor pool (~7K job slots) experienced an outage at approximately 11:25am today. The Alma 9 hosts were apparently unaffected.\n\nExperts are on site investigating and will update further as the situation evolves. Jobs submitted to SL7 hosts may be delayed due to limited resources until service is fully restored, and some may need to be restarted.

  • condor Shared Pool resources restored

    Duration:
    3/22/2025 11:25 am — 3/22/2025 6:36 pm

    Group Responsible:
    IT Fabric

    Affected Area:
    HTCondor Shared Pool

    Expected Impact:
    sl7 jobs should complete as before the outage

    Maintenance Type:
    Information

    Description:
    sl7 shared pool resource levels have been restored.

  • UPDATE, recovered - BNL SDCC datacenter power loss March 7, 2025

    Duration:
    3/7/2025 9:52 am — 3/7/2025 6:30 pm

    Group Responsible:
    IT Fabric

    Affected Area:
    many sdcc storage, compute, and services systems

    Expected Impact:
    impact to multiple services and experiments

    Maintenance Type:
    Unplanned/Outage

    Description:
    Bulk of services were restored as of ~18:30 EST 3/7.

  • BNL SDCC datacenter power loss March 7, 2025

    Duration:
    3/7/2025 9:42 am — 3/7/2025 12:00 pm

    Group Responsible:
    IT Fabric

    Affected Area:
    many sdcc storage, compute, and services systems

    Expected Impact:
    impact to multiple services and experiments

    Maintenance Type:
    Unplanned/Outage

    Description:
    BNL SCDF(SDCC) b725 datacenter experienced power loss on at least one of it's power systems.\n\nRecovery is underway.\n\nPostmortem will be done.

  • BNLBox Service Recovered

    Duration:
    3/3/2025 10:25 pm — 3/4/2025 3:00 pm

    Group Responsible:
    IT Services

    Affected Area:
    BNLBox

    Expected Impact:
    Service recovered

    Maintenance Type:
    Information

    Description:
    The backend Lustre storage server has been fixed and BNLBox service has been restored. Please report any residual issues via email to RT-RACF-StorageManagement@bnl.gov.

  • BNLBox Outage

    Duration:
    3/3/2025 9:25 pm — 3/4/2025 5:00 pm

    Group Responsible:
    IT Services

    Affected Area:
    BNLBox

    Expected Impact:
    Service unavailable

    Maintenance Type:
    Unplanned/Outage

    Description:
    Failure of a RAID controller on the Lustre storage server that holds the BNLBox data directory caused the service to become unavailable at about 9:25pm. The server will remain down overnight pending a response from Dell support. Data on the client side will remain unaffected, but any changes will not be synced to the storage service until it recovers. We will send out an update once the service is back in production.

  • Jupyter Infrastructure Migration

    Duration:
    2/18/2025 11:00 am — 2/18/2025 2:00 pm

    Group Responsible:
    IT Fabric

    Affected Area:
    jupyter.sdcc.bnl.gov

    Expected Impact:
    Service Unavailable

    Maintenance Type:
    Planned Maintenance/Downtime

    Description:
    jupyter.sdcc.bnl.gov infrastructure will be migrated to Alma 9 on Tuesday starting 11:00am.\n\nThis will terminate any running jobs you may have. \n\nDuring this time the service will be unavailable to users, please be aware.\n\nAfter the migration HTC and HPC jobs will be spawned in the Alma9 environment.\n\nEstimated duration time: 3h.

  • SDCC Globus Server update

    Duration:
    1/16/2025 10:00 am — 1/16/2025 11:00 am

    Group Responsible:
    IT Services

    Affected Area:
    Globus Wed acccess

    Expected Impact:
    access will unavailable and current sessions ended

    Maintenance Type:
    Planned Maintenance/Downtime

    Description:
    Tomorrow 1/16/2025 The Globus server for SDCC will be updated. During this time access to the globus website (https://app.globus.org/) will be unavailable.

  • Mattermost Maintenance

    Duration:
    1/9/2025 9:00 am — 1/9/2025 10:00 am

    Group Responsible:
    IT Services

    Affected Area:
    Mattermost CHAT

    Expected Impact:
    access will unavailable and current sessions ended

    Maintenance Type:
    Planned Maintenance/Downtime

    Description:
    Mattermost will be down for Maintenance on Thursday 01/09/2025 between 9:00AM and 10:00AM and will be unavailable.