• RACF Gitea Upgrade

    By Anonymous |

    Duration:
    4/21/2025 8:00 am — 4/21/2025 8:30 am

    Group Responsible:
    SDCC Operations

    Affected Area:
    Git repositories hosted on git.racf.bnl.gov

    Expected Impact:
    Some personal access tokens may need to be regenerated

    Maintenance Type:
    Transparent Upgrage/Maintenance

    Description:
    The RACF gitea (git.racf.bnl.gov/gitea) will be upgraded on Monday, 4/21 at 8:00 AM EDT. Minimal disruption of service is expected. The version is being upgraded from 1.19.1 to 1.23.7. In version 1.20.0, permissions for personal access tokens were changed and some tokens may need to be regenerated. More info on the blog: https://blog.gitea.com/release-of-1.20.0/#warning-refactored-scoped-tok…

  • sl7 shared pool resources restored

    By Anonymous |

    Duration:
    3/24/2025 11:57 am — 12/31/1969 7:00 pm

    Group Responsible:
    IT Fabric

    Affected Area:
    sl7 portion of the condor shared pool

    Expected Impact:
    jobs should run within normal time, some may need restart

    Maintenance Type:
    Information

    Description:
    sl7 shared pool resource levels have been restored.\n\nInvestigation into the root cause continues, could recur until identified and resolved.

  • reduced sl7 shared pool resources

    By Anonymous |

    Duration:
    3/24/2025 3:25 am — 3/24/2025 12:00 pm

    Group Responsible:
    IT Fabric

    Affected Area:
    sl7 condor shared pool

    Expected Impact:
    delayed jobs, after recovery some will likely need restart

    Maintenance Type:
    Unplanned/Outage

    Description:
    The sl7 portion of the shared pool is suffering from reduced resources.\n\nInvestigation under way ...

  • Portions of the Shared Condor pool are down

    By Anonymous |

    Duration:
    3/22/2025 11:25 am — 3/22/2025 10:00 pm

    Group Responsible:
    IT Fabric

    Affected Area:
    HTCondor Shared Pool

    Expected Impact:
    A portion of the compute farm is unavailable

    Maintenance Type:
    Unplanned/Outage

    Description:
    About half of the SL7 hosts on the shared HTCondor pool (~7K job slots) experienced an outage at approximately 11:25am today. The Alma 9 hosts were apparently unaffected.\n\nExperts are on site investigating and will update further as the situation evolves. Jobs submitted to SL7 hosts may be delayed due to limited resources until service is fully restored, and some may need to be restarted.

  • condor Shared Pool resources restored

    By Anonymous |

    Duration:
    3/22/2025 11:25 am — 3/22/2025 6:36 pm

    Group Responsible:
    IT Fabric

    Affected Area:
    HTCondor Shared Pool

    Expected Impact:
    sl7 jobs should complete as before the outage

    Maintenance Type:
    Information

    Description:
    sl7 shared pool resource levels have been restored.

  • UPDATE, recovered - BNL SDCC datacenter power loss March 7, 2025

    By Anonymous |

    Duration:
    3/7/2025 9:52 am — 3/7/2025 6:30 pm

    Group Responsible:
    IT Fabric

    Affected Area:
    many sdcc storage, compute, and services systems

    Expected Impact:
    impact to multiple services and experiments

    Maintenance Type:
    Unplanned/Outage

    Description:
    Bulk of services were restored as of ~18:30 EST 3/7.

  • BNL SDCC datacenter power loss March 7, 2025

    By Anonymous |

    Duration:
    3/7/2025 9:42 am — 3/7/2025 12:00 pm

    Group Responsible:
    IT Fabric

    Affected Area:
    many sdcc storage, compute, and services systems

    Expected Impact:
    impact to multiple services and experiments

    Maintenance Type:
    Unplanned/Outage

    Description:
    BNL SCDF(SDCC) b725 datacenter experienced power loss on at least one of it's power systems.\n\nRecovery is underway.\n\nPostmortem will be done.

  • BNLBox Service Recovered

    By Anonymous |

    Duration:
    3/3/2025 10:25 pm — 3/4/2025 3:00 pm

    Group Responsible:
    IT Services

    Affected Area:
    BNLBox

    Expected Impact:
    Service recovered

    Maintenance Type:
    Information

    Description:
    The backend Lustre storage server has been fixed and BNLBox service has been restored. Please report any residual issues via email to RT-RACF-StorageManagement@bnl.gov.

  • BNLBox Outage

    By Anonymous |

    Duration:
    3/3/2025 9:25 pm — 3/4/2025 5:00 pm

    Group Responsible:
    IT Services

    Affected Area:
    BNLBox

    Expected Impact:
    Service unavailable

    Maintenance Type:
    Unplanned/Outage

    Description:
    Failure of a RAID controller on the Lustre storage server that holds the BNLBox data directory caused the service to become unavailable at about 9:25pm. The server will remain down overnight pending a response from Dell support. Data on the client side will remain unaffected, but any changes will not be synced to the storage service until it recovers. We will send out an update once the service is back in production.

  • Jupyter Infrastructure Migration

    By Anonymous |

    Duration:
    2/18/2025 11:00 am — 2/18/2025 2:00 pm

    Group Responsible:
    IT Fabric

    Affected Area:
    jupyter.sdcc.bnl.gov

    Expected Impact:
    Service Unavailable

    Maintenance Type:
    Planned Maintenance/Downtime

    Description:
    jupyter.sdcc.bnl.gov infrastructure will be migrated to Alma 9 on Tuesday starting 11:00am.\n\nThis will terminate any running jobs you may have. \n\nDuring this time the service will be unavailable to users, please be aware.\n\nAfter the migration HTC and HPC jobs will be spawned in the Alma9 environment.\n\nEstimated duration time: 3h.