• Issue with RT notification in User Accounts queue

    Duration:
    7/24/2025 11:47 am — 12/31/1969 7:00 pm

    Group Responsible:
    Services & Tools

    Affected Area:
    RT System

    Expected Impact:
    RT tickets are not being generated with new user requests

    Maintenance Type:
    Information

    Description:
    We have determined that requests generated by the New User Account form on the User Accounts queue are not generating RT tickets or follow-up emails. Requests are being received through the useraccts@rcf.rhic.bnl.gov email.

    We are investigating the cause of this issue and will update when we have more information.

  • RHIC AFS migration to OpenShift

    Duration:
    7/24/2025 9:15 am

    Group Responsible:
    IT Services

    Affected Area:
    RHIC AFS File System

    Expected Impact:
    some limited access during migration process

    Maintenance Type:
    Transparent Upgrade/Maintenance

    Description:
    Migrating RHIC AFS servers from Red Hat Virtual environment to OpenShift. Will migrate 1 server at a time, but this will include some disruption in file access as it proceeds.

  • RHIC AFS migration to OpenShift

    Duration:
    7/24/2025 9:15 am — 12/31/1969 7:00 pm

    Group Responsible:
    IT Services

    Affected Area:
    RHIC AFS File System

    Expected Impact:
    some limited access during migration process

    Maintenance Type:
    Transparent Upgrade/Maintenance

    Description:
    Migrating RHIC AFS servers from Red Hat Virtual environment to OpenShift. Will migrate 1 server at a time, but this will include some disruption in file access as it proceeds.

  • NX Campus System Update

    Duration:
    6/25/2025 5:00 pm — 6/25/2025 5:30 pm

    Group Responsible:
    Services & Tools

    Affected Area:
    The NX service will not be available during this time

    Expected Impact:
    All NX sessions will be terminated, please save your work.

    Maintenance Type:
    Planned Maintenance/Downtime

    Description:
    The nx-campus01 server will be rebooted for system/OS updates today.

  • Mattermost Emergency Downtime

    Duration:
    6/20/2025 10:00 am — 6/20/2025 11:00 am

    Group Responsible:
    Services & Tools

    Affected Area:
    Mattermost

    Expected Impact:
    Service interruption

    Maintenance Type:
    Unplanned/Outage

    Description:
    Due to recent SCDF Mattermost issues the service will be going down for an emergency patch, the service should resume within 5min but due to potential issues may be down longer (approx. 1hr)

  • Mattermost Update

    Duration:
    6/18/2025 7:00 pm — 6/18/2025 7:30 pm

    Group Responsible:
    Services & Tools

    Affected Area:
    SCDF Mattermost

    Expected Impact:
    Service Unavailable

    Maintenance Type:
    Planned Maintenance/Downtime

    Description:
    SCDF Mattermost service will be down temporarily from 7PM EST 06/18 to 730PM EST for scheduled updates. Users will be unable to use the service during this period, the service will resume normally following the completion of this update.

  • RACF Gitea Maintenance

    Duration:
    5/15/2025 8:30 am — 5/15/2025 9:30 am

    Group Responsible:
    SDCC Operations

    Affected Area:
    Git repositories hosted on git.racf.bnl.gov

    Expected Impact:
    Git instance unavailable

    Maintenance Type:
    Planned Maintenance/Downtime

    Description:
    On Thursday, May 15, the RACF Gitea instance (git.racf.bnl.gov/gitea) will be undergoing maintenance from 8:30AM-9:30AM. Repositories hosted on this instance may be unavailable for all or part of this time.

  • Unplanned power outage on Friday, May 2nd

    Duration:
    5/2/2025 2:45 am — 5/2/2025 5:25 am

    Group Responsible:
    IT Fabric

    Affected Area:
    Linux Farm

    Expected Impact:
    Temporary Loss of cpu resources

    Maintenance Type:
    Unplanned/Outage

    Description:
    At approximtely 2:45 am on Friday, May 2nd, the SDCC experienced an unplanned and brief power outage that cut electrical power to a limited fraction of the Linux Farm cluster that provides resources to several SDCC-supported experiments. ATLAS Ter-1 and Tier-3, as well as the Alma9 and SL7 shared pool used by Belle-II, DUNE, EIC, PHENIX and STAR were affected. Staff responded to the alarms and began recovery at 4 am. Affected systems were fully restored at 5:25 am. Post-recovery, it was determined the power outage was caused by a brief but significant rain storm in the early hours of May 2nd, coupled with the partial failure of a UPS sub-system whose job is to provide electrical power continuity during utility power instabilities. The rest of the UPS sub-systems performed as expected, and the majority of the our resources (sPHENIX Linux Farm cluster, HPSS tape system, disk storage, file systems, collaborative tools and services,etc) were not affected. An investigation by BNL's Utilities Division is underway, followed by repair of the faulty sub-system and then implementation of an enhanced electrical system monitoring to prevent a re-occurrance of unplanned power outages.

  • RACF Gitea Upgrade

    Duration:
    4/21/2025 8:00 am — 4/21/2025 8:30 am

    Group Responsible:
    SDCC Operations

    Affected Area:
    Git repositories hosted on git.racf.bnl.gov

    Expected Impact:
    Some personal access tokens may need to be regenerated

    Maintenance Type:
    Transparent Upgrade/Maintenance

    Description:
    The RACF gitea (git.racf.bnl.gov/gitea) will be upgraded on Monday, 4/21 at 8:00 AM EDT. Minimal disruption of service is expected. The version is being upgraded from 1.19.1 to 1.23.7. In version 1.20.0, permissions for personal access tokens were changed and some tokens may need to be regenerated. More info on the blog: https://blog.gitea.com/release-of-1.20.0/#warning-refactored-scoped-tok…

  • sl7 shared pool resources restored

    Duration:
    3/24/2025 11:57 am — 12/31/1969 7:00 pm

    Group Responsible:
    IT Fabric

    Affected Area:
    sl7 portion of the condor shared pool

    Expected Impact:
    jobs should run within normal time, some may need restart

    Maintenance Type:
    Information

    Description:
    sl7 shared pool resource levels have been restored.\n\nInvestigation into the root cause continues, could recur until identified and resolved.