Fw: CRS batch software crashes

From: Flemming Videbaek (videbaek@sgs1.hirg.bnl.goV)
Date: Tue Nov 13 2001 - 14:28:58 EST

  • Next message: Christian Holm Christensen: "Re: Fw: CRS batch software crashes"

    Please supply us with feedback
    in particular you working actively with CRS.
    
    
    ------------------------------------------------------
    Flemming Videbaek
    Physics Department
    Brookhaven National Laboratory
    
    tlf: 631-344-4106
    fax 631-344-1334
    e-mail: videbaek@bnl.gov
    ----- Original Message -----
    From: "Tony W. Chan" <tony@bnl.gov>
    To: <mcbreen@bnl.gov>; <videbaek@bnl.gov>; <chujo@bnl.gov>;
    <momchil@bnl.gov>; <burt@bnl.gov>; <nigel@bnl.gov>; <jeromel@bnl.gov>;
    <messer@bnl.gov>
    Cc: <throwe@bnl.gov>
    Sent: Tuesday, November 13, 2001 2:10 PM
    Subject: CRS batch software crashes
    
    
    > Hi,                                Nov. 13, 2001
    >
    > I think (not 100% sure) that the CRS batch software crashes
    > are in part due to the doubling of the number of CRS nodes
    > and the small (128 MB) of RAM on the CRS master node (rcrsfm).
    > A newer system has been successfully tested over the last 2
    > weeks. The new system has a CPU that is 3x times more powerful,
    > 4x more memory and 2x more disk space, and we would like to
    > replace rcrsfm with the new machine, such that the new machine
    > preserves the name "rcrsfm", while the old machine will be used
    > as a back-up and test bench.
    >
    > To accomplish this change will require 1 or 2 system reboots
    > of rcrsfm, which means that the users will experience 1 or 2
    > disruptions of their usage of the CRS batch software.
    >
    > Here are the 2 choices:
    > -----------------------
    >
    > 1) Move all 4 experiments to the new machine in one move.
    > Likely downtime is about 1-2 hours, with disruptions affecting
    > everyone simultaneously. Everyone has to agree on a time and
    > day for this to happen.
    >
    > 2) Move one experiment at a time to the new machine (temporarily
    > called "rcrsfm01") according to each experiment's schedule.
    > Likely downtime is 30-60 minutes per experiment. After all 4 are
    > on the new machine, one more downtime (30 minutes) to make the
    > name change to rcrsfm.
    >
    > While 2) is more complicated, it will allow us to check if my
    > suspicions above are correct or not. Choice 1) is the simpler
    > move, but if my suspicions are wrong, we will not be solving
    > the crash problems completely, but we will probably improve
    > software performance.
    >
    > Date and Time:
    > --------------
    >
    > Wednesday/Thursday this week or Monday next week beginning
    > at 9 am, for either choices 1) or 2).
    >
    > Please let me know your choice soon, so we can start making
    > preparations. Thanks.
    >
    > Cheers,
    >
    > Tony
    >
    >
    >
    



    This archive was generated by hypermail 2b30 : Tue Nov 13 2001 - 14:39:45 EST