Re: Home on crs CRASH updates

From: Christian Holm Christensen (cholm@hehi03.nbi.dk)
Date: Sat Nov 17 2001 - 09:49:24 EST

  • Next message: jnorris: "supermon output"

    Hi Flemming et al, 
    
    On Sat, 17 Nov 2001 09:11:56 -0500
    "Flemming Videbaek" <videbaek@sgs1.hirg.bnl.gov> wrote
    concerning "Home on crs  CRASH updates":
    > If $HOME in fact points to different places on the crs nodes - and I
    > have no doubt it does it means that the RCF philosofy is broken on
    > the farms. The intentention is to haive all machine look identitcal
    > in terms of software setup. Christian or some-one else can you give
    > specifci cases and we will talk to Tony - as well as discussion 
    > the second (afs) issue
    
    What I did, was to submit the job 
    
       ~bramreco/tests/query_env.jsf
    
    which executes 
      
       ~bramreco/tests/query_env.sh 
    
    on a single node (which ever it might be).  The output is stored in
    three files  
    
       ~bramreco/tests/query_env.out     (standard out) 
       ~bramreco/tests/query_env.err     (standard error) 
       ~bramreco/tests/query_env.tree    (ls -R on /afs/rhic/opt/brahms/[pro|new]) 
       ~bramreco/tests/query_env.running (empty)
    
    Among other things, the query_env.sh prints the environment on the
    node of execution. Here's a snippet of the output (the rest can be
    found in the files above):
      
      First the environment variables
      ===============================
      LOGNAME=bramreco
      MACHTYPE=i386
      CRS_JOB_FILE=query_env.jsf
      HOSTTYPE=i386-linux
      PATH=/bin:/usr/bin:/usr/local/bin:/usr/bin/X11
      HOME=/home/bramreco
      SHELL=/bin/tcsh
      USER=bramreco
      HOST=rcrs0032.rcf.bnl.gov
      ...
    
      AFS Sysnname: Current sysname is 'i386_redhat61'
      ...
    
    As you can see, HOME=/home/bramreco and _not_ /brahms/u/bramreco!  I
    know the intention is to have a hetrogenious environment, but this
    really does not bother me too much though - others may feel
    different ofcourse.  
    
    If you want to run the test, just execute 
    
       ~bramreco/tests/query_env.doit
    
    It will clean the old output files, and keep looping until the job has
    executed, printing the progess. 
    
    On the AFS thing.  I think you're likely to get the reponse from Tony,
    that the Farm is not supposed to use AFS in the first place.  This is
    what I got from Tony on July 10, 2001 in reply to some other problem. 
    
      I found your "staging failed" job in the log files for
      the CRS jobs. It looks like some of your input files
      are being loaded from AFS. I told Flemming yesterday
      that only input files located in the following areas
      can run successfully:
    
        /brahms/u
        /brahms/data01, data02, data06
    
      The CRS software checks for the existence of the input
      files, and AFS files are not allowed. Can you move your 
      AFS input files to one of the above disks and try again?
    
    Now, /afs/rhic/opt/brahms/new/lib/bratmain.wrapper is not listed as an
    input stream in the JSF files, so it isn't checked for existence
    (otherwise we'd not get the error message), and I guess it isn't
    checked if the executable lives on AFS (otherwise, we chouldn't have
    used the farm as it is now).  When I asked Tony wether we could allow
    input files from AFS disks, I got this reply (July 11, 2001): 
    
      The RCF model agreed by the experiments a long time ago is that
      input files come from either HPSS or from our central disk storage
      area (which is NFS-mounted on the Linux nodes). That's why we only
      check the origin of the input files. AFS binaries work (we don't
      stop them), but we don't recommend it. 
    
    Interlude:  there you have it from the horses mouth - executables on
    AFS is OK. Tony goes on: 
    
      The CRS software is very complicated, and it depends on HPSS,
      network and NFS. By adding AFS-dependency, you are just making it
      more complicated and more prone to potential problems. I don't see
      how this is an improvement.  Our experience has shown that simpler
      systems are more reliable. 
    
    Interlude:  the reason why we like to have the software on AFS, is so
    that we only have to maintain _one_ installation for the rcas', rcrs',
    piis and other potential users in the US (AFS over the atlantic is a
    hassle).  
    
    As Tony says, simple systems are good - that's also really what we
    want, the simplicity of maintaining _one_ installation. Ofcourse we
    could install all our software on an NFS disk, and mount that disk on
    all rcas', rcrs', rmines, and piis, but people outside BNL would then
    not have those installations avaliable, and we wouldn't have the
    benefiet of the '@sys' feature on AFS.  To me, being across the
    atlantic, and therefor not using the AFS installations, I could
    defently live with moving our stuff to NFS. 
    
    Another possiblity, if technically possible, would be to NFS mount the
    AFS disks on the rcrs nodes, so that /afs/rhic/opt/brahms is infact a
    mount of the AFS disks for i386_linux - not using AFS. 
    
    Now back to Tony:
    
      In any case, the CRS software is the same for all 4 experiments  
      (BRAHMS, PHENIX, PHOBOS, STAR). We don't customize the software 
      for any experiment. Any changes you request has to be made by 
      your official RCF Liaison (Flemming or Betty) and presented to 
      the RCF and to the other experiments. 
    
    Interlude: Flemming and Betty, I guess this really puts the ball on
    your side :-) And Tony continues: 
    
      If no one objects, a new version is created, and it is extensively
      tested by everyone before it becomes the "production version". The
      coding and testing typically takes weeks, so I think the chance of
      your requested change becoming the production version for the
      current run is very low, although I wouldn't rule it out for the
      next run. 
    
    Which is in a weeks time? I mean, the current run ends soon
    (disregarding the p+p run), and so I guess there'll be time to do
    structual changes.  
    
    I hope this helps you resolve matters. 
    
    Yours, 
    
    Christian Holm Christensen -------------------------------------------
    Address: Sankt Hansgade 23, 1. th.           Phone:  (+45) 35 35 96 91 
             DK-2200 Copenhagen N                Cell:   (+45) 28 82 16 23
             Denmark                             Office: (+45) 353  25 305 
    Email:   cholm@nbi.dk                        Web:    www.nbi.dk/~cholm
    



    This archive was generated by hypermail 2b30 : Sat Nov 17 2001 - 09:50:03 EST