Re: use or misuse ofLSF queues ??

From: Christian Holm Christensen (cholm@hehi03.nbi.dk)
Date: Tue Apr 02 2002 - 06:42:06 EST

  • Next message: Flemming Videbaek: "Re: use or misuse ofLSF queues ??"

    Hi All, 
    
    On Mon, 1 Apr 2002 14:37:45 -0500
    "Flemming Videbaek" <videbaek@sgs1.hirg.bnl.gov> wrote
    concerning "use or  misuse ofLSF queues ??":
    > It has been described recently how to use the LSF system to submit
    > jobs. This does not mean it necessarely is reasonable to fill the
    > queues with apperently cpu intensitive jobs. 
    > 
    > Let me explain why;
    > 
    > The brahms data+user disk is served by a single SUN that is capable
    > of delivering at most ~35 Mb/sec.  If the CAS nodes are all loaded
    > with jobs that attempts to read from the data disks at high rate >
    > .9 Mb/sec one will get a very sluggish  response (as is the case
    > presently). 
    
    
    This why it can be reasonable to copy files from /brahms/u and
    brahms/data<x> to a local diorectory, like /home/<user>, and then copy
    the output it back and completion of the job.  This is done on the CAS
    nodes.  Please refer to my previous email on this issue. 
    
    > It is concevable we can get a seperate server machine for
    > e.g. /brahms/u + a subsets of disk but this will be a while (4-6
    > months) 
    
    That, or some other disk serving system with higher throughput, would
    be a good idea.  
    
    > Despite the jobs have been niced this does not help, and the time to
    > compile/link is about a factor 5-100  worse than when no reading
    > load is present.I started a linkage of brag about 20 minutes ago and
    > it is still not complete 
    
    Are you sure this had anything to do with jobs running in LSF?
    Saturday afternoon (CET)  Mads and I was checkin out BRAT which took
    something like 1hour!, compiled it, again 1hour+, and no jobs was
    running in LSF - my feeling is that AFS and NFS was hanging. `uptime'
    on rcas0022 gave a load of `0.0, 0.0, 0.0' :-!
    
    >  I do not immediate know how to address this other than look into 
    > a) divide the pool of rcas into LSF and interactive one 
    
    We already have that division - the CRS/CAS.  Unfortunately normal
    users cannot run jobs on CAS, and also the CAS has some `issues'.  It
    would be clever if something like LSF was used for jobscheduling on
    the CAS, since it  would allow each user submit jobs there, leaving
    the CAS for interactive analysis. 
    
    However, the use of LSF has one advantage that is not to be
    overlooked:  Most people tend to start jobs (and long ones too) on the
    CAS, like one of 
    
      prompt% <program> 
      prompt% <program> &
    
    and some even does 
    
      prompt% nice <program> &
    
    But then, it often happens that more than two such jobs are started on
    the same node.  The LSF will never allow more than 2 jobs on a node,
    and it does nice the processes (much more than a plain `nice'). 
    
    So at least LSF puts some additional constraints on what kind of
    behaviour is allowed, and makes sure your job is done. 
    
    I believe LSF also does some `accounting' - that is users that hasn't
    used LSF as much as others, get priority over those other users,
    thereby ensuring that everybody gets a fair share. 
    
    Finally, I'd like to remind you, that all those outside of BNL will
    hardly ever use the RCF for real interactive work.  The most you'll do
    over an SSH connection, is to 
    
      * Edit a few lines in a file (using `vi' or `emacs -nw') 
      * Compile libraries and programs 
      * Submit batch jobs
    
    What you mainly want to do at the CAS, is to do second passes on
    reduced data files, since the files are sitting there locally, and
    copying some O(100)Gb over SCP is a pain in the ...
    
    It would be folly to look at histograms and similar in a interactive
    (brat)root session over an SSH connectio trough two firewalls.  [To
    those of you who do that: Stop.  Instead, make a directory for
    yourself in /afs/rhic/brahms/user and you put your histogram files you
    need to browse there. Then use your local (brat)root installation to
    browse the files via AFS.  Or just SCP your files to your home
    machine.  Anything else is a waste of time and bandwidth.]
    
    > b) make additional queues with different charecteristics like io
    >    (fast) small cpu time haigh bandwidth max one per machine (except
    >    for 0-4) cpu intensive (e.g. simulations) 
    
    I think queues like the ones you suggest would indeed be a good
    thing.  Perhaps a few words on what it to be considered one or the
    other (and some `typical' jobs) would help everyone decide what is the
    appropiate course of action. 
    
    > c) rcas005 will definitely go out of LSF queues it is the database
    > machine. 
    
    Oh yes!
     
    > Until then I will appeal to people common sense not to load the
    > system completely - the impact on interactive use is too much.
    
    Your, 
    
    Christian Holm Christensen -------------------------------------------
    Address: Sankt Hansgade 23, 1. th.           Phone:  (+45) 35 35 96 91 
             DK-2200 Copenhagen N                Cell:   (+45) 28 82 16 23
             Denmark                             Office: (+45) 353  25 305 
    Email:   cholm@nbi.dk                        Web:    www.nbi.dk/~cholm
    



    This archive was generated by hypermail 2b30 : Tue Apr 02 2002 - 06:43:37 EST