Re: use or misuse ofLSF queues ??

From: Christian Holm Christensen (cholm@hehi03.nbi.dk)
Date: Tue Apr 02 2002 - 06:42:06 EST

Next message: Flemming Videbaek: "Re: use or misuse ofLSF queues ??"

Previous message: Flemming Videbaek: "use or misuse ofLSF queues ??"
In reply to: Flemming Videbaek: "use or misuse ofLSF queues ??"
Next in thread: Flemming Videbaek: "Re: use or misuse ofLSF queues ??"
Reply: Flemming Videbaek: "Re: use or misuse ofLSF queues ??"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi All, 

On Mon, 1 Apr 2002 14:37:45 -0500
"Flemming Videbaek" <videbaek@sgs1.hirg.bnl.gov> wrote
concerning "use or  misuse ofLSF queues ??":
> It has been described recently how to use the LSF system to submit
> jobs. This does not mean it necessarely is reasonable to fill the
> queues with apperently cpu intensitive jobs. 
> 
> Let me explain why;
> 
> The brahms data+user disk is served by a single SUN that is capable
> of delivering at most ~35 Mb/sec.  If the CAS nodes are all loaded
> with jobs that attempts to read from the data disks at high rate >
> .9 Mb/sec one will get a very sluggish  response (as is the case
> presently). 


This why it can be reasonable to copy files from /brahms/u and
brahms/data<x> to a local diorectory, like /home/<user>, and then copy
the output it back and completion of the job.  This is done on the CAS
nodes.  Please refer to my previous email on this issue. 

> It is concevable we can get a seperate server machine for
> e.g. /brahms/u + a subsets of disk but this will be a while (4-6
> months) 

That, or some other disk serving system with higher throughput, would
be a good idea.  

> Despite the jobs have been niced this does not help, and the time to
> compile/link is about a factor 5-100  worse than when no reading
> load is present.I started a linkage of brag about 20 minutes ago and
> it is still not complete 

Are you sure this had anything to do with jobs running in LSF?
Saturday afternoon (CET)  Mads and I was checkin out BRAT which took
something like 1hour!, compiled it, again 1hour+, and no jobs was
running in LSF - my feeling is that AFS and NFS was hanging. `uptime'
on rcas0022 gave a load of `0.0, 0.0, 0.0' :-!

>  I do not immediate know how to address this other than look into 
> a) divide the pool of rcas into LSF and interactive one 

We already have that division - the CRS/CAS.  Unfortunately normal
users cannot run jobs on CAS, and also the CAS has some `issues'.  It
would be clever if something like LSF was used for jobscheduling on
the CAS, since it  would allow each user submit jobs there, leaving
the CAS for interactive analysis. 

However, the use of LSF has one advantage that is not to be
overlooked:  Most people tend to start jobs (and long ones too) on the
CAS, like one of 

  prompt% <program> 
  prompt% <program> &

and some even does 

  prompt% nice <program> &

But then, it often happens that more than two such jobs are started on
the same node.  The LSF will never allow more than 2 jobs on a node,
and it does nice the processes (much more than a plain `nice'). 

So at least LSF puts some additional constraints on what kind of
behaviour is allowed, and makes sure your job is done. 

I believe LSF also does some `accounting' - that is users that hasn't
used LSF as much as others, get priority over those other users,
thereby ensuring that everybody gets a fair share. 

Finally, I'd like to remind you, that all those outside of BNL will
hardly ever use the RCF for real interactive work.  The most you'll do
over an SSH connection, is to 

  * Edit a few lines in a file (using `vi' or `emacs -nw') 
  * Compile libraries and programs 
  * Submit batch jobs

What you mainly want to do at the CAS, is to do second passes on
reduced data files, since the files are sitting there locally, and
copying some O(100)Gb over SCP is a pain in the ...

It would be folly to look at histograms and similar in a interactive
(brat)root session over an SSH connectio trough two firewalls.  [To
those of you who do that: Stop.  Instead, make a directory for
yourself in /afs/rhic/brahms/user and you put your histogram files you
need to browse there. Then use your local (brat)root installation to
browse the files via AFS.  Or just SCP your files to your home
machine.  Anything else is a waste of time and bandwidth.]

> b) make additional queues with different charecteristics like io
>    (fast) small cpu time haigh bandwidth max one per machine (except
>    for 0-4) cpu intensive (e.g. simulations) 

I think queues like the ones you suggest would indeed be a good
thing.  Perhaps a few words on what it to be considered one or the
other (and some `typical' jobs) would help everyone decide what is the
appropiate course of action. 

> c) rcas005 will definitely go out of LSF queues it is the database
> machine. 

Oh yes!
 
> Until then I will appeal to people common sense not to load the
> system completely - the impact on interactive use is too much.

Your, 

Christian Holm Christensen -------------------------------------------
Address: Sankt Hansgade 23, 1. th.           Phone:  (+45) 35 35 96 91 
         DK-2200 Copenhagen N                Cell:   (+45) 28 82 16 23
         Denmark                             Office: (+45) 353  25 305 
Email:   cholm@nbi.dk                        Web:    www.nbi.dk/~cholm

Next message: Flemming Videbaek: "Re: use or misuse ofLSF queues ??"
Previous message: Flemming Videbaek: "use or misuse ofLSF queues ??"
In reply to: Flemming Videbaek: "use or misuse ofLSF queues ??"
Next in thread: Flemming Videbaek: "Re: use or misuse ofLSF queues ??"
Reply: Flemming Videbaek: "Re: use or misuse ofLSF queues ??"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b30 : Tue Apr 02 2002 - 06:43:37 EST