Hi All, On Mon, 1 Apr 2002 14:37:45 -0500 "Flemming Videbaek" <videbaek@sgs1.hirg.bnl.gov> wrote concerning "use or misuse ofLSF queues ??": > It has been described recently how to use the LSF system to submit > jobs. This does not mean it necessarely is reasonable to fill the > queues with apperently cpu intensitive jobs. > > Let me explain why; > > The brahms data+user disk is served by a single SUN that is capable > of delivering at most ~35 Mb/sec. If the CAS nodes are all loaded > with jobs that attempts to read from the data disks at high rate > > .9 Mb/sec one will get a very sluggish response (as is the case > presently). This why it can be reasonable to copy files from /brahms/u and brahms/data<x> to a local diorectory, like /home/<user>, and then copy the output it back and completion of the job. This is done on the CAS nodes. Please refer to my previous email on this issue. > It is concevable we can get a seperate server machine for > e.g. /brahms/u + a subsets of disk but this will be a while (4-6 > months) That, or some other disk serving system with higher throughput, would be a good idea. > Despite the jobs have been niced this does not help, and the time to > compile/link is about a factor 5-100 worse than when no reading > load is present.I started a linkage of brag about 20 minutes ago and > it is still not complete Are you sure this had anything to do with jobs running in LSF? Saturday afternoon (CET) Mads and I was checkin out BRAT which took something like 1hour!, compiled it, again 1hour+, and no jobs was running in LSF - my feeling is that AFS and NFS was hanging. `uptime' on rcas0022 gave a load of `0.0, 0.0, 0.0' :-! > I do not immediate know how to address this other than look into > a) divide the pool of rcas into LSF and interactive one We already have that division - the CRS/CAS. Unfortunately normal users cannot run jobs on CAS, and also the CAS has some `issues'. It would be clever if something like LSF was used for jobscheduling on the CAS, since it would allow each user submit jobs there, leaving the CAS for interactive analysis. However, the use of LSF has one advantage that is not to be overlooked: Most people tend to start jobs (and long ones too) on the CAS, like one of prompt% <program> prompt% <program> & and some even does prompt% nice <program> & But then, it often happens that more than two such jobs are started on the same node. The LSF will never allow more than 2 jobs on a node, and it does nice the processes (much more than a plain `nice'). So at least LSF puts some additional constraints on what kind of behaviour is allowed, and makes sure your job is done. I believe LSF also does some `accounting' - that is users that hasn't used LSF as much as others, get priority over those other users, thereby ensuring that everybody gets a fair share. Finally, I'd like to remind you, that all those outside of BNL will hardly ever use the RCF for real interactive work. The most you'll do over an SSH connection, is to * Edit a few lines in a file (using `vi' or `emacs -nw') * Compile libraries and programs * Submit batch jobs What you mainly want to do at the CAS, is to do second passes on reduced data files, since the files are sitting there locally, and copying some O(100)Gb over SCP is a pain in the ... It would be folly to look at histograms and similar in a interactive (brat)root session over an SSH connectio trough two firewalls. [To those of you who do that: Stop. Instead, make a directory for yourself in /afs/rhic/brahms/user and you put your histogram files you need to browse there. Then use your local (brat)root installation to browse the files via AFS. Or just SCP your files to your home machine. Anything else is a waste of time and bandwidth.] > b) make additional queues with different charecteristics like io > (fast) small cpu time haigh bandwidth max one per machine (except > for 0-4) cpu intensive (e.g. simulations) I think queues like the ones you suggest would indeed be a good thing. Perhaps a few words on what it to be considered one or the other (and some `typical' jobs) would help everyone decide what is the appropiate course of action. > c) rcas005 will definitely go out of LSF queues it is the database > machine. Oh yes! > Until then I will appeal to people common sense not to load the > system completely - the impact on interactive use is too much. Your, Christian Holm Christensen ------------------------------------------- Address: Sankt Hansgade 23, 1. th. Phone: (+45) 35 35 96 91 DK-2200 Copenhagen N Cell: (+45) 28 82 16 23 Denmark Office: (+45) 353 25 305 Email: cholm@nbi.dk Web: www.nbi.dk/~cholm
This archive was generated by hypermail 2b30 : Tue Apr 02 2002 - 06:43:37 EST