Dear BRAHMS people (especially those with strong opinions on computing)
I got a message from the RCF guys about what looks to be the most
likely candidate for a batch queueing software to be installed on the
present CRS hosts and on a new cluster of approximately 10 quad
PentiumPro systems which they are in the process of procuring. Following are
the comments they made about the system. There is also information at
http://nhse.cs.rice.edu/NHSEreview/CMS/ on batch queueing systems in
general and at http://www.scri.fsu.edu/~pasko/dqs.html for the DQS system
in particular. If you have any comments positive or negative about this
system, please send themm to me.
Their message:
It would seem that, in a probe of likely vendors of cluster management
packages (LSF, CODINE, NQS) that Solarisx86 versions are not currently
available. Locally, we have been able to build PBS and DQS on our machines.
DQS was very easy to build, and seems very simple to administer relative
to PBS, which requires the ISODE libraries to compile, and the tradeoff
for flexibility is some programming in C or TCL to define scheduling
policies.
Some DQS features which might be relevent in the RCF context:
+ subordinate queues -- A queue can be defined which is subordinate
to another so that if there are no jobs in the primary queue,
computations can be taking place in the subordinate queue. This
could be used to ensure that there were no needlessly idle processors.
If a job is started in the primary queue, the job in the subordinate
queue is stopped. Queues can also be defined in which jobs run at
a lower UNIX priority.
+ Delegation of authority -- There is a list of managers and a list of
operators associated with each queue, and the managers can add users
to the ACL for the queue without having to appeal to "root" to carry
out the task.
+ Arbitrary consumables -- A quantity of consumables may be defined
do be associated with a queue or complex of queues. This could be
software licences, memory, disk, or something even more abstract.
+ DQS cells can be set up so that jobs can be submitted from the
General computing environment hosts, and then logins can be disabled
to the CPU farm machines.
+ Support for various "paradigms" [sic!] of parallel execution (PVM,
MPI, p4, etc).
Drawbacks include
- Once jobs are run, they cannot be moved and re-started (checkpointing
is not supported), though jobs can be moved from queue to queue before
they are run.
- The documentation needs a little work. Esp for a package in it's 3.0s.
This archive was generated by hypermail 2b29 : Tue Feb 01 2000 - 20:35:19 EST