Batch queueing system for RCF

From: HAGEL@comp.tamu.edu
Date: Thu May 15 1997 - 10:31:44 EDT


Dear BRAHMS people (especially those with strong opinions on computing)

  I got a message from the RCF guys about what looks to be the most
  likely candidate for a batch queueing software to be installed on the
  present CRS hosts and on a new cluster of approximately 10 quad
  PentiumPro systems which they are in the process of procuring. Following are
  the comments they made about the system. There is also information at
  http://nhse.cs.rice.edu/NHSEreview/CMS/ on batch queueing systems in
  general and at http://www.scri.fsu.edu/~pasko/dqs.html for the DQS system
  in particular. If you have any comments positive or negative about this
  system, please send themm to me.

  Their message:

  It would seem that, in a probe of likely vendors of cluster management
  packages (LSF, CODINE, NQS) that Solarisx86 versions are not currently
  available. Locally, we have been able to build PBS and DQS on our machines.

  DQS was very easy to build, and seems very simple to administer relative
  to PBS, which requires the ISODE libraries to compile, and the tradeoff
  for flexibility is some programming in C or TCL to define scheduling
  policies.

  Some DQS features which might be relevent in the RCF context:

    + subordinate queues -- A queue can be defined which is subordinate
      to another so that if there are no jobs in the primary queue,
      computations can be taking place in the subordinate queue. This
      could be used to ensure that there were no needlessly idle processors.
      If a job is started in the primary queue, the job in the subordinate
      queue is stopped. Queues can also be defined in which jobs run at
      a lower UNIX priority.

    + Delegation of authority -- There is a list of managers and a list of
      operators associated with each queue, and the managers can add users
      to the ACL for the queue without having to appeal to "root" to carry
      out the task.

    + Arbitrary consumables -- A quantity of consumables may be defined
      do be associated with a queue or complex of queues. This could be
      software licences, memory, disk, or something even more abstract.

    + DQS cells can be set up so that jobs can be submitted from the
      General computing environment hosts, and then logins can be disabled
      to the CPU farm machines.

    + Support for various "paradigms" [sic!] of parallel execution (PVM,
      MPI, p4, etc).

  Drawbacks include

    - Once jobs are run, they cannot be moved and re-started (checkpointing
      is not supported), though jobs can be moved from queue to queue before
      they are run.

    - The documentation needs a little work. Esp for a package in it's 3.0s.



This archive was generated by hypermail 2b29 : Tue Feb 01 2000 - 20:35:19 EST