By Costin Caramarcu |

 

Slurm

Slurm is an open-source workload manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

Commands cheat sheet

Batch Jobs

Batch jobs are jobs that run non-interactively under the control of a "batch script," which is a text file containing a number of job directives and Linux commands or utilities. Batch scripts are submitted to the "batch system," where they are queued awaiting free resources.

A simple SLURM batch script will look like this:

#!/bin/bash
#SBATCH -p long
#SBATCH -t 01:00:00
#SBATCH --account accountname
#SBATCH -N 2
#SBATCH -n 64
#SBATCH --qos normal
#SBATCH --gres=gpu:4
module load mpi/openmpi-1.10.2
module load gcc/5.3.0
srun ./my_executable

or by using the long format of SLURM keywords, as below:

#!/bin/bash
#SBATCH --partition long
#SBATCH --time=01:00:00
#SBATCH -A accountname
#SBATCH --nodes=2
#SBATCH --ntasks=64
#SBATCH --qos normal
#SBATCH --gres=gpu:4
module load mpi/openmpi-1.10.2
module load gcc/5.3.0
srun ./my_executable

The example above contains:

  • The Shell: SLURM batch scripts require users to specify which shell to use. In this case, we use bash by specifying #!/bin/bash. Batch scripts will not run without a shell being specified in the first line.
  • #SBATCH Directives: In the example, the directives tell the scheduler to allocate 2 nodes for the job, for how long, in which partition, what account, what qos and how many cores to use. Directives can also specify things such as what to name standard output files, what account to charge, whether to notify by email job completion, how many tasks to run, how many tasks per node, etc.
  • The srun command is used to start execution of application on compute nodes.

to top

Submitting Jobs

Once the batch script is ready, the submission can be done as follows:

% sbatch myscript.sh

sbatch directives can be specified at submission as options but we recommend putting directives in the script instead. That way the batch script will have a record of the directives used, which is useful for record-keeping as well as debugging should something go wrong.

to top

Choosing a Partition

The cluster has several partitions to choose from. The main purpose of having different partitions is to control scheduling priorities and set limits on the number of jobs of varying sizes. Different partitions may have distinct charge rates. This somewhat complex partition structure strives to achieve an optimal balance among fairness, wait times, and run times.

When submitting a batch job the most common partition to choose are:

  • long: Use this for almost all production runs.
  • debug: Use this for small, short test runs

to top

Job Output

Standard output (STDOUT) and standard error (STDERR) messages from jobs are directly written to the output or stderr file name specified in the batch script or to the default output file name (as slurm-jobid.out) in submit directory ($SLURM_SUBMIT_DIR). These files can be monitor during a job run.

#SBATCH -o hostname_%j.out # File to which STDOUT will be written
#SBATCH -e hostname_%j.err # File to which STDERR will be written

to top

Using Modules to Manage Environment

Modules are used to manage different software environments on the cluster. module avail will list available modules on the cluster, and they can be added to job or session environment using module load $modulename.

As an example, to compile an OpenMPI program:

module load mpi/openmpi-x86_64
mpicc -o hello hello_world.c

to top

Using GPUs

For running jobs using GPUs please add the following line in sbatch script or srun command line:

--gres=gpu:4

gres stands for generic resource, gpu is the name of the resource and 4 is the count of GPUs to be used (name[[:type]:count])

to top

Choosing GPU Type

There are two different types of GPUs in the Institutional Cluster. Each computer node either has 2 Tesla K80 or 2 Pascal P100 GPUs.  Each K80 appears as 2 GPU devices while each P100 appears as 1 GPU device.

To choose P100 nodes use: --constraint=pascal or -C pascal

To choose K80 nodes use: --constraint=tesla or -C tesla

Example job script using K80 GPUs:

#SBATCH -p long
#SBATCH -t=1:00:00
#SBATCH -A accountname
#SBATCH -N 2
#SBATCH -n 64
#SBATCH -C tesla
#SBATCH --gres=gpu:4
#SBATCH --qos=normal

module load mpi/openmpi-1.10.2
module load gcc/5.3.0
srun ./myexecutable

Example job script using  P100 GPUs:

#SBATCH -p long
#SBATCH -t 1:00:00
#SBATCH -A accountname
#SBATCH -N 2
#SBATCH -n 64
#SBATCH -C pascal
#SBATCH --gres=gpu:2
#SBATCH --qos=normal

module load mpi/openmpi-1.10.2
module load gcc/5.3.0
srun ./myexecutable

Note that if "-C" flag is not used and request "--gres=gpu:1" or "--gres=gpu:2" , the job may land on either K80 nodes, P100 nodes or mixed nodes.

to top

Working Directory

Slurm allows setting the working directory of the batch script before it is executed. The path can be specified as a full path or relative path to the directory where the command is executed. If no working directory is specified the current directory will be used.

--chdir=/hpcgpfs01/scratch/temp

Please note that input files might need to copied.

to top

#SBATCH Keywords

The following table lists recommended and useful #SBATCH keywords.

Required sbatch Options/Directives
Short Format Long Format Default Description
-N count 
--nodes=count
One node will be used. Used to allocate count nodes to your job.
-t HH:MM:SS
--time=HH:MM:SS
00:30:00
Always specify the maximum wallclock time for your job.
-p [partition]
--partition=partition
debug
Always specify your partition, which will usually be debug for testing and regular for production runs. See “Queues and Policies.”

 

Useful sbatch Options/Directives
Short Format Long Format Default Description
N/A
--ntasks-per-node=count
32
Use [count] of MPI tasks per node
-c count
--cpus-per-task=count
2
Run count threads threads per MPI task (for a MPI/OpenMP hybrid code, to run in pure MPI mode, please set OMP_NUM_THREADS to 1).
-J job_name
--job-name=name
Job script name Job Name: up to 15 printable, non-whitespace characters.
-A account
--account=repo
Your default account Charge this job to the IC account mXXX (necessary only if you have more than one IC repo)
-e filename
--error=filename
<script_name>.e<job_id>
Write STDERR to filename.
-o filename
--output=filename
<script_name>.o<job_id>
Write STDOUT to filename. By default both standard output and standard error are directed to a file of the name “slurm-%j.out”, where the “%j” is replaced with the job allocation number. See the -i option for filename specification options.
-i

 
“/dev/null” is open on the batch script’s standard input and both standard output and standard error are directed to a file of the name “slurm-%j.out”, where the “%j” is the job id.

Instruct SLURM to connect the batch script’s standard input directly to the file name specified in the “filename pattern.”

The filename pattern may contain one or more replacement symbols, which are a percent sign “%” followed by a letter (e.g. %j).

Supported replacement symbols are:

  • %j — Job allocation number
  • %N — Node name. Only one file is created, so %N will be replaced by the name of the first node in the job, which is the one that runs the script.
N/A
--mail-user=address
Email notification Valid event values are: BEGIN, END, FAIL, REQUEUE, ALL (equivalent to BEGIN, END, FAIL, REQUEUE, and STAGE_OUT), STAGE_OUT (burst buffer stage out completed), TIME_LIMIT, TIME_LIMIT_90 (reached 90 percent of time limit), TIME_LIMIT_80 (reached 80 percent of time limit), and TIME_LIMIT_50 (reached 50 percent of time limit). Multiple type values may be specified in a comma separated list. The user to be notified is indicated with --mail-user. Mail notifications on job BEGIN, END and FAIL apply to a job array as a whole rather than generating individual email messages for each task in the job array.
-D directory_name
--chdir=directory_name 
Directory specification Set the working directory of the batch script to directory before it is executed. The path can be specified as full path or relative path to the directory where the command is executed.
N/A
--export=ALL
This is on by default Export the current environment variables into the batch job environment.  This is the default behavior.

 

All options may be specified as either sbatch command-line options or as directives in the batch script as #SBATCH options. Note: When using both, any command line options will override the corresponding options in the batch script.

 

to top