Contents
Slurm
Slurm is an open-source workload manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
Batch Jobs
Batch jobs are jobs that run non-interactively under the control of a "batch script," which is a text file containing a number of job directives and Linux commands or utilities. Batch scripts are submitted to the "batch system," where they are queued awaiting free resources.
A simple SLURM batch script will look like this:
#!/bin/bash
#SBATCH -p long
#SBATCH -t 01:00:00
#SBATCH --account accountname
#SBATCH -N 2
#SBATCH -n 64
#SBATCH --qos normal
#SBATCH --gres=gpu:4
module load mpi/openmpi-1.10.2
module load gcc/5.3.0
srun ./my_executable
or by using the long format of SLURM keywords, as below:
#!/bin/bash
#SBATCH --partition long
#SBATCH --time=01:00:00
#SBATCH -A accountname
#SBATCH --nodes=2
#SBATCH --ntasks=64
#SBATCH --qos normal
#SBATCH --gres=gpu:4
module load mpi/openmpi-1.10.2
module load gcc/5.3.0
srun ./my_executable
The example above contains:
- The Shell: SLURM batch scripts require users to specify which shell to use. In this case, we use bash by specifying
#!/bin/bash
. Batch scripts will not run without a shell being specified in the first line.
#SBATCH
Directives: In the example, the directives tell the scheduler to allocate 2 nodes for the job, for how long, in which partition, what account, what qos and how many cores to use. Directives can also specify things such as what to name standard output files, what account to charge, whether to notify by email job completion, how many tasks to run, how many tasks per node, etc.
- The
srun
command is used to start execution of application on compute nodes.
Submitting Jobs
Once the batch script is ready, the submission can be done as follows:
% sbatch myscript.sh
sbatch directives can be specified at submission as options but we recommend putting directives in the script instead. That way the batch script will have a record of the directives used, which is useful for record-keeping as well as debugging should something go wrong.
Choosing a Partition
The cluster has several partitions to choose from. The main purpose of having different partitions is to control scheduling priorities and set limits on the number of jobs of varying sizes. Different partitions may have distinct charge rates. This somewhat complex partition structure strives to achieve an optimal balance among fairness, wait times, and run times.
When submitting a batch job the most common partition to choose are:
- long: Use this for almost all production runs.
- debug: Use this for small, short test runs
Job Output
Standard output (STDOUT) and standard error (STDERR) messages from jobs are directly written to the output or stderr file name specified in the batch script or to the default output file name (as slurm-jobid.out) in submit directory ($SLURM_SUBMIT_DIR). These files can be monitor during a job run.
#SBATCH -o hostname_%j.out # File to which STDOUT will be written
#SBATCH -e hostname_%j.err # File to which STDERR will be written
Using Modules to Manage Environment
Modules are used to manage different software environments on the cluster. module avail
will list available modules on the cluster, and they can be added to job or session environment using module load $modulename
.
As an example, to compile an OpenMPI program:
module load mpi/openmpi-x86_64
mpicc -o hello hello_world.c
Using GPUs
For running jobs using GPUs please add the following line in sbatch script or srun command line:
--gres=gpu:4
gres stands for generic resource, gpu is the name of the resource and 4 is the count of GPUs to be used (name[[:type]:count])
Choosing GPU Type
There are two different types of GPUs in the Institutional Cluster. Each computer node either has 2 Tesla K80 or 2 Pascal P100 GPUs. Each K80 appears as 2 GPU devices while each P100 appears as 1 GPU device.
To choose P100 nodes use: --constraint=pascal or -C pascal
To choose K80 nodes use: --constraint=tesla or -C tesla
Example job script using K80 GPUs:
#SBATCH -p long
#SBATCH -t=1:00:00
#SBATCH -A accountname
#SBATCH -N 2
#SBATCH -n 64
#SBATCH -C tesla
#SBATCH --gres=gpu:4
#SBATCH --qos=normal
module load mpi/openmpi-1.10.2
module load gcc/5.3.0
srun ./myexecutable
Example job script using P100 GPUs:
#SBATCH -p long
#SBATCH -t 1:00:00
#SBATCH -A accountname
#SBATCH -N 2
#SBATCH -n 64
#SBATCH -C pascal
#SBATCH --gres=gpu:2
#SBATCH --qos=normal
module load mpi/openmpi-1.10.2
module load gcc/5.3.0
srun ./myexecutable
Note that if "-C" flag is not used and request "--gres=gpu:1" or "--gres=gpu:2" , the job may land on either K80 nodes, P100 nodes or mixed nodes.
Working Directory
Slurm allows setting the working directory of the batch script before it is executed. The path can be specified as a full path or relative path to the directory where the command is executed. If no working directory is specified the current directory will be used.
--chdir=/hpcgpfs01/scratch/temp
Please note that input files might need to copied.
#SBATCH Keywords
The following table lists recommended and useful #SBATCH keywords.
Short Format | Long Format | Default | Description |
---|---|---|---|
-N count |
--nodes=count |
One node will be used. | Used to allocate count nodes to your job. |
-t HH:MM:SS |
--time=HH:MM:SS |
00:30:00 |
Always specify the maximum wallclock time for your job. |
-p [partition] |
--partition=partition |
debug |
Always specify your partition, which will usually be debug for testing and regular for production runs. See “Queues and Policies.” |
Short Format | Long Format | Default | Description |
---|---|---|---|
N/A |
--ntasks-per-node=count |
32 |
Use [count] of MPI tasks per node |
-c count |
--cpus-per-task=count |
2 |
Run count threads threads per MPI task (for a MPI/OpenMP hybrid code, to run in pure MPI mode, please set OMP_NUM_THREADS to 1). |
-J job_name |
--job-name=name |
Job script name | Job Name: up to 15 printable, non-whitespace characters. |
-A account |
--account=repo |
Your default account | Charge this job to the IC account mXXX (necessary only if you have more than one IC repo) |
-e filename |
--error=filename |
<script_name>.e<job_id> |
Write STDERR to filename. |
-o filename |
--output=filename |
<script_name>.o<job_id> |
Write STDOUT to filename. By default both standard output and standard error are directed to a file of the name “slurm-%j.out”, where the “%j” is replaced with the job allocation number. See the -i option for filename specification options. |
-i |
|
“/dev/null” is open on the batch script’s standard input and both standard output and standard error are directed to a file of the name “slurm-%j.out”, where the “%j” is the job id. |
Instruct SLURM to connect the batch script’s standard input directly to the file name specified in the “filename pattern.” The filename pattern may contain one or more replacement symbols, which are a percent sign “%” followed by a letter (e.g. %j). Supported replacement symbols are:
|
N/A |
--mail-user=address |
Email notification | Valid event values are: BEGIN, END, FAIL, REQUEUE, ALL (equivalent to BEGIN, END, FAIL, REQUEUE, and STAGE_OUT), STAGE_OUT (burst buffer stage out completed), TIME_LIMIT, TIME_LIMIT_90 (reached 90 percent of time limit), TIME_LIMIT_80 (reached 80 percent of time limit), and TIME_LIMIT_50 (reached 50 percent of time limit). Multiple type values may be specified in a comma separated list. The user to be notified is indicated with --mail-user. Mail notifications on job BEGIN, END and FAIL apply to a job array as a whole rather than generating individual email messages for each task in the job array. |
-D directory_name |
--chdir=directory_name |
Directory specification | Set the working directory of the batch script to directory before it is executed. The path can be specified as full path or relative path to the directory where the command is executed. |
N/A |
--export=ALL |
This is on by default | Export the current environment variables into the batch job environment. This is the default behavior. |
All options may be specified as either sbatch command-line options or as directives in the batch script as #SBATCH options. Note: When using both, any command line options will override the corresponding options in the batch script.