Slurm
Slurm Scheduling System
The Slurm system manages the department batch queue. Slurm runs jobs on the departmental and research compute nodes.
The CS Slurm setup organizes compute resources into partitions.
The compsci partition contains all the nodes with access to the department NFS filesystem where most user home and project directories live. If unsure, use the compsci partition. The GPU hosts are in the compsci-gpu partition. Please do not submit general batch jobs here as they should be reserved for GPU computing.
Additional partitions exist to hold computers owned by specific research groups.
Donald Lab users can use the grisman partition for priority access to the grisman cluster.
Sam Wiseman’s group can use the nlplab or wiseman partitions for priority access.
Michael Reiter’s group can use the skynet partition for priority access to the skynet cluster.
Lisa Will’s group can use the wills partition for priority access to the wills cluster.
Bhuwan Dhingra’s group can use the nlplab or bhuwan partitions for priority access.
Cynthia Rudin’s group can use the rudin partition for priority access.
In order to submit to the research partitions users can specify the account to use:
or switch their default account
All interation with the queuing system must be done from one of the cluster head nodes. To access the head nodes, ssh to login.cs.duke.edu using your NetID and NetID password.
For the basics of Slurm operation, please refer to the documentation on the Slurm website.
Job scripts
All jobs submitted to Slurm must be shell scripts, and must be submitted from one of the cluster head nodes. Slurm will scan the script text for option flags. The same flags can be on the srun command or embedded in the script. Lines in the script beginning with #SBATCH will be interpretted as containing slurm flags.
The following job runs the program hostname. The script passes slurm the -D flag to run the job in the current working directory where sbatch was executed. This is the equivalent of running: sbatch -D . job.sh.
For more examples, please see the cluster talk repository.
Defaults
By default, each job will get a default time limit of 4 days, and 30G of memory per node. If you need more, you will need to specify that in the parameters for the batch script.
Ssh access to the nodes
If you have a running job on a node, you can ssh into the host. The login session will be adopted into the slurm session — meaning that the resource limits specified in the job will also be in effect for the login session. See the Login for more details.
Examples
Below are some examples of common commands used with Slurm. For a more detailed set of commands, please refer to the Slurm command summary or the main Slurm documentation.
List running jobs
squeue
List jobs belonging to a user
squeue -A user
List running jobs
squeue -u user -t RUNNING
List compute partitions
sinfo
List compute nodes
sinfo -N
Show a node’s resource attributes
sinfo -Nel
Submit a job
sbatch script.sh
Interactive session on a GPU host
srun -p compsci-gpu --gres=gpu:1 --pty bash -i
Detailed job information
scontrol show jobid -dd jobid
Using an OR constraint to select between multiple GPUs
sbatch -p compsci-gpu --constraint="2080rtx|k80”
Request a specific type and number of GPU(s)
sbatch -p compsci-gpu --gres=gpu:2080rtx:1
Direct a job to a linux41 where the GPU has 10G of RAM while passing the GPU_SET variable 2 to pass to program
export GPU_SET=2;sbatch -w linux41 --mem-per-gpu=10g -p compsci-gpu job.sh
Delete a job
scancel jobid
Here is a sample script. This script will run in the compsci partition