Choosing Requested Resources

When submitting a job to slurm, you should pass in arguments to specify the resources you will need to run your job. This includes, but is not limited to:

  • The number of tasks you need to run: --ntasks

  • The number of cpus required for each task: -cpus-per-task

  • The number of requested nodes: --nodes (if your job is threaded, you should limit your requests to 1 node)

  • The amount of requested memory: --mem or --mem-per-cpu

  • The expected time needed to run the job: --time

  • The queue you wish to submit to: --partition

This is only a small subset of the available options! See https://slurm.schedmd.com/srun.html for a complete list.

Partitions / Job Queues

The nodes in the rcfcluster are grouped into several different “partitions” or “job queues”. Each partition has different time limits and permissions on who can submit jobs. The two partitions available to all general cluster users are:

  • --partition=normal (Max Time Limit = 72 hours)

  • --partition=long (Max Time Limit = Unlimited). Please do not submit to the long queue unless your job actually requires > 72 hours to run

  • --partition=amd (Contains only nodes with AMD CPUs)

  • --partition=intel (Contains only nodes with Intel CPUs)

See the previous Choosing a Partition section for more info.

Software Contraints:

Additionally, some software is only available on a selection of nodes. To specify you need your resource requests to satisfy these constraints, use:

  • make: #SBATCH --constraint=devel

  • magma: #SBATCH --constraint=magma

  • Pari/GP: #SBATCH --constraint=pari-gp

Know your Code!

Unfortunately, there’s no obvious setting for what will work “best” for a given application. However, the better you understand what your code is doing, the better you can request the resources you’ll need to optimize its performance.

In general, you should first determine if your job is “threaded” or not. A threaded application does not support parallel jobs across multiple nodes – all CPUs must be on the same compute node. An example of such a process is python’s “multiprocessing” module for parallelization. Therefore, your best bet if using this module is likely to request only 1 node, but multiple cores. You can specify that you need only 1 compute node by using the --nodes=1 option, and that you want multiple CPUs with the --cpus-per-task=# option.

Alternatively, if your job has the capability to communicate with CPUs on different nodes, you can omit the --nodes=1 option, or even use it to specify multiple nodes. For instance, --nodes=4 (or equivalently, -N 4) will request 4 compute nodes. An example of a non-threaded job is one that utilizes MPI.

While it is not required to specify a time limit (with the --time option) not doing so will severely limit the scheduler’s ability to start jobs in the queue efficiently. Slurm uses a scheduling policy that generally executes jobs in the order they were submitted, however it also has the capability to start a queued job early as long as doing so will not delay the expected execution time of an earlier submitted job. Thus, specifying reasonable time limits increases the efficiency of the scheduling for all users. Acceptable time formats include “minutes”, “minutes:seconds”, “hours:minutes:seconds”, “days-hours”, “days-hours:minutes” and “days-hours:minutes:seconds”.

Trial and Error

Feel free to experiment with your resource requests in order to find out what works best for your jobs (while not hogging all the resources from other users)! Of course, if your code is not written to utilize multiple threads, there is no use in requesting them, and thus an overly-demanding resource request will just delay your job’s start time while slurm waits for resources to become available! However, for parameters such as the amount of memory, it is much harder to know from the start what your jobs will require. If you don’t specify how much memory you will require, with either the “–mem” or “–mem-per-cpu” flags, the default value of 500 MB per CPU will be assumed for your job, however many jobs require less than this.

If you’re running multiple jobs of a similar type, it is recommended to submit the first one, monitor its performance on a node, and then using this info to tweak the requests moving forward.

For any job running on a single node, you can log on to the node while the job is running, and use either a “top” or “ps” command to get a peak at the resources currently in use. Once a job has completed, you can also use the seff <jobID> command, to get a nice summary of the resources that were actually used (vs. what was allocated). This can give you an idea of whether you were wasting/hogging resources, so you can correct it moving forward!