SLURM

At the heart of the cluster is the software which manages the workload.  ada.uea.ac.uk uses SLURM which is an opensource queueing system. Essentially it's a program that balances use across available resource, by allocating jobs to appropriate compute nodes.

Basic SLURM commands

Command Description
sbatch <job.sub> Submit your batch job to the queue
squeue -u <userid> Shows a list of your jobs
squeue -u <userid> -l Lists your jobs in more detail
squeue -u <userid> -t RUNNING Lists your running jobs
squeue -u <userid> -t PENDING Lists your pending jobs
squeue -u <userid> --start Shows estimated start time of your jobs only available if the Slurm is configured to use the backfill scheduling plugin
squeue -j <JOBID> Shows detailed information about your job (JOBID = job number)
scancel <JOBID> Kill your job.  You can kill multiple jobs by using a comma separated list
scancel -t PENDING -u <userid> Kill all your pending jobs
scancel -u <userid> Kill all your jobs
sstat -jobs=<JOBID>

Statistics about the job.  By default this gives a lot of information, but can be limited to the variables you want 

eg sstat --format=AveCPU,AvePages,AveRSS,AveVMSize,JobID -jobs=<JOBID>

saact --jobs-<JOBID> Information about a past job from the current day
saact --jobs=<JOBID>  --starttime=2019-03-12

Information about a past job from an earlier date

eg sacct --jobs=<JOBID> --starttime=2019-03-12 --format=jobname,nnodes,ncpus,maxrss,elapsed

 

SBATCH commands

The SBATCH commands listed below should be added to your submission scripts. If you are in doubt as to which directives to use, please email the HPC support team.

Command Description
#BSUB -B Email when job starts (you don't need to supply your email)
#BSUB -N Email when job finishes (you don't need to supply your email)
#BSUB -o output.out Write output to output.log (use %J to include JOBID) will append to output.log if it exists
#BSUB -oo output.out As above but overwriting output.out if it exists
#BSUB -e error.err Write errors to error.err (use %J to include JOBID) "error-%J.err")
#BSUB -eo error.err As above but overwriting error.err if it exists
#BSUB -x Grants your job exclusive access to the node (Check with HPC admin prior to using)
#BSUB -n 16 Request a number of slots (16 in this case)
#BSUB -J Jobname Job Name
#BSUB -J Jobname[1-10] Array job with 10 elements
#BSUB -R "rusage[mem=4000]" Request a specific amount of memory for your job
#BSUB -M Sets the memory limit (MB)
#BSUB -u email-address You could use this option to send an email to an external account if you so wished
#BSUB -q long-eth Submit your job to a specific queue (medium in this example)