Queues
As mentioned on the HPC basics page. Grace has a number of queues, which are defined by run length (short, medium, long) and by resource, i.e. whether to use Infiniband (ib) or not. A full breakdown of each Queue and it's associated resource can be seen below.
Important
*As a result of individuals being affected by other rogue jobs using too much memory without using memory directives, we have decided to impose a maximum memory usage limit of 5GB on the ethernet queues, short, medium and long. If you wish to use more memory on those queues, please use an appropriate LSF memory resource request
Please see the following page for details about using LSF memory resource directives.
Queue details
Infiniband Parallel Queues - Total Number of Slots
Queue Name | Slots Available | Wall Time | Slots Per Job | Priority |
debug-ib | 200 | 20 minutes | 168 | 20 |
short-ib | 844 | 24 hours | 472 | 15 |
medium-ib | 740 | 24 hours | 404 | 10 |
long-ib | 634 | 120 hours | 336 | 5 |
Standard Ethernet Queues - Total Number of Slots
Queue Name | Slots Available | Wall Time | Slots Per Job | Priority | Mem Limit |
short | 1840 | 24 hours | 1008 | 15 | 5GB* |
medium | 1497 | 24 hours | 874 | 10 | 5GB* |
long | 1382 | 168 hours | 800 | 5 | 5GB* |
Large Memory (48G) Ethernet Queues - Total Number of Slots
Queue Name | Slots Available | Wall Time | Slots Per Job | Priority |
medium-lm | 60 | 24 hours | 60 | 10 |
long-lm | 60 | 168 hours | 48 | 5 |
The huge memory queue has 2 nodes associated with it: cn302 which has (128GB) of memory and cn169 which has (192GB) of memory. If you wish to use the larger of the nodes (cn169) then you must include the following in your submission script "#BSUB -R "rusage[mem=190000]" and "#BSUB -M 190000" without the quotes. This queue should always be used in exclusive mode i.e. include "#BSUB -x" in your submission script.
Queue Name | Slots Available | Wall Time | Slots Per Job | Priority |
long-hm | 2 | 168 hours | 1 | 10 |
Interactive Queues - Total Number of Slots
Queue Name | Slots Available | Wall Time | Slots Per Job | Priority |
interactive | 230 | 12 hours | 36 | 18 |
interactive-lm | 12 | 12 hours | 12 | 18 |
interactive-hm | 16 | 12 hours | 16 | 24 |
Starting a large memory interactive session
- Xinteractive -q interactive-lm
- interactive -q interactive-lm
- Xinteractive -q interactive-hm
- interactive -q interactive-hm
Basic LSF commands relating to Queues and Jobs are listed below.
Command | Description |
bqueues | Shows a list of available queues including jobs and slots (shown below) |
bjobs -u all | Lists all of the jobs running on the cluster |
bjobs -u all -x | Lists all of the jobs running on the cluster and expanded to shows nodes |
bjobs | Lists all of the jobs which you have running |
[username@login00 ~]# bqueues
QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP
interactive-lm 24 Open:Active 12 12 - - 0 0 0 0
interactive-hm 24 Open:Active 16 16 - - 0 0 0 0
interactive 22 Open:Active 230 36 - - 1 0 1 0
debug-ib 20 Open:Active 200 168 - - 0 0 0 0
short 20 Open:Active 1840 1008 - - 0 0 0 0
short-ib 15 Open:Active 844 472 - - 0 0 0 0
medium-lm 15 Open:Active 60 60 - - 0 0 0 0
medium-ib 10 Open:Active 740 404 - - 0 0 0 0
long-lm 10 Open:Active 60 48 - - 1 0 1 0
long-hm 10 Open:Active 2 2 - - 0 0 0 0
medium 10 Open:Active 1497 874 - - 50 0 50 0
long-ib 5 Open:Active 634 336 - - 64 0 64 0
long 5 Open:Active 1382 800 - - 217 0 217 0
gpu 5 Open:Inact 12 2 - - 0 0 0 0
Understanding bqueues output
- Open:Active - the queue is able to accept jobs
- Open:Inact - accepts and holds new jobs
- Closed:Active - does not accept new jobs, but continues to start jobs held in queue
- Closed:Inact - does not accept new jobs and does not start jobs held in queue
- PRIO - priority of the queue. Higher the value, the higher the priority
- MAX - maximum number of job slots available in the queue
- JL/U - maximum number of job slots each user can use for jobs in the queue
- NJOBS - total number of job slots held currently in the queue
- PEND - number of job slots used by pending jobs in the queue
- RUN - number of job slots used by running jobs in the queue
- SUSP - number of job slots used by suspended jobs in the queue
Queue priority
Each of the queues listed above have been given a priority to ensure fair share scheduling. The shorter run time queues have a higher priority than the longer run time queues to ensure shorter jobs get completed quicker. For example, if a job is pending in the short queue and likewise for the medium queue, the job in the short queue will be scheduled to run first. So before submitting jobs to a queue, ensure the most appropriate queue is selected; otherwise you could end up waiting longer than expected, as well as to prevent inefficient scheduling.
Queue back fill and reserved jobs
For large parallel jobs, LSF reserves job slots until all resources are available for the job to start. Thus, job slots have the status "reserved" until they are in a "run" state. However, if it takes some time to reserve all the required resources for the large parallel job, this can result in under-utilisation of computational resources. And this is how the "backfill" feature can ensure better utilisation of resources. This technique allows smaller jobs to make use of reserved job slots if it will not delay the start of the large parallel job. Therefore, if you have a small job, ensure you provide the run limit with the -W switch. A word of warning: if your job exceeds this limit, it will be killed!