As mentioned on the HPC basics page. Grace has a number of queues, which are defined by run length (short, medium, long) and by resource, i.e. whether to use Infiniband (ib) or not. A full breakdown of each Queue and it's associated resource can be seen below.

Important

*As a result of individuals being affected by other rogue jobs using too much memory without using memory directives, we have decided to impose a maximum memory usage limit of 5GB on the ethernet queues, short, medium and long. If you wish to use more memory on those queues, please use an appropriate LSF memory resource request

Please see the following page for details about using LSF memory resource directives. 

 

Queue details

Infiniband Parallel Queues - Total Number of Slots

Queue Name Slots Available Wall Time Slots Per Job Priority
debug-ib 200 20 minutes 168 20
short-ib 844 24 hours 472 15
medium-ib 740 24 hours 404 10
long-ib 634 120 hours 336 5

 

Standard Ethernet Queues - Total Number of Slots

Queue Name Slots Available Wall Time Slots Per Job Priority Mem Limit
short 1840 24 hours 1008 15 5GB*
medium 1497 24 hours 874 10 5GB*
long 1382 168 hours 800 5 5GB*

 

Large Memory (48G) Ethernet Queues - Total Number of Slots

Queue Name Slots Available Wall Time Slots Per Job Priority
medium-lm 60 24 hours 60 10
long-lm 60 168 hours 48 5

 

The huge memory queue has 2 nodes associated with it: cn302 which has (128GB) of memory and cn169 which has (192GB) of memory. If you wish to use the larger of the nodes (cn169) then you must include the following in your submission script "#BSUB -R "rusage[mem=190000]" and "#BSUB -M 190000" without the quotes. This queue should always be used in exclusive mode i.e. include "#BSUB -x" in your submission script.

Queue Name Slots Available Wall Time Slots Per Job Priority
long-hm 2 168 hours 1 10

 

Interactive Queues - Total Number of Slots

Queue Name Slots Available Wall Time Slots Per Job Priority
interactive 230 12 hours 36 18
interactive-lm 12 12 hours 12 18
interactive-hm 16 12 hours 16 24

 

Starting a large memory interactive session

  • Xinteractive -q interactive-lm
  • interactive -q interactive-lm
  • Xinteractive -q interactive-hm
  • interactive -q interactive-hm

 

Basic LSF commands relating to Queues and Jobs are listed below.

Command Description
bqueues Shows a list of available queues including jobs and slots (shown below)
bjobs -u all  Lists all of the jobs running on the cluster
bjobs -u all -x Lists all of the jobs running on the cluster and expanded to shows nodes
bjobs Lists all of the jobs which you have running

 

[username@login00 ~]# bqueues
QUEUE_NAME      PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP
interactive-lm   24  Open:Active      12   12    -    -     0     0     0     0
interactive-hm   24  Open:Active      16   16    -    -     0     0     0     0
interactive      22  Open:Active     230   36    -    -     1     0     1     0
debug-ib         20  Open:Active     200  168    -    -     0     0     0     0
short            20  Open:Active    1840 1008    -    -     0     0     0     0
short-ib         15  Open:Active     844  472    -    -     0     0     0     0
medium-lm        15  Open:Active      60   60    -    -     0     0     0     0
medium-ib        10  Open:Active     740  404    -    -     0     0     0     0
long-lm          10  Open:Active      60   48    -    -     1     0     1     0
long-hm          10  Open:Active      2   2      -    -     0     0     0     0
medium           10  Open:Active    1497  874    -    -    50     0    50     0
long-ib           5  Open:Active     634  336    -    -    64     0    64     0
long              5  Open:Active    1382  800    -    -   217     0   217     0
gpu               5  Open:Inact       12    2    -    -     0     0     0     0

Understanding bqueues output

  • Open:Active - the queue is able to accept jobs
  • Open:Inact - accepts and holds new jobs
  • Closed:Active - does not accept new jobs, but continues to start jobs held in queue
  • Closed:Inact - does not accept new jobs and does not start jobs held in queue
  • PRIO - priority of the queue. Higher the value, the higher the priority
  • MAX - maximum number of job slots available in the queue
  • JL/U - maximum number of job slots each user can use for jobs in the queue
  • NJOBS - total number of job slots held currently in the queue
  • PEND - number of job slots used by pending jobs in the queue
  • RUN - number of job slots used by running jobs in the queue
  • SUSP - number of job slots used by suspended jobs in the queue

 

Queue priority

Each of the queues listed above have been given a priority to ensure fair share scheduling. The shorter run time queues have a higher priority than the longer run time queues to ensure shorter jobs get completed quicker. For example, if a job is pending in the short queue and likewise for the medium queue, the job in the short queue will be scheduled to run first. So before submitting jobs to a queue, ensure the most appropriate queue is selected; otherwise you could end up waiting longer than expected, as well as to prevent inefficient scheduling.
Queue back fill and reserved jobs

For large parallel jobs, LSF reserves job slots until all resources are available for the job to start. Thus, job slots have the status "reserved" until they are in a "run" state. However, if it takes some time to reserve all the required resources for the large parallel job, this can result in under-utilisation of computational resources. And this is how the "backfill" feature can ensure better utilisation of resources. This technique allows smaller jobs to make use of reserved job slots if it will not delay the start of the large parallel job. Therefore, if you have a small job, ensure you provide the run limit with the -W switch. A word of warning: if your job exceeds this limit, it will be killed!