Each standard ethernet node on the cluster, has 64GB of Memory installed. Each IB and GPU node has 128GB of memory installed. Some of which will be used by the Operating System for generic system related tasks.
Theoretically speaking, each slot on each node is allocated 2GB of memory. We appreciate that memory requirements differ between individuals and jobs, this is why we ask you to be as specific as possible (from a memory resource usage perspective) when submitting a job.
We strongly recommend, that prior to submitting multiple jobs, you run a benchmarking job to verify approximate memory usage and thus inform your subsequent memory resource request when submitting your jobs.
Memory usage and swap space
As well as the physical memory, each node has a certain amount of virtual memory, or swap space available. Swap space is an area of the hard disk drive which the Operating System uses as additional memory, using a small amount of swap space in normal circumstances. When a job requires more memory than is physically available, it will start using swap space. Because swap space is on the hard disk drive it is significantly slower than real memory, and generally if a job starts swapping, it will slow to the point it will no longer be making any progress with the task and it is best to terminate the job before the node becomes unstable.
First time runs (LM or -x)
When submitting jobs to LSF, it is recommended to be as specific as possible of the job's resource requirements, including memory. It can be difficult to determine an approximation of memory usage, as all task types are different and will depend on the application used, problem size and the data type used.
When running a job for the first time, if you think your task may have larger than 2G memory requirements but don't have a more accurate estimate (from calculating expected requirements or from comparison to either a similar HPC task or a similar task running on your own desktop PC) then it is advised to try running your task in one of the following ways:
- Run in an exclusive interactive session, or use the #BSUB –x in your job submission script (24G)
- Run on a standard queue, but include a conservative memory resource request and memory limit.
Reviewing memory requirements
One way of identifying the memory requirements of task is to look at what memory similar tasks have required. Job output logs include details about resource usage which includes the maximum memory used during the job run. For example, the following output is taken from a job output log file:
Resource usage summary:
CPU time : 1921.10 sec.
Max Memory : 30598 MB
Max Swap : 37025 MB
While a task is running it is possible to see how much memory is being used by running bjobs –l JobID and looking at the resource usage collected secion:
Tue Jan 29 14:10:40: Resource usage collected.
The CPU time used is 89 seconds.
MEM: 5 Gbytes; SWAP: 6 Gbytes; NTHREAD:6
For specific software apps, there are some useful profiling tools
- R - https://rstudio.github.io/profvis/ This is basically in built in R (studio) for users developing code and profiling memory requirements
- python - guppy for memory profiling
- java - Jconsole - https://docs.oracle.com/javase/8/docs/technotes/guides/management/jconsole.html
- matlab - only on windows, but might be useful if you want to port to HPC https://uk.mathworks.com/help/matlab/ref/memory.html
Requesting memory (rusage and -M)
The best practice when submitting a job is to ensure your job has appropriate resources available by requesting the amount of memory required. By including a memory resource request when submitting a job, the job scheduler can allocate your task appropriately to a compute node that has the available memory and make sure your job isn't vying with another.
To request the required memory for your job in a job submission script, use something similar to that given below, which will allocate 4000 MB (or 4 GB) of RAM for your job.
#BSUB -R "rusage[mem=4000]"
Or from the command line:
bsub -R "rusage[mem=4000]"
To ensure that your job does not exceed a specific memory usage and potentially cause instability to other jobs on the host, it is possible to include a memory limit at which point the job will automatically be terminated. The example below, if used, will kill the job if the memory usage reaches 4000 MB (6GB). It is recommended that the job is always killed once the memory usage matches the memory resource requested.
#BSUB -M 4000
Or at from the command line:
bsub -M 4000
The following message will be in the job output file when a job is terminated for using too much memory:
TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.
Requesting additional memory for an interactive job
The following example would start an interactive session, 6GB of memory and terminate at 6GB
interactive -R "rusage[mem=6000]" -M 6000
Requesting too much memory
It is important to make sure your memory resource request is appropriate; otherwise your job may not be allocated. For example, requesting too much memory (e.g. more than is available on a host) may result in your job pending for a long period as no node can satisfy the resource request. If you have included a memory resource and find your submitted job remains pending, check with bjobs –l JobID for the pending reason; the following message indicates the resource requirement cannot currently be met by any hosts.
Job requirements for reserving resource (mem) not satisfied: 1 host;
Parallel and Array
When requesting memory for parallel or array tasks, please consider the following points:
- In an array job, each element of the array is treated independently, which means memory resource requests and memory limit work on each element of the array separately. For example, if an array job of 10 elements is submitted with a memory limit of 4G each element will continue running while the memory usage is below 4G. Any element that exceeds 4G will be terminated, however leaving the other elements to continue running
- In a parallel job, the memory resource request and memory limit are treated on the overall task. For example, a parallel job with 8 tasks submitted with a memory limit of 4G will terminate when the combined memory usage exceeds 4G (e.g. 500MB per task).