17/02/2017

We are pleased to confirm that the HPC cluster is once again available.
 
Over the last 2.5 days we have carried out the following upgrades - which we feel were necessary from a security and performance perspective.
  • We have upgraded the current login node to a newer operating system and kernel, providing many benefits from a security and performance perspective
  • We have introduced a secondary login node to provide more HPC login resilience
  • We have upgraded the current (150) servers/compute nodes, to a newer operating system and kernel, providing many benefits from a security and performance perspective
  • We have upgraded 4 of the 8 GPU servers/nodes with a newer operating system and kernel, providing security and performance benefits
  • We have compiled a new cuda module to provide access to the latest cuda-8 toolkit/environment and updated the associated cuda driver (more details to follow) – Please don’t use this until advised!
  • We have Installed, imaged, and configured an additional 104 nodes/servers into the HPC environment split in the following way:
a. 60 x Ethernet servers/nodes running on Broadwell CPU architecture with 64GB of DDR4 on each node and 16 CPU cores
b. 36 x Infiniband servers/nodes running on Broadwell CPU architecture with 128GB of DDR4 memory on each node and 24 CPU cores – These IB nodes will utilise the latest IB interconnect (FDR and associated fabric) providing 56Gb/s IB performance
c. Installed and configured a new Infiniband 56Gb/s FDR network fabric to provide high speed interconnect/networking to the IB nodes
d. 2 x huge memory servers/nodes running on Broadwell CPU architecture with 512GB of DDR4 memory on each 16 CPU cores
e. 2 x huge memory servers/nodes running on Broadwell CPU architecture with 512GB of DDR4 memory on each 16 CPU cores, for a group of HPC/Bio users who invested in HPC equipment for their own dedicated usage
f.This upgrade provides you with an additional 1680 CPU cores
 
Note:
We have tested the above extensively over the last 24 hours or so, looking at many of the standard applications, but it isn’t beyond reason, especially with the amount of operating system and kernel changes which we have made to so many systems, that there might be the odd issue with the odd application.

 

CUDA/GPU users

As you might know, we currently have 8 GPU systems, all of which rely on cuda-7.5. As part of this upgrade we have taken 4 of those GPU nodes (g0005..g0008) out of that queue, and updated them (as mentioned above). We will likely contact some of you over the next few days, to ask you to do some testing for us. Utilizing the newer O/S, kernel and cuda-8 environment.

Note that submitting to the current GPU queue is absolutely fine, as we still have 4 GPU nodes (g0001..g0004) which we have left as was, until we finalize testing.
Infiniband users – for the new equipment (mentioned above)
Again, we will be working with you over the next week or so, to begin the process of doing more user testing. With the influx of a new (FDR mellanox 56Gb/s) fabric, some software will need recompiling (open-mpi compiled software) is an obvious one which springs to mind. Please note that this doesn’t prevent you from running IB jobs (as was), it just means that you might not take advantage of the new resource immediately.
 
Huge-memory nodes/queue (huge-mem)
Details will follow shortly about the usage of these nodes/queue
Thank you for your patience and if you have any questions please email hpc.admin@uea.ac.uk