What is a High Performance Cluster (HPC)

A High Performance Cluster refers to a cluster of computers used to tackle large scale analytical problems. Essentially it's a large collection of computing resource which can be utilised for scientific research. See the diagram below, for details of the various components of Grace. A diagram of the cluster can be seen on the following page.

Compute Nodes

Compute nodes can be thought of as powerful computers/servers, which carry out the majority of the work on a HPC.  Each computer node has it's own: Operating System (OS), Central Processing Unit (CPU), Memory (RAM) and Hard Drive (HD).

Head Node

The Head node is essentially the brains behind the HPC.  It is responsible for tracking and allocating computing resources to users.

Login Node

This is the node which controls user access and logon

Memory

Random access memory (RAM) is a form of temporary high speed data storage used in computers and servers.

Central Processing Unit (CPU)

This is the unit which executes the software commands, Grace currently uses Intel Xeon 6 core X5650@2.66GHz, Intel Xeon 8 core E5-26700@2.60GHz CPU's and Intel Xeon 10 core E5-2670 V2@2.5Ghz.

Core/Slot

When multiple processors are integrated into a single physical processor, each individual processor is referred to a slot/core. Grace currently has some 6 core CPU's and some 8 AND 10 core CPU's. Each node has between 12 and 20 cores (2 CPU's per node).

Ethernet

Ethernet is a network communications protocol.

Infiniband (IB)

Quad data rate infiniband is a low latency interconnect and is pretty much the standard for HPC installations.  Our Infiniband has a bandwidth of 40 Gigabits per second. The other important aspect of HPC networking is latency.  Latency refers to the time taken for a packet to get from a source location to the destination.  IB has incredibly low latency (about 1 microsecond).

Scheduling System

At the heart of the HPC is the software which manages the workload.  Grace uses Platform LSF which is part of the Platform HPC suite from IBM. Essentially it's a program that attempts to balance utilisation across the resource available.

Queue

All LSF jobs run on queues.  Queues have differing attributes which match the jobs which run on them.  These attributes can relate to things like: job run time length, number of slots, amount of memory.

Job

A job is essentially your task, the code which you ask the HPC to execute.

Parallel

A Parallel job is a task which makes use of more than one core/job slot simultaneously. Parallel jobs can be split into two further groups, Symmetrical MultiProcessing (SMP), where memory on a single node is shared amongst different threads of the job.  And Message Passing Interface (MPI) which uses network communication to pass information between different portions of the job, which allows jobs to be run over multiple nodes.

GPFS

General Parallel File System (GPFS) is a high performance shared disk, file system which was developed by IBM. GPFS provides concurrent high-speed file access to applications executing on multiple nodes.