Software Development   

This page will document the various aspects involved in scientific software development, with an emphasis on code and parallel communication optimisation. The GRACE cluster is a more powerful cluster, providing better efficiency (in terms of GFLOPs/W) than previous clusters, but this needs to be utilised efficiently by codes, particularly ones written by users. If codes are not developed efficiently, this has the potential to negate the efficiency efforts made by the new cluster. The purpose of the scientific computational infrastructure is there to run codes and it is imperative that they are written with efficiency in mind.


Compiling codes

The GRACE cluster provides three popular compilers for code development:

  • GNU compiler collection (GCC) which includes C/C++ and FORTRAN version 4.1.2, 4.5.2 and 4.6.1;
  • Intel C/C++ and FORTRAN compilers version 11.1;
  • Portland Group Incorporate (PGI) C/C++ and FORTRAN compilers versions 7.0.7, 10.9 and 12.5 (use the latest version unless older versions are specifically required).

The above three compilers were benchmarked with a wide range of codes and the Intel compiler outperformed the others. The second best compiler was PGI, followed by GNU. In fact, the GNU compiler performance lagged considerably behind the Intel and PGI compilers. Please see this document for further information on the benchmarks. Of course, compiler performance may well be code dependent, so please test the above three compilers for your code.

The modules for the PGI, Intel and GNU compilers are:

Compiler Fortran 77 command Fortran 90 command C command C++ command Module file
GNU g77 gfortran gcc g++ gcc/4.6.1
PGI pgf77 pgf90 pgcc pgCC pgi/12.5
Intel ifort ifort icc icpc icc/intel/12.1

It is recommended to use the Intel compiler, followed by the PGI compiler and finally GNU. The documentation for the Intel and PGI compilers are listed in the table below; please click on the relevant link to view the compiler documentation.











The documentation above also lists compiler switches that can be used for auto optimisations.

To compile C/C++ code, use the command:

icc -I<path to include files> -c program.c -o program.o

To compile Fortran code, use the command:

ifort -c program.f -o program.o

To then link your code against either shared or static libraries (for C/C++ and Fortran programs):

ifort program.o /usr/lib/ -o program

Or, if you have a directory with a set of libraries, you can specify the directory and a switch to invoke a list of libraries:

ifort program.o -L/usr/lib -llib1 -llib2 -o program

where the switch -l<lib> invokes the library <lib>. Please replace icc and ifort with your preferred compiler. For the GNU compiler, use gcc or gfortran and for PGI, use pgcc or pgf90, respectively.


Code optimisation

When developing code, it is important that performance is considered. Code optimisation, i.e. mapping code as closely to the underlying computational infrastructure, can yield significant reduction in run times, resulting in greater efficiency and a lower carbon footprint. Please click here for a presentation on code optimisation.


Code debugging

The compilers listed above all provide their own debugger. For further information in using debuggers for the compiler you are using, click on the links below.








Code profiling

Code profiling allows developers to dynamically analyse their codes and audit which parts are being executed and how much time is being spent in certain blocks. This allows the developer to expend their optimisation efforts in computationally intensive parts of their code. There are two methods to code profiling:

  • Statistical profilers

  • Instrumenting profilers.

It is generally recommended that users start with statistical methods and advance to the more complex instrumenting type for further analysis. Statistical methods are less intrusive than instrumentation types, but the latter provides more control and information such as TLB and cache miss. The GNU profiler gprof is fairly easy to use, and is recommended for new and novice users. PGI also provide their own profiler; unfortunately, the GRACE cluster does not have the Intel profiler, VTune. Please see the documentation below for further information on the two available profilers.

PGI profiler

GNU profiler



To use the PGI profiler, simply load the PGI module pgi/10.9. No module is required to be loaded for the gprof profiler.


Numerical libraries

There are whole catalogues of numerical libraries available, open source and proprietary. As the GRACE cluster is based on the Intel Westmere architecture, the Intel Maths Kernel Library (MKL) is available for use. This library contains the following set of optimised subroutines:


  • Sparse Solvers

  • Fast Fourier Transforms

  • Vector Maths

  • Statistical functions

Please click here for further documentation on utilising MKL for your codes. Note that before developing your own subroutine, please ensure that one is not already available. If one is available, it is strongly recommended that you use the available version as this would have been optimised for the cluster. To load MKL, type:

module load intel/mkl/11.1

The NAG Fortran libraries are also available on the Grace cluster. The module files that are available for NAG are:


nag/fortran/gnu/22 nag/fortran/intel/22 nag/fortran/pgi/22 nag/fortran/gnu/23 nag/fortran/intel/23 nag/c/gnu/9 nag/c/intel/

The module file will depend on the compiler you are using, so ensure you load the correct module. To locate the shared library, type:

module show nag/fortran/pgi/22

and note the path under the LD_LIBRARY_PATH variable:

append-path LD_LIBRARY_PATH /gpfs/grace/nagfl-22/fll6a22dpl/lib

Then to link against the library, use the shared object file (in this example) as:


Documentation on NAG libraries can be obtained from:


The cluster also has the GNU Scientific Library (GSL) installed and the module files for it are:

gsl/gcc/1.9 gsl/intel/1.9 gsl/pgi/1.9 gsl/gcc/1.15

They have been built with the GNU, Intel and PGI compilers. The documentation for GSL can be found here.


Parallel communication library

Code/application performance has considerably increased due to many core systems such as clusters. As CPU frequencies have stagnated for many years, the paradigm to greater performance is through the usage of clusters and multi-core architectures. To make use of the large number of cores, parallel communication libraries have been developed. The choice on which parallel library to use depends on whether the code runs on either:

  • shared memory machines;

  • distributed memory machines such as clusters.

If the computational experiment can fit into a single GRACE node that has 24 GB of RAM, then the shared memory module can be used. However, if the computational experiment cannot fit into a single node, then the distributed parallel library will have to be used. The parallel libraries that are available on GRACE are tabulated below with links provided for further information.


Shared memory

Distributed memory


Parallel Virtual Machine (PVM)


Message Passing Interface (MPI)

Intel Thread Building Blocks (TBB)

Unified Parallel C (UPC)

OpenMP is a much more simplified version of POSIX Threads and is controlled by compiler pragmas. If you want to develop multi-threaded code using the shared memory model, it is strongly recommended to use OpenMP (instead of POSIX Threads). But remember that this is strictly limited to a single node (24 GB of RAM and 12 CPU cores). The compilers listed above all support OpenMP and is controlled by a compiler switch (see compiler documentation for further information). In addition to the OpenMP thread model, the Intel Thread Building Blocks (TBB) is also provided which is a C++ template library for developing multi-threaded code. Please click here for documentation on using TBB.

To develop parallel code for distributed memory architectures, the MPI library is provided by the GRACE cluster. MPI has evolved over the years and has been accepted as the model to use for parallel code development, and usage of PVM is very minimal. Support for MPI is widespread and good documentation exists. Popular scientific codes have widely used this library, and for this reason, it is provided with the GRACE cluster. Note that MPI can also be used for codes that can run entirely within a single node. Development of UPC is still ongoing and will be monitored, and should the project develop into an enterprise grade library, may be provided by the Research Computing Group.

There are a number of implementations of MPI, and the version provided is Platform MPI version 8.1. Please click here for documentation on Platform MPI. See Chapter 6 for MPI performance optimisation. Platform MPI has been extensively benchmarked with other popular implementations (except Intel MPI), and has outperformed all other implantations. The library provides impressive performance and scales well with a large number of nodes. There are three module files for Platform MPI depending on the compiler you wish to use:

  • mpi/platform/gcc/8.1

  • mpi/platform/intel/8.1

  • mpi/platform/pgi/8.1

Another distributed parallel library is available that is also developed by Platform and is called MPICH version 2. MPICH implements the MPI standard and provides efficient mechanisms for intra-node communication. There are three module files for MPICH, depending on which compiler you wish to use:

  • mpich2/platform/gcc/8.1

  • mpich2/platform/intel/8.1

  • mpich2/platform/pgi/8.1

GPU software development in CUDA

Please see the GPU computing section for developing GPU code.

If you have any further queries regarding the contents of this wiki, please email


Software version control

The Research Computing Services group also provide a software version control service. This is a Subversion service that is centrally managed and more information on this can be found at UEA Subversion.