Quick Start

HPC Midlands Plus (Athena) Quick-start Guide

Accessing Athena

Log in to athena.hpc-midlands-plus.ac.uk (via ssh) using your SAFE username and the password issued to you via SAFE for this service.

Loughborough University users: please log in to athena.lboro.ac.uk.

Storage

The location of your home files is of the form: /gpfs/home/site/username, where site is the name of your site (e.g. aston, lboro ) and username is your username.

If you have critical data please ensure it is copied to other locations. User areas are not backed up.

You will need to transfer your files to/from Athena via scp or sftp (i.e. use winscp on Windows).

Modules

Athena uses the module system, so to list the available modules type

module avail

Job Submission

Athena uses the SLURM queueing system.

To submit a job use the sbatch command:

sbatch hello.sh

which responds with something like:

Submitted batch job 22465

Here is an example SLURM submission script:

#!/bin/bash
#SBATCH --time=10:00:00
#SBATCH --job-name=myjobname
#SBATCH --partition=compute
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=28
#SBATCH --account=a01
#SBATCH --mail-type=ALL
#SBATCH --mail-user=A.Bloggs@lboro.ac.uk
module purge
module load some_modules
mpirun ./hello_parallel

Please note that the x86 nodes on Athena x86 nodes all have 28 cores.

The working directory of the job defaults to the one in which you submitted the job.

Argument Meaning
–time=HH:MM:SS The walltime you are requesting, i.e. the maximum time the job will be allowed to run for. Specified in hours:minutes:seconds.
–job-name=some_name The name you have chosen to give the job. Defaults to the name of the job script.
–output=some_filename The name of the file in which the stdout output will be saved.Defaults to slurm-NNNNN.out, where NNNNN is the job ID.
–error=some_filename The name of the file in which the stderr output will be saved. Defaults to slurm-NNNNN.err, where NNNNN is the job ID.
–partition=partition_name The partition your job should run in. For most users, please set this to compute.
–nodes=some_number The number of nodes you are requesting.
–account=account_code You should have been assigned an account code you can use here. Jobs will usually default to the correct account unless you are a member of multiple projects.
–workdir=some_directory Set the job working directory. Without this the working directory defaults to the one where the job was submitted from.
–mail-type=specification When the system should send you email. Options are BEGIN, END, FAIL, REQUEUE and ALL.
–mail-user=thing@somewhere.com Who to email the job status updates to. Set to your full email address.
–requeue Tells SLURM it may restart the job if it has been interrupted (e.g. due to a hardware fault).

Partitions

Jobs can be submitted to one of three partitions.

compute

  • x86 nodes with 28 cores and 128 GB memory.
  • There are 512 nodes in this partition.
  • The maximum job walltime allowed is 100 hours hours.
  • The maximum core count per job is 756 cores (= 27 nodes).
  • The maximum cores per user (i.e. all their running jobs) is 2000 cores.

openpower

  • OpenPower nodes with 20 cores and 1 TB memory.
  • There are 4 nodes in this partition.
  • The maximum job walltime allowed is 100 hours hours.

openpower-nvidia

  • OpenPower node with 20 cores and 1 TB memory and two Nvidia P100 GPGPUs.
  • There is one node in this partition.
  • The maximum job walltime allowed is 100 hours hours.

Job Monitoring and Control

The squeue command lists both queued and runnning jobs:

squeue
  JOBID PARTITION NAME  USERST  TIME  NODES NODELIST(REASON)
  22163   all   jobnnt  tumax PD0:00  1 (AssociationJobLimit)
  22433   all   small   txdfq R 33:35 8 node[0146-0153]

If a job is running (R in the ST column) then squeue lists the nodes it is running on. If it is waiting to run it usually is shown as pending (PD in the_ST_ column) and squeue lists the reason in the NODELIST(REASON) column.

To show more details on a particular job use scontrol show job <JOBID>:

scontrol show job 22401
  JobId=22401 Name=tuhs
  UserId=tuhs(7890) GroupId=lboro_tu(5678)
  Priority=1 Account=tuhs98 QOS=normal
  JobState=RUNNING Reason=None Dependency=(null)
  Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
  RunTime=05:54:34 TimeLimit=20:00:00 TimeMin=N/A
  TEST_JOB SubmitTime=2014-07-24T10:21:37 EligibleTime=2014-07-24T10:21:37
  StartTime=2014-07-24T10:21:37 EndTime=2014-07-25T06:21:37
  PreemptTime=None SuspendTime=None SecsPreSuspend=0
  Partition=all AllocNode_Sid=athena10:3606
  ReqNodeList=(null) ExcNodeList=(null)
  NodeList=node[56-61]
  BatchHost=node56
  NumNodes=6 NumCPUs=96 CPUs/Task=1 ReqS:C:T=*:*:*
  MinCPUsNode=16 MinMemoryNode=0 MinTmpDiskNode=0
  Features=(null) Gres=(null) Reservation=(null)
  Shared=0 Contiguous=0 Licenses=(null) Network=(null)
  Command=/gpfs/home/loughborough/tu/tuhs/work/m77r80/submit.sh
  WorkDir=/gpfshome/loughborough/tu/tuhs/work/m77r80

If you need to kill a job or remove it from the queue, use the scancel command:

scancel 22465

Command Summary

Command Meaning
sbatch Submit a job. Equivalent of qsub.
squeue List the queue. Equivalent of qstat or showq.
scancel Kill a job. Equivalent of qdel.
scontrol show job jobid Show details of a job jobid. Equivalent to qstat -f.
smap Graphical display of where jobs are running. Equivalent to showstate.
squeue –start Show a expected start time for a job. Equivalent to showstart.
srun –exports=PATH –pty bash Start an interactive job. Equivalent to qsub -I.

Environment Variables

These environment variables may be useful in your submission scripts:

Variable Meaning
SLURM_SUBMIT_DIR The directory the job was submitted from. Equivalent to PBS_O_WORKDIR.
SLURM_JOB_ID Which job this is. Equivalent to PBS_JOBID.
SLURM_NODELIST The nodes assigned to the job. Equivalent to PBS_NODELIST.

Further Information

Further detail on many of the commands is available using the man command, e.g. type man squeue once you have logged in to Athena. For long pages type a space to get the next page, or q to quit.

These manual pages are also available on the web

Getting Help

Users at each member site should contact their local support.

University Support Contact / E-mail / Web site
Aston University M. Eberhard
University of Birmingham Service Desk
University of Leicester Rcs.support@le.ac.uk, Service Desk
Loughborough University research-computing@lboro.ac.uk
University of Nottingham Service Desk
Queen Mary, University of London its-research-support@qmul.ac.uk
University of Warwick Bugzilla

Installed Software

The cluster has a standard CentOS 7.3 Linux installation, with the Intel 2017 Compiler suite, MKL and MPI added.

The initial set of user software will be built and installed in tranches, based upon the priorities agreed by the HPC Midlands Plus working group to support the pilot service.

First tranche – compilers, libraries and other supporting packages.

Note that this is not an extensive list

Package Version(s)
perl 5.24.1
python 2.7.12, 3.6.0
arpack 96
arpack++ 1.2
blas (non-MKL) 3.6.0
openblas 0.2.18
boost 1.61.0
cblas 3.6.0
cuda 8
fftw 2.1.5, 3.3.5
hdf5 1.8.17
Lapack (non-MKL) 3.6.1
metis 5.1.0
netcdf 4.4.1
numexpr 2.6.2
numpy 1.11.2
openpyxl 2.4.5
pandas 0.19.2
parmetis 4.0.3
patsy 0.4.1
petsc 3.7.6
pillow 4.0.0
qhul 2015.2
qt 4.8.7
scalapack 2.0.2
scipy 0.18.1
suitesparse 4.5.3
superlu 5.2.1
git 2.9.3
subversion 1.9.5
gcc 4.9.3, 6.3.0
glpk 4.55
gsl 1.16
jre 1.8.0_121-b13
ant 1.10.0

Second tranche

Package Version(s)
castep 17.2
DLPOLY Classic 1.9
gromacs 5.1.4
lammps 17Nov16
R 3.3.2
gdal 2.1.2
namd 2.12

Third tranche

Package Version(s)
superlu 5.2.1
metis-mt 0.6.0
cblas 3.6.0
amber 16
DLPOLY 4.08
cp2k 4.1
gulp 4.3.x
ipython 5.1
matlab 2016b, 2017a
Matlab toolboxes
OpenFOAM V1612+, 2.4
paraview 5.2.0
siesta 3.2-pl-4
VASP 5.4.1.03Aug16, 5.4.4
vtk 8.0.0
cvodes 2.9.0
ffmpeg 2.8.6,3.3.2
hypre 2.11.1
id 2.9.0
idas 1.3.0
kinsol 2.9.0
cvode 2.9.0
WRF 3.6.1, 3.8.1