ELF

ELF went online in January 2016. It is a Dell HPC Cluster boasting:

– 128 nodes based on Dell PowerEdge M1000E enclosures (16 nodes per enclosure) and PowerEdge M630 servers

– Dual Intel Xeon E5-2680v3 2.5GHz, 12-Core, 120W processors (i.e., 24 cores per node).

– 256G of RAM DDR4-2133 per node

– Single 300G 10K local hard drive per node

– Dual 10 Gigabit Ethernet ports and a Mellanox FDR Infiniband adapter

– Four (4) nodes also provide larger RAM memory size (768GB)

– Sixteen (8) PowerEdge R730 also provide an NVIDIA TESLA K40M GPU

 

ELF I front view

 

Network

– Mellanox Infiniband FDR (56Gbps) fabric (2:1 oversubscribed)

– 10 Gigabit Ethernet network for connection to public networks

– Gigabit Ethernet management fabric

 

Storage System

– DDN GRIDScaler GPFS High-performance parallel filesystem

• 1PB of usable disk space

• 16xInfiniband FDR

 – Tape backup for RDImembers ($HOME, $PROJECTS) being deployed

– Additional storage systems to be deployed

 

ELF Initial Setup
Partition Type Run jobs? Data preserved?
/tmp local disk (300GB) YES NO
$HOME NFS NO YES
$PROJECTS NFS YES YES
$SCRATCH NFS YES NO

 

ELF queues

– Queues’ priority is based on purpose of the queue

– Arrangements needed for xlarge and xlong queues.

– Evolving configuration

 

Initial Queue Setup
Queue Name  Max Runtime  Max Nodes/Cores Purpose
development  4 hrs  2 / 48 Development nodes
gpu 2 days 8 / 96 GPU nodes
largemem 2 days 4 / 96 Large memory (768GB/node)
main 2 days 64 / 1536 Open access production
alloc 2 days  128 / 3072 Allocated access production

 

Software

– Red Hat Enterprise Linux derivative OS

– SLURM to manage and schedule jobs

– Parallel programming environments (MPI, CUDA, etc.)

– Initial software stack based on open software

        • Application packages, compilers, libraries and tools

 

Job accounting 

– Based on Service Units (SUs). 1 SU = 1 core/hour

– SUs billed = # nodes * 24 cores/node * wallclock time

– largemem queue: SUs billed = # nodes * 24 cores/node * wallclock time * 2