application performance on the uk s new hector service
play

Application performance on the UK's new HECToR service Fiona Reid - PowerPoint PPT Presentation

Application performance on the UK's new HECToR service Fiona Reid 1,2 , Mike Ashworth 1,3 , Thomas Edwards 2 , Alan Gray 1,2 , Joachim Hein 1,2 , Alan Simpson 1,2 , Peter Knight 4 , Kevin Stratford 1,2 , Michele Weiland 1,2 1 HPCX Consortium 2


  1. Application performance on the UK's new HECToR service Fiona Reid 1,2 , Mike Ashworth 1,3 , Thomas Edwards 2 , Alan Gray 1,2 , Joachim Hein 1,2 , Alan Simpson 1,2 , Peter Knight 4 , Kevin Stratford 1,2 , Michele Weiland 1,2 1 HPCX Consortium 2 EPCC, The University of Edinburgh 3 STFC Daresbury Laboratory 4 EURATOM/UKAEA Fusion Association CUG May 5-8th 2008

  2. Acknowledgements • STFC: Roderick Johnstone • Colin Roach, UKAEA and Bill Dorland, University of Maryland for assistance porting GS2 to HECToR and supplying the NEV02 benchmark • Jim Philips, UIUC and Ghengbin Zheng, UIUC for their assistance installing NAMD on HECToR CUG May 5-8th 2008 2

  3. Overview • System Introductions • Synthetic Benchmark Results • Application Benchmark Results • Conclusions CUG May 5-8th 2008 3

  4. Systems for comparison • HPCx (Phase 3): 160 IBM e-Server p575 nodes – SMP cluster, 16 Power5 1.5 GHz cores per node – 32 GB of RAM per node (2 GB per core) – IBM HPS interconnect (aka Federation) – 12.9 TFLOP/s Linpack, No 101 on top500 • HECToR (Phase 1): Cray XT4 – MPP, 5664 nodes, 2 Opteron 2.8 GHz cores per node – 6 GB of RAM per node (3 GB per core) – Cray Seastar2 torus network – 54.6 TFLOP/s Linpack, No 17 on top500 • Also included in some plots: – HECToR Test and Development system (TDS) – Cray XT4, 64 nodes: 2.6 GHz dual core, 4 GB RAM/node CUG May 5-8th 2008 4

  5. System Comparison (cont) HPCx HECToR Chip IBM Power5 (dual core) AMD Opteron (dual core) Clock 1.5 GHz 2.8 GHz FPUs 2 FMA 1 M, 1 A Peak 6.0 GFlop/s 5.6 GFlop/s Perf/core cores 2560 11328 Peak Perf 15.4 TFLOP/s 63.4 TFLOP/s Linpack 12.9 TFLOP/s 54.6 TFLOP/s CUG May 5-8th 2008 5

  6. Synthetic Benchmarks • Memory Bandwidth – Streams • MPI Bandwidth – Intel MPI Benchmarks – PingPing CUG May 5-8th 2008 6

  7. Memory bandwidth - Streams 100000 TDS: 2 cores per node TDS: 1 core per node HECToR: 2 cores per node Bandwidth (load+store) (MB/s) HECToR: 1 core per node HPCx: 16 cores per node HPCx: 8 cores per node 10000 1000 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+09 Array Size (Bytes) CUG May 5-8th 2008 7

  8. Memory bandwidth - Streams • Can clearly see caches • HECToR better at L1, slightly better on main memory – HPCx has advantage for intermediate array sizes. • Underpopulating nodes (1 core per chip) gives improvements on both systems – memory bandwidth cannot sustain 2 cores per chip – HECToR worse than HPCx, especially on main memory – Of course, 1 core/chip means double the resource for same no. tasks • TDS has lower clock rate than HECToR, but has higher bandwidth from main memory! – 4=2+2 GB RAM on TDS is symmetric, interleaving possible – 6=4+2 GB RAM on HECToR only allows partial interleaving CUG May 5-8th 2008 8

  9. MPI bandwidth - PingPing HPCx reaches Intel MPI Multi Ping Ping Benchmark System Comparison saturation point 720 MB/s earlier – HECToR 1000 may scale better 100 140 MB/s Bandwidth per task (MB/s) On both systems the latency (via IMB 10 PingPong) ~5.5µs HECToR: 2 cores per node 1 HPCx: 16 cores per node AlltoAll - HPCx has the advantage for 0.1 small (<100 bytes) messages, HECToR 0.01 outperforms HPCx 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 for larger messages Message Size (Bytes) CUG May 5-8th 2008 9

  10. Applications • Fluid Dynamics – PDNS3D – Ludwig • Fusion – Centori – GS2 • Ocean Modelling – POLCOMS • Molecular Dynamics – DL_POLY – LAMMPS – GROMACS (see paper) – NAMD – AMBER (see paper) CUG May 5-8th 2008 10

  11. Fluid Dynamics: PDNS3D (PCHAN) • Finite difference code for Turbulent Flow – shock/boundary layer interaction (SBLI) • Simulates the flow of fluids to study turbulence • T3 benchmark - Involves a 360x360x360 grid • Developed by Neil Sandham, University of Southampton CUG May 5-8th 2008 11

  12. PDNS3D – compilation optimization PDNS3D T3 Benchmark HECToR PGI Compilation Flag Comparison, 64 cores 1500 1450 1400 1350 time (s) 1300 1250 1200 1150 1100 1050 0 1 2 3 4 e e t 3 4 s O O O O O n n O O a i i - - - - - l l f - - n n - i i t t , , s s t t s s a a f f a a - - f f = = a a p p i i M M - - 3 4 O O - - t t s s a a f f - - CUG May 5-8th 2008 12

  13. PDNS3D – system comparison PDNS3D T3 Benchmark System Comparison 90000 80000 Time * Cores (s) 70000 HECToR 60000 HPCx Phase 3 50000 40000 30000 20000 10 100 1000 10000 Cores CUG May 5-8th 2008 13

  14. PDNS3D Memory Bandwidth sensitivity PDNS3D T3 Benchmark PDNS3D T3 Benchmark HPCx HECToR HPCx Phase 3 90000 90000 HPCx Phase 3, 8 cores 80000 80000 per node Time * Cores (s) Time * Cores (s) 70000 70000 60000 60000 50000 50000 40000 40000 HECToR HECToR, 1 core per node 30000 30000 HECToR TDS 20000 20000 HECToR TDS, 1 core per node 10 100 1000 10 100 1000 10000 Cores Cores • Underpopulating nodes gives huge improvement (in terms of performance/core) on HECToR, slight improvement on HPCx • TDS outperforms HECToR • c.f. streams CUG May 5-8th 2008 14

  15. PDNS3D – Optimised version • New optimised version less sensitive to memory bandwidth HECToR PDNS3D T3 Benchmark HPCx Phase 3 System Comparison HECToR, Opt HPCx, Opt 90000 HECToR, Opt, 1 c/n 80000 Time * Cores (s) 70000 60000 50000 40000 30000 20000 10 100 1000 10000 Cores CUG May 5-8th 2008 15

  16. PDNS3D – Optimised version • PathScale gives a further 10-15% improvement HECToR HPCx Phase 3 PDNS3D T3 Benchmark HECToR, Opt System Comparison HPCx, Opt HECToR, Opt, 1 c/n 90000 HECToR, Opt, PathScale 80000 Time * Cores (s) 70000 60000 50000 40000 30000 20000 10 100 1000 10000 Cores CUG May 5-8th 2008 16

  17. Fluid dynamics - Ludwig • Ludwig – Lattice Boltzmann code for solving the incompressible Navier-Stokes equations – Used to study complex fluids – Code uses a regular domain decomposition with local boundary exchanges between the subdomains – Two problems considered, one with a binary fluid mixture, the other with shear flow CUG May 5-8th 2008 17

  18. 18 CUG May 5-8th 2008 Ludwig 256x512x256 lattice

  19. Fusion • Centori – simulates the fluid flow inside a tokamak reactor developed by UKAEA Fusion in collaboration with EPCC • GS2 – Gyrokinetic simulations of low- frequency turbulence in tokamak developed by Bill Dorland et al. ITER tokamak reactor (www.iter.org) CUG May 5-8th 2008 19

  20. CENTORI Centori, 128x128x128 problem System Comparison 400 HECToR 350 HPCx TDS Time * Cores (s) 300 250 200 150 100 50 0 1 10 100 1000 10000 Cores CUG May 5-8th 2008 20

  21. GS2 GS2 NEV02 benchmark System Comparison HECToR HPCx 300 HECToR, 1 core per node TDS 250 TDS, 1 core per node Time * Cores (s) 200 150 100 50 0 10 100 1000 10000 Cores CUG May 5-8th 2008 21

  22. Ocean Modelling: POLCOMS • Proudman Oceanographic Laboratory Coastal Ocean Modelling System (POLCOMS) – Simulation of the marine environment – Applications include coastal engineering, offshore industries, fisheries management, marine pollution monitoring, weather forecasting and climate research – Uses 3-dimensional hydrodynamic model CUG May 5-8th 2008 22

  23. 23 CUG May 5-8th 2008 Ocean Modelling: POLCOMS

  24. Molecular dynamics • DL_POLY – general purpose molecular dynamics package which can be used to simulate systems with very large numbers of atoms • LAMMPS – Classical Molecular Dynamics - can simulate wide range of materials • NAMD – classical molecular dynamics code designed for high-performance simulation of large biomolecular systems • AMBER Protein Dihydrofolate Reductase – General purpose biomolecular simulation package • GROMACS – General purpose MD package - specialises in biochemical systems, e.g. protiens, lipids etc CUG May 5-8th 2008 24

  25. DL_POLY – system comparison DL_POLY 3.08 - GRAMICIDIN A WITH WATER SOLVATING (792960 atoms) System Comparison 350 HPCx Phase 3 300 timestep time * Cores (s) HECToR 250 200 150 100 50 0 10 100 1000 Cores CUG May 5-8th 2008 25

  26. DL_POLY – system comparison DL_POLY 3.08 - GRAMICIDIN A WITH WATER SOLVATING (792960 atoms) HECToR 350 HECToR HECToR 1 core per node 300 timestep time * Cores (s) 250 200 150 100 50 0 10 100 1000 Cores CUG May 5-8th 2008 26

  27. LAMMPS LAMMPS Rhodopsin benchmark, 4096000 atoms System Comparison 50000 45000 Loop Time * Cores (s) HPCx Phase 3 40000 HPCx Phase 3 SMT 35000 HECToR 30000 25000 20000 15000 10000 10 100 1000 10000 Cores CUG May 5-8th 2008 27

  28. LAMMPS LAMMPS Rhodopsin benchmark, 4096000 atoms HECToR 30000 28000 Loop Time * Cores (s) 26000 HECToR 24000 HECToR 1 core per node 22000 HECToR TDS 20000 HECToR TDS 1 core per node 18000 16000 14000 12000 10000 10 100 1000 10000 Cores CUG May 5-8th 2008 28

  29. LAMMPS • On HECToR we can run a problem with 500 million particles • On HPCx the limit is ~100 million particles – Fewer cores available – Less memory for per core CUG May 5-8th 2008 29

Recommend


More recommend