the need for parallel i o in classical molecular dynamics
play

The Need for Parallel I/O in Classical Molecular Dynamics I.T. - PowerPoint PPT Presentation

The Need for Parallel I/O in Classical Molecular Dynamics I.T. Todorov I.J. Bush, A.R. Porter Advanced Research Computing, CSE Department STFC Daresbury Laboratory, Warrington WA4 4AD, UK The DL_POLY_3 MD Package General purpose MD


  1. The Need for Parallel I/O in Classical Molecular Dynamics I.T. Todorov I.J. Bush, A.R. Porter Advanced Research Computing, CSE Department STFC Daresbury Laboratory, Warrington WA4 4AD, UK

  2. The DL_POLY_3 MD Package • General purpose MD simulation package • Written in modularised free formatted FORTRAN90 (+MPI2) with rigorous code syntax (FORCHECK and NAGWare verified) and no external library dependencies • Generic parallelisation (for short-ranged interactions) based on spatial domain decomposition (DD) and linked cells (LC) • Long-ranged Coulomb interactions are handled by SPM Ewald employing 3D FFTs for k-space evaluation - limited use to 2 k CPUs • Maximum particle load ≈ 2.1 × 10 9 atoms • Full force field and molecular description but no rigid body description yet (as in DL_POLY_2) • Free format semantically approached reading with some fail- safe features (but fully fool-proofed)

  3. Domain Decomposition Parallelisation A B A B C C D D 3

  4. Development 65 Published Lines of Code [1000] v09-Jan-08 v08-Sep-07 60 55 v07-Dec-06 v06-Mar-06 50 v05-Oct-05 v02-Mar- 45 04 v03-Sep-04 40 v01-Mar-03 v04-Mar-05 35 0 1 2 3 4 5 Development Time [Years]

  5. DL_POLY Licence Statistics 1420 885 196 5

  6. Performance Weak Scaling on IBM p575 Solid Ar (32'000 atoms per CPU) 1000 perfect parallelisation NaCl (27'000 ions per CPU) SPC Water (20'736 ions per CPU) 800 33 million atoms Speed Gain 28 million atoms 600 400 21 million atoms good parallelisation 200 max load 700'000 atoms per 1GB/CPU max load 220'000 ions per 1GB/CPU 0 max load 210'000 ions per 1GB/CPU 0 200 400 600 800 1000 Processor Count

  7. I/O Weak Scaling on IBM p575 800 Solid Ar dashed lines show shut-down times NaCl solid lines show start-up times SPC Water 600 Time [s] 400 200 0 0 200 400 600 800 1000 Processor Count

  8. Proof of Concept on IBM p575 300,763,000 NaCl with full SPME electrostatics evaluation on 1024 CPU cores Start-up time ≈ 1 hour Timestep time ≈ 68 seconds FFT evaluation ≈ 55 seconds In theory ,the system can be seen by the eye. Although you would need a very good microscope – the MD cell size for this system is 2 μ m along the side and as the wavelength of the visible light is 0.5 μ m so it should be theoretically possible.

  9. Importance of I/O - I Types of MD studies most dependent on I/O • Large length-scales (10 9 particles) , short time-scale such as screw deformations • Medium big length-scales (10 6 –10 8 particles), medium time-scale (ps-ns) such as radiation damage cascades • Medium length-scale (10 5 –10 6 particles), long time-scale (ns- µ s) such as membrane and protein processes Types of I/O: portable human readable loss of precision size • ASCII + + – – • Binary – – + + • XDR Binary + – + +

  10. Importance of I/O - II Example : 15 million system simulated with 2048 MPI tasks MD time per timestep ~0.7 (2.7) seconds on Cray XT4 (BG/L) Configuration read ~100 seconds (once during the simulation) Configuration write ~600 seconds for 1.1 GB with the fastest I/O method – MPI-I/O for Cray XT4 (parallel direct access for BG/L) I/O in native binary is only 3 times faster and 3 times smaller Some unpopular solutions • Saving only the important fragments of the configuration • Saving only fragments that have moved more than a given distance between two consecutive dumps • Distributed dump – separated configuration in separate files for each MPI task (CFD)

  11. ASCII I/O Solutions in DL_POLY_3 1. Serial direct access write (abbreviated as SDAW) – where only a single node, the master, prints it all and all the rest communicate information to a master in turn while the master completes writing a configuration of the time evolution. 2. Parallel direct access write (PDAW) – where all nodes print in the same file in an orderly manner so no overlapping occurs using Fortran direct access files. However, it should be noted that the behaviour of this method is not defined by the Fortran standard, and in particular we have experienced problems when disk cache is not coherent with the memory. 3. MPI-I/O write (MPIW) which has the same concept as the PDAW but is performed using MPI-I/O rather than direct access. 4. Serial NetCDF write Serial NetCDF write ( (SNCW SNCW) using NetCDF libraries for machine- ) using NetCDF libraries for machine- 4. independent data formats of array-based, scientific data (widely independent data formats of array-based, scientific data (widely used by various scientific communities) used by various scientific communities)

  12. Overall I/O Performance BG/L : SDAW, PDAW, MPIW 125 BG/P : SDAW, PDAW P5-575: SDAW, PDAW, MPIW, SNCW XT3 SC: PDAW, MPIW (to 512) XT3 DC: PDAW, MPIW (to 1024) Write Speed [MB/s] XT4 SC: SDAW, MPIW, SNCW 100 XT4 DC: SDAW, MPIW, SNCW 75 50 25 0 0 512 1024 1536 2048 Processor Count

  13. DL_POLY_3 I/O Conclusions • PDAW performs markedly superiorly to the SDAW or MPIW where supported by the platform for this particular size of messages (73 Bytes ASCII per message). Improvements by an order of magnitude can be obtained, even though the I/O is not scaling especially well itself. • MPIW optimised and performed consistently well for Cray XT3/4 architectures. MPIW and much better than SDAW but as seen on Cray XT3 this was still not as fast as PDAW . MPIW on Cray XT4 can achieve an improvement by a factor of two, similar performance to PDAW on the Cray XT3, once the storage methodology (OST) is optimised for the dedicated I/O processing units. • MPIW performs badly on IBM platforms. PDAW not accessible on Cray XT4. SDAW extremely slow on Cray XT3. • While on the IBM P5-575 SNCW was only 1.5 times faster than SDAW on average, on the Cray XT4 it was 10 times and it did not matter whether runs were in single- or dual-core mode. Despite these differences, SNCW performed 2.5 times faster on the IBM P5- 575 than on the Cray XT4.

  14. Benchmarking BG/L Jülich Perfect 16000 MD step total Link cells 14000 van der Waals Ewald real 12000 Speed Gain Ewald k-space 10000 8000 6000 4000 14.6 million particle Gd 2 Zr 2 O 7 system 2000 2000 4000 6000 8000 10000 12000 14000 16000 Processor count

  15. Benchmarking XT4 UK Perfect 8000 MD step total Link cells 7000 van der Waals Ewald real Speed Gain 6000 Ewald k-space 5000 4000 3000 2000 14.6 million particle Gd 2 Zr 2 O 7 system 1000 1000 2000 3000 4000 5000 6000 7000 8000 Processor count

  16. Benchmarking Main Platforms - I 8 CRAY XT4 SC CRAY XT4 DC CRAY XT3 SC CRAY XT3 DC 3GHz Woodcrest DC 6 IBM p575 ] Evaluations [s BG/L -1 BG/P 4 2 0 3.8 million particle Gd 2 Zr 2 O 7 system 0 500 1000 1500 2000 2500 Processor count

  17. Benchmarking Main Platforms - II 25 LINK CELLS 25 EWALD K-SPACE 20 20 15 15 10 10 5 5 0 0 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 125 VAN DER WAALS 125 EWALD REAL 100 100 75 75 50 50 25 25 0 0 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500

  18. Acknowledgements Thanks to • Ian Bush (DL/NAG) for optimisation support • Andy Porter (DL) for NetCDF work and support • Martyn Foster (NAG) and David Quigley (UoW) for Cray XT4 optimisation • Luican Anton (NAG) for first draft of MPI-I/O writing routine http://www.ccp5.ac.uk/DL_POLY/

Recommend


More recommend