namd on bluewaters
play

NAMD on BlueWaters Presented by: Eric Bohm Team: Eric Bohm, Chao - PowerPoint PPT Presentation

NAMD on BlueWaters Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale NSF/NCSA Blue Waters Project Sustained Petaflops system funded by NSF to be ready in 2011.


  1. NAMD on BlueWaters Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale

  2. NSF/NCSA Blue Waters Project  Sustained Petaflops system funded by NSF to be ready in 2011. − System expected to exceed 300,000 processor cores.  NSF Acceptance test: 100 million atom Bar Domain simulation using NAMD.  NAMD PRAC The Computational Microscope − Systems from 10 to 100 million atoms  A recently submitted PRAC from an independent group wishes to use NAMD − 1 Billion atoms!

  3. NAMD  Molecular Dynamics simulation of biological systems  Uses the Charm++ idea: − Decompose the computation into a large number of objects − Have an Intelligent Run-time system (of Charm++) assign objects to processors for dynamic load balancing Hybrid of spatial and force decomposition: • Spatial decomposition of atoms into cubes (called patches) • For every pair of interacting patches, create one object for calculating electrostatic interactions • Recent: Blue Matter, Desmond, etc. use this idea in some form

  4. BW Challenges and Opportunities  Support systems >= 100 Million atoms  Performance requirements for 100 Million atom  Scale to over 300,000 cores  Power 7 Hardware − PPC architecture − Wide node at least 32 cores with 128 HT threads  BlueWaters Torrent interconnect  Doing research under NDA

  5. BlueWaters Architecture  IBM Power7  8 cores/chip  Peak Perf ~10 PF  4 chips/MCM  Sustained ~1 PF  8 MCMs/Drawer  300,000+ cores  4 Drawers/SuperNode  1.2+ PB Memory  1024 cores/SuperNode  18+ PB Disc  Linux OS

  6. Power 7  64-bit PowerPC  2 fixed point, 2 load store  3.7-4Ghz  1 VMX  Up to 8 FLOPs/cycle  1 decimal FP  4-way SMT  2 VSX  128 byte cache lines − 4 FLOPs/cycle  32 KB L1  6-wide in-order  256 KB l2  8-wide out-of-order  4 MB local in shared  12 data streams 32 MB L3 cache prefetch

  7. Hub Chip Module  Connects 8 QCMs via L-local (copper) − 24 GB/s  Connects 4 P7-IH drawers L-remote (optical) − 6 GB/s  Connects up to 512 SuperNodes D (optical) − 10 GB/s

  8. Availability  NCSA has BlueDrop machine − Linux − IBM 780 (MR) POWER7 3.8 Ghz − Login node 2x8 core processors − Compute note 4x8 core in 2 enclosures  BlueBioU − Linux − 18 IBM 750 (HV32) nodes 3.55 Ghz − Infiniband 4x DDR (Galaxy)

  9. NAMD on BW  Use SMT=4 effectively  Use Power7 effectively − Shared memory topology − Prefetch − Loop unrolling − SIMD VSX  Use Torrent effectively − LAPI/XMI

  10. Petascale Scalability Concerns  Centralized load balancer - solved  IO − Unscalable file formats - solved − input read at startup - solved − Sequential output – in progress  Fine grain overhead – in progress  Non-bonded multicasts – being studied  Particle Mesh Ewald − Largest grid target <= 1024 − Communication overhead primary issue − Considering Multilevel Summation alternative

  11. NAMD and SMT=4  P7 hardware threads are prioritized − 0,1 highest − 2,3 lowest  Charm runtime measure processor performance − Load balancer operates accordingly  NAMD on SMT=4 35% faster than SMT=1 − No new code required!  At the limit it requires 4x more decomposition

  12. NAMD on Power7 HV 32 AIX Relative Parallel Efficiency NAMD ApoA1 on Power 7 HV32 (AIX) 1.8 1.6 1.4 Core 1 1.2 Core 1, SMT= 2 Core 1, SMT=4 Core 2 Core 4 1 Core 8 Efficiency Core 8, SMT=2 Core 8, SMT=4 Core 16 0.8 Core 32 0.6 0.4 0.2 0 HV32

  13. SIMD -> VSX  VSX adds double  Translate SSE to precision support to VSX VMX  Add VSX support to  SSE2 already in use MD-SIMD in 2 NAMD functions  MD-SIMD implementation of nonbonded MD benchmark available from Kunzman

  14. MD-SIMD performance

  15. Support for Large Molecular Systems  New Compressed PSF file format − Supports >100 million atoms − Supports parallel startup − Support MEM_OPT molecule representation  MEM_OPT molecule format reduces data replication through signatures  Parallelize reading of input at startup − Cannot support legacy PDB format − Use binary coordinates format  Changes in VMD courtesy John Stone

  16. Parallel Startup T a b le 1 : P a r a lle l S t a rt u p f o r 1 0 M illio n w a t e r o n B lu e G e n e /P N o d e s S tart (se c) M em o ry (M B ) 1 N A 4 4 8 4 .5 5 * 8 4 4 6 .4 9 9 8 6 5 .1 1 7 1 6 4 2 4 .7 6 5 4 5 6 .4 8 7 3 2 4 2 0 .4 9 2 2 5 8 .0 2 3 6 4 4 3 5 .3 6 6 2 3 5 .9 4 9 1 2 8 2 2 7 .0 1 8 2 2 2 .2 1 9 2 5 6 1 2 2 .2 9 6 2 1 8 .2 8 5 5 1 2 7 3 .2 5 7 1 2 1 8 .4 4 9 1 0 2 4 7 6 .1 0 0 5 2 1 4 .7 5 8 T a b le : P a r a lle l S t a rt u p 1 1 6 M illio n B A R d o m a in o n A b e N o d e s S ta rt (se c) M e m o ry (M B ) 1 3 0 7 5 .6 * 7 5 4 5 7 .7 * 5 0 3 4 0 .3 6 1 1 0 0 8 8 0 3 2 2 .1 6 5 9 0 8 1 2 0 3 2 3 .5 6 1 7 1 0

  17. Fine grain overhead  End user targets are all fixed size problems  Strong scaling performance dominates − Maximize number of nanoseconds/day of simulation  Non-bonded cutoff distance determines patch size − Patch can be subdivided along x, y, z dimensions  2 away X, 2-away XY, 2 away XYZ − Theoretically K-away...

  18. 1-away vs 2-away X

  19. Fine-grain overhead reduction  Distant computes have little or no interaction − Long diagonal opposites of 2-awayXYZ mostly outside of cutoff  Optimizations − Don't migrate tiny computes − Sort pairlists to truncate computation − Increase margin and do not create redundant compute objects  Slight (<5%) reduction in step time

  20. Future work  Integrate parallel output into CVS NAMD  Consolidate small compute objects  Leverage native communication API  Particle Mesh Ewald improve/replace  Parallel I/O optimization study on multiple platforms  High (>16k) scaling study on multiple platforms

Recommend


More recommend