NAMD on BlueWaters Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale
NSF/NCSA Blue Waters Project Sustained Petaflops system funded by NSF to be ready in 2011. − System expected to exceed 300,000 processor cores. NSF Acceptance test: 100 million atom Bar Domain simulation using NAMD. NAMD PRAC The Computational Microscope − Systems from 10 to 100 million atoms A recently submitted PRAC from an independent group wishes to use NAMD − 1 Billion atoms!
NAMD Molecular Dynamics simulation of biological systems Uses the Charm++ idea: − Decompose the computation into a large number of objects − Have an Intelligent Run-time system (of Charm++) assign objects to processors for dynamic load balancing Hybrid of spatial and force decomposition: • Spatial decomposition of atoms into cubes (called patches) • For every pair of interacting patches, create one object for calculating electrostatic interactions • Recent: Blue Matter, Desmond, etc. use this idea in some form
BW Challenges and Opportunities Support systems >= 100 Million atoms Performance requirements for 100 Million atom Scale to over 300,000 cores Power 7 Hardware − PPC architecture − Wide node at least 32 cores with 128 HT threads BlueWaters Torrent interconnect Doing research under NDA
BlueWaters Architecture IBM Power7 8 cores/chip Peak Perf ~10 PF 4 chips/MCM Sustained ~1 PF 8 MCMs/Drawer 300,000+ cores 4 Drawers/SuperNode 1.2+ PB Memory 1024 cores/SuperNode 18+ PB Disc Linux OS
Power 7 64-bit PowerPC 2 fixed point, 2 load store 3.7-4Ghz 1 VMX Up to 8 FLOPs/cycle 1 decimal FP 4-way SMT 2 VSX 128 byte cache lines − 4 FLOPs/cycle 32 KB L1 6-wide in-order 256 KB l2 8-wide out-of-order 4 MB local in shared 12 data streams 32 MB L3 cache prefetch
Hub Chip Module Connects 8 QCMs via L-local (copper) − 24 GB/s Connects 4 P7-IH drawers L-remote (optical) − 6 GB/s Connects up to 512 SuperNodes D (optical) − 10 GB/s
Availability NCSA has BlueDrop machine − Linux − IBM 780 (MR) POWER7 3.8 Ghz − Login node 2x8 core processors − Compute note 4x8 core in 2 enclosures BlueBioU − Linux − 18 IBM 750 (HV32) nodes 3.55 Ghz − Infiniband 4x DDR (Galaxy)
NAMD on BW Use SMT=4 effectively Use Power7 effectively − Shared memory topology − Prefetch − Loop unrolling − SIMD VSX Use Torrent effectively − LAPI/XMI
Petascale Scalability Concerns Centralized load balancer - solved IO − Unscalable file formats - solved − input read at startup - solved − Sequential output – in progress Fine grain overhead – in progress Non-bonded multicasts – being studied Particle Mesh Ewald − Largest grid target <= 1024 − Communication overhead primary issue − Considering Multilevel Summation alternative
NAMD and SMT=4 P7 hardware threads are prioritized − 0,1 highest − 2,3 lowest Charm runtime measure processor performance − Load balancer operates accordingly NAMD on SMT=4 35% faster than SMT=1 − No new code required! At the limit it requires 4x more decomposition
NAMD on Power7 HV 32 AIX Relative Parallel Efficiency NAMD ApoA1 on Power 7 HV32 (AIX) 1.8 1.6 1.4 Core 1 1.2 Core 1, SMT= 2 Core 1, SMT=4 Core 2 Core 4 1 Core 8 Efficiency Core 8, SMT=2 Core 8, SMT=4 Core 16 0.8 Core 32 0.6 0.4 0.2 0 HV32
SIMD -> VSX VSX adds double Translate SSE to precision support to VSX VMX Add VSX support to SSE2 already in use MD-SIMD in 2 NAMD functions MD-SIMD implementation of nonbonded MD benchmark available from Kunzman
MD-SIMD performance
Support for Large Molecular Systems New Compressed PSF file format − Supports >100 million atoms − Supports parallel startup − Support MEM_OPT molecule representation MEM_OPT molecule format reduces data replication through signatures Parallelize reading of input at startup − Cannot support legacy PDB format − Use binary coordinates format Changes in VMD courtesy John Stone
Parallel Startup T a b le 1 : P a r a lle l S t a rt u p f o r 1 0 M illio n w a t e r o n B lu e G e n e /P N o d e s S tart (se c) M em o ry (M B ) 1 N A 4 4 8 4 .5 5 * 8 4 4 6 .4 9 9 8 6 5 .1 1 7 1 6 4 2 4 .7 6 5 4 5 6 .4 8 7 3 2 4 2 0 .4 9 2 2 5 8 .0 2 3 6 4 4 3 5 .3 6 6 2 3 5 .9 4 9 1 2 8 2 2 7 .0 1 8 2 2 2 .2 1 9 2 5 6 1 2 2 .2 9 6 2 1 8 .2 8 5 5 1 2 7 3 .2 5 7 1 2 1 8 .4 4 9 1 0 2 4 7 6 .1 0 0 5 2 1 4 .7 5 8 T a b le : P a r a lle l S t a rt u p 1 1 6 M illio n B A R d o m a in o n A b e N o d e s S ta rt (se c) M e m o ry (M B ) 1 3 0 7 5 .6 * 7 5 4 5 7 .7 * 5 0 3 4 0 .3 6 1 1 0 0 8 8 0 3 2 2 .1 6 5 9 0 8 1 2 0 3 2 3 .5 6 1 7 1 0
Fine grain overhead End user targets are all fixed size problems Strong scaling performance dominates − Maximize number of nanoseconds/day of simulation Non-bonded cutoff distance determines patch size − Patch can be subdivided along x, y, z dimensions 2 away X, 2-away XY, 2 away XYZ − Theoretically K-away...
1-away vs 2-away X
Fine-grain overhead reduction Distant computes have little or no interaction − Long diagonal opposites of 2-awayXYZ mostly outside of cutoff Optimizations − Don't migrate tiny computes − Sort pairlists to truncate computation − Increase margin and do not create redundant compute objects Slight (<5%) reduction in step time
Future work Integrate parallel output into CVS NAMD Consolidate small compute objects Leverage native communication API Particle Mesh Ewald improve/replace Parallel I/O optimization study on multiple platforms High (>16k) scaling study on multiple platforms
Recommend
More recommend