NAMD on BlueWaters Presented by: Eric Bohm Team: Eric Bohm, Chao - PowerPoint PPT Presentation

NAMD on BlueWaters Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale

NSF/NCSA Blue Waters Project  Sustained Petaflops system funded by NSF to be ready in 2011. − System expected to exceed 300,000 processor cores.  NSF Acceptance test: 100 million atom Bar Domain simulation using NAMD.  NAMD PRAC The Computational Microscope − Systems from 10 to 100 million atoms  A recently submitted PRAC from an independent group wishes to use NAMD − 1 Billion atoms!

NAMD  Molecular Dynamics simulation of biological systems  Uses the Charm++ idea: − Decompose the computation into a large number of objects − Have an Intelligent Run-time system (of Charm++) assign objects to processors for dynamic load balancing Hybrid of spatial and force decomposition: • Spatial decomposition of atoms into cubes (called patches) • For every pair of interacting patches, create one object for calculating electrostatic interactions • Recent: Blue Matter, Desmond, etc. use this idea in some form

BW Challenges and Opportunities  Support systems >= 100 Million atoms  Performance requirements for 100 Million atom  Scale to over 300,000 cores  Power 7 Hardware − PPC architecture − Wide node at least 32 cores with 128 HT threads  BlueWaters Torrent interconnect  Doing research under NDA

BlueWaters Architecture  IBM Power7  8 cores/chip  Peak Perf ~10 PF  4 chips/MCM  Sustained ~1 PF  8 MCMs/Drawer  300,000+ cores  4 Drawers/SuperNode  1.2+ PB Memory  1024 cores/SuperNode  18+ PB Disc  Linux OS

Power 7  64-bit PowerPC  2 fixed point, 2 load store  3.7-4Ghz  1 VMX  Up to 8 FLOPs/cycle  1 decimal FP  4-way SMT  2 VSX  128 byte cache lines − 4 FLOPs/cycle  32 KB L1  6-wide in-order  256 KB l2  8-wide out-of-order  4 MB local in shared  12 data streams 32 MB L3 cache prefetch

Hub Chip Module  Connects 8 QCMs via L-local (copper) − 24 GB/s  Connects 4 P7-IH drawers L-remote (optical) − 6 GB/s  Connects up to 512 SuperNodes D (optical) − 10 GB/s

Availability  NCSA has BlueDrop machine − Linux − IBM 780 (MR) POWER7 3.8 Ghz − Login node 2x8 core processors − Compute note 4x8 core in 2 enclosures  BlueBioU − Linux − 18 IBM 750 (HV32) nodes 3.55 Ghz − Infiniband 4x DDR (Galaxy)

NAMD on BW  Use SMT=4 effectively  Use Power7 effectively − Shared memory topology − Prefetch − Loop unrolling − SIMD VSX  Use Torrent effectively − LAPI/XMI

Petascale Scalability Concerns  Centralized load balancer - solved  IO − Unscalable file formats - solved − input read at startup - solved − Sequential output – in progress  Fine grain overhead – in progress  Non-bonded multicasts – being studied  Particle Mesh Ewald − Largest grid target <= 1024 − Communication overhead primary issue − Considering Multilevel Summation alternative

NAMD and SMT=4  P7 hardware threads are prioritized − 0,1 highest − 2,3 lowest  Charm runtime measure processor performance − Load balancer operates accordingly  NAMD on SMT=4 35% faster than SMT=1 − No new code required!  At the limit it requires 4x more decomposition

NAMD on Power7 HV 32 AIX Relative Parallel Efficiency NAMD ApoA1 on Power 7 HV32 (AIX) 1.8 1.6 1.4 Core 1 1.2 Core 1, SMT= 2 Core 1, SMT=4 Core 2 Core 4 1 Core 8 Efficiency Core 8, SMT=2 Core 8, SMT=4 Core 16 0.8 Core 32 0.6 0.4 0.2 0 HV32

SIMD -> VSX  VSX adds double  Translate SSE to precision support to VSX VMX  Add VSX support to  SSE2 already in use MD-SIMD in 2 NAMD functions  MD-SIMD implementation of nonbonded MD benchmark available from Kunzman

MD-SIMD performance

Support for Large Molecular Systems  New Compressed PSF file format − Supports >100 million atoms − Supports parallel startup − Support MEM_OPT molecule representation  MEM_OPT molecule format reduces data replication through signatures  Parallelize reading of input at startup − Cannot support legacy PDB format − Use binary coordinates format  Changes in VMD courtesy John Stone

Parallel Startup T a b le 1 : P a r a lle l S t a rt u p f o r 1 0 M illio n w a t e r o n B lu e G e n e /P N o d e s S tart (se c) M em o ry (M B ) 1 N A 4 4 8 4 .5 5 * 8 4 4 6 .4 9 9 8 6 5 .1 1 7 1 6 4 2 4 .7 6 5 4 5 6 .4 8 7 3 2 4 2 0 .4 9 2 2 5 8 .0 2 3 6 4 4 3 5 .3 6 6 2 3 5 .9 4 9 1 2 8 2 2 7 .0 1 8 2 2 2 .2 1 9 2 5 6 1 2 2 .2 9 6 2 1 8 .2 8 5 5 1 2 7 3 .2 5 7 1 2 1 8 .4 4 9 1 0 2 4 7 6 .1 0 0 5 2 1 4 .7 5 8 T a b le : P a r a lle l S t a rt u p 1 1 6 M illio n B A R d o m a in o n A b e N o d e s S ta rt (se c) M e m o ry (M B ) 1 3 0 7 5 .6 * 7 5 4 5 7 .7 * 5 0 3 4 0 .3 6 1 1 0 0 8 8 0 3 2 2 .1 6 5 9 0 8 1 2 0 3 2 3 .5 6 1 7 1 0

Fine grain overhead  End user targets are all fixed size problems  Strong scaling performance dominates − Maximize number of nanoseconds/day of simulation  Non-bonded cutoff distance determines patch size − Patch can be subdivided along x, y, z dimensions  2 away X, 2-away XY, 2 away XYZ − Theoretically K-away...

1-away vs 2-away X

Fine-grain overhead reduction  Distant computes have little or no interaction − Long diagonal opposites of 2-awayXYZ mostly outside of cutoff  Optimizations − Don't migrate tiny computes − Sort pairlists to truncate computation − Increase margin and do not create redundant compute objects  Slight (<5%) reduction in step time

Future work  Integrate parallel output into CVS NAMD  Consolidate small compute objects  Leverage native communication API  Particle Mesh Ewald improve/replace  Parallel I/O optimization study on multiple platforms  High (>16k) scaling study on multiple platforms

NAMD on BlueWaters Presented by: Eric Bohm Team: Eric Bohm, Chao - PowerPoint PPT Presentation

NAMD on BlueWaters Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale NSF/NCSA Blue Waters Project Sustained Petaflops system funded by NSF to be ready in 2011.

NAMD - Scalable Molecular Dynamics Gengbin Zheng 9/1/01 1 Molecular dynamics and NAMD MD

Scaling Challenges in NAMD: Past and Future Outline NAMD: An Introduction Past Scaling

Refactoring NAMD for Petascale Machines and Graphics Processors James Phillips

Experiences with Charm++ and NAMD on Knights Landing Supercomputers 15 th Annual Workshop on

S6623: Advances in NAMD GPU Performance Antti-Pekka Hynninen Oak Ridge Leadership Computing

VMD & NAMD on Elastic Compute Cloud (EC2) instance of Amazon Web Services (AWS) Start VMD

Scriptable Asynchronous Multi-Copy Algorithms in NAMD via Charm++ Partitions James Phillips

Towards Process-Level Charm++ Programming in NAMD James Phillips Beckman Institute, University

James Phillips Beckman Institute, University of Illinois http://www.ks.uiuc.edu/Research/namd/

James Phillips Beckman Institute, University of Illinois http://www.ks.uiuc.edu/Research/namd/

Improving NAMD Performance on Volta GPUs David Hardy - Research Programmer, University of

USING NSIGHT TOOLS TO OPTIMIZE THE NAMD MOLECULAR DYNAMICS SIMULATION PROGRAM Robert (Bob) Knight

Improving NAMD Performance on Multi-GPU Platforms David J. Hardy Theoretical and Computational

Multilevel Summation Method for Calculating Electrostatic Interactions in NAMD David J. Hardy

Designing, Implementing and Optimizing Collective Variables in VMD and NAMD Jrme Hnin

Practical considerations in running simulations in NAMD James C. (JC) Gumbart Georgia Institute

Storage Ring Measurement of the Proton Electric Dipole Moment Richard Talman Laboratory for

3515ICT Theory of Computation Computational Complexity (Based loosely on slides by Harald

Fast reduction in the algebraic de Rham cohomology of projective hypersurfaces Sebastian Pancratz

Section 1.1: Percentages MATH 105: Contemporary Mathematics University of Louisville August 22,

Summary of Last Chapter Principles of Knowledge Discovery in Data What is a data warehouse

Correct rounding of transcendental functions: an approach via Euclidean lattices and approximation

IntroductionToVerilogfor Combinational*Logic

7/31/2018 IDEA PART B & Preschool Application Instructions 101 Anthony Mukuna, CPA Special

Sambuz

Useful Links

Newsletter

Mail Us

NAMD on BlueWaters Presented by: Eric Bohm Team: Eric Bohm, Chao - PowerPoint PPT Presentation

NAMD on BlueWaters Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale NSF/NCSA Blue Waters Project Sustained Petaflops system funded by NSF to be ready in 2011.

NAMD - Scalable Molecular Dynamics Gengbin Zheng 9/1/01 1 Molecular dynamics and NAMD MD

Scaling Challenges in NAMD: Past and Future Outline NAMD: An Introduction Past Scaling

Refactoring NAMD for Petascale Machines and Graphics Processors James Phillips

Experiences with Charm++ and NAMD on Knights Landing Supercomputers 15 th Annual Workshop on

S6623: Advances in NAMD GPU Performance Antti-Pekka Hynninen Oak Ridge Leadership Computing

VMD &amp; NAMD on Elastic Compute Cloud (EC2) instance of Amazon Web Services (AWS) Start VMD

Scriptable Asynchronous Multi-Copy Algorithms in NAMD via Charm++ Partitions James Phillips

Towards Process-Level Charm++ Programming in NAMD James Phillips Beckman Institute, University

James Phillips Beckman Institute, University of Illinois http://www.ks.uiuc.edu/Research/namd/

James Phillips Beckman Institute, University of Illinois http://www.ks.uiuc.edu/Research/namd/

Improving NAMD Performance on Volta GPUs David Hardy - Research Programmer, University of

USING NSIGHT TOOLS TO OPTIMIZE THE NAMD MOLECULAR DYNAMICS SIMULATION PROGRAM Robert (Bob) Knight

Improving NAMD Performance on Multi-GPU Platforms David J. Hardy Theoretical and Computational

Multilevel Summation Method for Calculating Electrostatic Interactions in NAMD David J. Hardy

Designing, Implementing and Optimizing Collective Variables in VMD and NAMD Jrme Hnin

Practical considerations in running simulations in NAMD James C. (JC) Gumbart Georgia Institute

Storage Ring Measurement of the Proton Electric Dipole Moment Richard Talman Laboratory for

3515ICT Theory of Computation Computational Complexity (Based loosely on slides by Harald

Fast reduction in the algebraic de Rham cohomology of projective hypersurfaces Sebastian Pancratz

Section 1.1: Percentages MATH 105: Contemporary Mathematics University of Louisville August 22,

Summary of Last Chapter Principles of Knowledge Discovery in Data What is a data warehouse

Correct rounding of transcendental functions: an approach via Euclidean lattices and approximation

Introduction*To*Verilog*for* Combinational*Logic

7/31/2018 IDEA PART B &amp; Preschool Application Instructions 101 Anthony Mukuna, CPA Special

Sambuz

Useful Links

Newsletter

Mail Us

VMD & NAMD on Elastic Compute Cloud (EC2) instance of Amazon Web Services (AWS) Start VMD

IntroductionToVerilogfor Combinational*Logic

7/31/2018 IDEA PART B & Preschool Application Instructions 101 Anthony Mukuna, CPA Special