analysis and optimization of a molecular dynamics code
play

Analysis and Optimization of a Molecular Dynamics Code using PAPI - PowerPoint PPT Presentation

Center for Information Services and High Performance Computing (ZIH) Analysis and Optimization of a Molecular Dynamics Code using PAPI and the Vampir Toolchain May 2, 2012 Thomas William Zellescher Weg 12 Willers-Bau A 34 +49 351 - 463 32446


  1. Center for Information Services and High Performance Computing (ZIH) Analysis and Optimization of a Molecular Dynamics Code using PAPI and the Vampir Toolchain May 2, 2012 Thomas William Zellescher Weg 12 Willers-Bau A 34 +49 351 - 463 32446 Thomas.William@ZIH.TU-Dresden.de

  2. Overview Introduction 1 Serial Analysis 2 PAPI measurements 3 Source Code Analysis 4 Source Code Optimization 5 Tracing and Visualization 6 Conclusion 7 1/34

  3. Overview Introduction 1 1/34

  4. Introduction IU ZIH FutureGrid MD 1/34

  5. Introduction: MD Code Classical molecular dynamics simulation of dense nuclear matter consisting of either fully ionized atoms or free neutrons and protons Main targets are studies of the dense matter in white dwarf and neutron stars. Interaction potentials between particles treated as classical two-particle central potentials No complicated particle geometry or orientation Electrons can be treated as a uniform background charge (not explicitly modeled) 2/34

  6. MD Code Details Particle-particle-interactions have been implemented in a multitude of ways Located in different files PP01, PP02 and PP03 The code blocks are selectable using preprocessor macros PP01 is the original implementation with no division into the Ax, Bx or NBS blocks PP02 implements the versions in use by the physicists today 3 different implementations for the Ax block 3 implementations for the Bx block No manual blocking (NBS) PP03 includes new routines 3 Ax blocks 8 Bx blocks Can be blocked using the NBS value 3/34

  7. MD Code Details Two sections of code each have two or more variations One section is labelled A and the other B Variations are numbered MD can be compiled as a serial, OpenMP , MPI or MPI+OpenMP ⌥ make MDEF=XRay md_mpi ALG=PP02 BLKA=A0 BLKB=B2 NBS="NBSX=0" ⌃ ⇧ ⌦ 4/34

  8. MD Workflow The structure of the algorithm is simple. First reads in a parameter file runmd.in And an initial particle configuration file md.in Program then calculates the initial set of accelerations Enters a time-stepping loop, a triply nested ”do” loop 5/34

  9. Runtime Parameters ⌥ ! Parameters : sim_type = ’ ion − mixture ’ , ! simulation type t s t a r t = 0.00 , ! s t a r t time dt = 25.00 , ! time step ( fm / c ) !Warmup: nwgroup = 2 , ! groups nwsteps = 50 , ! steps per group ! Measurement : ngroup = 2 , ! groups ntot = 2 , ! per group nind = 25 , ! steps between tnormalize = 50 , ! temp normal . ncom = 50 , ! center − of − mass ! motion cancel . ⌃ ⇧ ⌦ Figure : Runtime parameters, snippet from runmd.in 6/34

  10. The Main Loop ⌥ do 100 ig =1 ,ngroup i n i t i a l i z e group ig s t a t i s t i c s do 40 j =1 , ntot do i =1 , nind ! computes forces ! updates x and v call newton enddo call v t o t enddo compute group ig s t a t i s t i c s 40 continue 100 continue ⌃ ⇧ ⌦ Figure : Simplified version of the main loop 7/34

  11. MD Implementation Details - newton module Forces are calculated in the newton module in a pair of nested do-loops Outer loop goes over target particles Inner loop over source particles Targets are assigned to MPI processes in a round-robin fashion Within each MPI process, the work is shared among OpenMP threads 8/34

  12. Overview Serial Analysis 2 9/34

  13. XRay - a Cray XT5m TM Cray XT5m provided by the FutureGrid project XT5m is a 2D mesh of nodes Each node has two sockets each having four cores AMD Opteron 23 ”Shanghai” (45mm) running at 2.4 GHz 84 compute nodes with a total of 672 cores pgi/9.0.4 using the XT PE driver xtpe-shanghai 9/34

  14. Time Constraints 5k particles measurement takes 5 minutes 27k particles measurement takes 1 hour 55k particles measurement takes 10 hours 10/34

  15. Overview PP01, PP02, and PP03 Run$me'for'all'code'combina$ons' 60000" 50000" run$me'in'seconds' 40000" 30000" O3" 20000" O2" FASTSSE" 10000" 0" " " " " " " " " " " " " " " " " " " " " " " " " 0 2 6 8 2 6 8 4 2 2 6 8 4 2 6 8 2 4 2 6 8 2 6 8 B B 1 0 0 5 2 6 3 0 5 2 6 3 1 0 0 6 3 1 0 0 5 2 " " 0 0 0 2 1 0 0 0 2 1 0 0 0 0 0 0 0 0 0 0 2 1 0 1 " " " " " " " " " " " " " " " " " " " " " " A A 0 1 2 2 3 4 5 0 0 1 2 3 4 5 6 0 1 2 3 4 4 5 B B B B B B B B B B B B B B B B B B B B B B " " 1 2 " " " " " " " " " " " " " " " " " " " " " " 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 P P A A A A A A A A A A A A A A A A A A A A A A P P " " " " " " " " " " " " " " " " " " " " " " 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P Figure : Overview of the runtimes for all (143) code-block combinations with an ion-mix input dataset of 55k particles. The naming scheme for the measurements is ”source-code-file A-block B-block blockingfactor” 11/34

  16. PP02 Code Blocks Run$me,'55k'par$cle'run' 50000" 45000" 40000" Run$me'in'seconds' 35000" 30000" 25000" 20000" O3" 15000" O2" 10000" FASTSSE" 5000" 0" " " " " " " " " " " 0 0 1 2 0 1 2 0 1 2 B B B B B B B B B B " " " " " " " " " " 0 0 0 0 1 1 1 2 2 2 A A A A A A A A A A " " " " " " " " " " 1 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 P P P P P P P P P P P P P P P P P P P P Source'code'file'and'code'block'version' 12/34

  17. Annotated Source for -O3 ⌥ ## do 90 j=i+1,n − 1 ## ! −−−−−− Block A −−−−−− ## # if defined(A0) ## r2=0.0d0 movsd % xmm2 , % xmm1 movq %r12, % rdx movq %r15, % rcx movl $8, % eax .align 16 .LB2_555: ## lineno: 138 ⌃ ⇧ ⌦ 13/34

  18. Annotated Source for -fast ⌥ ## do 90 j=i+1,n − 1 ## ! −−−−−− Block A −−−−−− ## # if defined(A0) ## r2=0.0d0 ## do k=1,3 ## xx(k)=x(k, i) − x(k,j ) movlpd .BSS2+48(%rip),% xmm2 movlpd .C2_291(%rip),% xmm0 mulsd % xmm2 ,% xmm2 addsd % xmm1 ,% xmm2 movlpd % xmm2 ,344(% rsp ) sqrtsd % xmm2 ,% xmm2 movlpd % xmm2 ,448(% rsp ) mulsd md_globals_10_+120(%rip),% xmm2 % xmm2 ,% xmm0 subsd .p2align 4,,1 ⌃ ⇧ ⌦ 14/34

  19. Overview PAPI measurements 3 15/34

  20. Floating Point Instructions PAPI_FAD_INS" PAPI_FML_INS" 1.2E+10" 1.4E+10" 1.2E+10" 1E+10" 1E+10" #"of"instruc,ons" #"of"instruc,ons" 8E+09" 8E+09" 6E+09" O2" O2" 6E+09" O3" O3" 4E+09" FAST" FAST" 4E+09" 2E+09" 2E+09" 0" 0" A0_B0" A0_B1" A0_B2" A1_B0" A1_B1" A1_B2" A2_B0" A2_B1" A2_B2" A0_B0" A0_B1" A0_B2" A1_B0" A1_B1" A1_B2" A2_B0" A2_B1" A2_B2" code"block" code"block" PAPI_FP_INS" 2.5E+10" 2E+10" #"of"instruc,ons" 1.5E+10" O2" 1E+10" O3" FAST" 5E+09" 0" A0_B0" A0_B1" A0_B2" A1_B0" A1_B1" A1_B2" A2_B0" A2_B1" A2_B2" code"block" 15/34

  21. FPU idle times FPU%idle%&me%in%%% 8.00%$ 7.00%$ 6.00%$ idle%&me%in%%% 5.00%$ 4.00%$ O2$ O3$ 3.00%$ FAST$ 2.00%$ 1.00%$ 0.00%$ A0_B0$ A0_B1$ A0_B2$ A1_B0$ A1_B1$ A1_B2$ A2_B0$ A2_B1$ A2_B2$ code%block% Figure : FPU idle times in percent of PAPI measured total cycles 16/34

  22. Branch Miss Predictions PAPI_BR_INS" PAPI_BR_MSP" 1.2E+10" 6.E+08% 1E+10" 5.E+08% #"of"instruc,ons" #"of"instruc,ons" 8E+09" 4.E+08% 6E+09" O2" 3.E+08% O2% O3" O3% 4E+09" 2.E+08% FAST" FAST% 2E+09" 1.E+08% 0" 0.E+00% A0_B0" A0_B1" A0_B2" A1_B0" A1_B1" A1_B2" A2_B0" A2_B1" A2_B2" A0_B0% A0_B1% A0_B2% A1_B0% A1_B1% A1_B2% A2_B0% A2_B1% A2_B2% code"block" code"block" Branch$predic4on$miss$rate$ 16.00%$ 14.00%$ 12.00%$ miss$rate$in$%$ 10.00%$ 8.00%$ O2$ O3$ 6.00%$ FAST$ 4.00%$ 2.00%$ 0.00%$ A0_B0$ A0_B1$ A0_B2$ A1_B0$ A1_B1$ A1_B2$ A2_B0$ A2_B1$ A2_B2$ code$block$ 17/34

  23. Overview Source Code Analysis 4 18/34

Recommend


More recommend