efficient gpu parallelization of
play

Efficient GPU parallelization of the Fast Multipole Method with - PowerPoint PPT Presentation

Efficient GPU parallelization of the Fast Multipole Method with periodic boundary conditions Bartosz Kohnke Fast Multipole Method (FMM) Calculation of long-ranged forces in the n-body problem (Greengard and Rokhlin, 1987) Tree based


  1. Efficient GPU parallelization of the Fast Multipole Method with periodic boundary conditions Bartosz Kohnke

  2. Fast Multipole Method (FMM) Calculation of long-ranged forces in the n-body problem (Greengard and Rokhlin, 1987)  Tree – based approach  O ( n ) complexity n := number of particles  Multipole expansion of long-range interactions 08.05.2017 B.Kohnke 2

  3. Fast Multipole Method (FMM) 1/r Long-Range interactions Molecular Dynamics Astrophysics Plasma Physics 08.05.2017 B.Kohnke 3

  4. Molecular Dynamics Simulation Simulations on the atomistic level 08.05.2017 B.Kohnke 4

  5. Molecular Dynamics Simulation Describing the energy of the system Bonded interactions 𝐷𝑝𝑣𝑚. + 𝐹 𝑗𝑘 𝑤𝑒𝑋. 𝐹 = 𝐹 𝛽 + 𝐹 𝑗𝑘 𝛽 𝑗𝑘 E R 08.05.2017 B.Kohnke 5

  6. Molecular Dynamics Simulation Describing the energy of the system Non-bonded interactions 𝐷𝑝𝑣𝑚. + 𝐹 𝑗𝑘 𝑤𝑒𝑋. 𝐹 = 𝐹 𝛽 + 𝐹 𝑗𝑘 𝛽 𝑗𝑘 E R 08.05.2017 B.Kohnke 6

  7. Molecular Dynamics Simulation Describing the energy of the system Non-bonded interactions 𝐷𝑝𝑣𝑚. + 𝐹 𝑗𝑘 𝑤𝑒𝑋. 𝐹 = 𝐹 𝛽 + 𝐹 𝑗𝑘 𝛽 𝑗𝑘 E N-body problem O ( n² ) + R periodic boundaries 08.05.2017 B.Kohnke 7

  8. Particle Mesh Ewald (PME) GROMACS Basic idea of PME  Splitting the computation of electrostatic potential in two parts  Direct  Compute the particle-particle interactions directly within a cutoff  Reciprocal (FFT based)  Extrapolate charges on the grid 𝐹 = 𝐹 𝑒𝑗𝑠𝑓𝑑𝑢 + 𝐹 𝑠𝑓𝑑𝑗𝑞𝑠𝑝𝑑𝑏𝑚  FFT of the charge grid  Computation O ( n log n )  Communication O ( nodes² ) 08.05.2017 B.Kohnke 8

  9. PME vs FMM Massive parallelization, 150000 Particles Parallel efficiency PME FMM (Near Field) 64 100% 128 32 80% 256 Efficiency 1 60% 16 4 40% replication factor 2 8 20% 0% 08.05.2017 B.Kohnke 9

  10. FMM Basic Idea Classical O ( n² ) approach 08.05.2017 B.Kohnke 10

  11. FMM Basic Idea Classical O ( n² ) approach Tree-code O ( n log n ) 08.05.2017 B.Kohnke 11

  12. FMM Basic Idea Classical O ( n² ) approach Tree-code O ( n log n ) FMM O ( n ) 08.05.2017 B.Kohnke 12

  13. FMM – Parameters Controling the accuracy of the approximation and performance d – depth of the FMM tree FMM – Parameters 08.05.2017 B.Kohnke 13

  14. FMM – Parameters Controling the accuracy of the approximation and performance ws – separation criterion FMM – Parameters 08.05.2017 B.Kohnke 14

  15. FMM – Parameters Controling the accuracy of the approximation and performance p – multipole order FMM – Parameters 𝒒 𝑚 𝑏 𝑚 1 (𝑚 − 𝑛)! 𝑄 𝑚𝑛 (cos 𝜄) 𝑓 −𝑗𝑛 𝛾−𝜚 𝑒 = 𝑚 + 𝑛 ! 𝑄 𝑚𝑛 (cos 𝛽) 𝑠 𝑚+1 𝑚=0 𝑛=−𝑚 08.05.2017 B.Kohnke 15

  16. FMM – Workflow Preprocessing Particle input – positions and charges (x, y, z, q) ws E xyz Preprocessing Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 d F q p  Set algorithm parameters  ws – separation criterion  p – multipoleorder  d – tree depth 08.05.2017 B.Kohnke 16

  17. FMM – Workflow Preprocessing Clustering particles ws E xyz Preprocessing Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 d F q p  Build a tree according to chosen parameter d  ws – separation criterion  p – multipoleorder  d – tree depth 08.05.2017 B.Kohnke 17

  18. FMM – Workflow Preprocessing Particle to multipole (P2M) ws E xyz Preprocessing Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 d F q p  Expand particle positions and charges to multipole moments 𝜕  O ( np 2 ) – operation  Fill boxes on the lowest level 08.05.2017 B.Kohnke 18

  19. FMM – Workflow Pass 1 Multipole to multipole (M2M) ws E xyz Preprocessing Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 d F q p  Translate the multipole moments 𝜕 up the FMM-tree  O ( p 4 ) – operation  One operation per box  Vertical operator 𝑞 𝑚 𝑚 𝑘 𝜕 𝐛 + 𝐜 = 𝜕 𝑘𝑙 𝐛 𝑃 𝑚−𝑘,𝑛−𝑙 (𝐜) 𝑚=0 𝑛=−𝑚 𝑘=0 𝑙=−𝑘 08.05.2017 B.Kohnke 19

  20. FMM – Workflow Pass 2 Multipole to local (M2L) ws E xyz Preprocessing Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 d F q p  Transform multipole moments 𝜕 into Taylor moments 𝜈  O ( p 4 ) – operation  189 operations per box (ws = 1)  Horizontal operator 𝑞 𝑚 𝑚 𝑘 𝜈 𝐜 − 𝐛 = 𝜕 𝑘𝑙 𝐛 𝑁 𝑚+𝑘,𝑛+𝑙 (𝐜) 𝑚=0 𝑛=−𝑚 𝑘=0 𝑙=−𝑘 08.05.2017 B.Kohnke 20

  21. FMM – Workflow Pass 3 Local to local (L2L) ws E xyz Preprocessing Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 d F q p  Translate local moments 𝜈 down the tree  O ( p 4 ) – operation  One operation per box  Vertical operation 𝑞 𝑚 𝑞 𝑘 𝜈 𝐬 − 𝐜 = 𝜈 𝑘𝑙 𝐬 𝑁 𝑘−𝑚,𝑙−𝑛 (𝐜) 𝑚=0 𝑛=−𝑚 𝑘=𝑚 𝑙=−𝑘 08.05.2017 B.Kohnke 21

  22. FMM – Workflow Pass 4 and 5 Forces computing ws E xyz Preprocessing Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 d F q p  Far field contribution from 𝜈 and 𝜕  Compute Φ 𝐺𝐺 , 𝐆 𝐺𝐺 , 𝐅 𝐺𝐺  O ( np 2 ) – operation Near field contribution (particle to particle)  Compute Φ 𝑂𝐺 , 𝐆 𝑂𝐺 , 𝐅 𝑂𝐺 𝟑  O ( n cutoff ) – operation 08.05.2017 B.Kohnke 22

  23. FMM – Data Structures Far field Coefficient Matrix, Generalized Storage Type  Matrix size O ( p 2 ) 0,0  Triangular shape 1,0 1,-1 1,1  One complex value per element 2,-1 2,1 2,-2 2,0 2,2 Used as storage for  Multipole moments ω  Taylor moments µ 2,2 1,0  Operators M 2,0 0,0 1,1 2,1 Physical memory alignment 08.05.2017 B.Kohnke 23

  24. M2L Operation – Tree structure Tree loop and Box – Neighbor Structure, ws =1 M2L Operation 𝜈 Ω 𝜕 𝑁  For all boxes in the tree  Determine the interaction set  Children of direct neighbors of the own parent  Determine M2L operator  Compute one M2L operation  For each valid 𝜕, 𝜈 pair 08.05.2017 B.Kohnke 24

  25. M2L Operation – Tree structure Tree loop and Box – Neighbor Structure, ws =1 M2L Operation 𝜈 Ω 𝜕 𝑁  For all boxes in the tree  Determine the interaction set  Children of direct neighbors of the own parent  Determine M2L operator  Compute one M2L operation  For each valid 𝜕, 𝜈 pair 08.05.2017 B.Kohnke 25

  26. M2L Operation – Tree structure Tree loop and Box – Neighbor Structure, ws =1 M2L Operation 𝜈 Ω 𝜕 𝑁  For all boxes in the tree  Determine the interaction set  Children of direct neighbors of the own parent  Determine valid M2L operator  Compute one M2L operation  For each valid 𝜕, 𝜈 pair 08.05.2017 B.Kohnke 26

  27. M2L Operation – Tree structure Tree loop and Box – Neighbor Structure, ws =1 M2L Operation 𝜈 Ω 𝜕 𝑁  For all boxes in the tree  Determine the interaction set  Children of direct neighbors of the own parent  Determine valid M2L operator  Compute one M2L operation  For each valid 𝜕, 𝜈 pair 𝑞 𝑚 𝑚 𝑘 𝜈 𝐜 − 𝐛 = 𝜕 𝑘𝑙 𝐛 𝑁 𝑚+𝑘,𝑛+𝑙 (𝐜) 𝑚=0 𝑛=−𝑚 𝑘=0 𝑙=−𝑘 08.05.2017 B.Kohnke 27

  28. M2L Operation – Internal Structure Translating multipole expansion to local expansion, p 4 loop structure Translation Operation (M2L) - O ( p 4 ) One M2L operation 𝑚 𝑘 𝜈 𝑚𝑛 𝐜 − 𝐛 = 𝜕 𝑘𝑙 𝐛 𝑁 𝑚+𝑘,𝑛+𝑙 (𝐜) 𝑘=0 𝑙=−𝑘   M for (int l = 0; l <= p; ++l) lm for (int m = 0; m <= l; ++m) { omega_l_m=0; for (int j = 0; j <= p; ++j) { for (int k = -j; k <= j; ++k) { omega_l_m += M[m_idx](l+j, m+k) * omega[o_idx](j,k); } } mu[mu_idx](l, m) += omega_l_m } 08.05.2017 B.Kohnke 28

  29. M2L – GPU dynamic parallelism Tree loop and Box – Neighbor Structure, ws =1 M2L Operation 𝜈 Ω 𝜕 𝑁  Start one parent kernel for each 𝜕  Parent kernel  Computes the interaction set  Spawns the child kernels  Child kernel  Compute one p 4 operation ( p 2 threads) 08.05.2017 B.Kohnke 29

  30. M2L – GPU dynamic parallelism Tree loop and Box – Neighbor Structure, ws =1 M2L Operation 𝜈 Ω 𝜕 𝑁  Start one parent kernel for each 𝜕  Parent kernel  Computes the interaction set  Spawns the child kernels  Child kernel  Compute one p 4 operation ( p 2 threads) parent_kernel<<<(1,1,1)(3,3,3)>>> 08.05.2017 B.Kohnke 30

  31. M2L – GPU dynamic parallelism Tree loop and Box – Neighbor Structure, ws =1 M2L Operation 𝜈 Ω 𝜕 𝑁  Start one parent kernel for each 𝜕  Parent kernel  Computes interaction set  Spawns the child kernels  Child kernel  Compute one p 4 operation ( p 2 threads) parent_kernel<<<(1,1,1)(3,3,3)>>> child_kernel<<<(2,2,2)(p,p,1)>>>  Blocksize too small for small p values  Small grids (streams help to utilize the GPU) 08.05.2017 B.Kohnke 31

Recommend


More recommend