Efficient GPU parallelization of the Fast Multipole Method with periodic boundary conditions Bartosz Kohnke
Fast Multipole Method (FMM) Calculation of long-ranged forces in the n-body problem (Greengard and Rokhlin, 1987) Tree – based approach O ( n ) complexity n := number of particles Multipole expansion of long-range interactions 08.05.2017 B.Kohnke 2
Fast Multipole Method (FMM) 1/r Long-Range interactions Molecular Dynamics Astrophysics Plasma Physics 08.05.2017 B.Kohnke 3
Molecular Dynamics Simulation Simulations on the atomistic level 08.05.2017 B.Kohnke 4
Molecular Dynamics Simulation Describing the energy of the system Bonded interactions 𝐷𝑝𝑣𝑚. + 𝐹 𝑗𝑘 𝑤𝑒𝑋. 𝐹 = 𝐹 𝛽 + 𝐹 𝑗𝑘 𝛽 𝑗𝑘 E R 08.05.2017 B.Kohnke 5
Molecular Dynamics Simulation Describing the energy of the system Non-bonded interactions 𝐷𝑝𝑣𝑚. + 𝐹 𝑗𝑘 𝑤𝑒𝑋. 𝐹 = 𝐹 𝛽 + 𝐹 𝑗𝑘 𝛽 𝑗𝑘 E R 08.05.2017 B.Kohnke 6
Molecular Dynamics Simulation Describing the energy of the system Non-bonded interactions 𝐷𝑝𝑣𝑚. + 𝐹 𝑗𝑘 𝑤𝑒𝑋. 𝐹 = 𝐹 𝛽 + 𝐹 𝑗𝑘 𝛽 𝑗𝑘 E N-body problem O ( n² ) + R periodic boundaries 08.05.2017 B.Kohnke 7
Particle Mesh Ewald (PME) GROMACS Basic idea of PME Splitting the computation of electrostatic potential in two parts Direct Compute the particle-particle interactions directly within a cutoff Reciprocal (FFT based) Extrapolate charges on the grid 𝐹 = 𝐹 𝑒𝑗𝑠𝑓𝑑𝑢 + 𝐹 𝑠𝑓𝑑𝑗𝑞𝑠𝑝𝑑𝑏𝑚 FFT of the charge grid Computation O ( n log n ) Communication O ( nodes² ) 08.05.2017 B.Kohnke 8
PME vs FMM Massive parallelization, 150000 Particles Parallel efficiency PME FMM (Near Field) 64 100% 128 32 80% 256 Efficiency 1 60% 16 4 40% replication factor 2 8 20% 0% 08.05.2017 B.Kohnke 9
FMM Basic Idea Classical O ( n² ) approach 08.05.2017 B.Kohnke 10
FMM Basic Idea Classical O ( n² ) approach Tree-code O ( n log n ) 08.05.2017 B.Kohnke 11
FMM Basic Idea Classical O ( n² ) approach Tree-code O ( n log n ) FMM O ( n ) 08.05.2017 B.Kohnke 12
FMM – Parameters Controling the accuracy of the approximation and performance d – depth of the FMM tree FMM – Parameters 08.05.2017 B.Kohnke 13
FMM – Parameters Controling the accuracy of the approximation and performance ws – separation criterion FMM – Parameters 08.05.2017 B.Kohnke 14
FMM – Parameters Controling the accuracy of the approximation and performance p – multipole order FMM – Parameters 𝒒 𝑚 𝑏 𝑚 1 (𝑚 − 𝑛)! 𝑄 𝑚𝑛 (cos 𝜄) 𝑓 −𝑗𝑛 𝛾−𝜚 𝑒 = 𝑚 + 𝑛 ! 𝑄 𝑚𝑛 (cos 𝛽) 𝑠 𝑚+1 𝑚=0 𝑛=−𝑚 08.05.2017 B.Kohnke 15
FMM – Workflow Preprocessing Particle input – positions and charges (x, y, z, q) ws E xyz Preprocessing Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 d F q p Set algorithm parameters ws – separation criterion p – multipoleorder d – tree depth 08.05.2017 B.Kohnke 16
FMM – Workflow Preprocessing Clustering particles ws E xyz Preprocessing Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 d F q p Build a tree according to chosen parameter d ws – separation criterion p – multipoleorder d – tree depth 08.05.2017 B.Kohnke 17
FMM – Workflow Preprocessing Particle to multipole (P2M) ws E xyz Preprocessing Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 d F q p Expand particle positions and charges to multipole moments 𝜕 O ( np 2 ) – operation Fill boxes on the lowest level 08.05.2017 B.Kohnke 18
FMM – Workflow Pass 1 Multipole to multipole (M2M) ws E xyz Preprocessing Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 d F q p Translate the multipole moments 𝜕 up the FMM-tree O ( p 4 ) – operation One operation per box Vertical operator 𝑞 𝑚 𝑚 𝑘 𝜕 𝐛 + 𝐜 = 𝜕 𝑘𝑙 𝐛 𝑃 𝑚−𝑘,𝑛−𝑙 (𝐜) 𝑚=0 𝑛=−𝑚 𝑘=0 𝑙=−𝑘 08.05.2017 B.Kohnke 19
FMM – Workflow Pass 2 Multipole to local (M2L) ws E xyz Preprocessing Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 d F q p Transform multipole moments 𝜕 into Taylor moments 𝜈 O ( p 4 ) – operation 189 operations per box (ws = 1) Horizontal operator 𝑞 𝑚 𝑚 𝑘 𝜈 𝐜 − 𝐛 = 𝜕 𝑘𝑙 𝐛 𝑁 𝑚+𝑘,𝑛+𝑙 (𝐜) 𝑚=0 𝑛=−𝑚 𝑘=0 𝑙=−𝑘 08.05.2017 B.Kohnke 20
FMM – Workflow Pass 3 Local to local (L2L) ws E xyz Preprocessing Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 d F q p Translate local moments 𝜈 down the tree O ( p 4 ) – operation One operation per box Vertical operation 𝑞 𝑚 𝑞 𝑘 𝜈 𝐬 − 𝐜 = 𝜈 𝑘𝑙 𝐬 𝑁 𝑘−𝑚,𝑙−𝑛 (𝐜) 𝑚=0 𝑛=−𝑚 𝑘=𝑚 𝑙=−𝑘 08.05.2017 B.Kohnke 21
FMM – Workflow Pass 4 and 5 Forces computing ws E xyz Preprocessing Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 d F q p Far field contribution from 𝜈 and 𝜕 Compute Φ 𝐺𝐺 , 𝐆 𝐺𝐺 , 𝐅 𝐺𝐺 O ( np 2 ) – operation Near field contribution (particle to particle) Compute Φ 𝑂𝐺 , 𝐆 𝑂𝐺 , 𝐅 𝑂𝐺 𝟑 O ( n cutoff ) – operation 08.05.2017 B.Kohnke 22
FMM – Data Structures Far field Coefficient Matrix, Generalized Storage Type Matrix size O ( p 2 ) 0,0 Triangular shape 1,0 1,-1 1,1 One complex value per element 2,-1 2,1 2,-2 2,0 2,2 Used as storage for Multipole moments ω Taylor moments µ 2,2 1,0 Operators M 2,0 0,0 1,1 2,1 Physical memory alignment 08.05.2017 B.Kohnke 23
M2L Operation – Tree structure Tree loop and Box – Neighbor Structure, ws =1 M2L Operation 𝜈 Ω 𝜕 𝑁 For all boxes in the tree Determine the interaction set Children of direct neighbors of the own parent Determine M2L operator Compute one M2L operation For each valid 𝜕, 𝜈 pair 08.05.2017 B.Kohnke 24
M2L Operation – Tree structure Tree loop and Box – Neighbor Structure, ws =1 M2L Operation 𝜈 Ω 𝜕 𝑁 For all boxes in the tree Determine the interaction set Children of direct neighbors of the own parent Determine M2L operator Compute one M2L operation For each valid 𝜕, 𝜈 pair 08.05.2017 B.Kohnke 25
M2L Operation – Tree structure Tree loop and Box – Neighbor Structure, ws =1 M2L Operation 𝜈 Ω 𝜕 𝑁 For all boxes in the tree Determine the interaction set Children of direct neighbors of the own parent Determine valid M2L operator Compute one M2L operation For each valid 𝜕, 𝜈 pair 08.05.2017 B.Kohnke 26
M2L Operation – Tree structure Tree loop and Box – Neighbor Structure, ws =1 M2L Operation 𝜈 Ω 𝜕 𝑁 For all boxes in the tree Determine the interaction set Children of direct neighbors of the own parent Determine valid M2L operator Compute one M2L operation For each valid 𝜕, 𝜈 pair 𝑞 𝑚 𝑚 𝑘 𝜈 𝐜 − 𝐛 = 𝜕 𝑘𝑙 𝐛 𝑁 𝑚+𝑘,𝑛+𝑙 (𝐜) 𝑚=0 𝑛=−𝑚 𝑘=0 𝑙=−𝑘 08.05.2017 B.Kohnke 27
M2L Operation – Internal Structure Translating multipole expansion to local expansion, p 4 loop structure Translation Operation (M2L) - O ( p 4 ) One M2L operation 𝑚 𝑘 𝜈 𝑚𝑛 𝐜 − 𝐛 = 𝜕 𝑘𝑙 𝐛 𝑁 𝑚+𝑘,𝑛+𝑙 (𝐜) 𝑘=0 𝑙=−𝑘 M for (int l = 0; l <= p; ++l) lm for (int m = 0; m <= l; ++m) { omega_l_m=0; for (int j = 0; j <= p; ++j) { for (int k = -j; k <= j; ++k) { omega_l_m += M[m_idx](l+j, m+k) * omega[o_idx](j,k); } } mu[mu_idx](l, m) += omega_l_m } 08.05.2017 B.Kohnke 28
M2L – GPU dynamic parallelism Tree loop and Box – Neighbor Structure, ws =1 M2L Operation 𝜈 Ω 𝜕 𝑁 Start one parent kernel for each 𝜕 Parent kernel Computes the interaction set Spawns the child kernels Child kernel Compute one p 4 operation ( p 2 threads) 08.05.2017 B.Kohnke 29
M2L – GPU dynamic parallelism Tree loop and Box – Neighbor Structure, ws =1 M2L Operation 𝜈 Ω 𝜕 𝑁 Start one parent kernel for each 𝜕 Parent kernel Computes the interaction set Spawns the child kernels Child kernel Compute one p 4 operation ( p 2 threads) parent_kernel<<<(1,1,1)(3,3,3)>>> 08.05.2017 B.Kohnke 30
M2L – GPU dynamic parallelism Tree loop and Box – Neighbor Structure, ws =1 M2L Operation 𝜈 Ω 𝜕 𝑁 Start one parent kernel for each 𝜕 Parent kernel Computes interaction set Spawns the child kernels Child kernel Compute one p 4 operation ( p 2 threads) parent_kernel<<<(1,1,1)(3,3,3)>>> child_kernel<<<(2,2,2)(p,p,1)>>> Blocksize too small for small p values Small grids (streams help to utilize the GPU) 08.05.2017 B.Kohnke 31
Recommend
More recommend