Solving Domain Wall Dirac Equation Using Multisplitting Preconditioned Conjugate Gradient Jiqun Tu 1 1 Department of Physics, Columbia University The 36th International Symposium on Lattice Field Theory, July 23, 2018 @ 16:10 Talk based on: Duo Guo, Robert D. Mawhinney, and Jiqun Tu, [arXiv:1804.08593].
Special thanks to Norman Christ, Chulwoo Jung, and Christopher Kelly. The RBC & UKQCD collaborations Tianle Wang BNL and BNL/RBRC University of Liverpool Evan Wickenden Nicolas Garron Yasumichi Aoki (KEK) Yidi Zhao Mattia Bruno MIT University of Connecticut T aku Izubuchi Yong-Chull Jang David Murphy T om Blum Chulwoo Jung Dan Hoying (BNL) Peking University Christoph Lehner Luchang Jin (RBRC) Xu Feng Meifeng Lin Cheng Tu Aaron Meyer University of Southampton Edinburgh University Hiroshi Ohki Jonathan Flynn Shigemi Ohta (KEK) Peter Boyle Vera Guelpers Amarjit Soni Guido Cossu James Harrison UC Boulder Luigi Del Debbio Andreas Juettner T adeusz Janowski Oliver Witzel James Richings Richard Kenway Chris Sachrajda Columbia University Julia Kettle Fionn O'haigan Ziyuan Bai Stony Brook University Brian Pendleton Norman Christ Jun-Sik Yoo Antonin Portelli Duo Guo Sergey Syritsyn (RBRC) T obias T sang Christopher Kelly Azusa Yamaguchi Bob Mawhinney York University (Toronto) Masaaki T omii KEK Jiqun Tu Renwick Hudspith Julien Frison Bigeng Wang
SUMMIT at ORNL 3/25 Figure 1: The New York Times’s comment on SUMMIT becoming world’s most powerful supercomputer.
Scaling on SUMMIT at ORNL 4/25 94.2 100 ← single Volta peak performance(projected) 18.60 Tflops/node ← single Volta MDWF dslash performance(projected) 10 8.85 1.55 1 1 10 100 1000 number of nodes Figure 2: Half precision M¨ obius domain wall fermion CG weak scaling with local volume of 16 × 12 3 × 12 . 6 NVIDIA Volta GPUs on each compute node. Numbers provided by Chulwoo Jung.
Motivation 5/25 • Inter-processor communication is the bottleneck for Dirac equation solving. • For measurement there are many approaches available to improve the situation: Lanczos, EigCG, split-grid, multigrid, etc. • Not the case for evolution. • Need an (better) algorithm to reduce the communication overhead and exploit the fascinating local GPU flops. • Do more work locally !
Previous Work 6/25 • Domain Decomposition/Multiplicative Schwarz[ M. L¨ uscher 2004 ]. • Addtive Schwarz[ Y. Osaki 2000 ] and [ R. Babich 2011 ].
Multisplitting Algorithm 7/25 For reference see [ D. O’leary 1985 ]. Ax = b : A l x l + A s x s + A r x r = b s x l x s A l A s A r b s = × x r x = ×
Multisplitting Algorithm 8/25 Solve A l x l + A s x s + A r x r = b s , Rearrange into an iterative form = b s − A l x ( k ) A s x ( k +1) − A r x ( k ) s l r � Ax ( k ) − A s x ( k ) � = b s − s = r ( k ) + A s x ( k ) ≡ ˆ b ( k ) s s For each cycle, • use communication to calculate the right-hand-side ˆ b s . • solve A s x ( k +1) = ˆ b ( k ) locally. s s • the updated solution x ( k +1) will be used to ready the next s cycle. Get A s for each node by chopping off all off-block-diagonal terms: applying zero Dirichlet boundary condition.
M¨ obius Domain Wall Fermion 9/25 Even-odd preconditioning: � � � ψ e � � φ e � − κ b M 4 M 5 eo = , − κ b M 4 M 5 ψ o φ o oe then instead we solve D PC ψ e = ˆ φ e , D PC ≡ M 5 − κ 2 b M 4 eo M − 1 5 M 4 oe , M 4 oe/eo = D w x,y ( b 5 δ s,t + c 5 D 5 ) � � � (1 + γ µ ) U † D w µ,y + (1 − γ µ ) U † x,y = µ,µ δ x − ˆ x,µ δ x +ˆ . µ,y x − ˆ µ Using CG: PC ˆ D † PC D PC ψ e = D † φ e
The Normal Operator 10/25 • 4 hopping terms in the normal operator: A = D † PC D PC = ( M 5 − κ 2 b M 4 eo M − 1 5 M 4 oe ) † ( M 5 − κ 2 b M 4 eo M − 1 5 M 4 oe ) • This means we need to enforce Dirichlet boundary condition on D † PC D PC instead of the individual hopping terms M 4 eo/oe ( D w x,y ). • Need to include the snake terms: terms that hop out of the boundary and hop back. • Seems obvious but not trivial to implement.
The Normal Operator 11/25 The snake terms:
Dslash Implementation 12/25 before 1st hopping term
Dslash Implementation 12/25 before 1st hopping term
Dslash Implementation 12/25 after 1st hopping term
Dslash Implementation 12/25 before 2ed hopping term
Dslash Implementation 12/25 after 2ed hopping term
Dslash Implementation 12/25 before 3rd hopping term
Dslash Implementation 12/25 before 4th hopping term
Multisplitting Algorithm 13/25 • The algorithm converges with inclusion of the snake terms. • The convergence rate is slow. • Similar to [ M. L¨ uscher 2004 ] we use its first cycle with zero initial guess as a preconditioner for CG. • We use plain CG for the preconditioner solve. Instead of setting a precision stopping condition we iterate for a fixed number of times( the inner iteration count ).
As a Preconditioner 14/25 r 0 = b − Ax 0 z 0 = M − 1 r 0 p 0 = z 0 k = 0 while have not converged do α k = � r k , z k � / � p k , Ap k � x k +1 = x k + α k p k r k +1 = r k − α k Ap k = r ( k ) + A s x ( k ) z k +1 = M − 1 r k +1 ← A s x ( k +1) s s only first cycle, zero initial guess, iterate a fixed number of times β k = � z k +1 , r k +1 � / � z k , r k � p k +1 = z k +1 + β k p k k = k + 1 end while
As a Preconditioner 15/25 M = s A s A
As a Preconditioner 16/25 • Although starting from a different origin, this is now effectively the same with addtive Schwarz if one treats the Dirichlet boundary condition correctly. • Inclusion of the snake terms is crucial. • Naming issue: [ A Unified Representation and Theory of Algebraic Additive Schwarz and Multisplitting Methods , A. Frommer 1997 ]. • Multisplitting Preconditioned CG(MSPCG).
Result: 32 3 × 64 17/25 10 − 02 32x64x12ID, plain CG 10 − 03 r 2 /r 2 0 MSPCG: 3 inner iter. � 10 − 04 MSPCG: 4 inner iter. relative precision MSPCG: 6 inner iter. 10 − 05 10 − 06 10 − 07 10 − 08 0 2000 4000 6000 8000 10000 12000 14000 outer iteration count Figure 3: MSPCG solve on a 32 3 × 64 lattice ( a − 1 = 1 . 37 GeV ) with physical pion mass. Test performed on CORI at NERSC on 128 KNL nodes.
Result: 32 3 × 64 18/25 10 +00 plain CG 2 inner iter. 3 inner iter. 10 − 02 4 inner iter. 5 inner iter. r 2 /s 2 6 inner iter. 10 − 04 7 inner iter. � 8 inner iter. 9 inner iter. % = 10 − 06 10 inner iter. 10 − 08 10 − 10 0 5000 10000 15000 20000 25000 30000 35000 outer iteration count Figure 4: MSPCG solve on the same lattice. Test performed on 64 nodes at Piz Daint. Solving D † Dx = b instead of D † Dx = D † b . Numbers from Kate Clark.
Result: 64 3 × 128 19/25 number of outer iterations to converge 064 nodes 10000 9993 9741 128 nodes 256 nodes 9000 512 nodes 8000 7000 6000 6008 5823 5631 5464 5083 5000 4948 4800 4655 4492 4298 4237 4000 0 3 6 9 12 15 18 number of inner iterations Figure 5: MSPCG solve on a 64 3 × 128 lattice ( a − 1 = 2 . 36 GeV ) with physical pion mass. Plain CG takes 18092 iterations to converge to the same precision( 10 − 10 ). KNL at CORI.
Result: 80 2 × 96 × 192 20/25 10 − 02 80x80x96x192DED, plain CG 10 − 03 r 2 /r 2 0 MSPCG: 6 inner iter. 10 − 04 � 10 − 05 relative precision 10 − 06 10 − 07 10 − 08 10 − 09 10 − 10 0 2000 4000 6000 8000 10000 12000 14000 16000 outer iteration count Figure 6: MSPCG solve on a 80 2 × 96 × 192 lattice ( a − 1 = 3 . 00 GeV ) with physical pion mass. Test performed on CORI at NERSC with 1024 KNL nodes.
Sloppy Preconditioner Solve 21/25 • We observe that the number of iterations for outer CG is greatly reduced even if the inner preconditioner is solved in a sloppy way, e.g. iterating only 3-6 times. • Our observation is supported by several theoretical works, say, [ G. Golub 1999 ] and [ V. Simoncini 2003 ]. • Thus the number of preconditioner solve is a parameter that can be tuned to achieve maximum speed up.
SUMMIT 22/25 For 16 × 12 3 × 12 local volume on 4 Volta GPUs, preconditioner 14 . 13 Tflops With the same local volume on 1024 6 -Volta-nodes, outer 1 . 55 Tflops/GPU Assuming a factor of 3 in outer iteration count reduction with 6 inner iterations, the speed up from MSPCG is: more work from snake terms � 3 ���� �� � � 6 × 1 . 87 1 + = 1 . 65 1 . 55 14 . 13 × (6 / 4) 1 . 55 � �� � ���� precon. cost outer cost
Code Implementation 23/25 • First tested in CPS . • Fully implemented in Grid 1 and Quda 2 with help from Qlattice 3 . • Great thanks to Kate Clark from NVIDIA. 1 https://github.com/paboyle/Grid 2 https://github.com/lattice/quda 3 https://github.com/waterret/Qlattice
Conclusion 24/25 • The amount of inter-processor communication could be reduced at the expense of more local floating point computation by using the multisplitting algorithm as a preconditioner for CG. • If the local floating point computation is cheap enough this greatly speeds up domain wall fermion Dirac equation solving.
Future Work 25/25 • On going work on Quda : Speed up preconditioner dslash as much as possible. • The same approach is expected to work for staggered fermion as well. • Spectrum analysis of the matrix A and the preconditioner M .
Recommend
More recommend