parallel scaling of teter s minimization for ab initio
play

Parallel scaling of Teters Minimization for Ab Initio Calculations - PowerPoint PPT Presentation

Introduction Parallelization Hunting the Overlap Parallel scaling of Teters Minimization for Ab Initio Calculations Torsten Hoefler Department of Computer Science Technical University of Chemnitz HPCNano Workshop 2006 Supercomputing06


  1. Introduction Parallelization Hunting the Overlap Parallel scaling of Teter’s Minimization for Ab Initio Calculations Torsten Hoefler Department of Computer Science Technical University of Chemnitz HPCNano Workshop 2006 Supercomputing’06 Tampa, FL, USA November 13th 2006 university-logo Torsten Hoefler Teter Parallelism

  2. Introduction Parallelization Hunting the Overlap Outline Introduction 1 Introduction to ABINIT Teter’s Conjugate Gradient Minimization Parallelization 2 Already implemented Parallelization A new Proposal Verifying this Proposal Hunting the Overlap 3 Non blocking Collectives university-logo Torsten Hoefler Teter Parallelism

  3. Introduction Introduction to ABINIT Parallelization Teter’s Conjugate Gradient Minimization Hunting the Overlap Outline Introduction 1 Introduction to ABINIT Teter’s Conjugate Gradient Minimization Parallelization 2 Already implemented Parallelization A new Proposal Verifying this Proposal Hunting the Overlap 3 Non blocking Collectives university-logo Torsten Hoefler Teter Parallelism

  4. Introduction Introduction to ABINIT Parallelization Teter’s Conjugate Gradient Minimization Hunting the Overlap ABINIT Introduction ABINIT solves time-independent Schrödinger equation effective one-particle case, uses DFT � H tot Φ = E tot Φ ⇒ Eigenvalue problem Eigenvalues and -vectors determined with CG minimization (Teter et al.) wavefunction Φ written in plain-wave basis set university-logo Torsten Hoefler Teter Parallelism

  5. Introduction Introduction to ABINIT Parallelization Teter’s Conjugate Gradient Minimization Hunting the Overlap ABINIT Program Flow (3) (4) calculate trial potential minimize electronic Energy (5) (2) (8) SCF−cycle calculate total Energy calculate Electron density calculate Potential not converged (1) (7) (6) choose Coefficients mix new Density check convergence Initialization converged Stop Start university-logo Torsten Hoefler Teter Parallelism

  6. Introduction Introduction to ABINIT Parallelization Teter’s Conjugate Gradient Minimization Hunting the Overlap ABINIT Tracing vtowfk (97.3%/4.3%) cgwf (83.6%/1.3%) orthon (5.7%/5.6%) fourwf (27.4%/0.0%) projbd (36.0%/36.0%) nonlop (21.5%/0.0%) sg_fftrisc (27.4%/5.7%) nonlop_pl (21.5%/0.1%) sg_ffty (14.8%/14.8%) sg_fftpx (6.6%/6.6%) opernl4a (11.6%/10.3%) opernl4b (9.8%/8.7%) ⇒ 83% for Teter minimization university-logo Torsten Hoefler Teter Parallelism

  7. Introduction Introduction to ABINIT Parallelization Teter’s Conjugate Gradient Minimization Hunting the Overlap Outline Introduction 1 Introduction to ABINIT Teter’s Conjugate Gradient Minimization Parallelization 2 Already implemented Parallelization A new Proposal Verifying this Proposal Hunting the Overlap 3 Non blocking Collectives university-logo Torsten Hoefler Teter Parallelism

  8. Introduction Introduction to ABINIT Parallelization Teter’s Conjugate Gradient Minimization Hunting the Overlap Conjugate Gradient Operations dot- and matrix-vector product dot-product: � Φ i | Φ j � matrix-vector product: � H Φ � H = E e kin + V e loc + V e nl E e kin and V e loc in reciprocal (k-) space V e nl in real space ⇒ 3D-FFT to transform between real and reciprocal space university-logo Torsten Hoefler Teter Parallelism

  9. Introduction Already implemented Parallelization Parallelization A new Proposal Hunting the Overlap Verifying this Proposal Outline Introduction 1 Introduction to ABINIT Teter’s Conjugate Gradient Minimization Parallelization 2 Already implemented Parallelization A new Proposal Verifying this Proposal Hunting the Overlap 3 Non blocking Collectives university-logo Torsten Hoefler Teter Parallelism

  10. Introduction Already implemented Parallelization Parallelization A new Proposal Hunting the Overlap Verifying this Proposal K-Point Parallelization Bands have to be minimized for each k-point Minimization for each k-point is independent All k-point data is only needed for the calculation of ETOT ⇒ straightforward parallelization ABINIT implementation: Good speedup :-) Uses only collective communication :-) Limited to nkpt :-( Uses MPI_COMM_WORLD :-( Uses MPI_BARRIER :-( university-logo Torsten Hoefler Teter Parallelism

  11. Introduction Already implemented Parallelization Parallelization A new Proposal Hunting the Overlap Verifying this Proposal Band Parallelization The Teter Method allows parallel CG Orthogonalization constraint forces non-ideal solution ⇒ tricky parallelization ABINIT implementation: Speedup depends on interconnect :-/ Uses Send/Recv :-( Limited by nband / c ( c not easily predictable) university-logo Torsten Hoefler Teter Parallelism

  12. Introduction Already implemented Parallelization Parallelization A new Proposal Hunting the Overlap Verifying this Proposal Outline Introduction 1 Introduction to ABINIT Teter’s Conjugate Gradient Minimization Parallelization 2 Already implemented Parallelization A new Proposal Verifying this Proposal Hunting the Overlap 3 Non blocking Collectives university-logo Torsten Hoefler Teter Parallelism

  13. Introduction Already implemented Parallelization Parallelization A new Proposal Hunting the Overlap Verifying this Proposal G Parallelization Vector Distribution FFT ⇒ Two parallelization schemes: 1 1 2 2 Distribute plane wave coefficients PE0 3 3 4 4 Distribute real space FFT Grid 5 5 6 6 Strict load balancing PE1 7 7 8 8 9 9 Minimize communication 10 10 PE2 11 11 Possible to combine with Band and 12 12 13 13 k-Point parallelization 14 14 PE3 15 15 university-logo Torsten Hoefler Teter Parallelism

  14. Introduction Already implemented Parallelization Parallelization A new Proposal Hunting the Overlap Verifying this Proposal Real Space Distribution 3D−FFT Distribution 2D−FFT on z−planes 1D−FFT on xy−lines 1 2 FFT−Box FFT−Box 0 0 0 0 PE0 3 4 1 1 2 0 0 0 0 0 0 1 5 0 0 1 1 1 2 2 0 0 0 0 0 0 0 0 0 0 1 1 1 2 2 2 0 0 0 0 0 0 1 1 1 6 1 2 1 1 1 1 1 1 1 1 1 0 0 0 1 1 2 2 PE1 7 0 0 0 1 1 1 2 2 2 1 1 1 1 1 1 2 2 2 0 0 1 1 1 2 2 2 2 2 2 2 2 2 2 2 8 0 2 1 0 0 0 1 1 2 2 2 2 2 2 2 2 2 3 3 3 9 0 0 1 1 1 2 2 3 3 3 3 3 3 3 x 10 0 1 1 2 3 3 3 3 3 1 PE2 11 0 0 0 0 12 13 14 PE3 z y 15 MPI_ALLTOALL MPI_ALLTOALL university-logo Torsten Hoefler Teter Parallelism

  15. Introduction Already implemented Parallelization Parallelization A new Proposal Hunting the Overlap Verifying this Proposal Implementation Issues Necessary communication (complexity): Dot-products ( O ( 1 ) ) Computation of kinetic energy ( O ( 1 ) ) FFT transpose ( O ( natom ) ) Only collective communication: MPI_ALLREDUCE for reductions MPI_ALLTOALL for FFT transpose Principles: only coll. communication separate communicator simplification of the main code heavy usage of math librarys university-logo Torsten Hoefler Teter Parallelism

  16. Introduction Already implemented Parallelization Parallelization A new Proposal Hunting the Overlap Verifying this Proposal Benchmarking the Implementation of cgwf 60 SiO 2 , natom=43, nband=126, npw=48728 SiO 2 , natom=86, nband=251, npw=97624 linear 50 40 Speedup (s) 30 20 10 0 university-logo 0 10 20 30 40 50 60 # processors (P) Torsten Hoefler Teter Parallelism

  17. Introduction Already implemented Parallelization Parallelization A new Proposal Hunting the Overlap Verifying this Proposal Possible Reasons for limited Scalability serial parts (Amdahl’s law) allocations scalar calculation index reordering (packin,packout - FFT) communication overhead latency of blocking collective operations limits scalability significantly overhead will be modelled in the following university-logo Torsten Hoefler Teter Parallelism

  18. Introduction Already implemented Parallelization Parallelization A new Proposal Hunting the Overlap Verifying this Proposal Outline Introduction 1 Introduction to ABINIT Teter’s Conjugate Gradient Minimization Parallelization 2 Already implemented Parallelization A new Proposal Verifying this Proposal Hunting the Overlap 3 Non blocking Collectives university-logo Torsten Hoefler Teter Parallelism

  19. Introduction Already implemented Parallelization Parallelization A new Proposal Hunting the Overlap Verifying this Proposal The LogP Model level Sender Receiver CPU Network or o s L g g time university-logo Torsten Hoefler Teter Parallelism

  20. Introduction Already implemented Parallelization Parallelization A new Proposal Hunting the Overlap Verifying this Proposal Modelling the MPI_ALLREDUCE → MPI_REDUCE to node 0 and MPI_BCAST o s f f s s P0 o r o s f s P1 L o r o s P2 o r o s P3 o r P4 o r P5 f s = max( o s , g) o r P6 o r , g) f r = max( o r P7 t red ( P , size ) = 2 · size · ( 2 o + L +( ⌈ log 2 P ⌉− 1 ) · max { g , 2 o + L } ) university-logo Torsten Hoefler Teter Parallelism

  21. Introduction Already implemented Parallelization Parallelization A new Proposal Hunting the Overlap Verifying this Proposal Modelling the MPI_ALLTOALL → each node hast to send to all others single host: o s o s o s o s g g g P0 o r L P1 L o r L P2 L o r P3 o r P4 all hosts send, assuming FBB t a 2 a ( P , size ) = size · (( 2 o + L ) + ( P − 1 ) · ( g + o )) university-logo Torsten Hoefler Teter Parallelism

Recommend


More recommend