Introduction Parallelization Hunting the Overlap Parallel scaling of Teter’s Minimization for Ab Initio Calculations Torsten Hoefler Department of Computer Science Technical University of Chemnitz HPCNano Workshop 2006 Supercomputing’06 Tampa, FL, USA November 13th 2006 university-logo Torsten Hoefler Teter Parallelism
Introduction Parallelization Hunting the Overlap Outline Introduction 1 Introduction to ABINIT Teter’s Conjugate Gradient Minimization Parallelization 2 Already implemented Parallelization A new Proposal Verifying this Proposal Hunting the Overlap 3 Non blocking Collectives university-logo Torsten Hoefler Teter Parallelism
Introduction Introduction to ABINIT Parallelization Teter’s Conjugate Gradient Minimization Hunting the Overlap Outline Introduction 1 Introduction to ABINIT Teter’s Conjugate Gradient Minimization Parallelization 2 Already implemented Parallelization A new Proposal Verifying this Proposal Hunting the Overlap 3 Non blocking Collectives university-logo Torsten Hoefler Teter Parallelism
Introduction Introduction to ABINIT Parallelization Teter’s Conjugate Gradient Minimization Hunting the Overlap ABINIT Introduction ABINIT solves time-independent Schrödinger equation effective one-particle case, uses DFT � H tot Φ = E tot Φ ⇒ Eigenvalue problem Eigenvalues and -vectors determined with CG minimization (Teter et al.) wavefunction Φ written in plain-wave basis set university-logo Torsten Hoefler Teter Parallelism
Introduction Introduction to ABINIT Parallelization Teter’s Conjugate Gradient Minimization Hunting the Overlap ABINIT Program Flow (3) (4) calculate trial potential minimize electronic Energy (5) (2) (8) SCF−cycle calculate total Energy calculate Electron density calculate Potential not converged (1) (7) (6) choose Coefficients mix new Density check convergence Initialization converged Stop Start university-logo Torsten Hoefler Teter Parallelism
Introduction Introduction to ABINIT Parallelization Teter’s Conjugate Gradient Minimization Hunting the Overlap ABINIT Tracing vtowfk (97.3%/4.3%) cgwf (83.6%/1.3%) orthon (5.7%/5.6%) fourwf (27.4%/0.0%) projbd (36.0%/36.0%) nonlop (21.5%/0.0%) sg_fftrisc (27.4%/5.7%) nonlop_pl (21.5%/0.1%) sg_ffty (14.8%/14.8%) sg_fftpx (6.6%/6.6%) opernl4a (11.6%/10.3%) opernl4b (9.8%/8.7%) ⇒ 83% for Teter minimization university-logo Torsten Hoefler Teter Parallelism
Introduction Introduction to ABINIT Parallelization Teter’s Conjugate Gradient Minimization Hunting the Overlap Outline Introduction 1 Introduction to ABINIT Teter’s Conjugate Gradient Minimization Parallelization 2 Already implemented Parallelization A new Proposal Verifying this Proposal Hunting the Overlap 3 Non blocking Collectives university-logo Torsten Hoefler Teter Parallelism
Introduction Introduction to ABINIT Parallelization Teter’s Conjugate Gradient Minimization Hunting the Overlap Conjugate Gradient Operations dot- and matrix-vector product dot-product: � Φ i | Φ j � matrix-vector product: � H Φ � H = E e kin + V e loc + V e nl E e kin and V e loc in reciprocal (k-) space V e nl in real space ⇒ 3D-FFT to transform between real and reciprocal space university-logo Torsten Hoefler Teter Parallelism
Introduction Already implemented Parallelization Parallelization A new Proposal Hunting the Overlap Verifying this Proposal Outline Introduction 1 Introduction to ABINIT Teter’s Conjugate Gradient Minimization Parallelization 2 Already implemented Parallelization A new Proposal Verifying this Proposal Hunting the Overlap 3 Non blocking Collectives university-logo Torsten Hoefler Teter Parallelism
Introduction Already implemented Parallelization Parallelization A new Proposal Hunting the Overlap Verifying this Proposal K-Point Parallelization Bands have to be minimized for each k-point Minimization for each k-point is independent All k-point data is only needed for the calculation of ETOT ⇒ straightforward parallelization ABINIT implementation: Good speedup :-) Uses only collective communication :-) Limited to nkpt :-( Uses MPI_COMM_WORLD :-( Uses MPI_BARRIER :-( university-logo Torsten Hoefler Teter Parallelism
Introduction Already implemented Parallelization Parallelization A new Proposal Hunting the Overlap Verifying this Proposal Band Parallelization The Teter Method allows parallel CG Orthogonalization constraint forces non-ideal solution ⇒ tricky parallelization ABINIT implementation: Speedup depends on interconnect :-/ Uses Send/Recv :-( Limited by nband / c ( c not easily predictable) university-logo Torsten Hoefler Teter Parallelism
Introduction Already implemented Parallelization Parallelization A new Proposal Hunting the Overlap Verifying this Proposal Outline Introduction 1 Introduction to ABINIT Teter’s Conjugate Gradient Minimization Parallelization 2 Already implemented Parallelization A new Proposal Verifying this Proposal Hunting the Overlap 3 Non blocking Collectives university-logo Torsten Hoefler Teter Parallelism
Introduction Already implemented Parallelization Parallelization A new Proposal Hunting the Overlap Verifying this Proposal G Parallelization Vector Distribution FFT ⇒ Two parallelization schemes: 1 1 2 2 Distribute plane wave coefficients PE0 3 3 4 4 Distribute real space FFT Grid 5 5 6 6 Strict load balancing PE1 7 7 8 8 9 9 Minimize communication 10 10 PE2 11 11 Possible to combine with Band and 12 12 13 13 k-Point parallelization 14 14 PE3 15 15 university-logo Torsten Hoefler Teter Parallelism
Introduction Already implemented Parallelization Parallelization A new Proposal Hunting the Overlap Verifying this Proposal Real Space Distribution 3D−FFT Distribution 2D−FFT on z−planes 1D−FFT on xy−lines 1 2 FFT−Box FFT−Box 0 0 0 0 PE0 3 4 1 1 2 0 0 0 0 0 0 1 5 0 0 1 1 1 2 2 0 0 0 0 0 0 0 0 0 0 1 1 1 2 2 2 0 0 0 0 0 0 1 1 1 6 1 2 1 1 1 1 1 1 1 1 1 0 0 0 1 1 2 2 PE1 7 0 0 0 1 1 1 2 2 2 1 1 1 1 1 1 2 2 2 0 0 1 1 1 2 2 2 2 2 2 2 2 2 2 2 8 0 2 1 0 0 0 1 1 2 2 2 2 2 2 2 2 2 3 3 3 9 0 0 1 1 1 2 2 3 3 3 3 3 3 3 x 10 0 1 1 2 3 3 3 3 3 1 PE2 11 0 0 0 0 12 13 14 PE3 z y 15 MPI_ALLTOALL MPI_ALLTOALL university-logo Torsten Hoefler Teter Parallelism
Introduction Already implemented Parallelization Parallelization A new Proposal Hunting the Overlap Verifying this Proposal Implementation Issues Necessary communication (complexity): Dot-products ( O ( 1 ) ) Computation of kinetic energy ( O ( 1 ) ) FFT transpose ( O ( natom ) ) Only collective communication: MPI_ALLREDUCE for reductions MPI_ALLTOALL for FFT transpose Principles: only coll. communication separate communicator simplification of the main code heavy usage of math librarys university-logo Torsten Hoefler Teter Parallelism
Introduction Already implemented Parallelization Parallelization A new Proposal Hunting the Overlap Verifying this Proposal Benchmarking the Implementation of cgwf 60 SiO 2 , natom=43, nband=126, npw=48728 SiO 2 , natom=86, nband=251, npw=97624 linear 50 40 Speedup (s) 30 20 10 0 university-logo 0 10 20 30 40 50 60 # processors (P) Torsten Hoefler Teter Parallelism
Introduction Already implemented Parallelization Parallelization A new Proposal Hunting the Overlap Verifying this Proposal Possible Reasons for limited Scalability serial parts (Amdahl’s law) allocations scalar calculation index reordering (packin,packout - FFT) communication overhead latency of blocking collective operations limits scalability significantly overhead will be modelled in the following university-logo Torsten Hoefler Teter Parallelism
Introduction Already implemented Parallelization Parallelization A new Proposal Hunting the Overlap Verifying this Proposal Outline Introduction 1 Introduction to ABINIT Teter’s Conjugate Gradient Minimization Parallelization 2 Already implemented Parallelization A new Proposal Verifying this Proposal Hunting the Overlap 3 Non blocking Collectives university-logo Torsten Hoefler Teter Parallelism
Introduction Already implemented Parallelization Parallelization A new Proposal Hunting the Overlap Verifying this Proposal The LogP Model level Sender Receiver CPU Network or o s L g g time university-logo Torsten Hoefler Teter Parallelism
Introduction Already implemented Parallelization Parallelization A new Proposal Hunting the Overlap Verifying this Proposal Modelling the MPI_ALLREDUCE → MPI_REDUCE to node 0 and MPI_BCAST o s f f s s P0 o r o s f s P1 L o r o s P2 o r o s P3 o r P4 o r P5 f s = max( o s , g) o r P6 o r , g) f r = max( o r P7 t red ( P , size ) = 2 · size · ( 2 o + L +( ⌈ log 2 P ⌉− 1 ) · max { g , 2 o + L } ) university-logo Torsten Hoefler Teter Parallelism
Introduction Already implemented Parallelization Parallelization A new Proposal Hunting the Overlap Verifying this Proposal Modelling the MPI_ALLTOALL → each node hast to send to all others single host: o s o s o s o s g g g P0 o r L P1 L o r L P2 L o r P3 o r P4 all hosts send, assuming FBB t a 2 a ( P , size ) = size · (( 2 o + L ) + ( P − 1 ) · ( g + o )) university-logo Torsten Hoefler Teter Parallelism
Recommend
More recommend