Scalable GW software for excited electrons using OpenAtom Kavitha Chandrasekar, Eric Mikida, Eric Bohm and Laxmikant Kale University of Illinois at Urbana-Champaign Kayahan Saritas, Minjung Kim and Sohrab Ismail-Beigi Yale University Glenn Martyna Pimpernel Science, Software and Information Technology
Electronic structure calculations Β§ Time independent Schrodinger equation for a many-body system πβ π Ξ¨(π’) = + β© β© ππ’ | πΌ| Ξ¨(π’) Many R i & r j Β§ Density functional theory (DFT) simplifies this to one-body problem Solve for wavefunctions π ! (π ) and energies π !
Comparison of the methods Exact SchrΓΆdinger Equation FCI O(N!) CCSD(T) Chemical Chemical O(N 7 ) Accuracy Accuracy QMC Computational Cost O(N 3-4 ) GW Relative Relative HF, DFT O(N 3 ) Energies Energies Transition Transition Tight binding States? States? O(N 3 ) 1 10 100 1,000 10,000 Number of atoms
DFT problem with excitations DFT: ground state . Conduction band . (empty) . π !"# Band gap π ) π ! . Valence band . (filled) . Janakβs theorem ππΉ ππΉ πΉ $%& = % β % = π !"# β π ! ππ !"' ππ !('
DFT problem with excitations DFT: ground state . Why band gap/excitations in a material is important? Conduction band . Metallic, semiconducting or insulating? Β§ (empty) . π !"# Light-matter interactions in general Β§ Band gap π ) A lot of engineering implications: PV, lasers, luminescence β¦ Β§ π ! . Valence band Band gaps (eV) . (filled) . Material DFT GW Expt. Diamond 3.9 5.6* 5.48 Si 0.5 1.3* 1.17 ππΉ ππΉ πΉ $%& = % β % = π !"# β π ! ππ !"' ππ !(' LiCl 6.0 9.1* 9.4 SrTiO 3 2.0 3.4-3.8 3.25
GW method Challenges Β§ Memory intensive Β§ Much larger number of conduction bands: Huge number of FFTs Β§ Large and dense matrix multiplications Β§ Unfavorable scaling π(π 4 ) Goal Β§ Efficient and highly scalable GW software Β§ π(π 3 ) scaling method
What is expensive in GW? ~ π * + π + π , ln π , π π , π ) = - = π(π . ) ~ π * π + π , .1234 π * (π )π 0 (π )π * (π ) )π 0 (π ) ) +,--./ β2 4 4 πΉ * β πΉ 0 * 0 - ln π , ~2π , Lots of FFTs to get π ! (π ) functions Β§ However, π "# can converge using a Β§ small r-grid * Kim et al., (2020), Phys. Rev. B., 101, pp. 035139
O(N 3 ) algorithm (CTSP) for P CTSP: Complex time shredded propagator > $%"## π <,* > & > ' > "## β π <,0 π < ! ,0 β π < ! ,* π΅ <,<) πΆ <,<) π <,< ! = 4 4 π <,< ! = β2 4 4 N r2 N unocc N occ ~ N 4 π₯ + π @ β π πΉ 0 β πΉ * A @ A * 0 ' ' ' 1 π(π)π ") ππ (1) Laplace transform: π " ( ! "( " ) ππ = * π "( ! ) π ( " ) ππ = * = * πΉ $ β πΉ % & & & + # ' π(π)π ") ππ β 0 N r2 N q (N unocc +N occ )~ N 3 (2) Gauss-Laguerre quadrature: * π * π π * & * π " π !
O(N 3 ) algorithm (CTSP) for P > ( > "## > $%"## β π <,0 π < ! ,0 β π <,< ! = β2 4 4 π <,* π < ! ,* 4 π B π π B * 0 B > ( > "## > $%"## β β π C ) D * ][ 4 π EC # D * ] = 4 π B [4 π <,* π < ! ,* π <,0 π < ! ,0 N q (N unocc +N occ ) N r2 ~ N 3 B * 0 ( /0 ( 10 (3) Energy windows: ') π $,$& = ( ( π $,$& ' ) πΉ ! a) , ',- , ',# , ',$ , & , *,- , *,# , *,$ , *,. , *,/ " #$ ($ #$ , &; ( = 0) (&'() b) , #$ - ! !,# (*+) , #$ ,,$
Steps for typical GW calculations Most expensive β’ Real-space P β’ O(N 3 ) method Also expensive - O(N 4 )
O(N 3 ) method for self-energy J π <I π < ! I > & > ' β πΆ <,< ! π΅ <,<) πΆ <,<) & : residues GHI = 4 Ξ£ Β± (π) <,< ! πΆ $,$ ! π <,< ! = 4 4 π₯ + π @ β π π β πΉ I Β± π J π & : energies of the poles of π(π ) $,$' A @ A J,I Β§ π β π I Β± π J =0 is possible: Gauss-Laguerre quadrature not applicable Β§ New quadrature is needed and was developed: Hermite-Gauss-Laguerre quadrature L 1 πππ EDED + /N π @(OEC % Β±O , )D = π½π L π β πΉ I Β± π J K
Results: Energy gap Β§ MgO crystal (16 atoms) Β§ Si crystal (16 atoms) Β§ Number of bands: 433 Β§ Number of bands: 399 Β§ π Q* =1, π Q0 =4 Β§ π Q* =1, π Q0 =4 * Kim et al., (2020), Phys. Rev. B., 101, pp. 035139
Performance against other codes Β§ Si crystal (16 atoms) Β§ Number of bands: 399 Β§ π JQ =15, π IQ =30 http://charm.cs.illinois.edu/OpenAtom/ * Kim et al., (2019), Comput. Phys. Commun., 244, pp. 427-441
OpenAtom GW Parallel Scaling OpenAtom Team
GW-BSE Parallelization Phase Serial Parallel 1 Compute P in Rspace Complete Complete (N 4 and N 3 methods) 2 FFT P to GSpace Complete Complete 3 Invert epsilon Complete Complete 4 Plasmon pole Complete Future Work 5 COHSEX Self-energy Complete Complete 6 Dynamic Self-energy Complete Future Work 14
GW Phase-I P Matrix Computation (N 4 and N 3 method) Ξ¨ Vectors 1D Chare Array L occupied M unoccupied β¦ R P Matrix 2D Tiles 2D Chare Array R R 15
Parallel Decomposition: Input state vectors Duplicate occupied and unoccupied states on each node Ο Ο Ο Ο Ο 16
Computation of Pmatrix using N 3 method β’ Outer loops are windows of occupied and unoccupied states β’ Most expensive computation - π and π ) matrices for l = 1:Nvw for m = 1:Ncw for j = 1:Nquad lm calculate π 01')! calculate π &01')! P[r,rβ] += π 01')! [r,rβ] x π &01')! [r,rβ]
Computation π matrix (Using occupied states) β’ State vectors are represented with Ο β Number of occupied states = L, each state has N elements β All occupied states can be represented as a matrix Ο V [1: L][1:N]) π 2345) -> Same as ZGEMM of all Ο V and all Ο VT π 2345) -> Add elements of outer product of Ο V [1:L] ZGEMM ( Ο VT [1: N][1:L] , Ο V [1: L][1:N]) (i.e matrix multiply ) for l=1:L for r=1:N for r=1:N for rβ=1:N for rβ=1:N π 2345) [r,rβ] += Ο V [l] T [r] x Ο V [l][rβ] for l=1:L π 2345) [r,rβ] += Ο VT [r] [l] x Ο V [l][rβ]
Computation π β matrix (Using unoccupied states) β’ Number of unoccupied states = M, each state has N elements β’ All unoccupied states can be represented as a matrix Ο C [1:M ][1:N]) π 2345) -> Same as ZGEMM of all Ο C and all Ο CT π 2345) -> Add elements of outer product of Ο C [1:M] ZGEMM ( Ο CT [1: N][1:M] , Ο C [1:M ][1:N]) (i.e matrix multiply ) for m=1:M for r=1:N for r=1:N for rβ=1:N for rβ=1:N πβ² 2345) [r,rβ] += Ο C [m] T [r] x Ο C [m][rβ] for m=1:M πβ² 2345) [r,rβ] += Ο CT [r] [m] x Ο C [m][rβ]
Computation of P-matrix (tiled) (N 3 ) Occupied states Ο V (1:L) L Unoccupied states Ο C (1:M) M N N N N M L (ZGEMM) (ZGEMM) P Matrix π matrix π β matrix N (Element-wise multiply) N N of π & π β matrix N N N
Performance of N 3 method Intel KNL nodes (Stampede2) 10000 N 4 method N 3 method Execution Time β’ N 3 method is an order faster than 1000 N 4 method for Si108 atoms dataset β 20k X 20k output matrix size 100 8 16 32 64 β’ Scales well on Intel KNL and Node count (128 cores per node) SkyLake nodes Intel Skylake nodes (Stampede2) β’ Future scaling results for larger 10000 N 4 method N 3 method datasets Execution Time 1000 100 10 8 16 32 64 Node count (48 cores per node)
Questions?
Recommend
More recommend