fast scalable and accurate finite element based ab initio
play

Fast, scalable and accurate finite-element based ab initio - PowerPoint PPT Presentation

Fast, scalable and accurate finite-element based ab initio calculations using mixed precision computing Vikram Gavini Department of Mechanical Engineering Department of Materials Science and Engineering University of Michigan, Ann Arbor


  1. Fast, scalable and accurate finite-element based ab initio calculations using mixed precision computing Vikram Gavini Department of Mechanical Engineering Department of Materials Science and Engineering University of Michigan, Ann Arbor Collaborators: Sambit Das (U. Mich); Phani Motamarri (U. Mich); Bruno Turcksin (ORNL); Ying Wai Li (ORNL/LANL); Brent Leback (Nvidia) Funding: DoE-BES, ARO, AFOSR, TRI, XSEDE, NERSC, ALCF, OLCF SMC 2019

  2. Impact of Density Functional Theory Citations to seminal work of Walter Kohn (1964,1965 ) Data compiled from Web of Science 12 of the 100 most-cited papers in scientific literature pertain to DFT! (Nature 514 , 550 (2014)) 2 SMC 2019

  3. DFT codes ~100 available DFT codes developed since 1980 Relationship to HPC Courtesy: Anubhav Jain Data compiled from Web of Science Key Issues Lack of good parallel scalability of existing DFT codes v Computational complexity of DFT calculations (O(N^3)) v 3 SMC 2019

  4. Need for large scale DFT calculations Chemical properties of nanoparticles Biological systems Edge dislocation: Iyer et al. J. Mech, Phys. Solids (2015) Rocksalt phase formation during Litihiation of Magnetite Screw dislocation: 4 He et. al, Nature Comm, 2016 Das & Gavini J. Mech, Phys. Solids (2017) Defects in Materials SMC 2019

  5. Technological challenge of low ductility in Mg Magnesium is the lightest structural material with high strength to weight ratio Ø 75% lighter than Steel and 30% lighter than Aluminum v Every 10% reduction in the weight of a vehicle will result in 6-8% increase in Ø fuel efficiency. Important implications to fuel efficiency and reducing carbon footprint v Low ductility key issue in the manufacturability of structural components. Main Ø limitation in the adoptability of Mg and Mg alloys in automotive and aerospace sectors. (T.M. Pollock, Science 328 , 986-987 (2010)) Courtesy: https://www.audi-technology-portal.de/en/body S. Sandlöbes et al. Scientific Reports 7, 10458 (2017). Current state of art: Hybrid Steel and Aluminum construction 5 SMC 2019

  6. Technological challenge of low ductility in Mg 4 slip planes in Face Centered Cubic Crystals à higher ductility Prism II Basal Dislocations are energetically more favorable to v reside on certain slip systems. ( Energetics ) Prism I Pyramidal II Pyramidal I Dislocation glide occurs after the applied shear v stress is greater than the Perils barrier. ( Activation barrier ) More the number of slip systems where dislocations v can glide easily higher is the ductility. 6 SMC 2019

  7. Density Functional Theory Kohn-Sham eigenvalue problem: Self consistent iteration (Kohn-Sham map) Orbital occupancy: 7 SMC 2019

  8. DFT – Finite Element discretization Use finite-element basis for computing – Ø u Features of FE basis i=1 i=2 … r Ø Systematic convergence v Element size N 1 (r) N 2 (r) N 3 (r) 1 v Polynomial order Ø Adaptive refinement Ø Complex geometries and boundary r i=1 i=2 … conditions Ø Potential for excellent parallel scalability By changing the positioning of the nodes the spatial resolution of basis can be changed/adapted 8 SMC 2019

  9. Higher (polynomial) order FE basis II. Mo periodic I. Cu nanoparticle supercell w/ vacancy 55 atoms 53 atoms ~1000x advantage by using higher-order FE basis ! 9 SMC 2019

  10. Spatial adaptivity of the FE basis (Motamarri et al. J Comput Phys. (2013); Motamarri el al. Comput. Phys. Commun. (2019) ) Error Analysis: Ø Optimal FE mesh: Ø System Type DoFs DoFs for pyr II dislocation Uniform Mesh Adaptive Mesh 1848 atom Mg 347,206,614 55,112,161 10 6164 atom Mg 892,047,315 179,034,231 SMC 2019

  11. Eigen-space computation: Chebyshev acceleration (Zhou et al. J. Comput. Phys. 219 (2006); Motamarri et al. J. Comp. Phys. 253, 308-343 (2013)) Kohn-Sham eigenvalue problem : for k = 1,2,…N (N ~ 1.1N e /2) Unwanted Spectrum Wanted Spectrum Chebyshev Filtering : Wanted Spectrum Unwanted Spectrum 11 SMC 2019

  12. Numerical algorithm 1. Start with initial guess for electron density and the initial wavefunctions 2. Compute the discrete Hamiltonian using the input electron density 3. CF: Chebyshev filtering: 4. Orthonormalize CF basis: 5. Rayleigh-Ritz procedure : Compute projected Hamiltonian: v Diagonalize v Subspace rotation: v 6. Compute electron density 7. If , EXIT; else, compute new using a mixing scheme and go to (2). 12 SMC 2019

  13. Chebyshev Filtering DoF 1 DoF 2 DoF FE Cell DoF DoF 1 DoF 2 : Number of FE cells 13 SMC 2019

  14. Chebyshev Filtering Strided Batched xGEMM 14 SMC 2019

  15. Chebyshev Filtering Atomic operations to avoid race conditions in addition DoF 1 DoF 2 DoF Assembly across processor boundaries: Communication in FP32 15 Repeat for SMC 2019

  16. Performance of Chebyshev filtering (Summit) Case study : Mg 3x3x3 supercell with a vacancy. (1070 electrons) Fig : 14.7x GPU speed up for Chebyshev filtering. CPU run Fig : Chebyshev filtering throughput on 2 Summit nodes used 2 Summit nodes with 42 MPI tasks per node while using 12 GPUs (3 MPI tasks per GPU) for various block GPU run used 2 Summit nodes with 12 GPUs (3 MPI tasks sizes. FP64 peak of 2 Summit nodes is 87.6 TFLOPS per GPU) 16 SMC 2019

  17. Orthogonalization: Cholesky Gram-Schmidt Ø Cholesky factorization of the overlap matrix: Ø Orthonormal basis construction: Blocked approach to reduce peak memory Mixed precision computation for Chol-GS 1. 2. in double precision. 3. Orthonormal basis construction: Copy block to CPU MPI_Allreduce (if computation performed on GPU) Fill ScaLAPACK parallelized S matrix 17 SMC 2019

  18. Orthogonalization: Cholesky Gram-Schmidt Summit GPU cluster benchmark NERSC Cori CPU cluster benchmark Performance improvement in CholGS due to Performance improvement in computation of S mixed precision algorithm. Case study: due to mixed precision algorithm. Case study: Mg10x10x10 (39,990 electrons) and 61,640 electrons system using 1300 Summit Mo13x13x13 (61,502 electrons) nodes 18 SMC 2019

  19. Rayleigh-Ritz procedure v Compute projected Hamiltonian: v Diagonalization of v Subspace rotation step: Mixed precision computation for RR 1. Compute projected Hamiltonian: N oc N fr 19 SMC 2019

  20. Rayleigh-Ritz procedure 2. Diagonalization of in double precision. 3. Subspace rotation step: Summit GPU cluster benchmark Performance improvement in computation of due to mixed precision algorithm. Case study: 61,640 electrons system using 1300 Summit nodes 20 SMC 2019

  21. Comparison with Quantum Espresso (Cori KNL) (Motamarri et al. Comput. Phys. Commun. (2019)) Monovacancy in HCP Mg – periodic calculation ; ONCV pseudopotential Accuracy for all calculations <0.1mHa/atom (~2meV/atom) Time per SCF in Node-Hrs for various system sizes (NERSC Cori KNL) System size Q-Espresso DFT-FE DFT-FE (Ecut: 45 Ha) (h_min: 0.46, p=4) Wall-time per SCF iteration (sec) QUANTUM ESPRESSO 800 255 atoms 0.1 0.3 600 (N e =2550) 863 atoms 4.4 3.3 400 (N e =8630) 2047 atoms 123.5 21.6 200 (N e =20470) 3999 atoms - 103.4 0 0 10000 20000 30000 40000 (N e =39990) Number of Electrons 21 SMC 2019

  22. Comparison with Quantum Espresso (Cori KNL) Cu nanoparticles – non periodic calculation; ONCV pseudopotential Accuracy for all calculations <0.1mHa/atom (~2meV/atom) Time per SCF in Node-Hrs for various system sizes (NERSC Cori KNL) System size Q-Espresso DFT-FE (Ecut: 50 Ha) (h_min: 0.4; p=4) 147 atoms 0.2 0.3 (N e =2793) 309 atoms 5.5 1.7 (N e =5871) 561 atoms 63.4 5.3 (N e =10569) 923 atoms - 12.7 (N e =17537) 22 SMC 2019

  23. Technological challenge of low ductility in Mg 12 slip systems in Face Centered Cubic Crystals à higher ductility Prism II Basal Dislocations are energetically more favorable to v reside on certain slip systems. ( Energetics ) Prism I Pyramidal II Pyramidal I Dislocation glide occurs after the applied shear v stress is greater than the Perils barrier. ( Activation barrier ) More the number of slip systems where dislocations v can glide easily higher is the ductility. 23 SMC 2019

  24. Mg Pyramidal dislocation systems Pyramidal I and II dislocation systems of various sizes 728 Mg atoms 6164 Mg atoms 1848 Mg atoms 10,508 Mg atoms 24 SMC 2019

  25. Performance Benchmarks – Strong Scaling/time to solution Mg pyr II screw dislocation – 1,848 atoms (18,480 e - ); 55.11 million FE DoFs Theta Summit GPUs 32 16 Ideal Speedup Observed Speedup Observed Speedup Ideal Speedup 16 8 Relative Speedup Relative Speedup 8 4 4 2 2 Wall-time on 2048 tasks: 1511 sec Wall-time on 1260 tasks: 97.6 sec Wall-time on 20,160 tasks: 13.99 sec Wall-time on 65,536 tasks: 104 sec 1 1 1260 2520 5040 10080 20160 2048 4096 8192 16384 32768 65536 Number of MPI Tasks Number of MPI tasks 3 MPI tasks per GPU via MPS 25 SMC 2019

  26. Performance Benchmarks – Weak Scaling (Summit) Total MPI tasks (3 MPI tasks per GPU; via MPS ) Computational Complexity 54 180 12744 576 3294 Chebyshev filtering: O(MN) 120 Percent Weak Scaling Efficiency Orthonormalization: O(MN 2 ) 100 Rayleigh Ritz procedure: O(MN 2 ) 38250 80 60 Onset of cubic scaling significantly delayed ! 40 20 2500 5000 10000 20000 40000 100000 Number of Electrons 26 SMC 2019

Recommend


More recommend