NEW GPU FUNCTIONALITY IN VASP WITH OPENACC AND CUDA LIBRARIES Stefan Maintz, 2019/12/18
AGENDA Introduction to VASP GPU Acceleration in VASP 5 Prioritizing Use Cases for New Porting Efforts OpenACC in VASP 6 and Supported Features Comparative Benchmarking 2
INTRODUCTION TO VASP Scientific Background A leading electronic structure program for solids, surfaces and interfaces Used to study chemical and physical properties, reactions paths, etc. Atomic scale materials modeling from first principles from 1 to 1000s atoms Liquids, crystals, magnetism, semiconductors/insulators, surfaces, catalysts Solves many-body Schrödinger equation 3
INTRODUCTION TO VASP Quantum-mechanical methods Density Functional Theory (DFT) Enables solving sets of Kohn-Sham equations In a plane-wave based framework (PAW) Hybrid DFT adding (parts of) exact exchange (Hartree-Fock) and VASP can go even beyond! 4
INTRODUCTION TO VASP 12-25% of CPU cycles at supercomputing centers CSC, Finland (2012) Material Sciences Academia Top 5 HPC Chemical Engineering Applications Physics & Physical Chemistry Rank Application 1 GROMACS 2 ANSYS Fluent 3 Gaussian Companies Archer, UK (2019/03) Large semiconductor companies VASP Oil & gas 4 VASP Chemicals – bulk or fine 18.2% Gromacs Materials – glass, rubber , 5 NAMD cp2k 9.3% ceramic, alloys, Source: Intersect360 2017 Site Census Mentions polymers and metals NEMO Other Source: http://www.archer.ac.uk/status/codes/ 2019/03/28 5
INTRODUCTION TO VASP Details on the code Developed by Prof. Kresse’s group at University of Vienna (and external contributors) Under development/refactoring for about 25 years 460K lines of Fortran 90, some FORTRAN 77 MPI parallel, OpenMP recently added for multicore First endeavors on GPU acceleration date back to <2011 timeframe with CUDA C 6
INTRODUCTION TO VASP Computational characteristics Lots of small Fast-Fourier-Transformation (about 100x100x100 nodes) Matrix-Matrix and Matrix-Vector multiplications Matrix diagonalizations AllToAll communications And of course some custom kernels 7
AGENDA Introduction to VASP GPU Acceleration in VASP 5 Prioritizing Use Cases for New Porting Efforts OpenACC in VASP 6 and Supported Features Comparative Benchmarking 8
COLLABORATION ON CUDA C PORT OF VASP 5 Collaborators U of Chicago CUDA Port Project Scope Minimization algorithms to calculate electronic ground state: Blocked Davidson (ALGO = Normal & Fast) and RMM-DIIS (ALGO = VeryFast & Fast) Parallelization over k-points Exact-exchange calculations Earlier Work Speeding up plane-wave electronic-structure calculations using graphics-processing units (Maintz, Eck, Dronskowski) VASP on a GPU: Application to exact-exchange calculations of the stability of elemental boron (Hutchinson, Widom) Accelerating VASP Electronic Structure Calculations Using Graphic Processing Units (Hacene, Anciaux-Sedrakian, Rozanska, Klahr, Guignon, Fleurat-Lessard) 9
INSTRUCTIONS TO COMPILE AND RUN VASP 5 ON GPUS NVIDIA offers step-by-step instructions to compile and run VASP 5 with the CUDA C port: https://www.nvidia.cn/data-center/gpu- accelerated-applications/vasp/ Exemplary benchmarks to test against expected performance Hints on tuning important run-time parameters English version at https://www.nvidia.com/vasp 10
CUDA SOURCE INTEGRATION IN VASP 5.4.4 davidson.cu davidson_gpu.F davidson.F cuda_helpers.h makefile cuda_helpers.cu switch … Original source tree Accelerated call tree, Custom kernels drop-in replacements and support code (Fortran) (Fortran) (CUDA C) 11
CUDA C ACCELERATED VERSION OF VASP Available today on NVIDIA Tesla GPUs with VASP 5.4.4 All GPU acceleration with CUDA C Upper Levels Only some cases are ported to GPUs Different source trees for Fortran vs CUDA C GPU CPU call tree call tree CPU code gets continuously updated and enhanced, required for various platforms Challenge to keep CUDA C sources up-to-date routine A routine A Long development cycles to port new features routine B routine B 12
CUDA C ACCELERATED VERSION OF VASP Available today on NVIDIA Tesla GPUs with VASP 5.4.4 Source code duplication for CUDA C in VASP led to: Upper Levels • increased maintenance cost • improvements in CPU code need replication GPU CPU call tree call tree • long development cycles to port new solvers routine A routine A Explore OpenACC as an routine B routine B improvement for GPU acceleration 13
AGENDA Introduction to VASP GPU Acceleration in VASP 5 Prioritizing Use Cases for New Porting Efforts OpenACC in VASP 6 and Supported Features Comparative Benchmarking 14
FEATURES AVAILABLE AND ACCELERATED IN VASP 5 LEVELS OF THEORY SOLVERS / MAIN ALGORITHM Standard DFT Davidson Hybrid DFT (exact exchange) RMM-DIIS RPA (ACFDT , GW) Davidson+RMM-DIIS Bethe-Salpeter Equations (BSE) Direct optimizers (Damped, All) … Linear response ... PROJECTION SCHEME EXECUTABLE FLAVORS Real space Standard variant Real space (automatic optimization) Gamma-point simplification variant Reciprocal space Non-collinear spin variant 15 Light green: GPU accelerated in VASP 5, black: not accelerated in VASP 5
EXAMPLE BENCHMARK: C U C_ VD W standard KPAR, NSIM, NCORE exec. flavor parallelization options Davidson Real space Gamma-point solver proj. scheme exec. flavor RMM-DIIS Reciprocal Non-collinear solver proj. scheme exec. flavor Standard DFT level of theory Dav.+RMM-DIIS Automatic solver proj. scheme Hybrid DFT level of theory Damped solver RPA level of theory BSE level of theory 16
PARALLELIZATION LAYERS IN VASP Wavefunction Ψ Spins ↑ ↓ k -points KPAR>1 … … 𝑙 1 𝑙 2 𝑙 3 𝑙 1 𝑙 2 𝑙 3 Bands/Orbitals Default … … 𝑜 1 𝑜 2 𝑜 1 𝑜 2 … … 𝐷 1 𝐷 1 𝐷 1 𝐷 1 Plane-wave NCORE>1 … … coefficients 𝐷 2 𝐷 2 𝐷 2 𝐷 2 … … 𝐷 3 𝐷 3 𝐷 3 𝐷 3 Physical Parallelization … … … … … … quantities feature 17
PARALLELIZATION OPTIONS KPAR NSIM NCORE Distributes k -points Blocking of orbitals Distributes plane waves Highest level parallelism, Grouping (no distributing) Lowest level parallelism, more or less influences caching and needs parallel 3D FFT , embarrassingly parallel communication inserts lots of MPI msgs Can help for smaller Ideal value differs Can help with load systems between CPU and GPU balancing problems Not always possible Needs to be tuned No support in CUDA port 18
POSSIBLE USE CASES IN VASP Each with a different computational profile Supports a plethora of run-time options that define the workload (use case) Those methodological options can be grouped into categories Some, but not all are combinable Combination determines if GPU acceleration is supported and also how well Benchmarking the complete situation is tremendously complex 19
WHERE TO START You cannot accelerate everything (at least soon) Ideally every use case would be ported Standard and Hybrid DFT alone give 72 use cases (ignoring parallelization options)! Need to select most important use-cases Selection should be based on real-world or supercomputing-facility scenarios 20
STATISTICS ON VASP USE CASES NERSC job submission data 2014 Zhengji Zhao (NERSC) collected such data (INCAR) for 30397 VASP jobs over nearly 2 months Data is based on job count , but has no timing information Includes 130 unique users on Edison (CPU-only system) No 1:1-mapping of parameters possible, expect large error margins Data does not include calculation sizes, but it’s a great start 21
VASP FEATURE USAGE AT NERSC Levels of theory and main algorithms based on job count 2% Davidson Dav+RMM standard DFT RMM-DIIS hybrid DFT Direct Opt. 51% RPA Other BSE RPA BSE Solver / main algorithm Level of theory Source: based on data provided by Zhengji Zhao, NERSC, 2014 22
SUMMARY Where to start Start with standard DFT, to accelerate most jobs RMM-DIIS and Davidson nearly equally important, share a lot of routines anyway Real-space projection scheme more important for large setups Gamma-point executable flavor as important as standard, so start with general one Support as many parallelization options as possible (KPAR, NSIM, NCORE) Communication is important, but scaling to large node counts is low priority (62% fit into 4 nodes, 95% used ≤ 12 nodes) 23
AGENDA Introduction to VASP GPU Acceleration in VASP 5 Prioritizing Use Cases for New Porting Efforts OpenACC in VASP 6 and Supported Features Comparative Benchmarking 24
OPENACC DIRECTIVES Data directives are designed to be optional !$acc data copyin(a,b) copyout(c) Manage Data ... Movement !$acc parallel Initiate Parallel !$acc loop gang vector Execution do i=1, n c(i) = a(i) + b(i) Optimize ... Loop enddo Mappings !$acc end parallel ... !$acc end data 25
Recommend
More recommend