A Physicist’s Guide to Parallelization at the IHPC Viðar Guðmundsson Science Institute, University of Iceland vidar@hi.is 2017
Transport of electrons through dots in a photon cavity Exactly interacting electrons and photons, geometry
Equation of motion Complicated construction through functional spaces ⇓ Non-interacting single-electron basis, Hilbert space ⇓ Interacting many-electron states, Fock space ⇓ Non-interacting electron-photon basis, Fock space ⇓ Photon-dressed electron states, Fock space ⇓ Open system → projection on the central system ⇓ Equation of motion, linear space of transitions, Liouville space
Lot of analytical work and planning ⇓ Heavy use of linear algebra ⇓ Truncation of infinite dimensional spaces ⇓ Compact (dense) complex matrices ⇓ Intel MKL, BLAS, Lapack, OpenMP – cuBLAS, Magma ⇓ G-FORTRAN + ifort – ifort + CUDA
BLAS Always use (parallel f95-interface) BLAS for simple linear algebra task Level I : Vector – vector operations Level II : Matrix – vector operations Level III : Matrix – matrix operations You can not compete with your own code!
Module – wrapper Subroutine → function – matrix multiplication MODULE PAB CONTAINS FUNCTION MATMULVG(Af ,Bf) USE Mod_Precision USE lapack95 USE blas95 USE omp$_$lib IMPLICIT NONE INTEGER :: Nd1 , Nd2 , ierr COMPLEX(KIND=dp), ALLOCATABLE , DIMENSION (: ,:) :: MATMULVG , Af , Bf Nd1 = SIZE(Af ,1) Nd2 = SIZE(Bf ,2) ierr = 0 ALLOCATE(MATMULVG(Nd1 ,Nd2), STAT=ierr) MATMULVG = (0.0_dp , 0.0 _dp) CALL GEMM3M(Af ,Bf ,MATMULVG) ! ---------------------------------------- END FUNCTION MATMULVG END MODULE PAB
Usage – readability ρ vec ( t ) = { V exp( L t ) U } ρ (0) SUBROUTINE NaZwa_gR_errCorr_I ... USE PAB USE PcAB USE PABz USE PAv USE PAvT ... rhotv = MATvecVG(MATMULVG(MATMULVG(vrV ,expiLt),vlU),rho0v) ... END SUBROUTINE NaZwa_gR_errCorr_I
GPU-cards We can also off-load matrix handling to GPU-cards MODULE Mod_MVG CONTAINS FUNCTION MVG(A,B) ! Input: -Two complex N x N matrices A and B ! Output: -Complex N x N matrix c(A)*B USE Mod_Precision USE mkl95_lapack USE blas95 USE omp_lib IMPLICIT NONE INTEGER :: m, k, n, ierr COMPLEX(KIND=dp), ALLOCATABLE , DIMENSION (: ,:) :: MVG , A, B m = SIZE(A ,1) n = SIZE(B ,2) k = SIZE(A ,2) ierr = 0 ALLOCATE(MVG(m,n), STAT=ierr) MVG = (0.0_dp , 0.0 _dp) CALL matmulvgcuda (A, B, MVG , m, k, n) ! ---------------------------------------- END FUNCTION MVG END MODULE Mod_MVG
extern "C" void matmulvgcuda_ ( cuDoubleComplex *A, cuDoubleComplex *B, cuDoubleComplex *C, cudaSetDevice (0); cuDoubleComplex *d_A = 0; cuDoubleComplex *d_B = 0; cuDoubleComplex *d_C = 0; cuDoubleComplex alpha = {.x=1.0 , .y=0.0}; cuDoubleComplex beta = {.x=0.0 , .y=0.0}; int matrixSizeA = (*m)*(*k); int matrixSizeB = (*k)*(*n); int matrixSizeC = (*m)*(*n); cublasHandle_t handle; /* Initialize CUBLAS */ CUBLAS_HANDLE_ERROR ( cublasCreate (& handle) ); /* Allocate device memory for matrices */ CUDA_HANDLE_ERROR ( cudaMalloc (( void **)&d_A , matrixSizeA *sizeof( cuDoubleComplex ))); CUDA_HANDLE_ERROR ( cudaMalloc (( void **)&d_B , matrixSizeB *sizeof( cuDoubleComplex ))); CUDA_HANDLE_ERROR ( cudaMalloc (( void **)&d_C , matrixSizeC *sizeof( cuDoubleComplex ))); /* Copy host matrixes to device */ CUBLAS_HANDLE_ERROR ( cublasSetVector (matrixSizeA , sizeof( cuDoubleComplex ), A, 1, d_A , CUBLAS_HANDLE_ERROR ( cublasSetVector (matrixSizeB , sizeof( cuDoubleComplex ), B, 1, d_B , /* Perform multiplication with CUBLAS */ CUBLAS_HANDLE_ERROR ( cublasZgemm (handle , CUBLAS_OP_N , CUBLAS_OP_N , *m, *n, *k, &alpha /* Read result back */ CUBLAS_HANDLE_ERROR ( cublasGetVector (matrixSizeC , sizeof( cuDoubleComplex ), d_C , 1, C, /* Clean up device memory */ CUDA_HANDLE_ERROR ( cudaFree(d_A) ); CUDA_HANDLE_ERROR ( cudaFree(d_B) ); CUDA_HANDLE_ERROR ( cudaFree(d_C) ); /* Shutdown */ CUBLAS_HANDLE_ERROR ( cublasDestroy (handle) ); }
GPU – off loading � Information on FORTRAN-CUDA modules: vidar@hi.is � Lapack: MAGMA: Matrix Algebra on GPU and Multicore Architectures, http://icl.cs.utk.edu/magma/ � cuBLAS: https://developer.nvidia.com/cublas
OpenMP Matrix elements: !$OMP PARALLEL DO PRIVATE(fs , sumLp , sumRp , FermiS , Cpnu) & !$OMP SHARED(TL , TR , TauLp , TauRp , gamma , Plus , i1 , i2 , j1 , j2) SCHEDULE(DYNAMIC) DO nu = 1, N_mes DO mu = 1, N_mes sumLp = Czero sumRp = Czero DO ia = 1, N_ses FermiS = gamma(nu ,ia) Cpnu = Plus(nu ,ia) IF(mu .EQ. Cpnu) THEN fs = ( -1)** FermiS sumLp = sumLp+TL(i1 ,j1 ,ia)* CMPLX(REAL(fs ,dp) ,0.0_dp ,dp) sumRp = sumRp+TR(i2 ,j2 ,ia)* CMPLX(REAL(fs ,dp) ,0.0_dp ,dp) END IF END DO TauLp(mu ,nu) = sumLp TauRp(mu ,nu) = sumRp END DO END DO !$OMP END PARALLEL DO
Many dimensions � OpenMP – shared RAM – upto 32 cores on Garðar � MPI – distributed RAM – images � OpenMP – depth of parallel tasks � BLAS – linear algebra � Modules – readability – easy usage
Importance of analytical + numerical work Efficient determination of the Markovian time-evolution towards a steady-state of a complex open quantum system, Computer Physics Communications , in press, (2017), https://doi.org/10.1016/j.cpc.2017.06.018
Conclusions � MKL, OpenMP, cuBLAS, MAGMA � Faster development than with MPI � Build modules (wrappers) � Get clever students � G-FORTRAN – ifort
Collaboration and support � Þorsteinn Hjörtur Jónsson (UI) � University of Iceland Research Fund � Andrei Manolescu (RU) � The Icelandic � Chi-Shung Tang (NUU) Research Fund � Hsi-Sheng Goan (NTU) � The Taiwan Ministry � Anna Sitek (UI) of Technology � Nzar Rauf Abdullah (KUS) � The Icelandic � Maria Laura Bernodusson (ALUF) Infrastructure Fund � Sigtryggur Hauksson (McGill) � Árni Johnsen (UI) � Elías Snorrason (UI)
Recommend
More recommend