www.bsc.es MultiGPU Made Easy by OmpSs + CUDA/OpenACC Antonio J. Peña Sr. Researcher & Activity Lead Manager, BSC/UPC NVIDIA GCoE San Jose 2018
Introduction: Programming Models for GPU Computing CUDA (Compute Unified Device Architecture) – Runtime & Driver APIs (high-level / low-level) – Specific for NVIDIA GPUs: best performance & control OpenACC (Open Accelerators) – Open Standard – Higher-level, pragma-based – Aiming at portability – heterogeneous hardware – For NVIDIA GPUs, implemented on top of CUDA OpenCL (Open Computing Language) – Open Standard – Low-level – similar to CUDA Driver API – Multi-target, portable* (Intentionally leaving out weird stuff like CG, OpenGL , …) 2
Motivation: Coding Productivity & Performance Coding Prod. / Perf. Don’t get me wrong: CUDA CUDA delivers awesome coding productivity w.r.t., e.g., OpenGL, but I only want to use 3 (easy) OpenACC colors here. Please interpret colors as relative to each other OpenACC + CUDA OpenACC may well deliver more OmpSs + CUDA than the performance you *need*. However, we have the OmpSs + OpenACC lowest control on performance OmpSs + OpenACC + CUDA w.r.t. the discussed alternatives 3
EPEEC, an EU H2020 Project European joint E ffort toward a Highly P roductive Programming E nvironment for Heterogeneous E xascale C omputing FETHPC, 3 years, ~4M € , Starting October 2018 – Subtopic: “High productivity programming environments for exascale” 10 Partners; Coordinator: BSC (I’m the Technical Manager) High-level Objectives: – Develop & deploy a production-ready parallel programming environment – Advance and integrate existing state-of-the-art European technology – High coding productivity, high performance, energy awareness
Proposed Methodology for Application Developers Directive No No Automatic Code Satisfactory Profile Optimisation Annotation Performance? Possible? Yes Yes Update code patterns Tune/Insert Code Low-Level Directives Manually Accelerator Kernels Yes No No Satisfactory Satisfactory Code Patterns? Performance? Yes Start Deploy 5
OmpSs + CUDA / OpenACC
OmpSs Main Program Sequential control flow – Defines a single address space – Executes sequential code that • Can spawn/instantiate tasks that will be executed sometime in the future • Can stall/wait for tasks Tasks annotated with directionality clauses – in, out, inout – Used • To build dependences among tasks • For main to wait for data to be produced – Basis for memory management functionalities (replication, locality, movement , …) • Copy clauses 7
OmpSs: A Sequential Program … void Cholesky( float *A[NT][NT] ) { TS int i, j, k; NT for (k=0; k<NT; k++) { TS NT TS spotrf (A[k*NT+k]) ; for (i=k+1; i<NT; i++) { TS strsm (A[k][k], A[k][i]); } for (i=k+1; i<NT; i++) { for (j=k+1; j<i; j++) { sgemm( A[k][i], A[k][j], A[j][i]); } ssyrk (A[k][i], A[i][i]); } } 8
OmpSs : … with Directionality Annotations … void Cholesky( float *A[NT][NT] ) { TS int i, j, k; NT for (k=0; k<NT; k++) { TS #pragma omp task inout (A[k][k]) NT TS spotrf (A[k][k]) ; for (i=k+1; i<NT; i++) { TS #pragma omp task in (A[k][k]) inout (A[k][i]) strsm (A[k][k], A[k][i]); } for (i=k+1; i<NT; i++) { for (j=k+1; j<i; j++) { #pragma omp task in (A[k][i], A[k][j]) inout (A[j][i]) sgemm( A[k][i], A[k][j], A[j][i]); } #pragma omp task in (A[k][i]) inout (A[i][i]) ssyrk (A[k][i], A[i][i]); } } 9
OmpSs : … that Happens to Execute in Parallel void Cholesky( float *A[NT][NT] ) { TS int i, j, k; NT for (k=0; k<NT; k++) { TS #pragma omp task inout (A[k][k]) NT TS spotrf (A[k][k]) ; for (i=k+1; i<NT; i++) { TS #pragma omp task in (A[k][k]) inout (A[k][i]) strsm (A[k][k], A[k][i]); } for (i=k+1; i<NT; i++) { for (j=k+1; j<i; j++) { #pragma omp task in (A[k][i], A[k][j]) inout (A[j][i]) sgemm( A[k][i], A[k][j], A[j][i]); } #pragma omp task in (A[k][i]) inout (A[i][i]) ssyrk (A[k][i], A[i][i]); } } Decouple how we write/think (sequential) from how it is executed 10
Memory Consistency (Getting Consistent Copies) • Relaxed- consistency “shared - memory” model (OpenMP -like) #pragma omp target device (cuda) Task Dependency Graph #pragma omp task out([N] b) in([N] c) void scale_task_cuda (double *b, double *c, double a, int N) T1 { int j = blockIdx.x * blockDim.x + threadIdx.x; if (j < N) b[j] = a * c[j]; T2 T3 } #pragma omp target device (smp) #pragma omp task out([N] b) in([N] c) void scale_task_host (double *b, double *c, double a, int N) { for (int j=0; j < N; j++) b[j] = a*c[j]; } T1 needs a valid copy Memory Transfers void main(int argc, char *argv[]) { of array A in the device ... DEVICE HOST scale_task_cuda (B, A, 10.0, 1024); //T1 MEMORY MEMORY scale_task_cuda (A, B, 0.01, 1024); //T2 scale_task_host (C, B, 2.00, 1024); //T3 A A B B Also it allocates array B in C the device (no copy needed), T1 No need to copy and invalidates other B’s #pragma omp taskwait // can access any of A,B,C ... 11
Memory Consistency (Reusing Data in Place) • Relaxed- consistency “shared - memory” model (OpenMP -like) #pragma omp target device (cuda) Task Dependency Graph #pragma omp task out([N] b) in([N] c) void scale_task_cuda (double *b, double *c, double a, int N) T1 { int j = blockIdx.x * blockDim.x + threadIdx.x; if (j < N) b[j] = a * c[j]; T2 T3 } #pragma omp target device (smp) #pragma omp task out([N] b) in([N] c) void scale_task_host (double *b, double *c, double a, int N) { for (int j=0; j < N; j++) b[j] = a*c[j]; } T2 can reuse arrays A and B, Memory Transfers void main(int argc, char *argv[]) { due they have been used by ... DEVICE HOST scale_task_cuda (B, A, 10.0, 1024); //T1 previous task (T1) MEMORY MEMORY scale_task_cuda (A, B, 0.01, 1024); //T2 scale_task_host (C, B, 2.00, 1024); //T3 A A B B Additionally it also C invalidates others A’s T2 #pragma omp taskwait // can access any of A,B,C ... 12
Memory Consistency (on Demand Copy Data Back) • Relaxed- consistency “shared - memory” model (OpenMP -like) #pragma omp target device (cuda) Task Dependency Graph #pragma omp task out([N] b) in([N] c) void scale_task_cuda (double *b, double *c, double a, int N) T1 { int j = blockIdx.x * blockDim.x + threadIdx.x; if (j < N) b[j] = a * c[j]; T2 T3 } #pragma omp target device (smp) #pragma omp task out([N] b) in([N] c) void scale_task_host (double *b, double *c, double a, int N) { for (int j=0; j < N; j++) b[j] = a*c[j]; } Memory Transfers void main(int argc, char *argv[]) { ... DEVICE HOST T3 needs to copy back scale_task_cuda (B, A, 10.0, 1024); //T1 MEMORY MEMORY scale_task_cuda (A, B, 0.01, 1024); //T2 to the host array B scale_task_host (C, B, 2.00, 1024); //T3 A A B B C Does not invalidate the B existing copy in the device #pragma omp taskwait // can access any of A,B,C T3 ... 13
Memory Consistency (Centralized Memory Consistency) • Relaxed- consistency “shared - memory” model (OpenMP -like) #pragma omp target device (cuda) Task Dependency Graph #pragma omp task out([N] b) in([N] c) void scale_task_cuda (double *b, double *c, double a, int N) T1 { int j = blockIdx.x * blockDim.x + threadIdx.x; if (j < N) b[j] = a * c[j]; T2 T3 } #pragma omp target device (smp) #pragma omp task out([N] b) in([N] c) void scale_task_host (double *b, double *c, double a, int N) { for (int j=0; j < N; j++) b[j] = a*c[j]; } Memory Transfers void main(int argc, char *argv[]) { ... DEVICE HOST scale_task_cuda (B, A, 10.0, 1024); //T1 MEMORY MEMORY scale_task_cuda (A, B, 0.01, 1024); //T2 scale_task_host (C, B, 2.00, 1024); //T3 A A T1 B B T2 C B T3 TW A #pragma omp taskwait Taskwait requires full memory // can access any of A,B,C consistency in the host ... 14
Memory Consistency (Avoid taskwait Consistency) • Relaxed- consistency “shared - memory” model (OpenMP -like) #pragma omp target device (cuda) Task Dependency Graph #pragma omp task out([N] b) in([N] c) void scale_task_cuda (double *b, double *c, double a, int N) T1 { int j = blockIdx.x * blockDim.x + threadIdx.x; if (j < N) b[j] = a * c[j]; T2 T3 } #pragma omp target device (smp) noflush #pragma omp task out([N] b) in([N] c) T4 void scale_task_host (double *b, double *c, double a, int N) { for (int j=0; j < N; j++) b[j] = a*c[j]; } Memory Transfers void main(int argc, char *argv[]) { ... DEVICE HOST scale_task_cuda (B, A, 10.0, 1024); //T1 MEMORY MEMORY scale_task_cuda (A, B, 0.01, 1024); //T2 scale_task_host (C, B, 2.00, 1024); //T3 A A #pragma omp taskwait noflush T1 B B // does not flush data dev -> host Taskwait is waiting for task T2 C finalization, but does not B T3 scale_task_cuda (B, C, 3.00, 1024); //T4 copy memory back to the nf #pragma omp taskwait host (neither invalidate it) // can access any of A,B,C ... 15
Recommend
More recommend