V100 Member of the Helmholtz Association HBM2 Andreas Herten | GPU Programming 101 | 31 August 2017 16 GB RAM, 900 GB s Example values Now: Done automatically (performance…?) Formerly: Explicitly copy data to/from GPU Do as little as possible! Memory transfers need special consideration! GPU: accelerator / extension card Memory NVLink Device Control DRAM ALU ALU ALU ALU Cache DRAM Host # 7 41 GPU memory ain’t no CPU memory → Separate device from CPU Separate memory, but UVA and UM ≈ 160 GB / s P100 < 720 GB / s 16 GB RAM, 720 GB / s
Member of the Helmholtz Association Device Andreas Herten | GPU Programming 101 | 31 August 2017 Example values Now: Done automatically (performance…?) Formerly: Explicitly copy data to/from GPU Do as little as possible! Memory transfers need special consideration! GPU: accelerator / extension card Memory NVLink HBM2 Host ALU DRAM ALU ALU DRAM ALU Control Cache # 7 41 GPU memory ain’t no CPU memory → Separate device from CPU Separate memory, but UVA and UM ≈ 300 GB / s P100 V100 < 900 GB / s 16 GB RAM, 720 GB / s 16 GB RAM, 900 GB / s
Member of the Helmholtz Association Overview Aim: Hide Latency Everything else follows Asynchronicity Memory High Throughput # 8 41 GPU Architecture SIMT Andreas Herten | GPU Programming 101 | 31 August 2017
Member of the Helmholtz Association Overview Aim: Hide Latency Everything else follows Asynchronicity Memory High Throughput # 8 41 GPU Architecture SIMT Andreas Herten | GPU Programming 101 | 31 August 2017
Member of the Helmholtz Association Copy Andreas Herten | GPU Programming 101 | 31 August 2017 Compute Copy Compute Copy Async Compute Compute Copy Copy and compute engines run separately ( streams ) Solution: Do something else in meantime ( computation )! Problem: Memory transfer is comparably slow Following difgerent streams # 9 41 → Overlap tasks GPU needs to be fed: Schedule many computations CPU can do other work while GPU computes; synchronization Also: Fast switching of contexts to keep GPU busy.
Member of the Helmholtz Association Overview Aim: Hide Latency Everything else follows Asynchronicity Memory High Throughput # 10 41 GPU Architecture SIMT Andreas Herten | GPU Programming 101 | 31 August 2017
Member of the Helmholtz Association Overview Aim: Hide Latency Everything else follows Asynchronicity Memory High Throughput # 10 41 GPU Architecture SIMT Andreas Herten | GPU Programming 101 | 31 August 2017
— CPU core GPU multiprocessor ( SM ) — Simultaneous Multithreading ( SMT ) GPU : Single Instruction, Multiple Threads ( SIMT ) Member of the Helmholtz Association A 3 Scalar C 3 C 2 C 1 C 0 B 3 B 2 SIMT B 0 B 1 A 2 — Fast switching of threads (large register file) Of threads and warps A 1 — Working unit: set of threads (32, a warp ) — Branching if A 0 # 11 41 = + = + CPU : + = — Single Instruction, Multiple Data ( SIMD ) + = Andreas Herten | GPU Programming 101 | 31 August 2017
— CPU core GPU multiprocessor ( SM ) — Simultaneous Multithreading ( SMT ) GPU : Single Instruction, Multiple Threads ( SIMT ) Member of the Helmholtz Association A 3 Vector C 3 C 2 C 1 C 0 B 3 B 2 B 1 B 0 A 2 SIMT A 1 A 0 if — Branching — Fast switching of threads (large register file) — Working unit: set of threads (32, a warp ) Of threads and warps # 11 41 CPU : + = — Single Instruction, Multiple Data ( SIMD ) Andreas Herten | GPU Programming 101 | 31 August 2017
— CPU core GPU multiprocessor ( SM ) GPU : Single Instruction, Multiple Threads ( SIMT ) Vector B 3 C 0 C 1 C 2 C 3 Member of the Helmholtz Association Core B 1 Core Core Core Andreas Herten | GPU Programming 101 | 31 August 2017 B 2 B 0 SIMT A 3 A 2 A 1 A 0 if — Branching — Fast switching of threads (large register file) — Working unit: set of threads (32, a warp ) Of threads and warps # 11 41 CPU : + = — Single Instruction, Multiple Data ( SIMD ) — Simultaneous Multithreading ( SMT )
— CPU core GPU multiprocessor ( SM ) GPU : Single Instruction, Multiple Threads ( SIMT ) Member of the Helmholtz Association Core C 0 C 1 C 2 C 3 Vector Core Core B 2 Core Thread Thread SMT B 3 B 1 SIMT if Of threads and warps — Working unit: set of threads (32, a warp ) — Fast switching of threads (large register file) B 0 — Branching A 0 A 1 A 2 A 3 # 11 41 CPU : + = — Single Instruction, Multiple Data ( SIMD ) — Simultaneous Multithreading ( SMT ) Andreas Herten | GPU Programming 101 | 31 August 2017
— CPU core GPU multiprocessor ( SM ) Core C 0 C 1 C 2 C 3 Vector Member of the Helmholtz Association Core B 2 Core Core Thread Thread SMT B 3 B 1 SIMT if Of threads and warps — Working unit: set of threads (32, a warp ) — Fast switching of threads (large register file) B 0 — Branching A 0 A 1 A 2 A 3 # 11 41 CPU : + = — Single Instruction, Multiple Data ( SIMD ) — Simultaneous Multithreading ( SMT ) GPU : Single Instruction, Multiple Threads ( SIMT ) Andreas Herten | GPU Programming 101 | 31 August 2017
— CPU core GPU multiprocessor ( SM ) Member of the Helmholtz Association Core C 0 C 1 C 2 C 3 Vector Core Core SIMT Core Thread Thread SMT SIMT B 3 B 2 B 1 if Of threads and warps — Working unit: set of threads (32, a warp ) — Fast switching of threads (large register file) B 0 — Branching A 0 A 1 A 2 A 3 # 11 41 CPU : + = — Single Instruction, Multiple Data ( SIMD ) — Simultaneous Multithreading ( SMT ) GPU : Single Instruction, Multiple Threads ( SIMT ) Andreas Herten | GPU Programming 101 | 31 August 2017
Member of the Helmholtz Association SIMT SIMT SMT Thread Thread Core Core Core Core Vector C 3 C 2 C 1 C 0 B 3 B 2 B 1 if Of threads and warps — Working unit: set of threads (32, a warp ) B 0 — Branching — Fast switching of threads (large register file) A 0 A 1 A 2 A 3 # 11 41 CPU : + = — Single Instruction, Multiple Data ( SIMD ) — Simultaneous Multithreading ( SMT ) GPU : Single Instruction, Multiple Threads ( SIMT ) — CPU core ≊ GPU multiprocessor ( SM ) Andreas Herten | GPU Programming 101 | 31 August 2017
Member of the Helmholtz Association Core B 3 C 0 C 1 C 2 C 3 Vector Core SIMT Core Core Thread Thread SMT SIMT B 2 B 1 B 0 if Of threads and warps — Working unit: set of threads (32, a warp ) — Fast switching of threads (large register file) A 3 — Branching Tesla V100 Graphics: Nvidia Corporation [5] A 0 A 1 A 2 # 11 41 CPU : + = — Single Instruction, Multiple Data ( SIMD ) — Simultaneous Multithreading ( SMT ) GPU : Single Instruction, Multiple Threads ( SIMT ) — CPU core ≊ GPU multiprocessor ( SM ) Andreas Herten | GPU Programming 101 | 31 August 2017
Member of the Helmholtz Association Core B 3 C 0 C 1 C 2 C 3 Vector Core SIMT Core Core Thread Thread SMT SIMT B 2 B 1 B 0 if Of threads and warps — Working unit: set of threads (32, a warp ) — Fast switching of threads (large register file) A 3 — Branching Tesla V100 Graphics: Nvidia Corporation [5] A 0 A 1 A 2 # 11 41 CPU : + = — Single Instruction, Multiple Data ( SIMD ) — Simultaneous Multithreading ( SMT ) GPU : Single Instruction, Multiple Threads ( SIMT ) — CPU core ≊ GPU multiprocessor ( SM ) Andreas Herten | GPU Programming 101 | 31 August 2017
Member of the Helmholtz Association Vector B 2 B 3 C 0 C 1 C 2 C 3 Core B 0 Core Core Core Thread Thread SMT SIMT SIMT B 1 A 3 if Of threads and warps — Working unit: set of threads (32, a warp ) — Fast switching of threads (large register file) A 2 — Branching Tesla V100 Multiprocessor Graphics: Nvidia Corporation [5] A 0 A 1 # 11 41 CPU : + = — Single Instruction, Multiple Data ( SIMD ) — Simultaneous Multithreading ( SMT ) GPU : Single Instruction, Multiple Threads ( SIMT ) — CPU core ≊ GPU multiprocessor ( SM ) Andreas Herten | GPU Programming 101 | 31 August 2017
Member of the Helmholtz Association Vector B 1 B 2 B 3 C 0 C 1 C 2 C 3 Core A 3 Core Core Core Thread Thread SMT SIMT SIMT B 0 A 2 Tesla V100 Of threads and warps — Working unit: set of threads (32, a warp ) — Fast switching of threads (large register file) A 1 if — Branching Multiprocessor Tensor Cores Graphics: Nvidia Corporation [5] A 0 # 11 41 CPU : + = — Single Instruction, Multiple Data ( SIMD ) — Simultaneous Multithreading ( SMT ) GPU : Single Instruction, Multiple Threads ( SIMT ) — CPU core ≊ GPU multiprocessor ( SM ) 120 PFLOP / s for Deep Learning Andreas Herten | GPU Programming 101 | 31 August 2017
Member of the Helmholtz Association W 2 Andreas Herten | GPU Programming 101 | 31 August 2017 Thread/Warp Processing Context Switch Ready Waiting W 4 W 3 W 1 Low Latency vs. High Throughput GPU Streaming Multiprocessor: High Throughput T 4 T 3 T 2 T 1 CPU Core: Low Latency GPU Hides latency with computations from other thread warps # 12 41 Maybe GPU ’s ultimate feature CPU Minimizes latency within each thread
Member of the Helmholtz Association W 2 Andreas Herten | GPU Programming 101 | 31 August 2017 Thread/Warp Processing Context Switch Ready Waiting W 4 W 3 W 1 Low Latency vs. High Throughput GPU Streaming Multiprocessor: High Throughput T 4 T 3 T 2 T 1 CPU Core: Low Latency GPU Hides latency with computations from other thread warps # 12 41 Maybe GPU ’s ultimate feature CPU Minimizes latency within each thread
Member of the Helmholtz Association W 2 Andreas Herten | GPU Programming 101 | 31 August 2017 Thread/Warp Processing Context Switch Ready Waiting W 4 W 3 W 1 Low Latency vs. High Throughput GPU Streaming Multiprocessor: High Throughput T 4 T 3 T 2 T 1 CPU Core: Low Latency GPU Hides latency with computations from other thread warps # 12 41 Maybe GPU ’s ultimate feature CPU Minimizes latency within each thread
Member of the Helmholtz Association bandwidth Andreas Herten | GPU Programming 101 | 31 August 2017 memory Optimized for high throughput # 13 41 Optimized for low latency Let’s summarize this! CPU vs. GPU + Large main memory + High bandwidth main + Fast clock rate + Large caches + Latency tolerant (parallelism) + Branch prediction + More compute resources + Powerful ALU + High performance per watt − Relatively low memory − Limited memory capacity − Low per-thread performance − Cache misses costly − Extension card − Low performance per watt
Member of the Helmholtz Association Andreas Herten | GPU Programming 101 | 31 August 2017 # 14 41 Programming GPUs
Member of the Helmholtz Association y[i] = a * x[i] + y[i]; Andreas Herten | GPU Programming 101 | 31 August 2017 saxpy(n, a, x, y); // fill x, y float x[n], y[n]; int n = 10; int a = 42; } for ( int i = 0; i < n; i++) void saxpy( int n, float a, float * x, float * y) { y , with single precision # 15 41 Preface: CPU A simple CPU program as reference! SAXPY: ⃗ y = a ⃗ x + ⃗ Part of LAPACK BLAS Level 1
Member of the Helmholtz Association Libraries The truth is out there! Use applications & libraries! Wizard: Breazell [6] cuBLAS cuSPARSE cuFFT cuRAND CUDA Math Andreas Herten | GPU Programming 101 | 31 August 2017 # 16 41 Programming GPUs is easy: Just don’t!
Member of the Helmholtz Association Libraries The truth is out there! Use applications & libraries! Wizard: Breazell [6] cuBLAS cuSPARSE cuFFT cuRAND CUDA Math Andreas Herten | GPU Programming 101 | 31 August 2017 # 16 41 Programming GPUs is easy: Just don’t!
Member of the Helmholtz Association Libraries The truth is out there! Use applications & libraries! Wizard: Breazell [6] cuBLAS cuSPARSE cuFFT cuRAND CUDA Math Andreas Herten | GPU Programming 101 | 31 August 2017 # 16 41 Programming GPUs is easy: Just don’t!
Member of the Helmholtz Association Libraries The truth is out there! Use applications & libraries! Wizard: Breazell [6] cuBLAS cuSPARSE cuFFT cuRAND CUDA Math Andreas Herten | GPU Programming 101 | 31 August 2017 # 16 41 Programming GPUs is easy: Just don’t! th ano
Member of the Helmholtz Association Libraries The truth is out there! Use applications & libraries! Wizard: Breazell [6] cuBLAS cuSPARSE cuFFT cuRAND CUDA Math Andreas Herten | GPU Programming 101 | 31 August 2017 # 16 41 Programming GPUs is easy: Just don’t! th ano
Member of the Helmholtz Association cuBLAS Parallel algebra Single, double, complex data types Constant competition with Intel’s MKL http://docs.nvidia.com/cuda/cublas Andreas Herten | GPU Programming 101 | 31 August 2017 # 17 41 GPU -parallel BLAS (all 152 routines) Multi- GPU support → https://developer.nvidia.com/cublas
Member of the Helmholtz Association cuBLAS Andreas Herten | GPU Programming 101 | 31 August 2017 cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasSetVector(n, sizeof (x[0]), x, 1, d_x, 1); cudaMallocManaged(&d_y, n * sizeof (y[0]); cudaMallocManaged(&d_x, n * sizeof (x[0]); float * d_x, * d_y; cublasCreate(&handle); cublasHandle_t handle; // fill x, y float x[n], y[n]; int n = 10; int a = 42; Code example # 18 41 cublasSetVector(n, sizeof (y[0]), y, 1, d_y, 1); cublasGetVector(n, sizeof (y[0]), d_y, 1, y, 1);
Member of the Helmholtz Association cuBLAS Andreas Herten | GPU Programming 101 | 31 August 2017 Initialize cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasSetVector(n, sizeof (x[0]), x, 1, d_x, 1); cudaMallocManaged(&d_y, n * sizeof (y[0]); cudaMallocManaged(&d_x, n * sizeof (x[0]); float * d_x, * d_y; cublasCreate(&handle); cublasHandle_t handle; // fill x, y float x[n], y[n]; int n = 10; int a = 42; Code example # 18 41 cublasSetVector(n, sizeof (y[0]), y, 1, d_y, 1); cublasGetVector(n, sizeof (y[0]), d_y, 1, y, 1);
Member of the Helmholtz Association cuBLAS Andreas Herten | GPU Programming 101 | 31 August 2017 Allocate GPU memory Initialize cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasSetVector(n, sizeof (x[0]), x, 1, d_x, 1); cudaMallocManaged(&d_y, n * sizeof (y[0]); cudaMallocManaged(&d_x, n * sizeof (x[0]); float * d_x, * d_y; cublasCreate(&handle); cublasHandle_t handle; // fill x, y float x[n], y[n]; int n = 10; int a = 42; Code example # 18 41 cublasSetVector(n, sizeof (y[0]), y, 1, d_y, 1); cublasGetVector(n, sizeof (y[0]), d_y, 1, y, 1);
Member of the Helmholtz Association cuBLAS Andreas Herten | GPU Programming 101 | 31 August 2017 Copy data to GPU Allocate GPU memory Initialize cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasSetVector(n, sizeof (x[0]), x, 1, d_x, 1); cudaMallocManaged(&d_y, n * sizeof (y[0]); cudaMallocManaged(&d_x, n * sizeof (x[0]); float * d_x, * d_y; cublasCreate(&handle); cublasHandle_t handle; // fill x, y float x[n], y[n]; int n = 10; int a = 42; Code example # 18 41 cublasSetVector(n, sizeof (y[0]), y, 1, d_y, 1); cublasGetVector(n, sizeof (y[0]), d_y, 1, y, 1);
Member of the Helmholtz Association cuBLAS Andreas Herten | GPU Programming 101 | 31 August 2017 Call BLAS routine Copy data to GPU Allocate GPU memory Initialize cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasSetVector(n, sizeof (x[0]), x, 1, d_x, 1); cudaMallocManaged(&d_y, n * sizeof (y[0]); cudaMallocManaged(&d_x, n * sizeof (x[0]); float * d_x, * d_y; cublasCreate(&handle); cublasHandle_t handle; // fill x, y float x[n], y[n]; int n = 10; int a = 42; Code example # 18 41 cublasSetVector(n, sizeof (y[0]), y, 1, d_y, 1); cublasGetVector(n, sizeof (y[0]), d_y, 1, y, 1);
Member of the Helmholtz Association cuBLAS Andreas Herten | GPU Programming 101 | 31 August 2017 Copy result to host Call BLAS routine Copy data to GPU Allocate GPU memory Initialize cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasSetVector(n, sizeof (x[0]), x, 1, d_x, 1); cudaMallocManaged(&d_y, n * sizeof (y[0]); cudaMallocManaged(&d_x, n * sizeof (x[0]); float * d_x, * d_y; cublasCreate(&handle); cublasHandle_t handle; // fill x, y float x[n], y[n]; int n = 10; int a = 42; Code example # 18 41 cublasSetVector(n, sizeof (y[0]), y, 1, d_y, 1); cublasGetVector(n, sizeof (y[0]), d_y, 1, y, 1);
Member of the Helmholtz Association cuBLAS Andreas Herten | GPU Programming 101 | 31 August 2017 Finalize Copy result to host Call BLAS routine Copy data to GPU Allocate GPU memory Initialize cudaFree(d_x); cudaFree(d_y); cublasDestroy(handle); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasSetVector(n, sizeof (x[0]), x, 1, d_x, 1); cudaMallocManaged(&d_y, n * sizeof (y[0]); cudaMallocManaged(&d_x, n * sizeof (x[0]); float * d_x, * d_y; cublasCreate(&handle); cublasHandle_t handle; // fill x, y float x[n], y[n]; int n = 10; int a = 42; Code example # 18 41 cublasSetVector(n, sizeof (y[0]), y, 1, d_y, 1); cublasGetVector(n, sizeof (y[0]), d_y, 1, y, 1);
Member of the Helmholtz Association Libraries The truth is out there! Use applications & libraries! Wizard: Breazell [6] cuBLAS cuSPARSE cuFFT cuRAND CUDA Math Andreas Herten | GPU Programming 101 | 31 August 2017 # 19 41 Programming GPUs is easy: Just don’t! th ano
Member of the Helmholtz Association Libraries The truth is out there! Use applications & libraries! Wizard: Breazell [6] cuBLAS cuSPARSE cuFFT cuRAND CUDA Math Andreas Herten | GPU Programming 101 | 31 August 2017 # 19 41 Programming GPUs is easy: Just don’t! th ano
Member of the Helmholtz Association Thrust Iterators! Iterators everywhere! Thrust C++ Template library Based on iterators Data-parallel primitives ( scan() , sort() , reduce() , … ) Fully compatible with plain CUDA C (comes with CUDA Toolkit) Great with [](){} lambdas! http://docs.nvidia.com/cuda/thrust/ Andreas Herten | GPU Programming 101 | 31 August 2017 # 20 41 CUDA = STL → http://thrust.github.io/
Member of the Helmholtz Association Thrust Code example int a = 42; int n = 10; thrust::host_vector< float > x(n), y(n); // fill x, y thrust::device_vector d_x = x, d_y = y; using namespace thrust::placeholders; thrust::transform(d_x.begin(), d_x.end(), d_y.begin(), d_y.begin(), a * _1 + _2); Andreas Herten | GPU Programming 101 | 31 August 2017 # 21 41 ֒ → x = d_x;
Member of the Helmholtz Association if (N > gGpuThreshold) Andreas Herten | GPU Programming 101 | 31 August 2017 Source thrust::for_each(thrust::host, r, r+N, lambda);} else thrust::for_each(thrust::device, r, r+N, lambda); auto lambda = [=] __host__ __device__ ( int i) { Thrust auto r = thrust::counting_iterator< int >(0); void saxpy( float *x, float *y, float a, int N) { constexpr int gGpuThreshold = 10000; #include <thrust/execution_policy.h> #include <thrust/for_each.h> Code example with lambdas # 21 41 y[i] = a * x[i] + y[i];};
Member of the Helmholtz Association Directives Andreas Herten | GPU Programming 101 | 31 August 2017 # 22 41 Programming GPUs
Member of the Helmholtz Association — Difgerent target architectures Andreas Herten | GPU Programming 101 | 31 August 2017 debug Somewhat harder to Raw power hidden limited Compilers support Con Easy to program from same code To it, it’s a serial program — Other compiler? No problem! Portability Pro Compiler interprets directives, creates according instructions acc_copy(); Also: Generalized API functions for ( int i = 0; i < 1; i+*) {}; #pragma acc loop Annotate usual source code by directives Keepin’ you portable # 23 41 GPU Programming with Directives
Member of the Helmholtz Association — Difgerent target architectures Andreas Herten | GPU Programming 101 | 31 August 2017 debug Somewhat harder to Raw power hidden limited Compilers support Con Easy to program from same code To it, it’s a serial program — Other compiler? No problem! Portability Pro Compiler interprets directives, creates according instructions acc_copy(); Also: Generalized API functions for ( int i = 0; i < 1; i+*) {}; #pragma acc loop Annotate usual source code by directives Keepin’ you portable # 23 41 GPU Programming with Directives
Member of the Helmholtz Association — Difgerent target architectures Andreas Herten | GPU Programming 101 | 31 August 2017 debug Somewhat harder to Raw power hidden limited Compilers support Con Easy to program from same code To it, it’s a serial program — Other compiler? No problem! Portability Pro Compiler interprets directives, creates according instructions acc_copy(); Also: Generalized API functions for ( int i = 0; i < 1; i+*) {}; #pragma acc loop Annotate usual source code by directives Keepin’ you portable # 23 41 GPU Programming with Directives
Member of the Helmholtz Association for ( Andreas Herten | GPU Programming 101 | 31 August 2017 Might eventually be re-merged into OpenMP standard } } // … ) { #pragma omp parallel for ) { for ( #pragma omp distribute #pragma omp teams num_teams(10) num_threads(10) #pragma omp target map(tofrom:y), map(to:x) 4.0, better since 4.5 The power of… two. # 24 41 GPU Programming with Directives OpenMP Standard for multithread programming on CPU , GPU since OpenACC Similar to OpenMP, but more specifically for GPUs
Member of the Helmholtz Association OpenACC Code example void saxpy_acc( int n, float a, float * x, float * y) { #pragma acc kernels for ( int i = 0; i < n; i++) } int a = 42; int n = 10; float x[n], y[n]; // fill x, y saxpy_acc(n, a, x, y); Andreas Herten | GPU Programming 101 | 31 August 2017 # 25 41 y[i] = a * x[i] + y[i];
Member of the Helmholtz Association OpenACC Code example void saxpy_acc( int n, float a, float * x, float * y) { #pragma acc parallel loop copy(y) copyin(x) for ( int i = 0; i < n; i++) } int a = 42; int n = 10; float x[n], y[n]; // fill x, y saxpy_acc(n, a, x, y); Andreas Herten | GPU Programming 101 | 31 August 2017 # 25 41 y[i] = a * x[i] + y[i];
Member of the Helmholtz Association int a = 42; Andreas Herten | GPU Programming 101 | 31 August 2017 saxpy_acc(n, a, x, y); // fill x, y OpenACC int n = 10; float x[n], y[n]; } for ( int i = 0; i < n; i++) #pragma acc parallel loop copy(y) copyin(x) void saxpy_acc( int n, float a, float * x, float * y) { Code example # 25 41 U P y[i] = a * x[i] + y[i]; G tutorial this afuernoon!
Member of the Helmholtz Association Languages Andreas Herten | GPU Programming 101 | 31 August 2017 # 26 41 Programming GPUs
OpenCL Open Computing Language by Khronos Group (Apple, IBM , NVIDIA, …) 2009 — Targets CPUs , GPUs , FPGAs , and other many-core machines CUDA NVIDIA’s GPU platform 2007 — Only NVIDIA GPUs compiler, debuggers, profilers, … Hardest: Come up with parallelized algorithm Choose what flavor you like, what colleagues/collaboration is using — Also: CUDA Fortran clang has CUDA support, but CUDA needed for last step — Compilation with nvcc (free, but not open) Member of the Helmholtz Association — Platform: Drivers, programming language (CUDA C/C++), API, — Difgerent compilers available — Fully open source compiler — Platform: Programming language (OpenCL C/C++), API, and Two solutions: Finally… # 27 41 Programming GPU Directly Andreas Herten | GPU Programming 101 | 31 August 2017
CUDA NVIDIA’s GPU platform 2007 — Only NVIDIA GPUs — Platform: Drivers, programming language (CUDA C/C++), API, Hardest: Come up with parallelized algorithm Choose what flavor you like, what colleagues/collaboration is using — Also: CUDA Fortran clang has CUDA support, but CUDA needed for last step — Compilation with nvcc (free, but not open) compiler, debuggers, profilers, … Member of the Helmholtz Association — Difgerent compilers available — Fully open source compiler — Platform: Programming language (OpenCL C/C++), API, and Two solutions: Finally… # 27 41 Programming GPU Directly OpenCL Open Computing Language by Khronos Group (Apple, IBM , NVIDIA, …) 2009 — Targets CPUs , GPUs , FPGAs , and other many-core machines Andreas Herten | GPU Programming 101 | 31 August 2017
Member of the Helmholtz Association — Difgerent compilers available Hardest: Come up with parallelized algorithm Choose what flavor you like, what colleagues/collaboration is using — Also: CUDA Fortran clang has CUDA support, but CUDA needed for last step — Compilation with nvcc (free, but not open) compiler, debuggers, profilers, … — Platform: Drivers, programming language (CUDA C/C++), API, # 27 41 — Fully open source compiler — Platform: Programming language (OpenCL C/C++), API, and Two solutions: Finally… Programming GPU Directly OpenCL Open Computing Language by Khronos Group (Apple, IBM , NVIDIA, …) 2009 — Targets CPUs , GPUs , FPGAs , and other many-core machines CUDA NVIDIA’s GPU platform 2007 — Only NVIDIA GPUs Andreas Herten | GPU Programming 101 | 31 August 2017
Member of the Helmholtz Association — Difgerent compilers available Hardest: Come up with parallelized algorithm Choose what flavor you like, what colleagues/collaboration is using — Also: CUDA Fortran clang has CUDA support, but CUDA needed for last step — Compilation with nvcc (free, but not open) compiler, debuggers, profilers, … — Platform: Drivers, programming language (CUDA C/C++), API, # 27 41 — Fully open source compiler — Platform: Programming language (OpenCL C/C++), API, and Two solutions: Finally… Programming GPU Directly OpenCL Open Computing Language by Khronos Group (Apple, IBM , NVIDIA, …) 2009 — Targets CPUs , GPUs , FPGAs , and other many-core machines CUDA NVIDIA’s GPU platform 2007 — Only NVIDIA GPUs Andreas Herten | GPU Programming 101 | 31 August 2017
Member of the Helmholtz Association — Difgerent compilers available Hardest: Come up with parallelized algorithm Choose what flavor you like, what colleagues/collaboration is using — Also: CUDA Fortran clang has CUDA support, but CUDA needed for last step — Compilation with nvcc (free, but not open) API, compiler, debuggers, profilers, … — Platform: Drivers, programming language (CUDA C/C++), # 27 41 — Fully open source compiler — Platform: Programming language (OpenCL C/C++), API, and Two solutions: Finally… Programming GPU Directly OpenCL Open Computing Language by Khronos Group (Apple, IBM , NVIDIA, …) 2009 — Targets CPUs , GPUs , FPGAs , and other many-core machines CUDA NVIDIA’s GPU platform 2007 — Only NVIDIA GPUs Andreas Herten | GPU Programming 101 | 31 August 2017
— Lightweight Member of the Helmholtz Association 3D Andreas Herten | GPU Programming 101 | 31 August 2017 SAXPY! order non-deterministic! — 1000s threads execute simultaneously fast switchting! Execution entity: threads — Access own ID by global variables threadIdx.x , blockIdx.y , … — __global__ kernel( int a, float * b) { } Parallel function: kernel 3D CUDA Threading Model 3D — Threads & blocks in 3D Grid Blocks — Block Threads — Methods to exploit parallelism: Warp the kernel, it’s a thread! # 28 41
— Lightweight Member of the Helmholtz Association 3D Andreas Herten | GPU Programming 101 | 31 August 2017 SAXPY! order non-deterministic! — 1000s threads execute simultaneously fast switchting! Execution entity: threads — Access own ID by global variables threadIdx.x , blockIdx.y , … — __global__ kernel( int a, float * b) { } Parallel function: kernel 3D CUDA Threading Model 3D — Threads & blocks in 3D Grid Blocks — Block Thread — Methods to exploit parallelism: Warp the kernel, it’s a thread! # 28 41
— Lightweight Member of the Helmholtz Association CUDA Threading Model Andreas Herten | GPU Programming 101 | 31 August 2017 SAXPY! order non-deterministic! — 1000s threads execute simultaneously fast switchting! Execution entity: threads — Access own ID by global variables threadIdx.x , blockIdx.y , … — __global__ kernel( int a, float * b) { } Parallel function: kernel 3D 3D 3D — Threads & blocks in 3D Grid Blocks — Block Threads — Methods to exploit parallelism: Warp the kernel, it’s a thread! # 28 41 0 1 2 3 4 5
— Lightweight Member of the Helmholtz Association Parallel function: kernel Andreas Herten | GPU Programming 101 | 31 August 2017 SAXPY! order non-deterministic! — 1000s threads execute simultaneously fast switchting! Execution entity: threads — Access own ID by global variables threadIdx.x , blockIdx.y , … — __global__ kernel( int a, float * b) { } 3D CUDA Threading Model 3D 3D — Threads & blocks in 3D Grid Blocks — — Methods to exploit parallelism: Warp the kernel, it’s a thread! # 28 41 Threads → Block 0 1 2 3 4 5
— Lightweight Member of the Helmholtz Association 0 Andreas Herten | GPU Programming 101 | 31 August 2017 SAXPY! order non-deterministic! — 1000s threads execute simultaneously fast switchting! Execution entity: threads — Access own ID by global variables threadIdx.x , blockIdx.y , … — __global__ kernel( int a, float * b) { } Parallel function: kernel # 28 41 CUDA Threading Model 3D 3D 3D — Threads & blocks in 3D Grid Block — — Methods to exploit parallelism: Warp the kernel, it’s a thread! Threads → Block 0 1 2 3 4 5
— Lightweight Member of the Helmholtz Association 0 Andreas Herten | GPU Programming 101 | 31 August 2017 SAXPY! order non-deterministic! — 1000s threads execute simultaneously fast switchting! Execution entity: threads — Access own ID by global variables threadIdx.x , blockIdx.y , … — __global__ kernel( int a, float * b) { } Parallel function: kernel 2 1 # 28 41 CUDA Threading Model 3D 3D 3D — Threads & blocks in 3D Grid Blocks — — Methods to exploit parallelism: Warp the kernel, it’s a thread! Threads → Block 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5
— Lightweight Member of the Helmholtz Association 0 Andreas Herten | GPU Programming 101 | 31 August 2017 SAXPY! order non-deterministic! — 1000s threads execute simultaneously fast switchting! Execution entity: threads — Access own ID by global variables threadIdx.x , blockIdx.y , … — __global__ kernel( int a, float * b) { } Parallel function: kernel 2 1 # 28 41 CUDA Threading Model 3D 3D 3D — Threads & blocks in 3D — — Methods to exploit parallelism: Warp the kernel, it’s a thread! Threads → Block 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 Blocks → Grid
— Lightweight Member of the Helmholtz Association 0 Andreas Herten | GPU Programming 101 | 31 August 2017 SAXPY! order non-deterministic! — 1000s threads execute simultaneously fast switchting! Execution entity: threads — Access own ID by global variables threadIdx.x , blockIdx.y , … — __global__ kernel( int a, float * b) { } Parallel function: kernel 2 1 # 28 41 CUDA Threading Model 3D 3D 3D — Threads & blocks in 3D — — Methods to exploit parallelism: Warp the kernel, it’s a thread! Threads → Block 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 Blocks → Grid
Member of the Helmholtz Association CUDA Threading Model Andreas Herten | GPU Programming 101 | 31 August 2017 SAXPY! Execution entity: threads — Access own ID by global variables threadIdx.x , blockIdx.y , … — __global__ kernel( int a, float * b) { } Parallel function: kernel 2 1 0 # 28 41 3D 3D 3D — Threads & blocks in 3D — — Methods to exploit parallelism: Warp the kernel, it’s a thread! Threads → Block 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 Blocks → Grid — Lightweight → fast switchting! — 1000s threads execute simultaneously → order non-deterministic!
Member of the Helmholtz Association CUDA Threading Model Andreas Herten | GPU Programming 101 | 31 August 2017 Execution entity: threads — Access own ID by global variables threadIdx.x , blockIdx.y , … — __global__ kernel( int a, float * b) { } Parallel function: kernel 2 1 0 # 28 41 3D 3D 3D — Threads & blocks in 3D — — Methods to exploit parallelism: Warp the kernel, it’s a thread! Threads → Block 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 Blocks → Grid — Lightweight → fast switchting! — 1000s threads execute simultaneously → order non-deterministic! ⇒ SAXPY!
Member of the Helmholtz Association float x[n], y[n]; Andreas Herten | GPU Programming 101 | 31 August 2017 cudaDeviceSynchronize(); saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaMallocManaged(&y, n * sizeof ( float )); cudaMallocManaged(&x, n * sizeof ( float )); // fill x, y int n = 10; CUDA SAXPY int a = 42; } if (i < n) int i = blockIdx.x * blockDim.x + threadIdx.x; __global__ void saxpy_cuda( int n, float a, float * x, float * y) { With runtime-managed data transfers # 29 41 y[i] = a * x[i] + y[i];
Member of the Helmholtz Association float x[n], y[n]; Andreas Herten | GPU Programming 101 | 31 August 2017 Specify kernel cudaDeviceSynchronize(); saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaMallocManaged(&y, n * sizeof ( float )); cudaMallocManaged(&x, n * sizeof ( float )); // fill x, y int n = 10; CUDA SAXPY int a = 42; } if (i < n) int i = blockIdx.x * blockDim.x + threadIdx.x; __global__ void saxpy_cuda( int n, float a, float * x, float * y) { With runtime-managed data transfers # 29 41 y[i] = a * x[i] + y[i];
Member of the Helmholtz Association // fill x, y Andreas Herten | GPU Programming 101 | 31 August 2017 ID variables Specify kernel cudaDeviceSynchronize(); saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaMallocManaged(&y, n * sizeof ( float )); cudaMallocManaged(&x, n * sizeof ( float )); float x[n], y[n]; CUDA SAXPY int n = 10; int a = 42; } if (i < n) int i = blockIdx.x * blockDim.x + threadIdx.x; __global__ void saxpy_cuda( int n, float a, float * x, float * y) { With runtime-managed data transfers # 29 41 y[i] = a * x[i] + y[i];
Member of the Helmholtz Association cudaMallocManaged(&x, n * sizeof ( float )); Andreas Herten | GPU Programming 101 | 31 August 2017 too many threads Guard against ID variables Specify kernel cudaDeviceSynchronize(); saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaMallocManaged(&y, n * sizeof ( float )); // fill x, y CUDA SAXPY float x[n], y[n]; int n = 10; int a = 42; } if (i < n) int i = blockIdx.x * blockDim.x + threadIdx.x; __global__ void saxpy_cuda( int n, float a, float * x, float * y) { With runtime-managed data transfers # 29 41 y[i] = a * x[i] + y[i];
Member of the Helmholtz Association cudaMallocManaged(&y, n * sizeof ( float )); Andreas Herten | GPU Programming 101 | 31 August 2017 memory Allocate too many threads Guard against ID variables Specify kernel cudaDeviceSynchronize(); saxpy_cuda<<<2, 5>>>(n, a, x, y); cudaMallocManaged(&x, n * sizeof ( float )); CUDA SAXPY // fill x, y float x[n], y[n]; int n = 10; int a = 42; } if (i < n) int i = blockIdx.x * blockDim.x + threadIdx.x; __global__ void saxpy_cuda( int n, float a, float * x, float * y) { With runtime-managed data transfers # 29 41 y[i] = a * x[i] + y[i]; GPU -capable
Member of the Helmholtz Association saxpy_cuda<<<2, 5>>>(n, a, x, y); Andreas Herten | GPU Programming 101 | 31 August 2017 2 blocks, each 5 threads Call kernel memory Allocate too many threads Guard against ID variables Specify kernel cudaDeviceSynchronize(); cudaMallocManaged(&y, n * sizeof ( float )); CUDA SAXPY cudaMallocManaged(&x, n * sizeof ( float )); // fill x, y float x[n], y[n]; int n = 10; int a = 42; } if (i < n) int i = blockIdx.x * blockDim.x + threadIdx.x; __global__ void saxpy_cuda( int n, float a, float * x, float * y) { With runtime-managed data transfers # 29 41 y[i] = a * x[i] + y[i]; GPU -capable
Member of the Helmholtz Association cudaDeviceSynchronize(); Andreas Herten | GPU Programming 101 | 31 August 2017 kernel to finish Wait for 2 blocks, each 5 threads Call kernel memory Allocate too many threads Guard against ID variables Specify kernel saxpy_cuda<<<2, 5>>>(n, a, x, y); CUDA SAXPY cudaMallocManaged(&y, n * sizeof ( float )); cudaMallocManaged(&x, n * sizeof ( float )); // fill x, y float x[n], y[n]; int n = 10; int a = 42; } if (i < n) int i = blockIdx.x * blockDim.x + threadIdx.x; __global__ void saxpy_cuda( int n, float a, float * x, float * y) { With runtime-managed data transfers # 29 41 y[i] = a * x[i] + y[i]; GPU -capable
Member of the Helmholtz Association Abstraction Libraries/DSL Andreas Herten | GPU Programming 101 | 31 August 2017 # 30 41 Programming GPUs
Member of the Helmholtz Association Abstraction Libraries & DSLs Libraries with ready-programmed abstractions; partly compiler/transpiler necessary Have difgerent backends to choose from for targeted accelerator Andreas Herten | GPU Programming 101 | 31 August 2017 # 31 41 Between Thrust , OpenACC, and CUDA Examples: Kokkos , Alpaka , Futhark , HIP , C++AMP , …
Member of the Helmholtz Association Abstraction Libraries & DSLs Libraries with ready-programmed abstractions; partly compiler/transpiler necessary Have difgerent backends to choose from for targeted accelerator Andreas Herten | GPU Programming 101 | 31 August 2017 # 31 41 Between Thrust , OpenACC, and CUDA Examples: Kokkos , Alpaka , Futhark , HIP , C++AMP , …
Member of the Helmholtz Association From Sandia National Laboratories C++ library for performance portability Data-parallel patterns, architecture-aware memory layouts, … Kokkos::View< double *> x("X", length); Kokkos::View< double *> y("Y", length); double a = 2.0; // Fill x, y Kokkos::parallel_for(length, KOKKOS_LAMBDA ( const int & i) { }); Andreas Herten | GPU Programming 101 | 31 August 2017 # 32 41 An Alternative: Kokkos → https://github.com/kokkos/kokkos/ x(i) = a*x(i) + y(i);
Recommend
More recommend