Member of the Helmholtz Association GPUs: Platform, Programming, Pitfalls GridKa School 2016: Data Science on Modern Architectures Andreas Herten , Forschungszentrum Jülich, 1 September 2016
Member of the Helmholtz Association About, Outline Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 Pitfalls Tools Languages Directives Libraries Programming Features Hardware Platform Motivation Optimizing scientific applications for/on Since then: NVIDIA Application Lab # 2 37 Andreas Herten Physics in — Aachen (Dipl. at CMS ) — Jülich/Bochum (Dr. at PANDA ) ° Line Hough Transform Around Isochrone (5 step size) -1 y / cm − 0.028 − 0.03 − 0.032 − 0.034 − 0.036 − 0.038 − − − − − − − − − − 0.042 0.04 0.038 0.036 0.034 0.032 0.03 0.028 0.026 0.024 -1 x / cm 2015-04-13 22:58:19 GPUs
Member of the Helmholtz Association Status Quo GPU all around 1999: General computations with shaders of graphics hardware 2001: NVIDIA GeForce 3 with programmable shaders [1]; 2003: DirectX 9 at ATI GPUs Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 3 37 2016: Top 500: 1 / 10 with GPUs, Green 500: 70 % of top 50 with
Member of the Helmholtz Association Status Quo GPU all around Graphic: Rupp [2] Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 3 37
Member of the Helmholtz Association Status Quo GPU all around Graphic: Rupp [2] Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 3 37
Member of the Helmholtz Association Status Quo GPU all around Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 3 37
Member of the Helmholtz Association Status Quo GPU all around But why?! Let’s find out! Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 3 37
Member of the Helmholtz Association Platform Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 4 37
Member of the Helmholtz Association A matter of specialties Transporting one Graphics: Lee [3] and Shearings Holidays [4] Transporting many Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 5 37 CPU vs. GPU
Member of the Helmholtz Association Chip ALU ALU ALU ALU Control Cache DRAM DRAM Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 5 37 CPU vs. GPU
Member of the Helmholtz Association Overview Aim: Hide Latency Everything else follows Asynchronicity Memory High Throughput # 6 37 GPU Architecture SIMT Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016
Member of the Helmholtz Association Overview Aim: Hide Latency Everything else follows Asynchronicity Memory High Throughput # 6 37 GPU Architecture SIMT Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016
Member of the Helmholtz Association PCIe Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 Now: Done automatically (performance…?) Formerly: Explicitly copy data to/from GPU Do as little as possible! Memory transfers need special consideration! GPU: accelerator / extension card NVLink HBM2 Memory # 7 37 Device ALU DRAM Cache Control DRAM ALU Host ALU ALU GPU memory ain’t no CPU memory → Separate device from CPU Separate memory, but UVA and UM ≈ 80 GB / s < 16 GB / s Values for P100 : 16 GB RAM, 720 GB / s < 720 GB / s
Member of the Helmholtz Association Overview Aim: Hide Latency Everything else follows Asynchronicity Memory High Throughput # 8 37 GPU Architecture SIMT Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016
Member of the Helmholtz Association Async Following difgerent streams Problem: Memory transfer is comparably slow Solution: Do something else in meantime ( computation )! Copy and compute engines run separately ( streams ) GPU needs to be fed: Schedule many computations CPU can do other work while GPU computes; synchronization Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 9 37 → Overlap tasks
Member of the Helmholtz Association Overview Aim: Hide Latency Everything else follows Asynchronicity Memory High Throughput # 10 37 GPU Architecture SIMT Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016
Member of the Helmholtz Association SIMT Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 SIMT SMT Thread Thread Core Core Core Core Vector # 11 37 if Of threads and warps CPU: — Working unit: set of threads (32, a warp ) — Fast switching of threads (large register file) — Branching A 0 B 0 C 0 A 1 B 1 C 1 + = A 2 B 2 C 2 — Single Instruction, Multiple Data ( SIMD ) A 3 B 3 C 3 — Simultaneous Multithreading ( SMT ) GPU: Single Instruction, Multiple Threads ( SIMT ) — CPU core ≊ GPU multiprocessor ( SM )
Member of the Helmholtz Association SIMT Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 SIMT SMT Thread Thread Core Core Core Core Vector # 11 37 Graphics: Nvidia Corporation [5] Of threads and warps CPU: Pascal GP100 — Working unit: set of threads (32, a warp ) — Fast switching of threads (large register file) — Branching if A 0 B 0 C 0 A 1 B 1 C 1 + = A 2 B 2 C 2 — Single Instruction, Multiple Data ( SIMD ) A 3 B 3 C 3 — Simultaneous Multithreading ( SMT ) GPU: Single Instruction, Multiple Threads ( SIMT ) — CPU core ≊ GPU multiprocessor ( SM )
Member of the Helmholtz Association Graphics: Nvidia Corporation [5] Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 SIMT SMT Thread Thread Core Core Core Core Vector SIMT # 11 37 Multiprocessor Of threads and warps CPU: Pascal GP100 — Working unit: set of threads (32, a warp ) — Fast switching of threads (large register file) — Branching if A 0 B 0 C 0 A 1 B 1 C 1 + = A 2 B 2 C 2 — Single Instruction, Multiple Data ( SIMD ) A 3 B 3 C 3 — Simultaneous Multithreading ( SMT ) GPU: Single Instruction, Multiple Threads ( SIMT ) — CPU core ≊ GPU multiprocessor ( SM )
Member of the Helmholtz Association Latency Hiding Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 Graphics: Meinke and Nvidia Corporation [6] # 12 37 GPU hides latency with computations from other thread groups CPU minimizes latency within each thread GPU’s ultimate feature CPU core – Low Latency Processor T 1 T 2 T 3 T 4 GPU Streaming Multiprocessor – High Throughput Processing W 1 Waiting W 2 Ready W 3 Ctx switch W 4
Member of the Helmholtz Association bandwidth Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 memory Optimized for high throughput CPU vs. GPU # 13 37 Optimized for low latency Low latency vs. high throughput + Large main memory + High bandwidth main + Fast clock rate + Large caches + Latency tolerant (parallelism) + Branch prediction + More compute resources + Powerful ALU + High performance per watt − Relatively low memory − Limited memory capacity − Low per-thread performance − Cache misses costly − Extension card − Low performance per watt
Member of the Helmholtz Association Programming Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 14 37
Member of the Helmholtz Association y[i] = a * x[i] + y[i]; Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 saxpy(n, a, x, y); // fill x, y float x[n], y[n]; int n = 10; int a = 42; } for ( int i = 0; i < n; i++) Preface: CPU void saxpy( int n, float a, float * x, float * y) { y , with single precision A simple CPU program! # 15 37 SAXPY: ⃗ y = a ⃗ x + ⃗ Part of LAPACK BLAS Level 1
Member of the Helmholtz Association Libraries The truth is out there! Use applications & libraries! Wizard: Breazell [7] cuBLAS cuSPARSE cuFFT cuRAND CUDA Math Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 16 37 Programming GPUs is easy: Just don’t!
Member of the Helmholtz Association Libraries The truth is out there! Use applications & libraries! Wizard: Breazell [7] cuBLAS cuSPARSE cuFFT cuRAND CUDA Math Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 16 37 Programming GPUs is easy: Just don’t! th ano
Member of the Helmholtz Association cuBLAS Parallel algebra Single, double, complex data types Constant competition with Intel’s MKL http://docs.nvidia.com/cuda/cublas Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 # 17 37 GPU -parallel BLAS (all 152 routines) Multi- GPU support → https://developer.nvidia.com/cublas
Member of the Helmholtz Association cuBLAS Andreas Herten | GPUs: Platform, Programming, Pitfalls | 1 September 2016 cublasShutdown(); cublasGetVector(n, sizeof (y[0]), d_y, 1, y, 1); cublasSaxpy(n, a, d_x, 1, d_y, 1); cublasSetVector(n, sizeof (x[0]), x, 1, d_x, 1); cudaMalloc(( void **)&d_y, n * sizeof (y[0]); cudaMalloc(( void **)&d_x, n * sizeof (x[0]); float * d_x, * d_y; cublasInit(); // fill x, y float x[n], y[n]; int n = 10; int a = 42; Code example # 18 37 cublasSetVector(n, sizeof (y[0]), y, 1, d_y, 1);
Recommend
More recommend