EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA Mark Harris, - PowerPoint PPT Presentation

EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA Mark Harris, NVIDIA @harrism #NVSC14

EXTENDING THE REACH OF CUDA 1 Machine Learning 2 Higher Performance 3 New Platforms 4 New Languages 2

GPUS: THE HOT MACHINE LEARNING PLATFORM GPU Entries 120 Image Recognition Challenge 100 110 80 1.2M training images • 1000 object categories 60 60 40 Hosted by 20 4 0 2010 2011 2012 2013 2014 Classification Error Rates person car bird 30% 28% 26% helmet frog 25% motorcycle 20% 16% 15% 12% 10% person 7% person 5% hammer dog 0% flower pot chair 2010 2011 2012 2013 2014 power drill 3

GPU-ACCELERATED DEEP LEARNING High performance routines for Convolutional Neural Networks Optimized for current and future NVIDIA GPUs Caffe Integrated in major open-source frameworks (cuDNN) Caffe 14x (GPU) Caffe, Torch7, Theano 11x Flexible and easy-to-use API Caffe (CPU*) Also available for ARM / Jetson TK1 1x https://developer.nvidia.com/cuDNN Baseline Caffe compared to Caffe accelerated by cuDNN on K40 4 *CPU is 24 core E5-2697v2 @ 2.4GHz Intel MKL 11.1.3

KEPLER GPU PASCAL GPU NVLink NVLINK POWER CPU HIGH-SPEED GPU INTERCONNECT NVLink PCIe PCIe X86 ARM64 X86 ARM64 POWER CPU POWER CPU 2014 2016 6 6

NVLINK UNLEASHES MULTI-GPU PERFORMANCE Over 2x Application Performance Speedup GPUs Interconnected with NVLink When Next-Gen GPUs Connect via NVLink Versus PCIe Speedup vs PCIe based Server 2.25x CPU 2.00x PCIe Switch 1.75x 1.50x TESLA TESLA GPU GPU 1.25x 5x Faster than 1.00x PCIe Gen3 x16 ANSYS Fluent Multi-GPU Sort LQCD QUDA AMBER 3D FFT 7 7 3D FFT, ANSYS: 2 GPU configuration, All other apps comparing 4 GPU configuration AMBER Cellulose (256x128x128), FFT problem size (256^3)

NVLINK + UNIFIED MEMORY Simpler, Faster CPU NVLink PCIe Switch 80 GB/s TESLA TESLA GPU GPU Unified Memory 5x Faster than PCIe Gen3 x16 Share Data Structures at Eliminate Multi-GPU CPU Memory Speeds, Scaling Bottlenecks not PCIe speeds 8

Data Center Infrastructure Development Infrastructure Programming Development Software System Solutions Communication Management Languages Tools Solutions / Profile and GPU System Compiler Interconnect Libraries Debug Accelerators Management Solutions GPU Direct CUDA Debugging API cuBLAS GPU Boost NVML LLVM NVLink … … … … … … Tesla Accelerated Computing Platform 10

COMMON PROGRAMMING APPROACHES Across a Variety of Heterogeneous Systems Libraries AmgX cuBLAS Compiler Directives x 86 / Programming Languages 11

MAINSTREAM PARALLEL PROGRAMMING Enable more programmers to write parallel software Give programmers the choice of language to use Embrace and evolve key programming standards C 13

MAINSTREAM PARALLEL PROGRAMMING Enable more programmers to write parallel software Give programmers the choice of language to use Embrace and evolve key programming standards C 14

C++ PARALLEL ALGORITHMS LIBRARY PROGRESS std std:: ::ve vecto tor< r<int int> > vec vec = . ... .. // previous st // previ ous standard s andard sequent equential loop ial loop std std:: ::for_each(vec.begi vec.begin(), (), vec.end(), f); // explicitly // expli citly sequenti sequential loo al loop std:: std ::fo for_e _eac ach(std std:: ::seq seq, , ve vec.beg c.begin in() (), ve vec. c.end nd() (), , f); f); // permi // permitting tting parallel parallel execu execution tion std:: std ::for_each(std std::par ::par, , vec.begin(), vec.end(), f); • Complete set of parallel primitives: for_each, sort, reduce, scan, etc. N3960 Technical Specification Working Draft: • ISO C++ committee voted unanimously to http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3960.pdf accept as official tech. specification working draft Prototype: https://github.com/n3554/n3554 15

Linux GCC Compiler to Support GPU Accelerators Open Source GCC Effort by Mentor Embedded Pervasive Impact Free to all Linux users On Track Mainstream for GCC 5 Most Widely Used HPC Compiler “ Incorporating OpenACC into GCC is an excellent example of open source and open standards working together to make accelerated computing broadly accessible to all Linux developers. ” Oscar Hernandez Oak Ridge National Laboratories

NUMBA PYTHON COMPILER Free and open source JIT compiler for array-oriented Python numba.cuda module integrates CUDA directly into Python @cuda uda.j .jit it (“void(float32[:], float32, float32[:], float32[:])”) def sax de saxpy py(ou (out, t, a, a, x x, y , y): ): i = = cud cuda.g a.grid rid(1 (1) if if i < < ou out.s .siz ize: out[ out[i] = = a a * x * x[i] ] + y + y[i] # Lau # Launc nch h saxpy saxpy ker kernel el saxpy sa xpy[1 [100, 00, 51 512](out out, a , a, x , x, y) y) NumbaPro: commercial extension of Numba Python interfaces to CUDA libraries http://numba.pydata.org/ 17

ACCELERATING JAVA 3 WAYS Drop-In CUDA4J Acceleration 8 java.util.Arrays.sort(int[] a) Accelerate Java SE CUDA C++ Programming Accelerate Pure Java Libraries with CUDA Via Java APIs 18 IBM Developer Kits for Java: ibm.com/java/jdk

CUDA4J: GPU PROGRAMMING IN A JAVA API Access CUDA Programming Model with Java Best Practices void add(int[] a, int[] b, int[] c) throws CudaException, IOException { CudaDevice device = new CudaDevice(0); Manage CUDA CudaModule module = new CudaModule(device, devices and kernels getClass().getResourceAsStream (“ ArrayAdder ” )); CudaKernel kernel = new CudaKernel(module, “Cuda_cuda4j_samples_adder” ); CudaGrid grid = new CudaGrid(512, 512); Easily transfer data try (CudaBuffer aBuffer = new CudaBuffer(device, a.length * 4); between Java heap CudaBuffer bBuffer = new CudaBuffer(device, b.length * 4)) { and CUDA device aBuffer.copyFrom(a, 0, a.length); bBuffer.copyFrom(b, 0, b.length); kernel.launch(grid, new CudaKernel.Parameters(aBuffer, bBuffer, a.length)); Simple, flexible aBuffer.copyTo(c, 0, a.length); kernel launch } } 19

ACCELERATING PURE JAVA ON GPUS Java 8 Streams and Lambda Expressions Express computation as aggregate parallel operations on data streams IntStream.range(0, N).parallel().forEach(i -> c[i] = a[i] + b[i]); Benefits Standard Java idioms, so no code changes required No knowledge of GPU programming model required No low-level device manipulation – Java implementation has the controls Future JIT smarts do not require application code changes 20

JIT / GPU OPTIMIZATION OF LAMBDA EXPRESSION JIT-recognized Java matrix multiplication Speed-up factor when run on a GPU enabled host Public void multiply() { IntStream.range(0, COLS*COLS).parallel().forEach( id -> { int i = id / COLS; int j = id % COLS; int sum = 0; for (int k = 0; k < COLS; k++) { sum += left[i*COLS + k] * right[k*COLS + j]; } output[i*COLS + j] = sum; }); } IBM Power 8 with Nvidia K40m GPU 21

COMMON PROGRAMMING APPROACHES Across a Variety of Heterogeneous Systems Libraries AmgX cuBLAS Compiler Directives x 86 / Programming Languages 22

EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA Mark Harris, - PowerPoint PPT Presentation

EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA Mark Harris, NVIDIA @harrism #NVSC14 EXTENDING THE REACH OF CUDA 1 Machine Learning 2 Higher Performance 3 New Platforms 4 New Languages 2 GPUS: THE HOT MACHINE LEARNING PLATFORM GPU

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

2110412 Parallel Comp Arch CUDA: Parallel Programming on GPU Natawut Nupairoj, Ph.D. Department

COMPANY PRESENTATION Company Presentation Agenda REACH The Profile REACH The Values REACH The

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

Plan Optimizing Matrix Transpose with CUDA 1 CS4402-9535: High-Performance Computing with CUDA

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

Computer Graphics Parallel Programming with Cuda Hendrik Lensch Computer Graphics

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

How to Write a Parallel GPU Application Using CUDA and Charm++ Presented by Lukasz Wesolowski

CS4402-9535: Many-core Computing with CUDA Marc Moreno Maza University of Western Ontario,

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

Approaches to GPU computing Manuel Ujaldon Nvidia CUDA Fellow Computer Architecture Department

Copper Lake Resources Exploring Advanced Copper-Gold-Zinc-Silver & Gold Projects in NW Ontario

What is Manufacture under Excise? Manufacture - Sec. 2(f) Process - Incidental/ Deemed

Biofuel Transitions in S.E. Asia : Thailands Leading Example Samai Jai-Indr, Ph.D. Energy

HARRISON WIPES RANGE GUIDE Fleetfield Chemical Company Limited, Unit 1 Norfolk Bridge Court,

Building a High-Performance Earth System Model in Julia Maciej Waruszewski 1 , Lucas Wilcox 1 ,

EZMath Computational Mathtyping Piaoyang Cui, Yi Wang, Shangjin Zhang, Zhejiao Chen Motivation

Blended Analysis for Blended Analysis for Performance Understanding of Performance Understanding

BODY LANGUAGE DEFINITIONS: EMBLEMS: - Movements or expressions used in place of words. Almost

EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA Mark Harris, - PowerPoint PPT Presentation

EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA Mark Harris, NVIDIA @harrism #NVSC14 EXTENDING THE REACH OF CUDA 1 Machine Learning 2 Higher Performance 3 New Platforms 4 New Languages 2 GPUS: THE HOT MACHINE LEARNING PLATFORM GPU

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

2110412 Parallel Comp Arch CUDA: Parallel Programming on GPU Natawut Nupairoj, Ph.D. Department

COMPANY PRESENTATION Company Presentation Agenda REACH The Profile REACH The Values REACH The

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

Plan Optimizing Matrix Transpose with CUDA 1 CS4402-9535: High-Performance Computing with CUDA

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

Computer Graphics Parallel Programming with Cuda Hendrik Lensch Computer Graphics

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

How to Write a Parallel GPU Application Using CUDA and Charm++ Presented by Lukasz Wesolowski

CS4402-9535: Many-core Computing with CUDA Marc Moreno Maza University of Western Ontario,

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

Approaches to GPU computing Manuel Ujaldon Nvidia CUDA Fellow Computer Architecture Department

Copper Lake Resources Exploring Advanced Copper-Gold-Zinc-Silver &amp; Gold Projects in NW Ontario

What is Manufacture under Excise? Manufacture - Sec. 2(f) Process - Incidental/ Deemed

Biofuel Transitions in S.E. Asia : Thailands Leading Example Samai Jai-Indr, Ph.D. Energy

HARRISON WIPES RANGE GUIDE Fleetfield Chemical Company Limited, Unit 1 Norfolk Bridge Court,

Building a High-Performance Earth System Model in Julia Maciej Waruszewski 1 , Lucas Wilcox 1 ,

EZMath Computational Mathtyping Piaoyang Cui, Yi Wang, Shangjin Zhang, Zhejiao Chen Motivation

Blended Analysis for Blended Analysis for Performance Understanding of Performance Understanding

BODY LANGUAGE DEFINITIONS: EMBLEMS: - Movements or expressions used in place of words. Almost

Copper Lake Resources Exploring Advanced Copper-Gold-Zinc-Silver & Gold Projects in NW Ontario