GPU Programming Ren Kloth Florian Wende Pro Seminar: Parallel - PowerPoint PPT Presentation

GPU Programming René Kloth Florian Wende Pro Seminar: Parallel Programming Freie Universität Berlin, WS 2012/13, Prof. Dr. M. Esponda

Presentation Outline ■ Current Hardware Accelerators ■ Nvidia Fermi GPU Architecture ■ The CUDA/OpenCL Programming Model ■ Matrix-Matrix-Multiplication on GPU and CPU ■ On Using the GPU’s Shared Memory ■ Tiling Techniques ■ Efficient Memory Access Patterns ■ OpenMP + Vectorization on CPU ■ Molecular Dynamics ■ Ray Tracing

. . . Hardware Accelerators René Kloth, Florian Wende : Pro Seminar: Parallel Programming, GPU Programming Intel Xeon Phi GPU Cell processor, IBM PowerXCell 8i ClearSpeed FPGA Increase of application performance through task/thread-level parallelism . Instruction-level parallelism limited by single thread performance. 2000 and later — Single thread performance stalls! 1985 — Amiga 1000: Co-processors for Video/Audio/DMA/ 1980 — Intel 8087 x87 floating-point co-processor for 8086 CPUs. functions faster than a standard CPU can do in software. Hardware accelerator : Computer hardware that allows to perform certain Hardware Accelerators References Summary . Applications CUDA Programming Model Nvidia Fermi GPU Architecture GPU Computing 3 / 51 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Specialized hardware acting as co-processor for the CPU

. . . Hardware accelerator : Computer hardware that allows to perform certain René Kloth, Florian Wende : Pro Seminar: Parallel Programming, GPU Programming Intel Xeon Phi GPU Cell processor, IBM PowerXCell 8i ClearSpeed FPGA Increase of application performance through task/thread-level parallelism . Instruction-level parallelism limited by single thread performance. 2000 and later — Single thread performance stalls! Hardware Accelerators functions faster than a standard CPU can do in software. Hardware Accelerators . GPU Computing Nvidia Fermi GPU Architecture CUDA Programming Model References Applications Summary 3 / 51 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Specialized hardware acting as co-processor for the CPU ■ 1980 — Intel 8087 x87 floating-point co-processor for 8086 CPUs. ■ 1985 — Amiga 1000: Co-processors for Video/Audio/DMA/ . . .

. . . References René Kloth, Florian Wende : Pro Seminar: Parallel Programming, GPU Programming Increase of application performance through task/thread-level parallelism . Instruction-level parallelism limited by single thread performance. Hardware Accelerators Hardware accelerator : Computer hardware that allows to perform certain Hardware Accelerators functions faster than a standard CPU can do in software. Summary CUDA Programming Model GPU Computing Nvidia Fermi GPU Architecture 3 / 51 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Specialized hardware acting as co-processor for the CPU ■ 1980 — Intel 8087 x87 floating-point co-processor for 8086 CPUs. ■ 1985 — Amiga 1000: Co-processors for Video/Audio/DMA/ . . . ■ 2000 and later — Single thread performance stalls! ■ FPGA ■ ClearSpeed ■ Cell processor, IBM PowerXCell 8i ■ GPU ■ Intel Xeon Phi

. . . Summary René Kloth, Florian Wende : Pro Seminar: Parallel Programming, GPU Programming Trend: Multiple CPUs + multiple accelerators per compute node. accelerator(s). Hardware Accelerators Heterogeneous Computing References Modern computer systems consist of a combination of CPU(s) and hardware . Nvidia Fermi GPU Architecture GPU Computing 4 / 51 CUDA Programming Model Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Memory Peripherals Memory Shared Accelerator CPU ( GPU , FPGA, Cell, etc.) Memory Connection Channel: PCI(e), HyperTransport, etc.,

. . . Hardware Accelerators René Kloth, Florian Wende : Pro Seminar: Parallel Programming, GPU Programming Challenging aspect: Make multiple/different compute devices work together. Heterogeneous computer systems are in the ascendant 33% of the top 100 systems use GPU and Xeon Phi hardware accelerators. Most power efficient system uses Intel Xeon Phi. Green 500 Supercomputers, November 2012 46.2% with 8 or more cores. 80.6% Nvidia GPU, 11.3% Intel Xeon Phi, 4.8% AMD GPU. Top 500 Supercomputers, November 2012: Heterogeneous Computing References Summary . Applications CUDA Programming Model Nvidia Fermi GPU Architecture GPU Computing 5 / 51 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ■ More than 12% of the systems use hardware accelerators: ■ 84.6% of the systems use processors with 6 or more cores, and

. . . References René Kloth, Florian Wende : Pro Seminar: Parallel Programming, GPU Programming Challenging aspect: Make multiple/different compute devices work together. Heterogeneous computer systems are in the ascendant Green 500 Supercomputers, November 2012 46.2% with 8 or more cores. 80.6% Nvidia GPU, 11.3% Intel Xeon Phi, 4.8% AMD GPU. Top 500 Supercomputers, November 2012: Hardware Accelerators Heterogeneous Computing Summary . Applications CUDA Programming Model Nvidia Fermi GPU Architecture GPU Computing 5 / 51 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ■ More than 12% of the systems use hardware accelerators: ■ 84.6% of the systems use processors with 6 or more cores, and ■ Most power efficient system uses Intel Xeon Phi. ■ 33% of the top 100 systems use GPU and Xeon Phi hardware accelerators.

. . . Summary René Kloth, Florian Wende : Pro Seminar: Parallel Programming, GPU Programming Hardware Accelerators computations (GPGPU) due to Since 2007 GPUs are increasingly used for non-graphics GPU Computing References 6 / 51 . Nvidia Fermi GPU Architecture Applications CUDA Programming Model GPU Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tesla K20X 3.95 TFLOP/s ■ high compute performance, and ■ low acquisition and maintenance costs. GeForce GTX 680 3.09 TFLOP/s Nvidia GPU Single Precision 2500 Nvidia GPU Double Precision Intel CPU Single Precision Theoretical GFLOP/s 2000 Intel CPU Double Precision GeForce GTX 580 1500 GeForce GTX 480 Tesla K20X 1.31 TFLOP/s GeForce GTX 280 1000 GeForce 8800 GTX Tesla C2050 Tesla M2090 500 GeForce 7800 GTX Sandy Bridge GeForce 6800 Ultra Tesla C1060 Bloomfield GeForce FX 5800 0 Pentium 4 Woodcrest Harpertown Westmere Sep-01 Jun-04 Mar-07 Dec-09 Aug-12

. . . References René Kloth, Florian Wende : Pro Seminar: Parallel Programming, GPU Programming Hardware Accelerators to the increase in the memory bandwidth. number of compute cores, disproportionate On the GPU we observe a strongly increasing GPU Computing 7 / 51 Summary . Applications CUDA Programming Model GPU Computing Nvidia Fermi GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tesla K20X 250 GB/s 200 GeForce GTX 480 GeForce GTX 580 Memory Bandwidth GB/s 150 GeForce GTX 280 Nvidia GPU 100 Intel CPU GeForce 8800 GTX GeForce 7800 GTX Sandy Bridge 50 Bloomfield GeForce 6800 GT GeForce FX 5900 Woodcrest Westmere Harpertown 0 Pentium 4 2006 2008 2010 2012

. . . Before the introduction of the unified-compute shader in 2007, programming René Kloth, Florian Wende : Pro Seminar: Parallel Programming, GPU Programming We focus on CUDA. supporting OpenCL. systems. GPUs from the G80 series (2007) onwards. programmed using Current GPUs (based on the unified-shader architecture) can be ‘easily’ DirectX: Complicated! Hardware Accelerators GPUs for non-graphics applications was done by means of GLSL, Cg, OpenGL, GPU Computing References GPU Computing Nvidia Fermi GPU Architecture CUDA Programming Model Applications . Summary 8 / 51 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ■ CUDA : Nvidia proprietary parallel computing platform supporting Nvidia ■ OpenCL : Parallel programming platform for heterogeneous computer ■ Apple, AMD, Intel, IBM, Nvidia: OpenCL 1.0 by the end of 2008. ■ More general than CUDA + available for any computer architecture ■ Programming API similar to CUDA

GPU Programming Ren Kloth Florian Wende Pro Seminar: Parallel - PowerPoint PPT Presentation

GPU Programming Ren Kloth Florian Wende Pro Seminar: Parallel Programming Freie Universitt Berlin, WS 2012/13, Prof. Dr. M. Esponda Presentation Outline Current Hardware Accelerators Nvidia Fermi GPU Architecture The CUDA/OpenCL

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

Microsoft Corporation http://www.jeff.wilcox.name/ 2

Introduction to LLVM UG3 Compiling Techniques Autumn 2017 Contact Information Instructor:

CANT WE ALL JUST GET ALONG? Andrina Kelly - @andrina - Bell Media ! Diana Birsan -

CS3505/5020 Software Practice II XNA overview Representations in Simulations and Games

Shader Programming Shader Programming vs CUDA vs CUDA Tien-Tsin Wong The Chinese University of

WebGL Agenda Rendering pipeline Boilerplate for minimal application Obtaining

INFOGR Computer Graphics Jacco Bikker & Debabrata Panja - April-July 2017 Lecture 8:

Optimal Dirichlet regions for elliptic PDEs Giuseppe Buttazzo Dipartimento di Matematica

GPU Programming Ren Kloth Florian Wende Pro Seminar: Parallel - PowerPoint PPT Presentation

GPU Programming Ren Kloth Florian Wende Pro Seminar: Parallel Programming Freie Universitt Berlin, WS 2012/13, Prof. Dr. M. Esponda Presentation Outline Current Hardware Accelerators Nvidia Fermi GPU Architecture The CUDA/OpenCL

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

Microsoft Corporation http://www.jeff.wilcox.name/ 2

Introduction to LLVM UG3 Compiling Techniques Autumn 2017 Contact Information Instructor:

CANT WE ALL JUST GET ALONG? Andrina Kelly - @andrina - Bell Media ! Diana Birsan -

CS3505/5020 Software Practice II XNA overview Representations in Simulations and Games

Shader Programming Shader Programming vs CUDA vs CUDA Tien-Tsin Wong The Chinese University of

WebGL Agenda Rendering pipeline Boilerplate for minimal application Obtaining

INFOGR Computer Graphics Jacco Bikker &amp; Debabrata Panja - April-July 2017 Lecture 8:

Optimal Dirichlet regions for elliptic PDEs Giuseppe Buttazzo Dipartimento di Matematica

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

INFOGR Computer Graphics Jacco Bikker & Debabrata Panja - April-July 2017 Lecture 8: