GPU Programming René Kloth Florian Wende Pro Seminar: Parallel Programming Freie Universität Berlin, WS 2012/13, Prof. Dr. M. Esponda
Presentation Outline ■ Current Hardware Accelerators ■ Nvidia Fermi GPU Architecture ■ The CUDA/OpenCL Programming Model ■ Matrix-Matrix-Multiplication on GPU and CPU ■ On Using the GPU’s Shared Memory ■ Tiling Techniques ■ Efficient Memory Access Patterns ■ OpenMP + Vectorization on CPU ■ Molecular Dynamics ■ Ray Tracing
. . . Hardware Accelerators René Kloth, Florian Wende : Pro Seminar: Parallel Programming, GPU Programming Intel Xeon Phi GPU Cell processor, IBM PowerXCell 8i ClearSpeed FPGA Increase of application performance through task/thread-level parallelism . Instruction-level parallelism limited by single thread performance. 2000 and later — Single thread performance stalls! 1985 — Amiga 1000: Co-processors for Video/Audio/DMA/ 1980 — Intel 8087 x87 floating-point co-processor for 8086 CPUs. functions faster than a standard CPU can do in software. Hardware accelerator : Computer hardware that allows to perform certain Hardware Accelerators References Summary . Applications CUDA Programming Model Nvidia Fermi GPU Architecture GPU Computing 3 / 51 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Specialized hardware acting as co-processor for the CPU
. . . Hardware accelerator : Computer hardware that allows to perform certain René Kloth, Florian Wende : Pro Seminar: Parallel Programming, GPU Programming Intel Xeon Phi GPU Cell processor, IBM PowerXCell 8i ClearSpeed FPGA Increase of application performance through task/thread-level parallelism . Instruction-level parallelism limited by single thread performance. 2000 and later — Single thread performance stalls! Hardware Accelerators functions faster than a standard CPU can do in software. Hardware Accelerators . GPU Computing Nvidia Fermi GPU Architecture CUDA Programming Model References Applications Summary 3 / 51 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Specialized hardware acting as co-processor for the CPU ■ 1980 — Intel 8087 x87 floating-point co-processor for 8086 CPUs. ■ 1985 — Amiga 1000: Co-processors for Video/Audio/DMA/ . . .
. . . References René Kloth, Florian Wende : Pro Seminar: Parallel Programming, GPU Programming Increase of application performance through task/thread-level parallelism . Instruction-level parallelism limited by single thread performance. Hardware Accelerators Hardware accelerator : Computer hardware that allows to perform certain Hardware Accelerators functions faster than a standard CPU can do in software. Summary CUDA Programming Model GPU Computing Nvidia Fermi GPU Architecture 3 / 51 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Specialized hardware acting as co-processor for the CPU ■ 1980 — Intel 8087 x87 floating-point co-processor for 8086 CPUs. ■ 1985 — Amiga 1000: Co-processors for Video/Audio/DMA/ . . . ■ 2000 and later — Single thread performance stalls! ■ FPGA ■ ClearSpeed ■ Cell processor, IBM PowerXCell 8i ■ GPU ■ Intel Xeon Phi
. . . Summary René Kloth, Florian Wende : Pro Seminar: Parallel Programming, GPU Programming Trend: Multiple CPUs + multiple accelerators per compute node. accelerator(s). Hardware Accelerators Heterogeneous Computing References Modern computer systems consist of a combination of CPU(s) and hardware . Nvidia Fermi GPU Architecture GPU Computing 4 / 51 CUDA Programming Model Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Memory Peripherals Memory Shared Accelerator CPU ( GPU , FPGA, Cell, etc.) Memory Connection Channel: PCI(e), HyperTransport, etc.,
. . . Hardware Accelerators René Kloth, Florian Wende : Pro Seminar: Parallel Programming, GPU Programming Challenging aspect: Make multiple/different compute devices work together. Heterogeneous computer systems are in the ascendant 33% of the top 100 systems use GPU and Xeon Phi hardware accelerators. Most power efficient system uses Intel Xeon Phi. Green 500 Supercomputers, November 2012 46.2% with 8 or more cores. 80.6% Nvidia GPU, 11.3% Intel Xeon Phi, 4.8% AMD GPU. Top 500 Supercomputers, November 2012: Heterogeneous Computing References Summary . Applications CUDA Programming Model Nvidia Fermi GPU Architecture GPU Computing 5 / 51 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ■ More than 12% of the systems use hardware accelerators: ■ 84.6% of the systems use processors with 6 or more cores, and
. . . References René Kloth, Florian Wende : Pro Seminar: Parallel Programming, GPU Programming Challenging aspect: Make multiple/different compute devices work together. Heterogeneous computer systems are in the ascendant Green 500 Supercomputers, November 2012 46.2% with 8 or more cores. 80.6% Nvidia GPU, 11.3% Intel Xeon Phi, 4.8% AMD GPU. Top 500 Supercomputers, November 2012: Hardware Accelerators Heterogeneous Computing Summary . Applications CUDA Programming Model Nvidia Fermi GPU Architecture GPU Computing 5 / 51 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ■ More than 12% of the systems use hardware accelerators: ■ 84.6% of the systems use processors with 6 or more cores, and ■ Most power efficient system uses Intel Xeon Phi. ■ 33% of the top 100 systems use GPU and Xeon Phi hardware accelerators.
. . . References René Kloth, Florian Wende : Pro Seminar: Parallel Programming, GPU Programming Challenging aspect: Make multiple/different compute devices work together. Heterogeneous computer systems are in the ascendant Green 500 Supercomputers, November 2012 46.2% with 8 or more cores. 80.6% Nvidia GPU, 11.3% Intel Xeon Phi, 4.8% AMD GPU. Top 500 Supercomputers, November 2012: Hardware Accelerators Heterogeneous Computing Summary . Applications CUDA Programming Model Nvidia Fermi GPU Architecture GPU Computing 5 / 51 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ■ More than 12% of the systems use hardware accelerators: ■ 84.6% of the systems use processors with 6 or more cores, and ■ Most power efficient system uses Intel Xeon Phi. ■ 33% of the top 100 systems use GPU and Xeon Phi hardware accelerators.
. . . Summary René Kloth, Florian Wende : Pro Seminar: Parallel Programming, GPU Programming Hardware Accelerators computations (GPGPU) due to Since 2007 GPUs are increasingly used for non-graphics GPU Computing References 6 / 51 . Nvidia Fermi GPU Architecture Applications CUDA Programming Model GPU Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tesla K20X 3.95 TFLOP/s ■ high compute performance, and ■ low acquisition and maintenance costs. GeForce GTX 680 3.09 TFLOP/s Nvidia GPU Single Precision 2500 Nvidia GPU Double Precision Intel CPU Single Precision Theoretical GFLOP/s 2000 Intel CPU Double Precision GeForce GTX 580 1500 GeForce GTX 480 Tesla K20X 1.31 TFLOP/s GeForce GTX 280 1000 GeForce 8800 GTX Tesla C2050 Tesla M2090 500 GeForce 7800 GTX Sandy Bridge GeForce 6800 Ultra Tesla C1060 Bloomfield GeForce FX 5800 0 Pentium 4 Woodcrest Harpertown Westmere Sep-01 Jun-04 Mar-07 Dec-09 Aug-12
. . . References René Kloth, Florian Wende : Pro Seminar: Parallel Programming, GPU Programming Hardware Accelerators to the increase in the memory bandwidth. number of compute cores, disproportionate On the GPU we observe a strongly increasing GPU Computing 7 / 51 Summary . Applications CUDA Programming Model GPU Computing Nvidia Fermi GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tesla K20X 250 GB/s 200 GeForce GTX 480 GeForce GTX 580 Memory Bandwidth GB/s 150 GeForce GTX 280 Nvidia GPU 100 Intel CPU GeForce 8800 GTX GeForce 7800 GTX Sandy Bridge 50 Bloomfield GeForce 6800 GT GeForce FX 5900 Woodcrest Westmere Harpertown 0 Pentium 4 2006 2008 2010 2012
. . . Before the introduction of the unified-compute shader in 2007, programming René Kloth, Florian Wende : Pro Seminar: Parallel Programming, GPU Programming We focus on CUDA. supporting OpenCL. systems. GPUs from the G80 series (2007) onwards. programmed using Current GPUs (based on the unified-shader architecture) can be ‘easily’ DirectX: Complicated! Hardware Accelerators GPUs for non-graphics applications was done by means of GLSL, Cg, OpenGL, GPU Computing References GPU Computing Nvidia Fermi GPU Architecture CUDA Programming Model Applications . Summary 8 / 51 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ■ CUDA : Nvidia proprietary parallel computing platform supporting Nvidia ■ OpenCL : Parallel programming platform for heterogeneous computer ■ Apple, AMD, Intel, IBM, Nvidia: OpenCL 1.0 by the end of 2008. ■ More general than CUDA + available for any computer architecture ■ Programming API similar to CUDA
Recommend
More recommend