GPU Computing E. Carlinet, J. Chazalon { - PowerPoint PPT Presentation

GPU Computing E. Carlinet, J. Chazalon { firstname.lastname@lrde.epita.fr} Apr’20 EPITA Research & Development Laboratory (LRDE) 1

Fifty shades of Parallelism 2

How to get things done quicker 1. Do less work 2. Do some work better (i.e. the one being the more time-consuming) 3. Do some work at the same time 4. Distribute work between different workers • (1) Choose the most adapted algorithms , and avoid re-computing things • (2) Choose the most adapted data structures • (3,4) Parallelism 3

Why parallelism ? • Moore’s law: processors are not getting twice as powerful every 2 years anymore • So the processor is getting smarter: • Out-of-order execution / dynamic register renaming • Speculative execution with branch prediction • And the processor is getting super-scalar: • ISA gets vectorized instructions (More details in some slides) 4

Toward data-oriented programming • while the CPU clock rate got bounded… • … the quantity data to process has shot up! We need another way of thinking “speed” 5

The burger factory assembly line How to make several sandwiches as fast as possible ? • Optimize for latency : time to get 1 sandwich done. • Optimize for throughput : number of sandwiches done during a given duration 6

Data-oriented programming parallelism Flynn’s Taxonomy Single Instruction Multiple Instruction Single Data SISD MISD Multiple Data SIMD MIMD • SISD: no parallelism • SIMD: same instruction on data group (vector) • MISD: rare, mostly used for fault tolerant code • MIMD: usual parallel mode 7

Optimize for latency (MIMD with collaborative workers) 4 super-workers (4 CPU cores) collaborate to make 1 sandwich. • Manu gets the bread and cuts and waits for the others • Donald slices the salad • Angela slices the the tomatoes • Kim slices the cheeses This is optimized for latency (CPU are good for that). 8 Angela Manu Donald Kim Time Time to make 1 sandwich: 𝑡 4 (400% speed-up)

Optimize for throughput (MIMD Horizontal with multiple jobs) • Manu makes sandwich 1 • Donald makes sandwich 2 • … Time to make 4 sandwiches: 𝑡 (400% speed-up) This is optimized for throughput (GPU are good for that). 9 Angela Manu Donald Kim Time

Optimize for throughput (MIMD Vertical Pipelining) • Manu cuts the bread • Donald slices the salads • Angela slices the tomatoes • … Time to make 4 sandwiches: 𝑡 (400% speed-up) 10 Angela Manu Donald Kim Time

Optimize for throughput (SIMD DLP) A worker has many arms and make 4 sandwiches at a time Time to make 4 sandwiches: 𝑡 (400% speed-up) 11 Time

More cores is trendy 128 bits 61 Threads 2 2 8 12 24 244 SIMD Width (2 clocks) 6 128 bits (1 clock) 128 bits (1 clock) 128 bits (1 clock) 256 bits (1 clock) 512 bits (1 clock) 12 4 Data-oriented design have changed the way we make processors (even CPUs): series • Lower clock-rate • Larger vector-size, more vector-oriented ISA • More cores (processing units) 64bits Intel Xeon Xeon 5100 series Xeon 5500 series Xeon 5600 Xeon E5 2600 2 series Xeon Phi 7120P Freq 3.6 Ghz 3.0 Ghz 3.2 Ghz 3.3 Ghz 2.7 Ghz 1.24 Ghz Cores 1 12

More cores is trendy Peak performance / core is getting lower Global peak performance is getting higher (with more cores!) 13

CPU vs GPU performance And you see it with HPC apps: 14

𝑇 = 𝑢 _ old 𝑢 _ new = 4 Speed Up 3 2 1 10 20 30 40 50 60 # procs Toward Heterogeneous Architectures P = 80%, max speed-up = 5 But don’t forget, you may need to optimize both latency and throughput . • Time to run the parallel part • Time to run the sequential part (1 − 𝑄) + 𝑄/𝑂 1 If you have N processors, the speed-up is: Parallelizable=80% Sequential=20% (i.e. must run sequentially for (1 - P) ). What is the bounds speedup attainable on a parallel machine with a program which is parallelizable at P % 15

Toward Heterogeneous Architectures 1 But don’t forget, you may need to optimize both latency and throughput . • Time to run the parallel part • Time to run the sequential part (1 − 𝑄) + 𝑄/𝑂 P = 80%, max speed-up = 5 15 What is the bounds speedup attainable on a parallel machine with a program which is parallelizable at P % If you have N processors, the speed-up is: Parallelizable=80% Sequential=20% (i.e. must run sequentially for (1 - P) ). 𝑇 = 𝑢 _ old 𝑢 _ new = 4 Speed Up 3 2 1 10 20 30 40 50 60 # procs

Toward Heterogeneous Architectures (1/2) 1 (1 − 𝑄) + 𝑄/𝑂 • Time to run the sequential part • Time to run the parallel part Latency-optimized (multi-core CPU) Throughput-optimized (GPU) 16 𝑇 = 𝑢 _ old 𝑢 _ new = ❉ Poor perfs on parallel portions ❉ Poor perfs on sequential portions Execution time Execution time

Toward Heterogeneous Architectures (2/2) 1 (1 − 𝑄) + 𝑄/𝑂 • Time to run the sequential part • Time to run the parallel part Heterogeneous (CPU+GPU) 17 𝑇 = 𝑢 _ old 𝑢 _ new = ❯ Use the right tool for the right job ❯ Allows aggressive optimization for latency or for throughput Execution time

Toward Heterogeneous Architectures 18

GPU Computing E. Carlinet, J. Chazalon { - PowerPoint PPT Presentation

GPU Computing E. Carlinet, J. Chazalon { firstname.lastname@lrde.epita.fr} Apr20 EPITA Research & Development Laboratory (LRDE) 1 Fifty shades of Parallelism 2 How to get things done quicker 1. Do less work 2. Do some work better (i.e.

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

GPU Computing at the Netherlands eScience Center Ben van Werkhoven NIRICT GPU Applications

Integration Testing Functional Decomposition Based Chapter 13 Integration Testing Test the

Page 1 Pros and Cons of Sandwich Testing Top and Bottom Layer Tests can be done in parallel

A Trace (of) Sandwich Christoph Rauch , Sergey Goncharov, Lutz Schr oder, Maciej Pir og TCS

Sandwich problems on orientations Zolt an Szigeti Laboratoire G-SCOP INP Grenoble, France 27

Meaning Representation and Semantic Analysis Ling 571 Deep Processing Techniques for NLP

Solidity Pt. 1 Solidity idity Javascript-like programming language for writing programs that

Math 104 Calculus 10.1 Sequences Math 104 - Yu

CS4102 Algorithms Summer 2020 Warm up Why is an algorithms space complexity (how much memory

GPU Computing E. Carlinet, J. Chazalon { - PowerPoint PPT Presentation

GPU Computing E. Carlinet, J. Chazalon { firstname.lastname@lrde.epita.fr} Apr20 EPITA Research & Development Laboratory (LRDE) 1 Fifty shades of Parallelism 2 How to get things done quicker 1. Do less work 2. Do some work better (i.e.

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

GPU Computing at the Netherlands eScience Center Ben van Werkhoven NIRICT GPU Applications

Integration Testing Functional Decomposition Based Chapter 13 Integration Testing Test the

Page 1 Pros and Cons of Sandwich Testing Top and Bottom Layer Tests can be done in parallel

A Trace (of) Sandwich Christoph Rauch , Sergey Goncharov, Lutz Schr oder, Maciej Pir og TCS

Sandwich problems on orientations Zolt an Szigeti Laboratoire G-SCOP INP Grenoble, France 27

Meaning Representation and Semantic Analysis Ling 571 Deep Processing Techniques for NLP

Solidity Pt. 1 Solidity idity Javascript-like programming language for writing programs that

Math 104 Calculus 10.1 Sequences Math 104 - Yu

CS4102 Algorithms Summer 2020 Warm up Why is an algorithms space complexity (how much memory

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,