Streamlining GPU Applications On the Fly Thread Divergence - PowerPoint PPT Presentation

Streamlining GPU Applications On the Fly Thread Divergence Elimination through Runtime Thread-Data Remapping Eddy Z. Zhang , Yunlian Jiang, Ziyu Guo, Xipeng Shen Department of Computer Science, College of William and Mary eddy@cs cs.wm. .wm.edu edu 1 eddy@

GPU Divergence • GPU Features – Streaming multiprocessors • SIMD • Single instruction issue per SM – Warp / Half Warp • SIMD execution unit • Divergence – Threads in a warp take different execution paths eddy@cs cs.wm. .wm.edu edu 2 eddy@

Example of GPU Divergence Warp Time Instructions: A, B, C Inst. A Control Flow A …… Inst. B B C Thread Serialization Inst. C Behavior eddy@cs cs.wm. .wm.edu edu 3 eddy@

Impact of GPU Divergence • Degrading GPU Throughput – E.g., up to degradation on Tesla 1060 15 16 • Impairing GPU Usage – Esp. when having non-trivial condition statements eddy@cs cs.wm. .wm.edu edu 4 eddy@

Related Work • Stream packing and unpacking • [Popa: U. Waterloo C.S. master thesis’04] • Simulate hardware packing on CPU • Dynamic warp formation & scheduling • [Fung+: MICRO’07] • Hardware solution • Control structure splitting • [Carrillo+: CF’09] • Reducing register pressure but removing no divergences eddy@cs cs.wm. .wm.edu edu 5 eddy@

Basic Idea of Our Solution Swapping Jobs of Threads through Thread-Data Remapping A[ ] if ( A[tid] ){ �� green � C[tid] += 1; threads } else{ �� red � Thread-Data Remapping C[tid] -= 1; } threads warp 2 warp 3 warp 1 eddy@cs cs.wm. .wm.edu edu 6 eddy@

Challenges • How to determine a desirable mapping? – Complexities • irregular accesses, complex indexing expressions, side effects on memory reference patterns, ... • How to realize the new mapping? – Data movement or redirect threads’ data references • limitations, effectiveness, and safety. • How to do it on the fly? – Large overhead v.s. Need for runtime remapping • dependence on runtime data values, minimizing and hiding overhead eddy@cs cs.wm. .wm.edu edu 7 eddy@

Outline • Thread-data Remapping – Concept & mechanisms • Transformation on the Fly – CPU-GPU pipelining & LAM • Evaluation • Conclusion eddy@cs cs.wm. .wm.edu edu 8 eddy@

GPU Divergence Causes • Control Flows in Code – E.g., if, do, for, while, switch if, do, for, while, switch • Input Data Dependence – Input data-set --> execution path – Thread-data mapping --> amount of thread divergence eddy@cs cs.wm. .wm.edu edu 9 eddy@

Define Divergence • Control Flow Path Vector for One Thread Def: Pvector Pvector [ tid ] = < b b 1 , b 2 , b 3 , … …, , b b n > 1 , b 2 , b 3 , n Path Vector Example Condition Statements tid A[tid] Pvector if ( A[tid] % 2 ) {…}; 0 2 <0,1> if ( A[tid] < 10 ) {…}; 1 11 <1,0> 2 14 <0,0> … … … eddy@cs cs.wm. .wm.edu edu 10 eddy@

Define Divergence • Control Flow Path Vector for One Thread Def: Pvector Pvector [ tid ] = < b b 1 , b 2 , b 3 , … …, , b b n > 1 , b 2 , b 3 , n Path Vector Example Condition Statements tid A[tid] Pvector if ( A[tid] % 2 ) {…}; 0 2 <0,1> if ( A[tid] < 10 ) {…}; 1 11 <1,0> 2 14 <0,0> … … … eddy@cs cs.wm. .wm.edu edu 11 eddy@

Regroup Threads • To Satisfy Convergence Condition: – Sort Pvector Pvector [ [ 0 0 ] ] , , Pvector Pvector [ [ 1 1 ], ], Pvector Pvector [ [ 2 2 ], ], … … for for all threads all threads – E.g, after sorting, the grouping of threads – E.g, after sorting, the grouping of threads Path Vector: <0,0> <0,1> <1,1> Path Vector: <0,0> <0,1> <1,1> 0 12 8 11 9 4 6 7 2 5 10 3 1 13 14 15 Thread Index: Thread Index: Warp Warp Warp Warp eddy@cs cs.wm. .wm.edu edu 12 eddy@

Example of GPU Thread Divergence <0,0> Data Index: j <1,0> 1 2 3 4 6 7 5 8 Path Vector: Mapping: Thread [ i ] --> Data [ j ] Thread Index: i 1 2 3 5 6 7 4 8 i == j WARP 1 WARP 2 eddy@cs cs.wm. .wm.edu edu 13 eddy@

Remapping by Reference Redirection <0,0> Data Index: j <1,0> 1 2 3 4 6 7 5 8 Path Vector: Mapping: Thread [ i ] --> Data [ j ] Thread Index: i 1 2 3 5 6 7 4 8 j == IND[i] WARP 1 WARP 2 eddy@cs cs.wm. .wm.edu edu 14 eddy@

Remapping by Data Transformation <0,0> Data Index: j <1,0> 1 2 3 4 6 7 5 8 Path Vector: Thread Index: i 1 2 3 5 6 7 4 8 WARP 1 WARP 2 eddy@cs cs.wm. .wm.edu edu 15 eddy@

Remapping by Data Transformation Swap <0,0> Data Index: j <1,0> 1 2 3 4 6 7 5 8 Path Vector: Mapping: Thread [ i ] --> Data [ j ] Thread Index: i 1 2 3 5 6 7 4 8 i == j WARP 1 WARP 2 eddy@cs cs.wm. .wm.edu edu 16 eddy@

Outline • Thread-data Remapping – Concept & mechanisms • Transformation on the Fly – CPU-GPU pipelining & LAM • Evaluation • Conclusion eddy@cs cs.wm. .wm.edu edu 17 eddy@

Overview of CPU-GPU Synergy � Independent � Independent � Protected � Protected � Pipelined � Pipelined •Path Vectors CPU •Compute Best Mapping Collect •Feedback Control Branch Info Send Remapping • Realize Desired GPU Info Thread-data Map. eddy@cs cs.wm. .wm.edu edu 18 eddy@

CPU-GPU Pipeline Scheme • Without Pipelining Scheme T div : >=1 or < 1 ( T no-div + T remap ) • With Pipelining Scheme T div No Slow-down! : >= 1 (T no-div + T remap ) eddy@cs cs.wm. .wm.edu edu 19 eddy@

CPU-GPU Pipeline Example I GPU Code …… Remapping GPU Kernel Func. Thread Timeline CPU Remap Func. …… …… eddy@cs cs.wm. .wm.edu edu 20 eddy@

CPU-GPU Pipeline Example II GPU Code …… Pipeline GPU Kernel Func. s Thread s Timeline CPU Remap Func. •Controllable Threading •Adaptive to Avail. Resource …… …… …… eddy@cs cs.wm. .wm.edu edu 21 eddy@

Applicable Scenarios • Loops – Multiple invocation of same kernel function • Input Data Partition for A Kernel Function – Create multiple iterations • Across Different Kernels – With idle CPU processing system resources eddy@cs cs.wm. .wm.edu edu 22 eddy@

LAM: Reduce Data Movements • For Data Layout Transformation - LAM Scheme – 3 Steps: Label, Assign & Move (LAM) Tunable # of Classes – Label -- Classify path vectors into multiple classes • Based on similarity – Assign -- Assign warps to different classes • Based on occupation ratio Increase # of no-need-moves – Move -- Determine the destination eddy@cs cs.wm. .wm.edu edu 23 eddy@

Outline • Thread-data Remapping – Concept & mechanisms • Transformation on the fly – CPU-GPU pipelining & LAM • Evaluation • Conclusion eddy@cs cs.wm. .wm.edu edu 24 eddy@

Experiment Settings • Host Machine – Dual-socket quad-core Intel Xeon E5540 • GPU Device – NVIDIA Tesla 1060 • Runtime Library – Reference redirection & data transformation – Pipeline threads management eddy@cs cs.wm. .wm.edu edu 25 eddy@

Benchmarks Program Comments Potential Percent of Div. Div. Source Div. Warps Reduction 3D-LBM LBM based if statement 50-100% 50-100% PDE solver GAFORT genetic if statement 100% 44% algorithm & loop Marching graphics if statement 100% 99% algorithm Cubes Reduction parallel sum if statement 100% 50% Blackscholes option if statement 0% 0% pricing eddy@cs cs.wm. .wm.edu edu 26 eddy@

Benchmarks Program Comments Potential Percent of Div. Div. Source Div. Warps Reduction 3D-LBM LBM based if statement 50-100% 50-100% PDE solver GAFORT genetic if statement 100% 44% algorithm & loop Marching graphics if statement 100% 99% algorithm Cubes Reduction parallel sum if statement 100% 50% Blackscholes option if statement 0% 0% pricing eddy@cs cs.wm. .wm.edu edu 27 eddy@

Evaluation: Data Transformation 1.4 • GAFORT 1.2 – Mutation probabilities 1 Speedup 0.8 – Regular mem. access 0.6 0.4 – select_cld kernel 0.2 – Remap scheme 0 Baseline No-LAM +LAM + Pipeline • Data layout transform. Performance Comparison – Efficiency control Perform. Before After • LAM & Pipeline Div. Ratio 100% 56% 44% Reduced Time 67225 51325 1.31 Speedup eddy@cs cs.wm. .wm.edu edu 28 eddy@

Evaluation: Reference Redirect. • MarchingCubes Time (micro-sec) BlockSize 32 64 256 – Div: number of vertices that intersect isosurface Org.Time 17414 16707 16673 – Random memory access Opt.Time 12666 12371 12425 – generateTriangles2 kernel – Remap scheme Performance • Reference redirection – Efficiency control BlockSize 32 64 256 • Pipeline Div.Reduct 99% 99% 99% SpeedUp 1.37 1.35 1.34 eddy@cs cs.wm. .wm.edu edu 29 eddy@

Evaluation: All Benchmarks Org 1.6 No Perf. Opt-W.O. Eff. Control 1.4 Opt-With Eff. Control Loss 1.2 speedup 1 0.8 0.6 0.4 0.2 0 3D-LBM GAFORT MARCH REDUCT BLACK eddy@cs cs.wm. .wm.edu edu 30 eddy@

Streamlining GPU Applications On the Fly Thread Divergence - PowerPoint PPT Presentation

Streamlining GPU Applications On the Fly Thread Divergence Elimination through Runtime Thread-Data Remapping Eddy Z. Zhang , Yunlian Jiang, Ziyu Guo, Xipeng Shen Department of Computer Science, College of William and Mary eddy@cs cs.wm.

Fly Fishing Granite P. What is Fly Fishing? - A method of fishing in which an artificial fly is

FLY ASH EROSION FLY ASH EROSION FLY ASH EROSION FLY ASH EROSION CONTROL & PREVENTION

FLY HIGH 2019 Learning English is a joyful life experience FLY HIGH ROMANIA FLYHIGHROMANIA FLY

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Fuels Streamlining Workshop May 20, 2020 Fuels Streamlining Workshop Slide# 1 Anti-Trust

Streamlining Transportation Corridor Streamlining Transportation Corridor Planning Processes and

DMV STUDY STREAMLINING IFTA AND IRP PROCESSES Reba Calvert, Administrative Officer III, DMV John

Backstroke streamlining and skill development Streamlining The simplest way to improve swimming

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Now Everyone Can Fly Now Everyone Can Fly First Quarter 2006 Results First Quarter

Fl Fly Qu Quie iet t Co Comm mmittee ittee Aug ugust st 18, , 2015 15 Agenda

Licensing Enforcement Team FLY POSTING REVIEW 2015 1 Fly Posting There is no formal definition

Now Everyone Can Fly Now Everyone Can Fly 2005 Fourth Quarter & Full Year Results

Now Everyone Can Fly Now Everyone Can Fly 2005 Second Quarter Results 2005 Second

Nonexistence Certificates for Ovals in a Projective Plane of Order Ten Curtis Bright 1 , 2 Kevin

Revisiting Combinators Edward Kmett A Toy Lambda Calculus data LC a = Var a | App (LC a) (LC

Mechanising Hankin and Barendregt using the Gordon-Melham axioms Michael Norrish

Israel Special God is Faithful: T o Israel and to Us Part 1 August 13, 2017 Dean Bible

Generic programming Advanced functional programming - Lecture 10 Wouter Swierstra University of

LA-UR-15-24551 Approved for public release; distribution is unlimited. Title: Introducing

On Continuous Normalization Klaus Aehlig Felix Joachimski Mathematisches Institut LMU

The Pythagoras numbers of projective varieties. Grigoriy Blekherman (Georgia Tech, USA) Rainer

Streamlining GPU Applications On the Fly Thread Divergence - PowerPoint PPT Presentation

Streamlining GPU Applications On the Fly Thread Divergence Elimination through Runtime Thread-Data Remapping Eddy Z. Zhang , Yunlian Jiang, Ziyu Guo, Xipeng Shen Department of Computer Science, College of William and Mary eddy@cs cs.wm.

Fly Fishing Granite P. What is Fly Fishing? - A method of fishing in which an artificial fly is

FLY ASH EROSION FLY ASH EROSION FLY ASH EROSION FLY ASH EROSION CONTROL &amp; PREVENTION

FLY HIGH 2019 Learning English is a joyful life experience FLY HIGH ROMANIA FLYHIGHROMANIA FLY

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Fuels Streamlining Workshop May 20, 2020 Fuels Streamlining Workshop Slide# 1 Anti-Trust

Streamlining Transportation Corridor Streamlining Transportation Corridor Planning Processes and

DMV STUDY STREAMLINING IFTA AND IRP PROCESSES Reba Calvert, Administrative Officer III, DMV John

Backstroke streamlining and skill development Streamlining The simplest way to improve swimming

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Now Everyone Can Fly Now Everyone Can Fly First Quarter 2006 Results First Quarter

Fl Fly Qu Quie iet t Co Comm mmittee ittee Aug ugust st 18, , 2015 15 Agenda

Licensing Enforcement Team FLY POSTING REVIEW 2015 1 Fly Posting There is no formal definition

Now Everyone Can Fly Now Everyone Can Fly 2005 Fourth Quarter &amp; Full Year Results

Now Everyone Can Fly Now Everyone Can Fly 2005 Second Quarter Results 2005 Second

Nonexistence Certificates for Ovals in a Projective Plane of Order Ten Curtis Bright 1 , 2 Kevin

Revisiting Combinators Edward Kmett A Toy Lambda Calculus data LC a = Var a | App (LC a) (LC

Mechanising Hankin and Barendregt using the Gordon-Melham axioms Michael Norrish

Israel Special God is Faithful: T o Israel and to Us Part 1 August 13, 2017 Dean Bible

Generic programming Advanced functional programming - Lecture 10 Wouter Swierstra University of

LA-UR-15-24551 Approved for public release; distribution is unlimited. Title: Introducing

On Continuous Normalization Klaus Aehlig Felix Joachimski Mathematisches Institut LMU

The Pythagoras numbers of projective varieties. Grigoriy Blekherman (Georgia Tech, USA) Rainer

FLY ASH EROSION FLY ASH EROSION FLY ASH EROSION FLY ASH EROSION CONTROL & PREVENTION

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Now Everyone Can Fly Now Everyone Can Fly 2005 Fourth Quarter & Full Year Results