Streamlining GPU Applications On the Fly Thread Divergence Elimination through Runtime Thread-Data Remapping Eddy Z. Zhang , Yunlian Jiang, Ziyu Guo, Xipeng Shen Department of Computer Science, College of William and Mary eddy@cs cs.wm. .wm.edu edu 1 eddy@
GPU Divergence • GPU Features – Streaming multiprocessors • SIMD • Single instruction issue per SM – Warp / Half Warp • SIMD execution unit • Divergence – Threads in a warp take different execution paths eddy@cs cs.wm. .wm.edu edu 2 eddy@
Example of GPU Divergence Warp Time Instructions: A, B, C Inst. A Control Flow A …… Inst. B B C Thread Serialization Inst. C Behavior eddy@cs cs.wm. .wm.edu edu 3 eddy@
Impact of GPU Divergence • Degrading GPU Throughput – E.g., up to degradation on Tesla 1060 15 16 • Impairing GPU Usage – Esp. when having non-trivial condition statements eddy@cs cs.wm. .wm.edu edu 4 eddy@
Related Work • Stream packing and unpacking • [Popa: U. Waterloo C.S. master thesis’04] • Simulate hardware packing on CPU • Dynamic warp formation & scheduling • [Fung+: MICRO’07] • Hardware solution • Control structure splitting • [Carrillo+: CF’09] • Reducing register pressure but removing no divergences eddy@cs cs.wm. .wm.edu edu 5 eddy@
Basic Idea of Our Solution Swapping Jobs of Threads through Thread-Data Remapping A[ ] if ( A[tid] ){ �� �� � green � C[tid] += 1; threads } else{ �� �� � red � Thread-Data Remapping C[tid] -= 1; } threads warp 2 warp 3 warp 1 eddy@cs cs.wm. .wm.edu edu 6 eddy@
Challenges • How to determine a desirable mapping? – Complexities • irregular accesses, complex indexing expressions, side effects on memory reference patterns, ... • How to realize the new mapping? – Data movement or redirect threads’ data references • limitations, effectiveness, and safety. • How to do it on the fly? – Large overhead v.s. Need for runtime remapping • dependence on runtime data values, minimizing and hiding overhead eddy@cs cs.wm. .wm.edu edu 7 eddy@
Outline • Thread-data Remapping – Concept & mechanisms • Transformation on the Fly – CPU-GPU pipelining & LAM • Evaluation • Conclusion eddy@cs cs.wm. .wm.edu edu 8 eddy@
GPU Divergence Causes • Control Flows in Code – E.g., if, do, for, while, switch if, do, for, while, switch • Input Data Dependence – Input data-set --> execution path – Thread-data mapping --> amount of thread divergence eddy@cs cs.wm. .wm.edu edu 9 eddy@
Define Divergence • Control Flow Path Vector for One Thread Def: Pvector Pvector [ tid ] = < b b 1 , b 2 , b 3 , … …, , b b n > 1 , b 2 , b 3 , n Path Vector Example Condition Statements tid A[tid] Pvector if ( A[tid] % 2 ) {…}; 0 2 <0,1> if ( A[tid] < 10 ) {…}; 1 11 <1,0> 2 14 <0,0> … … … eddy@cs cs.wm. .wm.edu edu 10 eddy@
Define Divergence • Control Flow Path Vector for One Thread Def: Pvector Pvector [ tid ] = < b b 1 , b 2 , b 3 , … …, , b b n > 1 , b 2 , b 3 , n Path Vector Example Condition Statements tid A[tid] Pvector if ( A[tid] % 2 ) {…}; 0 2 <0,1> if ( A[tid] < 10 ) {…}; 1 11 <1,0> 2 14 <0,0> … … … eddy@cs cs.wm. .wm.edu edu 11 eddy@
Regroup Threads • To Satisfy Convergence Condition: – Sort Pvector Pvector [ [ 0 0 ] ] , , Pvector Pvector [ [ 1 1 ], ], Pvector Pvector [ [ 2 2 ], ], … … for for all threads all threads – E.g, after sorting, the grouping of threads – E.g, after sorting, the grouping of threads Path Vector: <0,0> <0,1> <1,1> Path Vector: <0,0> <0,1> <1,1> 0 12 8 11 9 4 6 7 2 5 10 3 1 13 14 15 Thread Index: Thread Index: Warp Warp Warp Warp eddy@cs cs.wm. .wm.edu edu 12 eddy@
Example of GPU Thread Divergence <0,0> Data Index: j <1,0> 1 2 3 4 6 7 5 8 Path Vector: Mapping: Thread [ i ] --> Data [ j ] Thread Index: i 1 2 3 5 6 7 4 8 i == j WARP 1 WARP 2 eddy@cs cs.wm. .wm.edu edu 13 eddy@
Remapping by Reference Redirection <0,0> Data Index: j <1,0> 1 2 3 4 6 7 5 8 Path Vector: Mapping: Thread [ i ] --> Data [ j ] Thread Index: i 1 2 3 5 6 7 4 8 j == IND[i] WARP 1 WARP 2 eddy@cs cs.wm. .wm.edu edu 14 eddy@
Remapping by Data Transformation <0,0> Data Index: j <1,0> 1 2 3 4 6 7 5 8 Path Vector: Thread Index: i 1 2 3 5 6 7 4 8 WARP 1 WARP 2 eddy@cs cs.wm. .wm.edu edu 15 eddy@
Remapping by Data Transformation Swap <0,0> Data Index: j <1,0> 1 2 3 4 6 7 5 8 Path Vector: Mapping: Thread [ i ] --> Data [ j ] Thread Index: i 1 2 3 5 6 7 4 8 i == j WARP 1 WARP 2 eddy@cs cs.wm. .wm.edu edu 16 eddy@
Outline • Thread-data Remapping – Concept & mechanisms • Transformation on the Fly – CPU-GPU pipelining & LAM • Evaluation • Conclusion eddy@cs cs.wm. .wm.edu edu 17 eddy@
Overview of CPU-GPU Synergy � Independent � Independent � Protected � Protected � Pipelined � Pipelined •Path Vectors CPU •Compute Best Mapping Collect •Feedback Control Branch Info Send Remapping • Realize Desired GPU Info Thread-data Map. eddy@cs cs.wm. .wm.edu edu 18 eddy@
CPU-GPU Pipeline Scheme • Without Pipelining Scheme T div : >=1 or < 1 ( T no-div + T remap ) • With Pipelining Scheme T div No Slow-down! : >= 1 (T no-div + T remap ) eddy@cs cs.wm. .wm.edu edu 19 eddy@
CPU-GPU Pipeline Example I GPU Code …… Remapping GPU Kernel Func. Thread Timeline CPU Remap Func. …… …… eddy@cs cs.wm. .wm.edu edu 20 eddy@
CPU-GPU Pipeline Example II GPU Code …… Pipeline GPU Kernel Func. s Thread s Timeline CPU Remap Func. •Controllable Threading •Adaptive to Avail. Resource …… …… …… eddy@cs cs.wm. .wm.edu edu 21 eddy@
Applicable Scenarios • Loops – Multiple invocation of same kernel function • Input Data Partition for A Kernel Function – Create multiple iterations • Across Different Kernels – With idle CPU processing system resources eddy@cs cs.wm. .wm.edu edu 22 eddy@
LAM: Reduce Data Movements • For Data Layout Transformation - LAM Scheme – 3 Steps: Label, Assign & Move (LAM) Tunable # of Classes – Label -- Classify path vectors into multiple classes • Based on similarity – Assign -- Assign warps to different classes • Based on occupation ratio Increase # of no-need-moves – Move -- Determine the destination eddy@cs cs.wm. .wm.edu edu 23 eddy@
Outline • Thread-data Remapping – Concept & mechanisms • Transformation on the fly – CPU-GPU pipelining & LAM • Evaluation • Conclusion eddy@cs cs.wm. .wm.edu edu 24 eddy@
Experiment Settings • Host Machine – Dual-socket quad-core Intel Xeon E5540 • GPU Device – NVIDIA Tesla 1060 • Runtime Library – Reference redirection & data transformation – Pipeline threads management eddy@cs cs.wm. .wm.edu edu 25 eddy@
Benchmarks Program Comments Potential Percent of Div. Div. Source Div. Warps Reduction 3D-LBM LBM based if statement 50-100% 50-100% PDE solver GAFORT genetic if statement 100% 44% algorithm & loop Marching graphics if statement 100% 99% algorithm Cubes Reduction parallel sum if statement 100% 50% Blackscholes option if statement 0% 0% pricing eddy@cs cs.wm. .wm.edu edu 26 eddy@
Benchmarks Program Comments Potential Percent of Div. Div. Source Div. Warps Reduction 3D-LBM LBM based if statement 50-100% 50-100% PDE solver GAFORT genetic if statement 100% 44% algorithm & loop Marching graphics if statement 100% 99% algorithm Cubes Reduction parallel sum if statement 100% 50% Blackscholes option if statement 0% 0% pricing eddy@cs cs.wm. .wm.edu edu 27 eddy@
Evaluation: Data Transformation 1.4 • GAFORT 1.2 – Mutation probabilities 1 Speedup 0.8 – Regular mem. access 0.6 0.4 – select_cld kernel 0.2 – Remap scheme 0 Baseline No-LAM +LAM + Pipeline • Data layout transform. Performance Comparison – Efficiency control Perform. Before After • LAM & Pipeline Div. Ratio 100% 56% 44% Reduced Time 67225 51325 1.31 Speedup eddy@cs cs.wm. .wm.edu edu 28 eddy@
Evaluation: Reference Redirect. • MarchingCubes Time (micro-sec) BlockSize 32 64 256 – Div: number of vertices that intersect isosurface Org.Time 17414 16707 16673 – Random memory access Opt.Time 12666 12371 12425 – generateTriangles2 kernel – Remap scheme Performance • Reference redirection – Efficiency control BlockSize 32 64 256 • Pipeline Div.Reduct 99% 99% 99% SpeedUp 1.37 1.35 1.34 eddy@cs cs.wm. .wm.edu edu 29 eddy@
Evaluation: All Benchmarks Org 1.6 No Perf. Opt-W.O. Eff. Control 1.4 Opt-With Eff. Control Loss 1.2 speedup 1 0.8 0.6 0.4 0.2 0 3D-LBM GAFORT MARCH REDUCT BLACK eddy@cs cs.wm. .wm.edu edu 30 eddy@
Recommend
More recommend