Common Subexpression Convergence (CSC) Sana Damani and Vivek Sarkar Habanero Extreme Scale Software Research Lab Georgia Institute of Technology Short paper at LCPC ’19, Atlanta, GA
Agenda • Motivation • Common Subexpression Convergence Transformations • Approach • Preliminary Results and Discussion 2
3
Divergence in SIMT processors • SIMT (Single Instruction Multiple Threads) • All threads in a warp execute the same instruction in parallel • Divergence • A conditional branch dependent on thread-local values • Threads in the warp execute different paths • Serialized execution of a warp threadIdx.x 7 (1) 6 (2) 5 4 3 (3) 2 (4) 1 0 (5) 4 Image credits: https://devblogs.nvidia.com/inside-volta
Problem: Serialization of common code • Divergent Code • Warp Execution 5
6
Hoist • Move to convergent common ancestor 7
Sink • Move to convergent common successor 8
Split • Move to new convergent join point • Duplicate conditional branch 9 • Alternative solution: hoist defs/sink uses
Operand Renaming • Insert copy instructions then sink/split 10
Branches • Flatten branch, then sink/split 11
Recursive CSC entry c = ... b = ... tid%2 F T tid%3 a=b*c F T a=b*c a=b*c Bottom-Up Traversal Through CDG 12
Common Loops • Loop distribution • Index set splitting 13
14
Problem Statement Given a GPU program, identify and move divergent common code to a convergent region using Hoist/Sink/Split such that dependences are preserved, and the benefit of code motion is maximized. 15
Algorithm 16
Identifying common code: Dynamic Programming 17
Profitability Heuristics • Benefit: • Function Call > Memory Instructions > Math Instructions > Copy Instructions • Loop nest depth • Cost: • Copy Instructions for Operand Renaming • Increase in register live range and/or stalls with hoist/sink • Increase in branches, smaller blocks, more barriers with Split 18
19
Experimental Setup CUDA NVPTX/LLVM Nvidia Volta V100 20
Preliminary Results: Microbenchmarks SIMT efficiency Speedup 120% 12 100% 10 80% 8 60% 6 40% 4 20% 2 0% 0 Hoist Sink Split Function Nested Switch Hoist Sink Split Switch SIMT efficiency Before SIMT efficiency After Speedup Note: nvprof shows major gains due to reduction in global reads of up to 27% with CSC (common address reads/coalesced accesses) 21
Preliminary Results: Bitonic Sort Run Time for Bitonic Sort 500 450 400 350 300 250 200 150 100 50 0 min run time max run time avg run time before after SIMT Efficiency for Bitonic Sort 120.00% 100.00% 80.00% 60.00% 40.00% 20.00% 22 0.00% min eff max eff avg eff before after
Discussion and Future Work • Legality • CSE and PRE • Interprocedural analysis • Opportunity in automatically parallelized programs • Profile information for divergence, cost, bottlenecks 23
Recommend
More recommend