Common Subexpression Convergence (CSC) Sana Damani and Vivek Sarkar - PowerPoint PPT Presentation
Common Subexpression Convergence (CSC) Sana Damani and Vivek Sarkar Habanero Extreme Scale Software Research Lab Georgia Institute of Technology Short paper at LCPC 19, Atlanta, GA Agenda Motivation Common Subexpression Convergence
Common Subexpression Convergence (CSC) Sana Damani and Vivek Sarkar Habanero Extreme Scale Software Research Lab Georgia Institute of Technology Short paper at LCPC ’19, Atlanta, GA
Agenda • Motivation • Common Subexpression Convergence Transformations • Approach • Preliminary Results and Discussion 2
3
Divergence in SIMT processors • SIMT (Single Instruction Multiple Threads) • All threads in a warp execute the same instruction in parallel • Divergence • A conditional branch dependent on thread-local values • Threads in the warp execute different paths • Serialized execution of a warp threadIdx.x 7 (1) 6 (2) 5 4 3 (3) 2 (4) 1 0 (5) 4 Image credits: https://devblogs.nvidia.com/inside-volta
Problem: Serialization of common code • Divergent Code • Warp Execution 5
6
Hoist • Move to convergent common ancestor 7
Sink • Move to convergent common successor 8
Split • Move to new convergent join point • Duplicate conditional branch 9 • Alternative solution: hoist defs/sink uses
Operand Renaming • Insert copy instructions then sink/split 10
Branches • Flatten branch, then sink/split 11
Recursive CSC entry c = ... b = ... tid%2 F T tid%3 a=b*c F T a=b*c a=b*c Bottom-Up Traversal Through CDG 12
Common Loops • Loop distribution • Index set splitting 13
14
Problem Statement Given a GPU program, identify and move divergent common code to a convergent region using Hoist/Sink/Split such that dependences are preserved, and the benefit of code motion is maximized. 15
Algorithm 16
Identifying common code: Dynamic Programming 17
Profitability Heuristics • Benefit: • Function Call > Memory Instructions > Math Instructions > Copy Instructions • Loop nest depth • Cost: • Copy Instructions for Operand Renaming • Increase in register live range and/or stalls with hoist/sink • Increase in branches, smaller blocks, more barriers with Split 18
19
Experimental Setup CUDA NVPTX/LLVM Nvidia Volta V100 20
Preliminary Results: Microbenchmarks SIMT efficiency Speedup 120% 12 100% 10 80% 8 60% 6 40% 4 20% 2 0% 0 Hoist Sink Split Function Nested Switch Hoist Sink Split Switch SIMT efficiency Before SIMT efficiency After Speedup Note: nvprof shows major gains due to reduction in global reads of up to 27% with CSC (common address reads/coalesced accesses) 21
Preliminary Results: Bitonic Sort Run Time for Bitonic Sort 500 450 400 350 300 250 200 150 100 50 0 min run time max run time avg run time before after SIMT Efficiency for Bitonic Sort 120.00% 100.00% 80.00% 60.00% 40.00% 20.00% 22 0.00% min eff max eff avg eff before after
Discussion and Future Work • Legality • CSE and PRE • Interprocedural analysis • Opportunity in automatically parallelized programs • Profile information for divergence, cost, bottlenecks 23
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.