common subexpression convergence csc
play

Common Subexpression Convergence (CSC) Sana Damani and Vivek Sarkar - PowerPoint PPT Presentation

Common Subexpression Convergence (CSC) Sana Damani and Vivek Sarkar Habanero Extreme Scale Software Research Lab Georgia Institute of Technology Short paper at LCPC 19, Atlanta, GA Agenda Motivation Common Subexpression Convergence


  1. Common Subexpression Convergence (CSC) Sana Damani and Vivek Sarkar Habanero Extreme Scale Software Research Lab Georgia Institute of Technology Short paper at LCPC ’19, Atlanta, GA

  2. Agenda • Motivation • Common Subexpression Convergence Transformations • Approach • Preliminary Results and Discussion 2

  3. 3

  4. Divergence in SIMT processors • SIMT (Single Instruction Multiple Threads) • All threads in a warp execute the same instruction in parallel • Divergence • A conditional branch dependent on thread-local values • Threads in the warp execute different paths • Serialized execution of a warp threadIdx.x 7 (1) 6 (2) 5 4 3 (3) 2 (4) 1 0 (5) 4 Image credits: https://devblogs.nvidia.com/inside-volta

  5. Problem: Serialization of common code • Divergent Code • Warp Execution 5

  6. 6

  7. Hoist • Move to convergent common ancestor 7

  8. Sink • Move to convergent common successor 8

  9. Split • Move to new convergent join point • Duplicate conditional branch 9 • Alternative solution: hoist defs/sink uses

  10. Operand Renaming • Insert copy instructions then sink/split 10

  11. Branches • Flatten branch, then sink/split 11

  12. Recursive CSC entry c = ... b = ... tid%2 F T tid%3 a=b*c F T a=b*c a=b*c Bottom-Up Traversal Through CDG 12

  13. Common Loops • Loop distribution • Index set splitting 13

  14. 14

  15. Problem Statement Given a GPU program, identify and move divergent common code to a convergent region using Hoist/Sink/Split such that dependences are preserved, and the benefit of code motion is maximized. 15

  16. Algorithm 16

  17. Identifying common code: Dynamic Programming 17

  18. Profitability Heuristics • Benefit: • Function Call > Memory Instructions > Math Instructions > Copy Instructions • Loop nest depth • Cost: • Copy Instructions for Operand Renaming • Increase in register live range and/or stalls with hoist/sink • Increase in branches, smaller blocks, more barriers with Split 18

  19. 19

  20. Experimental Setup CUDA NVPTX/LLVM Nvidia Volta V100 20

  21. Preliminary Results: Microbenchmarks SIMT efficiency Speedup 120% 12 100% 10 80% 8 60% 6 40% 4 20% 2 0% 0 Hoist Sink Split Function Nested Switch Hoist Sink Split Switch SIMT efficiency Before SIMT efficiency After Speedup Note: nvprof shows major gains due to reduction in global reads of up to 27% with CSC (common address reads/coalesced accesses) 21

  22. Preliminary Results: Bitonic Sort Run Time for Bitonic Sort 500 450 400 350 300 250 200 150 100 50 0 min run time max run time avg run time before after SIMT Efficiency for Bitonic Sort 120.00% 100.00% 80.00% 60.00% 40.00% 20.00% 22 0.00% min eff max eff avg eff before after

  23. Discussion and Future Work • Legality • CSE and PRE • Interprocedural analysis • Opportunity in automatically parallelized programs • Profile information for divergence, cost, bottlenecks 23

Recommend


More recommend