loop selection for thread level speculation
play

Loop Selection for Thread-Level Speculation Shengyue Wang, Xiaoru - PowerPoint PPT Presentation

Loop Selection for Thread-Level Speculation Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Chip Multiprocessors (CMPs) IBM Power5


  1. Loop Selection for Thread-Level Speculation Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota

  2. Chip Multiprocessors (CMPs) IBM Power5 • CMPs: Proc Proc • IBM Power5 • Sun Niagara Proc • Intel dual-core Xeon Cache • AMD dual-core Opteron Improve program performance with parallel threads Shengyue Wang

  3. Thread-Level Speculation (TLS) Automatic parallelization is difficult • Ambiguous data dependences • Complex control flow TLS facilitates automatic parallelization by: • Executing potentially dependent threads in parallel • Preserving data dependences via runtime checking Where do we find speculative parallel threads? Shengyue Wang

  4. Parallelizing Loops under TLS Loops are good candidates for parallelism • Regular structure • Significant coverage on dynamic execution time General purpose applications are complicated Facts about SPECINT 2000 • Average number of loops: 714 • Average dynamic loop nesting: 8 Loop selection: which loops should be parallelized? Shengyue Wang

  5. Potential of Loop Selection Outer loop Inner loop Best 100% 80% program speedup 60% 40% 20% 0% k r p 2 x m p f f c y r e c -20% p l i p e o a c t m z s b f v i t g g a w g z r r l o a r r b t e c v p p -40% Carefully selected loops can improve performance significantly! Shengyue Wang

  6. Outline � Loop selection � Algorithm • Parallel performance prediction • Dynamic loop behavior • Conclusions Shengyue Wang

  7. Loop Nesting main( ) { main_loop1 while ( condition1 ) { while ( condition2 ) { foo( ); goo( ); main_loop2 } } } foo_loop1 foo( ) { while ( condition3 ) { goo( ); } goo_loop1 } Loop graph goo( ) { while ( condition4 ) { : static loop } : nesting relationship } Source code Shengyue Wang

  8. Benefit of Parallelizing a Single Loop 13% main_loop1 Coverage Loop Speedup Benefit 20% 80% 1.2 13% main_loop2 70% 1.4 20% 5% 30% 1.2 5% foo_loop1 50% 1.6 18% 18% benefit = % program execution time saved goo_loop1 = coverage × (1 – 1 / loop speedup) Program speedup = 1 / (1 - benefit) = 1.25 Shengyue Wang

  9. Loop Selection: Problem Definition Goal: Select the set of loops that maximizes the overall program performance when parallelized Constraint: The set cannot contain loops with nesting relationship Loop selection is NP-complete ! Weighted maximum independent set Shengyue Wang

  10. Loop Selection: Algorithm • Exhaustive search ( ≤ 50 nodes) • Try all possible combinations of loops • Greedy algorithm (> 50 nodes) • In descending order of benefit • Check for nesting relation • Add the loop to the set if no nesting Average number of loops for SPECINT 2000: 714 Shengyue Wang

  11. Loop Pruning loop1 loop2 loop3 loop4 loop5 loop6 loop7 loop8 Only resort to greedy algorithm for gcc and parser Shengyue Wang

  12. Benefit of Parallelizing a Single Loop 13% main_loop1 Coverage Speedup Benefit 20% 80% 1.2 13% main_loop2 70% 1.4 20% 5% 30% 1.2 5% foo_loop1 50% 1.6 18% 18% goo_loop1 How can we estimate the speedup? Loop graph Shengyue Wang

  13. Outline � Loop selection • Algorithm � Parallel performance prediction • Dynamic loop behavior • Conclusions Shengyue Wang

  14. Estimating Parallel Performance Communicating value between speculative threads adds significant overhead to parallel execution • Synchronization: • Resolves frequently occurring data dependences • Speculation: • Resolves infrequently occurring data dependences Estimating communication costs with the compiler Shengyue Wang

  15. Cost of Mis-speculation T1 T2 load Amount of work wasted × store Cost of mis-speculation = amount of work wasted × prob. of mis-speculation Shengyue Wang

  16. Cost of Mis-speculation T1 Sequential part T2 Amount of load work wasted × store Cost of mis-speculation = amount of work wasted × prob. of mis-speculation Shengyue Wang

  17. Synchronization T1 T2 store load Synchronization serializes parallel execution Shengyue Wang

  18. Cost of Synchronization Est. III Est. II Est. I T1 T2 T1 T2 T1 T2 store2 load1 store1 load1 load2 store1 store1 load1 Synchronization Cost = Synchronization Cost = Synchronization Cost = longest stall based on longest stall # of dependent instructions dependent instructions Shengyue Wang

  19. Experimental Framework • Machine model • 4 single-issue in-order processors • Private L1 data cache (32K, 2-way, 1 cycle) • Shared L2 data cache (2M, 4-way, 10 cycles) • Speculation support (write buffer, address buffer) • Synchronization support (comm. buffer, 10 cycles) • Compiler optimizations using ORC 2.1 • Instruction scheduling to improve parallelism Shengyue Wang

  20. Comparison: Speedup Estimation Techniques 100% 100% 100% 100% Est. I Est. I Est. I Est. I 80% 80% 80% 80% Est. II Est. II Est. II Est. II program speedup 60% 60% 60% 60% Est. III Est. III Est. III Est. III Perfect Perfect Perfect Perfect 40% 40% 40% 40% 20% 20% 20% 20% 0% 0% 0% 0% m m m m cf cf cf cf crafty crafty crafty crafty tw tw tw tw olf olf olf olf gzip gzip gzip gzip bzip2 bzip2 bzip2 bzip2 v v v v ortex ortex ortex ortex v v v v pr pr pr pr parser parser parser parser gap gap gap gap gcc gcc gcc gcc perlbm perlbm perlbm perlbm k k k k -20% -20% -20% -20% -40% -40% -40% -40% Average program speedup: 20%, coverage: 70% 100% 100% 100% 100% Est. I Est. I Est. I Est. I 80% Est. II 80% 80% 80% Est. II Est. II coverage Est. II Est. III Est. III Est. III Est. III 60% 60% 60% 60% Perfect Perfect Perfect Perfect 40% 40% 40% 40% 20% 20% 20% 20% 0% 0% 0% 0% Shengyue Wang m cf crafty twolf gzip bzip2 vortex vpr parser gap gcc perlbm k m m cf cf crafty crafty twolf twolf gzip gzip bzip2 bzip2 vortex vortex vpr vpr parser parser gap gap gcc gcc perlbm perlbm k k m cf crafty twolf gzip bzip2 vortex vpr parser gap gcc perlbm k

  21. Outline • Loop selection • Algorithm • Parallel performance prediction � Dynamic loop behavior • Conclusions Shengyue Wang

  22. Loop Behavior May Change main( ) { main_loop1 while ( condition1 ) { while ( condition2 ) { foo( ); goo( ); main_loop2 } } } foo_loop1 foo( ) { while ( condition3 ) { goo( ); } goo_loop1_A goo_loop1_B } Loop tree goo( ) { while ( condition4 ) { Calling context of a loop: } the path from the root to that loop } Source code Shengyue Wang

  23. Loop Selection in a Tree 13% main_loop1 20% main_loop2 goo_loop1 is parallelized 5% only when it is reached foo_loop1 from main_loop2 -2% 18% goo_loop1_A goo_loop1_B Loop cloning can be used Shengyue Wang

  24. Loop Behavior May Change main_loop1 foo_loop1 only parallelize the loop in these invocations foo_loop1 goo_loop1_A goo_loop1_B Exploit loop behavior dynamically Shengyue Wang

  25. Potential of Exploiting Dynamic Behavior 100% 100% 100% No No Co Co n n te te x x t t No Co n te x t Ca Ca llin llin g Co g Co n n te te x x t t program speedup Ca llin g Co n te x t 80% 80% 80% O O O ra ra ra cle cle cle 60% 60% 60% 40% 40% 40% 20% 20% 20% 0% 0% 0% k k k rtex rtex er er r rlbm rlbm m zip zip 2 2 2 x p p p p cf cf f fty fty y lf lf cc cc c f r r r e c zip zip vp vp l p e p o o o i a a a t rs rs s c m m m z b f tw tw i t v g g g g cra cra a w g g g g g z r r l vo vo o a a a r r b b b t e e e c p p p v p p p 5 out of 11 benchmarks show performance potential 100% 100% 100% No No Co Co n n te te x x t t No Co n te x t Ca Ca Ca llin llin llin g Co g Co g Co n n n te te te x x x t t t 80% 80% 80% coverage O O O ra ra ra cle cle cle 60% 60% 60% 40% 40% 40% 20% 20% 20% 0% 0% 0% k k k rlbm rlbm 2 2 rtex rtex er er rlbm Shengyue Wang zip zip 2 rtex er zip p p cc cc cf cf fty fty lf lf r r p cc cf fty lf r zip zip vp vp zip vp a a o o a m m o rs rs m rs cra cra tw tw g g g g cra tw g g g g g vo vo a a vo a b b b e e e p p p p p p

  26. Conclusions Loop selection is important for TLS • Compiler-based loop selection • Speedup 20%, Coverage 70% • Exploiting dynamic behavior offers performance potential Shengyue Wang

  27. Thank You! Shengyue Wang

Recommend


More recommend