tslp throttling automatic vectorization when less is more
play

TSLP Throttling Automatic Vectorization: When Less is More - PowerPoint PPT Presentation

TSLP Throttling Automatic Vectorization: When Less is More Vasileios Porpodas and Timothy M. Jones University of Cambridge LLVM Developers Meeting 2015 www.cl.cam.ac.uk/ vp331/ slide 1 of 16 Why SIMD Vectorization? Scalar Reg. File


  1. SLP not profitable for whole graph A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) L L L L L * L * + + L L * * + + S S Total Cost: −1 S S −1 www.cl.cam.ac.uk/ ∼ vp331/ slide 6 of 16

  2. SLP not profitable for whole graph A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) L L L L L * L * + + L L * * + + + + −1 S S Total Cost: −2 S S −1 www.cl.cam.ac.uk/ ∼ vp331/ slide 6 of 16

  3. SLP not profitable for whole graph A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) L L L L L * L * + + L L * * −1 L L + + + + −1 S S Total Cost: −3 S S −1 www.cl.cam.ac.uk/ ∼ vp331/ slide 6 of 16

  4. SLP not profitable for whole graph A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) L L L L L * L * + + L L * * −1 L L −1 * * + + + + −1 S S Total Cost: −4 S S −1 www.cl.cam.ac.uk/ ∼ vp331/ slide 6 of 16

  5. SLP not profitable for whole graph A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) L L L L L * L * + + −1 + + L L * * −1 L L −1 * * + + + + −1 S S Total Cost: −5 S S −1 www.cl.cam.ac.uk/ ∼ vp331/ slide 6 of 16

  6. SLP not profitable for whole graph A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) L L L L L * L * −1 * * + + −1 + + L L * * −1 L L −1 * * + + + + −1 S S Total Cost: −6 S S −1 www.cl.cam.ac.uk/ ∼ vp331/ slide 6 of 16

  7. SLP not profitable for whole graph A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) L L L L L L L * L * +1 i i +1 −1 * * + + −1 + + L L * * −1 L L −1 * * + + + + −1 S S Total Cost: −4 S S −1 www.cl.cam.ac.uk/ ∼ vp331/ slide 6 of 16

  8. SLP not profitable for whole graph A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) L L L L +1 +1 L L i i L L L * L * +1 i i +1 −1 * * + + −1 + + L L * * −1 L L −1 * * + + + + −1 S S Total Cost: −2 S S −1 www.cl.cam.ac.uk/ ∼ vp331/ slide 6 of 16

  9. SLP not profitable for whole graph A[i] =B[i] + (C[2*i]*(D[2*i]+(E[2*i]*C[2*i]))) A[i+1]=B[i+1] + (C[3*i]*(D[3*i]+(E[3*i]*C[3*i]))) L L L L L L +1 +1 L L i i +1 i i +1 L L L * L * +1 i i +1 −1 * * + + −1 + + L L * * −1 L L −1 * * + + + + −1 S S Total Cost: 0 S S −1 Unprofitable ! www.cl.cam.ac.uk/ ∼ vp331/ slide 6 of 16

  10. TSLP removes unprofitable region SLP L L L L +1 +1 i i +1 i i +1 L L +1 i i +1 −1 * * −1 + + −1 L L −1 * * + + −1 S S −1 Total Cost: 0 Unprofitable! www.cl.cam.ac.uk/ ∼ vp331/ slide 7 of 16

  11. TSLP removes unprofitable region TSLP L L L L +1 +1 i i +1 i i +1 L L +1 i i +1 −1 * * −1 + + −1 L L −1 * * TSLP CUT + + −1 S S −1 Total Cost: www.cl.cam.ac.uk/ ∼ vp331/ slide 7 of 16

  12. TSLP removes unprofitable region L L L L TSLP L L * * + + * * i +1 i +1 −1 L L TSLP CUT + + −1 S S −1 Total Cost: www.cl.cam.ac.uk/ ∼ vp331/ slide 7 of 16

  13. TSLP removes unprofitable region L L L L TSLP L L * * + + * * i +1 i +1 −1 L L TSLP CUT + + −1 S S −1 Total Cost: −1 Profitable ! www.cl.cam.ac.uk/ ∼ vp331/ slide 7 of 16

  14. Scalar IR TSLP Algorithm Find seed instructions for vectorization 1. • Extension to SLP www.cl.cam.ac.uk/ ∼ vp331/ slide 8 of 16

  15. Scalar IR TSLP Algorithm Find seed instructions for vectorization 1. • Extension to SLP 2. Generate the SLP graph www.cl.cam.ac.uk/ ∼ vp331/ slide 8 of 16

  16. Scalar IR TSLP Algorithm Find seed instructions for vectorization 1. • Extension to SLP 2. Generate the SLP graph • Try out many cuts 3. Calculate all valid cuts www.cl.cam.ac.uk/ ∼ vp331/ slide 8 of 16

  17. Scalar IR TSLP Algorithm Find seed instructions for vectorization 1. • Extension to SLP 2. Generate the SLP graph • Try out many cuts 3. Calculate all valid cuts 4. Throttle (cut) the SLP graph www.cl.cam.ac.uk/ ∼ vp331/ slide 8 of 16

  18. Scalar IR TSLP Algorithm Find seed instructions for vectorization 1. • Extension to SLP 2. Generate the SLP graph • Try out many cuts 3. Calculate all valid cuts 4. Throttle (cut) the SLP graph Calculate cost of vectorization 5. www.cl.cam.ac.uk/ ∼ vp331/ slide 8 of 16

  19. Scalar IR TSLP Algorithm Find seed instructions for vectorization 1. • Extension to SLP 2. Generate the SLP graph • Try out many cuts 3. Calculate all valid cuts • Keep best cut Throttle (cut) the SLP graph 4. Calculate cost of vectorization 5. 6. Save cut with best cost www.cl.cam.ac.uk/ ∼ vp331/ slide 8 of 16

  20. Scalar IR TSLP Algorithm Find seed instructions for vectorization 1. • Extension to SLP 2. Generate the SLP graph • Try out many cuts 3. Calculate all valid cuts • Keep best cut Throttle (cut) the SLP graph 4. Calculate cost of vectorization 5. 6. Save cut with best cost 7. NO Tried all cuts? www.cl.cam.ac.uk/ ∼ vp331/ slide 8 of 16

  21. Scalar IR TSLP Algorithm Find seed instructions for vectorization 1. • Extension to SLP 2. Generate the SLP graph • Try out many cuts 3. Calculate all valid cuts • Keep best cut Throttle (cut) the SLP graph 4. • Vanilla SLP Calculate cost of vectorization 5. 6. Save cut with best cost 7. NO Tried all cuts? YES 8. cost < threshold? www.cl.cam.ac.uk/ ∼ vp331/ slide 8 of 16

  22. Scalar IR TSLP Algorithm Find seed instructions for vectorization 1. • Extension to SLP 2. Generate the SLP graph • Try out many cuts 3. Calculate all valid cuts • Keep best cut Throttle (cut) the SLP graph 4. • Vanilla SLP Calculate cost of vectorization 5. 6. Save cut with best cost 7. NO Tried all cuts? YES 8. cost < threshold? YES 9. Replace scalars with vectors www.cl.cam.ac.uk/ ∼ vp331/ slide 8 of 16

  23. Scalar IR TSLP Algorithm Find seed instructions for vectorization 1. • Extension to SLP 2. Generate the SLP graph • Try out many cuts 3. Calculate all valid cuts • Keep best cut Throttle (cut) the SLP graph 4. • Vanilla SLP Calculate cost of vectorization 5. 6. Save cut with best cost 7. NO Tried all cuts? YES 8. cost < threshold? YES 9. Replace scalars with vectors DONE www.cl.cam.ac.uk/ ∼ vp331/ slide 8 of 16

  24. Scalar IR TSLP Algorithm Find seed instructions for vectorization 1. • Extension to SLP 2. Generate the SLP graph • Try out many cuts 3. Calculate all valid cuts • Keep best cut Throttle (cut) the SLP graph 4. • Vanilla SLP Calculate cost of vectorization 5. 6. Save cut with best cost 7. NO Tried all cuts? YES 8. NO cost < threshold? YES 9. Replace scalars with vectors DONE www.cl.cam.ac.uk/ ∼ vp331/ slide 8 of 16

  25. Cost calculation example TotalCost L L L L L L * * + + * * L L + + S S www.cl.cam.ac.uk/ ∼ vp331/ slide 9 of 16

  26. Cost calculation example TotalCost Vector L L L L V+ S +G −Scalar L L * * + + * * L L + + S S www.cl.cam.ac.uk/ ∼ vp331/ slide 9 of 16

  27. Cost calculation example TotalCost Vector L L L L V+ S +G −Scalar − 18 L L * * − 18 + + − 18 * * − 18 L L − 18 − 18 + + − 18 S S − 18 www.cl.cam.ac.uk/ ∼ vp331/ slide 9 of 16

  28. Cost calculation example TotalCost SCALAR Vector L L L L V+ S +G −Scalar − 18 L L * * − 18 + + − 18 * * − 18 L L − 18 − 18 + + − 18 S S cut0 0 + 18+ 0 − 18 = 0 www.cl.cam.ac.uk/ ∼ vp331/ slide 9 of 16

  29. Cost calculation example TotalCost SCALAR Vector L L L L V+ S +G −Scalar − 18 L L * * − 18 + + − 18 * * − 18 L L − 18 − 18 + + cut1 1 +16+ 2 − 18 = +1 VEC S S cut0 0 + 18+ 0 − 18 = 0 www.cl.cam.ac.uk/ ∼ vp331/ slide 9 of 16

  30. Cost calculation example TotalCost SCALAR Vector L L L L V+ S +G −Scalar − 18 L L * * − 18 + + − 18 * * − 18 L L − 18 cut2 5 + 8 + 8 − 18 = +3 VECTOR + + cut1 1 +16+ 2 − 18 = +1 S S cut0 0 + 18+ 0 − 18 = 0 www.cl.cam.ac.uk/ ∼ vp331/ slide 9 of 16

  31. Cost calculation example TotalCost SCALAR Vector L L L L V+ S +G −Scalar − 18 L L * * − 18 + + − 18 * * − 18 L L cut3 2 + 14 + 4 − 18 = +2 cut2 5 + 8 + 8 − 18 = +3 VECTOR + + cut1 1 +16+ 2 − 18 = +1 S S cut0 0 + 18+ 0 − 18 = 0 www.cl.cam.ac.uk/ ∼ vp331/ slide 9 of 16

  32. Cost calculation example TotalCost SCALAR Vector L L L L V+ S +G −Scalar − 18 L L * * − 18 + + − 18 * * TSLP cut4 3 + 12 + 2 − 18 = −1 L L cut3 2 + 14 + 4 − 18 = +2 cut2 5 + 8 + 8 − 18 = +3 VECTOR + + cut1 1 +16+ 2 − 18 = +1 S S cut0 0 + 18+ 0 − 18 = 0 www.cl.cam.ac.uk/ ∼ vp331/ slide 9 of 16

  33. Cost calculation example TotalCost SCALAR Vector L L L L V+ S +G −Scalar − 18 L L * * − 18 + + cut5 4 + 10+ 4 − 18 = 0 * * TSLP cut4 3 + 12 + 2 − 18 = −1 L L cut3 2 + 14 + 4 − 18 = +2 cut2 5 + 8 + 8 − 18 = +3 VECTOR + + cut1 1 +16+ 2 − 18 = +1 S S cut0 0 + 18+ 0 − 18 = 0 www.cl.cam.ac.uk/ ∼ vp331/ slide 9 of 16

  34. Cost calculation example TotalCost SCALAR Vector L L L L V+ S +G −Scalar − 18 L L * * cut6 5 + 8 + 6 − 18 = +1 + + cut5 4 + 10+ 4 − 18 = 0 * * TSLP cut4 3 + 12 + 2 − 18 = −1 L L cut3 2 + 14 + 4 − 18 = +2 cut2 5 + 8 + 8 − 18 = +3 VECTOR + + cut1 1 +16+ 2 − 18 = +1 S S cut0 0 + 18+ 0 − 18 = 0 www.cl.cam.ac.uk/ ∼ vp331/ slide 9 of 16

  35. Cost calculation example TotalCost SCALAR Vector L L L L V+ S +G −Scalar no cut SLP (SLP) 6 + 6 + 6 − 18 = 0 L L * * cut6 5 + 8 + 6 − 18 = +1 + + cut5 4 + 10+ 4 − 18 = 0 * * TSLP cut4 3 + 12 + 2 − 18 = −1 L L cut3 2 + 14 + 4 − 18 = +2 cut2 5 + 8 + 8 − 18 = +3 VECTOR + + cut1 1 +16+ 2 − 18 = +1 S S cut0 0 + 18+ 0 − 18 = 0 www.cl.cam.ac.uk/ ∼ vp331/ slide 9 of 16

  36. Subgraph (Cuts) Generation Algorithm L L L * + L * + S • Only connected subgraphs that include the root www.cl.cam.ac.uk/ ∼ vp331/ slide 10 of 16

  37. Subgraph (Cuts) Generation Algorithm L L L * + L * + S S • Only connected subgraphs that include the root www.cl.cam.ac.uk/ ∼ vp331/ slide 10 of 16

  38. Subgraph (Cuts) Generation Algorithm L L L * + L * + S + S S • Only connected subgraphs that include the root www.cl.cam.ac.uk/ ∼ vp331/ slide 10 of 16

  39. Subgraph (Cuts) Generation Algorithm L L L * + L L * + + S S + S S • Only connected subgraphs that include the root www.cl.cam.ac.uk/ ∼ vp331/ slide 10 of 16

  40. Subgraph (Cuts) Generation Algorithm L L L L * + L L L * * + + + S S S + S S • Only connected subgraphs that include the root www.cl.cam.ac.uk/ ∼ vp331/ slide 10 of 16

  41. Subgraph (Cuts) Generation Algorithm L L L L L L * + + L L L L * * * + + + + S S S S + S S • Only connected subgraphs that include the root www.cl.cam.ac.uk/ ∼ vp331/ slide 10 of 16

  42. Subgraph (Cuts) Generation Algorithm L L L L L L L L L * * + + + L L L L L * * * * + + + + + S S S S S + S S • Only connected subgraphs that include the root www.cl.cam.ac.uk/ ∼ vp331/ slide 10 of 16

  43. Subgraph (Cuts) Generation Algorithm L L L L L L L L L * * + + + L L L L L * * * * + + + + + S S S S S L + S S * + S • Only connected subgraphs that include the root www.cl.cam.ac.uk/ ∼ vp331/ slide 10 of 16

  44. Subgraph (Cuts) Generation Algorithm L L L L L L L L L * * + + + L L L L L * * * * + + + + + S S S S S L L + L S S + * * + + S S • Only connected subgraphs that include the root www.cl.cam.ac.uk/ ∼ vp331/ slide 10 of 16

  45. Subgraph (Cuts) Generation Algorithm L L L L L L L L L * * + + + L L L L L * * * * + + + + + S S S S S L L L L + L L S S * + + * * * + + + S S S • Only connected subgraphs that include the root www.cl.cam.ac.uk/ ∼ vp331/ slide 10 of 16

  46. Subgraph (Cuts) Generation Algorithm L L L L L L L L L * * + + + L L L L L * * * * + + + + + S S S S S L L L L + L L S S * + + * * * + + + S S S • Only connected subgraphs that include the root www.cl.cam.ac.uk/ ∼ vp331/ slide 10 of 16

  47. Subgraph (Cuts) Generation Algorithm L L L L L L L L L * * + + + L L L L L * * * * + + + + + S S S S S L L L L + L L S S * + + * * * + + + S S S • Only connected subgraphs that include the root www.cl.cam.ac.uk/ ∼ vp331/ slide 10 of 16

  48. Subgraph (Cuts) Generation Algorithm L L L L L L L L L * * + + + L L L L L * * * * + + + + + S S S S S L L L L + L L S S * + + * * * + + + S S S • Only connected subgraphs that include the root www.cl.cam.ac.uk/ ∼ vp331/ slide 10 of 16

  49. Subgraph (Cuts) Generation Algorithm L L L L L L L L L * * + + + L L L L L * * * * + + + + + S S S S S L L L L + L L S S * + + * * * + + + S S S • Only connected subgraphs that include the root www.cl.cam.ac.uk/ ∼ vp331/ slide 10 of 16

  50. Subgraph (Cuts) Generation Algorithm L L L L L L L L L * * + + + L L L L L * * * * + + + + + S S S S S L L L L + L L S S * + + * * * + + + S S S • Only connected subgraphs that include the root • Worst time complexity O (2 B xN ) (N=Nodes, B=Neighbors) www.cl.cam.ac.uk/ ∼ vp331/ slide 10 of 16

  51. Fast Subgraph (Cuts) Generation Algorithm ... ... X Y subgraph www.cl.cam.ac.uk/ ∼ vp331/ slide 11 of 16

  52. Fast Subgraph (Cuts) Generation Algorithm ... ... X Y subgraph NO subgraphs > T ? www.cl.cam.ac.uk/ ∼ vp331/ slide 11 of 16

  53. Fast Subgraph (Cuts) Generation Algorithm ... ... X Y subgraph NO subgraphs > T ? ... ... X Y subgraph www.cl.cam.ac.uk/ ∼ vp331/ slide 11 of 16

  54. Fast Subgraph (Cuts) Generation Algorithm ... ... X Y subgraph NO subgraphs > T ? ... ... ... ... X Y X Y subgraph subgraph www.cl.cam.ac.uk/ ∼ vp331/ slide 11 of 16

  55. Fast Subgraph (Cuts) Generation Algorithm ... ... X Y subgraph NO subgraphs > T ? ... ... ... ... ... ... X Y X Y X Y subgraph subgraph subgraph www.cl.cam.ac.uk/ ∼ vp331/ slide 11 of 16

  56. Fast Subgraph (Cuts) Generation Algorithm ... ... X Y subgraph NO YES subgraphs > T ? ... ... ... ... ... ... X Y X Y X Y subgraph subgraph subgraph • After T subgraphs, attach all neighbors www.cl.cam.ac.uk/ ∼ vp331/ slide 11 of 16

  57. Fast Subgraph (Cuts) Generation Algorithm ... ... X Y subgraph NO YES subgraphs > T ? ... ... ... ... ... ... ... ... X Y X Y X Y X Y subgraph subgraph subgraph subgraph • After T subgraphs, attach all neighbors www.cl.cam.ac.uk/ ∼ vp331/ slide 11 of 16

  58. Fast Subgraph (Cuts) Generation Algorithm ... ... X Y subgraph NO YES subgraphs > T ? ... ... ... ... ... ... ... ... X Y X Y X Y X Y subgraph subgraph subgraph subgraph • After T subgraphs, attach all neighbors • Complexity reduced to linear O ( T + N ) www.cl.cam.ac.uk/ ∼ vp331/ slide 11 of 16

  59. Experimental Setup • Implemented TSLP in the trunk version of the LLVM 3.6 compiler. www.cl.cam.ac.uk/ ∼ vp331/ slide 12 of 16

  60. Experimental Setup • Implemented TSLP in the trunk version of the LLVM 3.6 compiler. • Target: Intel Core i5-4570 @ 3.2Ghz www.cl.cam.ac.uk/ ∼ vp331/ slide 12 of 16

  61. Experimental Setup • Implemented TSLP in the trunk version of the LLVM 3.6 compiler. • Target: Intel Core i5-4570 @ 3.2Ghz • Compiler flags: -O3 -allow-partial-unroll -march=core-avx2 -mtune-core-i7 www.cl.cam.ac.uk/ ∼ vp331/ slide 12 of 16

  62. Experimental Setup • Implemented TSLP in the trunk version of the LLVM 3.6 compiler. • Target: Intel Core i5-4570 @ 3.2Ghz • Compiler flags: -O3 -allow-partial-unroll -march=core-avx2 -mtune-core-i7 • Kernels, SPEC 2006 and NPB2.3-C • We evaluated the following cases: www.cl.cam.ac.uk/ ∼ vp331/ slide 12 of 16

  63. Experimental Setup • Implemented TSLP in the trunk version of the LLVM 3.6 compiler. • Target: Intel Core i5-4570 @ 3.2Ghz • Compiler flags: -O3 -allow-partial-unroll -march=core-avx2 -mtune-core-i7 • Kernels, SPEC 2006 and NPB2.3-C • We evaluated the following cases: 1 All loop, SLP and TSLP vectorizers disabled (O3) www.cl.cam.ac.uk/ ∼ vp331/ slide 12 of 16

  64. Experimental Setup • Implemented TSLP in the trunk version of the LLVM 3.6 compiler. • Target: Intel Core i5-4570 @ 3.2Ghz • Compiler flags: -O3 -allow-partial-unroll -march=core-avx2 -mtune-core-i7 • Kernels, SPEC 2006 and NPB2.3-C • We evaluated the following cases: 1 All loop, SLP and TSLP vectorizers disabled (O3) 2 O3 + SLP enabled (SLP) www.cl.cam.ac.uk/ ∼ vp331/ slide 12 of 16

  65. Experimental Setup • Implemented TSLP in the trunk version of the LLVM 3.6 compiler. • Target: Intel Core i5-4570 @ 3.2Ghz • Compiler flags: -O3 -allow-partial-unroll -march=core-avx2 -mtune-core-i7 • Kernels, SPEC 2006 and NPB2.3-C • We evaluated the following cases: 1 All loop, SLP and TSLP vectorizers disabled (O3) 2 O3 + SLP enabled (SLP) 3 O3 + TSLP enabled (TSLP) www.cl.cam.ac.uk/ ∼ vp331/ slide 12 of 16

Recommend


More recommend