Pollys Polyhedral Scheduling in the Presence of Reductions Johannes - PowerPoint PPT Presentation

Polly’s Polyhedral Scheduling in the Presence of Reductions Johannes Doerfert ⋆ Kevin Streit ⋆ Sebastian Hack ⋆ Zino Benaissa † ⋆ Saarland University † Qualcomm Innovation Center Saarbr¨ ucken, Germany San Diego, USA saarland university January 19, 2015 computer science

Reductions for (i = 0; i < 4 * N; i++) sum += A[i]; P. Jouvelot and B. Dehbonei. A unified semantic approach for the vectorization and parallelization of generalized reductions. In Proceedings of the 3rd International Conference on Supercomputing, ICS ’89, pages 186–194, New York, NY, USA, 1989. ACM. 2/54

Reductions tmp_sum[4] = {0,0,0,0} for (i = 0; i < 4 * N; i+=4) tmp_sum[0:3] += A[i:i+3]; sum += tmp_sum[0] + tmp_sum[1]; + tmp_sum[2] + tmp_sum[3]; P. Jouvelot and B. Dehbonei. A unified semantic approach for the vectorization and parallelization of generalized reductions. In Proceedings of the 3rd International Conference on Supercomputing, ICS ’89, pages 186–194, New York, NY, USA, 1989. ACM. 3/54

Reductions for (i = 0; i < 4 * N; i++) { S(i); sum += A[i]; P(i); } B. Pottenger and R. Eigenmann. Idiom recognition in the polaris parallelizing compiler. In Proceedings of the 9th International Conference on Supercomputing, ICS ’95, pages 444–448, New York, NY, USA, 1995. ACM. 4/54

Reductions tmp_sum[4] = {0,0,0,0} for (i = 0; i < 4 * N; i+=4) { vecS(i:i+3); tmp_sum[0:3] += A[i:i+3]; vecP(i:i+3); } sum += tmp_sum[0] + tmp_sum[1]; + tmp_sum[2] + tmp_sum[3]; B. Pottenger and R. Eigenmann. Idiom recognition in the polaris parallelizing compiler. In Proceedings of the 9th International Conference on Supercomputing, ICS ’95, pages 444–448, New York, NY, USA, 1995. ACM. 5/54

Reductions for (i = 0; i < NX; i++) { for (j = 0; j < NY; j++) { q[i] = q[i] + A[i][j] * p[j]; s[j] = s[j] + r[i] * A[i][j]; } } X. Redon and P. Feautrier. Detection of recurrences in sequential programs with loops. In Proceedings of the 5th International PARLE Conference on Parallel Architectures and Languages Europe, PARLE ’93, pages 132–145, London, UK, UK, 1993. X. Redon and P. Feautrier. Scheduling reductions. In Proceedings of the 8th International Conference on Supercomputing, ICS ’94, pages 117–125, New York, NY, USA, 1994. ACM. X. Redon and P. Feautrier. Detection of scans in the 6/54

Reductions for (i = 0; i <= N; i++) A[i] = i; for (i = N; i >= 0; i--) sum += A[i]; G. Gupta, S. Rajopadhye, and P. Quinton. Scheduling reductions on realistic machines. In Proceedings of the Fourteenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA ’02, pages 117–126, New York, NY, USA, 2002. ACM. 7/54

Reductions for (i = 0; i <= N; i++) A[i] = i; sums[N+1] = sum; for (i = N; i >= 0; i--) sums[i] = sums[i+1] + A[i]; sum = sums[0]; G. Gupta, S. Rajopadhye, and P. Quinton. Scheduling reductions on realistic machines. In Proceedings of the Fourteenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA ’02, pages 117–126, New York, NY, USA, 2002. ACM. 8/54

Reductions sums[N+1] = sum; for (i = 0; i <= N; i++) { A[i] = i; sums[i] = sums[i+1] + A[i]; } sum = sums[0]; G. Gupta, S. Rajopadhye, and P. Quinton. Scheduling reductions on realistic machines. In Proceedings of the Fourteenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA ’02, pages 117–126, New York, NY, USA, 2002. ACM. 9/54

Reductions 10/54

Objectives & Challenges 11/54

Objectives & Challenges Objectives 1) Detect general reduction computations 2) Parallelize/Vectorize reductions efficently 3) Interchange the order reductions are computed 12/54

Objectives & Challenges Objectives 1) Detect general reduction computations 2) Parallelize/Vectorize reductions efficently 3) Interchange the order reductions are computed Practical Considerations a) Avoid runtime regressions b) Minimize memory overhead c) Minimize compile time overhead 13/54

Overview — Polly in LLVM 14/54

Reduction-like Computations Reduction-like Computations ◮ Updates on the same memory cells ◮ Associative & commutative computations ◮ Locally not observed or intervened 17/54

Reduction-like Computations Reduction-like Computations ◮ Updates on the same memory cells ◮ Associative & commutative computations ◮ Locally not observed or intervened Details are provided in the paper. 18/54

Reduction Dependences Reduction Dependences ◮ Loop carried self dependences ◮ Induced by reduction-like computations ◮ Inherit “associative” & “commutative” properties W. Pugh and D. Wonnacott. Static analysis of upper and lower bounds on dependences and parallelism. ACM Trans. Program. Lang. Syst., 16(4):1248–1278, 21/54

Reduction Dependences Dependence Analysis int f(int *A, int N) { ◮ int sum = 0; Performed on statement level for (int i = 0; i < N; i++) ◮ Computes value-based dependences S: { ...; sum += A[i]; ...; } return sum; } 22/54

Reduction Dependences Dependence Analysis int f(int *A, int N) { ◮ int sum = 0; Performed on statement level for (int i = 0; i < N; i++) ◮ Computes value-based dependences { S: ...; Reduction Dependence Analysis R: sum += A[i]; S: ...; ◮ Isolates the load & store of reduction-like } computations return sum; ◮ Performed both on access and statement level } ◮ Identifies reuse of values by a reduction-like computation 23/54

Reduction Dependences Dependences int f(int *A, int N) { int sum = 0; { Stmt S [ i0 ] → Stmt S [ 1 + i0 ] : i0 > = 0 and i0 < = N − 1 } for (int i = 0; i < N; i++) S: sum += A[i]; return sum; } 24/54

Reduction Dependences Dependences int f(int *A, int N) { int sum = 0; { } for (int i = 0; i < N; i++) R: sum += A[i]; Reduction Dependences return sum; } { Stmt R [ i0 ] → Stmt R [ 1 + i0 ] : i0 > = 0 and i0 < = N − 1 } 25/54

Reduction Dependences int f(int *A, int N) { int sum = 0; Dependences for (int i = 0; i < N; i++) S: { { Stmt S [ i0 ] → Stmt S [ 1 + i0 ] : i0 > = 0 and i0 < = N − 1 } A[i] = A[i] + A[i - 1]; sum += i; A[i - 1] = A[i] + A[i - 2]; } return sum; } 26/54

Reduction Dependences int f(int *A, int N) { int sum = 0; Dependences for (int i = 0; i < N; i++) { { Stmt S [ i0 ] → Stmt S [ 1 + i0 ] : i0 > = 0 and i0 < = N − 1 } S: A[i] = A[i] + A[i - 1]; R: sum += i; Reduction Dependences S: A[i - 1] = A[i] + A[i - 2]; { Stmt R [ i0 ] → Stmt R [ 1 + i0 ] : i0 > = 0 and i0 < = N − 1 } } return sum; } 27/54

Reduction Dependences void bicg(float q[NX], ...) { for (int i = 0; i < NX; i++) { Dependences S: q[i] = 0; for (int j = 0; j < NY; j++) { Stmt S [ i0 ] → Stmt T [ i0 , 0 ] : . . . ; T: { Stmt T [ i0 , i1 ] → Stmt T [ i0 , 1 + i1 ] : . . . ; Stmt T [ i0 , i1 ] → Stmt T [ 1 + i0 , i1 ] : . . . } q[i] = q[i] + A[i][j] * p[j]; s[j] = s[j] + r[i] * A[i][j]; } } } 28/54

Reduction Dependences void bicg(float q[NX], ...) { for (int i = 0; i < NX; i++) { Dependences S: q[i] = 0; for (int j = 0; j < NY; j++) { Stmt S [ i0 ] → Stmt R1 [ i0 , 0 ] : . . . } { Stmt T [ i0 , i1 ] → Stmt T [ i0 , 1 + i1 ] : . . . ; Stmt T [ i0 , i1 ] → Stmt T [ 1 + i0 , i1 ] : . . . } R1: q[i] = q[i] + A[i][j] * p[j]; Reduction Dependences R2: s[j] = s[j] + r[i] * A[i][j]; { Stmt R1 [ i0 , i1 ] → Stmt R1 [ i0 , 1 + i1 ] : . . . ; } Stmt R2 [ i0 , i1 ] → Stmt R2 [ 1 + i0 , i1 ] : . . . } } } 29/54

Reduction Modeling 32/54

Reduction Modeling Reduction-enabled Code Generation ◮ Keep the polyhedral representation ◮ Perform parallelism check with and without reduction dependences 33/54

Reduction Modeling Reduction-enabled Code Generation ◮ Keep the polyhedral representation ◮ Perform parallelism check with and without reduction dependences Reduction-enabled Scheduling ◮ Ignore reduction dependences during the scheduling ◮ May need additional privatization dependences 34/54

Reduction Modeling Reduction-enabled Code Generation ◮ Keep the polyhedral representation ◮ Perform parallelism check with and without reduction dependences Reduction-enabled Scheduling ◮ Ignore reduction dependences during the scheduling ◮ May need additional privatization dependences Reduction-aware Scheduling ◮ Let the scheduler make the parallelization decision based on the environment and the potential cost of privatization 35/54

Reduction-enabled Scheduling Dependences void bicg(float q[NX], ...) { { Stmt S [ i0 ] → Stmt R1 [ i0 , 0 ] : i0 > = 0 and i0 < = NX } for (int i = 0; i < NX; i++) { S: q[i] = 0; Reduction Dependences for (int j = 0; j < NY; j++) { { Stmt R1 [ i0 , i1 ] → Stmt R1 [ i0 , 1 + i1 ] : . . . } R1: q[i] = q[i] + A[i][j] * p[j]; { Stmt R2 [ i0 , i1 ] → Stmt R2 [ 1 + i0 , i1 ] : . . . } R2: s[j] = s[j] + r[i] * A[i][j]; } } } 39/54

Pollys Polyhedral Scheduling in the Presence of Reductions Johannes - PowerPoint PPT Presentation

Pollys Polyhedral Scheduling in the Presence of Reductions Johannes Doerfert Kevin Streit Sebastian Hack Zino Benaissa Saarland University Qualcomm Innovation Center Saarbr ucken, Germany San Diego, USA saarland

Polly Polyhedral Optimizations for LLVM Tobias Grosser - Hongbin Zheng - Raghesh Aloor Andreas

Polyhedral Volumes Visual Techniques T. V. Raman & M. S. Krishnamoorthy Polyhedral Volumes

Polyhedral Volumes Visual Techniques T. V. Raman & M. S. Krishnamoorthy Polyhedral Volumes

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

An alternative OpenMP Backend for Polly Michael Halkenhuser 2019 European LLVM Developers

Extending Pluto-Style Polyhedral Scheduling with Consecutivity Sven Verdoolaege 1 Alexandre Isoard

Computing the Cohomology Ring of a Polyhedral Complex Joint work with D. Kravatz, R.

A study of some pitfalls preventing peak performance in polyhedral compilation using a polyhedral

The Polyhedral Model Beyond Loops Recursion Optimization and Parallelization Through Polyhedral

Computing the Cohomology Algebra of a Polyhedral Complex Joint work with R. Gonzalez-Diaz &

AlphaZ: A System for Design Space Exploration in the Polyhedral Model Tomofumi Yuki, Gautam

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms 2

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Text-based captchas strengths and weakness Elie Bursztein, Matthieu Martin, John Mitchell

Two Tools are Better Than One: Tool Diversity as a Means of Improving Aggregate Crowd Performance

Techniques and Challenges for Trajectory Prediction Slides credit: Layla Pournajaf o Navigational

Access Control Matrix and Safety Results CS461/ECE422 Computer Security I, Fall 2009 Based on

PETALS: Improving Learning of Expert Skill in Humanitarian Demining Lahiru Jayatilaka (Red Lotus

SET Secure Electronic Transactions Original participants: VISA and MasterCard, GTE, IBM,

The Economics of Retail Payment Security Tyler Moore University of Tulsa, OK

The EU Work on Payments W3C Workshop on Web Payments Paris, 2425 March 2014 Alexander Gee

Pollys Polyhedral Scheduling in the Presence of Reductions Johannes - PowerPoint PPT Presentation

Pollys Polyhedral Scheduling in the Presence of Reductions Johannes Doerfert Kevin Streit Sebastian Hack Zino Benaissa Saarland University Qualcomm Innovation Center Saarbr ucken, Germany San Diego, USA saarland

Polly Polyhedral Optimizations for LLVM Tobias Grosser - Hongbin Zheng - Raghesh Aloor Andreas

Polyhedral Volumes Visual Techniques T. V. Raman &amp; M. S. Krishnamoorthy Polyhedral Volumes

Polyhedral Volumes Visual Techniques T. V. Raman &amp; M. S. Krishnamoorthy Polyhedral Volumes

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

An alternative OpenMP Backend for Polly Michael Halkenhuser 2019 European LLVM Developers

Extending Pluto-Style Polyhedral Scheduling with Consecutivity Sven Verdoolaege 1 Alexandre Isoard

Computing the Cohomology Ring of a Polyhedral Complex Joint work with D. Kravatz, R.

A study of some pitfalls preventing peak performance in polyhedral compilation using a polyhedral

The Polyhedral Model Beyond Loops Recursion Optimization and Parallelization Through Polyhedral

Computing the Cohomology Algebra of a Polyhedral Complex Joint work with R. Gonzalez-Diaz &amp;

AlphaZ: A System for Design Space Exploration in the Polyhedral Model Tomofumi Yuki, Gautam

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms 2

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Text-based captchas strengths and weakness Elie Bursztein, Matthieu Martin, John Mitchell

Two Tools are Better Than One: Tool Diversity as a Means of Improving Aggregate Crowd Performance

Techniques and Challenges for Trajectory Prediction Slides credit: Layla Pournajaf o Navigational

Access Control Matrix and Safety Results CS461/ECE422 Computer Security I, Fall 2009 Based on

PETALS: Improving Learning of Expert Skill in Humanitarian Demining Lahiru Jayatilaka (Red Lotus

SET Secure Electronic Transactions Original participants: VISA and MasterCard, GTE, IBM,

The Economics of Retail Payment Security Tyler Moore University of Tulsa, OK

The EU Work on Payments W3C Workshop on Web Payments Paris, 2425 March 2014 Alexander Gee

Polyhedral Volumes Visual Techniques T. V. Raman & M. S. Krishnamoorthy Polyhedral Volumes

Polyhedral Volumes Visual Techniques T. V. Raman & M. S. Krishnamoorthy Polyhedral Volumes

Computing the Cohomology Algebra of a Polyhedral Complex Joint work with R. Gonzalez-Diaz &