An alternative OpenMP Backend for Polly Michael Halkenhuser 2019 - PowerPoint PPT Presentation

An alternative OpenMP Backend for Polly Michael Halkenhäuser 2019 European LLVM Developers’ Meeting 2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 1 / 19

Polly ◮ Polyhedral framework on LLVM-IR 2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 2 / 19

Polly ◮ Polyhedral framework on LLVM-IR ◮ Efficient analyses and transformations ◮ Code generation 2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 2 / 19

Polly ◮ Polyhedral framework on LLVM-IR ◮ Efficient analyses and transformations ◮ Code generation ◮ Example transformations ◮ Loop interchange / fission / fusion ◮ Strip mining (Vectorization) 2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 2 / 19

Polly ◮ Polyhedral framework on LLVM-IR ◮ Efficient analyses and transformations ◮ Code generation ◮ Example transformations ◮ Loop interchange / fission / fusion ◮ Strip mining (Vectorization) ◮ Automatic parallelization 2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 2 / 19

Polly – Sample Parallelization ◮ Automatic parallelization ◮ No need for manual annotation 2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 3 / 19

Polly – Sample Parallelization ◮ Automatic parallelization ◮ No need for manual annotation // "matvect" -- Sequential // (Simplified dependencies) for (i = 0; i <= n; i++) { for (j = 0; j <= n; j++) s[i] = s[i] + a[i][j] * x[j]; } Input 2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 3 / 19

Polly – Sample Parallelization ◮ Automatic parallelization ◮ No need for manual annotation // "matvect" -- Sequential // "matvect" -- OpenMP parallelized // (Simplified dependencies) // Equivalent to the LLVM-IR output #pragma omp parallel for [...] \ schedule (dynamic, 1) num_threads(N) for (i = 0; i <= n; i++) { for (i = 0; i <= n; i++) { for (j = 0; j <= n; j++) for (j = 0; j <= n; j++) s[i] = s[i] + a[i][j] * x[j]; s[i] = s[i] + a[i][j] * x[j]; } } Input Output 2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 3 / 19

Polly – Parallelization Scheme ◮ Polly detects parallelizable code regions ◮ Moved into an outlined function ◮ Executed using OpenMP API 2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 4 / 19

Motivation for an alternative OpenMP Backend ◮ Limited influence on OpenMP execution ◮ Increase number of user options ◮ Improve fine-tuning possibilities 2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 5 / 19

Motivation for an alternative OpenMP Backend ◮ Limited influence on OpenMP execution ◮ Increase number of user options ◮ Improve fine-tuning possibilities ◮ Dependent on GNU OpenMP API ◮ Expand the scope of application 2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 5 / 19

Motivation for an alternative OpenMP Backend ◮ Limited influence on OpenMP execution ◮ Increase number of user options ◮ Improve fine-tuning possibilities ◮ Dependent on GNU OpenMP API ◮ Expand the scope of application ◮ LLVM OpenMP implementation available ◮ Enable direct use of LLVM’s OpenMP runtime ◮ Support automated testing 2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 5 / 19

LLVM OpenMP Backend ◮ Extension of the preexisting backend ◮ Reused common functionalities ◮ Moved into abstract base class 2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 6 / 19

LLVM OpenMP Backend ◮ Extension of the preexisting backend ◮ Reused common functionalities ◮ Moved into abstract base class ◮ API-specific call creation and placement ◮ Implemented in derived class per backend 2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 6 / 19

LLVM OpenMP Backend ◮ Extension of the preexisting backend ◮ Reused common functionalities ◮ Moved into abstract base class ◮ API-specific call creation and placement ◮ Implemented in derived class per backend ◮ User may choose backend ◮ Via CL switch, similar to ◮ Number of threads 2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 6 / 19

LLVM OpenMP Backend ◮ Extension of the preexisting backend ◮ Reused common functionalities ◮ Moved into abstract base class ◮ API-specific call creation and placement ◮ Implemented in derived class per backend ◮ User may choose backend ◮ Via CL switch, similar to ◮ Number of threads ◮ Additional options ◮ Scheduling type ◮ Chunk size 2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 6 / 19

LLVM OpenMP Backend – Options ◮ Scheduling type determines work distribution static dynamic guided Predetermined, Threads request Hybrid scheduling of uniform distribution work shares of static and dynamic , using of iterations chunk size a minimum chunk size 2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 7 / 19

LLVM OpenMP Backend – Options ◮ Scheduling type determines work distribution static dynamic guided Load Balancing – + ◦ Organization Overhead + – ◦ 2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 8 / 19

LLVM OpenMP Backend – Options ◮ Scheduling type determines work distribution static dynamic guided Load Balancing – + ◦ Organization Overhead + – ◦ ◮ static suited for constant computational demands ◮ dynamic suited for shifting computational demands ◮ guided suited for "both" 2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 8 / 19

Experimental Methodology ◮ PolyBench 1 ◮ Provides multiple datasets ◮ Triggers auto-parallelization in 18 benchmarks 1 https://sourceforge.net/projects/polybench/ 2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 9 / 19

Experimental Methodology ◮ PolyBench 1 ◮ Provides multiple datasets ◮ Triggers auto-parallelization in 18 benchmarks ◮ Runtime results ◮ Average from 50 out of 60 runs (10% trimmed-mean) ◮ Utilized CPU: AMD R5 1600X 1 https://sourceforge.net/projects/polybench/ 2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 9 / 19

Experimental Methodology ◮ PolyBench 1 ◮ Provides multiple datasets ◮ Triggers auto-parallelization in 18 benchmarks ◮ Runtime results ◮ Average from 50 out of 60 runs (10% trimmed-mean) ◮ Utilized CPU: AMD R5 1600X ◮ Plots show relative speedup ◮ speedup = runtime of baseline runtime of competitor 1 https://sourceforge.net/projects/polybench/ 2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 9 / 19

Performance Impact of chunk size LLVM OpenMP Chunk Size Comparison Large Dataset · No Vectorization · Dynamic Scheduling · 12 Threads · Baseline: Chunk Size 1 4.0 Chunk Size 2 Chunk Size 3 Chunk Size 4 Chunk Size 6 Achieved Speedup 2.0 0.0 adi atax bicg cholesky correlation covariance deriche doitgen gemver gesummv g-schmidt lu ludcmp mvt symm syr2k syrk trmm PolyBench-Benchmark 2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 10 / 19

Performance Impact of scheduling type LLVM OpenMP Scheduling Comparison No Vectorization · 12 Threads · Baseline: Dynamic Scheduling 10.0 Guided Scheduling · Large Dataset Static Scheduling · Large Dataset 8.0 Achieved Speedup 6.0 4.0 2.0 0.0 adi atax bicg cholesky correlation covariance deriche doitgen gemver gesummv g-schmidt lu ludcmp mvt symm syr2k syrk trmm PolyBench-Benchmark 2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 11 / 19

Intermezzo – Customization Options ◮ Chunk size ◮ 1 is usually a reasonable choice ◮ Very beneficial in particular cases ◮ More than 3 × speedup possible 2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 12 / 19

Intermezzo – Customization Options ◮ Chunk size ◮ 1 is usually a reasonable choice ◮ Very beneficial in particular cases ◮ More than 3 × speedup possible ◮ Scheduling type ◮ Dynamic: Good overall performance ◮ Guided: Performs at least as good as dynamic ◮ Static: Problem-dependent ◮ May achieve 8 × speedup compared to dynamic 2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 12 / 19

Backend Comparison LLVM versus GNU OpenMP Backend GNU & LLVM Backend Comparison Large Dataset · No Vectorization · 4 Threads · Baseline: GNU Backend 2.0 LLVM OpenMP · Best Result 1.5 Achieved Speedup 1.0 0.5 0.0 adi atax bicg cholesky correlation covariance deriche doitgen gemver gesummv g-schmidt lu ludcmp mvt symm syr2k syrk trmm PolyBench-Benchmark 2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhäuser | 13 / 19

An alternative OpenMP Backend for Polly Michael Halkenhuser 2019 - PowerPoint PPT Presentation

An alternative OpenMP Backend for Polly Michael Halkenhuser 2019 European LLVM Developers Meeting 2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhuser | 1 / 19 Polly Polyhedral framework on LLVM-IR 2019-04-08 |

Recommended Reading A Brief Introduction to OpenMP OpenMP FAQ http://openmp.org/openmp-faq.html

Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

OpenMP Paolo Burgio paolo.burgio@unimore.it A history of OpenMP 1997 OpenMP for

Threaded Programming Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

MetaPost 1.207 (TEXLive 2009) EuroTEX 2009 SVG backend SVG backend SVG backend SVG backend A

Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013

Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is OpenMP? OpenMP is an API

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

Speeding Up Reactive Transport Code Using OpenMP By Jared McLaughlin OpenMP A standard for

Introduction to OpenMP Lecture 6: Further topics in OpenMP Nested parallelism Unlike most

Parallel Programming using OpenMP Qin Liu The Chinese University of Hong Kong 1 Overview Why

SHARED MEMORY PROGRAMMING WITH OPENMP Lecture 9: OpenMP Performance 2 A common scenario.....

A Detailed Look at the R600 Backend T om Stellard November 7, 2013 1 | A Detailed Look at the

Targeting GPUs with OpenMP 4.5 Device Directives James Beyer, NVIDIA Jeff Larkin, NVIDIA OpenMP

OpenMP on GPUs, First Experiences and Best Practices Jeff Larkin, GTC2018 S8344, March 2018 What

Fine Tuning of Universe Evidence for (but not proof of) the Existence of God? Walter L.

Structured Fusion Networks for Dialog Shikib Mehri, Tejas Srinivasan, Maxine Eskenazi Language

2014: Fine Tuning The Fumigant System Stanley Culpepper, University of Georgia Tifton Campus

CITIZEN PARTICIPATION DISASTER WAIVER REQUIREMENTS 1 CITIZEN PARTICIPATION CDBG CITIZEN

COMPOSITE HIGGS MODELS Daniel Murnane University of Adelaide, University of Southern Denmark

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding ( B idirectional

Some aspects of physics Some aspects of physics beyond the SM at the LHC beyond the SM at the

Whither SUSY? G. Ross, Birmingham, January 2013 whither Archaic or poetic adv 1. to what

An alternative OpenMP Backend for Polly Michael Halkenhuser 2019 - PowerPoint PPT Presentation

An alternative OpenMP Backend for Polly Michael Halkenhuser 2019 European LLVM Developers Meeting 2019-04-08 | Embedded Systems and Applications Group | Michael Halkenhuser | 1 / 19 Polly Polyhedral framework on LLVM-IR 2019-04-08 |

Recommended Reading A Brief Introduction to OpenMP OpenMP FAQ http://openmp.org/openmp-faq.html

Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

OpenMP Paolo Burgio paolo.burgio@unimore.it A history of OpenMP 1997 OpenMP for

Threaded Programming Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

MetaPost 1.207 (TEXLive 2009) EuroTEX 2009 SVG backend SVG backend SVG backend SVG backend A

Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013

Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is OpenMP? OpenMP is an API

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

Speeding Up Reactive Transport Code Using OpenMP By Jared McLaughlin OpenMP A standard for

Introduction to OpenMP Lecture 6: Further topics in OpenMP Nested parallelism Unlike most

Parallel Programming using OpenMP Qin Liu The Chinese University of Hong Kong 1 Overview Why

SHARED MEMORY PROGRAMMING WITH OPENMP Lecture 9: OpenMP Performance 2 A common scenario.....

A Detailed Look at the R600 Backend T om Stellard November 7, 2013 1 | A Detailed Look at the

Targeting GPUs with OpenMP 4.5 Device Directives James Beyer, NVIDIA Jeff Larkin, NVIDIA OpenMP

OpenMP on GPUs, First Experiences and Best Practices Jeff Larkin, GTC2018 S8344, March 2018 What

Fine Tuning of Universe Evidence for (but not proof of) the Existence of God? Walter L.

Structured Fusion Networks for Dialog Shikib Mehri*, Tejas Srinivasan*, Maxine Eskenazi Language

2014: Fine Tuning The Fumigant System Stanley Culpepper, University of Georgia Tifton Campus

CITIZEN PARTICIPATION DISASTER WAIVER REQUIREMENTS 1 CITIZEN PARTICIPATION CDBG CITIZEN

COMPOSITE HIGGS MODELS Daniel Murnane University of Adelaide, University of Southern Denmark

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding ( B idirectional

Some aspects of physics Some aspects of physics beyond the SM at the LHC beyond the SM at the

Whither SUSY? G. Ross, Birmingham, January 2013 whither Archaic or poetic adv 1. to what

Structured Fusion Networks for Dialog Shikib Mehri, Tejas Srinivasan, Maxine Eskenazi Language