many others LLVM Developers Meeting, San Jose, October 2017 - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth Scalable, Robust, and “Regression - Free” Loop Optimizations for Scientific Fortran and Modern C++ Michael Kruse, Tobias Grosser Albert Cohen, Sven Verdoolaege, Oleksandre Zinenko Polly Labs, ENS Paris Johannes Doerfert Uni. Saarbruecken Siddharth Bhat IIIT Hydrabad (Intern at ETH) Roman Gereev, Ural Federal University Hongin Zheng, Alexandre Isonard Xilinx Swiss Universities / PASC Qualcomm, ARM, Xilinx … many others LLVM Developers Meeting, San Jose, October 2017

spcl.inf.ethz.ch @spcl_eth Weather Graphics Physics Simulations Machine Learning 2

spcl.inf.ethz.ch @spcl_eth COSMO: Weather and Climate Model • 500.000 Lines of Fortran • 18.000 Loops • 19 Years of Knowledge • Used in Switzerland, Russia, Germany, Poland, Italy, Israel, Greece, Romania, … 3

spcl.inf.ethz.ch @spcl_eth COSMO – Climate Modeling Piz Daint, Lugano, Switzerland • Global (low-resolution Model) • Up to 5000 Nodes • Run ~monthly 4

spcl.inf.ethz.ch @spcl_eth COSMO – Weather Forecast • Regional model • High-resolution • Runs hourly (20 instances in parallel) • Today: 40 Nodes * 8 GPU • Manual Translation to GPUs • 3 year multi-person project Can LLVM do this automatically? 5

spcl.inf.ethz.ch @spcl_eth Polyhedral Model – In a nutshell Program Code Iteration Space i ≤ N = 4 5 i = 4, j = 0 i = 4, j = 1 i = 4, j = 2 i = 4, j = 3 i = 4, j = 4 for (i = 0; i <= N; i++) i 4 for (j = 0; j <= i; j++) i = 3, j = 0 i = 3, j = 1 i = 3, j = 2 i = 3, j = 3 3 S(i,j); i = 2, j = 0 i = 2, j = 1 i = 2, j = 2 0 ≤ j 2 i = 1, j = 0 i = 1, j = 1 j ≤ i 1 i = 0, j = 1 0 N = 4 0 ≤ i 0 1 2 3 4 5 j D = { (i,j ) | 0 ≤ i ≤ N ∧ 0 ≤ j ≤ i } Polly – Performing Polyhedral Optimizations on a Low-Level Intermediate Representation Tobias Grosser, Armin Groesslinger, and Christian Lengauer in Parallel Processing Letters (PPL), April, 2012 6

spcl.inf.ethz.ch @spcl_eth Statistics - COSMO  Number of Loops  18,093 Total  9,760 Static Control Loops (Modeled precisely by Polly)  15,245 Non-Affine Memory Accesses (Approximated by Polly) Siddharth Bhat  11.154 Loops after precise modeling, less e.g. due to: • Infeasible assumptions taken, or modeling timeouts  Largest set of loops: 72 loops  Reasons why loops cannot be modeled  Function calls with side-effects  Uncomputable loops bounds (data-dependent loop bounds?) 7

spcl.inf.ethz.ch @spcl_eth Interprocedural Loop Interchange for GPU Execution Pulled out parallel loop for #ifdef _OPENACC OpenACC Annotations !$acc parallel !$acc loop gang vector DO j1 = ki1sc, ki1ec CALL coe_th_gpu(pduh2oc (j1, ki3sc), pduh2of(j1, ki3sc), pduco2(j1, ki3sc), pduo3(j1, ki3sc), …, pa2f(j1 ), pa3c(j1), pa3f(j1)) ENDDO !$acc end parallel #else CALL coe_th (pduh2oc, pduh2of, pduco2, pduo3, palogp, palogt, podsc, podsf, podac, podaf, …, pa3c, pa3f ) #endif 8

spcl.inf.ethz.ch @spcl_eth Optical Effect on Solar Layer Outer loop is sequential DO j3 = ki3sc+1, ki3ec CALL coe_th (j3) { ! Determine effect of the layer in *coe_th* ! Optical depth of gases Inner loop is parallel DO j1 = ki1sc, ki1ec … IF (kco2 /= 0) THEN zodgf = zodgf + pduco2(j1 ,j3)* (cobi(kco2,kspec,2)* EXP ( coali(kco2,kspec,2) * palogp(j1 ,j3) -cobti(kco2,kspec,2) * palogt(j1 ,j3))) ENDIF Sequential Dependences … zeps=SQRT(zodgf*zodgf) … ENDDO } Inner loop is parallel DO j1 = ki1sc, ki1ec ! Set RHS … ENDDO Inner loop is parallel DO j1 = ki1sc, ki1ec ! Elimination and storage of utility variables … ENDDO ENDDO ! End of vertical loop over layers 9

spcl.inf.ethz.ch @spcl_eth Optical Effect on Solar Layer – After interchange !> Turn loop structure with multiple ip loops inside a !> single k loop into perfectly nested k-ip loop on GPU. #ifdef _OPENACC Outer loop is parallel !$acc parallel !$acc loop gang vector Inner loop is sequential DO j1 = ki1sc, ki1ec !$acc loop seq DO j3 = ki3sc+1, ki3ec ! Loop over vertical ! Determine effects of layer in *coe_so* CALL coe_so_gpu(pduh2oc (j1,j3) , pduh2of (j1,j3) , …, pa4c(j1 ), pa4f(j1), pa5c(j1), pa5f(j1)) ! Elimination … ztd1 = 1.0_dp/(1.0_dp-pa5f(j1)*(pca2(j1,j3)*ztu6(j1,j3-1)+pcc2(j1,j3)*ztu8(j1,j3-1))) ztu9(j1,j3) = pa5c(j1)*pcd1(j1,j3)+ztd6*ztu3(j1,j3) + ztd7*ztu5(j1,j3) ENDDO END DO ! Vertical loop !$acc end parallel 10

spcl.inf.ethz.ch @spcl_eth Life Range Reordering (IMPACT’16 Verdoolaege) Privatization needed for parallel execution parallel sequential sequential parallel False dependences prevent interchange Scalable Scheduling 11

spcl.inf.ethz.ch @spcl_eth Polly-ACC: Architecture Polly-ACC: Transparent Compilation to Heterogeneous Hardware Tobias Grosser, Torsten Hoefler at International Conference of Supercomputing (ICS), June 2016, Istanbul 12

spcl.inf.ethz.ch @spcl_eth Polly-ACC: Architecture Intrinsics to model Better ways to link Multi-dimensional with NVIDIA strided arrays libdevice Scalable Modeling Scalable Scheduling Unified Memory Polly-ACC: Transparent Compilation to Heterogeneous Hardware Tobias Grosser, Torsten Hoefler at International Conference of Supercomputing (ICS), June 2016, Istanbul 13

spcl.inf.ethz.ch @spcl_eth Performance Headroom: - Kernel compilation COSMO (1.5s) - Register usage All important loop 1000 (2x) transformations - Block-size tuning performed - Unified-memory overhead? 22x 100 speedup 5x speedup (TTS) 10 4.3x speedup 1 Dragonegg + LLVM (CPU Cray (CPU only) Polly-ACC (P100) Manual OpenACC (P100) only) COSMO 14

spcl.inf.ethz.ch @spcl_eth Expression Templates (in a nutshell) class Vec : public VecExpression<Vec> { std::vector<double> elems; public: double operator[](size_t i) const { return elems[i]; } double &operator[](size_t i) { return elems[i]; } size_t size() const { return elems.size(); } Vec(size_t n) : elems(n) {} // A Vec can be constructed from any VecExpression, forcing its evaluation. template <typename E> Vec(VecExpression<E> const& vec) : elems(vec.size()) { for (size_t i = 0; i != vec.size(); ++i) { elems[i] = vec[i]; } } }; 15

spcl.inf.ethz.ch @spcl_eth Expression Templates (in a nutshell) class Vec : public VecExpression<Vec> { std::vector<double> elems; public: double operator[](size_t i) const { return elems[i]; } double &operator[](size_t i) { return elems[i]; } size_t size() const { return elems.size(); } Vec(size_t n) : elems(n) {} // A Vec can be constructed from any VecExpression, forcing its evaluation. template <typename E> Vec(VecExpression<E> const& vec) : elems(vec.size()) { for (size_t i = 0; i != vec.size(); ++i) { elems[i] = vec[i]; } } }; 16

spcl.inf.ethz.ch @spcl_eth Expression Templates (in a nutshell) - II template <typename E1, typename E2> class VecSum : public VecExpression<VecSum<E1, E2> > { E1 const& _u; Vec a, b, c; E2 const& _v; auto Sum = a + b + c; public: VecSum(E1 const& u, E2 const& v) : _u(u), _v(v) { assert(u.size() == v.size()); assert(typeof(sum) == } VecSum<VecSum<Vec, Vec>, Vec>) double operator[](size_t i) const { return _u[i] + _v[i]; } size_t size() const { return _v.size(); } // evaluation only happens on }; // assignment to type Vec template <typename E1, typename E2> VecSum<E1,E2> const Vec evaluate = Sum; operator+(E1 const& u, E2 const& v) { return VecSum<E1, E2>(u, v); } 17

spcl.inf.ethz.ch @spcl_eth “Modern C ++” -- boost::ublas and Expression Templates 1. Detect operations on tropical semi-rings • SGEMM/DGEMM (+, *) • Data-Layout Floyd-Warshall (min, +) Transformations in Polly Roman Gareev 2. Apply GOTO Algorithm (1) • L2 Tiling • Cache Transposed Submatrices • Register Tiling TargetData: - L1/L2 Cache Sizes 3. Chose optimal Cache and Register Tile Sizes (2) - L2/L2 Cache Latencies (1) High-performance implementation of the level-3 BLAS, Goto et al (2) A Analytical Modeling is Enough for High Performance BLIS, Tzem Low et al 18

spcl.inf.ethz.ch @spcl_eth DGEMM Performance Thanks @ARMHPC (Florian Hahn) for ARM codegen improvements! 19

spcl.inf.ethz.ch @spcl_eth 3MM Compile Time 4500 4000 3500 3000 2500 2000 1500 1000 500 0 clang -O3 clang -O3 -polly LLVM LLVM-Additonal IR - Generation AST Generation DeLICM Scheduling Instruction Forwarding 20

spcl.inf.ethz.ch @spcl_eth “Provable” Correct Types for Loop Transformations for ( int32 i = 1; i < N; i++) for ( int32 j = 1; j <= M; j++) Maximilian A(i,j) = A(i-1,j) + A(i,j-1) Falkenstein j i 21

spcl.inf.ethz.ch @spcl_eth “Provable” Correct Types for Loop Transformations for ( int32 i = 1; i < N; i++) for ( int32 j = 1; j <= M; j++) Maximilian A(i,j) = A(i-1,j) + A(i,j-1) Falkenstein j for ( intX c = 2; c < N+M; c++) i + j i #pragma simd for ( intX i = max(1, c-M); i <= min(N, c-1); i++) A(i,c-i) = A(i-1,c-1) + A(i,c-i-1) 22

spcl.inf.ethz.ch @spcl_eth “Provable” Correct Types for Loop Transformations for ( int32 i = 1; i < N; i++) for ( int32 j = 1; j <= M; j++) Maximilian A(i,j) = A(i-1,j) + A(i,j-1) Falkenstein j What is X? for ( intX c = 2; c < N+M; c++) i + j i #pragma simd for ( intX i = max(1, c-M); i <= min(N, c-1); i++) A(i,c-i) = A(i-1,c-1) + A(i,c-i-1) 23

many others LLVM Developers Meeting, San Jose, October 2017 - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth Scalable, Robust, and Regression - Free Loop Optimizations for Scientific Fortran and Modern C++ Michael Kruse, Tobias Grosser Albert Cohen, Sven Verdoolaege, Oleksandre Zinenko Polly Labs, ENS Paris Johannes

LLVM IR and the IoT Dvid Juhsz david.juhasz@imsystech.com 4/2/2018 1 FOSDEM 2018 LLVM

LLVM Binutils BoF 2019 EuroLLVM Developers' Meeting James Henderson (SN Systems) Jordan

Porting LLVM to a new OS Kai Nacke 31 January 2016 LLVM devroom @ FOSDEM16 Porting LLVM

Building, Testing and Debugging a Simple out-of-tree LLVM Pass October 29, 2015, LLVM

LLVM Numerics Improvements Michael C. Berg, Apple LLVM Developers Meeting, Brussels,

FTL WebKits LLVM based JIT Andrew Trick, Apple Juergen Ributzka, Apple LLVM Developers

MLIR Tutorial: Building a Compiler with MLIR LLVM Developers Meeting, Euro-LLVM 2019 Mehdi Amini

Wring an LLVM Pass: 101 LLVM 2019 tutorial Andrzej Warzyski arm October 2019 Andrzejs

LLVM Coroutines Bringing resumable functions to LLVM LLVM Dev Meeting 2016 Gor Nishanov

Jin Lin, Ernesto Su, Xinmin Tian Intel Corporation LLVM Developers Meeting 2018, October

LLVM/Clang Mouna Abidi & Manel Grichi 1 Plan What is LLVM? How will you be using it?

A Brief Introduction to Using LLVM Nick Sumner Spring 2013 What is LLVM? A compiler? What

LLVM Simone Campanoni simonec@eecs.northwestern.edu Problems with Canvas? Problems with slides?

LLVM Passes Nick Sumner (see also https://github.com/nsumner/llvm-demo) Matt Dwyer (see also

Hello, my name is Petr Hosek phosek@google.com #llvm/phosek LLVM Developers' Meeting.

NORTHERN CALIFORNIA REGION Local Policy Maker Group August 27, 2020 SAN FRANCISCO TO SAN JOSE

CSE 484 / CSE M 584 Computer Security: Buffer Overflows

A technical introduction to Bitcoin and crypto-currencies Organised by Steven Gordon Room RS410

File Systems: Fundamentals 1 Files What is a file? A named collection of related information

A Warning Propagation-Based Linear-Time-and-Space Algorithm for the Minimum Vertex Cover Problem

Static Analysis Basics II Trent Jaeger Systems and Internet Infrastructure Security (SIIS) Lab

12th International Satisfiability Modulo Theories Competition SMT-COMP 2017 Matthias Heizmann

Sources 1/2 Model 634 Deuterium source (side on, MgF2 window, 30 W) for vacuum ultraviolet

A tale of two schedulers Noah Evans, Richard Barrett, Stephen Olivier, George Stelle

many others LLVM Developers Meeting, San Jose, October 2017 - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth Scalable, Robust, and Regression - Free Loop Optimizations for Scientific Fortran and Modern C++ Michael Kruse, Tobias Grosser Albert Cohen, Sven Verdoolaege, Oleksandre Zinenko Polly Labs, ENS Paris Johannes

LLVM IR and the IoT Dvid Juhsz david.juhasz@imsystech.com 4/2/2018 1 FOSDEM 2018 LLVM

LLVM Binutils BoF 2019 EuroLLVM Developers' Meeting James Henderson (SN Systems) Jordan

Porting LLVM to a new OS Kai Nacke 31 January 2016 LLVM devroom @ FOSDEM16 Porting LLVM

Building, Testing and Debugging a Simple out-of-tree LLVM Pass October 29, 2015, LLVM

LLVM Numerics Improvements Michael C. Berg, Apple LLVM Developers Meeting, Brussels,

FTL WebKits LLVM based JIT Andrew Trick, Apple Juergen Ributzka, Apple LLVM Developers

MLIR Tutorial: Building a Compiler with MLIR LLVM Developers Meeting, Euro-LLVM 2019 Mehdi Amini

Wring an LLVM Pass: 101 LLVM 2019 tutorial Andrzej Warzyski arm October 2019 Andrzejs

LLVM Coroutines Bringing resumable functions to LLVM LLVM Dev Meeting 2016 Gor Nishanov

Jin Lin, Ernesto Su, Xinmin Tian Intel Corporation LLVM Developers Meeting 2018, October

LLVM/Clang Mouna Abidi &amp; Manel Grichi 1 Plan What is LLVM? How will you be using it?

A Brief Introduction to Using LLVM Nick Sumner Spring 2013 What is LLVM? A compiler? What

LLVM Simone Campanoni simonec@eecs.northwestern.edu Problems with Canvas? Problems with slides?

LLVM Passes Nick Sumner (see also https://github.com/nsumner/llvm-demo) Matt Dwyer (see also

Hello, my name is Petr Hosek phosek@google.com #llvm/phosek LLVM Developers' Meeting.

NORTHERN CALIFORNIA REGION Local Policy Maker Group August 27, 2020 SAN FRANCISCO TO SAN JOSE

CSE 484 / CSE M 584 Computer Security: Buffer Overflows

A technical introduction to Bitcoin and crypto-currencies Organised by Steven Gordon Room RS410

File Systems: Fundamentals 1 Files What is a file? A named collection of related information

A Warning Propagation-Based Linear-Time-and-Space Algorithm for the Minimum Vertex Cover Problem

Static Analysis Basics II Trent Jaeger Systems and Internet Infrastructure Security (SIIS) Lab

12th International Satisfiability Modulo Theories Competition SMT-COMP 2017 Matthias Heizmann

Sources 1/2 Model 634 Deuterium source (side on, MgF2 window, 30 W) for vacuum ultraviolet

A tale of two schedulers Noah Evans, Richard Barrett, Stephen Olivier, George Stelle

LLVM/Clang Mouna Abidi & Manel Grichi 1 Plan What is LLVM? How will you be using it?