DOE Proxy Apps Clang/LLVM vs. the World! Hal Finkel, Brian - PowerPoint PPT Presentation

DOE Proxy Apps – Clang/LLVM vs. the World! Hal Finkel, Brian Homerding, Michael Kruse EuroLLVM 2018 Argonne Leadership Computing Facility Argonne Leadership Computing Facility 1 1

High-Level Effects Low-Level Effects Argonne Leadership Computing Facility 2

Many Good Stories Start with Some Source of Confusion... Why do you think Clang/LLVM is doing better than I do? Argonne Leadership Computing Facility 3

Test Suite Analysis Methodology • Collect 30 samples of executon tme of test suite using lnt with both clang 7 and GCC 7.3 using all threads including hyper-threading (112 on Skylake run and 88 on Broadwell run) (Noisy System) • Compare with 99.5% confdence level using ministat • Collect 30 additonal samples for each compiler with only a single thread being used (Quiet System) • Compare with 99.5% confdence level using ministat • Look at the diference between compiler performance with diferent amounts of noise on the system • Removed Some Outliers (Clang 20,000% faster on Shootout-C++- nestedloop) Argonne Leadership Computing Facility 4

Subset of DOE Proxies GCC Faster Clang Faster Argonne Leadership Computing Facility 5

Several of the DOE Proxy Apps are Interestng • MiniAMR, RSBench and HPCCG jump the line and GCC begins to outperform • PENNANT, MiniFE and CLAMR show GCC outperforming when there was no diference on a quite system • XSBench shows Clang outperforming on a quiet system and no diference on a noisy system (memory latency sensitve) Argonne Leadership Computing Facility 6

= Statstcal Diference on 112 Threads – Statstcal Diference on 1 Thread Diference Moving towards GCC Diference Moving towards Clang Argonne Leadership Computing Facility 7

What is causing the statstcal diference? • Instructon Cache Misses? • Rerun methodology collectng performance counters 30 Samples each compiler for both quiet and noisy system Argonne Leadership Computing Facility 8

Instructon Cache Miss Data Added Diference Moving towards GCC Diference Moving towards Clang Argonne Leadership Computing Facility 9

Top 12 tests where performance trends towards GCC on noisy system • Instructon cache misses do appear to explain some of the cases but is not the only relevant factor. Argonne Leadership Computing Facility 10

RSBench Proxy Applicaton Signifcant amount of work in math library Argonne Leadership Computing Facility 12

Generated Assembly Clang 7 GCC 7.3 Argonne Leadership Computing Facility 13

For This, We Have A Plan: Modelling write-only errno • Missed SimplifyLibCall • Current limitatons with representng write-only functons • Write only atribute in clang • Marking math functons as write only • Special case that sin and cos afect memory in the same way Argonne Leadership Computing Facility 14

Compiler Specifc Pragmas • #pragma ivdep • Not always just specifc pragmas • #pragma loop_count(15) • #pragma vector nontemporal • Clear mapping of to Clang pragmas? Argonne Leadership Computing Facility 16

MiniFE Proxy Applicaton / openmp-opt ./miniFE.x –nx 420 –ny 420 –nz 420 • Compiler Specifc Pragmas Argonne Leadership Computing Facility 17

Compiler Specifc Pragmas • Intel Compiler shows litle to no performance gain from #pragmas for fully optmized applicatons investgated thus far •#pragma loop_count(15) •#pragma ivdep •#pragma vector nontemporal • Is there a potental beneft from this additonal informaton that is not yet realized? Were the pragmas needed in a previous version and not now? Were they needed in the full applicaton but not in the proxy? Argonne Leadership Computing Facility 20

LCALS “Livermore Compiler Analysis Loop Suite” Subset A: ○ Loops representatve of Subset C: ○ Loops extracted from those found in applicaton codes “Livermore Loops coded in C” developed by Steve Langer, Subset B: which were derived from the ○ Basic loops that help to Fortran version by Frank illustrate compiler McMahon optmizaton issues Argonne Leadership Computing Facility 21

Google Benchmark Library • Runs each micro-benchmark a variable amount of tmes and reports the mean. The library controls the number of iteratons. • Provides additonal support for specifying diferent inputs, controlling measurement units, minimum kernel runtme, etc… • Did not match lit’s one test to one result reportng Argonne Leadership Computing Facility 22

Expanding lit • Expand the lit Result object to allow for a one test to many result model Argonne Leadership Computing Facility 23

Expanding lit ● The test suite can now use lit report individual kernel tmings based on the mean of many iteratons of the kernel test-suite/MicroBenchmarks Argonne Leadership Computing Facility 24

LLVM Test Suite MicroBenchmarks • Write benchmark code using the Google Benchmark Library htps://github.com/google/benchmark • Add test code into test-suite/MicroBenchmarks • Link executable in test’s CMakeLists to benchmark library • lit.local.cfg in test-suite/MicroBenchmarks will include the microBenchmark module from test-suite/litsupport Argonne Leadership Computing Facility 25

And Now To Talk About Loops and Directives... Some plans for a new loop-transformation framework in LLVM... Argonne Leadership Computing Facility 27

EXISTING LOOP TRANSFORMATIONS Loop Transformation #pragmas are Already All Around gcc #pragma unroll 4 [also supported by clang, icc, xlc] clang #pragma clang loop distribute(enable) #pragma clang loop vectorize_width(4) #pragma clang loop interleave(enable) #pragma clang loop vectorize(assume_safety) [undocumented] icc #pragma ivdep #pragma distribute_point msvc #pragma loop(hint_parallel(0)) xlc #pragma unrollandfuse #pragma loopid(myloopname) #pragma block_loop(50, myloopname) OpenMP/OpenACC #pragma omp parallel for 28 Argonne Leadership Computing Facility 28

SYNTAX Current syntax: – #pragma clang loop transformation(option) transformation(option) ... – Transformation order determined by pass manager – Each transformation may appear at most once – LoopDistribution results in multiple loops, to which one apply follow-ups? Proposed syntax: – #pragma clang loop transformation option option(arg) ... – One # pragma per transformation – Transformations stack up – Can apply same transformation multiple times – Resembles OpenMP syntax 29 Argonne Leadership Computing Facility 29

AVAILABLE TRANSFORMATIONS Ideas, to be Implemented Incrementally #pragma clang loop stripmine/tile/block #pragma clang loop split/peel/concatenate [index domain] #pragma clang loop specialize [loop versioning] #pragma clang loop unswitch #pragma clang loop shift/scale/skew [inducton variable] #pragma clang loop coalesce #pragma clang loop distribute/fuse #pragma clang loop reverse #pragma clang loop move #pragma clang loop interchange #pragma clang loop parallelize_threads/parallelize_accelarator #pragma clang loop ifconvert #pragma clang loop zcurve #pragma clang loop reschedule algorithm(pluto) #pragma clang loop assume_parallel/assume_coincident/assume_min_depdist #pragma clang loop assume_permutable #pragma clang data localize [copy working set used in loop body] ... 30 Argonne Leadership Computing Facility 30

LOOP NAMING Ambiguity when Transformations Result in Multiple Loops #pragma clang loop vectorize width(8) #pragma clang loop distribute for (int i = 1; i < n; i+=1) { A[i] = A[i-1] + A[i]; B[i] = B[i] + 1; } #pragma clang loop vectorize width(8) for (int i = 1; i < n; i+=1) A[i] = A[i-1] + A[i]; [<= not vectorizable] #pragma clang loop vectorize width(8) for (int i = 1; i < n; i+=1) B[i] = B[i] + 1; Argonne Leadership Computing Facility 31

LOOP NAMING Solution: Loop Names #pragma clang loop(B) vectorize width(8) #pragma clang loop distribute [← applies implicitly on next loop] for (int i = 1; i < n; i+=1) { #pragma clang block id(A) { A[i] = A[i-1] + A[i]; } #pragma clang block id(B) { B[i] = B[i] + 1; } } #pragma clang loop id(A) [← implicit name from loop distribution] for (int i = 1; i < n; i+=1) A[i] = A[i-1] + A[i]; #pragma clang loop vectorize width(8) #pragma clang loop id(B) [← implicit name from loop distribution] for (int i = 1; i < n; i+=1) B[i] = B[i] + 1; Argonne Leadership Computing Facility 32

OPEN QUESTIONS Is #pragma clang loop parallelize_threads different enough from #pragma omp parallel for to justify its addition? How to encode different parameters for different platforms? Is it possible to use such #pragmas outside of the function the loop is in? – Would like to put the source into a different file, which is then #included Does the location of a #pragma with a loop name have a meaning? 33 Argonne Leadership Computing Facility 33

DOE Proxy Apps Clang/LLVM vs. the World! Hal Finkel, Brian - PowerPoint PPT Presentation

DOE Proxy Apps Clang/LLVM vs. the World! Hal Finkel, Brian Homerding, Michael Kruse EuroLLVM 2018 Argonne Leadership Computing Facility Argonne Leadership Computing Facility 1 1 High-Level Effects Low-Level Effects Argonne Leadership

Web Proxy Web Proxy Caching Caching Caching Web Proxy Web Proxy Caching By Miquel Company

Structure-aware fuzzing for Clang and LLVM with libprotobuf-mutator Kostya Serebryany, Vitaly

LLVM IR and the IoT Dvid Juhsz david.juhasz@imsystech.com 4/2/2018 1 FOSDEM 2018 LLVM

An update on Clang-based C++ Tooling Manuel Klimek Daniel Jasper Tomorrowland (from Euro LLVM

Life of an instruction in Clang / LLVM Francis Visoiu Mistrih - francis@lse.epita.fr 1 / 20

LLVM/Clang Mouna Abidi & Manel Grichi 1 Plan What is LLVM? How will you be using it?

Developing the Clang Static Analyzer Artem Dergachev, Apple Clang Static Analyzer Finds bugs

Swift Intermediate Language A high level IR to complement LLVM Joe Gro ff and Chris Lattner Why

Developing and Shipping LLVM and Clang with CMake The lesser of two evils Chris Bieneman IRC:

Here be dragons: Using clang/LLVM to build Android Presented by: Behan Webster (LLVMLinux

Porting LLVM to a new OS Kai Nacke 31 January 2016 LLVM devroom @ FOSDEM16 Porting LLVM

Generating Serialisation Code with Clang EURO-LLVM CONFERENCE 12 th April 2012 Wayne Palmer

LLVM Binutils BoF 2019 EuroLLVM Developers' Meeting James Henderson (SN Systems) Jordan

I n t e r n s L i g h t n i n g T a l k s Proxy editing PiTiVi Proxy editing

LLVM on the Web Using Portable Native Client to run Clang/LLVM in the Browser Brad Nelson

Building Clang/LLVM efficiently Tilmann Scheller LLVM Compiler Engineer t.scheller@samsung.com

Proxytunnel Punching holes through the corporate firewall. Mark Janssen Dag Wiers

Group Project: Can Brazil be crowned as Champion? Tae Hwan Chung Jin Yoon Kanishk Jain Thomas

Investor Presentation March 2017 1 Disclaimers This material shall be read in conjunction with

2019 RESULTS CONFERENCE CALL FEB 13, 2020 | 10AM EASTERN KILLAM APARTMENT REIT Cautionary

Adaptive Address for Next Generation IP Protocol in Hierarchical Networks Haoyu Song, Zhaobo

CrIMSS EDR at Sounder PEATE Sounder Science Team Meeting May 2009 Sung-Yung Lee California

Design Patterns #1 Reid Holmes Lecture 11 - Tuesday October 19 2010. GoF design patterns

Blind Proxy Voting Implementation roles Data flow Keys and hash Frantiek Hakl Key code

DOE Proxy Apps Clang/LLVM vs. the World! Hal Finkel, Brian - PowerPoint PPT Presentation

DOE Proxy Apps Clang/LLVM vs. the World! Hal Finkel, Brian Homerding, Michael Kruse EuroLLVM 2018 Argonne Leadership Computing Facility Argonne Leadership Computing Facility 1 1 High-Level Effects Low-Level Effects Argonne Leadership

Web Proxy Web Proxy Caching Caching Caching Web Proxy Web Proxy Caching By Miquel Company

Structure-aware fuzzing for Clang and LLVM with libprotobuf-mutator Kostya Serebryany, Vitaly

LLVM IR and the IoT Dvid Juhsz david.juhasz@imsystech.com 4/2/2018 1 FOSDEM 2018 LLVM

An update on Clang-based C++ Tooling Manuel Klimek Daniel Jasper Tomorrowland (from Euro LLVM

Life of an instruction in Clang / LLVM Francis Visoiu Mistrih - francis@lse.epita.fr 1 / 20

LLVM/Clang Mouna Abidi &amp; Manel Grichi 1 Plan What is LLVM? How will you be using it?

Developing the Clang Static Analyzer Artem Dergachev, Apple Clang Static Analyzer Finds bugs

Swift Intermediate Language A high level IR to complement LLVM Joe Gro ff and Chris Lattner Why

Developing and Shipping LLVM and Clang with CMake The lesser of two evils Chris Bieneman IRC:

Here be dragons: Using clang/LLVM to build Android Presented by: Behan Webster (LLVMLinux

Porting LLVM to a new OS Kai Nacke 31 January 2016 LLVM devroom @ FOSDEM16 Porting LLVM

Generating Serialisation Code with Clang EURO-LLVM CONFERENCE 12 th April 2012 Wayne Palmer

LLVM Binutils BoF 2019 EuroLLVM Developers' Meeting James Henderson (SN Systems) Jordan

I n t e r n s L i g h t n i n g T a l k s Proxy editing PiTiVi Proxy editing

LLVM on the Web Using Portable Native Client to run Clang/LLVM in the Browser Brad Nelson

Building Clang/LLVM efficiently Tilmann Scheller LLVM Compiler Engineer t.scheller@samsung.com

Proxytunnel Punching holes through the corporate firewall. Mark Janssen Dag Wiers

Group Project: Can Brazil be crowned as Champion? Tae Hwan Chung Jin Yoon Kanishk Jain Thomas

Investor Presentation March 2017 1 Disclaimers This material shall be read in conjunction with

2019 RESULTS CONFERENCE CALL FEB 13, 2020 | 10AM EASTERN KILLAM APARTMENT REIT Cautionary

Adaptive Address for Next Generation IP Protocol in Hierarchical Networks Haoyu Song, Zhaobo

CrIMSS EDR at Sounder PEATE Sounder Science Team Meeting May 2009 Sung-Yung Lee California

Design Patterns #1 Reid Holmes Lecture 11 - Tuesday October 19 2010. GoF design patterns

Blind Proxy Voting Implementation roles Data flow Keys and hash Frantiek Hakl Key code

LLVM/Clang Mouna Abidi & Manel Grichi 1 Plan What is LLVM? How will you be using it?