The Future of GPU/Accelerator Programming Models LLVM HPC 2015 - PowerPoint PPT Presentation

The Future of GPU/Accelerator Programming Models LLVM HPC 2015 Michael Wong (IBM) michaelw@ca.ibm.com; http:://wongmichael.com http://isocpp.org/wiki/faq/wg21:michael-wong IBM and Canadian C++ Standard Committee HoD OpenMP CEO Chair of WG21 SG5 Transactional Memory , SG14 Games/Low Latency Director, Vice President of ISOCPP.org Vice Chair Standards Council of Canada Programming Languages

Acknow ledgem ent and Disclaim er ฀ Numerous people internal and external to the original OpenMP group, in industry and academia, have made contributions, influenced ideas, written part of this presentations, and offered feedbacks to form part of this talk. ฀ I even lifted this acknowledgement and disclaimer from some of them. ฀ But I claim all credit for errors, and stupid mistakes. These are m ine, all m ine! ฀

Legal Disclaim er ฀ This work represents the view of the author and does not necessarily represent the view of IBM. ฀ IBM, PowerPC and the IBM logo are trademarks or registered trademarks of IBM or its subsidiaries in the United States and other countries. ฀ Other company, product, and service names may be trademarks or service marks of others.

Agenda • Clang/OpenMP Multi-company collaboration • What Now? • SG14 • C++ Std GPU Accelerator Model 4

OpenMP Mission Statement changed in 2013 • OpenMP’s new mission statement –“Standardize directive-based multi-language high- level parallelism that is performant, productive and portable” –Updated from • "Standardize and unify shared memory, thread-level parallelism for HPC” 5

OpenMP in Clang update • I Chair Weekly OpenMP Clang review WG (Intel, IBM, AMD, TI, Micron) to help speedup OpenMP upstream into clang: April 2015-on going –Joint code reviews, code refactoring –Delivered full OpenMP 3.1 into Clang 3.7 (default lib is still GCC OpenMP) –Added U of Houston OpenMP tests into clang –IBM team Delivered changes for OpenMP RT for PPC, other teams added their platform/architecture –Released Joint design on Multi-device target interface for LLVM to llvm-dev for comment –LLVM developer Conf Oct 2015 talk: • http://llvm.org/devmtg/2015-10/slides/WongBataev-OpenMPGPUAcceleratorsComingOfAgeInClang.pdf • https://www.youtube.com/watch?v=dCdOaL3asx8&list=PL_R5A0lGi1AA4Lv2bBFSwhgDaHvvpVU21&index =18

Many Participants/companies • Ajay Jayaraj, TI • Kelvin Li, IBM • Alexander Musman, Intel • Kevin O’Brien, IBM • Alex Eichenberger, IBM • Samuel Antao, IBM • Alexey Bataev, Intel • Sergey Ostanevich, Intel • Andrey Bokhanko, Intel • Sunita Chandrasekaran, UH • Carlo Bertolli, IBM • Michael Wong, IBM • Eric Stotzer, TI • Wang Chan, IBM • Guansong Zhang, AMD • Robert Ho, IBM • Hal Finkel, ANL • Wael Yehia, IBM • Ilia Verbyn, Intel • Ettore Tiotto, IBM • James Cownie, Intel • Melanie Ullmer, IBM • Yaoqing Gao, IBM • Kevin Smith, Intel

The codebase LLVM main repository Clang-OMP repository http://llvm.org http://clang-omp.github.io Clang/LLVM Version 3.5 snapshot Initial version Added OpenMP All OpenMP 3.1 features to Clang Version 3.7 merged Current version Version 3.8 Trunk OpenMP 4 Now merging offloading OpenMP 4.0 support • How to use it: – Grab the latest source files and install LLVM as usual OpenMP – Use the right options to specify host and targ et machines, e.g.: 4.5 $ clang –fopenmp –target powerpc64le-ibm-linux-gnu –mcpu pwr8 support –omptargets=nvptx64sm_35-nvidia-cuda <source files>

Offloading in OpenMP – Impl. components Fat binary Input OpenMP Device runtime Program Host Device enabled library code code C/C++ compiler Host runtime library Target agnostic Host component Host machine component Target API Device Device Operating System Device Driver

Offloading in OpenMP – Impl. components Clang Fat binary Input OpenMP Device runtime Program Host Device enabled library code code C/C++ compiler Host runtime library Target agnostic Host component Host machine component Target API Device Device Operating System Device K40 Driver

Clang with OpenMP a.cpp b.cpp • Compiler actions: – Driver preprocesses input source files using Host Host Target Target host/target preprocessor Compiler Compiler Preproc. Preproc. • Header files may be in different places We may revisit this in the future • Host Host Target Target – For each source file, the driver spawns a job Compiler Compiler Compiler Compiler using the host toolchain and an additional job Host Host Target Target for each target specified by the user Assemble Assemble Assemble Assemble Flags informing the frontend that we are – r r r r compiling code for a target so only the relevant Target target regions are considered Linker Host – Target linker creates a self-contained (no linker undefined symbols) image file FatBin – Target image file is embedded “as is” by the Device host linker into the host fat binary RTL Host – The host linker is provided with information to RTL generate the symbols required by the RTL

Offloading in Clang: Current Status • Initial implementation available at https://github.com/clang-omp/clang_trunk • First patches are committed to trunk – Support for target constructs parsing/sema/codegen for host • Several patches are under review – Support for new driver option – Offloading descriptor registration and device codegen

heterogeneous device model • OpenMP 4.0 supports accelerators/coprocessors • Device model: – one host – multiple acclerators / coprocessors of the same kind 13

Data mapping: shared or distributed memory Shared memory Memory Processor X Processor Y Cache Cache A A A Distributed memory Accelertor Y Memory X The corresponding variable in the • Processor X device data environment may share Memory Y storage with the original variable. Cache A A • Writes to the corresponding variable A may alter the value of the original variable. 14

OpenMP 4.0 Device Constructs • Execute code on a target device omp target [clause[[,] clause],…] – structured-block – omp declare target [function-definitions-or-declarations] • Map variables to a target device map ([map-type:] list) // map clause – map-type := alloc | tofrom | to | from – omp target data [clause[[,] clause],…] structured-block omp target update [clause[[,] clause],…] – – omp declare target [variable-definitions-or-declarations] • Workshare for acceleration omp teams [clause[[,] clause],…] – structured-block omp distribute [clause[[,] clause],…] – for-loops 15

SAXPY: Serial (host) 16

SAXPY: Serial (host) 17

SAXPY: Coprocessor/Accelerator 18

SAXPY: Coprocessor/Accelerator 19

Building Fat Binary • Clang generates objects for each target • Target toolchains combine objects into target- dependent binaries • Host linker combines host + target-dependent binaries into an executable (Fat Binary) Data LLVM Xeon Phi Code Generated GPU Code host code • New driver command-line option DSP Code Fat Binary -omptargets=T1,…,Tn clang -fopenmp -omptargets=nvptx64-nvidia-cuda,x86-pc-linux-gnu foo.c bar.c -o foobar.bin

Heterogeneous Execution of Fat Binary Xeon Phi offload Xeon RTL Phi Data LLVM Xeon Phi Code Generated host libomptarget GPU offload RTL GPU GPU Code code library DSP Code Fat Binary DSP offload RTL DSP

Libomptarget and offload RTL • Source code available at https://github.com/clang- omp/libomptarget • Planned to be upstreamed • Supported platforms libomptarget – Platform neutral implementation (tested on Linux for x86-64, PowerPC * ) • NVIDIA * (Tested with CUDA * compilation tools V7.0.27) • – Offload target RTL • x86-64, PowerPC, NVIDIA *Other names and brands may be claimed as the property of others.

What did we learn? • Multi-Vendor/University collaboration works even outside of ISO • Support separate vendor-dependent target RTL to enable other programming models • Production compilers need support for L10N and I18N for multiple countries and languages

Future plans • Clang 3.8 (~Feb, 2016): trunk switches to clang OpenMP lib, upstream OpenMP 4.0 with focus on Accelerator delivery; start code dropping for OpenMP 4.5 • Clang 3.9 (~Aug 2016): Complete OpenMP 4.0 and continue to Add OpenMP 4.5 functionality • Clang 4.0 (~Feb 2017): clang/llvm becomes reference compiler; follow OpenMP ratification with collaborated contribution?

Clang 4.0 becomes OpenMP reference compiler and tracks C++14 Implemented in Clang 3.5 OpenMP closely? 9/3/2014 2/28/2017 OpenMP 4.5 Ratified/Release C++17 Ratify? 11/12/2015 5/31/2017 C++14 Ratify 5/31/2014 OpenMP 5.0 C++17 Ratified/Release? Release? C++14 Released 11/12/2017 12/31/2017 12/31/2014 C++17 Clang 3.5 Release Implemented 8/31/2014 in Clang 4.0? 2/28/2017 OpenMP 4.0 Ratified/Release 11/12/2013 Today 2017 2013 2014 2015 2016 2017 8/31/2015 2/28/2017 Clang 3.7 Clang 4.0 Release Release? 2/29/2016 Clang 3.8 2/28/2015 8/31/2016 Release Clang 3.6 Release Clang 3.9 Release?

The Future of GPU/Accelerator Programming Models LLVM HPC 2015 - PowerPoint PPT Presentation

The Future of GPU/Accelerator Programming Models LLVM HPC 2015 Michael Wong (IBM) michaelw@ca.ibm.com; http:://wongmichael.com http://isocpp.org/wiki/faq/wg21:michael-wong IBM and Canadian C++ Standard Committee HoD OpenMP CEO Chair of WG21

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

1 3 5 CONVENTIONAL DC MODEL Accelerator Output Accelerator Opening FB-CA SERIES Accelerator

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

CEBAF Accelerator Status Arne Freyberger Operations Department Accelerator Division Jefferson

SLAC Accelerator Science and R&D R. Hettel Accelerator Research Division Head (acting)

Fermilab Accelerator R&D Program Vladimir Shiltsev, Accelerator Physics Center Institutional

GPU WORKSHOP University of Maryland 1 Intro to GPU Computing 2 OpenACC with hands-on AGENDA 3

Use Tesla to provide first GPU VM Service in China Feng Zhu

Automatic Code Generation from Stateflow Models Andres Toom IB Krates O / Institute of

CSE 341 Lecture 1 Programming Languages; Intro to ML Reading: Ullman 1.1; 2; 3 - 3.2 slides

Legato NetWorker technical update 3E03 26.DECUS Mnchen e.V. Symposium 2003 in Bonn

Lucas-Interpretation Users Views Programmers from Users Perspective Students Lucas-

State of Practice Jerome C. Hunsaker Visiting Professor Department of Aeronautics and

Advances in platform manoeuvring control - Toward full autonomy Nathan Thomas 1 , Chris Harris 2 1

NightLight Safety in the palm of your hand asdasd 37% of Americans do not feel safe walking

B LAISE CATI 4.8.2 IBUC 2010 Pre Conference workshop W HAT IS NEW IN 4.8.2? CATI

The Future of GPU/Accelerator Programming Models LLVM HPC 2015 - PowerPoint PPT Presentation

The Future of GPU/Accelerator Programming Models LLVM HPC 2015 Michael Wong (IBM) michaelw@ca.ibm.com; http:://wongmichael.com http://isocpp.org/wiki/faq/wg21:michael-wong IBM and Canadian C++ Standard Committee HoD OpenMP CEO Chair of WG21

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

1 3 5 CONVENTIONAL DC MODEL Accelerator Output Accelerator Opening FB-CA SERIES Accelerator

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

CEBAF Accelerator Status Arne Freyberger Operations Department Accelerator Division Jefferson

SLAC Accelerator Science and R&amp;D R. Hettel Accelerator Research Division Head (acting)

Fermilab Accelerator R&amp;D Program Vladimir Shiltsev, Accelerator Physics Center Institutional

GPU WORKSHOP University of Maryland 1 Intro to GPU Computing 2 OpenACC with hands-on AGENDA 3

Use Tesla to provide first GPU VM Service in China Feng Zhu

Automatic Code Generation from Stateflow Models Andres Toom IB Krates O / Institute of

CSE 341 Lecture 1 Programming Languages; Intro to ML Reading: Ullman 1.1; 2; 3 - 3.2 slides

Legato NetWorker technical update 3E03 26.DECUS Mnchen e.V. Symposium 2003 in Bonn

Lucas-Interpretation Users Views Programmers from Users Perspective Students Lucas-

State of Practice Jerome C. Hunsaker Visiting Professor Department of Aeronautics and

Advances in platform manoeuvring control - Toward full autonomy Nathan Thomas 1 , Chris Harris 2 1

NightLight Safety in the palm of your hand asdasd 37% of Americans do not feel safe walking

B LAISE CATI 4.8.2 IBUC 2010 Pre Conference workshop W HAT IS NEW IN 4.8.2? CATI

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

SLAC Accelerator Science and R&D R. Hettel Accelerator Research Division Head (acting)

Fermilab Accelerator R&D Program Vladimir Shiltsev, Accelerator Physics Center Institutional