the future of gpu accelerator programming models
play

The Future of GPU/Accelerator Programming Models LLVM HPC 2015 - PowerPoint PPT Presentation

The Future of GPU/Accelerator Programming Models LLVM HPC 2015 Michael Wong (IBM) michaelw@ca.ibm.com; http:://wongmichael.com http://isocpp.org/wiki/faq/wg21:michael-wong IBM and Canadian C++ Standard Committee HoD OpenMP CEO Chair of WG21


  1. The Future of GPU/Accelerator Programming Models LLVM HPC 2015 Michael Wong (IBM) michaelw@ca.ibm.com; http:://wongmichael.com http://isocpp.org/wiki/faq/wg21:michael-wong IBM and Canadian C++ Standard Committee HoD OpenMP CEO Chair of WG21 SG5 Transactional Memory , SG14 Games/Low Latency Director, Vice President of ISOCPP.org Vice Chair Standards Council of Canada Programming Languages

  2. Acknow ledgem ent and Disclaim er ฀ Numerous people internal and external to the original OpenMP group, in industry and academia, have made contributions, influenced ideas, written part of this presentations, and offered feedbacks to form part of this talk. ฀ I even lifted this acknowledgement and disclaimer from some of them. ฀ But I claim all credit for errors, and stupid mistakes. These are m ine, all m ine! ฀

  3. Legal Disclaim er ฀ This work represents the view of the author and does not necessarily represent the view of IBM. ฀ IBM, PowerPC and the IBM logo are trademarks or registered trademarks of IBM or its subsidiaries in the United States and other countries. ฀ Other company, product, and service names may be trademarks or service marks of others.

  4. Agenda • Clang/OpenMP Multi-company collaboration • What Now? • SG14 • C++ Std GPU Accelerator Model 4

  5. OpenMP Mission Statement changed in 2013 • OpenMP’s new mission statement –“Standardize directive-based multi-language high- level parallelism that is performant, productive and portable” –Updated from • "Standardize and unify shared memory, thread-level parallelism for HPC” 5

  6. OpenMP in Clang update • I Chair Weekly OpenMP Clang review WG (Intel, IBM, AMD, TI, Micron) to help speedup OpenMP upstream into clang: April 2015-on going –Joint code reviews, code refactoring –Delivered full OpenMP 3.1 into Clang 3.7 (default lib is still GCC OpenMP) –Added U of Houston OpenMP tests into clang –IBM team Delivered changes for OpenMP RT for PPC, other teams added their platform/architecture –Released Joint design on Multi-device target interface for LLVM to llvm-dev for comment –LLVM developer Conf Oct 2015 talk: • http://llvm.org/devmtg/2015-10/slides/WongBataev-OpenMPGPUAcceleratorsComingOfAgeInClang.pdf • https://www.youtube.com/watch?v=dCdOaL3asx8&list=PL_R5A0lGi1AA4Lv2bBFSwhgDaHvvpVU21&index =18

  7. Many Participants/companies • Ajay Jayaraj, TI • Kelvin Li, IBM • Alexander Musman, Intel • Kevin O’Brien, IBM • Alex Eichenberger, IBM • Samuel Antao, IBM • Alexey Bataev, Intel • Sergey Ostanevich, Intel • Andrey Bokhanko, Intel • Sunita Chandrasekaran, UH • Carlo Bertolli, IBM • Michael Wong, IBM • Eric Stotzer, TI • Wang Chan, IBM • Guansong Zhang, AMD • Robert Ho, IBM • Hal Finkel, ANL • Wael Yehia, IBM • Ilia Verbyn, Intel • Ettore Tiotto, IBM • James Cownie, Intel • Melanie Ullmer, IBM • Yaoqing Gao, IBM • Kevin Smith, Intel

  8. The codebase LLVM main repository Clang-OMP repository http://llvm.org http://clang-omp.github.io Clang/LLVM Version 3.5 snapshot Initial version Added OpenMP All OpenMP 3.1 features to Clang Version 3.7 merged Current version Version 3.8 Trunk OpenMP 4 Now merging offloading OpenMP 4.0 support • How to use it: – Grab the latest source files and install LLVM as usual OpenMP – Use the right options to specify host and targ et machines, e.g.: 4.5 $ clang –fopenmp –target powerpc64le-ibm-linux-gnu –mcpu pwr8 support –omptargets=nvptx64sm_35-nvidia-cuda <source files>

  9. Offloading in OpenMP – Impl. components Fat binary Input OpenMP Device runtime Program Host Device enabled library code code C/C++ compiler Host runtime library Target agnostic Host component Host machine component Target API Device Device Operating System Device Driver

  10. Offloading in OpenMP – Impl. components Clang Fat binary Input OpenMP Device runtime Program Host Device enabled library code code C/C++ compiler Host runtime library Target agnostic Host component Host machine component Target API Device Device Operating System Device K40 Driver

  11. Clang with OpenMP a.cpp b.cpp • Compiler actions: – Driver preprocesses input source files using Host Host Target Target host/target preprocessor Compiler Compiler Preproc. Preproc. • Header files may be in different places We may revisit this in the future • Host Host Target Target – For each source file, the driver spawns a job Compiler Compiler Compiler Compiler using the host toolchain and an additional job Host Host Target Target for each target specified by the user Assemble Assemble Assemble Assemble Flags informing the frontend that we are – r r r r compiling code for a target so only the relevant Target target regions are considered Linker Host – Target linker creates a self-contained (no linker undefined symbols) image file FatBin – Target image file is embedded “as is” by the Device host linker into the host fat binary RTL Host – The host linker is provided with information to RTL generate the symbols required by the RTL

  12. Offloading in Clang: Current Status • Initial implementation available at https://github.com/clang-omp/clang_trunk • First patches are committed to trunk – Support for target constructs parsing/sema/codegen for host • Several patches are under review – Support for new driver option – Offloading descriptor registration and device codegen

  13. heterogeneous device model • OpenMP 4.0 supports accelerators/coprocessors • Device model: – one host – multiple acclerators / coprocessors of the same kind 13

  14. Data mapping: shared or distributed memory Shared memory Memory Processor X Processor Y Cache Cache A A A Distributed memory Accelertor Y Memory X The corresponding variable in the • Processor X device data environment may share Memory Y storage with the original variable. Cache A A • Writes to the corresponding variable A may alter the value of the original variable. 14

  15. OpenMP 4.0 Device Constructs • Execute code on a target device omp target [clause[[,] clause],…] – structured-block – omp declare target [function-definitions-or-declarations] • Map variables to a target device map ([map-type:] list) // map clause – map-type := alloc | tofrom | to | from – omp target data [clause[[,] clause],…] structured-block omp target update [clause[[,] clause],…] – – omp declare target [variable-definitions-or-declarations] • Workshare for acceleration omp teams [clause[[,] clause],…] – structured-block omp distribute [clause[[,] clause],…] – for-loops 15

  16. SAXPY: Serial (host) 16

  17. SAXPY: Serial (host) 17

  18. SAXPY: Coprocessor/Accelerator 18

  19. SAXPY: Coprocessor/Accelerator 19

  20. Building Fat Binary • Clang generates objects for each target • Target toolchains combine objects into target- dependent binaries • Host linker combines host + target-dependent binaries into an executable (Fat Binary) Data LLVM Xeon Phi Code Generated GPU Code host code • New driver command-line option DSP Code Fat Binary -omptargets=T1,…,Tn clang -fopenmp -omptargets=nvptx64-nvidia-cuda,x86-pc-linux-gnu foo.c bar.c -o foobar.bin

  21. Heterogeneous Execution of Fat Binary Xeon Phi offload Xeon RTL Phi Data LLVM Xeon Phi Code Generated host libomptarget GPU offload RTL GPU GPU Code code library DSP Code Fat Binary DSP offload RTL DSP

  22. Libomptarget and offload RTL • Source code available at https://github.com/clang- omp/libomptarget • Planned to be upstreamed • Supported platforms libomptarget – Platform neutral implementation (tested on Linux for x86-64, PowerPC * ) • NVIDIA * (Tested with CUDA * compilation tools V7.0.27) • – Offload target RTL • x86-64, PowerPC, NVIDIA *Other names and brands may be claimed as the property of others.

  23. What did we learn? • Multi-Vendor/University collaboration works even outside of ISO • Support separate vendor-dependent target RTL to enable other programming models • Production compilers need support for L10N and I18N for multiple countries and languages

  24. Future plans • Clang 3.8 (~Feb, 2016): trunk switches to clang OpenMP lib, upstream OpenMP 4.0 with focus on Accelerator delivery; start code dropping for OpenMP 4.5 • Clang 3.9 (~Aug 2016): Complete OpenMP 4.0 and continue to Add OpenMP 4.5 functionality • Clang 4.0 (~Feb 2017): clang/llvm becomes reference compiler; follow OpenMP ratification with collaborated contribution?

  25. Clang 4.0 becomes OpenMP reference compiler and tracks C++14 Implemented in Clang 3.5 OpenMP closely? 9/3/2014 2/28/2017 OpenMP 4.5 Ratified/Release C++17 Ratify? 11/12/2015 5/31/2017 C++14 Ratify 5/31/2014 OpenMP 5.0 C++17 Ratified/Release? Release? C++14 Released 11/12/2017 12/31/2017 12/31/2014 C++17 Clang 3.5 Release Implemented 8/31/2014 in Clang 4.0? 2/28/2017 OpenMP 4.0 Ratified/Release 11/12/2013 Today 2017 2013 2014 2015 2016 2017 8/31/2015 2/28/2017 Clang 3.7 Clang 4.0 Release Release? 2/29/2016 Clang 3.8 2/28/2015 8/31/2016 Release Clang 3.6 Release Clang 3.9 Release?

Recommend


More recommend