Bringing Next Generation C++ to GPUs Michael Haidl 1 , Michel Steuwer - PowerPoint PPT Presentation

1 University of Muenster, Germany 2 University of Edinburgh, UK Bringing Next Generation C++ to GPUs Michael Haidl 1 , Michel Steuwer 2 , Lars Klein 1 and Sergei Gorlatch 1

std::vector< int > a(N), b(N), tmp(N); std::transform(a.begin(), a.end(), b.begin(), tmp.begin(), std::multiplies< int >()); auto result = std::accumulate(tmp.begin(), tmp.end(), 0, std::plus< int >()); • The STL is the C++ programmers swiss knife • STL containers, iterators and algorithms introduce a high-level of abstraction • Since C++17 it is also parallel 1 The Problem: Dot Product

1 std::vector< int > a(N), b(N), tmp(N); std::transform(a.begin(), a.end(), b.begin(), tmp.begin(), std::multiplies< int >()); auto result = std::accumulate(tmp.begin(), tmp.end(), 0, std::plus< int >()); The Problem: Dot Product f1.hpp f2.hpp template<typename T> template<typename T> auto mult(const std::vector<T>& a auto sum(const std::vector<T>& a){ const std::vector<T>& b){ return std::accumulate(a.begin(), std::vector<T> tmp(a.size()); a.end(), T(), std::plus<T>()); std::transform(a.begin(), a.end(), } b.begin(), tmp.begin(), std::multiplies<T>()); return tmp; }

1 std::vector< int > a(N), b(N); auto result = sum(mult(a, b)); The Problem: Dot Product f1.hpp f2.hpp template<typename T> template<typename T> auto mult(const std::vector<T>& a auto sum(const std::vector<T>& a){ const std::vector<T>& b){ return std::accumulate(a.begin(), std::vector<T> tmp(a.size()); a.end(), T(), std::plus<T>()); std::transform(a.begin(), a.end(), } b.begin(), tmp.begin(), std::multiplies<T>()); return tmp; }

f1.hpp f2.hpp 1 40 a(N), b(N); auto result = sum(mult(a, b)); Performance: • vectors of size 25600000 • Clang/LLVM 5.0.0svn -O3 optimized std::vector< int > * LLVM patched with extended D17386 (loop fusion) Runtime (ms) 0 20 The Problem: Dot Product transform/accumulate transform/accumulate* inner_product

thrust::device_vector< int > a(N), b(N), tmp(N); thrust::transform(a.begin(), a.end(), b.begin(), tmp.begin(), thrust::multiplies< int >()); auto result = thrust::reduce(tmp.begin(), tmp.end(), 0, thrust::plus< int >()); • Highly tuned STL-like library for GPU programming • Thrust offers containers, iterators and algorithms • Based on CUDA 2 The Problem: Dot Product on GPUs

2 thrust::device_vector< int > Runtime (ms) 30 20 10 0 • nvcc -O3 (from CUDA 8.0) Same Experiment: • Based on CUDA • Thrust offers containers, iterators and algorithms • Highly tuned STL-like library for GPU programming thrust::plus< int >()); auto result = thrust::reduce(tmp.begin(), tmp.end(), 0, thrust::multiplies< int >()); thrust::transform(a.begin(), a.end(), b.begin(), tmp.begin(), a(N), b(N), tmp(N); The Problem: Dot Product on GPUs inner_product transform/reduce

• range-v3 prototype implementation by E. Niebler • Proposed as N4560 for the C++ Standard std::vector< int > a(N), b(N); auto mult = []( auto tpl) { return get<0>(tpl) * get<1>(tpl); }; auto result = accumulate(view::transform(view::zip(a, b), mult), 0); 3 The Next Generation: Ranges for the STL

std::vector< int > a(N), b(N); auto mult = []( auto tpl) { return get<0>(tpl) * get<1>(tpl); }; auto result = accumulate(view::transform(view::zip(a, b), mult), 0); Performance? • Clang/LLVM 5.0.0svn -O3 optimized 0 20 40 Runtime (ms) 3 The Next Generation: Ranges for the STL inner_product transform/accumulate

std::vector< int > a(N), b(N); auto mult = []( auto tpl) { return get<0>(tpl) * get<1>(tpl); }; auto result = accumulate(view::transform(view::zip(a, b), mult), 0); • Views describe lazy, non-mutating operations on ranges • Evaluation happens inside an algorithm (e.g., accumulate ) • Fusion is guaranteed by the implementation 3 The Next Generation: Ranges for the STL

• Extended range-v3 with GPU-enabled container and algorithms • Original code of range-v3 remains unmodified std::vector< int > a(N), b(N); auto mult = []( auto tpl) { return get<0>(tpl) * get<1>(tpl); }; auto ga = gpu::copy(a); auto gb = gpu::copy(b); auto result = gpu::reduce(view::transform(view::zip(ga, gb), mult), 0); 4 Ranges for GPUs

5 • Based entirely on LLVM / Clang • Just-In-Time Compilation of LLVM IR for target accelerators • Supports C++14 for GPU Programming Programming Accelerators with C++ (PACXX) Executable O ffl ine Stage PACXX Runtime PACXX O ffl ine Compiler Online Stage LLVM-based Clang Frontend LLVM Online Compiler online compiler IR LLVM libc++ OpenCL CUDA NVPTX LLVM IR to SPIR Backend Backend SPIR PTX #include <algorithm> #include <vector> #include <iostream> C++ template< class ForwardIt, class T > void fill(ForwardIt first, ForwardIt last, CUDA Runtime const T& value) OpenCL Runtime { for (; first != last; ++first) { AMD GPU Intel MIC Nvidia GPU *first = value; } }

6 sum = fun(sum, *(in + gid)); return std::accumulate(out, init, fun); } // 4. finish reduction on the CPU kernel(in, out, distance(in), init); // 3. execute kernel }, blocks, threads); if (lid = 0) *(out + bid) = shared[0]; // 2d. write result back ... // 2c. perform reduction in shared memory gid += glbSize; } for ( int x = 0; x < ept; ++x) { template < typename InRng, typename T, typename Fun> auto sum = init; // 2b. start reduction computation auto ept = stage ([&]{ return size / get_block_size(0); }); // 2a. stage elements per thread [fun]( auto && in, auto && out, int size, auto init) { auto kernel = pacxx::kernel( // 2. create GPU kernel ... // 1. preparation of kernel call auto reduce(InRng&& in, T init, Fun&& fun) { Multi-Stage Programming

7 • MSP Engine JIT compiles the MSP IR, stage. • Enables more optimizations (e.g., loop-unrolling) in the online MSP Integration into PACXX Executable PACXX Runtime PACXX MSP Engine MSP O ffl ine Compiler IR LLVM-based Clang Frontend KERNEL online compiler IR LLVM libc++ OpenCL CUDA LLVM IR to SPIR NVPTX Backend Backend SPIR PTX #include <algorithm> #include <vector> #include <iostream> C++ template< class ForwardIt, class T > void fill(ForwardIt first, ForwardIt last, CUDA Runtime OpenCL Runtime const T& value) { for (; first != last; ++first) { AMD GPU Intel MIC Nvidia GPU *first = value; } } • evaluates stage prior to a kernel launch, and • replaces the calls to stage in the kernel’s IR with the results.

8 gpu::reduce on Nvidia K20c Up to 35% better performance compared to non-MSP version Performance Impact of MSP 1.4 Dot Sum 1.35 Dot +MS Sum +MS 1.3 1.25 1.2 Speedup 1.15 1.1 1.05 1 0.95 0.9 2 15 2 17 2 19 2 21 2 23 2 25 Input Size

9 Comparing MSP in PACXX with Nvidia’s nvrtc library 10 to 20 times faster because front-end actions are performed. Just-In-Time Compilation Overhead 450 CUDA 7.5 CUDA 8.0 RC 400 PACXX 350 Compilation Time (ms) 300 250 200 150 100 50 0 Dot Sum

10 2 Competitive performance with a composable GPU programming API • 1000 runs for each benchmark • 11 different input sizes • Evaluated on a Nvidia K20c GPU PACXX Thrust Speedup range-v3 + PACXX vs. Nvidia’s Thrust 1 Benchmarks 82 . 32% 73 . 31% 1 . 5 33 . 6% 8 . 37% 1 . 61% − 3 . 13% − 7 . 38% 0 . 5 saxpy sum vadd dot Monte Carlo Mandelbrot Voronoi

11 • PACXX is extended by a native CPU backend [1] Karrenberg, Ralf, and Sebastian Hack. ”Whole-Function Vectorization.” @ CGO’11, pp. 141–150 • MCJIT compiles the kernels and TBB executes them in parallel. • Kernels are vectorized by the Whole Function Vectorizer (WFV) [1] • The Kernel IR is modified to be runnable on a CPU Going Native: Work in Progress Executable PACXX Runtime PACXX MSP Engine MSP O ffl ine Compiler IR LLVM-based Clang Frontend KERNEL online compiler IR WFV [1] LLVM libc++ OpenCL LLVM IR CUDA NVPTX Backend to SPIR Backend MCJIT SPIR PTX #include <algorithm> #include <vector> #include <iostream> C++ template< class ForwardIt, class T > void fill(ForwardIt first, ForwardIt last, CUDA Runtime OpenCL Runtime const T& value) { for (; first != last; ++first) { AMD GPU Intel MIC CPUs: Intel / AMD / IBM ... Nvidia GPU *first = value; } }

12 1 • Intel’s auto-vectorizer optimizes the OpenCL C code • Running on 2x Intel Xeon E5-2620 CPUs PACXX OpenCL (Intel) Speedup range-v3 + PACXX vs. OpenCL on x86_64 2 0 Benchmarks 1 . 5 − 1 . 45% − 12 . 64% − 13 . 94% − 20 . 02% − 36 . 48% 0 . 5 saxpy sum vadd dot Mandelbrot

13 Speedup (speedup up to 126x for sum) • Barriers are very expensive in AMD’s OpenCL implementation • AMD OpenCL SDK has no auto-vectorizer • Running on 2x Intel Xeon E5-2620 CPUs PACXX 1 2 3 range-v3 + PACXX vs. OpenCL on x86_64 OpenCL (AMD) Benchmarks 153 . 82% 58 . 12% 58 . 45% saxpy sum vadd dot Mandelbrot

Bringing Next Generation C++ to GPUs Michael Haidl 1 , Michel Steuwer - PowerPoint PPT Presentation

1 University of Muenster, Germany 2 University of Edinburgh, UK Bringing Next Generation C++ to GPUs Michael Haidl 1 , Michel Steuwer 2 , Lars Klein 1 and Sergei Gorlatch 1 std::vector< int > a(N), b(N), tmp(N); std::transform(a.begin(),

PanMedia Bringing it all together FIRST Vancouver 2008 PanMedia Bringing it all Together

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Bringing Our Children Home Act and Five Nations Template Laws on Children and Families Bringing

Next Generation Next Generation gTLD Dir gTLD Directory Services ectory Services Pr

Next Generation Climate Next Generation Climate Grades 6-8 Supports NGSS Lots of graphs and

Next Generation ACO Model Open Door Forum: Next Generation ACO Application Overview March 29,

Next Generation ACO Model Open Door Forum: Next Generation ACO Application Overview March 14,

Video Consoles - The Next Generation consoles and games from Next Generation 1994 - present

Scott Le Grand Some Things Never Change (GPUs vs the World) How Best to Exploit GPUs

Unleashing the Power of GPUs over the Web Vishal Vaidyanathan Royal Caliber LLC GPUs are

Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs by Point Based Rendering by

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy K. John Problem Statement

MD5 Chosen-Prefix Collisions on GPUs Marc Bevand m.bevand@gmail.com marc.bevand@rapid7.com

Analyzing Throughput of GPUs Analyzing Throughput of GPUs Exploiting Within-Die Core-to-Core

Bringing Riak to the Mobile Platform Kresten Krab Thorup Hacker @drkrab Bringing Riak to the

Origin of this work Trying to understand concept of specialized services. Discussion

Using ATLAST Instruments to Study the Relationships Among Teacher Knowledge, Classroom

TF-MSP chairman: Walter van Dijk (SURFnet) co-chairs: Alberto Perez (RedIRIS) Koen Schelkens

The Riddle of the Ribbon, Moving to MS Project 2010 from a customized MS Project 2007, or Where

Pastor Walter Still Succession Retirement Plan Peace Lutheran Church, Fort Myers, Florida

* Alternative technique not covered in this class Multivariate Fundamentals: Prediction

eMedNY TPL Overview The eMedNY Third Party Liability (TPL) application maintains

Managers Meeting 21 st October 2016 Roy Preece STFC-RAL Content Finance Schedule