1 University of Muenster, Germany 2 University of Edinburgh, UK Bringing Next Generation C++ to GPUs Michael Haidl 1 , Michel Steuwer 2 , Lars Klein 1 and Sergei Gorlatch 1
std::vector< int > a(N), b(N), tmp(N); std::transform(a.begin(), a.end(), b.begin(), tmp.begin(), std::multiplies< int >()); auto result = std::accumulate(tmp.begin(), tmp.end(), 0, std::plus< int >()); • The STL is the C++ programmers swiss knife • STL containers, iterators and algorithms introduce a high-level of abstraction • Since C++17 it is also parallel 1 The Problem: Dot Product
1 std::vector< int > a(N), b(N), tmp(N); std::transform(a.begin(), a.end(), b.begin(), tmp.begin(), std::multiplies< int >()); auto result = std::accumulate(tmp.begin(), tmp.end(), 0, std::plus< int >()); The Problem: Dot Product f1.hpp f2.hpp template<typename T> template<typename T> auto mult(const std::vector<T>& a auto sum(const std::vector<T>& a){ const std::vector<T>& b){ return std::accumulate(a.begin(), std::vector<T> tmp(a.size()); a.end(), T(), std::plus<T>()); std::transform(a.begin(), a.end(), } b.begin(), tmp.begin(), std::multiplies<T>()); return tmp; }
1 std::vector< int > a(N), b(N); auto result = sum(mult(a, b)); The Problem: Dot Product f1.hpp f2.hpp template<typename T> template<typename T> auto mult(const std::vector<T>& a auto sum(const std::vector<T>& a){ const std::vector<T>& b){ return std::accumulate(a.begin(), std::vector<T> tmp(a.size()); a.end(), T(), std::plus<T>()); std::transform(a.begin(), a.end(), } b.begin(), tmp.begin(), std::multiplies<T>()); return tmp; }
f1.hpp f2.hpp 1 40 a(N), b(N); auto result = sum(mult(a, b)); Performance: • vectors of size 25600000 • Clang/LLVM 5.0.0svn -O3 optimized std::vector< int > * LLVM patched with extended D17386 (loop fusion) Runtime (ms) 0 20 The Problem: Dot Product transform/accumulate transform/accumulate* inner_product
thrust::device_vector< int > a(N), b(N), tmp(N); thrust::transform(a.begin(), a.end(), b.begin(), tmp.begin(), thrust::multiplies< int >()); auto result = thrust::reduce(tmp.begin(), tmp.end(), 0, thrust::plus< int >()); • Highly tuned STL-like library for GPU programming • Thrust offers containers, iterators and algorithms • Based on CUDA 2 The Problem: Dot Product on GPUs
2 thrust::device_vector< int > Runtime (ms) 30 20 10 0 • nvcc -O3 (from CUDA 8.0) Same Experiment: • Based on CUDA • Thrust offers containers, iterators and algorithms • Highly tuned STL-like library for GPU programming thrust::plus< int >()); auto result = thrust::reduce(tmp.begin(), tmp.end(), 0, thrust::multiplies< int >()); thrust::transform(a.begin(), a.end(), b.begin(), tmp.begin(), a(N), b(N), tmp(N); The Problem: Dot Product on GPUs inner_product transform/reduce
• range-v3 prototype implementation by E. Niebler • Proposed as N4560 for the C++ Standard std::vector< int > a(N), b(N); auto mult = []( auto tpl) { return get<0>(tpl) * get<1>(tpl); }; auto result = accumulate(view::transform(view::zip(a, b), mult), 0); 3 The Next Generation: Ranges for the STL
std::vector< int > a(N), b(N); auto mult = []( auto tpl) { return get<0>(tpl) * get<1>(tpl); }; auto result = accumulate(view::transform(view::zip(a, b), mult), 0); Performance? • Clang/LLVM 5.0.0svn -O3 optimized 0 20 40 Runtime (ms) 3 The Next Generation: Ranges for the STL inner_product transform/accumulate
std::vector< int > a(N), b(N); auto mult = []( auto tpl) { return get<0>(tpl) * get<1>(tpl); }; auto result = accumulate(view::transform(view::zip(a, b), mult), 0); • Views describe lazy, non-mutating operations on ranges • Evaluation happens inside an algorithm (e.g., accumulate ) • Fusion is guaranteed by the implementation 3 The Next Generation: Ranges for the STL
• Extended range-v3 with GPU-enabled container and algorithms • Original code of range-v3 remains unmodified std::vector< int > a(N), b(N); auto mult = []( auto tpl) { return get<0>(tpl) * get<1>(tpl); }; auto ga = gpu::copy(a); auto gb = gpu::copy(b); auto result = gpu::reduce(view::transform(view::zip(ga, gb), mult), 0); 4 Ranges for GPUs
5 • Based entirely on LLVM / Clang • Just-In-Time Compilation of LLVM IR for target accelerators • Supports C++14 for GPU Programming Programming Accelerators with C++ (PACXX) Executable O ffl ine Stage PACXX Runtime PACXX O ffl ine Compiler Online Stage LLVM-based Clang Frontend LLVM Online Compiler online compiler IR LLVM libc++ OpenCL CUDA NVPTX LLVM IR to SPIR Backend Backend SPIR PTX #include <algorithm> #include <vector> #include <iostream> C++ template< class ForwardIt, class T > void fill(ForwardIt first, ForwardIt last, CUDA Runtime const T& value) OpenCL Runtime { for (; first != last; ++first) { AMD GPU Intel MIC Nvidia GPU *first = value; } }
6 sum = fun(sum, *(in + gid)); return std::accumulate(out, init, fun); } // 4. finish reduction on the CPU kernel(in, out, distance(in), init); // 3. execute kernel }, blocks, threads); if (lid = 0) *(out + bid) = shared[0]; // 2d. write result back ... // 2c. perform reduction in shared memory gid += glbSize; } for ( int x = 0; x < ept; ++x) { template < typename InRng, typename T, typename Fun> auto sum = init; // 2b. start reduction computation auto ept = stage ([&]{ return size / get_block_size(0); }); // 2a. stage elements per thread [fun]( auto && in, auto && out, int size, auto init) { auto kernel = pacxx::kernel( // 2. create GPU kernel ... // 1. preparation of kernel call auto reduce(InRng&& in, T init, Fun&& fun) { Multi-Stage Programming
7 • MSP Engine JIT compiles the MSP IR, stage. • Enables more optimizations (e.g., loop-unrolling) in the online MSP Integration into PACXX Executable PACXX Runtime PACXX MSP Engine MSP O ffl ine Compiler IR LLVM-based Clang Frontend KERNEL online compiler IR LLVM libc++ OpenCL CUDA LLVM IR to SPIR NVPTX Backend Backend SPIR PTX #include <algorithm> #include <vector> #include <iostream> C++ template< class ForwardIt, class T > void fill(ForwardIt first, ForwardIt last, CUDA Runtime OpenCL Runtime const T& value) { for (; first != last; ++first) { AMD GPU Intel MIC Nvidia GPU *first = value; } } • evaluates stage prior to a kernel launch, and • replaces the calls to stage in the kernel’s IR with the results.
8 gpu::reduce on Nvidia K20c Up to 35% better performance compared to non-MSP version Performance Impact of MSP 1.4 Dot Sum 1.35 Dot +MS Sum +MS 1.3 1.25 1.2 Speedup 1.15 1.1 1.05 1 0.95 0.9 2 15 2 17 2 19 2 21 2 23 2 25 Input Size
9 Comparing MSP in PACXX with Nvidia’s nvrtc library 10 to 20 times faster because front-end actions are performed. Just-In-Time Compilation Overhead 450 CUDA 7.5 CUDA 8.0 RC 400 PACXX 350 Compilation Time (ms) 300 250 200 150 100 50 0 Dot Sum
10 2 Competitive performance with a composable GPU programming API • 1000 runs for each benchmark • 11 different input sizes • Evaluated on a Nvidia K20c GPU PACXX Thrust Speedup range-v3 + PACXX vs. Nvidia’s Thrust 1 Benchmarks 82 . 32% 73 . 31% 1 . 5 33 . 6% 8 . 37% 1 . 61% − 3 . 13% − 7 . 38% 0 . 5 saxpy sum vadd dot Monte Carlo Mandelbrot Voronoi
11 • PACXX is extended by a native CPU backend [1] Karrenberg, Ralf, and Sebastian Hack. ”Whole-Function Vectorization.” @ CGO’11, pp. 141–150 • MCJIT compiles the kernels and TBB executes them in parallel. • Kernels are vectorized by the Whole Function Vectorizer (WFV) [1] • The Kernel IR is modified to be runnable on a CPU Going Native: Work in Progress Executable PACXX Runtime PACXX MSP Engine MSP O ffl ine Compiler IR LLVM-based Clang Frontend KERNEL online compiler IR WFV [1] LLVM libc++ OpenCL LLVM IR CUDA NVPTX Backend to SPIR Backend MCJIT SPIR PTX #include <algorithm> #include <vector> #include <iostream> C++ template< class ForwardIt, class T > void fill(ForwardIt first, ForwardIt last, CUDA Runtime OpenCL Runtime const T& value) { for (; first != last; ++first) { AMD GPU Intel MIC CPUs: Intel / AMD / IBM ... Nvidia GPU *first = value; } }
12 1 • Intel’s auto-vectorizer optimizes the OpenCL C code • Running on 2x Intel Xeon E5-2620 CPUs PACXX OpenCL (Intel) Speedup range-v3 + PACXX vs. OpenCL on x86_64 2 0 Benchmarks 1 . 5 − 1 . 45% − 12 . 64% − 13 . 94% − 20 . 02% − 36 . 48% 0 . 5 saxpy sum vadd dot Mandelbrot
13 Speedup (speedup up to 126x for sum) • Barriers are very expensive in AMD’s OpenCL implementation • AMD OpenCL SDK has no auto-vectorizer • Running on 2x Intel Xeon E5-2620 CPUs PACXX 1 2 3 range-v3 + PACXX vs. OpenCL on x86_64 OpenCL (AMD) Benchmarks 153 . 82% 58 . 12% 58 . 45% saxpy sum vadd dot Mandelbrot
Recommend
More recommend