Automated OpenCL GPU kernel fusion for Stan Math Tadej Ciglari - PowerPoint PPT Presentation

Automated OpenCL GPU kernel fusion for Stan Math Tadej Ciglarič (presenter) * , Rok Češnovar, Erik Štrumbelj *

Stan • State-of-the-art software for Bayesian statistics. • Probabilistic programming language + Math library with auto- differentiation + Inference algorithms. • Some operations have an OpenCL implementation.

Overview

GPU development in in the Stan Math li library ry • Hundreds of possible operations and distributions to implement for GPUs. • Sequence of basic kernels: simple to develop, poor performance. • Specialized kernels: good performance, slow development.

Kernel fu fusio ion • Execution of multiple operations in a single kernel. • Speedup: kernel launch overhead, memory transfers between registers and global memory. • Can be automated. • Data fusion. • Parallel fusion.

Im Implementation: : in interface Lazy evaluation: • Operations are C++ objects, • expression is evaluated when assigned to result matrix. Curiously Recurring Template Pattern: template <typename T_a, typename T_b> class addition_ : public binary_operation<addition_<T_a, T_b>, T_a, T_b> { public: addition_(T_a&& a, T_b&& b) : binary_operation<addition_<T_a, T_b>, T_a, T_b>( std::forward<T_a>(a), std::forward<T_b>(b), "+") {} }; template <typename T_a, typename T_b, typename = require_all_valid_expressions_t<T_a, T_b>> inline addition_<as_operation_cl_t<T_a>, as_operation_cl_t<T_b>> operator+(T_a&& a, T_b&& b) { return {as_operation_cl(std::forward<T_a>(a)), as_operation_cl(std::forward<T_b>(b))}; }

Implementation: Im : operation types Example: matrix_cl<double> a, b; double c; matrix_cl<double> d = c * (a + b); a + b addition_<load_<matrix_cl<double>&>, load_<matrix_cl<double>&>> c * (a + b) elewise_multiplication_<scalar_<double>, addition_<load_<matrix_cl<double>&>, load_<matrix_cl<double>&>>> Assignment of an expression to a matrix generates, compiles and executes a kernel.

Implementation: Im : generating kernel code Operation objects generate code for their operation: _ load : double [NAME] = 0; if (!((!contains_nonzero([NAME]_view, LOWER) && j < i) || (!contains_nonzero([NAME]_view, UPPER) && j > i))) { [NAME] = [NAME]_global[i + [NAME]_rows * j]; } _ addition: double var4 = var2 + var3; _load: var5_global[i + var5_rows * j] = var4;

Complete kernel kernel void calculate(__global double var1, __global double* var2_global, int var2_rows, int var2_view, __global double* var3_global, int var3_rows, int var3_view __global double* var6_global, int var6_rows, int var6_view){ int i = get_global_id(0); int j = get_global_id(1); double var2 = 0; if (!((!contains_nonzero(var2_view, LOWER) && j < i) || (!contains_nonzero(var2_view, UPPER) && j > i))) { var2 = var2_global[i + var2_rows * j]; } double var3 = 0; if (!((!contains_nonzero(var3_view, LOWER) && j < i) || (!contains_nonzero(var3_view, UPPER) && j > i))) { var3 = var3_global[i + var1_rows * j]; } double var4 = var2 + var3; double var5 = var1 * var4; var6_global[i + var6_rows * j] = var5; }

Adding a new operation • New class for the operation (derived from operation_cl or operation_cl_lhs ). • Must define: • Scalar , • generate , • view . • Optional: generate_lhs , rows , cols . • A function that constructs the object.

Empirical vali lidation • Comparison with a sequence of basic kernels. • Comparison with a hand crafted kernel. • Comparison with VexCL, a similar library. • On NVIDIA GeForce GTX 1070 and AMD Radeon VII.

Comparison wit ith a sequence of f basic ic kernels • Single operation kernel is comparable. • Sequence is much faster. • Matrix multiplication is slow, so speedups are negligible. • We also avoid memory reallocations, which are slow on NVIDIA GPU.

Comparison wit ith a hand craft fted kernel • On Bayesian linear regression. • Comparable performance. • Much simpler to use.

Comparison wit ith VexCL • Transposition and colwise sum are much faster. • Rowwise sum is slightly slower. • Other operations and multi- operation kernels are comparable. • Also supports general tensors and multiple OpenCL devices.

Conclusion • Performance is comparable to hand crafted kernels. • As simple to use as calling premade kernels. • Our work is similar to VexCL and Tensorflow XLA.

Automated OpenCL GPU kernel fusion for Stan Math Tadej Ciglari - PowerPoint PPT Presentation

Automated OpenCL GPU kernel fusion for Stan Math Tadej Ciglari (presenter) * , Rok enovar, Erik trumbelj * Stan State-of-the-art software for Bayesian statistics. Probabilistic programming language + Math library with auto-

OpenCL Kernel Compilation Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

Automated GPU Kernel Fusion with XLA EuroLLVM'19, April 8 2019 Thomas Joerg, Google Presenting

Probabilistic and Model Fusion: . . . Model Fusion: . . . Interval Uncertainty Model Fusion:

High resolution image fusion via fusion frames Shidong Li San Francisco State University

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and

An Introduction to Stan and RStan Introduction I (MW) am not a developer of Stan , only a very

October 2016 October 2016 WHAT IS FUSION? TWO FUSION TYPES NEUTRONIC ANEUTRONIC TWO

Update on the Fusion Update on the Fusion Energy Sciences Program Energy Sciences Program Ed

Modeling with MOSEK Fusion Ulf Worse INFORMS Minneapolis October 5 2013 http://www.mosek.com

Investigation of the OpenCL support in the GeantV's Vectorized Geometry Gabor Biro 22.09.2014.

The OpenCL C++ API Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Introduction to OpenCL David Black-Schaffer david.black-schaffer@it.uu.se 1 Disclaimer I

OpenCL on FPGAs Contains material from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Synchronization in OpenCL Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

Scalable Multi-Precision Simulation of Spiking Neural Networks on GPU with OpenCL Dmitri

GPU Parallel Implementation of The Approximate K-SVD Algorithm Using OpenCL Paul Irofti 1 Bogdan

High Performance Experiment Data Archiving with gStore Chep 2012, New York May 21, 2012 Horst

1 CS486/686 Lecture Slides (c) 2009 K. Larson and P.Poupart 2 CS486/686 Lecture Slides (c) 2009

Topics in Automated Deduction (CS 576) Elsa L. Gunter 2112 Siebel Center egunter@cs.uiuc.edu

Content-based encoding of mathematical and code libraries Josef Urban Institute for Computing

No challenge, decade outlook. Industrys evolutionary path Que sera sera Grand Goodness

Ingest and Dissemination with DAITSS Presented by Randy Fischer, Programmer, Florida Center for

Criteria and metrics for thresholded AU detection Jeff Girard and Jeff Cohn University of

Constant-time programming in FaCT Sunjay Cauligi , UC San Diego Fraser Brown, Ranjit Jhala, Brian