Automated GPU Kernel Fusion with XLA EuroLLVM'19, April 8 2019 - PowerPoint PPT Presentation

Automated GPU Kernel Fusion with XLA EuroLLVM'19, April 8 2019 Thomas Joerg, Google Presenting work done by the XLA team

Outline ● TensorFlow ● Kernel fusion ● XLA compiler ● Automated kernel fusion

Example: ResNet block ReLu := max(input, 0.0) Relu Element-wise Addition 0 Add Fused Batch Normalization Convolution

Fused Kernels ● Convenient ● Performant

// Compute a * x + y. // a is a scalar, x and y are tensors. tmp = tf.multiply(a, x) out = tf.add(tmp, y)

// Compute a * x + y. // a is a scalar, x and y are tensors. tmp = tf.multiply(a, x) out = tf.add(tmp, y) __global__ void Multiply(int n, float a, float* x) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) x[i] = a * x[i]; }

// Compute a * x + y. // a is a scalar, x and y are tensors. tmp = tf.multiply(a, x) Tensors read + written: 4 out = tf.add(tmp, y) 0 __global__ void Multiply(int n, float a, float* x) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) x[i] = a * x[i]; } __global__ void Add(int n, float* x, float* y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) x[i] = x[i] + y[i]; }

// Compute a * x + y. // a is a scalar, x and y are tensors. tmp = tf.multiply(a, x) out = tf.add(tmp, y) __global__ void FusedMulAdd(int n, float a, float* x, float* y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) x[i] = a * x[i] + y[i]; }

// Compute a * x + y. // a is a scalar, x and y are tensors. out = tf.fused_multiply_add(a, x, y) Tensors read + written: 3 0 25% reduction! __global__ void FusedMulAdd(int n, float a, float* x, float* y) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) x[i] = a * x[i] + y[i]; }

Fused Kernels ● Convenient ● Performant But ● Development cost ● Inflexibel ● Hard to optimize

Submitter Hardware Chip count Software ResNet-50 v1.5 * NVIDIA DGX-1 8 ngc18.11_MXNet, 65.6 (on premise) cuDNN 7.4 Google 8x Volta V100 8 TF 1.12, cuDNN 64.1 (Cloud) 7.4 Full results: https://mlperf.org/results/ * speedup relative to reference implementation

Example: ResNet block Relu Add

TensorFlow with XLA TPU GPU TensorFlow Model TensorFlow Graph CPU XLA Intermediate Representation: HLO Target-specific code generation HLO Fusion happens here! XLA target-independent & target-specific optimizations

HLO IR Sample HLO ops Sample data types Elementwise math ● Primitive types ● Add, Tanh, Map ○ PRED ○ Spezialized math for neural nets ● F16 ○ Dot, Convolution, Reduce ○ F32 ○ Re-organize data ● Composite types ● Reshape, Broadcast, Concat, Tuple ○ array: F32[2,3], F16[] ○ Control flow ● tuple: TUPLE(F32[16], F16) ○ While, Call, CustomCall ○ Data transfer ● Parameter, Constant ○

ReLu in HLO Operation Type Shape

HLO Fusion

HLO Fusion ● Reduce memory bandwidth ● Compatible loop pattern ● Coalesced memory access

HLO Fusion 1) Fusion (with duplication) A A’ A’’ 2) Sibling fusion 3) Fusion with multiple outputs B C B C

HLO Fusion 1) Fusion (with duplication) A A 2) Sibling fusion 3) Fusion with multiple outputs B C B C

HLO Fusion A 1) Fusion (with duplication) A 2) Sibling fusion B 3) Fusion with multiple outputs B C C

Example: ResNet block Relu Add

Fused Add + ReLu __global__ void fusion(float *lhs, float *rhs, float* output) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < 128*512*28*28) { output[i] = } }

std::function<llvm::Value*>(const IrArray::Index& index) MakeElementGenerator(const HloInstruction* hlo, HloToElementGeneratorMap& operand_to_generator) { switch (hlo->opcode()) { case HloOpcode::kMaximum: return [...](const IrArray::Index& index) { llvm::Value* lhs = operand_to_generator.at(hlo->operand(0))(index); llvm::Value* rhs = operand_to_generator.at(hlo->operand(1))(index); auto cmp = b->CreateFCmpUGE(lhs, rhs); return ir_builder_->CreateSelect(cmp, lhs, rhs); }; ... }

Fused Add + ReLu __global__ void fusion(float *lhs, float *rhs, float* output) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < 128*512*28*28) { max(0.0, ); lhs[i] + rhs[i] output[i] = } }

Reduction 1 i = blockIdx.x * blockDim.x + threadIdx.x; kTileSize y_in_tiles = i / width; x = i % width; for (int j = 0; j < kTileSize: ++j) { y = y_in_tiles * kTileSize + j; if (y < height) { partial_sum += generator(y, x); } sum reduction } atomicAdd(&output[x], partial_sum);

Multi-output fusion i = blockIdx.x * blockDim.x + threadIdx.x; y_in_tiles = i / width; x = i % width; for (int j = 0; j < kTileSize: ++j) { y = y_in_tiles * kTileSize + j; if (y < height) { partial_sum[0] += generator[0](y, x); partial_sum[1] += generator[1](y, x); } } atomicAdd(&output[0][x], partial_sum[0]); atomicAdd(&output[1][x], partial_sum[1]);

i = blockIdx.x * blockDim.x + threadIdx.x; y_in_tiles = i / width; x = i % width; for (int j = 0; j < kTileSize: ++j) { y = y_in_tiles * kTileSize + j; if (y < height) { partial_sum[0] += generator[0](y, x); partial_sum[1] += generator[1](y, x); partial_sum[2] += generator[2](y, x); output[3][y, x] = generator[3](y, x); } } atomicAdd(&output[0][x], partial_sum[0]); atomicAdd(&output[1][x], partial_sum[1]); atomicAdd(&output[2][x], partial_sum[2]);

Thank you! Questions? XLA documentation https://www.tensorflow.org/xla/overview Public XLA mailing list xla-dev@googlegroups.com XLA on Github https://github.com/tensorflow/tensorflow/tree/master/tensorflow/compiler

Automated GPU Kernel Fusion with XLA EuroLLVM'19, April 8 2019 - PowerPoint PPT Presentation

Automated GPU Kernel Fusion with XLA EuroLLVM'19, April 8 2019 Thomas Joerg, Google Presenting work done by the XLA team Outline TensorFlow Kernel fusion XLA compiler Automated kernel fusion Example: ResNet block ReLu :=

Probabilistic and Model Fusion: . . . Model Fusion: . . . Interval Uncertainty Model Fusion:

TensorFlow w/XLA: TensorFlow, Compiled! Expressiveness with performance Pre-release

High resolution image fusion via fusion frames Shidong Li San Francisco State University

October 2016 October 2016 WHAT IS FUSION? TWO FUSION TYPES NEUTRONIC ANEUTRONIC TWO

Update on the Fusion Update on the Fusion Energy Sciences Program Energy Sciences Program Ed

Modeling with MOSEK Fusion Ulf Worse INFORMS Minneapolis October 5 2013 http://www.mosek.com

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Automated OpenCL GPU kernel fusion for Stan Math Tadej Ciglari (presenter) * , Rok enovar,

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Update of Magnetic Fusion Energy Research Brian A. Nelson for the UW Fusion Energy Research Group

Fusion Nothing But The Truth Fusion Orbotech s True Commitment To The PCB Industry Overall

Oncentra Prostate Image Fusion Josh Mason Oncentra Prostate Image Fusion Multiple image

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Automated Design of Digital Automated Design of Digital Automated Design of Digital Automated

Tree A tree consists of a set of nodes and a set of edges that connect pairs of nodes.

My y Dig igit ital al Sib iblin ling Andy Fawkes DRAFT as of 6-Jan-20 Ove vervi view ew

Why the compiler broke your program Peter Brett, LiveCode Six impossible things before breakfast

Inferring Internet Server IPv4 and IPv6 Address Relationships Robert Beverly, Arthur Berger ,

Pros and cons of propositional logic Propositional logic is declarative : pieces of syntax

SoK:%Introspections%on%Trust%and% the%Semantic%Gap Presented(by(Zhenyu Ning 1 Contents

Set 7: Predicate logic Chapter 8 R&N ICS 271 Fall 2015 Outline New ontology

Topic 19 Red Black Trees Red Black Trees "People in every direction p y No words

Automated GPU Kernel Fusion with XLA EuroLLVM'19, April 8 2019 - PowerPoint PPT Presentation

Automated GPU Kernel Fusion with XLA EuroLLVM'19, April 8 2019 Thomas Joerg, Google Presenting work done by the XLA team Outline TensorFlow Kernel fusion XLA compiler Automated kernel fusion Example: ResNet block ReLu :=

Probabilistic and Model Fusion: . . . Model Fusion: . . . Interval Uncertainty Model Fusion:

TensorFlow w/XLA: TensorFlow, Compiled! Expressiveness with performance Pre-release

High resolution image fusion via fusion frames Shidong Li San Francisco State University

October 2016 October 2016 WHAT IS FUSION? TWO FUSION TYPES NEUTRONIC ANEUTRONIC TWO

Update on the Fusion Update on the Fusion Energy Sciences Program Energy Sciences Program Ed

Modeling with MOSEK Fusion Ulf Worse INFORMS Minneapolis October 5 2013 http://www.mosek.com

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Automated OpenCL GPU kernel fusion for Stan Math Tadej Ciglari (presenter) * , Rok enovar,

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Update of Magnetic Fusion Energy Research Brian A. Nelson for the UW Fusion Energy Research Group

Fusion Nothing But The Truth Fusion Orbotech s True Commitment To The PCB Industry Overall

Oncentra Prostate Image Fusion Josh Mason Oncentra Prostate Image Fusion Multiple image

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Automated Design of Digital Automated Design of Digital Automated Design of Digital Automated

Tree A tree consists of a set of nodes and a set of edges that connect pairs of nodes.

My y Dig igit ital al Sib iblin ling Andy Fawkes DRAFT as of 6-Jan-20 Ove vervi view ew

Why the compiler broke your program Peter Brett, LiveCode Six impossible things before breakfast

Inferring Internet Server IPv4 and IPv6 Address Relationships Robert Beverly, Arthur Berger ,

Pros and cons of propositional logic Propositional logic is declarative : pieces of syntax

SoK:%Introspections%on%Trust%and% the%Semantic%Gap Presented(by(Zhenyu Ning 1 Contents

Set 7: Predicate logic Chapter 8 R&amp;N ICS 271 Fall 2015 Outline New ontology

Topic 19 Red Black Trees Red Black Trees &quot;People in every direction p y No words

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Set 7: Predicate logic Chapter 8 R&N ICS 271 Fall 2015 Outline New ontology

Topic 19 Red Black Trees Red Black Trees "People in every direction p y No words