Relay : a high level differentiable IR Jared Roesch TVMConf - PowerPoint PPT Presentation

Relay : a high level differentiable IR Jared Roesch TVMConf December 12th, 2018 � 1

This represents months of joint work with lots of great folks: � 2

TVM Stack Optimization High-Level Differentiable IR Relay AutoTVM Tensor Expression IR LLVM, CUDA, Metal VTA AutoVTA Edge Cloud Hardware ASIC FPGA FPGA Fleet � 3

How do we represent deep learning? • Build parametric functions which approximate impossible or hard to program functions. • In order to perform deep learning we need: • To represent computation • To differentiate • To optimize � 4

Existing Approach Resnet, DCGAN LSTM Training Loop Computation Graph Tensor Expression IR LLVM, CUDA, Metal VTA Edge Cloud ASIC FPGA FPGA � 5 � 5

Existing Approach Resnet, DCGAN LSTM Training Loop High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA Edge Cloud ASIC FPGA FPGA � 6 � 6

Python Relay for i in range(…): for i in range(…): inp, hs = … input, hs = … out, nhs = RNNCell(inp, hs) out, nhs = RNNCell(inp, hs)

Challenges • How do we represent control-flow, functional abstraction, and recursion? • How do we represent and optimize training? • How do we perform end-to-end whole model optimization? � 8

Relay • Relay is the high level IR of the TVM stack. • Generalize computation graphs to differentiable programs. • Enables whole-program optimization for deep learning. • Composed of new IR , auto-diff , optimizer , and backends. • Relay is open source. � 9

Initial Results • Relay shows promising initial results when evaluated in inference tasks: • We are able fully optimize models such as generative RNN s, outperforming PyTorch by up to 3x on model inference. • We demonstrate performance comparable to NNVM and outperform TensorFlow and TensorFlow Lite. • We show that Relay can be executed on FPGA s, resulting in up to an 11x performance improvement over baseline. � 10

Compiler Execution Frontend DSL Operator Compiled Graph FPGA Language Operators Runtime Model Importer Reference AST Optimizer GPU Interpreter Text Format Ahead of time CPU compiler On-disk representat ion � 11

IR • A functional IR , an ML-like (ReasonML, OCaml, SML, …) language tailored to machine learning. • Features closures, reference, ADTs, and primitive operators, tensors are the primary value type. • We can use this to represent full-models including a generative RNN and training loops. • Functional style makes it possible to analyze and transform as pure data-flow. � 12

RNN x0 x1 xN … sn + 1 s0 sn s1 s2 o1 o2 oN � 13

Loop Counter def @generate(n, i, h, …): if (n == 0) [] else let (output, new_hidden) = @rnn_cell(i, h, …); Parameters output + @generate( n - 1, output, new_hidden, …) Functional style loop � 14

Typing • Typing these programs introduces a few challenges: • Need static Tensor shape information to match accelerator primitives, optimize aggressively, and provide better errors. • Provide flexible typing for operators which contain shape input and output relationships such as broadcast, flatten, concat, squeeze, and more. � 15

Tensor : (BaseType, Shape) -> Type Float : (Width: Int, Lanes: Int) -> BaseType f32 = Float <32, 1> Tensor <f32, (32, 3, 32, 32)> 4-d Tensor N * Channels * Height * Width � 16

Type Relation • Operators, the primitive building block of machine learning, are hard to type check (e.g. preconditions must hold over input tensors). • A call can contain a series of relations which must hold over the input types. • Enables very flexible typing of operators. • For example can implement variable arguments using relations (concat) and input/ output relationships (broadcast). � 17

For example we can type broadcasting addition: add : forall (Lhs: Type, Rhs: Type, Out: Type), (Lhs, Rhs) -> Out where Broadcast(Lhs, Rhs, Out) Broadcasting is a tricky rule often employed in machine learning: Broadcast (Tensor<f32, (3, 4, 5)>, Tensor<f32 ( n , 3, 4, 5), Tensor<f32, ( n , 3, 4, 5)>) Broadcast (Tensor<f32, ( 1 , 5)>, Tensor<f32, ( n , 5)>, Tensor<f32, ( n, 5)>) � 18

Or more complex constraints such as: concat : forall (Args: Type, Out: Type), (Args) -> Out where IsTuple(Args), Concat(Args, Out) � 19

Optimizations • We implement various optimizations over these programs including: • Standard Optimizations • Fusion • Constant Propagation • Accelerator Specific Optimizations • Quantization (see Ziheng’s talk) • FoldScaleAxis • Data Packing � 20

Backends Graph Runtime FPGA Relay Interpreter GPU AoT Compiler CPU � 21

Backends • We implemented multiple execution backends to demonstrate the versatility of Relay as an IR . • Each backend builds on TVM’s existing low level Tensor IR ( HalideIR ). • TVM is used for operators, but the rest of the program must be executed (e.g. allocation, control-flow, recursion). � 22

Operator Compilation def @my_func(…) { … } operators.so TVM � 23

Graph Runtime + operators.so • TVM’s existing execution pipeline, can execute a subset of Relay programs. • Requires a graph, a shared library containing operators, and parameters GraphRTS � 24

Interpreter • A reference interpreter for Relay. • Implements the reference semantics. • Uses naive recursive AST traversal for interpreting control flow. • Uses JIT compilation for operators. � 25

AoT Compiler • A case study of what Relay IR affords, we built prototype compiler in less than 3 weeks. • Generates code for CPU / GPU , FPGA support in the future. • Removes interpretation overhead and enables optimization. • Written as a pure Python library and uses Relay as dependency. � 26

Ahead of time compiler def @my_func(…) { Standard AoT … LittleCpp Clang Optimize Optimize } f = compile(my_func) librelay_aot_my_func.so f(…) � 27

VTA • VTA is a target for Relay. • We can compile high level models written in Frameworks such as MxNet directly to Relay. • Generic compilation to VTA will be upstreamed soon after the conference. � 28

VTA • VTA is a target for Relay. DRAM INSTRUCTION FETCH MODULE COMPUTE • We can compile high level models written in LOAD STORE CMD Q CMD Q CMD Q LD → CMP Q CMP → ST Q COMPUTE MODULE Frameworks such as MxNet directly to Relay. REGISTER MICRO-OP FILE BUFFER LOAD STORE MODULE MODULE Vector ALU • Generic compilation to VTA will be CMP → LD Q ST → CMP Q Tensor Core upstreamed soon after the conference. INPUT BUFFER STORE BUFFER WEIGHT BUFFER � 28

Evaluation • Relay supports expressive models : • We demonstrate Relay’s ability to optimize full models such as generative RNNs, beating PyTorch by up to 3x . • Relay provides competitive performance: • We demonstrate better than TensorFlow and on par performance with NNVM on a suite of models. • Relay supports customized hardware : • We show how Relay and TVM can be used to execute on FPGA based accelerators, bring 11x performance improvement over baseline. � 29

PyTorch Relay-Compiled Cell Relay-Compiled RNN Relay-Interpreted Cell Relay-Interpreted RNN � 30

CNN Results Relay Relay Relay � 31

VTA Results � 32

Future Work • Evaluating Relay on training tasks. • AutoRelay: applying ideas from AutoTVM to Relay . • A high-level full differentiable programming language frontend (i.e Python frontend, Haskell DSL). • Novel analyses and optimizations for DL (e.g automatic differential privacy). • Non-standard data types (e.g unums, posits). � 33

Lessons Learned • Using a full program representation we were able to: • Rephrase shape inference as type checking. • Use Relay as platform to develop novel optimizations such as automatic quantization. • Execute Relay programs via a variety of backends and hardware devices. • Demonstrate an increase in expressiveness does not come at the cost of performance. � 34

Conclusion • Relay is a new intermediate representation for optimizing deep learning programs. • We apply the straightforward insight that machine learning models are just programs. http://sampl.cs.washington.edu • This generalization enables support for a greater range of programs, new optimizations, and the ability to target a wide range of devices. • Excited about production and research http://tvm.ai collaborations. � 35

Relay : a high level differentiable IR Jared Roesch TVMConf - PowerPoint PPT Presentation

Relay : a high level differentiable IR Jared Roesch TVMConf December 12th, 2018 1 This represents months of joint work with lots of great folks: 2 TVM Stack Optimization High-Level Differentiable IR Relay AutoTVM Tensor Expression

Tor61 P P R2 Time Note on Relay Packets A relay does not look inside Relay cells unless

2010 Relay for Life Seasons of Hope What is Relay for Life? Relay for Life is an annual event

Frame Relay Topologies and Designs Frame Relay Topologies and Design As we learned in the Frame

Wave Relay System and Wave Relay System and General Project Details General Project Details

What is the National Traffic System (NTS)? The RELAY in American Radio Relay League

Frame Relay Basic Configurations: Point to Point Frame Relay Basic Point to Point Configuration

Frame Relay Basic Configurations: Hub and Spoke Frame Relay Basic Hub and Spoke Configuration

Harry Porters Relay Computer Harry Porter, Ph.D. Portland State University November 7, 2007

An Enriched Perspective on Differentiable Stacks Benjamin MacAdam Joint work with Jonathan

Learning with Differentiable Perturbed Optimizers Quentin Berthet Youth in High-dimensions -

Frame Relay Analysis 818 West Diamond Avenue - Third Floor, Gaithersburg, MD 20878 Phone: (301)

Joint Optimal Power Allocation and Relay Selection with Spatial Diversity in Wireless Relay

Cooperative Strategies and Capacity Theorems for Relay Networks Desmond Lun 22 November 2004

RELAYS, CONTACTORS & MOTOR STARTER TESTING EQUIPMENTS LIST OF TEST EQUIPMENT Relay Test

To Relay or Not to Relay for Inter-Cloud Transfers? Fan Lai , Mosharaf Chowdhury, Harsha

A Quantum Multiparty Packing Lemma and the Relay Channel Dawei Ding Stanford University Joint

Status of Krell Tools Built using Dyninst/MRNet Paradyn Week

Street lighting monitoring at cabinet level using open-source tools: a real scenario Adamo Ferro

Complex malware & forensic investigation RMLL 2016 Paul Rascagnres & Sebastien

Memory System Design Chapter 16 S. Dandamudi Outline Introduction Building larger

Model Checking Infinite-state Systems in SAL Bruno Dutertre, SRI International Automated Formal

MSRP Relays Rohan Mahy (rohan@cisco.com) Cullen Jennings (fluffy@cisco.com) Status and Changes

Towards Relay: a New IR for Machine Learning Frameworks Jared Roesch , Steven Lyubomirsky, Logan

Relay Attacks and Distance Bounding Protocols in RFID Environments Prof. Gildas Avoine

Relay : a high level differentiable IR Jared Roesch TVMConf - PowerPoint PPT Presentation

Relay : a high level differentiable IR Jared Roesch TVMConf December 12th, 2018 1 This represents months of joint work with lots of great folks: 2 TVM Stack Optimization High-Level Differentiable IR Relay AutoTVM Tensor Expression

Tor61 P P R2 Time Note on Relay Packets A relay does not look inside Relay cells unless

2010 Relay for Life Seasons of Hope What is Relay for Life? Relay for Life is an annual event

Frame Relay Topologies and Designs Frame Relay Topologies and Design As we learned in the Frame

Wave Relay System and Wave Relay System and General Project Details General Project Details

What is the National Traffic System (NTS)? The RELAY in American Radio Relay League

Frame Relay Basic Configurations: Point to Point Frame Relay Basic Point to Point Configuration

Frame Relay Basic Configurations: Hub and Spoke Frame Relay Basic Hub and Spoke Configuration

Harry Porters Relay Computer Harry Porter, Ph.D. Portland State University November 7, 2007

An Enriched Perspective on Differentiable Stacks Benjamin MacAdam Joint work with Jonathan

Learning with Differentiable Perturbed Optimizers Quentin Berthet Youth in High-dimensions -

Frame Relay Analysis 818 West Diamond Avenue - Third Floor, Gaithersburg, MD 20878 Phone: (301)

Joint Optimal Power Allocation and Relay Selection with Spatial Diversity in Wireless Relay

Cooperative Strategies and Capacity Theorems for Relay Networks Desmond Lun 22 November 2004

RELAYS, CONTACTORS &amp; MOTOR STARTER TESTING EQUIPMENTS LIST OF TEST EQUIPMENT Relay Test

To Relay or Not to Relay for Inter-Cloud Transfers? Fan Lai , Mosharaf Chowdhury, Harsha

A Quantum Multiparty Packing Lemma and the Relay Channel Dawei Ding Stanford University Joint

Status of Krell Tools Built using Dyninst/MRNet Paradyn Week

Street lighting monitoring at cabinet level using open-source tools: a real scenario Adamo Ferro

Complex malware &amp; forensic investigation RMLL 2016 Paul Rascagnres &amp; Sebastien

Memory System Design Chapter 16 S. Dandamudi Outline Introduction Building larger

Model Checking Infinite-state Systems in SAL Bruno Dutertre, SRI International Automated Formal

MSRP Relays Rohan Mahy (rohan@cisco.com) Cullen Jennings (fluffy@cisco.com) Status and Changes

Towards Relay: a New IR for Machine Learning Frameworks Jared Roesch , Steven Lyubomirsky, Logan

Relay Attacks and Distance Bounding Protocols in RFID Environments Prof. Gildas Avoine

RELAYS, CONTACTORS & MOTOR STARTER TESTING EQUIPMENTS LIST OF TEST EQUIPMENT Relay Test

Complex malware & forensic investigation RMLL 2016 Paul Rascagnres & Sebastien