Towards Relay: a New IR for Machine Learning Frameworks Jared Roesch , Steven Lyubomirsky, Logan Weber, Josh Pollock, Marisa Kirisame, Tianqi Chen, Zachary Tatlock
2
Tension between performance and flexibility 🐈 🧙 3
Tension between performance and flexibility 🐈 🧙 4
Tension between performance and flexibility 🐈 🧙 4
Tension between performance and flexibility 🐈 🧙 4
From OpenAI’s recent blog post: https://blog.openai.com/ai-and-compute/ 5
“We believe the largest training runs today employ hardware that cost in the single digit millions of dollars to purchase (although the amortized cost is much lower).” –- Open AI Blog 6
Growing compute • The community is addressing need for cost e ff ective compute with new hardware designs. • TPU, Trillium, A10 Bionic, Brainwave, etc • Hardware landscape is becoming very heterogeneous, mix of CPUs, GPUs, custom accelerators. 7
Growing compute • Di ff erent operating environments; ex. can be memory hungry in cloud, but not edge devices. • Introducing new compute may increase runtime e ffi ciency. • Doesn’t account for programming and porting costs. • For ex. Cloud FPGAs 8
Leveraging diversity • Current state of the art is to port and tweak models by hand for each hardware platform until they work. • How to write programs for many di ff erent devices, optimize for: • Memory • Quantization • New numeric representations • Model transforms • Layout change • Device scheduling 9
VTA • Take our friend Thierry who has been building new hardware accelerators for ML. • How to program the hardware? • How to port existing models? • How to adapt software for di ff erent HW designs? 10
Portability + Flexibility • We need models that can be e ff ectively optimized and run on a variety of devices. • We want generic models, but tuned implementations • Can we build custom hardware from directly from model descriptions? • “Write once, run everywhere ” 11
TVM • An end-to-end compiler stack for deep learning. • Hierarchal intermediate representations, tightly integrated for tuning models for specific hardware targets. • TVM is currently focused on producing high performance operator implementations. • TVM is bottom-up. 12
Relay • We contribute a new high level IR for TVM named Relay. • Generalize computation graphs to di ff erentiable programs. • Write Python (in the style of PyTorch) but apply end-to-end optimizations. • Composed of new front-end , IR , auto-di ff , optimizer , backend , and runtime . • Relay is top-down. 13
graph, lib, params = module = runtime.create(graph, lib, t.cuda(0)) t.compiler.build(graph, target, params) module.set_input(**params) module.run(data=data_array) CoreML output = t.nd.empty(out_shape, ctx=t.cuda(0)) Frameworks CNTK module.get_output(0, output) Computational Graph input High level Data-flow Rewriting Tensor Operator Description Deployable Module Schedule prediction tabby, tabby cat Accelerators LLVM CUDA/Metal/OpenCL 14
What Relay will replace graph, lib, params = module = runtime.create(graph, lib, t.cuda(0)) t.compiler.build(graph, target, params) module.set_input(**params) module.run(data=data_array) CoreML output = t.nd.empty(out_shape, ctx=t.cuda(0)) Frameworks CNTK module.get_output(0, output) Computational Graph input High level Data-flow Rewriting Tensor Operator Description Deployable Module Schedule prediction tabby, tabby cat Accelerators LLVM CUDA/Metal/OpenCL 14
Why not current frameworks or IRs? • We believe the key to being able to optimize programs e ff ectively is a typed, whole program representation of machine learning models. • We will show how current framework’s IRs are lacking, then examine how Relay addresses these challenges. 15
DL Frameworks Compilers • We are at the dawn of the compiler age for deep learning. • Framework designers realize performance is being left on the table, and frameworks are converging on compilation pipelines. • XLA for TF , Glow for PyTorch, NNVM/TVM for MxNet • Other IRs are framework first, we want to be IR first! • Need “whole model” to do certain classes of optimization, analogous to “whole program” in traditional compilers. • But we want flexibility, portability, and performance! 16
Disadvantages: Advantages: - Embedded domain specific language + Embedded domain specific language - Users write programs to build graph and later execute. + Dataflow graph gives rise to straightforward execution and - Staging can be complex and scheduling. confusing. + The graph is easy to optimize and - IR is computation graph (i.e a data compile, for example static memory flow graph) with embedded control planning. and mutation. + XLA style compilation is - Ex. What does a gradient of an straightforward. impure function mean? 17
x = tf.placeholder(tf.float32, shape=(None, D_in)) y = tf.placeholder(tf.float32, shape=(None, D_out)) w1 = tf.Variable(tf.random_normal((D_in, H))) w2 = tf.Variable(tf.random_normal((H, D_out))) h = tf.matmul(x, w1) h_relu = tf.maximum(h, tf.zeros(1)) Need to evaluate loss y_pred = tf.matmul(h_relu, w2) sess.run executes graph loss = tf.reduce_sum((y - y_pred) ** 2.0) grad_w1, grad_w2 = tf.gradients(loss, [w1, w2]) new_w1 = w1.assign(w1 - learning_rate * grad_w1) new_w2 = w2.assign(w2 - learning_rate * grad_w2) for _ in range(500): loss_value, _, _ = sess.run( [loss, new_w1, new_w2], feed_dict={x: x_value, y: y_value}) … Adapted from: https://github.com/jcjohnson/pytorch-examples/blob/master/autograd/two_layer_net_autograd.py 18
Advantages: Disadvantages: + Shallow embedding, users just - Trace based JIT and exporting, interact with normal Python APIs only capture specific execution traces. + Expressive, can use all of Python - Not “whole model” to interact with PyTorch, as it is the execution layer upto tensors. - Python is “control plane” + Trace based auto-di ff over a - C extensions are “data plane”; subset of Python, can handle arbitrary control flow. requires C extensions - Incredibly limited and brittle + Can accelerate pieces using Glow export functionality. and Tensor Comprehensions 19
Tracing based tools fail if traces change at all (i.e essentially static graph) 20
x = torch.randn(N, D_in) y = torch.randn(N, D_out) w1 = torch.randn(D_in, H, requires_grad=True) w2 = torch.randn(H, D_out, requires_grad=True) for t in range(500): y_pred = x.mm(w1).clamp(min=0).mm(w2) loss = (y_pred - y).pow(2).sum() print(t, loss.item()) Updates can be implemented loss.backward() in vanilla Python with torch.no_grad(): w1 -= learning_rate * w1.grad w2 -= learning_rate * w2.grad w1.grad.zero_() w2.grad.zero_() Adapted from: https://github.com/jcjohnson/pytorch-examples/blob/master/autograd/two_layer_net_autograd.py 21
22
22
System Design CoreML Frameworks CNTK Computational Graph High level Data-flow Rewriting Tensor Operator Description Schedule CUDA/Metal/OpenCL Accelerators LLVM 23
System Design CoreML Frameworks Relay Python Decorator CNTK Relay Fusion, Layout Change, Partial Eval, Traditional Optimizations Control Operators Tensor Operator Description Relay runtime Schedule system Hardware Implementation 24
Language • Functional higher order language • Closures • Tensors • Control flow • References • Shape dependent type system • Di ff erentiable 25
Language • Functional higher order language • Closures • Tensors Old PL you know and love • Control flow • References • Shape dependent type system • Di ff erentiable 25
Language • Functional higher order language • Closures • Tensors Old PL you know and love • Control flow New challenges • References • Shape dependent type system • Di ff erentiable 25
Frontend • Our current frontend is a subset of Python. @relay • We use AST rewriting to def linear_loss(a, b, x, y): transform the Python program y_hat = a * x + b into our IR directly. return (y - y_hat)**2 • We can statically analyze this subset, and type check it. • We rely on MyPy’s infrastructure (annotations, and typed_ast ). 🧙 26
If we remove all syntactic sugar we can see a little more what’s going on: @relay def linear_loss( a: Tensor [ Float , (1, 1)], b: Tensor [ Float , (1, 1)], x: Tensor [ Float , (1, 1)], y: Tensor [ Float , (1, 1)]) -> Tensor [ Float , (1, 1)]: y_hat = relay.broadcast_add(relay.broadcast_mul(a, x), b) diff = relay.broadcast_sub(y, y_hat) return relay.broadcast_mul(diff, diff) 27
If we remove all syntactic sugar we can see a little more what’s going on: We can use Python’s type annotations to provide type info. @relay def linear_loss( a: Tensor [ Float , (1, 1)], b: Tensor [ Float , (1, 1)], x: Tensor [ Float , (1, 1)], y: Tensor [ Float , (1, 1)]) -> Tensor [ Float , (1, 1)]: y_hat = relay.broadcast_add(relay.broadcast_mul(a, x), b) diff = relay.broadcast_sub(y, y_hat) return relay.broadcast_mul(diff, diff) 27
Recommend
More recommend