CS 744: TVM - Shivaram Venkataraman Fall 2020 ADMINISTRIVIA - PowerPoint PPT Presentation

Machine Virtual x.dk ! Tensor L , Llvm T CS 744: TVM - Shivaram Venkataraman Fall 2020

ADMINISTRIVIA Assignment - Course project titles → - Project proposal aka Introduction (10/16) < Introduction ] → writeup page 2 Related Work Timeline (with eval plan) - Midterm: Oct 22

MACHINE LEARNING: STACK Distributed no owed ✓ efferent just lie Train ↳ forward pass → Interplay & inference quondam \ training ( distributed , dealing inference make easy groin v and saddle Hardware

⇒ ⇒ MOTIVATION: PERFORAMNCE PORTABILITY model file : www.rayf/4TE-iTIy Pytoreh → " intent :?÷ confute primitives cow matrix ed multiply 1 performance high I want you back ends hardware across - vendor specific • q o Dependence on o libraries available operators models fast new operators Y evolve Not combination of vendor ML new existing in libraries

. ) AM describes code Python → model ML Tvm → . = file that → Binary on hardware runs

¥÷÷÷¥÷÷÷÷÷ :* " OPTIMIZATION COMPUTATION GRAPHS [ I " ÷ . :÷÷÷ : Operator Fusion - T " operators " 1-1 → map , ↳ ( Spg - - reduction , scaling after Sum → - Data layout Major , column - Major Kow is represented .in/TEHtEi/:::IS...:: : , , Infest Blocked NN 2- layer g teat layout as

↳ TENSOR EXPRESSION LANGUAGE operator cry language expression expressed tensor in ↳ tensor ) math operations Common Arithmetic, Math operations Know the shape of the output and the data accessed

anime fi Halide → ← gu expression OpenMP CODE GENERATION + of imtmhns :÷÷÷÷÷÷:÷÷i÷÷÷ ead does a. ← Nested parallelism " l : " in for i ' jaihe t :S in for j , Bs tdgmptd.im#isIL;i!dffIHdeu- poker . threads use as =doopiterah# can the Tensorization is , store , add whet bad instruction → hardware set - to Allows = you operator register - - Extensible ! intrinsic

as Some Latency HIDING etc . Pytorch 9 What is the goal? and computation ↳ Overlap communication utilizes that Schedule & bandwidth - memory units compute ig :* year 'fad

AUTOMATING OPTIMIZATION Goal: Create a specialized operator for input shape and layout - - - Challenge: choices & Choose appropriate schedule optimizations parameters Hunts different Tiling size, loop unrolling lots of - - - - r . lots of also choose . to Automate the optimizer! Fim m " " . Ml ? - I what configurations to Try ←

ML-Based Cost model to Machine Learning Model Design Choices as n seconds Speed: Faster than time it takes to evaluate a config → → Quality: Use a rank objective to predict the relative order of runtime take generate code Gradient tree boosting model ← d memory access count features reuse ratio of each memory buffer at each loop level one-hot encoding of loop annotations

model perf using ← ML-BASED COST MODEL when config tom > LE , - y 20ms ( Cz , Iteration candidate 8ms > is ccz wyignratim l ahhhh each Select a batch of candidates < £ , 41 → ' fail > fief a Collect data , a : ← harder :c , trashy data Use as training data to update the model above → step a) toaB7 ✓ How to select candidates? - Parallel Simulated Annealing ' better than b is cj model Task ↳ → ↳ Start from a random config Aa cluster d , , ~ on Walk to a nearby config à & try go → Yes another Successful if cost decreases Else Reject → generate oyer config No

Distributed device pool Pool of devices to speed up profiling RPC interface to run a trial on device Share device pools for multiple graphs

SUMMARY TVM: Compiler for ML inference models Support high performance for range of models, hardware devices Key ideas → operator fusion Graph-level optimizations → Tensor expression language: Code-gen, Latency hiding etc → ML based Cost Model for automation -

DISCUSSION https://forms.gle/WiVgJ3abGXXgfBN99

↳ Consider that you are building an optimizer for Spark programs instead of ML inference. What would be some configuration knobs that you could similarly tune? What might be different from the TVM optimizer? hiding latency Similar logic → communication overlap comp , 7rYYdimemim'operatorfmon → mapBoperahn access patterns ↳ laa# user defined operators are - challenging ? ? automate Partitioning can you → partitions / co - partitioning \ ↳ number of space ! config performance ! Had . cache → manually insert Persistence →

What is your takeaway from the following figure? on fasting qm " bae : Enea . :c . . → f- T honey ! week , unite :b . or

NEXT STEPS spark ? hiding in latency |D÷i;: ✓ rddlsmaf tasks > Next class: Ray D Course project: Oct 16 (introductions) credence tasks Midterm: Oct 22 ← Hanf files D map rddz : comm Edna > ← no wait D

CS 744: TVM - Shivaram Venkataraman Fall 2020 ADMINISTRIVIA - PowerPoint PPT Presentation

Machine Virtual x.dk ! Tensor L , Llvm T CS 744: TVM - Shivaram Venkataraman Fall 2020 ADMINISTRIVIA Assignment - Course project titles - Project proposal aka Introduction (10/16) < Introduction ] writeup page 2