December 12, 2018 Luis Ceze Welcome to the 1st TVM and Deep - PowerPoint PPT Presentation

TVM + AWS Vin Sharma, Amazon SageMaker Neo Amazon: vinarm@ | Twitter: ciphr@

How is AWS using TVM?

How is AWS using TVM? • As a back-end for Apache MXNet • To deploy easily onto edge devices • To improve performance on target hardware

How is AWS using TVM? • As a back-end for Apache MXNet • To deploy easily onto edge devices • To improve performance on target hardware • As an optimizer for Amazon AI services • Amazon Rekognition: To improve end-to-end latency • Amazon Alexa: To increase resource efficiency on Echo/Dot

How is AWS using TVM? • As a back-end for Apache MXNet • To deploy easily onto edge devices • To improve performance on target hardware • As an optimizer for Amazon AI services • Amazon Rekognition: To improve end-to-end latency • Amazon Alexa: To increase resource efficiency on Echo/Dot • In a tool chain for Amazon Inferentia

We’re Hiring! How is AWS using TVM? • As a back-end for Apache MXNet • To deploy easily onto edge devices • To improve performance on target hardware • As an optimizer for Amazon AI services • Amazon Rekognition: To improve end-to-end latency • Amazon Alexa: To increase resource efficiency on Echo/Dot • In a tool chain for Amazon Inferentia

How is AWS enabling adoption of TVM? In a new service called Amazon SageMaker Neo

How is AWS enabling adoption of TVM? In a new service called Amazon SageMaker Neo Model input files: MXNet: .json & .params Name and shape of input node: {“data”:[1,3,227,277]} Framework Output Locatio n Target Platform: Cloud Instance Type | Edge Device

We’re Hiring! How is AWS enabling adoption of TVM? In a new service called Amazon SageMaker Neo Model input files: MXNet: .json & .params Name and shape of input node: {“data”:[1,3,227,277]} Framework Output Locatio n Target Platform: Cloud Instance Type | Edge Device

How is AWS contributing to TVM? Releasing all TVM modifications and enhancements in Neo to open source • Frameworks: TensorFlow, MXNet, PyTorch, ONNX • Models: ResNet, VGG, Inception, MobileNet, DenseNet, SqueezeNet • Operators: Several new ops in NNVM/TVM • Optimizations: Node Annotation, Graph Partitioning, Ring Buffer, NHWC, Graph Tuning • Acceleration Library: Nvidia TensorRT • Hardware: Cross-Compilation to ARM, Intel, Nvidia; More Coming Soon

We’re Hiring! How is AWS contributing to TVM? Releasing all TVM modifications and enhancements in Neo to open source • Frameworks: TensorFlow, MXNet, PyTorch, ONNX • Models: ResNet, VGG, Inception, MobileNet, DenseNet, SqueezeNet • Operators: Several new ops in NNVM/TVM • Optimizations: Node Annotation, Graph Partitioning, Ring Buffer, NHWC, Graph Tuning • Acceleration Library: Nvidia TensorRT • Hardware: Cross-Compilation to ARM, Intel, Nvidia; More Coming Soon

Chen Tian, Technical VP

TVM on Huawei’s AI portfolio AI Applications General Advanced HiAI Service Pre-integrated Solutions Application enabling: APIs1 Application APIs Full-pipeline services(ModelArts), hierarchical APIs, and pre- Enablement integrated solutions HiAI ModelArts Engine MindSpore ： Framework Unified training and inference framework for device, edge, and MindSpore TensorFlow PyTorch PaddlePaddle … cloud (both standalone and cooperative) CANN (Compute Architecture for Neural Networks) Chip Enabler CANN ： Chip operators library and highly automated operators Tensor Engine / TVM CCE lib/extensions development toolkit Ascend ： IP & Chip Ascend- Ascend- Ascend- Ascend Ascend-Mini Ascend-Max AI chip series based on unified scalable architecture Nano Tiny Lite Edge Industrial Private Consumer Device Public Cloud Computing IoT Device Cloud Huawei Confidential

How do we use TVM Frameworks Model Conversion Third-Party Operators TE/TVM model execution During model conversion we use TE/TVM to customize operators for completeness and performance. 70+ operators are written by TVM ， bring us ~3x development efficiency improvement Huawei Confidential

Successful Practice with Audi in Level 4 Autonomous Driving ~ A Complete City Commute Record ~ Driving in the evening High-speed cruise Traffic Jam Pilot (TJP) Traffic light identification Pedestrian identification Automatic parking Joint developed autonomous driving algorithm gains leading scores in industry authoritative KITTI 2D/3D/BEV tests! Huawei Confidential � 32

TVM is working on Atlas series product Atlas 200 Developer Kit Atlas 800 AI Appliance Atlas 300 AI Accelerator Card Atlas 500 AI Edge Station • Capable of processing 16-channel HD • Provides optimized AI environment ● 16 TOPS INT8@24 W ● 64 TOPS INT8@75 W videos in the size of a set-top-box ● 1 USB type-C, 2 CCM interfaces, 1 ● 64-channel HD video real-time analysis based on the standard framework and (STB) GE network port, 1 SD card slot and JPEG decoding programming environment • Delivers 4x higher performance over • Leverages high-performance   ● 8 GB memory ● 32 GB memory, 204.8 GB/s memory counterparts bandwidth GPU scheduling algorithms, improving ● PCIe 3.0 x16, half-height half-length card resource utilization   by over 15% Smart Transportation Smart Manufacturing Intelligent Care (traffic light tuning, intelligent traffic guiding) (kindergarten and elderly care) (intelligent quality inspection and flexible manufacturing) Huawei Confidential � 33

Huawei’s Contributions on TVM 8 Contributors ： kun-zh, sgrechanik-h, libing4752, derisavi-huawei, solin319, ehsanmok, gaoxiong-1, jiacunjiang1215   4 Reviewers ： Srkreddy1238 , PariksheetPinjari909 , siju-Samuel , Xqdan We are working on ： 1.Huawei Ascend ASIC support. 2.Front end to support Darknet, ONNX. 3.Optimization on Auto-TVM, IR extensions. 4.Tensorize, cache read/write, access_ptr API. In the future we will try to ： 1.Codegen for fused operators. 2.NLP support. 3.More optimization. 4.Training Operators. HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential � 34

Meghan Cowan

VGG11 on Raspberry Pi 3B TensorflowLite 32bit fp 66% top-1ImageNet accuracy 1.42 fps

VGG11 on Raspberry Pi 3B Trained binarized model Operators implemented with TVM TensorflowLite 32bit fp 66% top-1ImageNet accuracy 1.42 fps

VGG11 on Raspberry Pi 3B Trained binarized model Operators implemented with TVM TensorflowLite TVM 32bit fp 2-bit activation 1-bit weight 66% top-1ImageNet accuracy 62% top-1 ImageNet accuracy 1.42 fps 4.67 fps

Further down the stack…

Thierry Moreau

Open Source Stack Overview High-Level Differentiable IR Tensor Expression IR VTA Runtime & JIT Compiler VTA Hardware/Software Interface (ISA) Versatile Tensor Accelerator VTA MicroArchitecture VTA Simulator Stack (VTA)

Open Source Stack Overview VTA Backends • Simulator : out-of- High-Level Differentiable IR the-box testing to Tensor Expression IR write compiler passes VTA Runtime & JIT Compiler VTA Hardware/Software Interface (ISA) Versatile Tensor Accelerator VTA MicroArchitecture VTA Simulator Stack (VTA)

Open Source Stack Overview VTA Backends • Simulator : out-of- High-Level Differentiable IR the-box testing to Tensor Expression IR write compiler passes VTA Runtime & JIT Compiler • FPGA : fast design VTA Hardware/Software Interface (ISA) iteration, quick Versatile Tensor Accelerator deployment, flexibility VTA MicroArchitecture VTA Simulator Stack (VTA)

Open Source Stack Overview VTA Backends • Simulator : out-of- High-Level Differentiable IR the-box testing to Tensor Expression IR write compiler passes VTA Runtime & JIT Compiler • FPGA : fast design VTA Hardware/Software Interface (ISA) iteration, quick Versatile Tensor Accelerator deployment, flexibility VTA MicroArchitecture VTA Simulator Stack (VTA) • ASIC : industrial- strength efficiency

Hardware Exploration with VTA channel width { HW / SW Constraints # BRAMs FPGA DRAM channels logic resources Model batch size data types

VTA Design Space } Hardware Exploration with VTA channel width { { HW / SW Constraints HW / SW Constraints Architecture Knobs # BRAMs GEMM Intrinsic: e.g. (1,32) x (32,32) vs. (4,16) x (16,16) FPGA DRAM channels # of units in tensor ALU : e.g. 32 vs. 16 logic resources logic resources BRAM allocation between bu ff ers, register file, micro-op cache Model batch size Circuit Knobs data types Circuit Pipelining: e.g. for GEMM core between [11, 20] stages PLL Frequency Sweeps: e.g. 250 vs. 300 vs. 333MHz 1000s

VTA Design Space } Hardware Exploration with VTA channel width { { e between [11, 20] stages } HW / SW Constraints HW / SW Constraints VTA Candidate Designs Architecture Knobs #1 Design AAA @ 307GOPs # BRAMs GEMM Intrinsic: e.g. (1,32) x (32,32) vs. (4,16) x (16,16) FPGA DRAM channels # of units in tensor ALU : e.g. 32 vs. 16 #2 Design BBB @ 307GOPs logic resources logic resources BRAM allocation between bu ff ers, register file, micro-op cache o-op cache #3 Design CCC @ 307GOPs Model #4 Design DDD @ 256GOPs batch size Circuit Knobs data types Circuit Pipelining: e.g. for GEMM core between [11, 20] stages Needs to pass place & route and pass timing closure PLL Frequency Sweeps: e.g. 250 vs. 300 vs. 333MHz 1000s ~10

Schedule Exploration with VTA } VTA Candidate Designs #1 Design AAA @ 307GOPs #2 Design BBB @ 307GOPs #3 Design CCC @ 307GOPs #4 Design DDD @ 256GOPs Needs to pass place & route and pass timing closure

Schedule Exploration with VTA } } VTA Candidate Designs A Candidate Designs Operator Performance AutoTuning throughput #1 Design AAA @ 307GOPs 307GOPs 307 GOPs 256 GOPs #2 Design BBB @ 307GOPs 307GOPs #3 Design CCC @ 307GOPs 307GOPs #4 Design DDD @ 256GOPs 256GOPs Needs to pass place & route oute and pass timing closure e autotuning steps

Schedule Exploration with VTA } } VTA Candidate Designs A Candidate Designs Operator Performance Operator Performance Deliverable AutoTuning Model throughput #1 Design AAA @ 307GOPs 307GOPs 307 GOPs 256 GOPs #2 Design BBB @ 307GOPs 307GOPs Graph Optimizer custom #3 Design CCC @ 307GOPs 307GOPs Tuned Operator Lib VTA Design BBB #4 Design DDD @ 256GOPs 256GOPs FPGA Needs to pass place & route oute and pass timing closure e autotuning steps autotuning steps

TVM+VTA Stack Goals

TVM+VTA Stack Goals • Blue-print for a complete deep learning acceleration stack

TVM+VTA Stack Goals • Blue-print for a complete deep learning acceleration stack • Experimentation framework for cross- stack deep learning optimizations

TVM+VTA Stack Goals • Blue-print for a complete deep learning acceleration stack • Experimentation framework for cross- stack deep learning optimizations • Open-source community for industrial- strength deep learning acceleration

December 12, 2018 Luis Ceze Welcome to the 1st TVM and Deep - PowerPoint PPT Presentation

1st TVM and Deep Learning Compilation Conference December 12, 2018 Luis Ceze Welcome to the 1st TVM and Deep Learning Compilation Conference! Welcome to the 1st TVM and Deep Learning Compilation Conference! 180+ ppl! Machine learning is

Approximate Computing and Storage from Programming Language to Hardware and Molecules Luis Ceze

Approximating to the Last Bit Thierry Moreau , Adrian Sampson, Luis Ceze {moreau,

Who is Luis? PhD in architecture, multiprocessors, parallelism, compilers. University of

Operating System Implications of Fast, Cheap, Non-Volatile Memory Katelin Bailey , Luis Ceze,

Architecture 2030 @ ISCA16 Luis Ceze, Tom Wenisch Mark Hill (CCC liaison, mentor) Neha

Modeling/Checking Approximate Computing Sara Achour, David Bindel, Luis Ceze, Eva Darulova,

Deterministic Process Groups in Tom Bergan Nicholas Hunt, Luis Ceze, Steven D. Gribble

Security Risks to Third-Party Genetic Genealogy Services Peter Ney , Luis Ceze, Tadayoshi Kohno

Security Risks to Third-Party Genetic Genealogy Services Peter Ney , Luis Ceze, Tadayoshi Kohno

Data-Race Exceptions Have Benefits Beyond the Memory Model Benjamin P . Wood , Luis Ceze, Dan

HardwareSoftware Co-Design: Not Just a Clich Adrian Sampson James Bornholt Luis Ceze sa

SYNAPSE with Metasketches POPL 2016 James Bornholt Emina Torlak Dan Grossman Luis Ceze

Approximate Program Synthesis James Bornholt Emina Torlak Luis Ceze Dan Grossman University of

PRESENTATION OF CREDENTIALS IN 2018 Permanent Representatives December 2018 H.E. Mr. Abdullah

A DNA-Based Archival Storage System James Bornholt * Randolph Lopez * Douglas M. Carmean Luis

Spherical Panoramic Images LightDB A Database System for Virtual, Augmented, & Mixed Reality

Profiling and Autotuning for Energy- Aware Approximate Programming

TUESDAY WEDNESDAY THURSDAY Time 11 th of December 2018 12 th of December 2018 13 th of December

8. Network Analysis December 8, 2019 Slides by Marta Arias, Jos Luis Balczar, Ramon

9. Public-key cryptography December 20, 2019 Slides by Marta Arias, Jos Luis Balczar, Ramon

Report from the Arch2030 Visioning Workshop: Where are Computer Architects headed and what does

ACCEPT: We Built an Open-Source Approximation Compiler Framework So You Don't Have To Adrian

Centinela - Site Visit December 7 th 2016 Luis Snchez Chief Operating Officer Cautionary

Architecture Support for Disciplined Approximate Programming Hadi Esmaeilzadeh, Adrian Sampson,

December 12, 2018 Luis Ceze Welcome to the 1st TVM and Deep - PowerPoint PPT Presentation

1st TVM and Deep Learning Compilation Conference December 12, 2018 Luis Ceze Welcome to the 1st TVM and Deep Learning Compilation Conference! Welcome to the 1st TVM and Deep Learning Compilation Conference! 180+ ppl! Machine learning is

Approximate Computing and Storage from Programming Language to Hardware and Molecules Luis Ceze

Approximating to the Last Bit Thierry Moreau , Adrian Sampson, Luis Ceze {moreau,

Who is Luis? PhD in architecture, multiprocessors, parallelism, compilers. University of

Operating System Implications of Fast, Cheap, Non-Volatile Memory Katelin Bailey , Luis Ceze,

Architecture 2030 @ ISCA16 Luis Ceze, Tom Wenisch Mark Hill (CCC liaison, mentor) Neha

Modeling/Checking Approximate Computing Sara Achour, David Bindel, Luis Ceze, Eva Darulova,

Deterministic Process Groups in Tom Bergan Nicholas Hunt, Luis Ceze, Steven D. Gribble

Security Risks to Third-Party Genetic Genealogy Services Peter Ney , Luis Ceze, Tadayoshi Kohno

Security Risks to Third-Party Genetic Genealogy Services Peter Ney , Luis Ceze, Tadayoshi Kohno

Data-Race Exceptions Have Benefits Beyond the Memory Model Benjamin P . Wood , Luis Ceze, Dan

HardwareSoftware Co-Design: Not Just a Clich Adrian Sampson James Bornholt Luis Ceze sa

SYNAPSE with Metasketches POPL 2016 James Bornholt Emina Torlak Dan Grossman Luis Ceze

Approximate Program Synthesis James Bornholt Emina Torlak Luis Ceze Dan Grossman University of

PRESENTATION OF CREDENTIALS IN 2018 Permanent Representatives December 2018 H.E. Mr. Abdullah

A DNA-Based Archival Storage System James Bornholt * Randolph Lopez * Douglas M. Carmean Luis

Spherical Panoramic Images LightDB A Database System for Virtual, Augmented, &amp; Mixed Reality

Profiling and Autotuning for Energy- Aware Approximate Programming

TUESDAY WEDNESDAY THURSDAY Time 11 th of December 2018 12 th of December 2018 13 th of December

8. Network Analysis December 8, 2019 Slides by Marta Arias, Jos Luis Balczar, Ramon

9. Public-key cryptography December 20, 2019 Slides by Marta Arias, Jos Luis Balczar, Ramon

Report from the Arch2030 Visioning Workshop: Where are Computer Architects headed and what does

ACCEPT: We Built an Open-Source Approximation Compiler Framework So You Don't Have To Adrian

Centinela - Site Visit December 7 th 2016 Luis Snchez Chief Operating Officer Cautionary

Architecture Support for Disciplined Approximate Programming Hadi Esmaeilzadeh, Adrian Sampson,

Spherical Panoramic Images LightDB A Database System for Virtual, Augmented, & Mixed Reality