VTA: Open & Flexible DL Acceleration Thierry Moreau TVM - PowerPoint PPT Presentation

VTA: Open & Flexible DL Acceleration Thierry Moreau TVM Conference, Dec 12th 2018

TVM Stack High-Level Differentiable IR Tensor Expression IR LLVM CUDA Metal

TVM Stack High-Level Differentiable IR Tensor Expression IR LLVM CUDA Metal VTA: Open Hardware Accelerator

TVM Stack High-Level Differentiable IR Tensor Expression IR LLVM CUDA Metal VTA: Open Hardware Accelerator Edge FPGA

TVM Stack High-Level Differentiable IR Tensor Expression IR LLVM CUDA Metal VTA: Open Hardware Accelerator Edge FPGA Cloud FPGA

TVM Stack High-Level Differentiable IR Tensor Expression IR LLVM CUDA Metal VTA: Open Hardware Accelerator Edge FPGA Cloud FPGA ASIC

TVM Stack High-Level Differentiable IR Transparent End-to-End Deep Learning System Stack Tensor Expression IR LLVM CUDA Metal VTA: Open Hardware Accelerator Edge FPGA Cloud FPGA ASIC

TVM+VTA Stack Goals

TVM+VTA Stack Goals • Blue-print for a complete deep learning acceleration stack

TVM+VTA Stack Goals • Blue-print for a complete deep learning acceleration stack • Experimentation framework for cross- stack deep learning optimizations

TVM+VTA Stack Goals • Blue-print for a complete deep learning acceleration stack • Experimentation framework for cross- stack deep learning optimizations • Open-source community for industrial- strength deep learning acceleration

VTA Overview Extensible Hardware Architecture Programmability Across the Stack Facilitates HW-SW Co-Design

VTA: General DL Architecture

VTA: General DL Architecture Tensor Intrinsic 32 16 8 8 x 1 8 x 8 vs. 32

VTA: General DL Architecture Tensor Intrinsic Hardware Datatype 32 16 8 8 vs. <16 x i8> <32 x i4> x 1 8 x 8 vs. 32

VTA: General DL Architecture Tensor Intrinsic Hardware Datatype 32 16 8 8 vs. <16 x i8> <32 x i4> x 1 8 x 8 vs. 32 Memory Subsystem vs.

VTA: General DL Architecture Tensor Intrinsic Hardware Datatype 32 16 8 8 vs. <16 x i8> <32 x i4> x 1 8 x 8 vs. 32 Memory Subsystem Operation Support vs. vs. {ADD, MUL, SHL, MAX} {ADD, SHL, MAX}

VTA Hardware Architecture Philosophy: simple hardware, provide software-defined flexibility

VTA Hardware Architecture Philosophy: simple hardware, provide software-defined flexibility DRAM INSTRUCTION FETCH MODULE COMPUTE LOAD STORE CMD Q CMD Q CMD Q LD → CMP Q CMP → ST Q COMPUTE MODULE REGISTER MICRO-OP FILE BUFFER LOAD STORE MODULE MODULE Vector ALU CMP → LD Q ST → CMP Q Tensor Core INPUT BUFFER STORE BUFFER WEIGHT BUFFER

VTA Hardware Architecture DRAM INSTRUCTION FETCH MODULE COMPUTE LOAD STORE CMD Q CMD Q CMD Q LD → CMP Q CMP → ST Q COMPUTE MODULE REGISTER MICRO-OP FILE BUFFER LOAD STORE MODULE MODULE Vector ALU CMP → LD Q ST → CMP Q Tensor Core INPUT BUFFER STORE BUFFER WEIGHT BUFFER

Pipelining Tasks to Hide Memory Latency Monolithic Design LD LD EX EX LD LD EX EX LD LD EX EX LD LD EX EX ST ST LD: load EX: compute ST: store

Pipelining Tasks to Hide Memory Latency Monolithic Design LD EX LD EX LD EX LD EX ST Load Stage LD LD LD LD EX EX EX EX Execute Stage Store Stage ST LD: load EX: compute ST: store

Pipelining Tasks to Hide Memory Latency Monolithic Design LD EX LD EX LD EX LD EX ST Load Stage LD LD LD LD EX EX EX EX Execute Stage Store Stage ST latency savings LD: load EX: compute ST: store

Pipelining Tasks to Hide Memory Latency Monolithic Design LD EX LD EX LD EX LD EX ST Load Stage LD LD LD LD EX EX EX EX Execute Stage Store Stage ST latency savings low-level synchronization between tasks is explicitly managed by the software LD: load EX: compute ST: store

Two-Level ISA Overview Provides the right tradeoff between expressiveness and code compactness

Two-Level ISA Overview Provides the right tradeoff between expressiveness and code compactness • Use CISC instructions to perform multi-cycle tasks DENSE ALU LOAD STORE

Two-Level ISA Overview Provides the right tradeoff between expressiveness and code compactness • Use CISC instructions to perform multi-cycle tasks DENSE ALU LOAD STORE • Use RISC micro-ops to perform single-cycle tensor operations

Two-Level ISA Overview Provides the right tradeoff between expressiveness and code compactness • Use CISC instructions to perform multi-cycle tasks DENSE ALU LOAD STORE • Use RISC micro-ops to perform single-cycle tensor operations R0: R0 + GEMM(A8, W3)

Two-Level ISA Overview Provides the right tradeoff between expressiveness and code compactness • Use CISC instructions to perform multi-cycle tasks DENSE ALU LOAD STORE • Use RISC micro-ops to perform single-cycle tensor operations R0: R0 + GEMM(A8, W3) R2: MAX(R0, ZERO)

VTA RISC Micro-Kernels

VTA RISC Micro-Kernels multiple RISC instructions define a micro-kernel , which can be invoked by a CISC instruction

VTA RISC Micro-Kernels multiple RISC instructions define a micro-kernel , which can be invoked by a CISC instruction CONV2D: layout=NCHW, chan=128, kernel=(3,3), padding=(1,1), strides=(1,1)

VTA RISC Micro-Kernels multiple RISC instructions define a micro-kernel , which can be invoked by a CISC instruction CONV2D: layout=NCHW, chan=128, kernel=(3,3), padding=(1,1), strides=(1,1) CONV2D: layout=NCHW, chan=256, kernel=(1,1), padding=(0,0), strides=(2,2)

VTA RISC Micro-Kernels multiple RISC instructions define a micro-kernel , which can be invoked by a CISC instruction CONV2D: layout=NCHW, chan=128, kernel=(3,3), padding=(1,1), strides=(1,1) CONV2D: layout=NCHW, chan=256, kernel=(1,1), padding=(0,0), strides=(2,2) CONV2D_TRANSPOSE: ...

VTA RISC Micro-Kernels multiple RISC instructions define a micro-kernel , which can be invoked by a CISC instruction CONV2D: layout=NCHW, chan=128, kernel=(3,3), padding=(1,1), strides=(1,1) CONV2D: layout=NCHW, chan=256, kernel=(1,1), padding=(0,0), strides=(2,2) CONV2D_TRANSPOSE: ... GROUP_CONV2D: ...

VTA RISC Micro-Kernels micro-kernel programming gives us software-defined flexibility “cat” DCGAN ResNet50

How is VTA Programmed?

VTA: Open & Flexible DL Acceleration Thierry Moreau TVM - PowerPoint PPT Presentation

VTA: Open & Flexible DL Acceleration Thierry Moreau TVM Conference, Dec 12th 2018 TVM Stack High-Level Differentiable IR Tensor Expression IR LLVM CUDA Metal TVM Stack High-Level Differentiable IR Tensor Expression IR LLVM CUDA Metal

Build your own VTA design with Chisel Luis Vega VTA-generator vision VTA-generator vision

VTA HACKATHON Gather ideas for how to visualize and leverage real-time data Swiftly received

The The Beverly Beverly Middle Middle School School Flexible Flexible Learning Learning

VTA 03.12.2019 | TU Darmstadt | ESA | F. Stock | 1 TaPaSCo Framework Builds complete

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Personalized Learning Flexible Seating and Space Flexible Seating and Space Flexible Seating and

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

Particle Driven Acceleration Experiments Edda Gschwendtner CAS, Plasma Wake Acceleration 2014 2

Motion with Constant Acceleration 1 Particle Under Constant Acceleration In the case of motion

acceleration Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada NSS acceleration

Flexible Instruction Day Parent Presentation Flexible Instruction Day March 16 - 20 - Flexible

Flexible Infrastructure Qualification What Is Flexible Infrastructure/Benefits Flexible

VT VTA v A v. FD . FDA A Li Litigation Up Update ERIC N. HEYER THOMPSON HINE LLP Di

Hard working VTA Engineer Fun Loving Silicon Valley Engineers Suburban Freeways Define a NETWORK

VTA INTER-REGIONAL RAIL ADVISORY GROUP REGIONAL RAIL ADVISORY GROUP ALTAMONT REGIONAL RAIL

VTA Annual Meeting Kevin Altman September 17, 2019 Application must demonstrate that new

/ IntrinsicAutoRegressieModels Spaial daa anali in San Se

Casimir effect due to a single boundary as a manifestation of the Weyl problem Eugene B.

1 Related Work Related Work Related Work Related Work Gromov-Hausdorff Gromov-Hausdorff

The Womens Empowerment in Agriculture Index WEAI is made up of two sub indices Public

The intrinsic dimension of importance sampling Omiros Papaspiliopoulos www.econ.upf.edu/~omiros

Physical(ly) Unclonable Functions An introduction to Intrinsic PUFs Ingrid Verbauwhede Slide

CS4495/6495 Introduction to Computer Vision 3C-L2 Intrinsic camera calibration Geometric Camera

Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization Maxim Raginsky and