vta open flexible dl acceleration
play

VTA: Open & Flexible DL Acceleration Thierry Moreau TVM - PowerPoint PPT Presentation

VTA: Open & Flexible DL Acceleration Thierry Moreau TVM Conference, Dec 12th 2018 TVM Stack High-Level Differentiable IR Tensor Expression IR LLVM CUDA Metal TVM Stack High-Level Differentiable IR Tensor Expression IR LLVM CUDA Metal


  1. VTA: Open & Flexible DL Acceleration Thierry Moreau TVM Conference, Dec 12th 2018

  2. TVM Stack High-Level Differentiable IR Tensor Expression IR LLVM CUDA Metal

  3. TVM Stack High-Level Differentiable IR Tensor Expression IR LLVM CUDA Metal VTA: Open Hardware Accelerator

  4. TVM Stack High-Level Differentiable IR Tensor Expression IR LLVM CUDA Metal VTA: Open Hardware Accelerator Edge FPGA

  5. TVM Stack High-Level Differentiable IR Tensor Expression IR LLVM CUDA Metal VTA: Open Hardware Accelerator Edge FPGA Cloud FPGA

  6. TVM Stack High-Level Differentiable IR Tensor Expression IR LLVM CUDA Metal VTA: Open Hardware Accelerator Edge FPGA Cloud FPGA ASIC

  7. TVM Stack High-Level Differentiable IR Tensor Expression IR LLVM CUDA Metal VTA: Open Hardware Accelerator Edge FPGA Cloud FPGA ASIC

  8. TVM Stack High-Level Differentiable IR Transparent End-to-End Deep Learning System Stack Tensor Expression IR LLVM CUDA Metal VTA: Open Hardware Accelerator Edge FPGA Cloud FPGA ASIC

  9. TVM+VTA Stack Goals

  10. TVM+VTA Stack Goals • Blue-print for a complete deep learning acceleration stack

  11. TVM+VTA Stack Goals • Blue-print for a complete deep learning acceleration stack • Experimentation framework for cross- stack deep learning optimizations

  12. TVM+VTA Stack Goals • Blue-print for a complete deep learning acceleration stack • Experimentation framework for cross- stack deep learning optimizations • Open-source community for industrial- strength deep learning acceleration

  13. VTA Overview Extensible Hardware Architecture Programmability Across the Stack Facilitates HW-SW Co-Design

  14. VTA Overview Extensible Hardware Architecture Programmability Across the Stack Facilitates HW-SW Co-Design

  15. VTA: General DL Architecture

  16. VTA: General DL Architecture Tensor Intrinsic 32 16 8 8 x 1 8 x 8 vs. 32

  17. VTA: General DL Architecture Tensor Intrinsic Hardware Datatype 32 16 8 8 vs. <16 x i8> <32 x i4> x 1 8 x 8 vs. 32

  18. VTA: General DL Architecture Tensor Intrinsic Hardware Datatype 32 16 8 8 vs. <16 x i8> <32 x i4> x 1 8 x 8 vs. 32 Memory Subsystem vs.

  19. VTA: General DL Architecture Tensor Intrinsic Hardware Datatype 32 16 8 8 vs. <16 x i8> <32 x i4> x 1 8 x 8 vs. 32 Memory Subsystem Operation Support vs. vs. {ADD, MUL, SHL, MAX} {ADD, SHL, MAX}

  20. VTA Hardware Architecture Philosophy: simple hardware, provide software-defined flexibility

  21. VTA Hardware Architecture Philosophy: simple hardware, provide software-defined flexibility DRAM INSTRUCTION FETCH MODULE COMPUTE LOAD STORE CMD Q CMD Q CMD Q LD → CMP Q CMP → ST Q COMPUTE MODULE REGISTER MICRO-OP FILE BUFFER LOAD STORE MODULE MODULE Vector ALU CMP → LD Q ST → CMP Q Tensor Core INPUT BUFFER STORE BUFFER WEIGHT BUFFER

  22. VTA Hardware Architecture DRAM INSTRUCTION FETCH MODULE COMPUTE LOAD STORE CMD Q CMD Q CMD Q LD → CMP Q CMP → ST Q COMPUTE MODULE REGISTER MICRO-OP FILE BUFFER LOAD STORE MODULE MODULE Vector ALU CMP → LD Q ST → CMP Q Tensor Core INPUT BUFFER STORE BUFFER WEIGHT BUFFER

  23. Pipelining Tasks to Hide Memory Latency Monolithic Design LD LD EX EX LD LD EX EX LD LD EX EX LD LD EX EX ST ST LD: load EX: compute ST: store

  24. Pipelining Tasks to Hide Memory Latency Monolithic Design LD EX LD EX LD EX LD EX ST Load Stage LD LD LD LD EX EX EX EX Execute Stage Store Stage ST LD: load EX: compute ST: store

  25. Pipelining Tasks to Hide Memory Latency Monolithic Design LD EX LD EX LD EX LD EX ST Load Stage LD LD LD LD EX EX EX EX Execute Stage Store Stage ST latency savings LD: load EX: compute ST: store

  26. Pipelining Tasks to Hide Memory Latency Monolithic Design LD EX LD EX LD EX LD EX ST Load Stage LD LD LD LD EX EX EX EX Execute Stage Store Stage ST latency savings low-level synchronization between tasks is explicitly managed by the software LD: load EX: compute ST: store

  27. Two-Level ISA Overview Provides the right tradeoff between expressiveness and code compactness

  28. Two-Level ISA Overview Provides the right tradeoff between expressiveness and code compactness • Use CISC instructions to perform multi-cycle tasks DENSE ALU LOAD STORE

  29. Two-Level ISA Overview Provides the right tradeoff between expressiveness and code compactness • Use CISC instructions to perform multi-cycle tasks DENSE ALU LOAD STORE • Use RISC micro-ops to perform single-cycle tensor operations

  30. Two-Level ISA Overview Provides the right tradeoff between expressiveness and code compactness • Use CISC instructions to perform multi-cycle tasks DENSE ALU LOAD STORE • Use RISC micro-ops to perform single-cycle tensor operations R0: R0 + GEMM(A8, W3)

  31. Two-Level ISA Overview Provides the right tradeoff between expressiveness and code compactness • Use CISC instructions to perform multi-cycle tasks DENSE ALU LOAD STORE • Use RISC micro-ops to perform single-cycle tensor operations R0: R0 + GEMM(A8, W3) R2: MAX(R0, ZERO)

  32. VTA RISC Micro-Kernels

  33. VTA RISC Micro-Kernels multiple RISC instructions define a micro-kernel , which can be invoked by a CISC instruction

  34. VTA RISC Micro-Kernels multiple RISC instructions define a micro-kernel , which can be invoked by a CISC instruction CONV2D: layout=NCHW, chan=128, kernel=(3,3), padding=(1,1), strides=(1,1)

  35. VTA RISC Micro-Kernels multiple RISC instructions define a micro-kernel , which can be invoked by a CISC instruction CONV2D: layout=NCHW, chan=128, kernel=(3,3), padding=(1,1), strides=(1,1) CONV2D: layout=NCHW, chan=256, kernel=(1,1), padding=(0,0), strides=(2,2)

  36. VTA RISC Micro-Kernels multiple RISC instructions define a micro-kernel , which can be invoked by a CISC instruction CONV2D: layout=NCHW, chan=128, kernel=(3,3), padding=(1,1), strides=(1,1) CONV2D: layout=NCHW, chan=256, kernel=(1,1), padding=(0,0), strides=(2,2) CONV2D_TRANSPOSE: ...

  37. VTA RISC Micro-Kernels multiple RISC instructions define a micro-kernel , which can be invoked by a CISC instruction CONV2D: layout=NCHW, chan=128, kernel=(3,3), padding=(1,1), strides=(1,1) CONV2D: layout=NCHW, chan=256, kernel=(1,1), padding=(0,0), strides=(2,2) CONV2D_TRANSPOSE: ... GROUP_CONV2D: ...

  38. VTA RISC Micro-Kernels micro-kernel programming gives us software-defined flexibility “cat” DCGAN ResNet50

  39. How is VTA Programmed?

Recommend


More recommend