generating fast operators for binarizable networks
play

Generating Fast Operators for Binarizable Networks Meghan Cowan - PowerPoint PPT Presentation

Generating Fast Operators for Binarizable Networks Meghan Cowan Running Binarizable Networks? Running Binarizable Networks? Training in frameworks with no binarizable operators. Running Binarizable Networks? ? Speedup Cant evaluate


  1. Generating Fast Operators for Binarizable Networks Meghan Cowan

  2. Running Binarizable Networks?

  3. Running Binarizable Networks? Training in frameworks with no binarizable operators.

  4. Running Binarizable Networks? ? Speedup Can’t evaluate performance gains Training in frameworks with no binarizable operators.

  5. Running Binarizable Networks? ? Speedup Easy to introduce bugs Can’t evaluate performance gains Training in frameworks with no binarizable operators.

  6. Running Binarizable Networks? ? Speedup Easy to introduce bugs Can’t evaluate performance gains Training in frameworks with no binarizable operators. Need to generate binarizable operators ourselves!

  7. Speedup Baseline Unoptimized Goal Baselines are incredibly well optimized Without optimizations low precision can’t compete

  8. Want operators that are fast Speedup Baseline Unoptimized Goal Baselines are incredibly well optimized Without optimizations low precision can’t compete

  9. Want operators that are fast Speedup Baseline Unoptimized Goal Baselines are incredibly well optimized Need optimized operators for all workloads Performance portability across different CPUs Without optimizations low precision can’t compete

  10. Generating Fast Operators for Binarizable Networks Optimization High-Level Differentiable IR AutoTVM Tensor Expression IR VTA LLVM, CUDA, Metal AutoVTA Edge Cloud Hardware ASIC FPGA FPGA Fleet

  11. Generating Fast Operators for Binarizable Networks Optimization High-Level Differentiable IR AutoTVM Tensor Expression IR Tensor Expression IR VTA LLVM, CUDA, Metal AutoVTA Edge Cloud Hardware ASIC FPGA FPGA Fleet Declare bitserial computation and CPU schedule describing an optimization space

  12. Generating Fast Operators for Binarizable Networks Optimization High-Level Differentiable IR AutoTVM AutoTVM Tensor Expression IR Tensor Expression IR VTA LLVM, CUDA, Metal AutoVTA Edge Cloud Hardware ASIC FPGA FPGA Fleet Declare bitserial computation and CPU schedule describing an optimization space Use AutoTVM use to find schedule parameters for different operators and backends

  13. Generating Fast Operators for Binarizable Networks Optimization tensorize() High-Level Differentiable IR AutoTVM AutoTVM Tensor Expression IR Tensor Expression IR vcnt.8 q8, q8 vrev16.8 q5, q8 vadd.i8 q8, q8, q5 vorr q5, q8, q8 vuzp.8 q8, q5 vmovl.u8 q5, d16 vrev32.16 q5, q5 LLVM, CUDA, Metal VTA LLVM, CUDA, Metal AutoVTA vaddw.u8 q8, q5, d16 vorr q5, q8, q8 vuzp.16 q8, q5 vcnt.8 q8, q8 vrev16.8 q5, q8 vadd.i8 q8, q8, q5 vorr q5, q8, q8 Edge Cloud Hardware vuzp.8 q8, q5 ASIC vmovl.u8 q5, d16 vrev32.16 q5, q5 FPGA FPGA Fleet vaddw.u8 q8, q5, d16 vorr q5, q8, q8 vuzp.16 q8, q5 Declare bitserial computation and CPU schedule Overrule LLVM code generation with custom microkernel describing an optimization space Use tensorize primitive to replace inner-most loop of computation Use AutoTVM use to find schedule parameters for different operators and backends

  14. Convolutions on Raspberry Pi 16-bit TVM W1A1 W1A2 W2A2 30 24 Relative Speedup 18 12 6 0 2 3 4 5 6 7 8 9 10 11 12 Total ResNet 18 Layer Can generate low precision convolutions 5.5x to 15.2x faster than optimized 16-bit integer

Recommend


More recommend