Generating Fast Operators for Binarizable Networks Meghan Cowan
Running Binarizable Networks?
Running Binarizable Networks? Training in frameworks with no binarizable operators.
Running Binarizable Networks? ? Speedup Can’t evaluate performance gains Training in frameworks with no binarizable operators.
Running Binarizable Networks? ? Speedup Easy to introduce bugs Can’t evaluate performance gains Training in frameworks with no binarizable operators.
Running Binarizable Networks? ? Speedup Easy to introduce bugs Can’t evaluate performance gains Training in frameworks with no binarizable operators. Need to generate binarizable operators ourselves!
Speedup Baseline Unoptimized Goal Baselines are incredibly well optimized Without optimizations low precision can’t compete
Want operators that are fast Speedup Baseline Unoptimized Goal Baselines are incredibly well optimized Without optimizations low precision can’t compete
Want operators that are fast Speedup Baseline Unoptimized Goal Baselines are incredibly well optimized Need optimized operators for all workloads Performance portability across different CPUs Without optimizations low precision can’t compete
Generating Fast Operators for Binarizable Networks Optimization High-Level Differentiable IR AutoTVM Tensor Expression IR VTA LLVM, CUDA, Metal AutoVTA Edge Cloud Hardware ASIC FPGA FPGA Fleet
Generating Fast Operators for Binarizable Networks Optimization High-Level Differentiable IR AutoTVM Tensor Expression IR Tensor Expression IR VTA LLVM, CUDA, Metal AutoVTA Edge Cloud Hardware ASIC FPGA FPGA Fleet Declare bitserial computation and CPU schedule describing an optimization space
Generating Fast Operators for Binarizable Networks Optimization High-Level Differentiable IR AutoTVM AutoTVM Tensor Expression IR Tensor Expression IR VTA LLVM, CUDA, Metal AutoVTA Edge Cloud Hardware ASIC FPGA FPGA Fleet Declare bitserial computation and CPU schedule describing an optimization space Use AutoTVM use to find schedule parameters for different operators and backends
Generating Fast Operators for Binarizable Networks Optimization tensorize() High-Level Differentiable IR AutoTVM AutoTVM Tensor Expression IR Tensor Expression IR vcnt.8 q8, q8 vrev16.8 q5, q8 vadd.i8 q8, q8, q5 vorr q5, q8, q8 vuzp.8 q8, q5 vmovl.u8 q5, d16 vrev32.16 q5, q5 LLVM, CUDA, Metal VTA LLVM, CUDA, Metal AutoVTA vaddw.u8 q8, q5, d16 vorr q5, q8, q8 vuzp.16 q8, q5 vcnt.8 q8, q8 vrev16.8 q5, q8 vadd.i8 q8, q8, q5 vorr q5, q8, q8 Edge Cloud Hardware vuzp.8 q8, q5 ASIC vmovl.u8 q5, d16 vrev32.16 q5, q5 FPGA FPGA Fleet vaddw.u8 q8, q5, d16 vorr q5, q8, q8 vuzp.16 q8, q5 Declare bitserial computation and CPU schedule Overrule LLVM code generation with custom microkernel describing an optimization space Use tensorize primitive to replace inner-most loop of computation Use AutoTVM use to find schedule parameters for different operators and backends
Convolutions on Raspberry Pi 16-bit TVM W1A1 W1A2 W2A2 30 24 Relative Speedup 18 12 6 0 2 3 4 5 6 7 8 9 10 11 12 Total ResNet 18 Layer Can generate low precision convolutions 5.5x to 15.2x faster than optimized 16-bit integer
Recommend
More recommend