riptide fast end to end binarized neural networks
play

Riptide: Fast End-to-End Binarized Neural Networks Josh Fromm, - PowerPoint PPT Presentation

Riptide: Fast End-to-End Binarized Neural Networks Josh Fromm, Meghan Cowan, Matthai Philipose, Luis Ceze, and Shwetak Patel 2 Canziani et al., An analysis of deep neural network models for practical applications. 2016 1-bit Matrix


  1. Riptide: Fast End-to-End Binarized Neural Networks Josh Fromm, Meghan Cowan, Matthai Philipose, Luis Ceze, and Shwetak Patel

  2. 2 Canziani et al., “An analysis of deep neural network models for practical applications.” 2016

  3. 1-bit Matrix Operations • Quantize floats to +/-1 • 1.122 * -3.112 ==> 1 * -1 • Notice: 1.2 3.12 -11.2 3.4 -2.12 -132.1 … 0.2 -121.1, … • 1 * 1 = 1 64 floats • 1 * -1 = -1 • -1 * 1 = -1 64 bits • -1 * -1 = 1 0b110100…1 0xD0… • Replacing -1 with 0, this is just XNOR A[:64] . W[:64] == popc(A /64 XNOR W /64 ) • Retrain model to convergence 3 Rastegari et al., “Xnor-net: Imagenet classification using binary convolutional neural networks.” 2016

  4. 1-bit Matrix Operations: Cost/Benefit float x[], y[], w[]; ... for i in 1…N: 2N ops y[j] += x[i] * w[i]; ~40x faster 32x smaller 3N/64 ops unsigned long x[], y[], w[]; … for i in 1…N/64: y[j] += 64 – 2*popc(not(x_b[i] xor w_b[i])); Typically, lose ~10% accuracy 4 Rastegari et al., “Xnor-net: Imagenet classification using binary convolutional neural networks.” 2016

  5. 1-bit Matrix Operations: Cost/Benefit float x[], y[], w[]; ... for i in 1…N: 2N ops y[j] += x[i] * w[i]; ~40x faster 32x smaller 3N/64 ops unsigned long x[], y[], w[]; … for i in 1…N/64: y[j] += 64 – 2*popc(not(x_b[i] xor w_b[i])); Typically, lose ~10% accuracy 5 Rastegari et al., “Xnor-net: Imagenet classification using binary convolutional neural networks.” 2016

  6. 1-bit Matrix Operations: Cost/Benefit ~40x faster 6

  7. 1-bit Matrix Operations: Cost/Benefit Runtime 1904 ms 380 ms Full Precision Baseline Unoptimized Binary Network 7

  8. Implementation Challenges No optimized linear algebra libraries like BLAS to leverage Need to implement optimizations from scratch Optimizations tuned for specific CPU uint1 uint2 CPUs have no native support for low bit data types Need to work on packed data Baselines incredibly well optimized Optimized linear algebra libraries Hardware support for conventional deep learning 8

  9. Are Binary Networks Actually Fast? Majority of work in binarization is simulated • Which binarization techniques can be implemented efficiently? • What are the runtime bottlenecks in a binary model? • How do I deploy a fast binary model on my platform? To address these questions we introduce Riptide. 9

  10. Addresses implementation issues in mixed polarity • quantization Introduces the Fused Glue operation, removing all floating- • point arithmetic from binary models. Provides high-performance bitserial operators through TVM. • Yields 4-12X speedups across various models and bitwidths • while maintaining state-of-the-art accuracy. Available open-source today at github.com/jwfromm/Riptide • A one-stop solution to training and deploying fast binary networks on a variety of hardware platforms. 10

  11. Implementing Binary Layers kernels: float array … activations: float array features: float array = Multiply Accumulate 11

  12. Implementing Binary Layers kernels: float array … activations: float array features: int array features: float array = QA Multiply Accumulate 12

  13. Implementing Binary Layers kernels: float array … QW kernels: int array … activations: float array features: int array features: float array = QA Multiply Accumulate 13

  14. Implementing Binary Layers kernels: float array … QW kernels: int array … activations: int array features: int array features: float array = QA Bitserial Accumulate 14

  15. Quantization Polarity Bipolar Quantization Unipolar Quantization -1 -1 0 1 0 1 Quantization Function: ! 𝑦 = 𝑡𝑗𝑕𝑜(𝑦) Quantization Function: ! 𝑦 = 𝑦 > 0 • Implemented with bitwise-and and popcount • Implemented with bitwise-xnor and popcount • Well-suited for activations, which represent • Well-suited for weights, which represent pattern-match (1) or no pattern-match (0) correlation (1) or inverse-correlation (-1) 15

  16. Quantization Polarity • XnorNet (all bipolar) -> 44.2% accuracy • DorefaNet (bipolar weights unipolar activations) -> 50.0% accuracy A (unipolar) 1 1 0 1 0 1 … W (bipolar) 0 1 0 0 1 0 … = Expected 0 1 0 -1 0 -1 … Multiple meanings of 0 bits causes mixed polarity to unimplementable 16 Zhou et al., “DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients.” 2016

  17. Mixed Polarity Operation Count number of bit Subtract cases where output multiplications where output should be -1 should be 1 Enables mixed polarity binary networks • Doubles amount of inner loop compute but does not require additional memory operations • Mixed polarity may offer compelling points on speedup to accuracy versus pure bipolar • 17

  18. Multibit Quantization Translates naturally to integer representation • Does not necessarily fit distribution • 0 .3 .6 1 Quantization Function: ! 𝑦 = 𝑚𝑗𝑜𝑓𝑏𝑠(𝑦) 18 Zhou et al., “DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients.” 2016

  19. Multibit Quantization Better fit for Gaussian distribution • Not implementable • .2 .6 1.1 2.1 Quantization Function: ! 𝑦 = 𝐼𝑋𝐻𝑅(𝑦) 19 Cai et al., “Deep Learning with Low Precision by Half-wave Gaussian Quantization.” 2017

  20. Multibit Quantization .2 .6 1.1 2.1 Unique bit combinations lost during popcount Bit pair: 00 01 10 11 Value is based on unique bit pair rather than combination of bits, ( 01 + 10 ≠ 11) 20 Cai et al., “Deep Learning with Low Precision by Half-wave Gaussian Quantization.” 2017

  21. Implementing Binary Layers kernels: float array … 1-bit bipolar quantization QW kernels: int array … activations: int array features: int array features: float array 256 = QA 128 Bitwise Accumulate N-bit linear bipolar or Xnor-popcount / 128 unipolar quantization mixed polarity-popcount 21

  22. Implementing Binary Models Full Precision Binary Conv QConv QConv QConv QConv QDense 22

  23. Implementing Binary Models Full Precision Binary WeightScale Dequantize BatchNorm Activation Quantize Bitpack Conv QConv QConv QConv QConv QDense 𝑂𝐿𝐿𝐺𝐼𝑋𝐷 Computational Complexity: 4𝐼𝑋𝐺 𝐼𝑋𝐺 4𝐼𝑋𝐺 𝐼𝑋𝐺 5𝐼𝑋𝐺 3𝐼𝑋𝐺 43 23

  24. Estimated Impact of Glue Layers • Impact of glue layers is too high • We must derive binarized glue for decent end-to-end speedups 24

  25. Weight Scaling W Introduced in XnorNet • Allows scale of weights to be preserved • Qconv 𝛽 ? = 𝑛𝑓𝑏𝑜( 𝑋 ? ) Qconv Brought accuracy from 27% to 44% • 𝑟 𝑏 = 𝛽 ? 𝑏 Now used ubiquitously • 25 Rastegari et al., “Xnor-net: Imagenet classification using binary convolutional neural networks.” 2016

  26. Quantized Weight Scaling W Use approximate power of 2 (AP2) • Replaces multiply with bitwise shift • Qconv 𝛽 ? = 𝑛𝑓𝑏𝑜( 𝑋 ? ) Qconv Constant at inference time • 𝑟 𝑏 = 𝛽 ? 𝑏 Requires only a single instruction • 26 Rastegari et al., “Xnor-net: Imagenet classification using binary convolutional neural networks.” 2016

  27. BatchNormalization • Centers and scales output activations • Essential for quantization, used in all binary techniques • Must derive quantized versions of both centering and scaling D (𝑏 F ) I = C D (𝑏 F − 𝜈 ? ) I 𝜈 ? = C 𝑏 F = L M N O P D ∑ FGC D ∑ FGC 𝜏 ? K R S T Q P 27 Ioffe et al., “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” 2015

  28. Binary Scaling • We can simply compute the AP2 of standard deviation 28

  29. ̂ Binary Center To add a constant to a binarized tensor, we must use Fixed Point Quantization (FPQ) with the same bits and scale N -bit input to wb fractional next layer bits 𝐶 = 𝑂 + 𝑥𝑐 1 1 1 1 0 1 1 0 … ]^ 1 2 + 1 4 + 1 1 1 2) F 1 𝑇 = 1 + \ 2 _ − 1 ( ` 8 + ⋯ FGC 2 _ − 1 (1 − 1 1 = 1 + 2 ]^ ) 𝜈 = 𝐺𝑄𝑅(𝜈, 𝐶, 𝑇) 29

  30. ̂ Fused Glue Operation 2 _ − 1 (1 − 1 1 𝐶 = 𝑂 + 𝑥𝑐 𝑇 = 1 + 2 ]^ ) 𝜈 = 𝐺𝑄𝑅(𝜈, 𝐶, 𝑇) This is the fused glue operation All terms are constant at runtime except 𝑏 Only requires two integer operations 30

  31. Fully Binarized Network Full Precision Binary Traditional WeightScale Dequantize BatchNorm Activation Quantize Bitpack Binary QConv QConv Network Computational Complexity: 4𝐼𝑋𝐺 𝐼𝑋𝐺 4𝐼𝑋𝐺 𝐼𝑋𝐺 5𝐼𝑋𝐺 3𝐼𝑋𝐺 Total = 18𝐼𝑋𝐺 Fully 3X fewer glue operations • Fused Glue Bitpack Binarized Clip QConv QConv No floating-point data • Network No multiplication or division • Computational Complexity: 2𝐼𝑋𝐺 𝐼𝑋𝐺 3𝐼𝑋𝐺 Total = 6𝐼𝑋𝐺 31

  32. FBN Accuracy • Our system is comparable to state-of-the-art techniques • Unipolar quantization yields higher accuracies as expected • Effective across various models 32

  33. Measurement Platform • Widely available and inexpensive • Representative of IoT devices • QualComm Snapdragons • Azure Sphere Raspberry Pi • Resource constrained / in need of acceleration ARM Cortex-A53 33

Recommend


More recommend