the embedded learning library the embedded learning
play

The Embedded Learning Library The Embedded Learning Library (ELL) - PDF document

The Embedded Learning Library The Embedded Learning Library (ELL) Cross-compiler for AI pipelines, specialized for resource constrained target platforms T arget ELL AI Machine Pipeline Code https://github.com/Microsoft/ELL The Embedded


  1. The Embedded Learning Library

  2. The Embedded Learning Library (ELL) Cross-compiler for AI pipelines, specialized for resource constrained target platforms T arget ELL AI Machine Pipeline Code https://github.com/Microsoft/ELL

  3. The Embedded Learning Library • 3 years at Microsoft Research • compiler toolchain, tutorials, model gallery • focus: ARM CPUs  embedded GPUs, vision on ARM Cortex A53, keyword spotting on ARM Cortex M4f

  4. Architecture Pretrained Dataset Model Importers ELL Trainers Importer Importer Importer Importer T arget Computation Graph Optimizer Importer Profiles Importer ELL Platform Abstraction Layer LLVM OpenCL … T arget Emitter Emitter LLVM OpenCL BLAS

  5. AI compiler vs. AI runtime why AI compiler? why AI runtime? • model-specific optimization • portability • target-specific optimization • seamless migration from cloud to edge • small executable best of both worlds just-in-time AI compiler

  6. Evaluation small loss in accuracy  large gain in cost compression techniques: • efficient architectures • pruning • low precision math and quantization • low rank matrix approximation

  7. January 2018 Architecture search 70 65 Pareto frontier model ILSVRC2012 top-1 60 55 50 45 40 35 30 0 100 200 300 400 500 600 700 800 900 1000 ms/image on RPi3@700MHz

  8. January 2018 Architecture search 70 65 ILSVRC2012 top-1 60 55 50 45 40 35 30 0 100 200 300 400 500 600 700 800 900 1000 ms/image on RPi3@700MHz

  9. February 2018 Architecture search 70 65 ILSVRC2012 top-1 60 55 50 45 40 35 30 0 100 200 300 400 500 600 700 800 900 1000 ms/image on RPi3@700MHz

  10. March 2018 Architecture search 70 65 ILSVRC2012 top-1 60 55 50 45 40 35 30 0 100 200 300 400 500 600 700 800 900 1000 ms/image on RPi3@700MHz

  11. April 2018 Architecture search 70 65 ILSVRC2012 top-1 60 55 50 45 40 35 30 0 100 200 300 400 500 600 700 800 900 1000 ms/image on RPi3@700MHz

  12. Lossless acceleration • variety of convolution kernels • scheduling • engineering

  13. January 2019 Lossless acceleration 70 65 ILSVRC2012 top-1 60 55 50 45 40 35 30 0 100 200 300 400 500 600 700 800 900 1000 ms/image on RPi3@700MHz

  14. February 2019 Lossless acceleration 70 65 ILSVRC2012 top-1 60 55 50 45 40 35 30 0 100 200 300 400 500 600 700 800 900 1000 ms/image on RPi3@700MHz

  15. March 2019 Lossless acceleration . 70 65 ILSVRC2012 top-1 60 55 50 45 40 35 30 0 100 200 300 400 500 600 700 800 900 1000 ms/image on RPi3@700MHz

  16. Lossy Acceleration mix and match compression techniques engineering/ML co-design during training vs post processing

  17. Quantization semantics binary lookup/clustered iterative sum bit bit value bit bit bits value bits Value value 0…k lookup 0 0 0…k a ± b ± c ± .. ± n 0 -1 1 1 1 1 linear ternary bits value bits value 0…k [0...2^k - 1] 00 0 exponential 01 1 bits value 10 n/a 11 -1 0…k [-2^(b-1)-1...2^(b-1)-1]

  18. Quantization representation b3 b2 b1 b0 a3 a2 a1 a0 d3 d2 d1 d0 c3 c2 c1 c0 bit packed d0 c0 b0 a0 d1 c1 b1 a1 d2 c2 b2 a2 d3 c3 b3 a3 bit planes

  19. Quantization example ternary weights, 3-bit unsigned linear activations (bitplane) activations 5 1 7 6 3 4 2 5 weights 1 -1 0 -1 -1 -1 1 0 dot = 5*1 + 1*-1 + 7*0 + 6*-1 + 3*-1 + 4*-1 + 2*1 + 5*0 = -7

  20. Quantization example 1 1 1 0 1 0 0 1 activations 0 0 1 1 1 0 1 0 1 0 1 1 0 1 0 1 5 1 7 6 3 4 2 5 sign 0 1 0 1 1 1 0 0 magnitude 1 1 0 1 1 1 1 0 1 -1 0 -1 -1 -1 1 0

  21. Quantization example 1 1 1 0 1 0 0 1 activations 0 0 1 1 1 0 1 0 1 0 1 1 0 1 0 1 sign 0 1 0 1 1 1 0 0 magnitude 1 1 0 1 1 1 1 0

  22. Quantization example 1 1 1 0 1 0 0 1 activations 0 0 1 1 1 0 1 0 1 0 1 1 0 1 0 1 sign 0 1 0 1 1 1 0 0 magnitude 1 1 0 1 1 1 1 0 absSum: o = a && m o = 11101001 && 11011110 = 11001000 absSum += popcount(o) absSum += popcount(o) = 3 negSum: o = a && s o = 1100100 && 01011100 = 10000100 negSum += popcount(o) negSum += popcount(o) = 2

  23. Quantization example 1 1 1 0 1 0 0 1 activations 0 0 1 1 1 0 1 0 1 0 1 1 0 1 0 1 sign 0 1 0 1 1 1 0 0 magnitude 1 1 0 1 1 1 1 0 absSum: o = a && m o = 00111010 && 11011110 = 00011010 absSum += popcount(o) << 1 absSum += popcount(o) = 3 + 2*3 = 9 negSum: o = a && s o = 00011010 && 01011100 = 00011000 negSum += popcount(o) << 1 negSum += popcount(o) = 2 + 2*2 = 6

  24. Quantization example 1 1 1 0 1 0 0 1 activations 0 0 1 1 1 0 1 0 1 0 1 1 0 1 0 1 sign 0 1 0 1 1 1 0 0 magnitude 1 1 0 1 1 1 1 0 absSum: o = a && m o = 10110101 && 11011110 = 11001000 absSum += popcount(o) << 2 absSum += popcount(o) = 9 + 4 * 3 = 21 negSum: o = a && s o = 11001000 && 01011100 = 01001000 negSum += popcount(o) << 2 negSum += popcount(o) = 6 + 4 * 2 = 14 total = 21 – 2 * 14 = -7 total = absSum – 2 * negSum

  25. Quantization example 1 1 1 0 1 0 0 1 activations 0 0 1 1 1 0 1 0 1 0 1 1 0 1 0 1 sign 0 1 0 1 1 1 0 0 magnitude 1 1 0 1 1 1 1 0 instruction_count = 8 instructions * 3 bits = 24 instructions vector size = 8 instructions per element = 24 / 8 = 3 if word is 128-bit (NEON): instruction_count = 8 instructions * 3 bits + 0.3 reduce ops = 24.3 instructions vector size = 128 instructions per element = 24.3 / 128 = 0.19 (5x faster than float)

  26. Quantization performance Speedup on ARM1176 25 quantized vs full precision 20 15 10 5 0 1 Bit 2 Bits 3 bits 8 bits

  27. Quantized weight accuracy 1 accuracy vs original model 0.9 0.8 0.7 0.6 0.5 model with models with binary weights 0.4 trinarized 0.3 weights 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 proportion of zeros in ternary weights

  28. Quantized activation accuracy 1 accuracy vs real activations 0.9 0.8 0.7 0.6 0.5 0.4 ternary weights binary weights 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 quantized activation bit count

  29. Current focus areas • post-training lossy compression (pruning and quantization) • engineering/ML training co-design • infrastructure: beating BLAS on embedded platforms extending platform abstraction layer to embedded GPUs global optimizer

  30. Questions? • https://microsoft.github.io/ELL/ • Code: https://github.com/Microsoft/ELL • Model Gallery: https://microsoft.github.io/ELL/gallery/

  31. Not every model is a winner

Recommend


More recommend