The Embedded Learning Library
The Embedded Learning Library (ELL) Cross-compiler for AI pipelines, specialized for resource constrained target platforms T arget ELL AI Machine Pipeline Code https://github.com/Microsoft/ELL
The Embedded Learning Library • 3 years at Microsoft Research • compiler toolchain, tutorials, model gallery • focus: ARM CPUs embedded GPUs, vision on ARM Cortex A53, keyword spotting on ARM Cortex M4f
Architecture Pretrained Dataset Model Importers ELL Trainers Importer Importer Importer Importer T arget Computation Graph Optimizer Importer Profiles Importer ELL Platform Abstraction Layer LLVM OpenCL … T arget Emitter Emitter LLVM OpenCL BLAS
AI compiler vs. AI runtime why AI compiler? why AI runtime? • model-specific optimization • portability • target-specific optimization • seamless migration from cloud to edge • small executable best of both worlds just-in-time AI compiler
Evaluation small loss in accuracy large gain in cost compression techniques: • efficient architectures • pruning • low precision math and quantization • low rank matrix approximation
January 2018 Architecture search 70 65 Pareto frontier model ILSVRC2012 top-1 60 55 50 45 40 35 30 0 100 200 300 400 500 600 700 800 900 1000 ms/image on RPi3@700MHz
January 2018 Architecture search 70 65 ILSVRC2012 top-1 60 55 50 45 40 35 30 0 100 200 300 400 500 600 700 800 900 1000 ms/image on RPi3@700MHz
February 2018 Architecture search 70 65 ILSVRC2012 top-1 60 55 50 45 40 35 30 0 100 200 300 400 500 600 700 800 900 1000 ms/image on RPi3@700MHz
March 2018 Architecture search 70 65 ILSVRC2012 top-1 60 55 50 45 40 35 30 0 100 200 300 400 500 600 700 800 900 1000 ms/image on RPi3@700MHz
April 2018 Architecture search 70 65 ILSVRC2012 top-1 60 55 50 45 40 35 30 0 100 200 300 400 500 600 700 800 900 1000 ms/image on RPi3@700MHz
Lossless acceleration • variety of convolution kernels • scheduling • engineering
January 2019 Lossless acceleration 70 65 ILSVRC2012 top-1 60 55 50 45 40 35 30 0 100 200 300 400 500 600 700 800 900 1000 ms/image on RPi3@700MHz
February 2019 Lossless acceleration 70 65 ILSVRC2012 top-1 60 55 50 45 40 35 30 0 100 200 300 400 500 600 700 800 900 1000 ms/image on RPi3@700MHz
March 2019 Lossless acceleration . 70 65 ILSVRC2012 top-1 60 55 50 45 40 35 30 0 100 200 300 400 500 600 700 800 900 1000 ms/image on RPi3@700MHz
Lossy Acceleration mix and match compression techniques engineering/ML co-design during training vs post processing
Quantization semantics binary lookup/clustered iterative sum bit bit value bit bit bits value bits Value value 0…k lookup 0 0 0…k a ± b ± c ± .. ± n 0 -1 1 1 1 1 linear ternary bits value bits value 0…k [0...2^k - 1] 00 0 exponential 01 1 bits value 10 n/a 11 -1 0…k [-2^(b-1)-1...2^(b-1)-1]
Quantization representation b3 b2 b1 b0 a3 a2 a1 a0 d3 d2 d1 d0 c3 c2 c1 c0 bit packed d0 c0 b0 a0 d1 c1 b1 a1 d2 c2 b2 a2 d3 c3 b3 a3 bit planes
Quantization example ternary weights, 3-bit unsigned linear activations (bitplane) activations 5 1 7 6 3 4 2 5 weights 1 -1 0 -1 -1 -1 1 0 dot = 5*1 + 1*-1 + 7*0 + 6*-1 + 3*-1 + 4*-1 + 2*1 + 5*0 = -7
Quantization example 1 1 1 0 1 0 0 1 activations 0 0 1 1 1 0 1 0 1 0 1 1 0 1 0 1 5 1 7 6 3 4 2 5 sign 0 1 0 1 1 1 0 0 magnitude 1 1 0 1 1 1 1 0 1 -1 0 -1 -1 -1 1 0
Quantization example 1 1 1 0 1 0 0 1 activations 0 0 1 1 1 0 1 0 1 0 1 1 0 1 0 1 sign 0 1 0 1 1 1 0 0 magnitude 1 1 0 1 1 1 1 0
Quantization example 1 1 1 0 1 0 0 1 activations 0 0 1 1 1 0 1 0 1 0 1 1 0 1 0 1 sign 0 1 0 1 1 1 0 0 magnitude 1 1 0 1 1 1 1 0 absSum: o = a && m o = 11101001 && 11011110 = 11001000 absSum += popcount(o) absSum += popcount(o) = 3 negSum: o = a && s o = 1100100 && 01011100 = 10000100 negSum += popcount(o) negSum += popcount(o) = 2
Quantization example 1 1 1 0 1 0 0 1 activations 0 0 1 1 1 0 1 0 1 0 1 1 0 1 0 1 sign 0 1 0 1 1 1 0 0 magnitude 1 1 0 1 1 1 1 0 absSum: o = a && m o = 00111010 && 11011110 = 00011010 absSum += popcount(o) << 1 absSum += popcount(o) = 3 + 2*3 = 9 negSum: o = a && s o = 00011010 && 01011100 = 00011000 negSum += popcount(o) << 1 negSum += popcount(o) = 2 + 2*2 = 6
Quantization example 1 1 1 0 1 0 0 1 activations 0 0 1 1 1 0 1 0 1 0 1 1 0 1 0 1 sign 0 1 0 1 1 1 0 0 magnitude 1 1 0 1 1 1 1 0 absSum: o = a && m o = 10110101 && 11011110 = 11001000 absSum += popcount(o) << 2 absSum += popcount(o) = 9 + 4 * 3 = 21 negSum: o = a && s o = 11001000 && 01011100 = 01001000 negSum += popcount(o) << 2 negSum += popcount(o) = 6 + 4 * 2 = 14 total = 21 – 2 * 14 = -7 total = absSum – 2 * negSum
Quantization example 1 1 1 0 1 0 0 1 activations 0 0 1 1 1 0 1 0 1 0 1 1 0 1 0 1 sign 0 1 0 1 1 1 0 0 magnitude 1 1 0 1 1 1 1 0 instruction_count = 8 instructions * 3 bits = 24 instructions vector size = 8 instructions per element = 24 / 8 = 3 if word is 128-bit (NEON): instruction_count = 8 instructions * 3 bits + 0.3 reduce ops = 24.3 instructions vector size = 128 instructions per element = 24.3 / 128 = 0.19 (5x faster than float)
Quantization performance Speedup on ARM1176 25 quantized vs full precision 20 15 10 5 0 1 Bit 2 Bits 3 bits 8 bits
Quantized weight accuracy 1 accuracy vs original model 0.9 0.8 0.7 0.6 0.5 model with models with binary weights 0.4 trinarized 0.3 weights 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 proportion of zeros in ternary weights
Quantized activation accuracy 1 accuracy vs real activations 0.9 0.8 0.7 0.6 0.5 0.4 ternary weights binary weights 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 quantized activation bit count
Current focus areas • post-training lossy compression (pruning and quantization) • engineering/ML training co-design • infrastructure: beating BLAS on embedded platforms extending platform abstraction layer to embedded GPUs global optimizer
Questions? • https://microsoft.github.io/ELL/ • Code: https://github.com/Microsoft/ELL • Model Gallery: https://microsoft.github.io/ELL/gallery/
Not every model is a winner
Recommend
More recommend