The Embedded Learning Library The Embedded Learning Library (ELL) - - PDF document
The Embedded Learning Library The Embedded Learning Library (ELL) - - PDF document
The Embedded Learning Library The Embedded Learning Library (ELL) Cross-compiler for AI pipelines, specialized for resource constrained target platforms T arget ELL AI Machine Pipeline Code https://github.com/Microsoft/ELL The Embedded
The Embedded Learning Library (ELL)
Cross-compiler for AI pipelines, specialized for resource constrained target platforms
https://github.com/Microsoft/ELL
AI Pipeline T arget Machine Code
ELL
- 3 years at Microsoft Research
- compiler toolchain, tutorials, model gallery
- focus: ARM CPUs embedded GPUs, vision on ARM
Cortex A53, keyword spotting on ARM Cortex M4f
The Embedded Learning Library
Computation Graph Optimizer ELL Platform Abstraction Layer LLVM Emitter OpenCL Emitter Importer Importer Importers Importer Importer T arget Profiles … Importer Importer ELL Trainers T arget Dataset Pretrained Model LLVM OpenCL BLAS
Architecture
AI compiler vs. AI runtime
- model-specific optimization
- target-specific optimization
- small executable
- portability
- seamless migration from
cloud to edge why AI compiler? why AI runtime? best of both worlds just-in-time AI compiler
compression techniques:
- efficient architectures
- pruning
- low precision math and quantization
- low rank matrix approximation
Evaluation
small loss in accuracy large gain in cost
January 2018
30 35 40 45 50 55 60 65 70 100 200 300 400 500 600 700 800 900 1000
ILSVRC2012 top-1 ms/image on RPi3@700MHz
Architecture search
model Pareto frontier
30 35 40 45 50 55 60 65 70 100 200 300 400 500 600 700 800 900 1000
ILSVRC2012 top-1 ms/image on RPi3@700MHz
Architecture search
January 2018
February 2018
30 35 40 45 50 55 60 65 70 100 200 300 400 500 600 700 800 900 1000
ILSVRC2012 top-1 ms/image on RPi3@700MHz
Architecture search
March 2018
30 35 40 45 50 55 60 65 70 100 200 300 400 500 600 700 800 900 1000
ILSVRC2012 top-1 ms/image on RPi3@700MHz
Architecture search
April 2018
30 35 40 45 50 55 60 65 70 100 200 300 400 500 600 700 800 900 1000
ILSVRC2012 top-1 ms/image on RPi3@700MHz
Architecture search
- variety of convolution kernels
- scheduling
- engineering
Lossless acceleration
January 2019
30 35 40 45 50 55 60 65 70 100 200 300 400 500 600 700 800 900 1000
ILSVRC2012 top-1 ms/image on RPi3@700MHz
Lossless acceleration
February 2019
30 35 40 45 50 55 60 65 70 100 200 300 400 500 600 700 800 900 1000
ILSVRC2012 top-1 ms/image on RPi3@700MHz
Lossless acceleration
March 2019
30 35 40 45 50 55 60 65 70 100 200 300 400 500 600 700 800 900 1000
ILSVRC2012 top-1 ms/image on RPi3@700MHz
Lossless acceleration
.
mix and match compression techniques engineering/ML co-design during training vs post processing
Lossy Acceleration
bit bit value 1 1 bit bit value
- 1
1 1 bits value 00 01 1 10 n/a 11
- 1
bits value 0…k [0...2^k - 1] bits value 0…k [-2^(b-1)-1...2^(b-1)-1] bits Value 0…k lookup bits value 0…k a±b±c±.. ±n
Quantization semantics
binary ternary linear exponential lookup/clustered iterative sum
b3 b2 b1 b0 a3 a2 a1 a0 d3 d2 d1 d0 c3 c2 c1 c0 d0 c0 b0 a0 d1 c1 b1 a1 d2 c2 b2 a2 d3 c3 b3 a3
bit packed bit planes
Quantization representation
Quantization example
activations weights 5 1 7 6 3 4 2 5 1
- 1
- 1
- 1
- 1
1
ternary weights, 3-bit unsigned linear activations (bitplane) dot = 5*1 + 1*-1 + 7*0 + 6*-1 + 3*-1 + 4*-1 + 2*1 + 5*0 = -7
Quantization example
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 activations sign magnitude 5 1 7 6 3 4 2 5 1
- 1
- 1
- 1
- 1
1
Quantization example
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 activations sign magnitude
Quantization example
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 activations sign magnitude
- = 11101001 && 11011110 = 11001000
absSum += popcount(o) = 3
- = 1100100 && 01011100 = 10000100
negSum += popcount(o) = 2 absSum: o = a && m absSum += popcount(o) negSum: o = a && s negSum += popcount(o)
Quantization example
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 activations sign magnitude
- = 00111010 && 11011110 = 00011010
absSum += popcount(o) = 3 + 2*3 = 9
- = 00011010 && 01011100 = 00011000
negSum += popcount(o) = 2 + 2*2 = 6 absSum: o = a && m absSum += popcount(o) << 1 negSum: o = a && s negSum += popcount(o) << 1
Quantization example
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 activations sign magnitude absSum: o = a && m absSum += popcount(o) << 2 negSum: o = a && s negSum += popcount(o) << 2 total = absSum – 2 * negSum
- = 10110101 && 11011110 = 11001000
absSum += popcount(o) = 9 + 4 * 3 = 21
- = 11001000 && 01011100 = 01001000
negSum += popcount(o) = 6 + 4 * 2 = 14 total = 21 – 2 * 14 = -7
Quantization example
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 activations sign magnitude instruction_count = 8 instructions * 3 bits = 24 instructions vector size = 8 instructions per element = 24 / 8 = 3 if word is 128-bit (NEON): instruction_count = 8 instructions * 3 bits + 0.3 reduce ops = 24.3 instructions vector size = 128 instructions per element = 24.3 / 128 = 0.19 (5x faster than float)
Quantization performance
5 10 15 20 25
quantized vs full precision Speedup on ARM1176
1 Bit 2 Bits 3 bits 8 bits
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7
accuracy vs original model proportion of zeros in ternary weights
model with binary weights models with trinarized weights
Quantized weight accuracy
Quantized activation accuracy
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8
accuracy vs real activations quantized activation bit count
ternary weights binary weights
- post-training lossy compression (pruning and quantization)
- engineering/ML training co-design
- infrastructure:
beating BLAS on embedded platforms extending platform abstraction layer to embedded GPUs global optimizer
Current focus areas
Questions?
- https://microsoft.github.io/ELL/
- Code: https://github.com/Microsoft/ELL
- Model Gallery: https://microsoft.github.io/ELL/gallery/