The Embedded Learning Library The Embedded Learning Library (ELL) - - PDF document

the embedded learning library the embedded learning
SMART_READER_LITE
LIVE PREVIEW

The Embedded Learning Library The Embedded Learning Library (ELL) - - PDF document

The Embedded Learning Library The Embedded Learning Library (ELL) Cross-compiler for AI pipelines, specialized for resource constrained target platforms T arget ELL AI Machine Pipeline Code https://github.com/Microsoft/ELL The Embedded


slide-1
SLIDE 1

The Embedded Learning Library

slide-2
SLIDE 2

The Embedded Learning Library (ELL)

Cross-compiler for AI pipelines, specialized for resource constrained target platforms

https://github.com/Microsoft/ELL

AI Pipeline T arget Machine Code

ELL

slide-3
SLIDE 3
  • 3 years at Microsoft Research
  • compiler toolchain, tutorials, model gallery
  • focus: ARM CPUs  embedded GPUs, vision on ARM

Cortex A53, keyword spotting on ARM Cortex M4f

The Embedded Learning Library

slide-4
SLIDE 4

Computation Graph Optimizer ELL Platform Abstraction Layer LLVM Emitter OpenCL Emitter Importer Importer Importers Importer Importer T arget Profiles … Importer Importer ELL Trainers T arget Dataset Pretrained Model LLVM OpenCL BLAS

Architecture

slide-5
SLIDE 5

AI compiler vs. AI runtime

  • model-specific optimization
  • target-specific optimization
  • small executable
  • portability
  • seamless migration from

cloud to edge why AI compiler? why AI runtime? best of both worlds just-in-time AI compiler

slide-6
SLIDE 6

compression techniques:

  • efficient architectures
  • pruning
  • low precision math and quantization
  • low rank matrix approximation

Evaluation

small loss in accuracy  large gain in cost

slide-7
SLIDE 7

January 2018

30 35 40 45 50 55 60 65 70 100 200 300 400 500 600 700 800 900 1000

ILSVRC2012 top-1 ms/image on RPi3@700MHz

Architecture search

model Pareto frontier

slide-8
SLIDE 8

30 35 40 45 50 55 60 65 70 100 200 300 400 500 600 700 800 900 1000

ILSVRC2012 top-1 ms/image on RPi3@700MHz

Architecture search

January 2018

slide-9
SLIDE 9

February 2018

30 35 40 45 50 55 60 65 70 100 200 300 400 500 600 700 800 900 1000

ILSVRC2012 top-1 ms/image on RPi3@700MHz

Architecture search

slide-10
SLIDE 10

March 2018

30 35 40 45 50 55 60 65 70 100 200 300 400 500 600 700 800 900 1000

ILSVRC2012 top-1 ms/image on RPi3@700MHz

Architecture search

slide-11
SLIDE 11

April 2018

30 35 40 45 50 55 60 65 70 100 200 300 400 500 600 700 800 900 1000

ILSVRC2012 top-1 ms/image on RPi3@700MHz

Architecture search

slide-12
SLIDE 12
  • variety of convolution kernels
  • scheduling
  • engineering

Lossless acceleration

slide-13
SLIDE 13

January 2019

30 35 40 45 50 55 60 65 70 100 200 300 400 500 600 700 800 900 1000

ILSVRC2012 top-1 ms/image on RPi3@700MHz

Lossless acceleration

slide-14
SLIDE 14

February 2019

30 35 40 45 50 55 60 65 70 100 200 300 400 500 600 700 800 900 1000

ILSVRC2012 top-1 ms/image on RPi3@700MHz

Lossless acceleration

slide-15
SLIDE 15

March 2019

30 35 40 45 50 55 60 65 70 100 200 300 400 500 600 700 800 900 1000

ILSVRC2012 top-1 ms/image on RPi3@700MHz

Lossless acceleration

.

slide-16
SLIDE 16

mix and match compression techniques engineering/ML co-design during training vs post processing

Lossy Acceleration

slide-17
SLIDE 17

bit bit value 1 1 bit bit value

  • 1

1 1 bits value 00 01 1 10 n/a 11

  • 1

bits value 0…k [0...2^k - 1] bits value 0…k [-2^(b-1)-1...2^(b-1)-1] bits Value 0…k lookup bits value 0…k a±b±c±.. ±n

Quantization semantics

binary ternary linear exponential lookup/clustered iterative sum

slide-18
SLIDE 18

b3 b2 b1 b0 a3 a2 a1 a0 d3 d2 d1 d0 c3 c2 c1 c0 d0 c0 b0 a0 d1 c1 b1 a1 d2 c2 b2 a2 d3 c3 b3 a3

bit packed bit planes

Quantization representation

slide-19
SLIDE 19

Quantization example

activations weights 5 1 7 6 3 4 2 5 1

  • 1
  • 1
  • 1
  • 1

1

ternary weights, 3-bit unsigned linear activations (bitplane) dot = 5*1 + 1*-1 + 7*0 + 6*-1 + 3*-1 + 4*-1 + 2*1 + 5*0 = -7

slide-20
SLIDE 20

Quantization example

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 activations sign magnitude 5 1 7 6 3 4 2 5 1

  • 1
  • 1
  • 1
  • 1

1

slide-21
SLIDE 21

Quantization example

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 activations sign magnitude

slide-22
SLIDE 22

Quantization example

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 activations sign magnitude

  • = 11101001 && 11011110 = 11001000

absSum += popcount(o) = 3

  • = 1100100 && 01011100 = 10000100

negSum += popcount(o) = 2 absSum: o = a && m absSum += popcount(o) negSum: o = a && s negSum += popcount(o)

slide-23
SLIDE 23

Quantization example

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 activations sign magnitude

  • = 00111010 && 11011110 = 00011010

absSum += popcount(o) = 3 + 2*3 = 9

  • = 00011010 && 01011100 = 00011000

negSum += popcount(o) = 2 + 2*2 = 6 absSum: o = a && m absSum += popcount(o) << 1 negSum: o = a && s negSum += popcount(o) << 1

slide-24
SLIDE 24

Quantization example

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 activations sign magnitude absSum: o = a && m absSum += popcount(o) << 2 negSum: o = a && s negSum += popcount(o) << 2 total = absSum – 2 * negSum

  • = 10110101 && 11011110 = 11001000

absSum += popcount(o) = 9 + 4 * 3 = 21

  • = 11001000 && 01011100 = 01001000

negSum += popcount(o) = 6 + 4 * 2 = 14 total = 21 – 2 * 14 = -7

slide-25
SLIDE 25

Quantization example

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 activations sign magnitude instruction_count = 8 instructions * 3 bits = 24 instructions vector size = 8 instructions per element = 24 / 8 = 3 if word is 128-bit (NEON): instruction_count = 8 instructions * 3 bits + 0.3 reduce ops = 24.3 instructions vector size = 128 instructions per element = 24.3 / 128 = 0.19 (5x faster than float)

slide-26
SLIDE 26

Quantization performance

5 10 15 20 25

quantized vs full precision Speedup on ARM1176

1 Bit 2 Bits 3 bits 8 bits

slide-27
SLIDE 27

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7

accuracy vs original model proportion of zeros in ternary weights

model with binary weights models with trinarized weights

Quantized weight accuracy

slide-28
SLIDE 28

Quantized activation accuracy

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8

accuracy vs real activations quantized activation bit count

ternary weights binary weights

slide-29
SLIDE 29
  • post-training lossy compression (pruning and quantization)
  • engineering/ML training co-design
  • infrastructure:

beating BLAS on embedded platforms extending platform abstraction layer to embedded GPUs global optimizer

Current focus areas

slide-30
SLIDE 30

Questions?

  • https://microsoft.github.io/ELL/
  • Code: https://github.com/Microsoft/ELL
  • Model Gallery: https://microsoft.github.io/ELL/gallery/
slide-31
SLIDE 31
slide-32
SLIDE 32
slide-33
SLIDE 33
slide-34
SLIDE 34
slide-35
SLIDE 35

Not every model is a winner

slide-36
SLIDE 36