my current dream managing machine learning on modern
play

My Current Dream : Managing Machine Learning on Modern Hardware Ce - PowerPoint PPT Presentation

My Current Dream : Managing Machine Learning on Modern Hardware Ce Zhang 2 ML: fancy UDF breaking into traditional data workflow Input Recovered Ground Truth 1. Play well with existing data ecosystems (e.g., SciDB) 2. Fast! (< 20TB/s


  1. My Current Dream : Managing Machine Learning on Modern Hardware Ce Zhang

  2. 2 ML: fancy UDF breaking into traditional data workflow Input Recovered Ground Truth 1. Play well with existing data ecosystems (e.g., SciDB) 2. Fast! (< 20TB/s data) — seems to be a hardware problem MNRAS’17

  3. 3 ML on Modern Hardware: How to manage this messy cross-product? Enterprise Tomographic Image Speech APPLICA TIONS Analytics Reconstruction Classification Recognition x Linear Deep Decision Bayesian MODELS Models Learning Tree Models x HARDWARE FPGA GPU Xeon Phi CPU How? & Will it actually help?

  4. 4 Hasn’t Stochastic Gradient Descent already solved the whole thing?

  5. 5 Hasn’t Stochastic Gradient Descent already solved the whole thing? Logically , not tooooooo far off (I can live with it, unhappily) Physically , things get sophisticated

  6. 6 Goal: How to manage this messy cross-product? Different Error Tolerance Enterprise Tomographic Image Different $$$ Speech APPLICA TIONS Analytics Reconstruction Classification Recognition A: 20GB A: 4MB Different Performance Expectation x x: 4MB x: 240GB Linear Deep Decision Bayesian Models MODELS Learning Tree Models We need an “optimizer” for x machine learning on modern hardware HARDWARE FPGA GPU Xeon Phi CPU Very Different System Architecture

  7. 7 No idea about the full answer

  8. 8 How many bits do you need to represent a single number in machine learning systems?

  9. 9 Data Flow in Machine Learning Systems Gradient: dot(A r, x)A r 3 Data Source Storage Device Data A r Model x Sensor DRAM 1 2 Database CPU Cache Computation Device Takeaway: You can do all FPGA GPU, CPU three channels with low- precision, with some care

  10. 10 ZipML 1 Naive solution: nearest rounding (=1) => Converge to a different solution […. 0.7 ….] Stochastic rounding: 0 with prob 0.3, 1 with prob 0.7 0 Gradient: dot(A r, x)A r Expectation matches => OK! 3 (Over-simplified, need to be careful about variance!) Data Source Storage Device Data A r Model x Sensor DRAM 1 2 Database CPU Cache Computation Device FPGA GPU, CPU NIPS’15

  11. 11 ZipML Gradient: dot(A r, x)A r 3 Data Source Storage Device Data A r Model x Sensor DRAM 1 2 Database CPU Cache […. 0.7 ….] Computation Loss: ( ax - b) 2 Device 0 (p=0.3) Gradient: 2 a ( ax -b) FPGA 1 (p=0.7) GPU, CPU Expectation matches => OK!

  12. 12 ZipML Gradient: dot(A r, x)A r 3 Data Source Storage Device Data A r Model x Sensor DRAM 1 2 Database CPU Cache […. 0.7 ….] Computation 0.029 Training Loss Expectation matches Device Naive Sampling 0 (p=0.3) 0.027 => OK? FPGA NO!! 1 (p=0.7) 0.024 GPU, CPU 0.022 32-bit 0.019 Why? Gradient 2 a ( ax -b) is not linear in a . 2 4 6 8 10 12 14 16 # Iterations

  13. 13 ZipML: “Double Sampling” How many more bits do we need to store the second sample? How to generate samples for a to get Not 2x Overhead! an unbiased estimator for 2 a ( ax -b)? 1 2 a 1 ( a 2 x -b) 3bits to store the first sample TWO 0.857 Independent 0.714 2nd sample only have 3 choices: First Second - up, down, same Samples! 0.571 Sample Sample => 2bits to store 0.429 0.029 We can do even better—Samples are symmetric! Training Loss Naive Sampling 0.286 0.027 15 different possibilities 0.143 0.024 Double Sampling => 4bits to store 2 samples 32-bit 0.022 0 1bit Overhead 0.019 2 4 6 8 10 12 14 16 # Iterations arXiv’16

  14. 14 It works!

  15. 15 Experiments 32bit Floating Points Tomographic Reconstruction 12bit Fixed Points Linear regression with fancy regularization (but a 240GB model)

  16. 16 Just to make it a little bit more “Swiss” 32bit 8bit

  17. 17 There is no magic, it is tradeoff, tradeoff, & TRADEOFF Ground 32 16 8 4 2 1 Truth bit bit bit bit bit bit Fixed Stepsize RMSE: 0.000 0.000 0.000 0.007 0.172 0.929 Variance gets Larger => You can still converge, but need smaller step sizes => Potentially more iterations to the same quality

  18. 18 More “classic” analytics is easier (1)Quantized Data (2) Data and Gradient (3) Data, Gradient and Model Billions of rows 8 8 8 2Bit Thousands of columns (a)Linear Regression 2Bit 2Bit Training Loss (x10 9 ) 4 4 4 Original Data Original Data Original Data 0.89% 0.21% 0.06% 0 0 0 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 # Epochs # Epochs # Epochs 0.3 0.3 0.3 2Bit 2Bit 2Bit 0.2 (b)LS-SVM 0.2 0.2 Training Loss Original Data Original Data Original Data 3.82% 0.88% 3Bit 1.50% 3Bit 3Bit 0.1 0.1 0.1 0 0 0 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 # Epochs # Epochs # Epochs

  19. 19 It works, but is what we are doing optimal?

  20. 20 Not Really Data-optimal Quantization Strategy a b Probability of quantizing to A: P A = b / (a+b) 1 1 Probability of quantizing to B: P B = a / (a+b) A B Expectation of quantization error for A, B (variance) 0.75 = a P A + b P B = 2ab / (a+b) 0.5 Intuitively, shouldn’t we put more markers 0.25 here? 0 0

  21. 21 Not Really Data-optimal Quantization Strategy a b b’ Probability of quantizing to A: P A = b / (a+b) 1 1 Probability of quantizing to B: P B = a / (a+b) A B B’ Expectation of quantization error for A, B (variance) 0.75 = a P A + b P B = 2ab / (a+b) Expectation of quantization error for A, B’ (variance) = 2a(b+b’) / (a+b+b’) 0.5 Intuitively, P-TIME Dynamic Programming shouldn’t Find a set of markers c 1 < … < c s , we put more markers 0.25 ( a − c j )( c j +1 − a ) X X min here? c 1 ,...,c s c j +1 − c j j a ∈ [ c j ,c j +1 ] 0 0 All data points Dense Markers falling into an interval

  22. 22 Experiments 150 120 Training Loss Uniform Levels, 5bits 90 Uniform Levels, 3bits 60 Data-Optimal Levels, 3bits Original Data, 32bits 30 0 0 6 12 18 24 30 # Epochs

  23. 23 Enough “Theory”! Let’s Build Something REAL !

  24. 24 Data Flow in Machine Learning Systems Gradient: dot(A r, x)A r 3 Data Source Storage Device Data A r Model x Sensor DRAM 1 2 Database CPU Cache Computation Device FPGA 1 2 3 GPU, CPU Low-precision Quantization 
 + Optimal Quantization

  25. 25 Two Implementations Gradient FPGA Data Source Storage Device Data A r Model x Sensor DRAM Database CPU Cache Param. GPU Gradient Server Data Source Storage Device Data A r Model x Sensor DRAM Database CPU Cache

  26. 26 “Fancy” things first :) Deep Learning

  27. 27 GPU - Quantization Impact of Quantization Impact of Optimal Quantization 2.4 Training Loss 0.15 1.8 32-bit Full Precision 1.2 XNOR5 0.05 0.6 32bit Optimal5 0 0 10 20 30 40 2bit #Epochs (b) Deep Learning

  28. 28 Full precision SGD on FPGA Data Source Sensor Database 32-bit floating-point: 16 values 64B cache-line 16 floats Processing rate: 12.8 GB/s 16 float multipliers 1 x x x x 16 float2fixed 16 fixed2float 2 converters converters C C C C C + + + + + + + + + + + + 3 + + 16 fixed + B A adders 4 + b fifo a fifo Dot product 5 c ax b 6 1 fixed2float converter - 1 float adder 1 float multiplier x γ x 7 γ( ax -b) a Gradient calculation D 8 Batch size 16 float multipliers x x x is reached? Computation γ( ax -b) a 9 16 float2fixed C C C Custom Logic Storage Device converters x - - - FPGA BRAM Model update 16 fixed C loading adders x - γ( ax -b) a

  29. 29 Challenge + Solution 64B cache-line 16 floats 8-bit ZipML: 32 values Q8 16 float multipliers 1 x x x x 4-bit ZipML: 64 values Q4 16 float2fixed 16 fixed2float 2 converters converters C C C C 2-bit ZipML: 128 values Q2 C + + + + + + + + 1-bit ZipML: 256 values Q1 + + + + 3 + + 16 fixed + B A adders => Scaling out is not trivial! 4 + b fifo a fifo Dot product 5 c ax b 6 1 fixed2float converter - 1 float adder 1 float multiplier x γ 1) We can get rid of floating- x 7 γ( ax -b) a Gradient point arithmetic. calculation D 8 Batch size 2) We can further simplify 16 float multipliers x x x is reached? γ( ax -b) a 9 integer arithmetic for 16 float2fixed C C C converters lowest precision data. x - - - Model update 16 fixed C loading adders x - γ( ax -b) a

  30. 30 FPGA - Speed 0.030 0.014 float CPU 1-thread float CPU 1-thread 0.025 float CPU 10-threads float CPU 10-threads 0.012 Q4 FPGA Q8 FPGA Training Loss Training Loss 0.020 float FPGA float FPGA 0.010 0.015 0.008 0.006 0.010 0.004 0.005 0.002 0.000 2x 7.2x 0.000 1E-4 0.001 0.01 0.1 0.001 0.01 0.1 1 Time (s) Time (s) When data are already stored in a specific format.

  31. 31 VISION | ZipML: The Precision Manager for Machine Learning Enterprise Tomographic Image Speech Analytics Reconstruction Classification Recognition Linear Deep Decision Bayesian Models Learning Tree Models FPGA GPU Xeon Phi CPU

Recommend


More recommend