model compression
play

Model Compression Seminar: Advanced Machine Learning, SS 2016 - PowerPoint PPT Presentation

Model Compression Seminar: Advanced Machine Learning, SS 2016 Markus Beuckelmann markus.beuckelmann@stud.uni-heidelberg.de July 19, 2016 Markus Beuckelmann Model Compression July 19, 2016 1 / 33 Introduction Outline Outline 1 Overview


  1. Model Compression Seminar: Advanced Machine Learning, SS 2016 Markus Beuckelmann markus.beuckelmann@stud.uni-heidelberg.de July 19, 2016 Markus Beuckelmann Model Compression July 19, 2016 1 / 33

  2. Introduction Outline Outline 1 Overview & Motivation ◇ Why do we need model compression? ◇ Embedded & Mobile devices ◇ DRAM vs. SRAM 2 Recap: Neural Networks for Prediction 3 Neural Network Compression & Model Compression ◇ Neural Network Pruning: OBD and OBS ◇ Knowledge Distillation ◇ Deep Compression 4 Summary Markus Beuckelmann Model Compression July 19, 2016 2 / 33

  3. Introduction Overview & Motivation 1 Overview & Motivation Markus Beuckelmann Model Compression July 19, 2016 3 / 33

  4. Introduction Overview & Motivation Success of Neural Networks • Image recognition • Image classification • Speech recognition • Natural Language Processing (Han et al., 2015) (Tensorflow) Markus Beuckelmann Model Compression July 19, 2016 4 / 33

  5. Introduction Overview & Motivation Problem: Predictive Performance is Not Enough • There a different metrics when it comes to evaluating a model • Usually there is some kind of trade–off, so the choice is governed by deployment requirements How good is your model in terms of...? • Predictive performance • Speed (time complexity) in training/testing • Memory complexity in training/testing • Energy consumption in training/testing Markus Beuckelmann Model Compression July 19, 2016 5 / 33

  6. Introduction Overview & Motivation AlexNet: Millions of Parameters AlexNet (Krizhevsky et al., 2012) • Trained on ImageNet (15 · 10 6 training images, 22 · 10 3 categories) • Number of neurons: 650 · 10 3 • Number of free parameters: 61 · 10 6 • ≈ 233 MiB (32-bit float) (Krizhevsky et al., 2012) • Having this many parameters is expensive in memory, time and energy. Markus Beuckelmann Model Compression July 19, 2016 6 / 33

  7. Introduction Overview & Motivation Mobile & Embedded Devices • Smartphones • Hearing implants • Credit cards, etc. ... Smartphone Hardware (2016) • CPU: 2 × 1 . 7 GHz • DRAM: 2 GiB • SRAM: MiB • Battery: 2000 mAh (Micri µ m, Embedded Software) • Limitations: storage, battery, computational power, network bandwidth Model Compression : Find a minimum topology of the model. Markus Beuckelmann Model Compression July 19, 2016 7 / 33

  8. Introduction Overview & Motivation Minimizing Energy Consumption: SRAM & DRAM • DRAM: Slower, higher energy consumption, cheaper • SRAM: Faster, less energy consumption, more expensive, usually used as cache memory (Han et al., 2015) Markus Beuckelmann Model Compression July 19, 2016 8 / 33

  9. Introduction Overview & Motivation Minimizing Energy Consumption: SRAM & DRAM • DRAM: Slower, higher energy consumption, cheaper • SRAM: Faster, less energy consumption, more expensive, usually used as cache memory (Han et al., 2015) • If we can fit the whole model into SRAM , we will consume drastically less energy and gain significant speedups! Markus Beuckelmann Model Compression July 19, 2016 8 / 33

  10. Neural Networks 2 Neural Networks Markus Beuckelmann Model Compression July 19, 2016 9 / 33

  11. Neural Networks Neural Networks Neural Networks: Basics Feed–Forward Networks • a ( i +1) = ( W ( i +1) ) ⊤ z ( i ) , z (0) := x • z ( i +1) = 𝑔 ( i +1) ( a ( i +1) ) • 𝜚 ( x ) = 𝑔 ( N ) ⎞ ⎡ )︄ [︄ 𝑔 1 ( W (1) x ) · · · · · · )︄ [︄ • ˆ 𝑧 = arg max 𝜚 ( x ) • Training: GD, Backpropagation (Rajesh Rai, AI lecture) • Powerful, (non–linear) classification/regression • Keep in mind: there are more complex architectures! ( http: // deepdish. io ) Markus Beuckelmann Model Compression July 19, 2016 10 / 33

  12. Neural Networks Neural Networks Neural Networks: Prediction (Zeiler, 2013) Loss functions N 1 √︂ 𝑧 i ) 2 • Regression: L ( 𝜄 | X , y ) = ( 𝑧 i − ˆ 2 𝑂 i =1 N K ⎞ ⎡ √︂ √︂ • Multiclass classification: L ( 𝜄 | X , y ) = − 𝑧 ik · log ( P (ˆ 𝑧 ik )) i =1 k =1 ⨂︂ K • Last layer is usually a softmax layer: p = z ( l ) = exp( a ( l ) ) √︂ exp( 𝑏 ( l ) k ) k =1 • In the end, we will get posterior probability distribution over the classes Markus Beuckelmann Model Compression July 19, 2016 11 / 33

  13. Model Compression Methods 3 Neural Network Compression & Model Compression Markus Beuckelmann Model Compression July 19, 2016 12 / 33

  14. Model Compression Methods Pruning Pruning: Overview • Selectively removing weights / neurons • Compression: 2 × to 4 × • Usually combined with retraining (Ben Lorica, O’Reilly Media) Markus Beuckelmann Model Compression July 19, 2016 13 / 33

  15. Model Compression Methods Pruning Pruning: Overview • Selectively removing weights / neurons • Compression: 2 × to 4 × • Usually combined with retraining (Ben Lorica, O’Reilly Media) Important Questions • Which weights should we remove first? • How many weights can we remove? • What about the order of removal? Markus Beuckelmann Model Compression July 19, 2016 13 / 33

  16. Model Compression Methods Pruning Motivation: Synaptic Pruning • In Humans we have synaptic pruning • This removes redundant connections in the brain (Seeman et al., 1987) Markus Beuckelmann Model Compression July 19, 2016 14 / 33

  17. Model Compression Methods Pruning Pruning: How do we find the least important weight(s)? • Brute–force Pruning ◇ 𝒫 ( 𝑁𝑋 2 ) with 𝑋 weights and 𝑁 training samples ◇ Not feasible for large neural networks Markus Beuckelmann Model Compression July 19, 2016 15 / 33

  18. Model Compression Methods Pruning Pruning: How do we find the least important weight(s)? • Brute–force Pruning ◇ 𝒫 ( 𝑁𝑋 2 ) with 𝑋 weights and 𝑁 training samples ◇ Not feasible for large neural networks • Simple Heuristics ◇ Magnitude–Based Damage: Look at ♣♣ w ♣♣ p ◇ Variance–Based Damage Markus Beuckelmann Model Compression July 19, 2016 15 / 33

  19. Model Compression Methods Pruning Pruning: How do we find the least important weight(s)? • Brute–force Pruning ◇ 𝒫 ( 𝑁𝑋 2 ) with 𝑋 weights and 𝑁 training samples ◇ Not feasible for large neural networks • Simple Heuristics ◇ Magnitude–Based Damage: Look at ♣♣ w ♣♣ p ◇ Variance–Based Damage • More Rigorous Approaches ◇ Optimal Brain Damage (OBD) (LeCun et al., 1990) ◇ Optimal Brain Surgeon (OBS) (Hassibi et al., 1993) Markus Beuckelmann Model Compression July 19, 2016 15 / 33

  20. Model Compression Methods Pruning Optimal Brain Damage (OBD) • Small perturbation: δ w ⇒ δ L = L ( w + δ w ) − L ( w ) • Taylor expansion: ⎞ 𝜖 L ⎡ ⊤ · δ w + 1 2 δ w ⊤ · H · δ w + O ( || δ w || 3 ) δ L ≈ 𝜖 w ⎞ 𝜖 L ⎡ δ𝑥 i + 1 ∑︂ ∑︂ δ𝑥 i ( H ) ij δ𝑥 j + O ( || δ w || 3 ) ⇒ δ L ≈ 𝜖𝑥 i 2 i ( i,j ) ∂ 2 ℒ • With Hessian: ( H ) ij = ∂w i ∂w j Markus Beuckelmann Model Compression July 19, 2016 16 / 33

  21. Model Compression Methods Pruning Optimal Brain Damage (OBD) • We need to deal with: ⎞ 𝜖 L ⎡ 1 ∑︂ ∑︂ O ( || δ w || 3 ) ⇒ δ L ≈ δ𝑥 i + δ𝑥 i ( H ) ij δ𝑥 j + 𝜖𝑥 i 2 i ( i,j ) ⏟ ⏞ √︂ √︂ 1 ( H ) ii δw 2 i + 1 δw i ( H ) ij δw j 2 2 i i ̸ = j Approximations • Extremal assumption: local optimum (training has converged) • Diagonal assumption: H is diagonal • Quadratic approximation: L is approximately quadratic Markus Beuckelmann Model Compression July 19, 2016 17 / 33

  22. Model Compression Methods Pruning Optimal Brain Damage (OBD) • We need to deal with: ⎞ 𝜖 L ⎡ 1 ∑︂ ∑︂ O ( || δ w || 3 ) ⇒ δ L ≈ δ𝑥 i + δ𝑥 i ( H ) ij δ𝑥 j + 𝜖𝑥 i 2 i ( i,j ) ⏟ ⏞ √︂ √︂ 1 ( H ) ii δw 2 i + 1 δw i ( H ) ij δw j 2 2 i i ̸ = j Approximations • Extremal assumption: local optimum (training has converged) • Diagonal assumption: H is diagonal • Quadratic approximation: L is approximately quadratic • Now we are left with: δ L ≈ 1 i → 𝑇 k = 1 ∑︂ ( H ) ii δ𝑥 2 2( H ) kk 𝑥 2 k 2 i Markus Beuckelmann Model Compression July 19, 2016 17 / 33

  23. Model Compression Methods Pruning OBD: The Algorithm 1 Choose a reasonable network architecture 2 Train the network until a reasonable local minimum is obtained 3 Compute the diagonal of the Hessian, i.e. ( H ) kk 4 Compute the saliencies given by 𝑇 k = 1 2( H ) kk 𝑥 2 k for each parameter 5 Sort the parameters by 𝑇 k 6 Delete parameters with low–saliency 7 (Optional: Iterate to step 2) Markus Beuckelmann Model Compression July 19, 2016 18 / 33

  24. Model Compression Methods Pruning OBD: Experimental Results • Data: MNIST (handwritten digits recognition) • Left panel (a): Comparison to magnitude–based pruning • Right panel(b): Comparison to saliencies (Le Cun et al., 1990) Markus Beuckelmann Model Compression July 19, 2016 19 / 33

  25. Model Compression Methods Pruning OBD: Experimental Results – With Retraining • This is what it looks with retraining. • Left panel (a): Retraining (training data) • Right panel (a): Retraining (test data) (Le Cun et al., 1990) Markus Beuckelmann Model Compression July 19, 2016 20 / 33

Recommend


More recommend