Neural Network Compression Linear Neural Reconstruction David A. R. - PowerPoint PPT Presentation

Neural Network Compression Linear Neural Reconstruction David A. R. Robin Internship with Swayambhoo Jain Feb-Aug 2019 Report : www.robindar.com/m1-internship/report.pdf

Neural networks X ∈ R d �→ W 2 · σ 1 ( W 1 · σ 0 ( W 0 · X ))

Notations (when considering a single layer) d ∈ N : number of inputs of a layer h ∈ N : number of outputs of a layer W ∈ R h × d : weights of a single layer (over R d ) D : distribution of inputs to a layer X ∼ D : input to a layer (random variable)

Previously in Network Compression

Previously in Network Compression : Pruning Pruning : Remove weights (i.e. connections) Assumption : small magnitude | w q | pruned → small loss increase (even when pruning several weights at once) Pruning Algorithm : Prune, Retrain, Repeat Result : 90% of weights removed, same accuracy (high compressibility)

Previously in Network Compression : Explaining Pruning Magnitude-based pruning requires retraining. Pruning Neurons with no inputs or no outputs (in red) can be kept 1 , as well as redundant neurons (in blue) that could be discarded at no cost. Redundancy is not leveraged Can we take advantage of redundancies ? 1 given enough retraining with weight decay, these will be discarded

Previously in Network Compression : Low-rank � � W − PQ T � min � � � P ∈ R h × r , Q ∈ R d × r 2 Low-Rank Problems : keeps hidden neuron count intact, data-agnostic

Contribution

Activation reconstruction L -layer feed-forward flow: Weight approximation (theirs) : ˆ ◮ Z 0 input to the network W k ≈ W k ◮ ◮ Z k +1 = σ k ( W k · Z k ) Activation reconstruction (ours) : ◮ Use Z L as prediction ◮ ˆ Z k ≈ Z k We have more than weights, we have activations We only need σ k ( ˆ W k Z k ) ≈ σ k ( W k Z k ) Z k ≈ Z k , σ k ( ˆ ˆ ˆ W k Z k ) ≈ σ k ( W k Z k ) ⇒ Z k +1 ≈ Z k +1 σ k ( ˆ ˆ W k Z k ) ≈ σ k ( W k Z k ) W k ≈ W k �

Linear activation reconstruction ˆ ˆ σ k ( ˆ W k ≈ W k ⇒ W k Z k ≈ W k Z k ⇒ W k Z k ) ≈ σ k ( W k Z k ) The first ( ˆ W k ≈ W k ) is sub-optimal because data-agnostic The third ( σ k ( ˆ W k Z k ) ≈ σ k ( W k Z k )) is non-convex, non-smooth ˆ Let’s try to get W k · Z k ≈ W k · Z k

Low-rank inspiration Low-rank with activation reconstruction gives 2 � � � WX − PQ T X min P ∈ R h × r , Q ∈ R n × r E X � � � 2 Q : feature extractor, P : linear reconstruction from extracted features Knowing the right rank r to use is hard. Soft low-rank would use the nuclear norm � · � ∗ instead E X � WX − MX � 2 min 2 + λ · � M � ∗ M where λ controls the tradeoff between compression and accuracy

Neuron removal C i ( M ) = 0 ⇒ X i is never used ⇒ we can remove neuron n o i Column-sparse matrices remove neurons. Caracterization of such matrices reminiscent of low-rank : PC T Low-Rank : M = PQ T Column-sparse : M = PC T ◮ P ∈ R h × r Q ◮ P ∈ R h × r C ◮ C ∈ { 0 , 1 } d × r C , C T 1 d = 1 r ◮ Q ∈ R d × r Q Q the feature extractor becomes a feature selector C

Leveraging consecutive layers Restricting to feature selectors, we gain an interesting property feature selector ’s action commute with non-linearities For a three-layer network: W 3 · σ 2 ( W 2 · σ 1 ( W 1 · X )) P 3 C T P 2 C T P 1 C T ≈ 3 · σ 2 ( 2 · σ 1 ( 1 · X )) C T C T 2 P 1 · C T = P 3 · σ 2 ( 3 P 2 · σ 1 ( 1 X )) ˆ ˆ ˆ W 1 · C T = W 3 · σ 2 ( W 2 · σ 1 ( 1 X )) Memory footprint: original : h 3 × h 2 + h 2 × h 1 + h 1 × d ◮ � d ◮ compressed : h 3 × r 3 + r 3 × r 2 + r 2 × r 1 + α · log 2 � r 1 h 2 and h 1 are gone ! Only h 3 (#outputs) and d (#inputs) remain

Optimality of feature selectors feature selector ’s action commute with non-linearities: C ∈ { 0 , 1 } r × d , C T 1 d = 1 r PC T · σ ( U ) = P · σ ( C T U ) ⇒ We only need the commutation property. Can we maybe use something less extreme than feature selectors ? Lemma (commutation lemma) Let C be a linear operator Let σ : x �→ max(0 , x ) be the pointwise ReLU C’s action commutes with σ ⇒ C is a feature selector Answer : No, not even if all σ k are ReLU

Comparison with low-rank Hidden neurons are deleted Note how this doesn’t suffer pruning drawbacks discussed before

Comparison with low-rank Low-Rank : M = PQ T Column-sparse : M = PC T ◮ P ∈ R h × r Q ◮ P ∈ R h × r C ◮ C ∈ { 0 , 1 } d × r C , C T 1 d = 1 r ◮ Q ∈ R d × r Q For the same ℓ 2 error, low-rank is less constrained, hence r Q ≤ r C But it doesn’t remove hidden neurons, which may dominate its cost Two regimes: ◮ Heavy overparameterization ( r C ≪ d ) : use column-sparse ◮ Light overparameterization ( r C ≈ d ) : use low-rank Once neurons have been removed, it is still possible to apply low-rank approximation on top of the first compression

Solving for column-sparse

Linear Neural Reconstruction Problem Using the ℓ 2 , 1 norm as a proxy for the number of non-zero columns, we can consider the following distinct relaxation E X � WX − MX � 2 min 2 + λ · � M � 2 , 1 (1) M �� i M 2 where � M � 2 , 1 = � i , j is the ℓ 2 , 1 norm of M , j i.e. the sum of the ℓ 2 -norms of its columns.

Auto-correlation factorization The sum over the training set can be factored away Using A = W − M , we have � A · XX T · A T � � A · ( E X XX T ) · A T � E X � AX � 2 2 = E X Tr = Tr R = E X [ XX T ] ∈ R d × d is the auto-correlation matrix. The objective can then be evaluated in O ( hd 2 ), which does not depend on the number of samples.

Efficient solving Our problem is strictly convex → solvable to global optimum We solve it with Fast Iterative Shrinkage-Thresholding, an accelerated proximal gradient method (quadratic convergence). Lemma (quadratic convergence) 2 · E X � WX − MX � 2 Let L : M �→ 1 2 + λ · � M � 2 , 1 , ( M k ) k the iterates obtained by the FISTA algorithm, M ∗ the global optimum, and L = λ max ( E X [ XX T ] ) . Then L ( M k ) − L ( M ∗ ) ≤ 2 L k 2 � M 0 − M ∗ � 2 F

Extension to convolutional layers For each output position ( u , v ) in output channel j , we write X ( u , v ) the associated input, i that will be multiplied by W j to get ( W ∗ X i ) j , u , v 2 � � � W j ⊙ X ( u , v ) � W ∗ X i � 2 � � 2 = � � i � 2 u , v j hence vec( X ( u , v ) ) · vec( X ( u , v ) � � ) T R ∝ i i u , v i This rewriting holds for any stride, padding or dilation values Then use more general Group-Lasso instead of ℓ 2 , 1

Results

General results Network Error Comp. rate Size Architecture Type Top-1 Top-5 Baseline 1.68 % - - 1.02 MiB LeNet-300-100 Compressed 1.71 % - 46 % 482 KiB Retrained (1) 1.64 % - 29 % 307 KiB Baseline 0.74 % - - 1.64 MiB LeNet-5 (Caffe) Compressed 0.78 % - 16 % 276 KiB Retrained (1) 0.78 % - 10 % 177 KiB Baseline 43.48 % 20.93 % - 234 MiB AlexNet Compressed 45.36 % 21.90 % 39 % 91 MiB

Reconstruction chaining

Extension to arbitrary output We can extend the previous problem to reconstruct arbitrary output Y 1 � � Y i − MX i � 2 min 2 + λ · � M � 2 , 1 (2) 2 N M i FISTA is adapted by simply changing the gradient step dA = YX T − AXX T , where YX T can be precomputed as well

Three chaining strategies Consider a feed-forward fully connected network Input Z 0 , weights ( W k ) k and non-linearities ( σ k ) k Z k +1 = σ k ( W k · Z k ) ◮ Parallel : Y = W k · Z k , X = Z k X = ˆ ◮ Top-down : Y = W k · Z k , Z k ◮ Bottom-up : Y = C T k +1 W k · Z k , X = Z k

Three chaining strategies Top Down Bottom Up Parallel Operators Original layer Feature extraction Reconstruction Minimized error Activations original extracted reconstructed

Reconstruction chaining Figure: Performances of reconstruction chainings (LeNet-5 Caffe)

Tackling Lasso bias Lasso regularization → shrinkage effect → bias in the solution We limit this effect by solving twice ◮ Solve for ( P , C T ) and retain only C ◮ Solve for P with fixed C without penalty The second is just a linear regression

Influence of debiasing Figure: Influence of debiasing on reconstruction quality (LeNet-5 Caffe)

Appendix

Fast Iterative Shrinkage-Thresholding Algorithm 1 FISTA with fixed step size X ∈ R h × N : input to the layer, input: W ∈ R o × h : weight to approximate, λ : hyperparameter output: M ∈ R o × h : reconstruction R ← XX T / N L ← largest eigenvalue of R M ← 0 ∈ R o × h , P ← 0 ∈ R o × h t ← λ / L , k ← 1 , θ ← 1 repeat θ ← ( k − 1) / ( k + 2) , k ← k + 1 A ← M + θ ( M − P ) dA ← ( W − A ) R P ← M , M ← prox t �·� 2 , 1 ( A − dA / L ) until desired convergence

Convergence guarantees Lemma 2 · E X � WX − MX � 2 Let L : M �→ 1 2 + λ · � M � 2 , 1 , ( M k ) k the iterates obtained by FISTA as described above, M ∗ the global optimum, and L = λ max ( E X [ XX T ] ) . Then L ( M k ) − L ( M ∗ ) ≤ 2 L k 2 � M 0 − M ∗ � 2 F Choosing M 0 = 0, we can refine this bound with the following � √ � � M ∗ � 2 F ≤ � M ∗ � 2 , 1 · min d , � M ∗ � 2 , 1 and by definition of M ∗ , we have ∀ M , � M ∗ � 2 , 1 ≤ 1 λ L ( M )

Neural Network Compression Linear Neural Reconstruction David A. R. - PowerPoint PPT Presentation

Neural Network Compression Linear Neural Reconstruction David A. R. Robin Internship with Swayambhoo Jain Feb-Aug 2019 Report : www.robindar.com/m1-internship/report.pdf Neural networks X R d W 2 1 ( W 1 0 ( W 0 X ))

Lossless compression in lossy compression systems Almost every lossy compression system

14.9.2 JPEG2000 compression DCT compression basis for JPEG wavelet compression

JPEG Compression Ian Snyder December 11, 2009 Ian Snyder JPEG Compression Outline

Lecture 9: Compression 1 / 52 Compression Recap Bu ff er Management Recap 2 / 52 Compression

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Digital Image Compression Digital Image Compression Digital Image Compression and JPEG Standards

Digital Video Compression Digital Video Compression Digital Video Compression and H.261

From Sorting to Heaps to Compression Data Compression video on demand/set top box jpeg

Tradeoffs in XML Database Compression James Cheney University of Edinburgh Data Compression

Fast Text Compression with Neural Networks Matthew Mahoney Florida Institute of Technology

Deep Compression and EIE: Deep Neural Network Model Compression and Efficient Inference

Deep Compression and EIE: Deep Neural Network Model Compression and Efficient Inference

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Evaluation of neural code compression techniques for image retrieval Feature compression for

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

A Model to Address Salary Compression for Faculty (an anti-compression model) Presented to

Research and development for the IsoDAR experiment WIN2017 06/23/2017 Spencer N. Axani

Akenteva Anna Postgres Professional Overview 1. The role of VACUUM and Autovacuum 2. Issues and

@" 'tvery Doy Caunts in the Lile of 0 Child' Language, Communication and Literacy: Let's dig

SERIES 500 SWITCHES SLIDE SWITCHES SWITCHES TACT SPECIFICATIONS Contact Rating: Gold, 0.4 VA

3/26/19 INSERT Martinsville Trailer HOMETOWN PRIDE & A WORLDLY WELCOME Build Positive

QBF-BASED SYNTHESIS OF OPTIMAL WORD-SPLITTING IN APPROXIMATE MULTI-LEVEL CELLS Daniel E. Holcomb

DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua Li Time: 6-8:50PM Thursday

Mid Way House Program Finding Home.Far Away From Home Since August 2013 JI HANE I

Neural Network Compression Linear Neural Reconstruction David A. R. - PowerPoint PPT Presentation

Neural Network Compression Linear Neural Reconstruction David A. R. Robin Internship with Swayambhoo Jain Feb-Aug 2019 Report : www.robindar.com/m1-internship/report.pdf Neural networks X R d W 2 1 ( W 1 0 ( W 0 X ))

Lossless compression in lossy compression systems Almost every lossy compression system

14.9.2 JPEG2000 compression DCT compression basis for JPEG wavelet compression

JPEG Compression Ian Snyder December 11, 2009 Ian Snyder JPEG Compression Outline

Lecture 9: Compression 1 / 52 Compression Recap Bu ff er Management Recap 2 / 52 Compression

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Digital Image Compression Digital Image Compression Digital Image Compression and JPEG Standards

Digital Video Compression Digital Video Compression Digital Video Compression and H.261

From Sorting to Heaps to Compression Data Compression video on demand/set top box jpeg

Tradeoffs in XML Database Compression James Cheney University of Edinburgh Data Compression

Fast Text Compression with Neural Networks Matthew Mahoney Florida Institute of Technology

Deep Compression and EIE: Deep Neural Network Model Compression and Efficient Inference

Deep Compression and EIE: Deep Neural Network Model Compression and Efficient Inference

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Evaluation of neural code compression techniques for image retrieval Feature compression for

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

A Model to Address Salary Compression for Faculty (an anti-compression model) Presented to

Research and development for the IsoDAR experiment WIN2017 06/23/2017 Spencer N. Axani

Akenteva Anna Postgres Professional Overview 1. The role of VACUUM and Autovacuum 2. Issues and

@&quot; 'tvery Doy Caunts in the Lile of 0 Child' Language, Communication and Literacy: Let's dig

SERIES 500 SWITCHES SLIDE SWITCHES SWITCHES TACT SPECIFICATIONS Contact Rating: Gold, 0.4 VA

3/26/19 INSERT Martinsville Trailer HOMETOWN PRIDE &amp; A WORLDLY WELCOME Build Positive

QBF-BASED SYNTHESIS OF OPTIMAL WORD-SPLITTING IN APPROXIMATE MULTI-LEVEL CELLS Daniel E. Holcomb

DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua Li Time: 6-8:50PM Thursday

Mid Way House Program Finding Home.Far Away From Home Since August 2013 JI HANE I

@" 'tvery Doy Caunts in the Lile of 0 Child' Language, Communication and Literacy: Let's dig

3/26/19 INSERT Martinsville Trailer HOMETOWN PRIDE & A WORLDLY WELCOME Build Positive