Neural Network Compression Linear Neural Reconstruction David A. R. Robin Internship with Swayambhoo Jain Feb-Aug 2019 Report : www.robindar.com/m1-internship/report.pdf
Neural networks X ∈ R d �→ W 2 · σ 1 ( W 1 · σ 0 ( W 0 · X ))
Notations (when considering a single layer) d ∈ N : number of inputs of a layer h ∈ N : number of outputs of a layer W ∈ R h × d : weights of a single layer (over R d ) D : distribution of inputs to a layer X ∼ D : input to a layer (random variable)
Previously in Network Compression
Previously in Network Compression : Pruning Pruning : Remove weights (i.e. connections) Assumption : small magnitude | w q | pruned → small loss increase (even when pruning several weights at once) Pruning Algorithm : Prune, Retrain, Repeat Result : 90% of weights removed, same accuracy (high compressibility)
Previously in Network Compression : Explaining Pruning Magnitude-based pruning requires retraining. Pruning Neurons with no inputs or no outputs (in red) can be kept 1 , as well as redundant neurons (in blue) that could be discarded at no cost. Redundancy is not leveraged Can we take advantage of redundancies ? 1 given enough retraining with weight decay, these will be discarded
Previously in Network Compression : Low-rank � � W − PQ T � min � � � P ∈ R h × r , Q ∈ R d × r 2 Low-Rank Problems : keeps hidden neuron count intact, data-agnostic
Contribution
Activation reconstruction L -layer feed-forward flow: Weight approximation (theirs) : ˆ ◮ Z 0 input to the network W k ≈ W k ◮ ◮ Z k +1 = σ k ( W k · Z k ) Activation reconstruction (ours) : ◮ Use Z L as prediction ◮ ˆ Z k ≈ Z k We have more than weights, we have activations We only need σ k ( ˆ W k Z k ) ≈ σ k ( W k Z k ) Z k ≈ Z k , σ k ( ˆ ˆ ˆ W k Z k ) ≈ σ k ( W k Z k ) ⇒ Z k +1 ≈ Z k +1 σ k ( ˆ ˆ W k Z k ) ≈ σ k ( W k Z k ) W k ≈ W k �
Linear activation reconstruction ˆ ˆ σ k ( ˆ W k ≈ W k ⇒ W k Z k ≈ W k Z k ⇒ W k Z k ) ≈ σ k ( W k Z k ) The first ( ˆ W k ≈ W k ) is sub-optimal because data-agnostic The third ( σ k ( ˆ W k Z k ) ≈ σ k ( W k Z k )) is non-convex, non-smooth ˆ Let’s try to get W k · Z k ≈ W k · Z k
Low-rank inspiration Low-rank with activation reconstruction gives 2 � � � WX − PQ T X min P ∈ R h × r , Q ∈ R n × r E X � � � 2 Q : feature extractor, P : linear reconstruction from extracted features Knowing the right rank r to use is hard. Soft low-rank would use the nuclear norm � · � ∗ instead E X � WX − MX � 2 min 2 + λ · � M � ∗ M where λ controls the tradeoff between compression and accuracy
Neuron removal C i ( M ) = 0 ⇒ X i is never used ⇒ we can remove neuron n o i Column-sparse matrices remove neurons. Caracterization of such matrices reminiscent of low-rank : PC T Low-Rank : M = PQ T Column-sparse : M = PC T ◮ P ∈ R h × r Q ◮ P ∈ R h × r C ◮ C ∈ { 0 , 1 } d × r C , C T 1 d = 1 r ◮ Q ∈ R d × r Q Q the feature extractor becomes a feature selector C
Leveraging consecutive layers Restricting to feature selectors, we gain an interesting property feature selector ’s action commute with non-linearities For a three-layer network: W 3 · σ 2 ( W 2 · σ 1 ( W 1 · X )) P 3 C T P 2 C T P 1 C T ≈ 3 · σ 2 ( 2 · σ 1 ( 1 · X )) C T C T 2 P 1 · C T = P 3 · σ 2 ( 3 P 2 · σ 1 ( 1 X )) ˆ ˆ ˆ W 1 · C T = W 3 · σ 2 ( W 2 · σ 1 ( 1 X )) Memory footprint: original : h 3 × h 2 + h 2 × h 1 + h 1 × d ◮ � d ◮ compressed : h 3 × r 3 + r 3 × r 2 + r 2 × r 1 + α · log 2 � r 1 h 2 and h 1 are gone ! Only h 3 (#outputs) and d (#inputs) remain
Optimality of feature selectors feature selector ’s action commute with non-linearities: C ∈ { 0 , 1 } r × d , C T 1 d = 1 r PC T · σ ( U ) = P · σ ( C T U ) ⇒ We only need the commutation property. Can we maybe use something less extreme than feature selectors ? Lemma (commutation lemma) Let C be a linear operator Let σ : x �→ max(0 , x ) be the pointwise ReLU C’s action commutes with σ ⇒ C is a feature selector Answer : No, not even if all σ k are ReLU
Comparison with low-rank Hidden neurons are deleted Note how this doesn’t suffer pruning drawbacks discussed before
Comparison with low-rank Low-Rank : M = PQ T Column-sparse : M = PC T ◮ P ∈ R h × r Q ◮ P ∈ R h × r C ◮ C ∈ { 0 , 1 } d × r C , C T 1 d = 1 r ◮ Q ∈ R d × r Q For the same ℓ 2 error, low-rank is less constrained, hence r Q ≤ r C But it doesn’t remove hidden neurons, which may dominate its cost Two regimes: ◮ Heavy overparameterization ( r C ≪ d ) : use column-sparse ◮ Light overparameterization ( r C ≈ d ) : use low-rank Once neurons have been removed, it is still possible to apply low-rank approximation on top of the first compression
Solving for column-sparse
Linear Neural Reconstruction Problem Using the ℓ 2 , 1 norm as a proxy for the number of non-zero columns, we can consider the following distinct relaxation E X � WX − MX � 2 min 2 + λ · � M � 2 , 1 (1) M �� i M 2 where � M � 2 , 1 = � i , j is the ℓ 2 , 1 norm of M , j i.e. the sum of the ℓ 2 -norms of its columns.
Auto-correlation factorization The sum over the training set can be factored away Using A = W − M , we have � A · XX T · A T � � A · ( E X XX T ) · A T � E X � AX � 2 2 = E X Tr = Tr R = E X [ XX T ] ∈ R d × d is the auto-correlation matrix. The objective can then be evaluated in O ( hd 2 ), which does not depend on the number of samples.
Efficient solving Our problem is strictly convex → solvable to global optimum We solve it with Fast Iterative Shrinkage-Thresholding, an accelerated proximal gradient method (quadratic convergence). Lemma (quadratic convergence) 2 · E X � WX − MX � 2 Let L : M �→ 1 2 + λ · � M � 2 , 1 , ( M k ) k the iterates obtained by the FISTA algorithm, M ∗ the global optimum, and L = λ max ( E X [ XX T ] ) . Then L ( M k ) − L ( M ∗ ) ≤ 2 L k 2 � M 0 − M ∗ � 2 F
Extension to convolutional layers For each output position ( u , v ) in output channel j , we write X ( u , v ) the associated input, i that will be multiplied by W j to get ( W ∗ X i ) j , u , v 2 � � � W j ⊙ X ( u , v ) � W ∗ X i � 2 � � 2 = � � i � 2 u , v j hence vec( X ( u , v ) ) · vec( X ( u , v ) � � ) T R ∝ i i u , v i This rewriting holds for any stride, padding or dilation values Then use more general Group-Lasso instead of ℓ 2 , 1
Results
General results Network Error Comp. rate Size Architecture Type Top-1 Top-5 Baseline 1.68 % - - 1.02 MiB LeNet-300-100 Compressed 1.71 % - 46 % 482 KiB Retrained (1) 1.64 % - 29 % 307 KiB Baseline 0.74 % - - 1.64 MiB LeNet-5 (Caffe) Compressed 0.78 % - 16 % 276 KiB Retrained (1) 0.78 % - 10 % 177 KiB Baseline 43.48 % 20.93 % - 234 MiB AlexNet Compressed 45.36 % 21.90 % 39 % 91 MiB
Reconstruction chaining
Extension to arbitrary output We can extend the previous problem to reconstruct arbitrary output Y 1 � � Y i − MX i � 2 min 2 + λ · � M � 2 , 1 (2) 2 N M i FISTA is adapted by simply changing the gradient step dA = YX T − AXX T , where YX T can be precomputed as well
Three chaining strategies Consider a feed-forward fully connected network Input Z 0 , weights ( W k ) k and non-linearities ( σ k ) k Z k +1 = σ k ( W k · Z k ) ◮ Parallel : Y = W k · Z k , X = Z k X = ˆ ◮ Top-down : Y = W k · Z k , Z k ◮ Bottom-up : Y = C T k +1 W k · Z k , X = Z k
Three chaining strategies Top Down Bottom Up Parallel Operators Original layer Feature extraction Reconstruction Minimized error Activations original extracted reconstructed
Reconstruction chaining Figure: Performances of reconstruction chainings (LeNet-5 Caffe)
Tackling Lasso bias Lasso regularization → shrinkage effect → bias in the solution We limit this effect by solving twice ◮ Solve for ( P , C T ) and retain only C ◮ Solve for P with fixed C without penalty The second is just a linear regression
Influence of debiasing Figure: Influence of debiasing on reconstruction quality (LeNet-5 Caffe)
Appendix
Fast Iterative Shrinkage-Thresholding Algorithm 1 FISTA with fixed step size X ∈ R h × N : input to the layer, input: W ∈ R o × h : weight to approximate, λ : hyperparameter output: M ∈ R o × h : reconstruction R ← XX T / N L ← largest eigenvalue of R M ← 0 ∈ R o × h , P ← 0 ∈ R o × h t ← λ / L , k ← 1 , θ ← 1 repeat θ ← ( k − 1) / ( k + 2) , k ← k + 1 A ← M + θ ( M − P ) dA ← ( W − A ) R P ← M , M ← prox t �·� 2 , 1 ( A − dA / L ) until desired convergence
Convergence guarantees Lemma 2 · E X � WX − MX � 2 Let L : M �→ 1 2 + λ · � M � 2 , 1 , ( M k ) k the iterates obtained by FISTA as described above, M ∗ the global optimum, and L = λ max ( E X [ XX T ] ) . Then L ( M k ) − L ( M ∗ ) ≤ 2 L k 2 � M 0 − M ∗ � 2 F Choosing M 0 = 0, we can refine this bound with the following � √ � � M ∗ � 2 F ≤ � M ∗ � 2 , 1 · min d , � M ∗ � 2 , 1 and by definition of M ∗ , we have ∀ M , � M ∗ � 2 , 1 ≤ 1 λ L ( M )
Recommend
More recommend