rate distortion for model compression from theory to
play

Rate Distortion for Model Compression: From Theory To Practice - PowerPoint PPT Presentation

Rate Distortion for Model Compression: From Theory To Practice Weihao Gao , Yu-Han Liu , Chong Wang and Sewoong Oh UIUC, Google, Bytedance, Univ of Washington June 10, 2019 Weihao Gao (UIUC) Model Compression June


  1. Rate Distortion for Model Compression: From Theory To Practice Weihao Gao ∗ , Yu-Han Liu † , Chong Wang ‡ and Sewoong Oh § ∗ UIUC, † Google, ‡ Bytedance, § Univ of Washington June 10, 2019 Weihao Gao (UIUC) Model Compression June 10, 2019 1 / 13

  2. Motivation Nowadays, neural networks become more and more powerful Weihao Gao (UIUC) Model Compression June 10, 2019 2 / 13

  3. Motivation Nowadays, neural networks become more and more powerful Also, neural networks become larger and larger LeNet 40K, AlexNet 62M, BERT 110M(base)/340M(large) Weihao Gao (UIUC) Model Compression June 10, 2019 2 / 13

  4. Motivation Nowadays, neural networks become more and more powerful Also, neural networks become larger and larger LeNet 40K, AlexNet 62M, BERT 110M(base)/340M(large) Compression of models are necessary for saving training and inference time storing space, e.g., for mobile Apps Weihao Gao (UIUC) Model Compression June 10, 2019 2 / 13

  5. Motivation Nowadays, neural networks become more and more powerful Also, neural networks become larger and larger LeNet 40K, AlexNet 62M, BERT 110M(base)/340M(large) Compression of models are necessary for saving training and inference time storing space, e.g., for mobile Apps Two fundamental questions about model compression Weihao Gao (UIUC) Model Compression June 10, 2019 2 / 13

  6. Motivation Nowadays, neural networks become more and more powerful Also, neural networks become larger and larger LeNet 40K, AlexNet 62M, BERT 110M(base)/340M(large) Compression of models are necessary for saving training and inference time storing space, e.g., for mobile Apps Two fundamental questions about model compression Is there any theoretical understanding of the fundamental limit of 1 model compression algorithms? Weihao Gao (UIUC) Model Compression June 10, 2019 2 / 13

  7. Motivation Nowadays, neural networks become more and more powerful Also, neural networks become larger and larger LeNet 40K, AlexNet 62M, BERT 110M(base)/340M(large) Compression of models are necessary for saving training and inference time storing space, e.g., for mobile Apps Two fundamental questions about model compression Is there any theoretical understanding of the fundamental limit of 1 model compression algorithms? How can theoretical understanding help us to improve practical 2 compression algorithms? Weihao Gao (UIUC) Model Compression June 10, 2019 2 / 13

  8. Fundamental limit for model compression Trade-off between compression ratio and quality of compressed model Weihao Gao (UIUC) Model Compression June 10, 2019 3 / 13

  9. Fundamental limit for model compression Trade-off between compression ratio and quality of compressed model 2.5 uncompressed baseline Cross Entropy proposed 2.0 1.5 1.0 0.5 0.0 0% 5% 10% 15% 20% 25% Compression Ratio Figure 1: Trade-off between compression ratio and cross entropy loss Weihao Gao (UIUC) Model Compression June 10, 2019 3 / 13

  10. Fundamental limit for model compression Trade-off between compression ratio and quality of compressed model 2.5 uncompressed baseline Cross Entropy proposed 2.0 1.5 1.0 0.5 0.0 0% 5% 10% 15% 20% 25% Compression Ratio Figure 1: Trade-off between compression ratio and cross entropy loss Fundamental question: Given a pretrained model f w ( x ), how well can we compress the model, given certain ratio? Weihao Gao (UIUC) Model Compression June 10, 2019 3 / 13

  11. Rate distortion for model compression We bring the tool of rate distortion theory from information theory Weihao Gao (UIUC) Model Compression June 10, 2019 4 / 13

  12. Rate distortion for model compression We bring the tool of rate distortion theory from information theory Rate : average number of bits to represent parameters Weihao Gao (UIUC) Model Compression June 10, 2019 4 / 13

  13. Rate distortion for model compression We bring the tool of rate distortion theory from information theory Rate : average number of bits to represent parameters Distortion: difference between compressed model and original model Weihao Gao (UIUC) Model Compression June 10, 2019 4 / 13

  14. Rate distortion for model compression We bring the tool of rate distortion theory from information theory Rate : average number of bits to represent parameters Distortion: difference between compressed model and original model w ( X ) � 2 ] For regression d ( w , ˆ w ) = E X [ � f w ( X ) − f ˆ For classification d ( w , ˆ w ) = E X [ D KL ( f ˆ w ( X ) || f w ( X ))] Weihao Gao (UIUC) Model Compression June 10, 2019 4 / 13

  15. Rate distortion for model compression We bring the tool of rate distortion theory from information theory Rate : average number of bits to represent parameters Distortion: difference between compressed model and original model w ( X ) � 2 ] For regression d ( w , ˆ w ) = E X [ � f w ( X ) − f ˆ For classification d ( w , ˆ w ) = E X [ D KL ( f ˆ w ( X ) || f w ( X ))] Rate-distortion theorem for model compression I ( W ; ˆ R ( D ) = min W ) W | W : E [ d ( W , ˆ W )] ≤ D P ˆ Weihao Gao (UIUC) Model Compression June 10, 2019 4 / 13

  16. Our contributions Generally, it is intractable to evaluate R ( D ) due to High dimensionality of parameters Complicated non-linearity Weihao Gao (UIUC) Model Compression June 10, 2019 5 / 13

  17. Our contributions Generally, it is intractable to evaluate R ( D ) due to High dimensionality of parameters Complicated non-linearity In this talk, our contributions are Weihao Gao (UIUC) Model Compression June 10, 2019 5 / 13

  18. Our contributions Generally, it is intractable to evaluate R ( D ) due to High dimensionality of parameters Complicated non-linearity In this talk, our contributions are For linear regression model, we give a lower bound of R ( D ) and give an algorithm achieving the lower bound Weihao Gao (UIUC) Model Compression June 10, 2019 5 / 13

  19. Our contributions Generally, it is intractable to evaluate R ( D ) due to High dimensionality of parameters Complicated non-linearity In this talk, our contributions are For linear regression model, we give a lower bound of R ( D ) and give an algorithm achieving the lower bound Inspired by the optimal algorithm, we propose two “golden rules” for model compression Weihao Gao (UIUC) Model Compression June 10, 2019 5 / 13

  20. Our contributions Generally, it is intractable to evaluate R ( D ) due to High dimensionality of parameters Complicated non-linearity In this talk, our contributions are For linear regression model, we give a lower bound of R ( D ) and give an algorithm achieving the lower bound Inspired by the optimal algorithm, we propose two “golden rules” for model compression We prove the optimality of proposed “golden rules” for one layer ReLU network Weihao Gao (UIUC) Model Compression June 10, 2019 5 / 13

  21. Our contributions Generally, it is intractable to evaluate R ( D ) due to High dimensionality of parameters Complicated non-linearity In this talk, our contributions are For linear regression model, we give a lower bound of R ( D ) and give an algorithm achieving the lower bound Inspired by the optimal algorithm, we propose two “golden rules” for model compression We prove the optimality of proposed “golden rules” for one layer ReLU network We show that the algorithm following “golden rules” performs better in real models Weihao Gao (UIUC) Model Compression June 10, 2019 5 / 13

  22. Linear regression Consider linear regression model f w ( x ) = w T x Weihao Gao (UIUC) Model Compression June 10, 2019 6 / 13

  23. Linear regression Consider linear regression model f w ( x ) = w T x and the following assumptions Weights W are drawn from N (0 , Σ W ) Data X has zero mean and E [ X 2 i ] = λ x , i , E [ X i X j ] = 0. Weihao Gao (UIUC) Model Compression June 10, 2019 6 / 13

  24. Linear regression Consider linear regression model f w ( x ) = w T x and the following assumptions Weights W are drawn from N (0 , Σ W ) Data X has zero mean and E [ X 2 i ] = λ x , i , E [ X i X j ] = 0. Theorem: the rate distortion function is lower bounded by: m R ( D ) ≥ R ( D ) = 1 1 � 2 log det(Σ W ) − 2 log( D i ) , i =1 where � if µ < λ x , i E W [ W 2 µ/λ x , i i ] , D i = E W [ W 2 if µ ≥ λ x , i E W [ W 2 i ] i ] , where µ is chosen that � m i =1 λ x , i D i = D . The lower bound is tight for linear regression. Weihao Gao (UIUC) Model Compression June 10, 2019 6 / 13

  25. From theory to practice Two “golden rules” of the optimal compressor W [ ˆ W T Σ X ( W − ˆ Orthogonality: E W , ˆ W )] = 0 1 W [( W − ˆ W ) T Σ X ( W − ˆ Minimization: E W , ˆ W )] should be minimized, 2 given certain rate. Weihao Gao (UIUC) Model Compression June 10, 2019 7 / 13

  26. From theory to practice Two “golden rules” of the optimal compressor W [ ˆ W T Σ X ( W − ˆ Orthogonality: E W , ˆ W )] = 0 1 W [( W − ˆ W ) T Σ X ( W − ˆ Minimization: E W , ˆ W )] should be minimized, 2 given certain rate. Modified “golden rules” for practice w T I w ( w − ˆ Orthogonality: ˆ w ) = 0, 1 w ) T I w ( w − ˆ Minimization: ( w − ˆ w ) is minimized given certain 2 constraints. Weihao Gao (UIUC) Model Compression June 10, 2019 7 / 13

Recommend


More recommend