trimming the 1 regularizer
play

Trimming the 1 Regularizer: Statistical Analysis, Optimization, and - PowerPoint PPT Presentation

Trimming the 1 Regularizer: Statistical Analysis, Optimization, and Applications to Deep Learning Jihun Yun 1 , Peng Zheng 2 , Eunho Yang 1,3 , Aur elie C. Lozano 4 , Aleksandr Aravkin 2 1 KAIST 2 University of Washington 3 AITRICS 4 IBM T.J.


  1. Trimming the ℓ 1 Regularizer: Statistical Analysis, Optimization, and Applications to Deep Learning Jihun Yun 1 , Peng Zheng 2 , Eunho Yang 1,3 , Aur´ elie C. Lozano 4 , Aleksandr Aravkin 2 1 KAIST 2 University of Washington 3 AITRICS 4 IBM T.J. Watson Research Center arcprime@kaist.ac.kr International Conference on Machine Learning June 12, 2019 Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 1 / 40

  2. Table of Contents Introduction and Setup 1 Statistical Analysis 2 Optimization 3 Experiments & Applications to Deep Learning 4 Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 2 / 40

  3. Table of Contents Introduction and Setup 1 Statistical Analysis 2 Optimization 3 Experiments & Applications to Deep Learning 4 Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 3 / 40

  4. ℓ 1 Regularization is Popular High-dimensional data with ℓ 1 regularization ( n < < p ) Genomic Data, Matrix Completion, Deep Learning, etc. (a) Sparse linear models (b) Sparse graphical models (c) Matrix Completion (d) Sparse neural networks Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 4 / 40

  5. Concrete Example 1 Lasso Example 1: Lasso ∗ (Sparse Linear Regression) 1 � 2 n � y − X θ � 2 θ ∈ argmin 2 + λ n � θ � 1 θ ∈ Ω ∗ R. Tibshirani. Regression shrinkage and selection via the lasso. JRSS, Series B,1996. Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 5 / 40

  6. Concrete Example 2 Graphical Lasso Example 2: Graphical Lasso ∗ (Sparse Concentration Matrix) � trace( � Θ ∈ argmin ΣΘ) − log det(Θ) + λ n � Θ � 1 , off Θ ∈S p ++ where � Σ is a sample covariance matrix, S p ++ the symmetric and strictly positive definite matrices, and � Θ � 1 , off the ℓ 1 -norm on the off-diagonal elements of Θ . ∗ P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covariance estimation by minimizing l1-penalized log-determinant divergence. EJS, 2011 Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 6 / 40

  7. Concrete Example 3 Group ℓ 1 on Network Pruning Task Example 3: Group ℓ 1 ∗ (Structured Sparsity of Weight Parameters) � θ ∈ argmin L ( θ ; D ) + λ n � θ � G θ ∈ Ω where � θ is a collection of weight parameters of neural networks, L the neural network loss (ex. softmax), and � θ � G the group sparsity regularizer. Before Pruning After Pruning Pruning Synapses Pruning Neurons Figure: Encouraging group sparsity . For example, � θ � G = � g ∈G � θ g � 2 with each group g . ∗ W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning Structured Sparsity in Deep Neural Networks. NIPS, 2016 Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 7 / 40

  8. Shrinkage Bias of Standard ℓ 1 Penalty As parameter size gets larger, the shrinkage bias effect also tends to be larger. The ℓ 1 penalty is proportional to the size of parameters. Despite the popularity of ℓ 1 penalty (and also strong statistical guarantees), Is it really good enough? Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 8 / 40

  9. Non-convex Regularizers Previous Work For amenable non-convex regularizers (such as SCAD ∗ and MCP ∗∗ ), ⊲ Amenable regularizer: Resembles ℓ 1 at the origin and has vanishing derivatives at the tail. → coordinate-wise decomposable . ⊲ (Loh & Wainwright) ∗∗∗ provide the statistical analysis on amenable regularizers. Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 9 / 40

  10. Non-convex Regularizers Previous Work For amenable non-convex regularizers (such as SCAD ∗ and MCP ∗∗ ), ⊲ Amenable regularizer: Resembles ℓ 1 at the origin and has vanishing derivatives at the tail. → coordinate-wise decomposable . ⊲ (Loh & Wainwright) ∗∗∗ provide the statistical analysis on amenable regularizers. What about more structurally complex regularizer? ∗ J. Fan and R. Li. Variable selection via non-concave penalized likelihood and its oracle properties. Jour. Amer. Stat. Ass., 96(456):1348-1360, December 2001. ∗∗ Cun-Hui Zhang et al. Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics, 38(2):894-942, 2010. ∗∗∗ P. Loh and M. J. Wainwright. Regularized M -estimators with non-convexity: statistical and algorithmic theory for local optima and algorithmic. JMLR, 2015. ∗∗∗ P. Loh and M. J. Wainwright. Support recovery without incoherence: A case for nonconvex regularization. The Annals of Statistics, 2017. Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 9 / 40

  11. 𝑞 − ℎ Trimmed ℓ 1 Penalty Definition In this paper, we study the Trimmed ℓ 1 penalty. New class of regularizers. Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 10 / 40

  12. 𝑞 − ℎ Trimmed ℓ 1 Penalty Definition In this paper, we study the Trimmed ℓ 1 penalty. New class of regularizers. Definition: For a parameter vector θ ∈ R p , we only ℓ 1 -penalize each entry except largest h entries (We call h the trimming parameter). Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 10 / 40

  13. Trimmed ℓ 1 Penalty Definition In this paper, we study the Trimmed ℓ 1 penalty. New class of regularizers. Definition: For a parameter vector θ ∈ R p , we only ℓ 1 -penalize each entry except largest h entries (We call h the trimming parameter). Parameter (The darker color, the larger value) Penalty-free We only penalize the smallest 𝑞 − ℎ entries. Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 10 / 40

  14. Trimmed ℓ 1 Penalty First Formulation Parameter (The darker color, the larger value) Penalty-free We only penalize the smallest 𝑞 − ℎ entries. We can formalize by defining the order statistics of the parameter vector | θ (1) | > | θ (2) | > · · · > | θ ( p ) | , the M -estimation with the Trimmed ℓ 1 penalty is minimize L ( θ ; D ) + λ n R ( θ ; h ) θ ∈ Ω where the regularizer R ( θ ; h ) = � p j = h +1 | θ ( j ) | (sum of smallest p − h entries in absolute values). Importantly, the Trimmed ℓ 1 is not amenable nor coordinate-wise separable. Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 11 / 40

  15. M -estimation with the Trimmed ℓ 1 penalty Second Formulation We can rewrite the M -estimation with the Trimmed ℓ 1 penalty by introducing additional variable w : p � θ ∈ Ω , w ∈ [0 , 1] p F ( θ , w ) := L ( θ ; D ) + λ n minimize w j | θ j | j =1 such that 1 T w ≥ p − h Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 12 / 40

  16. M -estimation with the Trimmed ℓ 1 penalty Second Formulation We can rewrite the M -estimation with the Trimmed ℓ 1 penalty by introducing additional variable w : p � θ ∈ Ω , w ∈ [0 , 1] p F ( θ , w ) := L ( θ ; D ) + λ n minimize w j | θ j | j =1 such that 1 T w ≥ p − h The variable w encodes the sparsity pattern and order information of θ . As an ideal case, w j = 0 for largest h entries w j = 1 for smallest p − h entries Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 12 / 40

  17. M -estimation with the Trimmed ℓ 1 penalty Second Formulation We can rewrite the M -estimation with the Trimmed ℓ 1 penalty by introducing additional variable w : p � θ ∈ Ω , w ∈ [0 , 1] p F ( θ , w ) := L ( θ ; D ) + λ n minimize w j | θ j | j =1 such that 1 T w ≥ p − h The variable w encodes the sparsity pattern and order information of θ . As an ideal case, w j = 0 for largest h entries w j = 1 for smallest p − h entries If we set the trimming parameter h = 0 , it is just a standard ℓ 1 . Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 12 / 40

  18. M -estimation with the Trimmed ℓ 1 penalty Second Formulation: Important Properties p � θ ∈ Ω , w ∈ [0 , 1] p F ( θ , w ) := L ( θ ; D ) + λ n minimize w j | θ j | j =1 such that 1 T w ≥ p − h The objective function F is Weighted ℓ 1 -regularized if we fix w . Linear in w with fixing θ . However, F is non-convex in jointly ( θ , w ) because of coupling of θ and w . We use this second formulation for an optimization. Since we don’t need to sort the parameter. Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 13 / 40

  19. 𝜄 3 𝜄 3 𝜄 3 𝜄 2 𝜄 2 𝜄 2 𝜄 1 𝜄 1 𝜄 1 ℎ = 0 ℎ = 1 ℎ = 2 Trimmed ℓ 1 Penalty Unit Balls Visualization Trimmed ℓ 1 Unit balls of θ = ( θ 1 , θ 2 , θ 3 ) in the 3-dimensional space. Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 14 / 40

  20. 𝜄 3 𝜄 3 𝜄 2 𝜄 2 𝜄 1 𝜄 1 ℎ = 1 ℎ = 2 Trimmed ℓ 1 Penalty Unit Balls Visualization Trimmed ℓ 1 Unit balls of θ = ( θ 1 , θ 2 , θ 3 ) in the 3-dimensional space. 𝜄 3 𝜄 2 𝜄 1 ℎ = 0 Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 14 / 40

  21. 𝜄 3 𝜄 2 𝜄 1 ℎ = 2 Trimmed ℓ 1 Penalty Unit Balls Visualization Trimmed ℓ 1 Unit balls of θ = ( θ 1 , θ 2 , θ 3 ) in the 3-dimensional space. 𝜄 3 𝜄 3 𝜄 2 𝜄 2 𝜄 1 𝜄 1 ℎ = 0 ℎ = 1 Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 14 / 40

  22. Trimmed ℓ 1 Penalty Unit Balls Visualization Trimmed ℓ 1 Unit balls of θ = ( θ 1 , θ 2 , θ 3 ) in the 3-dimensional space. 𝜄 3 𝜄 3 𝜄 3 𝜄 2 𝜄 2 𝜄 2 𝜄 1 𝜄 1 𝜄 1 ℎ = 0 ℎ = 1 ℎ = 2 Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 14 / 40

Recommend


More recommend