Trimming the 1 Regularizer: Statistical Analysis, Optimization, and - PowerPoint PPT Presentation

Trimming the ℓ 1 Regularizer: Statistical Analysis, Optimization, and Applications to Deep Learning Jihun Yun 1 , Peng Zheng 2 , Eunho Yang 1,3 , Aur´ elie C. Lozano 4 , Aleksandr Aravkin 2 1 KAIST 2 University of Washington 3 AITRICS 4 IBM T.J. Watson Research Center arcprime@kaist.ac.kr International Conference on Machine Learning June 12, 2019 Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 1 / 40

Table of Contents Introduction and Setup 1 Statistical Analysis 2 Optimization 3 Experiments & Applications to Deep Learning 4 Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 2 / 40

Table of Contents Introduction and Setup 1 Statistical Analysis 2 Optimization 3 Experiments & Applications to Deep Learning 4 Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 3 / 40

ℓ 1 Regularization is Popular High-dimensional data with ℓ 1 regularization ( n < < p ) Genomic Data, Matrix Completion, Deep Learning, etc. (a) Sparse linear models (b) Sparse graphical models (c) Matrix Completion (d) Sparse neural networks Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 4 / 40

Concrete Example 1 Lasso Example 1: Lasso ∗ (Sparse Linear Regression) 1 � 2 n � y − X θ � 2 θ ∈ argmin 2 + λ n � θ � 1 θ ∈ Ω ∗ R. Tibshirani. Regression shrinkage and selection via the lasso. JRSS, Series B,1996. Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 5 / 40

Concrete Example 2 Graphical Lasso Example 2: Graphical Lasso ∗ (Sparse Concentration Matrix) � trace( � Θ ∈ argmin ΣΘ) − log det(Θ) + λ n � Θ � 1 , off Θ ∈S p ++ where � Σ is a sample covariance matrix, S p ++ the symmetric and strictly positive definite matrices, and � Θ � 1 , off the ℓ 1 -norm on the off-diagonal elements of Θ . ∗ P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covariance estimation by minimizing l1-penalized log-determinant divergence. EJS, 2011 Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 6 / 40

Concrete Example 3 Group ℓ 1 on Network Pruning Task Example 3: Group ℓ 1 ∗ (Structured Sparsity of Weight Parameters) � θ ∈ argmin L ( θ ; D ) + λ n � θ � G θ ∈ Ω where � θ is a collection of weight parameters of neural networks, L the neural network loss (ex. softmax), and � θ � G the group sparsity regularizer. Before Pruning After Pruning Pruning Synapses Pruning Neurons Figure: Encouraging group sparsity . For example, � θ � G = � g ∈G � θ g � 2 with each group g . ∗ W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning Structured Sparsity in Deep Neural Networks. NIPS, 2016 Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 7 / 40

Shrinkage Bias of Standard ℓ 1 Penalty As parameter size gets larger, the shrinkage bias effect also tends to be larger. The ℓ 1 penalty is proportional to the size of parameters. Despite the popularity of ℓ 1 penalty (and also strong statistical guarantees), Is it really good enough? Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 8 / 40

Non-convex Regularizers Previous Work For amenable non-convex regularizers (such as SCAD ∗ and MCP ∗∗ ), ⊲ Amenable regularizer: Resembles ℓ 1 at the origin and has vanishing derivatives at the tail. → coordinate-wise decomposable . ⊲ (Loh & Wainwright) ∗∗∗ provide the statistical analysis on amenable regularizers. Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 9 / 40

Non-convex Regularizers Previous Work For amenable non-convex regularizers (such as SCAD ∗ and MCP ∗∗ ), ⊲ Amenable regularizer: Resembles ℓ 1 at the origin and has vanishing derivatives at the tail. → coordinate-wise decomposable . ⊲ (Loh & Wainwright) ∗∗∗ provide the statistical analysis on amenable regularizers. What about more structurally complex regularizer? ∗ J. Fan and R. Li. Variable selection via non-concave penalized likelihood and its oracle properties. Jour. Amer. Stat. Ass., 96(456):1348-1360, December 2001. ∗∗ Cun-Hui Zhang et al. Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics, 38(2):894-942, 2010. ∗∗∗ P. Loh and M. J. Wainwright. Regularized M -estimators with non-convexity: statistical and algorithmic theory for local optima and algorithmic. JMLR, 2015. ∗∗∗ P. Loh and M. J. Wainwright. Support recovery without incoherence: A case for nonconvex regularization. The Annals of Statistics, 2017. Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 9 / 40

𝑞 − ℎ Trimmed ℓ 1 Penalty Definition In this paper, we study the Trimmed ℓ 1 penalty. New class of regularizers. Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 10 / 40

𝑞 − ℎ Trimmed ℓ 1 Penalty Definition In this paper, we study the Trimmed ℓ 1 penalty. New class of regularizers. Definition: For a parameter vector θ ∈ R p , we only ℓ 1 -penalize each entry except largest h entries (We call h the trimming parameter). Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 10 / 40

Trimmed ℓ 1 Penalty Definition In this paper, we study the Trimmed ℓ 1 penalty. New class of regularizers. Definition: For a parameter vector θ ∈ R p , we only ℓ 1 -penalize each entry except largest h entries (We call h the trimming parameter). Parameter (The darker color, the larger value) Penalty-free We only penalize the smallest 𝑞 − ℎ entries. Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 10 / 40

Trimmed ℓ 1 Penalty First Formulation Parameter (The darker color, the larger value) Penalty-free We only penalize the smallest 𝑞 − ℎ entries. We can formalize by defining the order statistics of the parameter vector | θ (1) | > | θ (2) | > · · · > | θ ( p ) | , the M -estimation with the Trimmed ℓ 1 penalty is minimize L ( θ ; D ) + λ n R ( θ ; h ) θ ∈ Ω where the regularizer R ( θ ; h ) = � p j = h +1 | θ ( j ) | (sum of smallest p − h entries in absolute values). Importantly, the Trimmed ℓ 1 is not amenable nor coordinate-wise separable. Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 11 / 40

M -estimation with the Trimmed ℓ 1 penalty Second Formulation We can rewrite the M -estimation with the Trimmed ℓ 1 penalty by introducing additional variable w : p � θ ∈ Ω , w ∈ [0 , 1] p F ( θ , w ) := L ( θ ; D ) + λ n minimize w j | θ j | j =1 such that 1 T w ≥ p − h Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 12 / 40

M -estimation with the Trimmed ℓ 1 penalty Second Formulation We can rewrite the M -estimation with the Trimmed ℓ 1 penalty by introducing additional variable w : p � θ ∈ Ω , w ∈ [0 , 1] p F ( θ , w ) := L ( θ ; D ) + λ n minimize w j | θ j | j =1 such that 1 T w ≥ p − h The variable w encodes the sparsity pattern and order information of θ . As an ideal case, w j = 0 for largest h entries w j = 1 for smallest p − h entries Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 12 / 40

M -estimation with the Trimmed ℓ 1 penalty Second Formulation We can rewrite the M -estimation with the Trimmed ℓ 1 penalty by introducing additional variable w : p � θ ∈ Ω , w ∈ [0 , 1] p F ( θ , w ) := L ( θ ; D ) + λ n minimize w j | θ j | j =1 such that 1 T w ≥ p − h The variable w encodes the sparsity pattern and order information of θ . As an ideal case, w j = 0 for largest h entries w j = 1 for smallest p − h entries If we set the trimming parameter h = 0 , it is just a standard ℓ 1 . Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 12 / 40

M -estimation with the Trimmed ℓ 1 penalty Second Formulation: Important Properties p � θ ∈ Ω , w ∈ [0 , 1] p F ( θ , w ) := L ( θ ; D ) + λ n minimize w j | θ j | j =1 such that 1 T w ≥ p − h The objective function F is Weighted ℓ 1 -regularized if we fix w . Linear in w with fixing θ . However, F is non-convex in jointly ( θ , w ) because of coupling of θ and w . We use this second formulation for an optimization. Since we don’t need to sort the parameter. Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 13 / 40

𝜄 3 𝜄 3 𝜄 3 𝜄 2 𝜄 2 𝜄 2 𝜄 1 𝜄 1 𝜄 1 ℎ = 0 ℎ = 1 ℎ = 2 Trimmed ℓ 1 Penalty Unit Balls Visualization Trimmed ℓ 1 Unit balls of θ = ( θ 1 , θ 2 , θ 3 ) in the 3-dimensional space. Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 14 / 40

𝜄 3 𝜄 3 𝜄 2 𝜄 2 𝜄 1 𝜄 1 ℎ = 1 ℎ = 2 Trimmed ℓ 1 Penalty Unit Balls Visualization Trimmed ℓ 1 Unit balls of θ = ( θ 1 , θ 2 , θ 3 ) in the 3-dimensional space. 𝜄 3 𝜄 2 𝜄 1 ℎ = 0 Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 14 / 40

𝜄 3 𝜄 2 𝜄 1 ℎ = 2 Trimmed ℓ 1 Penalty Unit Balls Visualization Trimmed ℓ 1 Unit balls of θ = ( θ 1 , θ 2 , θ 3 ) in the 3-dimensional space. 𝜄 3 𝜄 3 𝜄 2 𝜄 2 𝜄 1 𝜄 1 ℎ = 0 ℎ = 1 Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 14 / 40

Trimmed ℓ 1 Penalty Unit Balls Visualization Trimmed ℓ 1 Unit balls of θ = ( θ 1 , θ 2 , θ 3 ) in the 3-dimensional space. 𝜄 3 𝜄 3 𝜄 3 𝜄 2 𝜄 2 𝜄 2 𝜄 1 𝜄 1 𝜄 1 ℎ = 0 ℎ = 1 ℎ = 2 Jihun Yun (KAIST) Trimmed ℓ 1 Penalty June 12, 2019 14 / 40

Trimming the 1 Regularizer: Statistical Analysis, Optimization, and - PowerPoint PPT Presentation

Trimming the 1 Regularizer: Statistical Analysis, Optimization, and Applications to Deep Learning Jihun Yun 1 , Peng Zheng 2 , Eunho Yang 1,3 , Aur elie C. Lozano 4 , Aleksandr Aravkin 2 1 KAIST 2 University of Washington 3 AITRICS 4 IBM T.J.

Feature Grouping as a Stochastic Regularizer for High Dimensional Structured Data Sergl

Discount Factor as a Regularizer in RL Ron Amit , Ron Meir (Technion) , Kamil Ciosek (MSR)

A New Boosting Algorithm Using Input-Dependent Regularizer Rong Jin rong+@cs.cmu.edu Yan Liu

Eating more sustainably by trimming off the excess what about discretionary foods? Michalis

Trimming Health Inequity: Barbershops/Salons as Trusted Partners Stephen B. Thomas, Ph.D.

Down to 20nm width photoresist patterns fabricated by using a dry plasma trimming A. DE LUCA 1 .

Trim Your Holiday Wasteline Help trim the trash while trimming the tree November Council Green

A World of Trimming Beaded Trimmings of Glass and Acrylic Braids and Gimp Braids

WORNALL HOMESTEAD HOMES ASSOCIATION PIAC grant for new trees and trimming WHO District

Analytic models for flash-based SSD performance when subject to trimming Robin Verschoren and

On Robust Trimming of Bayesian Network Classifiers YooJung Choi and Guy Van den Broeck UCLA

An Investigation into System-Level Trimming Issues in On-Chip Nanophotonic Networks p

DRAT-trim: Efficient Checking and Trimming Using Expressive Clausal Proofs Nathan Wetzler

Trimming while Checking Clausal Proofs Marijn J.H. Heule Warren A. Hunt, Jr. Nathan Wetzler

2. Stump Stumper Trimming Solution Christmas Tree Farm Market 2530 million

Stochastic Gradient Descent 10701 Recitations 3 Mu Li Computer Science Department Cargenie

The Probabilistic Method The Probabilistic Method Topics on Randomized Computation Topics on

14 The Plane Stress Problem IFEM Ch 14 Slide 1 Department of Engineering Mechanics PhD.

Vragen? Noem een aantal niet functionele requirements Software Design Software Design

Data Preparation Data cleaning Data integration and transformation (Data

Poster Session Chair: Rasmus K. Ursem 6 Towards a Method for Automatic Algorithm Configuration:

Design and Architectures for Embedded Systems Prof. Dr. J. Henkel Henkel Prof. Dr. J. CES - -

On Binscatter Matias D. Cattaneo 1 , Richard K. Crump 2 , Max H. Farrell 3 and Yingjie Feng 4

Update on CTEQ-TEA PDFs Carl Schmidt Michigan State University On behalf of the CTEQ-TEA group