Pathwise Coordinate Optimization for Nonconvex Sparse Learning Tuo Zhao http://www.princeton.edu/˜tuoz Department of Computer Science Johns Hopkins University Mar. 25. 2015
A General Theory of Pathwise Coordinate Optimization Collaborators This is joint work with Prof. Han Liu at Princeton University, Prof. Tong Zhang at Rutgers University and Baidu, Xingguo Li at University of Minnesota. Manuscript: http://arxiv.org/abs/1412.7477 Software Package: http://cran.r-project.org/web/packages/picasso/ Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 2 / 45
A General Theory of Pathwise Coordinate Optimization Outline Background Pathwise Coordinate Optimization Computational and Statistical Theories Numerical Simulations Conclusions Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 3 / 45
Background
A General Theory of Pathwise Coordinate Optimization Regularized M-Estimation Let β ∗ denote the parameter to be estimated. We solve the following regularized M-estimation problem β ∈ R d L ( β ) + R λ ( β ) min , � �������������� �� �������������� � F λ ( β ) where L ( β ) is a smooth loss function, and R λ ( β ) is a regularization function with a tuning parameter λ . Examples: Lasso, Logistic Lasso (Tibshirani, 1996), Group Lasso (Yuan and Lin, 2006), Graphical Lasso (Yuan and Lin, 2007; Banerjee et al., 2008; Friedman et al. 2008), ... Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 5 / 45
A General Theory of Pathwise Coordinate Optimization Regularization Functions R λ ( β ) is coordinate separable, d � R λ ( β ) = r λ ( β j ) . j = 1 R λ ( β ) is decomposable, d � R λ ( β ) = λ � β � 1 + H λ ( β ) = λ [ | β j | + h λ ( β j )] . j = 1 Examples: Smooth Clipped Absolute Deviation (SCAD, Fan and Li, 2001) and Minimax Concavity Penalty (MCP, Zhang, 2010) Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 6 / 45
A General Theory of Pathwise Coordinate Optimization Regularization Functions For any γ > 2, SCAD is defined as λ | β j | if | β j | � λ , − | β j | 2 − 2 λγ | β j | + λ 2 if λ < | β j | � λγ , r λ ( β j ) = 2 ( γ − 1 ) ( γ + 1 ) λ 2 if | β j | > λγ . 2 0 if | β j | � λ , 2 λ | β j | − | β j | 2 − λ 2 if λ < | β j | � λγ , h λ ( β j ) = 2 ( γ − 1 ) ( γ + 1 ) λ 2 − 2 λ | β j | if | β j | > λγ . 2 Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 7 / 45
A General Theory of Pathwise Coordinate Optimization Regularization Functions For any γ > 1, MCP is defined as � � | β j | − | β j | 2 λ if | β j | � λγ , 2 λγ r λ ( β j ) = λ 2 γ if | β j | > λγ . 2 − | β j | 2 if | β j | � λγ , 2 γ h λ ( β j ) = λ 2 γ − 2 λ | β j | if | β j | > λγ . 2 Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 8 / 45
A General Theory of Pathwise Coordinate Optimization Regularization Functions 3.0 0.0 ` 1 SCAD 2.5 − 0.5 MCP 2.0 − 1.0 h λ ( θ j ) r λ ( θ j ) 1.5 − 1.5 1.0 − 2.0 ` 1 0.5 SCAD MCP − 2.5 0.0 − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 θ j θ j Figure: 1. λ = 1 and γ = 2 . 01. Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 9 / 45
A General Theory of Pathwise Coordinate Optimization Loss Functions X ∈ R n × d – design matrix, y ∈ R n – response vector. Least Square Loss: L ( β ) = 1 2 n � y − X β � 2 2 . Logistic Loss: � � � � n L ( β ) = 1 � 1 + exp ( X T − y i X T i ∗ β ) . log i ∗ β n i = 1 Others: Huber Loss, Multi-category Logistic Loss,... Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 10 / 45
A General Theory of Pathwise Coordinate Optimization Reformulation We rewrite the regularized M-estimation problem as � min L λ ( β ) + λ � β � 1 . β ∈ R d � ���������������� �� ���������������� � F λ ( β ) � L λ ( β ) is smooth but nonconvex, � L λ ( β ) = L ( β ) + H λ ( β ) . λ � β � 1 is nonsmooth but convex. Remark: Amenable to theoretical analysis. Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 11 / 45
A General Theory of Pathwise Coordinate Optimization Randomized Coordinate Descent Algorithm At the t -th iteration, we randomly select a coordinate j from d coordinates. We then take β ( t + 1 ) ← β ( t ) \ j , and \ j Exact Coordinate Minimization (Fu, 1998) β ( t + 1 ) L λ ( β j ; β ( t ) � ← arg min \ j ) + λ | β j | . j β j Inexact Coordinate Minimization (Shalev-Shwartz, 2011) L λ ( β ( t ) ) + L β ( t + 1 ) 2 ( β j − β ( t ) ) 2 + λ | β j | , ( β j − β ( t ) ) ∇ j � ← arg min j β j where L is the step size parameter. Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 12 / 45
A General Theory of Pathwise Coordinate Optimization Examples Sparse Linear Regression + MCP: β ( t + 1 ) β ( t + 1 ) � if | � | � γλ , j j T j , λ ( β ( t ) ) = β ( t + 1 ) S λ ( � ) j if | � β ( t + 1 ) | < γλ . j 1 − 1 /γ β ( t + 1 ) ∗ j ( y − X ∗ \ j β ( t ) where � = X T \ j ) / n . j Sparse Logistic Regression + MCP: T j , λ ( β ( t ) ) = S λ ( β ( t ) − ∇ j � L λ ( β ( t ) ) / L ) Remark: Sublinear Convergence to Local Optima without Statistical Guarantees (Shalev-Shwartz, 2011). Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 13 / 45
Pathwise Coordinate Optimization
A General Theory of Pathwise Coordinate Optimization Pathwise Coordinate Optimization Much faster than other competing algorithms. Very simple implementation. Easily scale to large problems. NO computational analysis in existing literature NO statistical guranratee on the obtained estimator. Our Contribution: The FIRST pathwise coordinate optimization algorithm with both computational and statistical guarantees. The FIRST two-step estimator with both computational and statistical guarantees. Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 15 / 45
A General Theory of Pathwise Coordinate Optimization Pathwise Coordinate Optimization Friedman et al. 2007, Mazumder et al. 2011 Inner loop Active set Active coordinate Regularization parameter Active set minimization initialization identification n o t i u Convergence o l S a l i Convergence i t Coordinate updating n I Middle loop Warm start initialization Output solution Outer loop Figure: 2. The pathwise coordinate optimization framework contains 3 nested loops : (I) Warm start initialization; (II) Active set identification; (III) Active coordinate minimization. Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 16 / 45
A General Theory of Pathwise Coordinate Optimization Restricted Strong Convexity and Smoothness Motivation: For any β , β ′ ∈ R d such that |{ j | β j � 0 or β ′ j � 0 }| � s , we have � � L λ ( β ) � C − ( s ) L λ ( β ) − ( β ′ − β ) T ∇ � � β ′ − β � 2 L λ ( β ′ ) − � � 2 , 2 � � L λ ( β ) � C + ( s ) L λ ( β ) − ( β ′ − β ) T ∇ � � β ′ − β � 2 L λ ( β ′ ) − � � 2 , 2 where C − ( s ) , C + ( s ) > 0 are two constants depending on s . Remark: An algorithm, which can maintain SPARSE solutions throughout all iterations, behaves like minimizing a STRONGLY CONVEX function. Therefore a linear convergence can be expected. Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 17 / 45
A General Theory of Pathwise Coordinate Optimization Warm Start Initialization (Outer Loop) We choose a sequence of DECREASING regularization parameters { λ K } N K = 1 : λ 0 � λ 1 � λ 2 � ... � λ N − 1 � λ N . The algorithm yields a sequence of output solutions { K } } N { � β K = 0 from sparse to dense, { K } ← min � � β L λ K ( β ) + λ K � β � 1 . β Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 18 / 45
A General Theory of Pathwise Coordinate Optimization Warm Start Initialization (Outer Loop) We choose λ 0 = �∇ L ( 0 ) � ∞ , then have { 0 } = 0 . �∇ L ( 0 ) + ∇ H λ ( 0 ) + λ 0 ξ � ∞ = 0 and � min β ξ ∈ ∂ � 0 � 1 The regularization sequence { λ K } N K = 0 is geometrically decreasing λ K = ηλ K − 1 with η ∈ ( 0 , 1 ) . When solving the optimization problem with λ K , we use { K − 1 } as INITIALIZATION. � β Tuo Zhao http://www.princeton.edu/˜tuoz — Pathwise Coordinate Optimization for Nonconvex Sparse Learning 19 / 45
Recommend
More recommend