Screening Rules for Lasso with Non-Convex Sparse Regularizers Joseph Salmon http://josephsalmon.eu Université de Montpellier Joint work with A. Rakotomamonjy and G. Gasso 1 / 18
Motivation and objective Lasso and screening ◮ Learning sparse regression models : X ∈ R n × d , y ∈ R n d 1 � 2 � y − Xw � 2 + λ min | w j | w =( w 1 ,...,w d ) ⊤ ∈ R d j =1 ◮ Safe screening rules (1), (2) : identify vanishing coordinates of a/the solution by exploiting sparsity, convexity and duality Extension to non-convex regularizers : 2.00 l1 logsum 1.75 ◮ non-convex regularizers lead to mcp 1.50 1.25 statistically better models but ... 1.00 0.75 ◮ how to perform screening when the 0.50 0.25 0.00 regularizer is non-convex ? −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 (1). L. El Ghaoui , V. Viallon et T. Rabbani . “Safe feature elimination in sparse supervised learning”. In : Journal of Pacific Optimization (8 2012), p. 667-698. (2). Antoine Bonnefoy et al. “Dynamic screening : Accelerating first-order algorithms for the lasso and group-lasso”. In : IEEE Trans. Signal Process. 63.19 (2015), p. 5121-5132. 2 / 18
Non-convex sparse regression Non convex regularization : r λ ( · ) smooth & concave on [0 , ∞ [ d 1 2 � y − Xw � 2 + � min r λ ( | w j | ) w ∈ R d j =1 ◮ Log Sum penalty (LSP) (3) ◮ Smoothly Clipped Absolute Deviation (SCAD) (4) Examples : ◮ capped- ℓ 1 penalty (5) ◮ Minimax Concave Penalty (MCP) (6) Rem: for pros & cons of such formulations cf. Soubies et al. (7) (3). Emmanuel J Candès , Michael B Wakin et Stephen P Boyd . “Enhancing Sparsity by Reweighted l 1 Minimization”. In : J. Fourier Anal. Applicat. 14.5-6 (2008), p. 877-905. (4). Jianqing Fan et Runze Li . “Variable selection via nonconcave penalized likelihood and its oracle properties”. In : J. Amer. Statist. Assoc. 96.456 (2001), p. 1348-1360. (5). Tong Zhang . “Analysis of multi-stage convex relaxation for sparse regularization”. In : Journal of Machine Learning Research 11.Mar (2010), p. 1081-1107. (6). Cun-Hui Zhang . “Nearly unbiased variable selection under minimax concave penalty”. In : Ann. Statist. 38.2 (2010), p. 894-942. (7). E. Soubies , L. Blanc-Féraud et G. Aubert . “A Unified View of Exact Continuous Penalties for ℓ 2 - ℓ 0 Minimization”. In : SIAM J. Optim. 27.3 (2017), p. 2034-2060. 3 / 18
Majorization-Minimization Algorithm: Maximization minimization input : max. iterations k max , stopping criterion ǫ , α , w 0 (= 0) for k = 0 , . . . , k max − 1 do Break if stopping criterion smaller than ǫ λ k j ← r ′ λ ( | w k j | ) // Majorization w k ← arg min 2 � y − Xw � 2 1 // Minimization w ∈ R d d 2 α � w − w k � 2 + � + 1 λ k j | w j | j =1 return w k r λ ( | w j | ) ≤ r λ ( | w k j | ) + r ′ λ ( | w k j | )( | w j | − | w k Majorization : j | ) Minimization : weighted-Lasso formulation 2 α � w − w k � 2 acts as a regularization for MM (8) (other 1 Rem : majorization alternatives possible, e.g., with gradient information) (8). Yangyang Kang , Zhihua Zhang et Wu-Jun Li . “On the global convergence of majorization minimization algorithms for nonconvex optimization problems”. In : arXiv preprint arXiv :1504.07791 (2015). 4 / 18
Safe Screening / Two-level screening Safe Screening : for Lasso problems, vanishing coefficients at optimality can be certified without knowing the solution ◮ prior computation starting from a similar set of tuning parameter (sequential (9) / dual-warm start) ◮ along the optimization algorithm (dynamic (10) ) State-of-the-art safe screening rules : rely on duality gap (11) Two-level screening for non-convex cases : ◮ Inner level screening : within each (weighted) Lasso ◮ Outer level screening : propagate information between Lassos (9). L. El Ghaoui , V. Viallon et T. Rabbani . “Safe feature elimination in sparse supervised learning”. In : Journal of Pacific Optimization (8 2012), p. 667-698. (10). Antoine Bonnefoy et al. “Dynamic screening : Accelerating first-order algorithms for the lasso and group-lasso”. In : IEEE Trans. Signal Process. 63.19 (2015), p. 5121-5132. (11). E. Ndiaye et al. “Gap Safe screening rules for sparsity enforcing penalties”. In : Journal of Machine Learning Research 18.128 (2017), p. 1-33. 5 / 18
Safe Screening / Two-level screening Safe Screening : for Lasso problems, vanishing coefficients at optimality can be certified without knowing the solution ◮ prior computation starting from a similar set of tuning parameter (sequential (9) / dual-warm start) ◮ along the optimization algorithm (dynamic (10) ) State-of-the-art safe screening rules : rely on duality gap (11) Two-level screening for non-convex cases : ◮ Inner level screening : within each (weighted) Lasso ◮ Outer level screening : propagate information between Lassos (9). L. El Ghaoui , V. Viallon et T. Rabbani . “Safe feature elimination in sparse supervised learning”. In : Journal of Pacific Optimization (8 2012), p. 667-698. (10). Antoine Bonnefoy et al. “Dynamic screening : Accelerating first-order algorithms for the lasso and group-lasso”. In : IEEE Trans. Signal Process. 63.19 (2015), p. 5121-5132. (11). E. Ndiaye et al. “Gap Safe screening rules for sparsity enforcing penalties”. In : Journal of Machine Learning Research 18.128 (2017), p. 1-33. 5 / 18
Notation Notation : X = [ x 1 , . . . , x d ] , Λ = ( λ 1 , . . . , λ d ) ⊤ Inner (convex) problems : d 2 � y − Xw � 2 + 1 2 α � w − w k � 2 + � P Λ ( w ) � 1 (Primal) λ j | w j | j =1 6 / 18
Notation Notation : X = [ x 1 , . . . , x d ] , Λ = ( λ 1 , . . . , λ d ) ⊤ , s ∈ R n , v ∈ R d Inner (convex) problems : d 2 � y − Xw � 2 + 1 2 α � w − w k � 2 + � P Λ ( w ) � 1 (Primal) λ j | w j | j =1 D Λ ( s , v ) � − 1 2 � s � 2 − α 2 � v � 2 + s ⊤ y − v ⊤ w k (Dual) | X ⊤ s − v | � Λ s.t. 6 / 18
Notation Notation : X = [ x 1 , . . . , x d ] , Λ = ( λ 1 , . . . , λ d ) ⊤ , s ∈ R n , v ∈ R d Inner (convex) problems : d 2 � y − Xw � 2 + 1 2 α � w − w k � 2 + � P Λ ( w ) � 1 (Primal) λ j | w j | j =1 D Λ ( s , v ) � − 1 2 � s � 2 − α 2 � v � 2 + s ⊤ y − v ⊤ w k (Dual) | X ⊤ s − v | � Λ s.t. G Λ ( w , s , v ) � P Λ ( w ) − D ( s , v ) (Dual-Gap) 6 / 18
Screening weighted Lasso ◮ Primal optimization problem P Λ ( w ) : d 2 � y − Xw � 2 + 1 1 2 α � w − w k � 2 + � w ← arg min ˜ λ j | w j | w ∈ R d j =1 Screening test : | x ⊤ j ˜ s − ˜ v j | < λ j = ⇒ ˜ w j = 0 (impractical) w − w k s � y − X ˜ w v � ˜ with ˜ ρ (Λ) , ˜ αρ (Λ) (for a scalar ρ (Λ) well chosen) ◮ (Practical) Dynamic Gap safe screening test (12), (13) : � � x j � + 1 � � | x ⊤ j s − v j | + 2 G Λ ( w , s , v ) < λ j α � �� � T (Λ) ( w , s , v ) j given a primal-dual approximate solution triplet ( w , s , v ) (12). O. Fercoq , A. Gramfort et J. Salmon . “Mind the duality gap : safer rules for the lasso”. In : ICML . 2015, p. 333-342. (13). E. Ndiaye et al. “Gap Safe screening rules for sparsity enforcing penalties”. In : Journal of Machine Learning Research 18.128 (2017), p. 1-33. 7 / 18
Screening weighted Lasso ◮ Primal optimization problem P Λ ( w ) : d 2 � y − Xw � 2 + 1 1 2 α � w − w k � 2 + � w ← arg min ˜ λ j | w j | w ∈ R d j =1 Screening test : | x ⊤ j ˜ s − ˜ v j | < λ j = ⇒ ˜ w j = 0 (impractical) w − w k s � y − X ˜ w v � ˜ with ˜ ρ (Λ) , ˜ αρ (Λ) (for a scalar ρ (Λ) well chosen) ◮ (Practical) Dynamic Gap safe screening test (12), (13) : � � x j � + 1 � � | x ⊤ j s − v j | + 2 G Λ ( w , s , v ) < λ j α � �� � T (Λ) ( w , s , v ) j given a primal-dual approximate solution triplet ( w , s , v ) (12). O. Fercoq , A. Gramfort et J. Salmon . “Mind the duality gap : safer rules for the lasso”. In : ICML . 2015, p. 333-342. (13). E. Ndiaye et al. “Gap Safe screening rules for sparsity enforcing penalties”. In : Journal of Machine Learning Research 18.128 (2017), p. 1-33. 7 / 18
Inner level screening and speed-ups ◮ After iteration k , one receives approximate solutions w k , s k and v k for weighted Lasso with weights Λ k Set of screened variables : � � j ∈ � 1 , d � : T (Λ k ) S � ( w k , s k , v k ) < λ k j j ◮ Speed-ups : reduced weighted Lasso problem size substituting X ← X S c Rem : most beneficiary with coordinate descent type solvers 8 / 18
Outer screening level / screening propagation Before iteration k + 1 ◮ change of weights Λ k +1 = { λ k +1 } j =1 ,...,d j � � w k , y − Xw k ρ (Λ k +1 ) , w k +1 − w k ◮ update ( w k +1 , s k +1 , v k +1 ) ← ρ (Λ k +1 ) Screening propagation test √ √ 2 b ) + c + 1 T (Λ k ) 2 b < λ k +1 ( ˆ w , ˆ s , ˆ v ) + � x j � ( a + j j α with � s k +1 − s k � ≤ a | G Λ ( w k , s k , v k ) − G Λ k +1 ( w k +1 , s k +1 , v k +1 ) | ≤ b | v k +1 − v k j | ≤ c j Rem: same flavor as sequential screening (14) (14). L. El Ghaoui , V. Viallon et T. Rabbani . “Safe feature elimination in sparse supervised learning”. In : Journal of Pacific Optimization (8 2012), p. 667-698. 9 / 18
Recommend
More recommend