A practical tour of optimization algorithms for the Lasso Alexandre - PowerPoint PPT Presentation

A practical tour of optimization algorithms for the Lasso Alexandre Gramfort alexandre.gramfort@inria.fr Inria, Parietal Team Université Paris-Saclay Huawei - Apr. 2017

Outline • What is the Lasso • Lasso with an orthogonal design • From projected gradient to proximal gradient • Optimality conditions and subgradients (LARS algo.) • Coordinate descent algorithm … with some demos 2 Alex Gramfort Algorithms for the Lasso

Lasso 1 x ∗ 2 argmin 2 k b � Ax k 2 + λ k x k 1 x p X k x k 1 = | x i | with A ∈ R n × p λ > 0 i =1 • Commonly attributed to [Tibshirani 96] (> 19000 citations) • Also known as Basis Pursuit Denoising [Chen 95] (> 9000 c.) • Convex way of promoting sparsity in high dimensional regression / inverse problems. • Can lead to statistical guarantees even if n ≈ log( p ) 3 Alex Gramfort Algorithms for the Lasso

Algorithm 0 Using CVX Toolbox n = 10; A = randn(n/2,n); b = randn(n/2,1); gamma = 1; cvx_begin variable x(n) dual variable y minimize(0.5*norm(A*x - b, 2) + gamma * norm(x, 1)) cvx_end http://cvxr.com/cvx/ 4 Alex Gramfort Algorithms for the Lasso

Algorithm 1 Rewrite: x i = x + i + x − i = max( x i , 0) + min( x i , 0) | x i | = x + i = max( x i , 0) + max( − x i , 0) i − x − k x k 1 = x + � x − Leads to: 1 z ∗ 2 argmin 2 k b � [ A, � A ] z k 2 + λ X z i z ∈ R 2 p i + • This is a simple smooth convex optimization problem with positivity constraints (convex constraints) 5 Alex Gramfort Algorithms for the Lasso

Gradient Descent x ∈ R p f ( x ) min With f smooth with L-Lipschitz gradient: kr f ( x ) � r f ( y ) k  L k x � y k Gradient descent reads: x k +1 = x k � 1 L r f ( x k ) 6 Alex Gramfort Algorithms for the Lasso

Projected gradient Descent x ∈ C ⊂ R p f ( x ) min With a convex set and f smooth with L-Lipschitz gradient C projected gradient reads: x x k +1 = π C ( x k � 1 L r f ( x k )) π C ( x ) C Orthogonal projection on C 7 Alex Gramfort Algorithms for the Lasso

demo_grad_proj.ipynb

What if A is orthogonal? • Let’s assume we have a square orthogonal design matrix A > A = AA > = I p k b � Ax k 2 = k A > b � x k 2 One has: So the Lasso boils down to minimizing: 1 x ⇤ = argmin 2 k A > b � x k 2 + λ k x k 1 x 2 R p + p ✓ 1 ◆ x ⇤ = argmin 2(( A > b ) i − x i ) 2 + λ | x i | X (p 1 d problems) x 2 R p i =1 + x ⇤ , prox λ k · k 1 ( A > b ) (Definition of the proximal operator) 9 Alex Gramfort Algorithms for the Lasso

Proximal operator of L1 norm The soft-thresholding: c → sign( c )( | c | − λ ) + S λ ( c ) − λ c λ • Solution of is the solution of 1 2( c − x ) 2 + λ · | x | min x 10 Alex Gramfort Algorithms for the Lasso

Algorithm with A orthogonal c = A.T.dot(b) x_star = np.sign(c) * np.maximum(np.abs(c) - lambd, 0.) 11 Alex Gramfort Algorithms for the Lasso

What if A is NOT orthogonal? f ( x ) = 1 Let us define: 2 k b � Ax k 2 Leads to: r f ( x ) = � A > ( b � Ax ) The Lipschitz constant of the gradient: L = k A > A k 2 Quadratic upper bound of f at the previous iterate: x k +1 = argmin f ( x k ) + ( x � x k ) > r f ( x k ) x 2 R p + + L 2 k x � x k k 2 + λ k x k 1 ⇔ 1 2 k x � ( x k � 1 L r f ( x k )) k 2 + λ x k +1 = argmin L k x k 1 x ∈ R p + 12 Alex Gramfort Algorithms for the Lasso

Algorithm 2: Proximal gradient descent That we can rewrite: 1 2 k x � ( x k � 1 L r f ( x k )) k 2 + λ x k +1 = argmin L k x k 1 x 2 R p + L k · k 1 ( x k � 1 L r f ( x k )) = prox λ [Daubechies et al. 2004, Combettes et al. 2005] ✓ 1 ◆ Remark: If f is not strongly convex f ( x k ) − f ( x ∗ ) = O k Very far from an exponential rate of GD with strong convexity Remark: There exist so called “accelerated” methods known as FISTA, Nesterov acceleration… 13 Alex Gramfort Algorithms for the Lasso

Proximal gradient alpha = 0.1 # Lambda parameter L = linalg.norm(A)**2 x = np.zeros(A.shape[1]) for i in range(max_iter): x += (1. / L) * np.dot(A.T, b - np.dot(A, x)) x = np.sign(X) * np.maximum(np.abs(X) - (alpha / L), 0) 14 Alex Gramfort Algorithms for the Lasso

demo_grad_proximal.ipynb

Pros of proximal gradient • First order method (only requires to compute gradients) • Algorithms scalable even if p is large (needs to store A in memory) • Great if A is an implicit linear operator (Fourier, Wavelet, MDCT, etc.) as dot products have some logarithmic complexities. 16 Alex Gramfort Algorithms for the Lasso

Subgradient and subdifferential The subdifferential of f at x 0 is: ∂ f ( x 0 ) = { g ∈ R n /f ( x ) − f ( x 0 ) ≥ g T ( x − x 0 ) , ∀ x ∈ R n } Properties • The subdifferential is a convex set • x 0 is a minimizer of f if 0 ∈ ∂ f ( x 0 ) Exercise: What is ∂ | · | (0) =? 17 Alex Gramfort Algorithms for the Lasso

Path of solutions Lemma [Fuchs 97] : Let be a solution of the Lasso x ∗ 1 x ∗ 2 argmin 2 k b � Ax k 2 + λ k x k 1 x Let the support I = { i s.t. x i 6 = 0 } Then: A > I ( Ax ⇤ − b ) + λ sign( x ⇤ I ) = 0 I c ( Ax ⇤ � b ) k 1  λ k A > And also: x ⇤ I = ( A > I A I ) � 1 ( A > I b − λ sign( x ⇤ I )) 18 Alex Gramfort Algorithms for the Lasso

Algorithm 3: Homotopy and LARS The idea is to compute the full path of solution noticing that for a given sparsity / sign pattern the solution if affine. x ⇤ I = ( A > I A I ) � 1 ( A > I b − λ sign( x ⇤ I )) The LARS algorithm [Osborne 2000, Efron et al. 2004] consists if finding the breakpoints along the path. 19 Alex Gramfort Algorithms for the Lasso

Lasso path with LARS algorithm 20 Alex Gramfort Algorithms for the Lasso

Pros/Cons of LARS Pros: • Gives the full path of solution • Fast with support is small and one can compute Gram matrix Cons: • Scales with the size of the support • Hard to make it numerically stable • One can have many many breakpoints [Mairal et al. 2012] 21 Alex Gramfort Algorithms for the Lasso

demo_lasso_lars.ipynb

Coordinate descent (CD) Limitation of proximal gradient descent: L k · k 1 ( x k � 1 x k +1 = prox λ L r f ( x k )) if L is big we make tiny steps ! The idea of coordinate descent (CD) is to update one coefficient at a time (also known as univariate relaxation methods in optimization or Gauss Seidel’s method ) . Hope: make bigger steps. Spoiler: It is the state of the art in machine learning problems (cf. GLMNET R package, scikit-learn) [Friedman et al. 2009] 23 Alex Gramfort Algorithms for the Lasso

Coordinate descent (CD) 24 Alex Gramfort Algorithms for the Lasso

Coordinate descent (CD) 1 0.9 0.8 0.7 0.6 x 0 x 1 = x 2 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Warning: It does not always work ! 25 Alex Gramfort Algorithms for the Lasso

Algorithm 4: Coordinate descent (CD) p X k x k 1 = | x i | Since the regularization i =1 is separable function CD works for the Lasso [Tseng 2001] Proximal coordinate descent algorithm works: for k = 1 . . . K i = ( k mod p ) + 1 i � 1 x k +1 Li ( x k ( r f ( x k )) i ) = prox λ i L i we make bigger steps ! L i ⌧ L 26 Alex Gramfort Algorithms for the Lasso

Algorithm 4: Coordinate descent (CD) • Their exist many “tricks” to make CD fast for the Lasso • Lazy update of the residuals • Pre-computation of certain dot products • Active set methods • Screening rules • More in the next talk… 27 Alex Gramfort Algorithms for the Lasso

Conclusion • What is the Lasso • Lasso with an orthogonal design • From projected gradient to proximal gradient • Optimality conditions and subgradients (LARS algo.) • Coordinate descent algorithm 28 Alex Gramfort Algorithms for the Lasso

Contact http://alexandre.gramfort.net GitHub : @agramfort Twitter : @agramfort

A practical tour of optimization algorithms for the Lasso Alexandre - PowerPoint PPT Presentation

A practical tour of optimization algorithms for the Lasso Alexandre Gramfort alexandre.gramfort@inria.fr Inria, Parietal Team Universit Paris-Saclay Huawei - Apr. 2017 Outline What is the Lasso Lasso with an orthogonal design

Ridge/Lasso Regression, Model selection Xuezhi Wang Computer Science Department Carnegie Mellon

THE FUTURE TOUR THE FUTURE TOUR THE FUTURE TOUR THE FUTURE TOUR Under the framework of

A G E N D A Tour Policy Oakhill Tour Presentation Travel & Sports Tour

Outline Overview VR Tour VR Tour Entities Luiz Velho Tour Script IMPA Tour

DAY TOURS 2016 TOUR OPTIONS SCHEDULED TOUR We require a minimum of two people to conduct a

Sparse CCA using Lasso Anastasia Lykou & Joe Whittaker Department of Mathematics and

Sparse Exponential Weighting as an alternative to LASSO and Dantzig selector Alexandre Tsybakov

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp August, 2019

Why Geometric Progression LASSO Method in Selecting the LASSO How Is Selected: . . . Natural

2019 KR19 TOUR PRESENTATION Kalahari Sunset KR19 EURO TOUR PRESENTATION Kalahari Khoi-San

Complexity Analysis of the Lasso Regularization Path Julien Mairal and Bin Yu Inria, UC Berkeley

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Phuket Football Tour 26 November 4 December 2017 Phuket Football Tour Biennial football tour

2016 ENERGY STAR Change the World Tour Webinar Agenda Tour overview ENERGY STAR Day

Omitted variable bias of Lasso-based inference methods: A finite sample analysis uthrich

On Model Selection Consistency Of Lasso Yewon Kim 12/08/2015 Introduction Model selection is a

Living with Uncertainty Jim Dodrill ARM 1 We want our variation models to be: Effective

Loop Quantum Gravity Reduced Phase Space Approach Thomas Thiemann 1 , 2 1 Albert Einstein

Stochastic Solitons in Computational Anatomy Darryl D Holm Imperial College Vienna, 20 Feb 2015

Learning Automata with Hankel Matrices Borja Balle Amazon Research Cambridge Highlights

The Prospectivity of Mid-Latitude Buried Ice Deposits on Mars for Future Human Missions

Novel Approaches for Mitigating Plasma Disruptions and Runaway Electrons in Tokamak ADITYA by R.

Approximating Reachability Probabilities by (Super-)Martingales Ichiro Hasuo National Institute

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

A practical tour of optimization algorithms for the Lasso Alexandre - PowerPoint PPT Presentation

A practical tour of optimization algorithms for the Lasso Alexandre Gramfort alexandre.gramfort@inria.fr Inria, Parietal Team Universit Paris-Saclay Huawei - Apr. 2017 Outline What is the Lasso Lasso with an orthogonal design

Ridge/Lasso Regression, Model selection Xuezhi Wang Computer Science Department Carnegie Mellon

THE FUTURE TOUR THE FUTURE TOUR THE FUTURE TOUR THE FUTURE TOUR Under the framework of

A G E N D A Tour Policy Oakhill Tour Presentation Travel &amp; Sports Tour

Outline Overview VR Tour VR Tour Entities Luiz Velho Tour Script IMPA Tour

DAY TOURS 2016 TOUR OPTIONS SCHEDULED TOUR We require a minimum of two people to conduct a

Sparse CCA using Lasso Anastasia Lykou &amp; Joe Whittaker Department of Mathematics and

Sparse Exponential Weighting as an alternative to LASSO and Dantzig selector Alexandre Tsybakov

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp August, 2019

Why Geometric Progression LASSO Method in Selecting the LASSO How Is Selected: . . . Natural

2019 KR19 TOUR PRESENTATION Kalahari Sunset KR19 EURO TOUR PRESENTATION Kalahari Khoi-San

Complexity Analysis of the Lasso Regularization Path Julien Mairal and Bin Yu Inria, UC Berkeley

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Phuket Football Tour 26 November 4 December 2017 Phuket Football Tour Biennial football tour

2016 ENERGY STAR Change the World Tour Webinar Agenda Tour overview ENERGY STAR Day

Omitted variable bias of Lasso-based inference methods: A finite sample analysis uthrich

On Model Selection Consistency Of Lasso Yewon Kim 12/08/2015 Introduction Model selection is a

Living with Uncertainty Jim Dodrill ARM 1 We want our variation models to be: Effective

Loop Quantum Gravity Reduced Phase Space Approach Thomas Thiemann 1 , 2 1 Albert Einstein

Stochastic Solitons in Computational Anatomy Darryl D Holm Imperial College Vienna, 20 Feb 2015

Learning Automata with Hankel Matrices Borja Balle Amazon Research Cambridge Highlights

The Prospectivity of Mid-Latitude Buried Ice Deposits on Mars for Future Human Missions

Novel Approaches for Mitigating Plasma Disruptions and Runaway Electrons in Tokamak ADITYA by R.

Approximating Reachability Probabilities by (Super-)Martingales Ichiro Hasuo National Institute

Data Mining in Bioinformatics Day 9: String &amp; Text Mining in Bioinformatics Karsten Borgwardt

A G E N D A Tour Policy Oakhill Tour Presentation Travel & Sports Tour

Sparse CCA using Lasso Anastasia Lykou & Joe Whittaker Department of Mathematics and

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt