bilevel learning of the group lasso structure
play

Bilevel Learning of the Group Lasso Structure Jordan Frecon 1 , - PowerPoint PPT Presentation

Bilevel Learning of the Group Lasso Structure Jordan Frecon 1 , Saverio Salzo 1 , Massimiliano Pontil 1 , 2 1 CSML - Istituto Italiano di Tecnologia 2 Dept of Computer Science - University College London Thirty-second Conference on Neural


  1. Bilevel Learning of the Group Lasso Structure Jordan Frecon 1 , Saverio Salzo 1 , Massimiliano Pontil 1 , 2 1 CSML - Istituto Italiano di Tecnologia 2 Dept of Computer Science - University College London Thirty-second Conference on Neural Information Processing Systems, Montreal, Canada Jordan Frecon, Saverio Salzo, Massimiliano Pontil NIPS 2018 1 / 9

  2. Linear Regression and Group Sparsity Problem: Predict y ∈ R N from X ∈ R N × P Linear Regression: Find w ∈ R P such that In many applications, few groups are relevant to predict y ⇒ Group Sparse w Predict psychiatric disorder from activities in regions of the brain Predict protein functions from their molecular composition Jordan Frecon, Saverio Salzo, Massimiliano Pontil NIPS 2018 2 / 9

  3. Group Lasso Given λ > 0 and a group-structure {G 1 , . . . , G L } , find L 1 2 � y − Xw � 2 + λ � w ∈ argmin ˆ � w G l � 2 , w ∈ R P l =1 5 G 1 10 20 G 2 G 3 G 4 0 Group-sparse solution ˆ w 30 40 G 5 50 -5 Limitation: The group-structure {G 1 , . . . , G L } may be unknown Jordan Frecon, Saverio Salzo, Massimiliano Pontil NIPS 2018 3 / 9

  4. Setting Setting: T Group Lasso problems with shared group-structure L 1 2 � y t − X t w t � 2 + λ � ( ∀ t ∈ { 1 , . . . , T } ) w t ( θ ) ∈ argmin ˆ � w t ⊙ θ l � 2 , w t ∈ R P l =1 encodes groups 5 10 10 20 20 0 30 30 40 40 50 50 -5 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 Goal: Estimation of the optimal group-structure θ ∗ Jordan Frecon, Saverio Salzo, Massimiliano Pontil NIPS 2018 4 / 9

  5. A Bilevel Programming Approach Upper-level Problem: T � [ θ 1 ··· θ L ] ∈ Θ U ( θ ) := E t ( ˆ w t ( θ )) ( e.g., validation error ) minimize t =1 � � where ˆ w ( θ ) = w 1 ( θ ) · · · ˆ ˆ w T ( θ ) solves Lower-level Problem: ( T Group Lasso problems) � � T L 1 2 � y t − X t w t � 2 + λ � � L ( w , θ ) := � θ l ⊙ w t � 2 minimize w ∈ R P × T t =1 l =1 Difficulties: w ( θ ) not available in closed form ˆ θ �→ ˆ w ( θ ) is nonsmooth [ ⇒ U is nonsmooth] Jordan Frecon, Saverio Salzo, Massimiliano Pontil NIPS 2018 5 / 9

  6. Approximate Bilevel Problem Upper-level Problem: T E t ( w ( K ) � [ θ 1 ··· θ L ] ∈ Θ U K ( θ ) := ( θ )) minimize t t =1 where w ( K ) ( θ ) → ˆ w t ( θ ) t Dual Algorithm: u (0) ( θ ) chosen arbitrarily for k = 0 , 1 , . . . , K − 1 � u ( k +1) ( θ ) = A ( u ( k ) ( θ ) , θ ) dual update w ( K ) ( θ ) · · · w ( K ) � � = B ( u ( K ) ( θ ) , θ ) ( θ ) primal dual relationship 1 T Goals: Find A and B smooth [ ⇒ w ( K ) is smooth ⇒ U K is smooth] Prove that the approximate bilevel scheme converges . Jordan Frecon, Saverio Salzo, Massimiliano Pontil NIPS 2018 6 / 9

  7. Contributions Bilevel Framework for Estimating the Group Lasso Structure Design of a Dual Forward-Backward Algorithm with Bregman Distances such that A and B are smooth ⇒ U K is smooth 1 � min U K → min U 2 argmin U K → argmin U Implementation of proxSAGA algorithm: nonconvex stochastic variant of θ ( q +1) = P Θ θ ( q ) − γ ∇U K ( θ ( q ) ) � � Jordan Frecon, Saverio Salzo, Massimiliano Pontil NIPS 2018 7 / 9

  8. Numerical Experiment Setting: T = 500 tasks, N = 25 noisy observations, P = 50 features. Estimate and group the features into, at most, L = 10 groups. 10 10 20 8 30 40 6 50 50 500 5000 1 2 3 4 5 6 7 8 9 10 Jordan Frecon, Saverio Salzo, Massimiliano Pontil NIPS 2018 8 / 9

  9. Conclusion Thank You Our poster AB #92 will be presented in Room 210 & 230 at 5pm Jordan Frecon, Saverio Salzo, Massimiliano Pontil NIPS 2018 9 / 9

Recommend


More recommend