harnessing structure in optimization for machine learning
play

Harnessing Structure in Optimization for Machine Learning Franck - PowerPoint PPT Presentation

Harnessing Structure in Optimization for Machine Learning Franck Iutzeler LJK, Univ. Grenoble Alpes Optimization for Machine Learning CIRM 9-13 March 2020 >>> Regularization in Learning Structure Regularization Linear inverse


  1. Harnessing Structure in Optimization for Machine Learning Franck Iutzeler LJK, Univ. Grenoble Alpes Optimization for Machine Learning CIRM – 9-13 March 2020

  2. >>> Regularization in Learning Structure Regularization Linear inverse problems: for a chosen sparsity r = � · � 1 regularization, we seek anti-sparsity r = � · � ∞ x ⋆ ∈ arg min r ( x ) such that Ax = b low rank r = � · � ∗ . . x . . . . Regularized Empirical Risk Minimization problem: x ⋆ ∈ arg min R ( x ; { a i , b i } m Find i = 1 ) + λ r ( x ) x ∈ R n obtained from chosen statistical modeling regularization x ⋆ ∈ arg min � m 2 ( a ⊤ 1 i x − b i ) 2 + λ � x � 1 e.g. Lasso: Find i = 1 x ∈ R n Regularization can improve statistical properties (generalization, stability, ...). ⋄ Tibshirani: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society (1996) ⋄ Tibshirani et al. : Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society (2004) ⋄ Vaiter, Peyré, Fadili: Model consistency of partly smooth regularizers. IEEE Trans. on Information Theory (2017) 1 / 18

  3. >>> Optimization for Machine Learning Composite minimization x ⋆ ∈ arg min R ( x ; { a i , b i } m i = 1 ) + λ r ( x ) Find x ∈ R n x ⋆ ∈ arg min Find f ( x ) + g ( x ) x ∈ R n smooth non-smooth > f : differentiable surrogate of the empirical risk ⇒ Gradient non-linear smooth function that depends on all the data > g : non-smooth but chosen regularization ⇒ Proximity operator non-differentiability on some manifolds implies structure on the solutions closed form/easy for many regularizations: � � – g ( x ) = � x � 1 2 γ � y − u � 2 1 prox γ g ( u ) = arg min y ∈ R n g ( y ) + 2 – g ( x ) = TV ( x ) – g ( x ) = indicator C ( x ) Natural optimization method: proximal gradient � u k + 1 = x k − γ ∇ f ( x k ) x k + 1 = prox γ g ( u k + 1 ) and its stochastic variants: proximal sgd, etc. 2 / 18

  4. >>> Structure, Non-differentiability, and Proximity operator Example: LASSO x ⋆ ∈ arg min R ( x ; { a i , b i } m i = 1 ) + λ r ( x ) Find x ∈ R n x ⋆ ∈ arg min 2 � Ax − b � 2 1 Find + λ � x � 1 2 x ∈ R n smooth non-smooth Structure ↔ Optimality conditions Coordinates i ( Ax ⋆ − b ) ∈ [ − λ, λ ] x ⋆ A ⊤ i = 0 ⇔ ∀ i Proximity Operator: per coordinate  | · | u i − λγ if u i > λγ � �  prox γλ �·� 1 ( u ) i = 0 if u i ∈ [ − λγ ; λγ ] 2  u i + λγ if u i < − λγ 1 SoftThresholding Proximal Gradient (aka ISTA) : � u k + 1 = x k − γ A ⊤ ( Ax k − b ) − 3 − 2 − 1 1 2 3 x k + 1 = prox γλ �·� 1 ( u k + 1 ) − 1 [ − 1 , 1 ] → { 0 } per coord. 3 / 18

  5. >>> Structure, Non-differentiability, and Proximity operator Example: LASSO x ⋆ ∈ arg min R ( x ; { a i , b i } m i = 1 ) + λ r ( x ) Find x ∈ R n x ⋆ ∈ arg min 2 � Ax − b � 2 1 Find + λ � x � 1 2 x ∈ R n smooth non-smooth Structure ↔ Optimality conditions ↔ Proximity operation Coordinates � � i ( Ax ⋆ − b ) ∈ [ − λ, λ ] x ⋆ A ⊤ prox γλ �·� 1 ( u ⋆ ) i = 0 ⇔ ⇔ i = 0 ∀ i u ⋆ = x ⋆ − γ A ⊤ ( Ax ⋆ − b ) 9 2 . 1 11 . 4 Proximal Gradient 3 . 4 1 1 . 5 0 . 2 . 3 2 1 . 1  u i − λγ if u i > λγ � �  4 . 5 5 6 . 8 prox γλ �·� 1 ( u ) i = 0 if u i ∈ [ − λγ ; λγ ] . 8 x ⋆ 7  1 u i + λγ if u i < − λγ Proximal Gradient (aka ISTA) : 1 . 1 3 . 4 0 . 5 � u k + 1 = x k − γ A ⊤ ( Ax k − b ) 3 . 2 0 2 . 3 5 . 7 x k + 1 = prox γλ �·� 1 ( u k + 1 ) 4 . 5 3 . 4 5 . 7 − 0 . 5 3 . 4 6 . 8 4 . 5 8 − 1 − 1 0 1 2 3 4 Iterates ( x k ) reach the same structure as x ⋆ in finite time! 3 / 18

  6. >>> Mathematical properties of Proximal Algorithms 9 . 1 2 11 . 4 Proximal Gradient 3 . 4 10 . 2 1 . 5 1 . 1 2 . 3 4 . 5 5 6 . 8 8 Proximal Algorithms: x ⋆ . 7 1 � u k + 1 = x k − γ ∇ f ( x k ) 1 . 1 3 . 4 0 . 5 x k + 1 = prox γ g ( u k + 1 ) 3 . 2 0 2 . 3 4 . 5 5 . 7 3 . 4 5 . 7 − 0 . 5 6 . 8 3 . 4 4 . 5 8 − 1 − 1 0 1 2 3 4 > project on manifolds Let M be a manifold and u k such that u k − x k x k = prox γ g ( u k ) ∈ M ∈ ri ∂ g ( x k ) and γ If g is partly smooth at x k relative to M , then prox γ g ( u ) ∈ M for any u close to u k . ⋄ Hare, Lewis: Identifying active constraints via partial smoothness and prox-regularity. Journal of Convex Analysis (2004) ⋄ Daniilidis, Hare, Malick: Geometrical interpretation of the predictor-corrector type algorithms in structured optimization problems. Optimization (2006) 4 / 18

  7. >>> Mathematical properties of Proximal Algorithms 2 9 . 1 11 . 4 Proximal Gradient 3 . 4 u ⋆ 10 . 2 1 . 5 2 . 3 1 . 1 4 . 5 5 6 . 8 Proximal Algorithms: x ⋆ . 8 7 1 x ⋆ SoftThresholding � u k + 1 = x k − γ ∇ f ( x k ) 0 . 5 1 . 1 3 . 4 x k + 1 = prox γ g ( u k + 1 ) 3 2 . 2 0 . 3 5 4 . 5 . 7 3 . 4 5 . 7 − 0 . 5 3 . 4 6 . 8 4 . 5 8 − 1 − 1 0 1 2 3 4 > project on manifolds > identify the optimal structure Let ( x k ) and ( u k ) be a pair of sequences such that x k = prox γ g ( u k ) → x ⋆ = prox γ g ( u ⋆ ) and M be a manifold. If x ⋆ ∈ M and ∃ ε > 0 such that for all u ∈ B ( u ⋆ , ε ) , prox γ g ( u ) ∈ M (QC) holds, then, after some finite but unknown time, x k ∈ M . ⋄ Lewis: Active sets, nonsmoothness, and sensitivity. SIAM Journal on Optimization (2002) ⋄ Fadili, Malick, Peyré: Sensitivity analysis for mirror-stratifiable convex functions. SIAM Journal on Optimization (2018) 4 / 18

  8. >>> “Nonsmoothness can help” > Nonsmoothness is actively studied in Numerical Optimization... Subgradients, Partial Smoothness/prox-regularity, Bregman metrics, Error Bounds/Kurdyka-Łojasiewicz, etc. ⋄ Hare, Lewis: Identifying active constraints via partial smoothness and prox-regularity. Journal of Convex Analysis (2004) ⋄ Lemarechal, Oustry, Sagastizabal: The U-Lagrangian of a convex function. Transactions of the AMS (2000) ⋄ Bolte, Daniilidis, Lewis: The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization (2007) ⋄ Chen, Teboulle: A proximal-based decomposition method for convex minimization problems. Mathematical Programming (1994) 5 / 18

  9. >>> “Nonsmoothness can help” > Nonsmoothness is actively studied in Numerical Optimization... Subgradients, Partial Smoothness/prox-regularity, Bregman metrics, Error Bounds/Kurdyka-Łojasiewicz, etc. > ...but often suffered because of lack of structure/expression. Bundle methods, Gradient Sampling, Smoothing, Inexact proximal methods, etc. ⋄ Nesterov: Smooth minimization of non-smooth functions. Mathematical Programming (2005) ⋄ Burke, Lewis, Overton: A robust gradient sampling algorithm for nonsmooth, nonconvex optimization. SIAM Journal on Optimization (2005) ⋄ Solodov, Svaiter: A hybrid projection-proximal point algorithm. Journal of convex analysis (1999) ⋄ de Oliveira, Sagastizábal: Bundle methods in the XXIst century: A bird’s-eye view. Pesquisa Operacional (2014) 5 / 18

  10. >>> “Nonsmoothness can help” > Nonsmoothness is actively studied in Numerical Optimization... Subgradients, Partial Smoothness/prox-regularity, Bregman metrics, Error Bounds/Kurdyka-Łojasiewicz, etc. > ...but often suffered because of lack of structure/expression. Bundle methods, Gradient Sampling, Smoothing, Inexact proximal methods, etc. > For Machine Learning objectives , it can often be harnessed - Explicit/“proximable” regularizations ℓ 1 , nuclear norm - We know the expressions and activity of sought structures sparsity, rank See the talks of ... ⋄ Bach, et al.: Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning (2012) ⋄ Massias, Salmon, Gramfort: Celer: a fast solver for the lasso with dual extrapolation. ICML (2018) ⋄ Liang, Fadili, Peyré: Local linear convergence of forward–backward under partial smoothness. NeurIPS (2014) ⋄ O’Donoghue, Candes: Adaptive restart for accelerated gradient schemes. Foundations of computational mathematic (2015) 5 / 18

  11. >>> Noticeable Structure x ⋆ ∈ arg min R ( x ; { a i , b i } m i = 1 ) + λ r ( x ) Find x ∈ R n x ⋆ ∈ arg min Find f ( x ) + g ( x ) x ∈ R n smooth non-smooth A reason why the nonsmoothness of ML problems can be leveraged is their noticeable structure , that is: We can design a lookout collection C = {M 1 , .., M p } of closed sets such that: (i) we have a projection mapping proj M i onto M i for all i ; (ii) prox γ g ( u ) is a singleton and can be computed explicitly for any u and γ ; (iii) upon computation of x = prox γ g ( u ) , we know if x ∈ M i or not for all i . ⇒ Identification can be directly harnessed . Example: Sparse structure and g = � · � 1 , � · � 0 . 5 0 . 5 , � · � 0 , ... with M i = { x ∈ R n : x i = 0 } C = {M 1 , . . . , M n } 6 / 18

Recommend


More recommend