Self-concordant analysis of Frank-Wolfe algorithms Pavel - PowerPoint PPT Presentation

Overview Development of the algorithms The Algorithms Conclusion Self-concordant analysis of Frank-Wolfe algorithms Pavel Dvurechensky 1 Shimrit Shtern 2 Mathias Staudigl 3 Petr Ostroukhov 4 Kamil Safin 4 1 WIAS 2 The Technion 3 Maastricht University 4 Moscow Institute of Physics and Technology ICML2020, July 12-July 18

Overview Development of the algorithms The Algorithms Conclusion Self-concordant minimization We consider the optimization problem min x ∈ X f ( x ) (P) where X ⊂ R n is convex compact f : R n → ( − ∞ , ∞ ] is convex and thrice continuously differentiable on the open set dom f = { x : f ( x ) < ∞ } . Given the large-scale nature of optimization problems in machine learning, first-order methods are the method of choice.

Overview Development of the algorithms The Algorithms Conclusion Frank-Wolfe methods Because of great scalability and sparsity properties, Frank-Wolfe (FW) methods (Frank & Wolfe, 1956) received lot of attention in ML. Convergence guarantees require Lipschitz continuous 1 gradients, or finite curvature constants on f (Jaggi, 2013) Even for well-conditioned (Lipschitz smooth and strongly convex) problems 2 only sublinear convergence rates guaranteed in general. x (0) x ( t ) x ( t +1) s t x ∗

Overview Development of the algorithms The Algorithms Conclusion Many canonical ML problems do not have Lipschitz gradients Portfolio Optimization T n ln ( � r t , x � ) , x ∈ X = { x ∈ R n ∑ ∑ f ( x ) = − + : x i = 1 } . t = 1 i = 1 Covariance Estimation: f ( x ) = − ln ( det ( X )) + tr ( ˆ Σ X ) , x ∈ X = { x ∈ R n × n sym , + : � Vec ( X ) � 1 ≤ R } . Poisson Inverse Problem m m ∑ ∑ f ( x ) = � w i , x � − y i ln ( � w i , x � ) , i = 1 i = 1 x ∈ X = { x ∈ R n | � x � 1 ≤ R } .

Overview Development of the algorithms The Algorithms Conclusion Main Results All these function are Self-concordant (SC), and have no Lipschitz continuous gradient. Standard analysis does not apply. Result 1: We give a unified analysis of provable convergent FW algorithms minimizing SC functions. Result 2: Based on the theory of Local Linear Optimization Oracles (LLOO) (Lan 2013, Garber & Hazan, 2016), we construct linearly convergent variants for our base algorithms.

Overview Development of the algorithms The Algorithms Conclusion Vanilla FW The analysis of FW involves (a) a search direction s ( x ) = argmin �∇ f ( x ) , s � . s ∈ X (b) as merit function the gap function gap ( x ) = �∇ f ( x ) , x − s ( x ) � Standard Frank-Wolfe method: If gap ( x k ) > ε then Obtain s k = s ( x k ) ; 1 Set x k + 1 = x k + α k ( s k − x k ) for some α k ∈ [ 0, 1 ] . 2

Overview Development of the algorithms The Algorithms Conclusion SC optimization Definition of SC functions f : R n → ( − ∞ , + ∞ ] a C 3 ( dom f ) convex function dom f is open set in R n . f is SC if � ≤ M ϕ ′′ ( t ) 3 / 2 � ϕ ′′′ ( t ) � � for ϕ ( t ) = f ( x + tv ) , x ∈ dom f , v ∈ R n and x + tv ∈ dom f .

Overview Development of the algorithms The Algorithms Conclusion SC optimization Self-concordant functions Self-concordant (SC) function have been developed within the field of interior-point method (Nesterov & Nemirovski, 1994) Starting with Bach (2010), they gained a lot of interest in Machine learning and Statistics (see e.g. Tran-Dinh, Kyrillidis & Cevher; Sun & Tran-Dinh 2018; Ostrovskii & Bach 2018) MATLAB toolbox SCOPT

Overview Development of the algorithms The Algorithms Conclusion Adaptive Frank Wolfe methods Basic estimates of SC functions For all x , ˜ x ∈ dom f we have the following bounds on function values x − x � + 4 f ( ˜ x ) ≥ f ( x ) + �∇ f ( x ) , ˜ M 2 ω ( d ( x , ˜ x )) x − x � + 4 f ( ˜ x ) ≤ f ( x ) + �∇ f ( x ) , ˜ M 2 ω ∗ ( d ( x , ˜ x )) where ω ( t ) : = t − ln ( 1 + t ) , and ω ∗ ( t ) : = − t − ln ( 1 − t ) d ( x , y ) : = M 2 � y − x � x = M � 1 / 2 � D 2 f ( x )[ y − x , y − x ] . 2

Overview Development of the algorithms The Algorithms Conclusion Algorithm 1 Let x + = x + t ( s ( x ) − x ) , t > 0 t Obtain the non-Euclidean descent inequality: � + 4 f ( x + � ∇ f ( x ) , x + t ) ≤ f ( x ) + t − x M 2 ω ∗ ( t e ( x )) ≤ f ( x ) − η x ( t ) 2 � s ( x ) − x � 2 for t ∈ ( 0, 1 / e ( x )) , e ( x ) = M x . Optimizing the per-iteration decrease w.r.t t leads to gap ( x ) α ( x ) = min { 1, t ( x ) } , t ( x ) = . 4 e ( x )( gap ( x ) + M 2 e ( x ))

Overview Development of the algorithms The Algorithms Conclusion Iteration Complexity Define the approximation error : h k = f ( x k ) − f ∗ . Let S ( x 0 ) = { x ∈ X | f ( x ) ≤ f ( x 0 ) } , and λ max ( ∇ 2 f ( x )) . L ∇ f = max x ∈ S ( x 0 ) Theorem For given ε > 0 , define N ε ( x 0 ) = min { k ≥ 0 | h k ≤ ε } . Then, � � h 0 b ln + L ∇ f diam ( X ) 2 a N ε ( x 0 ) ≤ ( 1 + ln ( 2 )) ε . a � 2 ( 1 − ln ( 2 )) � 1 − ln ( 2 ) 1 where a = min 2 , and b = L ∇ f diam ( X ) 2 . M √ L ∇ f diam ( X )

Overview Development of the algorithms The Algorithms Conclusion Algorithm 2: Backtracking Variant of FW Let Q ( x k , t , µ ) : = f ( x k ) − t · gap ( x k ) + t 2 µ 2 � s ( x k ) − x k � � 2 . � � 2 � On S ( x 0 ) : = { x ∈ X | f ( x ) ≤ f ( x 0 ) } , we have f ( x k + t ( s k − x k )) ≤ Q ( x k , t , L ∇ f ) . Problem: L ∇ f is hard to estimate and numerically large. Solution: A backtracking procedure allows us to find a local estimate for the unknown L ∇ f (see also Pedregosa et al. 2020)

Overview Development of the algorithms The Algorithms Conclusion Backtracking procedure to find the local Lipschitz constant Algorithm 1 Function step ( f , v , x , g , L ) Choose γ u > 1, γ d < 1 Choose µ ∈ [ γ d L , L ] g α = min { , 1 } µ � v � 2 2 if f ( x + α v ) > Q ( x , α , µ ) then µ ← γ u µ g α ← min { , 1 } µ � v � 2 2 end if Return α , µ We have for all t ∈ [ 0, 1 ] f ( x k + 1 ) ≤ f ( x k ) − t · gap ( x k ) + t 2 L k 2 � � s k − x k � � � 2 � where L k is obtained from Algorithm 1.

Overview Development of the algorithms The Algorithms Conclusion Main Result Theorem Let ( x k ) k be the backtracking variant of FW using Algorithm 1 as subroutine. Then 2 gap ( x 0 ) k diam ( X ) 2 ¯ h k ≤ ( k + 1 )( k + 2 ) + L k ( k + 1 )( k + 2 ) where ¯ L k � 1 k ∑ k − 1 i = 0 L i .

Overview Development of the algorithms The Algorithms Conclusion Linear Convergence Linearly Convergent FW variant Definition (Garber & Hazan (2016)) A procedure A ( x , r , c ) , where x ∈ X , r > 0, c ∈ R n , is a LLOO with parameter ρ ≥ 1 for the polytope X if A ( x , r , c ) returns a point s ∈ X such that for all x ∈ B r ( x ) ∩ X � c , x � ≥ � c , s � and � x − s � 2 ≤ ρ r . Such oracles exist for any compact polyhedral domain. Particular simple implementation for Simplex-like domains.

Overview Development of the algorithms The Algorithms Conclusion Linear Convergence Call x ∈ S ( x 0 ) λ min ( ∇ 2 f ( x )) . σ f = min Theorem (Simplified version) Given a polytope X with LLOO A ( x , r , c ) for each x ∈ X , , r ∈ ( 0, ∞ ) , c ∈ R n . Let σ f 1 α � min { 6 L ∇ f ρ 2 , 1 } ¯ . 1 + √ L ∇ f M diam ( X ) 2 Then, h k ≤ gap ( x 0 ) exp ( − k ¯ α / 2 ) . In the paper we present a version of this Theorem without knowledge of L ∇ f .

Overview Development of the algorithms The Algorithms Conclusion Linear Convergence Numerical Performance Portfolio Optimization T ∑ f ( x ) = ln ( � r t , x � ) t = 1 n X = { x ∈ R n ∑ + : x i = 1 } . i = 1 Poisson Inverse problem m m f ( x ) = ∑ � w i , x � − ∑ y i ln ( � w i , x � ) , i = 1 i = 1 x ∈ X = { x ∈ R n | � x � 1 ≤ R } . Figure: Portfolio Optimization (Right), Poisson Inverse Problem (Left)

Overview Development of the algorithms The Algorithms Conclusion Conclusion We derived various novel FW schemes with provable convergence guarantees for self-concordant minimization. Future directions of research include the following Generalized self-concordant minimization (Sun & Tran-Dinh 2018) Stochastic oracles Inertial effects in algorithm design (Conditional gradient sliding (Lan & Zhou, 2016))

Self-concordant analysis of Frank-Wolfe algorithms Pavel - PowerPoint PPT Presentation

Overview Development of the algorithms The Algorithms Conclusion Self-concordant analysis of Frank-Wolfe algorithms Pavel Dvurechensky 1 Shimrit Shtern 2 Mathias Staudigl 3 Petr Ostroukhov 4 Kamil Safin 4 1 WIAS 2 The Technion 3 Maastricht

WOLFE RESIDENCE 337 Kenmore Road The Douglaston Historic District Kevin Wolfe Architect 1

Shake slice and shake concordant knots Arunima Ray Brandeis University Joint work with Tim

with OpenACC Directives Michael Wolfe michael.wolfe@pgroup.com http://www.pgroup.com/accelerate

Presentation for: Prospect Name February 17, 2015 Gregg Wolfe-Principal Doreen Guss National

Fortran Programmers Michael Wolfe PGI compiler engineer michael.wolfe@pgroup.com Outline GPU

Greedy Algorithms, Frank-Wolfe and Friends a modern perspective NIPS 2013 Workshop

Frank-Wolfe Algorithms for Saddle Point problems Gauthier Gidel 1 , 3 Tony Jebara 2 Simon

Frank-Wolfe Algorithms for Saddle Point problems author: Gauthier Gidel, Supervisors: Simon

Frank-Wolfe Algorithms for Saddle Point problems Gauthier Gidel 1 Tony Jebara 2 Simon

Language Concordant Home Care Visits Reduce 30 Day Readmissions in Limited English Proficiency

RULE 6A-1.09422 PROPOSED AMENDMENT LANGUAGE New Proposed Rule Cohort Concordant Scores Grade

A new family of links topologically, but not smoothly, concordant to the Hopf link Arunima Ray

Live eMate eMate repair at WWNC repair at WWNC Live Frank Gr Gr ndel ndel Frank

NEUROMUSCULAR DISEASE LISA F. WOLFE, MD A SSOCIATE P ROFESSOR IN M EDICINE -P ULMONARY AND N

About The Firm - Kalis, Kleiman & Wolfe In 1996, Mr. Kalis and Mr. Kleiman formed KALIS &

Delivering Effective Presentations Joanna Wolfe, PhD Director, Global Communication Center The

Efficient Allocations under Ambiguity Tomasz Strzalecki (Harvard University) Jan Werner

Satellite operations and fractal structures on knot concordance Arunima Ray Brandeis University

Compiling topic-specific corpora from limited-access online databases Costas Gabrielatos

12. Interior-point methods inequality constrained minimization logarithmic barrier function

CS681: Advanced Topics in Computational Biology Week 6 Lectures 2-3 Can Alkan EA224

Sub- & Cross-Phonemic Priming in Vowel Shadowing 1. Memory Types and Respresentation of

BlandAltman plots, rank parameters, and calibration ridit splines Roger B. Newson

Evaluating Intensive Outpatient Primary Care: VA Experience Steven M. Asch MD MPH Director,

Self-concordant analysis of Frank-Wolfe algorithms Pavel - PowerPoint PPT Presentation

Overview Development of the algorithms The Algorithms Conclusion Self-concordant analysis of Frank-Wolfe algorithms Pavel Dvurechensky 1 Shimrit Shtern 2 Mathias Staudigl 3 Petr Ostroukhov 4 Kamil Safin 4 1 WIAS 2 The Technion 3 Maastricht

WOLFE RESIDENCE 337 Kenmore Road The Douglaston Historic District Kevin Wolfe Architect 1

Shake slice and shake concordant knots Arunima Ray Brandeis University Joint work with Tim

with OpenACC Directives Michael Wolfe michael.wolfe@pgroup.com http://www.pgroup.com/accelerate

Presentation for: Prospect Name February 17, 2015 Gregg Wolfe-Principal Doreen Guss National

Fortran Programmers Michael Wolfe PGI compiler engineer michael.wolfe@pgroup.com Outline GPU

Greedy Algorithms, Frank-Wolfe and Friends a modern perspective NIPS 2013 Workshop

Frank-Wolfe Algorithms for Saddle Point problems Gauthier Gidel 1 , 3 Tony Jebara 2 Simon

Frank-Wolfe Algorithms for Saddle Point problems author: Gauthier Gidel, Supervisors: Simon

Frank-Wolfe Algorithms for Saddle Point problems Gauthier Gidel 1 Tony Jebara 2 Simon

Language Concordant Home Care Visits Reduce 30 Day Readmissions in Limited English Proficiency

RULE 6A-1.09422 PROPOSED AMENDMENT LANGUAGE New Proposed Rule Cohort Concordant Scores Grade

A new family of links topologically, but not smoothly, concordant to the Hopf link Arunima Ray

Live eMate eMate repair at WWNC repair at WWNC Live Frank Gr Gr ndel ndel Frank

NEUROMUSCULAR DISEASE LISA F. WOLFE, MD A SSOCIATE P ROFESSOR IN M EDICINE -P ULMONARY AND N

About The Firm - Kalis, Kleiman &amp; Wolfe In 1996, Mr. Kalis and Mr. Kleiman formed KALIS &amp;

Delivering Effective Presentations Joanna Wolfe, PhD Director, Global Communication Center The

Efficient Allocations under Ambiguity Tomasz Strzalecki (Harvard University) Jan Werner

Satellite operations and fractal structures on knot concordance Arunima Ray Brandeis University

Compiling topic-specific corpora from limited-access online databases Costas Gabrielatos

12. Interior-point methods inequality constrained minimization logarithmic barrier function

CS681: Advanced Topics in Computational Biology Week 6 Lectures 2-3 Can Alkan EA224

Sub- &amp; Cross-Phonemic Priming in Vowel Shadowing 1. Memory Types and Respresentation of

BlandAltman plots, rank parameters, and calibration ridit splines Roger B. Newson

Evaluating Intensive Outpatient Primary Care: VA Experience Steven M. Asch MD MPH Director,

About The Firm - Kalis, Kleiman & Wolfe In 1996, Mr. Kalis and Mr. Kleiman formed KALIS &

Sub- & Cross-Phonemic Priming in Vowel Shadowing 1. Memory Types and Respresentation of