Overview Development of the algorithms The Algorithms Conclusion Self-concordant analysis of Frank-Wolfe algorithms Pavel Dvurechensky 1 Shimrit Shtern 2 Mathias Staudigl 3 Petr Ostroukhov 4 Kamil Safin 4 1 WIAS 2 The Technion 3 Maastricht University 4 Moscow Institute of Physics and Technology ICML2020, July 12-July 18
Overview Development of the algorithms The Algorithms Conclusion Self-concordant minimization We consider the optimization problem min x ∈ X f ( x ) (P) where X ⊂ R n is convex compact f : R n → ( − ∞ , ∞ ] is convex and thrice continuously differentiable on the open set dom f = { x : f ( x ) < ∞ } . Given the large-scale nature of optimization problems in machine learning, first-order methods are the method of choice.
Overview Development of the algorithms The Algorithms Conclusion Frank-Wolfe methods Because of great scalability and sparsity properties, Frank-Wolfe (FW) methods (Frank & Wolfe, 1956) received lot of attention in ML. Convergence guarantees require Lipschitz continuous 1 gradients, or finite curvature constants on f (Jaggi, 2013) Even for well-conditioned (Lipschitz smooth and strongly convex) problems 2 only sublinear convergence rates guaranteed in general. x (0) x ( t ) x ( t +1) s t x ∗
Overview Development of the algorithms The Algorithms Conclusion Many canonical ML problems do not have Lipschitz gradients Portfolio Optimization T n ln ( � r t , x � ) , x ∈ X = { x ∈ R n ∑ ∑ f ( x ) = − + : x i = 1 } . t = 1 i = 1 Covariance Estimation: f ( x ) = − ln ( det ( X )) + tr ( ˆ Σ X ) , x ∈ X = { x ∈ R n × n sym , + : � Vec ( X ) � 1 ≤ R } . Poisson Inverse Problem m m ∑ ∑ f ( x ) = � w i , x � − y i ln ( � w i , x � ) , i = 1 i = 1 x ∈ X = { x ∈ R n | � x � 1 ≤ R } .
Overview Development of the algorithms The Algorithms Conclusion Main Results All these function are Self-concordant (SC), and have no Lipschitz continuous gradient. Standard analysis does not apply. Result 1: We give a unified analysis of provable convergent FW algorithms minimizing SC functions. Result 2: Based on the theory of Local Linear Optimization Oracles (LLOO) (Lan 2013, Garber & Hazan, 2016), we construct linearly convergent variants for our base algorithms.
Overview Development of the algorithms The Algorithms Conclusion Vanilla FW The analysis of FW involves (a) a search direction s ( x ) = argmin �∇ f ( x ) , s � . s ∈ X (b) as merit function the gap function gap ( x ) = �∇ f ( x ) , x − s ( x ) � Standard Frank-Wolfe method: If gap ( x k ) > ε then Obtain s k = s ( x k ) ; 1 Set x k + 1 = x k + α k ( s k − x k ) for some α k ∈ [ 0, 1 ] . 2
Overview Development of the algorithms The Algorithms Conclusion SC optimization Definition of SC functions f : R n → ( − ∞ , + ∞ ] a C 3 ( dom f ) convex function dom f is open set in R n . f is SC if � ≤ M ϕ ′′ ( t ) 3 / 2 � ϕ ′′′ ( t ) � � for ϕ ( t ) = f ( x + tv ) , x ∈ dom f , v ∈ R n and x + tv ∈ dom f .
Overview Development of the algorithms The Algorithms Conclusion SC optimization Self-concordant functions Self-concordant (SC) function have been developed within the field of interior-point method (Nesterov & Nemirovski, 1994) Starting with Bach (2010), they gained a lot of interest in Machine learning and Statistics (see e.g. Tran-Dinh, Kyrillidis & Cevher; Sun & Tran-Dinh 2018; Ostrovskii & Bach 2018) MATLAB toolbox SCOPT
Overview Development of the algorithms The Algorithms Conclusion Adaptive Frank Wolfe methods Basic estimates of SC functions For all x , ˜ x ∈ dom f we have the following bounds on function values x − x � + 4 f ( ˜ x ) ≥ f ( x ) + �∇ f ( x ) , ˜ M 2 ω ( d ( x , ˜ x )) x − x � + 4 f ( ˜ x ) ≤ f ( x ) + �∇ f ( x ) , ˜ M 2 ω ∗ ( d ( x , ˜ x )) where ω ( t ) : = t − ln ( 1 + t ) , and ω ∗ ( t ) : = − t − ln ( 1 − t ) d ( x , y ) : = M 2 � y − x � x = M � 1 / 2 � D 2 f ( x )[ y − x , y − x ] . 2
Overview Development of the algorithms The Algorithms Conclusion Algorithm 1 Let x + = x + t ( s ( x ) − x ) , t > 0 t Obtain the non-Euclidean descent inequality: � + 4 f ( x + � ∇ f ( x ) , x + t ) ≤ f ( x ) + t − x M 2 ω ∗ ( t e ( x )) ≤ f ( x ) − η x ( t ) 2 � s ( x ) − x � 2 for t ∈ ( 0, 1 / e ( x )) , e ( x ) = M x . Optimizing the per-iteration decrease w.r.t t leads to gap ( x ) α ( x ) = min { 1, t ( x ) } , t ( x ) = . 4 e ( x )( gap ( x ) + M 2 e ( x ))
Overview Development of the algorithms The Algorithms Conclusion Iteration Complexity Define the approximation error : h k = f ( x k ) − f ∗ . Let S ( x 0 ) = { x ∈ X | f ( x ) ≤ f ( x 0 ) } , and λ max ( ∇ 2 f ( x )) . L ∇ f = max x ∈ S ( x 0 ) Theorem For given ε > 0 , define N ε ( x 0 ) = min { k ≥ 0 | h k ≤ ε } . Then, � � h 0 b ln + L ∇ f diam ( X ) 2 a N ε ( x 0 ) ≤ ( 1 + ln ( 2 )) ε . a � 2 ( 1 − ln ( 2 )) � 1 − ln ( 2 ) 1 where a = min 2 , and b = L ∇ f diam ( X ) 2 . M √ L ∇ f diam ( X )
Overview Development of the algorithms The Algorithms Conclusion Algorithm 2: Backtracking Variant of FW Let Q ( x k , t , µ ) : = f ( x k ) − t · gap ( x k ) + t 2 µ 2 � s ( x k ) − x k � � 2 . � � 2 � On S ( x 0 ) : = { x ∈ X | f ( x ) ≤ f ( x 0 ) } , we have f ( x k + t ( s k − x k )) ≤ Q ( x k , t , L ∇ f ) . Problem: L ∇ f is hard to estimate and numerically large. Solution: A backtracking procedure allows us to find a local estimate for the unknown L ∇ f (see also Pedregosa et al. 2020)
Overview Development of the algorithms The Algorithms Conclusion Backtracking procedure to find the local Lipschitz constant Algorithm 1 Function step ( f , v , x , g , L ) Choose γ u > 1, γ d < 1 Choose µ ∈ [ γ d L , L ] g α = min { , 1 } µ � v � 2 2 if f ( x + α v ) > Q ( x , α , µ ) then µ ← γ u µ g α ← min { , 1 } µ � v � 2 2 end if Return α , µ We have for all t ∈ [ 0, 1 ] f ( x k + 1 ) ≤ f ( x k ) − t · gap ( x k ) + t 2 L k 2 � � s k − x k � � � 2 � where L k is obtained from Algorithm 1.
Overview Development of the algorithms The Algorithms Conclusion Main Result Theorem Let ( x k ) k be the backtracking variant of FW using Algorithm 1 as subroutine. Then 2 gap ( x 0 ) k diam ( X ) 2 ¯ h k ≤ ( k + 1 )( k + 2 ) + L k ( k + 1 )( k + 2 ) where ¯ L k � 1 k ∑ k − 1 i = 0 L i .
Overview Development of the algorithms The Algorithms Conclusion Linear Convergence Linearly Convergent FW variant Definition (Garber & Hazan (2016)) A procedure A ( x , r , c ) , where x ∈ X , r > 0, c ∈ R n , is a LLOO with parameter ρ ≥ 1 for the polytope X if A ( x , r , c ) returns a point s ∈ X such that for all x ∈ B r ( x ) ∩ X � c , x � ≥ � c , s � and � x − s � 2 ≤ ρ r . Such oracles exist for any compact polyhedral domain. Particular simple implementation for Simplex-like domains.
Overview Development of the algorithms The Algorithms Conclusion Linear Convergence Call x ∈ S ( x 0 ) λ min ( ∇ 2 f ( x )) . σ f = min Theorem (Simplified version) Given a polytope X with LLOO A ( x , r , c ) for each x ∈ X , , r ∈ ( 0, ∞ ) , c ∈ R n . Let σ f 1 α � min { 6 L ∇ f ρ 2 , 1 } ¯ . 1 + √ L ∇ f M diam ( X ) 2 Then, h k ≤ gap ( x 0 ) exp ( − k ¯ α / 2 ) . In the paper we present a version of this Theorem without knowledge of L ∇ f .
Overview Development of the algorithms The Algorithms Conclusion Linear Convergence Numerical Performance Portfolio Optimization T ∑ f ( x ) = ln ( � r t , x � ) t = 1 n X = { x ∈ R n ∑ + : x i = 1 } . i = 1 Poisson Inverse problem m m f ( x ) = ∑ � w i , x � − ∑ y i ln ( � w i , x � ) , i = 1 i = 1 x ∈ X = { x ∈ R n | � x � 1 ≤ R } . Figure: Portfolio Optimization (Right), Poisson Inverse Problem (Left)
Overview Development of the algorithms The Algorithms Conclusion Conclusion We derived various novel FW schemes with provable convergence guarantees for self-concordant minimization. Future directions of research include the following Generalized self-concordant minimization (Sun & Tran-Dinh 2018) Stochastic oracles Inertial effects in algorithm design (Conditional gradient sliding (Lan & Zhou, 2016))
Recommend
More recommend