An Optimal Affine Invariant Smooth Minimization Algorithm. Alexandre - PowerPoint PPT Presentation

An Optimal Affine Invariant Smooth Minimization Algorithm. Alexandre d’Aspremont , CNRS & ENS . Joint work with Cristobal Guzman & Martin Jaggi. Support from ERC SIPA. Alex d’Aspremont ADGO, Santiago, Feb. 2016. 1/22

A Basic Convex Problem Solve minimize f ( x ) subject to x ∈ Q, in x ∈ R n . � Here, f ( x ) is convex, smooth. � Assume Q ⊂ R n is compact, convex and simple . Alex d’Aspremont ADGO, Santiago, Feb. 2016. 2/22

Complexity Newton’s method. At each iteration, take a step in the direction ∆ x nt = −∇ 2 f ( x ) − 1 ∇ f ( x ) Assume that � the function f ( x ) is self-concordant , i.e. | f ′′′ ( x ) | ≤ 2 f ′′ ( x ) 3 / 2 , � the set Q has a self concordant barrier g ( x ) . [Nesterov and Nemirovskii, 1994] Newton’s method produces an ǫ optimal solution to the barrier problem x h ( x ) � f ( x ) + t g ( x ) min for some t > 0 , in at most 20 − 8 α αβ (1 − 2 α ) 2 ( h ( x 0 ) − h ∗ ) + log 2 log 2 (1 /ǫ ) iterations where 0 < α < 0 . 5 and 0 < β < 1 are line search parameters. Alex d’Aspremont ADGO, Santiago, Feb. 2016. 3/22

Complexity Newton’s method. Basically ≤ 375 ( h ( x 0 ) − h ∗ ) + 6 # Newton iterations � Empirically valid, up to constants. � Independent from the dimension n . � Affine invariant. In practice, implementation mostly requires efficient linear algebra. . . � Form the Hessian. � Solve the Newton (or KKT) system ∇ 2 f ( x )∆ x nt = −∇ f ( x ) . Alex d’Aspremont ADGO, Santiago, Feb. 2016. 4/22

Affine Invariance Set x = Ay where A ∈ R n × n is nonsingular ˆ minimize f ( x ) minimize f ( y ) becomes y ∈ ˆ subject to x ∈ Q, subject to Q, in the variable y ∈ R n , where ˆ f ( y ) � f ( Ay ) and ˆ Q � A − 1 Q . � Identical Newton steps , with ∆ x nt = A ∆ y nt � Identical complexity bounds 375 ( h ( x 0 ) − h ∗ ) + 6 since h ∗ = ˆ h ∗ Newton’s method is invariant w.r.t. an affine change of coordinates. The same is true for its complexity analysis. Alex d’Aspremont ADGO, Santiago, Feb. 2016. 5/22

Large-Scale Problems The challenge now is scaling. � Newton’s method (and derivatives) solve all reasonably large problems. � Beyond a certain scale, second order information is out of reach. Question today: clean complexity bounds for first order methods? Alex d’Aspremont ADGO, Santiago, Feb. 2016. 6/22

Franke-Wolfe Conditional gradient. At each iteration, solve minimize �∇ f ( x k ) , u � subject to u ∈ Q in u ∈ R n . Define the curvature 1 C f � sup α 2 ( f ( y ) − f ( x ) − � y − x, ∇ f ( x ) � ) . s,x ∈M , α ∈ [0 , 1] , y = x + α ( s − x ) The Franke-Wolfe algorithm will then produce an ǫ solution after N max = 4 C f ǫ iterations. � C f is affine invariant but the bound is suboptimal in ǫ . � � 1 � If f ( x ) has a Lipschitz gradient, the lower bound is O . √ ǫ Alex d’Aspremont ADGO, Santiago, Feb. 2016. 7/22

Optimal First-Order Methods Smooth Minimization algorithm in [Nesterov, 1983] to solve minimize f ( x ) subject to x ∈ Q, Original paper was in an Euclidean setting. In the general case. . . � Choose a norm � · � . ∇ f ( x ) Lipschitz with constant L w.r.t. � · � f ( y ) ≤ f ( x ) + �∇ f ( x ) , y − x � + 1 2 L � y − x � 2 , x, y ∈ Q � Choose a prox function d ( x ) for the set Q , with σ 2 � x − x 0 � 2 ≤ d ( x ) for some σ > 0 . Alex d’Aspremont ADGO, Santiago, Feb. 2016. 8/22

Optimal First-Order Methods Smooth minimization algorithm [Nesterov, 2005] Input: x 0 , the prox center of the set Q . 1: for k = 0 , . . . , N do Compute ∇ f ( x k ) . 2: �∇ f ( x k ) , y − x k � + 1 2 L � y − x k � 2 � � Compute y k = argmin y ∈ Q . 3: �� k � i =0 α i [ f ( x i ) + �∇ f ( x i ) , x − x i � ] + L Compute z k = argmin x ∈ Q σ d ( x ) . 4: Set x k +1 = τ k z k + (1 − τ k ) y k . 5: 6: end for Output: x N , y N ∈ Q . Produces an ǫ -solution in at most � d ( x ⋆ ) 8 L N max = ǫ σ iterations. Optimal in ǫ , but not affine invariant. Heavily used: TFOCS, NESTA, Structured ℓ 1 , . . . Alex d’Aspremont ADGO, Santiago, Feb. 2016. 9/22

Optimal First-Order Methods Choosing norm and prox can have a big impact, beyond the immediate computational cost of computing the prox steps. Consider the following matrix game problem { 1 T x =1 ,x ≥ 0 } x T Ay min max { 1 T x =1 ,x ≥ 0 } � Euclidean prox. Pick � · � 2 and d ( x ) = � x � 2 2 / 2 , after regularization, the complexity bound is N max = 4 � A � 2 N + 1 � Entropy prox. Pick � · � 1 and d ( x ) = � i x i log x i + log n , the bound becomes N max = 4 √ log n log m max ij | A ij | N + 1 which can be significantly smaller. Speedup is roughly √ n when A is Bernoulli. . . Alex d’Aspremont ADGO, Santiago, Feb. 2016. 10/22

Choosing the norm Invariance means � · � and d ( x ) constructed using only f and the set Q . Minkovski gauge. Assume Q is centrally symmetric with non-empty interior. The Minkowski gauge of Q is a norm: � x � Q � inf { λ ≥ 0 : x ∈ λQ } Lemma Affine invariance. The function f ( x ) has Lipschitz continuous gradient with respect to the norm � · � Q with constant L Q > 0 , i.e. f ( y ) ≤ f ( x ) + �∇ f ( x ) , y − x � + 1 2 L Q � y − x � 2 Q , x, y ∈ Q, if and only if the function f ( Aw ) has Lipschitz continuous gradient with respect to the norm � · � A − 1 Q with the same constant L Q . A similar result holds for strong convexity. Note that � x � ∗ Q = � x � Q ◦ . Alex d’Aspremont ADGO, Santiago, Feb. 2016. 11/22

Choosing the prox. How do we choose the prox.? Start with two definitions. Definition Banach-Mazur distance. Suppose �·� X and �·� Y are two norms on a space E , the distortion d ( � · � X , � · � Y ) is the smallest product ab > 0 such that 1 b � x � Y ≤ � x � X ≤ a � x � Y , for all x ∈ E . log( d ( � · � X , � · � Y )) is the Banach-Mazur distance between X and Y . Alex d’Aspremont ADGO, Santiago, Feb. 2016. 12/22

Choosing the prox. Regularity constant. Regularity constant of ( E, � · � ) , defined in [Juditsky and Nemirovski, 2008] to study large deviations of vector valued martingales. Definition [Juditsky and Nemirovski, 2008] Regularity constant of a Banach ( E, � . � ) . The smallest constant ∆ > 0 for which there exists a smooth norm p ( x ) such that � The prox p ( x ) 2 / 2 has a Lipschitz continuous gradient w.r.t. the norm p ( x ) , with constant µ where 1 ≤ µ ≤ ∆ , � The norm p ( x ) satisfies � 1 / 2 � ∆ � x � ≤ p ( x ) ≤ � x � , for all x ∈ E µ � i.e. d ( p ( x ) , � . � ) ≤ ∆ /µ . Alex d’Aspremont ADGO, Santiago, Feb. 2016. 13/22

Complexity Using the algorithm in [Nesterov, 2005] to solve minimize f ( x ) subject to x ∈ Q. Proposition [d’Aspremont, Guzman, and Jaggi, 2013] Affine invariant complexity bounds. Suppose f ( x ) has a Lipschitz continuous gradient with constant L Q with respect to the norm �·� Q and the space ( R n , �·� ∗ Q ) is D Q -regular, then the smooth algorithm in [Nesterov, 2005] will produce an ǫ solution in at most � 4 L Q D Q N max = ǫ iterations. Furthermore, the constants L Q and D Q are affine invariant. We can show C f ≤ L Q D Q , but it is not clear if the bound is attained. . . Alex d’Aspremont ADGO, Santiago, Feb. 2016. 14/22

Complexity A few more facts about L Q and D Q . . . Suppose we scale Q → αQ , with α > 0 , � the Lipschitz constant L αQ satisfies α 2 L Q ≤ L αQ . � the smoothness term D Q remains unchanged. � Given our choice of norm (hence L Q ), L Q D Q is the best possible bound. Also, from [Juditsky and Nemirovski, 2008], in the dual space � The regularity constant decreases on a subspace F , i.e. D Q ∩ F ≤ D Q . � From D regular spaces ( E i , � · � ) , we can construct a 2 D + 2 regular product space E × . . . × E m . Alex d’Aspremont ADGO, Santiago, Feb. 2016. 15/22

Complexity, ℓ 1 example Minimizing a smooth convex function over the unit simplex minimize f ( x ) 1 T x ≤ 1 , x ≥ 0 subject to in x ∈ R n . � Choosing � · � 1 as the norm and d ( x ) = log n + � n i =1 x i log x i as the prox function, complexity bounded by � 8 L 1 log n ǫ (note L 1 is lowest Lipschitz constant among all ℓ p norm choices.) � Symmetrizing the simplex into the ℓ 1 ball. The space ( R n , � · � ∞ ) is 2 log n regular [Juditsky and Nemirovski, 2008, Ex. 3.2]. The prox function chosen here is � · � 2 α / 2 , with α = 2 log n/ (2 log n − 1) and our complexity bound is � 16 L 1 log n ǫ Alex d’Aspremont ADGO, Santiago, Feb. 2016. 16/22

In practice Easy and hard problems. � The parameter L Q satisfies f ( y ) ≤ f ( x ) + �∇ f ( x ) , y − x � + 1 2 L Q � y − x � 2 Q , x, y ∈ Q, On easy problems , � · � is large in directions where ∇ f is large, i.e. the sublevel sets of f ( x ) and Q are aligned. � For l p spaces for p ∈ [2 , ∞ ] , the unit balls B p have low regularity constants, D B p ≤ min { p − 1 , 2 log n } while D B 1 = n (worst case). By duality, problems over unit balls B q for q ∈ [1 , 2] are easier. � Optimizing over cubes is harder. Alex d’Aspremont ADGO, Santiago, Feb. 2016. 17/22

An Optimal Affine Invariant Smooth Minimization Algorithm. Alexandre - PowerPoint PPT Presentation

An Optimal Affine Invariant Smooth Minimization Algorithm. Alexandre dAspremont , CNRS & ENS . Joint work with Cristobal Guzman & Martin Jaggi. Support from ERC SIPA. Alex dAspremont ADGO, Santiago, Feb. 2016. 1/22 A Basic Convex

Dimensions of invariant measures for affine iterated function systems De-Jun Feng The Chinese

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

Strengthening Smooth Transition Strengthening Smooth Transition Strengthening Smooth Transition

Minimization Satoru Iwata (University of Tokyo) Submodular Function Minimization ( )

On the affine VW supercategory Mee Seong Im West Point, NY Interactions of quantum affine

Last time 6.891 Computer Vision and Applications Interesting points, correspondence, affine

A Minimization Algorithm Consider the minimization problem: * M min M M * subject

ALICe: A Framework to Improve Affine Loop Invariant Computation Vivien Maisonneuve Olivier

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

The Metropolis Hastings algorithm : introduction and optimal scaling of the transient phase

Extremal generalized smooth words Kolakoski word Run-length encoding Smooth words Generalized

1 Algorithm for Identifying Loop Invariant Code Algorithm for Identifying Loop Invariant Code

In theoretic Fgm ( X invariant k GW 1) us GW . = Ijf , M ) invariant refined ( M )

Outline Last time: local invariant features, scale invariant detection Lecture 14:

Invariant Variational Calculus Irina Kogan North Carolina State University & IMA December

T-duality Invariant Formalisms at the Quantum Level Daniel Thompson Queen Mary University of

Results of the WMT16 Metrics Shared Task Ond rej Bojar Yvette Graham Amir Kamran Milo s

A primal-dual algorithm for expontial-cone optimization ICCOPT Berlin, August 8th, 2019

Evaluating Intensive Outpatient Primary Care: VA Experience Steven M. Asch MD MPH Director,

BlandAltman plots, rank parameters, and calibration ridit splines Roger B. Newson

An E ffi cient A ffi ne-Scaling Algorithm for Hyperbolic Programming Jim Renegar joint work

Model 1 proc logistic data=framing descending; model chd01 = age; run; Model Information Data

Machine Learning for Information Discovery Thorsten Joachims Cornell University Department of

Square Formation by Asynchronous Oblivious Robots CCCG 2016 Marcello Mamino, Giovanni Viglietta

An Optimal Affine Invariant Smooth Minimization Algorithm. Alexandre - PowerPoint PPT Presentation

An Optimal Affine Invariant Smooth Minimization Algorithm. Alexandre dAspremont , CNRS & ENS . Joint work with Cristobal Guzman & Martin Jaggi. Support from ERC SIPA. Alex dAspremont ADGO, Santiago, Feb. 2016. 1/22 A Basic Convex

Dimensions of invariant measures for affine iterated function systems De-Jun Feng The Chinese

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

Strengthening Smooth Transition Strengthening Smooth Transition Strengthening Smooth Transition

Minimization Satoru Iwata (University of Tokyo) Submodular Function Minimization ( )

On the affine VW supercategory Mee Seong Im West Point, NY Interactions of quantum affine

Last time 6.891 Computer Vision and Applications Interesting points, correspondence, affine

A Minimization Algorithm Consider the minimization problem: * M min M M * subject

ALICe: A Framework to Improve Affine Loop Invariant Computation Vivien Maisonneuve Olivier

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

The Metropolis Hastings algorithm : introduction and optimal scaling of the transient phase

Extremal generalized smooth words Kolakoski word Run-length encoding Smooth words Generalized

1 Algorithm for Identifying Loop Invariant Code Algorithm for Identifying Loop Invariant Code

In theoretic Fgm ( X invariant k GW 1) us GW . = Ijf , M ) invariant refined ( M )

Outline Last time: local invariant features, scale invariant detection Lecture 14:

Invariant Variational Calculus Irina Kogan North Carolina State University &amp; IMA December

T-duality Invariant Formalisms at the Quantum Level Daniel Thompson Queen Mary University of

Results of the WMT16 Metrics Shared Task Ond rej Bojar Yvette Graham Amir Kamran Milo s

A primal-dual algorithm for expontial-cone optimization ICCOPT Berlin, August 8th, 2019

Evaluating Intensive Outpatient Primary Care: VA Experience Steven M. Asch MD MPH Director,

BlandAltman plots, rank parameters, and calibration ridit splines Roger B. Newson

An E ffi cient A ffi ne-Scaling Algorithm for Hyperbolic Programming Jim Renegar joint work

Model 1 proc logistic data=framing descending; model chd01 = age; run; Model Information Data

Machine Learning for Information Discovery Thorsten Joachims Cornell University Department of

Square Formation by Asynchronous Oblivious Robots CCCG 2016 Marcello Mamino, Giovanni Viglietta

Invariant Variational Calculus Irina Kogan North Carolina State University & IMA December