regularized nonlinear acceleration
play

Regularized Nonlinear Acceleration. Alexandre dAspremont , CNRS - PowerPoint PPT Presentation

Regularized Nonlinear Acceleration. Alexandre dAspremont , CNRS & D.I. Ecole Normale Sup erieure . with Damien Scieur & Francis Bach. Support from ERC SIPA and ITN SpaRTaN. Alex dAspremont Huatulco, January 2018. 1/30


  1. Regularized Nonlinear Acceleration. Alexandre d’Aspremont , CNRS & D.I. Ecole Normale Sup´ erieure . with Damien Scieur & Francis Bach. Support from ERC SIPA and ITN SpaRTaN. Alex d’Aspremont Huatulco, January 2018. 1/30

  2. Introduction Generic convex optimization problem x ∈ R n f ( x ) min Alex d’Aspremont Huatulco, January 2018. 2/30

  3. Introduction Algorithms produce a sequence of iterates. We only keep the last (or best) one. . . Alex d’Aspremont Huatulco, January 2018. 3/30

  4. Introduction Aitken’s ∆ 2 [Aitken, 1927]. Given a sequence { s k } k =1 ,... ∈ R N with limit s ∗ , and suppose s k +1 − s ∗ = a ( s k − s ∗ ) , for k = 1 , . . . We can compute a using a = s k +1 − s k s k +1 − s k = a ( s k − s k − 1 ) ⇒ s k − s k − 1 and get the limit s ∗ by solving s k +1 − s ∗ = s k +1 − s k ( s k − s ∗ ) s k − s k − 1 which yields s k − 1 s k +1 − s 2 s ∗ = k s k +1 − 2 s k + s k − 1 This is Aitken’s ∆ 2 and allows us to compute s ∗ from { s k +1 , s k , s k − 1 } . Alex d’Aspremont Huatulco, January 2018. 4/30

  5. Introduction Convergence acceleration. Consider k ( − 1) i π � k →∞ s k = − − − − → 4 = 0 . 785398 . . . (2 i + 1) i =0 we have ( − 1) k ( − 1) i � k ∆ 2 k i =0 (2 k +1) (2 i +1) 0 1 1.0000 – 1 -0.33333 0.66667 – 2 0.2 0.86667 0.7 9167 3 -0.14286 0.7 2381 0.78 333 4 0.11111 0.83492 0.78 631 5 -0.090909 0.7 4401 0.78 492 6 0.076923 0.82093 0.785 68 7 -0.066667 0.7 5427 0.785 22 8 0.058824 0.81309 0.785 52 9 -0.052632 0.7 6046 0.7853 1 Alex d’Aspremont Huatulco, January 2018. 5/30

  6. Introduction Convergence acceleration. � Similar results apply to sequences satisfying k � a i ( s n + i − s ∗ ) = 0 i =0 using Aitken’s ideas recursively. � This produces Wynn’s ε − algorithm [Wynn, 1956]. � See [Brezinski, 1977] for a survey on acceleration, extrapolation. � Directly related to the Levinson-Durbin algo on AR processes. � Vector case: focus on Minimal Polynomial Extrapolation [Sidi et al., 1986]. Overall: a simple postprocessing step. Alex d’Aspremont Huatulco, January 2018. 6/30

  7. Outline � Introduction � Minimal Polynomial Extrapolation � Regularized MPE � Numerical results Alex d’Aspremont Huatulco, January 2018. 7/30

  8. Minimal Polynomial Extrapolation Quadratic example. Minimize f ( x ) = 1 2 � Bx − b � 2 2 using the basic gradient algorithm, with x k +1 := x k − 1 L ( B T Bx k − b ) . we get � � I − 1 x k +1 − x ∗ := LB T B ( x k − x ∗ ) � �� � A since B T Bx ∗ = b . This means x k +1 − x ∗ follows a vector autoregressive process. Alex d’Aspremont Huatulco, January 2018. 8/30

  9. Minimal Polynomial Extrapolation We have k k � � c i ( x i − x ∗ ) = c i A i ( x 0 − x ∗ ) i =0 i =1 and setting 1 T c = 1 , yields � k � � − x ∗ = p ( A )( x 0 − x ∗ ) , where p ( v ) = � k i =1 c i v i c i x i i =0 � Setting c such that p ( A )( x 0 − x ∗ ) = 0 , we would have k � x ∗ = c i x i i = 0 � Get the limit by averaging iterates (using weights depending on x k ). � We typically do not observe A (or x ∗ ). � How do we extract c from the iterates x k ? Alex d’Aspremont Huatulco, January 2018. 9/30

  10. Minimal Polynomial Extrapolation We have ( x k − x ∗ ) − ( x k − 1 − x ∗ ) x k − x k − 1 = ( A − I ) A k − 1 ( x 0 − x ∗ ) = hence if p ( A ) = 0 , we must have k � c i ( x i − x i − 1 ) = ( A − I ) p ( A )( x 0 − x ∗ ) = 0 i =1 so if ( A − I ) is nonsingular, the coefficient vector c solves the linear system  � k i =1 c i ( x i − x i − 1 ) = 0   � k  i =1 c i = 1  and p ( · ) is the minimal polynomial of A w.r.t. ( x 0 − x ∗ ) . Alex d’Aspremont Huatulco, January 2018. 10/30

  11. Approximate Minimal Polynomial Extrapolation Approximate MPE. � For k smaller than the degree of the minimal polynomial, we find c that minimizes the residual � � k � � � � ( A − I ) p ( A )( x 0 − x ∗ ) � 2 = � � c i ( x i − x i − 1 ) � � � � i =1 2 � Setting U ∈ R n × k +1 , with U i = x i +1 − x i , this means solving c ∗ � argmin � Uc � 2 (AMPE) 1 T c =1 in the variable c ∈ R k +1 . � Also known as Eddy-Meˇ sina method [Meˇ sina, 1977, Eddy, 1979] or Reduced Rank Extrapolation with arbitrary k (see [Smith et al., 1987, § 10]). Very similar to Anderson acceleration, GMRES, etc. Alex d’Aspremont Huatulco, January 2018. 11/30

  12. Uniform Bound Chebyshev polynomials. Crude bound on � Uc ∗ � 2 using Chebyshev polynomials, to bound error as a function of k , with � i x i − x ∗ � � � �� k � ( I − A ) − 1 � k � i =0 c ∗ � � i =0 c ∗ � = i U i � � 2 2 � ( I − A ) − 1 � � ≤ 2 � p ( A )( x 1 − x 0 ) � 2 � We have � p ( A )( x 1 − x 0 ) � 2 ≤ � p ( A ) � 2 � ( x 1 − x 0 ) � 2 = i =1 ,...,n | p ( λ i ) | � ( x 1 − x 0 ) � 2 max where 0 ≤ λ i ≤ σ are the eigenvalues of A . It suffices to find p ( · ) ∈ R k [ x ] solving inf sup | p ( v ) | { p ∈ R k [ x ]: p (1)=1 } v ∈ [0 ,σ ] Explicit solution using modified Chebyshev polynomials. Alex d’Aspremont Huatulco, January 2018. 12/30

  13. Uniform Bound using Chebyshev Polynomials σ 0.9 0.8 0.7 0.6 T k ( x, σ ) 0.5 0.4 0.3 0.2 0.1 0 −0.1 0 0.2 0.4 0.6 0.8 1 x Chebyshev polynomials T 3 ( x, σ ) and T 5 ( x, σ ) for x ∈ [0 , 1] and σ = 0 . 85 . The maximum value of T k on [0 , σ ] decreases geometrically fast when k grows. Alex d’Aspremont Huatulco, January 2018. 13/30

  14. Approximate Minimal Polynomial Extrapolation Proposition Let A be symmetric, 0 � A � σI with σ < 1 and c ∗ be AMPE convergence. the solution of (AMPE) . Then � � k 2 ζ k � � � c ∗ i x i − x ∗ 1 + ζ 2 k � x 0 − x ∗ � 2 � � ≤ κ ( A − I ) (1) � � � � i =0 2 where κ ( A − I ) is the condition number of the matrix A − I and ζ is given by ζ = 1 − √ 1 − σ 1 + √ 1 − σ < σ, (2) See also [Nemirovskiy and Polyak, 1984]. Gradient method, σ = 1 − µ/L , so 1 − √ � � k � i x i − x ∗ � � k µ/L � i =0 c ∗ � � x 0 − x ∗ � 2 1+ √ 2 ≤ κ ( A − I ) � � µ/L Alex d’Aspremont Huatulco, January 2018. 14/30

  15. Approximate Minimal Polynomial Extrapolation AMPE versus Nesterov, conjugate gradient. � Key difference with conjugate gradient: we do not observe A . . . � Chebyshev polynomials satisfy a two-step recurrence. For quadratic minimization using the gradient method:  z k − 1 = y k − 1 − 1 L ( By k − 1 − b )    � 2 z k − 1 � y k = α k − 1 − α k − 2  − y k − 1 y k − 2   α k σ α k where α k = 2 − σ σ α k − 1 − α k − 2 � Nesterov’s acceleration recursively computes a similar polynomial with  z k − 1 = y k − 1 − 1 L ( By k − 1 − b )   y k = z k − 1 + β k ( z k − 1 − z k − 2 ) , see also [Hardt, 2013]. Alex d’Aspremont Huatulco, January 2018. 15/30

  16. Approximate Minimal Polynomial Extrapolation Accelerating optimization algorithms. For gradient descent, we have x k − 1 x k +1 := ˜ ˜ L ∇ f (˜ x k ) x k +1 − x ∗ := A (˜ x k − x ∗ ) + O ( � ˜ x k − x ∗ � 2 � This means ˜ 2 ) where A = I − 1 L ∇ 2 f ( x ∗ ) , meaning that � A � 2 ≤ 1 − µ L , whenever µI � ∇ 2 f ( x ) � LI . � Approximation error is a sum of three terms � � � � � � � � k k k k � � � � � � � � � � � � x i − x ∗ c i x i − x ∗ � � � � � � � � c i ˜ ˜ ≤ + (˜ c i − c i ) x i + c i (˜ ˜ x i − x i ) � � � � � � � � � � � � � � � � i =0 i =0 i =0 i =0 2 2 2 2 � �� � � �� � � �� � AMPE Stability Nonlinearity Stability is key here. Alex d’Aspremont Huatulco, January 2018. 16/30

  17. Approximate Minimal Polynomial Extrapolation Stability. � The iterations span a Krylov subspace � � U 0 , AU 0 , ..., A k − 1 U 0 K k = span so the matrix U in AMPE is a Krylov matrix. � Similar to Hankel or Toeplitz case. U T U has a condition number typically growing exponentially with dimension [Tyrtyshnikov, 1994]. � In fact, the Hankel, Toeplitz and Krylov problems are directly connected, hence the link with Levinson-Durbin [Heinig and Rost, 2011]. � For generic optimization problems, eigenvalues are perturbed by deviations from the linear model, which can make the situation even worse. Be wise, regularize . . . Alex d’Aspremont Huatulco, January 2018. 17/30

  18. Outline � Introduction � Minimal Polynomial Extrapolation � Regularized MPE � Numerical results Alex d’Aspremont Huatulco, January 2018. 18/30

  19. Regularized Minimal Polynomial Extrapolation Regularized AMPE. Add a regularization term to AMPE. � Regularized formulation of problem (AMPE), c T ( U T U + λI ) c minimize (RMPE) 1 T c = 1 subject to � Solution given by a linear system of size k + 1 . ( U T U + λI ) − 1 1 c ∗ λ = (3) 1 T ( U T U + λI ) − 1 1 Alex d’Aspremont Huatulco, January 2018. 19/30

Recommend


More recommend