geometry of first order methods
play

Geometry of First-order Methods Trajectory and Adaptive Acceleration - PowerPoint PPT Presentation

Geometry of First-order Methods Trajectory and Adaptive Acceleration Clarice Poon University of Bath Joint work with: Jingwei Liang, University of Cambridge Outline Introduction Trajectory of first-order methods Adaptive acceleration via


  1. Example: Douglas–Rachford splitting Feasibility problem in R 2 Let T 1 , T 2 ⊂ R 2 be two subspaces such that T 1 ∩ T 2 � = ∅ , Find x ∈ R 2 such that x ∈ T 1 ∩ T 2 . 10 0 1-step inertial DR normal DR Inertial Douglas–Rachford: 2-step inertial DR ¯ z k = z k + a ( z k − z k − 1 ) + b ( z k − 1 − z k − 2 ) , 10 -5 z k + 1 = F DR (¯ z k ) . 1-step inertial: a = 0 . 3. 10 -10 2-step inertial: a = 0 . 6 , b = − 0 . 3. 10 -15 200 400 600 800 1000 NB : 1-step inertial will always worsen the rate! 10

  2. Problems Nesterov/FISTA achieve worst case optimal convergence rate. 11

  3. Problems Nesterov/FISTA achieve worst case optimal convergence rate. The performance of inertial in general is not clear, no rate improvements. 11

  4. Problems Nesterov/FISTA achieve worst case optimal convergence rate. The performance of inertial in general is not clear, no rate improvements. Generalisation of inertial technique to first-order methods, or in general fixed-point iteration, is achievable: 11

  5. Problems Nesterov/FISTA achieve worst case optimal convergence rate. The performance of inertial in general is not clear, no rate improvements. Generalisation of inertial technique to first-order methods, or in general fixed-point iteration, is achievable: � Guaranteed sequence convergence [Alvarez & Attouch ’01]. � NO acceleration guarantees. Unless stronger assumptions are imposed, e.g. strong convexity or Lipschitz smoothness. 11

  6. Problems Nesterov/FISTA achieve worst case optimal convergence rate. The performance of inertial in general is not clear, no rate improvements. Generalisation of inertial technique to first-order methods, or in general fixed-point iteration, is achievable: � Guaranteed sequence convergence [Alvarez & Attouch ’01]. � NO acceleration guarantees. Unless stronger assumptions are imposed, e.g. strong convexity or Lipschitz smoothness. For a given method, e.g. Douglas–Rachford, the outcome of inertial/SOR is problem and parameters dependent. 11

  7. Problems Nesterov/FISTA achieve worst case optimal convergence rate. The performance of inertial in general is not clear, no rate improvements. Generalisation of inertial technique to first-order methods, or in general fixed-point iteration, is achievable: � Guaranteed sequence convergence [Alvarez & Attouch ’01]. � NO acceleration guarantees. Unless stronger assumptions are imposed, e.g. strong convexity or Lipschitz smoothness. For a given method, e.g. Douglas–Rachford, the outcome of inertial/SOR is problem and parameters dependent. A general acceleration framework with acceleration guarantees is missing! 11

  8. Outline Introduction Trajectory of first-order methods Adaptive acceleration via linear prediction Relation with previous work Numerical experiments Conclusions 12

  9. Why the difference? Trivial rivial ans answer: er: because they are different problems and different methods... 13

  10. Why the difference? Trivial rivial ans answer: er: because they are different problems and different methods... What happens when a problem can be solved by diff differ eren ent me methods thods ? 13

  11. Why the difference? Trivial rivial ans answer: er: because they are different problems and different methods... Gradient descent Trajectory of { x k } k ∈ N . 13

  12. Why the difference? Trivial rivial ans answer: er: because they are different problems and different methods... normal DR Trajectory of { z k } k ∈ N . 13

  13. Why the difference? Trivial rivial ans answer: er: because they are different problems and different methods... 1-step inertial DR normal DR Trajectory of { z k } k ∈ N . 13

  14. Why the difference? Trivial rivial ans answer: er: because they are different problems and different methods... 1-step inertial DR normal DR 2-step inertial DR Trajectory of { z k } k ∈ N . 13

  15. Why the difference? Trivial rivial ans answer: er: because they are different problems and different methods... Geome Geometr try Structure Structur = = = = = = = = ⇒ FoM = = = = = = = = ⇒ Trajectory of sequence Non-smooth opt. 13

  16. Why the difference? Trivial rivial ans answer: er: because they are different problems and different methods... Geome Geometr try Structure Structur = = = = = = = = ⇒ FoM = = = = = = = = ⇒ Trajectory of sequence Non-smooth opt. 13

  17. Why the difference? Trivial rivial ans answer: er: because they are different problems and different methods... Geome Geometr try Structure Structur = = = = = = = = ⇒ FoM = = = = = = = = ⇒ Trajectory of sequence Non-smooth opt. 13

  18. Why the difference? Trivial rivial ans answer: er: because they are different problems and different methods... Geome Geometr try Structure Structur = = = = = = = = ⇒ FoM = = = = = = = = ⇒ Trajectory of sequence Non-smooth opt. 13

  19. Why the difference? Trivial rivial ans answer: er: because they are different problems and different methods... Geome Geometr try Structure Structur = = = = = = = = ⇒ FoM = = = = = = = = ⇒ Trajectory of sequence Non-smooth opt. 13

  20. Why the difference? Trivial rivial ans answer: er: because they are different problems and different methods... However, FoM are non-linear in general... 13

  21. Partial smoothness Partly smooth function [Lewis ’03] R is partly smooth at x relative to a set M x containing x if ∂ R ( x ) � = ∅ and Smoothness M x is a C 2 -manifold, R | M x is C 2 near x . � ⊥ . � Sharpness Tangent space T M x ( x ) is T x = par def ∂ R ( x ) Continuity ∂ R : R n ⇒ R n is continuous along M x near x . C ) :sub-space parallel to C , where C ⊂ R n is a non-empty convex set. par ( 20 Examples: Ex amples: 15 ℓ 1 , ℓ 1 , 2 , ℓ ∞ -norm f ( x ) 10 5 Nuclear norm 0 x Total variation M − 5 40 0 30 20 5 14 10 x 2 x 1

  22. Trajectory of first-order methods Framework for analysing the local trajectory of FoM a First-order method ( non-linear ) ⇓ Convergence & Non-degeneracy: finite identification of M ⇓ Local linearisation along M : matrix M ( linear ) ⇓ Spectral properties of M ⇓ Local trajectory 15 a Local convergence previously studied in [Liang et al ’16].

  23. Trajectory of first-order methods Local linearisation of FoM: z k + 1 − z k = M ( z k − z k − 1 ) + o ( || z k − z k − 1 || ) . 15

  24. Trajectory of first-order methods Local linearisation of FoM: z k + 1 − z k = M ( z k − z k − 1 ) + o ( || z k − z k − 1 || ) . FB M is similar to a symmetric matrix with real eigenvalues in ] − 1 , 1 ] : straight-line FB trajectory . 15

  25. Trajectory of first-order methods Local linearisation of FoM: z k + 1 − z k = M ( z k − z k − 1 ) + o ( || z k − z k − 1 || ) . FB M is similar to a symmetric matrix with real eigenvalues in ] − 1 , 1 ] : straight-line FB trajectory . DR DR If both functions are locally polyhedral, M is normal matrix with complex eigenval- ues of the form cos ( θ ) e ± i θ : logarithmic spiral trajectory . 15

  26. Trajectory of first-order methods Local linearisation of FoM: z k + 1 − z k = M ( z k − z k − 1 ) + o ( || z k − z k − 1 || ) . FB M is similar to a symmetric matrix with real eigenvalues in ] − 1 , 1 ] : straight-line FB trajectory . DR DR If both functions are locally polyhedral, M is normal matrix with complex eigenval- ues of the form cos ( θ ) e ± i θ : logarithmic spiral trajectory . PD PD If both functions are locally polyhedral, M is up to orthogonal transform a block diagonal matrix, composition of circular and elliptical rotations: elliptical spiral trajectory . 15

  27. Trajectory of first-order methods Local linearisation of FoM: z k + 1 − z k = M ( z k − z k − 1 ) + o ( || z k − z k − 1 || ) . FB M is similar to a symmetric matrix with real eigenvalues in ] − 1 , 1 ] : straight-line FB trajectory . DR DR If both functions are locally polyhedral, M is normal matrix with complex eigenval- ues of the form cos ( θ ) e ± i θ : logarithmic spiral trajectory . PD PD If both functions are locally polyhedral, M is up to orthogonal transform a block diagonal matrix, composition of circular and elliptical rotations: elliptical spiral trajectory . NB : For DR/ADMM, if one term is locally C 2 -smooth, straight-line trajectory can be obtained under proper parameters. 15

  28. Trajectory of first-order methods For orwar ard–Backw d–Backwar ard Douglas–Rach Douglas–Rachfor ord/ d/ADMM ADMM Primal–Dual Primal–Dual 10 -4 0.96 10 -5 10 -6 0.94 0.92 10 -8 0.9 10 -10 10 -10 0.88 0.86 10 -12 10 -15 0.84 10 -14 50 100 150 200 250 300 50 100 150 200 250 50 100 150 200 250 300 1 − cos ( θ k ) cos ( ψ ) − cos ( θ k ) cos ( θ k ) 15

  29. Failure of inertial Consider the Lasso for a random Gaussian matrix A ∈ R m × n with m < n : x , y ∈ R n || x || 1 + 1 min 2 || Ax − f || 2 2 . Solving using DR with γ = 0 . 9 || A || 2 10 0 250 10 0 200 10 -5 150 10 -5 100 10 -10 50 10 -15 10 -10 0 200 400 600 800 1000 1200 200 400 600 800 1000 1200 1400 Eventual trajectory: Straight line when γ < ( || A || 2 ) − 1 16

  30. Failure of inertial Consider the Lasso for a random Gaussian matrix A ∈ R m × n with m < n : x , y ∈ R n || x || 1 + 1 min 2 || Ax − f || 2 2 . Solving using DR with γ = 10 || A || 2 10 0 250 10 0 200 10 -5 150 10 -5 100 10 -10 50 10 -15 10 -10 0 50 100 150 200 100 200 300 400 500 600 700 800 900 Eventual trajectory: Linearisation matrix may have complex leading eigenvalue if γ ≥ ( || A || 2 ) − 1 . 16

  31. Outline Introduction Trajectory of first-order methods Adaptive acceleration via linear prediction Relation with previous work Numerical experiments Conclusions 17

  32. Linear prediction: illustration Idea: Given past points { z k − j } q + 1 j = 0 , how to predict z k + 1 ? Define { v k − j def = z k − j − z k − j − 1 } q j = 0 . 18

  33. Linear prediction: illustration Idea: Given past points { z k − j } q + 1 j = 0 , how to predict z k + 1 ? Define { v k − j def = z k − j − z k − j − 1 } q j = 0 . Fit the past directions v k − 1 , . . . , v k − q to the lat- est direction v k : = argmin c ∈ R q || � q j = 1 c j v k − j − v k || . def c k 18

  34. Linear prediction: illustration Idea: Given past points { z k − j } q + 1 j = 0 , how to predict z k + 1 ? Define { v k − j def = z k − j − z k − j − 1 } q j = 0 . Fit the past directions v k − 1 , . . . , v k − q to the lat- est direction v k : = argmin c ∈ R q || � q j = 1 c j v k − j − v k || . def c k Let ¯ def = z k + � q z k , 1 j = 1 c k , j v k − j + 1 . 18

  35. Linear prediction: illustration Idea: Given past points { z k − j } q + 1 j = 0 , how to predict z k + 1 ? Define { v k − j def = z k − j − z k − j − 1 } q j = 0 . Fit the past directions v k − 1 , . . . , v k − q to the lat- est direction v k : = argmin c ∈ R q || � q j = 1 c j v k − j − v k || . def c k Let ¯ def = z k + � q z k , 1 j = 1 c k , j v k − j + 1 . Repeat on { z k − j } q j = 0 ∪ { ¯ z k , 1 } and so on. 18

  36. Linear prediction: illustration Idea: Given past points { z k − j } q + 1 j = 0 , how to predict z k + 1 ? Define { v k − j def = z k − j − z k − j − 1 } q j = 0 . Fit the past directions v k − 1 , . . . , v k − q to the lat- est direction v k : = argmin c ∈ R q || � q j = 1 c j v k − j − v k || . def c k Let ¯ def = z k + � q z k , 1 j = 1 c k , j v k − j + 1 . Repeat on { z k − j } q j = 0 ∪ { ¯ z k , 1 } and so on. 18

  37. Linear prediction: illustration Idea: Given past points { z k − j } q + 1 j = 0 , how to predict z k + 1 ? Define { v k − j def = z k − j − z k − j − 1 } q j = 0 . Fit the past directions v k − 1 , . . . , v k − q to the lat- est direction v k : = argmin c ∈ R q || � q j = 1 c j v k − j − v k || . def c k Let ¯ def = z k + � q z k , 1 j = 1 c k , j v k − j + 1 . Repeat on { z k − j } q j = 0 ∪ { ¯ z k , 1 } and so on. 18

  38. Linear prediction: illustration Idea: Given past points { z k − j } q + 1 j = 0 , how to predict z k + 1 ? Define { v k − j def = z k − j − z k − j − 1 } q j = 0 . Fit the past directions v k − 1 , . . . , v k − q to the lat- est direction v k : = argmin c ∈ R q || � q j = 1 c j v k − j − v k || . def c k Let ¯ def = z k + � q z k , 1 j = 1 c k , j v k − j + 1 . Repeat on { z k − j } q j = 0 ∪ { ¯ z k , 1 } and so on. 18

  39. Linear prediction: illustration Idea: Given past points { z k − j } q + 1 j = 0 , how to predict z k + 1 ? Define { v k − j def = z k − j − z k − j − 1 } q j = 0 . Fit the past directions v k − 1 , . . . , v k − q to the lat- est direction v k : = argmin c ∈ R q || � q j = 1 c j v k − j − v k || . def c k Let ¯ def = z k + � q z k , 1 j = 1 c k , j v k − j + 1 . Repeat on { z k − j } q j = 0 ∪ { ¯ z k , 1 } and so on. The s -step extrapolation is ¯ z k , s = z k + E s , q , k , where E s , q , k = � q j = 1 ˆ c j v k − j + 1 and � � Id q − 1 = � � s j = 1 H ( c k ) j � ˆ def def H ( c k ) = . c k c with (: , 1 ) 0 1 , q − 1 18

  40. Adaptive acceleration for FoM (A 2 FoM) Given first-order method z k + 1 = F ( z k ) . A 2 FoM via linear prediction z 0 = z 0 , set D = 0 ∈ R n × ( q + 1 ) : Let s ≥ 1 , q ≥ 1 be integers. Let z 0 ∈ R n and ¯ For k ≥ 1: z k = F (¯ z k − 1 ) , v k = z k − z k − 1 , D = [ v k , D (: , 1 : q )] . If mod ( k , q + 2 ) = 0: compute c and H c , � � s � If ρ ( H c ) < 1 : ¯ z k = z k + V k (: , 1 ) ; else: ¯ z k = z k . i = 1 H i c If mod ( k , q + 2 ) � = 0: ¯ z k = z k . 19

  41. Adaptive acceleration for FoM (A 2 FoM) Simplific Simplification tion If ρ ( H c ) < 1, the Neumann series is convergent � + ∞ c = ( Id − H c ) − 1 . i = 0 H i 19

  42. Adaptive acceleration for FoM (A 2 FoM) Simplific Simplification tion If ρ ( H c ) < 1, the Neumann series is convergent � + ∞ c = ( Id − H c ) − 1 . i = 0 H i For the summation � s i = 1 H i c , � s c = ( H c − H s + 1 )( Id − H c ) − 1 . i = 1 H i c 19

  43. Adaptive acceleration for FoM (A 2 FoM) Simplific Simplification tion If ρ ( H c ) < 1, the Neumann series is convergent � + ∞ c = ( Id − H c ) − 1 . i = 0 H i For the summation � s i = 1 H i c , � s c = ( H c − H s + 1 )( Id − H c ) − 1 . i = 1 H i c When s = + ∞ , we have � + ∞ c = H c ( Id − H c ) − 1 i = 1 H i = ( Id − H c ) − 1 − Id . 19

  44. Adaptive acceleration for FoM (A 2 FoM) Remark emark We extrapolate every ( q + 2 ) -iterations. 20

  45. Adaptive acceleration for FoM (A 2 FoM) Remark emark We extrapolate every ( q + 2 ) -iterations. Only apply the linear prediction when ρ ( H c ) < 1. 20

  46. Adaptive acceleration for FoM (A 2 FoM) Remark emark We extrapolate every ( q + 2 ) -iterations. Only apply the linear prediction when ρ ( H c ) < 1. Extra memory cost n × ( q + 1 ) (the difference vector matrix). Usually q ≤ 10 20

  47. Adaptive acceleration for FoM (A 2 FoM) Remark emark We extrapolate every ( q + 2 ) -iterations. Only apply the linear prediction when ρ ( H c ) < 1. Extra memory cost n × ( q + 1 ) (the difference vector matrix). Usually q ≤ 10 Extra computation cost, q 2 n from V + k − 1 . 20

  48. Adaptive acceleration for FoM (A 2 FoM) Remark emark We extrapolate every ( q + 2 ) -iterations. Only apply the linear prediction when ρ ( H c ) < 1. Extra memory cost n × ( q + 1 ) (the difference vector matrix). Usually q ≤ 10 Extra computation cost, q 2 n from V + k − 1 . Global convergence can be obtained by treating extrapolation as perturbation error [Alvarez & Attouch ’01], i.e. z k + 1 = F ( z k + ǫ k ) . Weighted LP � � s � ¯ z k = z k + a k V k (: , 1 ) , i = 1 H i c with a k updated online. 20

  49. Example: Douglas–Rachford continue normal DR 21

  50. Example: Douglas–Rachford continue normal DR LP, s = 4 21

  51. Example: Douglas–Rachford continue normal DR LP, s = 4 LP, s = 25 21

  52. Example: Douglas–Rachford continue normal DR LP, s = 4 LP, s = 25 LP, s = + 21

  53. Outline Introduction Trajectory of first-order methods Adaptive acceleration via linear prediction Relation with previous work Numerical experiments Conclusions 22

  54. Convergence acceleration Given a sequence { z k } k ∈ N which converges to z ⋆ . Can we generate another sequence z k − z ⋆ || = o ( || z k − z ⋆ || ) ? { ¯ z k } k ∈ N such that || ¯ 23

  55. Convergence acceleration Given a sequence { z k } k ∈ N which converges to z ⋆ . Can we generate another sequence z k − z ⋆ || = o ( || z k − z ⋆ || ) ? { ¯ z k } k ∈ N such that || ¯ This is called convergence acceleration and is well-established in numerical analysis: 1927 Aitkin’s ∆ -process. 1965 Andersen’s acceleration. 1970’s Vector extrapolation techniques such as minimal polynomial extrapolation (MPE) and reduced rank extrapolation (RRE) [Sidi ’17]. Now Regularized non-linear acceleration (RNA) is a regularised version of RRE introduced by [Scieur, D’Aspremont, Bach ’16]. 23

  56. Vector extrapolation techniques Polynomial extrapolation [Cabay & Jackson ’76] Consider z k + 1 = Mz k + d with ρ ( M ) < 1 such that z k → z ⋆ : 24

  57. Vector extrapolation techniques Polynomial extrapolation [Cabay & Jackson ’76] Consider z k + 1 = Mz k + d with ρ ( M ) < 1 such that z k → z ⋆ : z k − z ⋆ = M ( z k − 1 − z ⋆ ) = M k ( z 0 − z ⋆ ) , 24

  58. Vector extrapolation techniques Polynomial extrapolation [Cabay & Jackson ’76] Consider z k + 1 = Mz k + d with ρ ( M ) < 1 such that z k → z ⋆ : z k − z ⋆ = M ( z k − 1 − z ⋆ ) = M k ( z 0 − z ⋆ ) , j = 0 c j λ j is the minimal polynomial of M w.r.t. z 0 − z ⋆ , that is If P ( λ ) = � q P ( M )( z 0 − z ⋆ ) = � q j = 0 c j M j ( z 0 − z ⋆ ) = 0 . � q then z ⋆ = j = 0 c j z j j c j . � The coefficients c can be computed without knowledge of z ⋆ : V q c ( 0 : q − 1 ) = − v q + 1 c q = 1 and � � where V q = v 1 | v 2 | · · · | v q and v j = z j − z j − 1 . 24

  59. Vector extrapolation techniques Vector extrapolation methods with applications (SIAM, 2017) by Avram Sidi. Given a sequence generated by z k = F ( z k − 1 ) . Minimal polynomial extrapolation (MPE) Let z 0 = ¯ z : S.1 Generate points { z j } q + 1 j = 0 and let v j = z j − z j − 1 . S.2 Let c ∈ R q + 1 be such that c q = 1 and V q c ( 0 : q − 1 ) = − v q + 1 where V q = � � . For j ∈ [ 0 , q − 1 ] , ˜ = c j / ( � q v 1 | · · · | v q def j = 0 c j ) . c j S.3 ¯ = � q def j = 0 ˜ c j z j . z 25

  60. Vector extrapolation techniques Vector extrapolation methods with applications (SIAM, 2017) by Avram Sidi. Given a sequence generated by z k = F ( z k − 1 ) . Minimal polynomial extrapolation (MPE) Let z 0 = ¯ z : S.1 Generate points { z j } q + 1 j = 0 and let v j = z j − z j − 1 . S.2 Let c ∈ R q + 1 be such that c q = 1 and V q c ( 0 : q − 1 ) = − v q + 1 where V q = � � . For j ∈ [ 0 , q − 1 ] , ˜ = c j / ( � q v 1 | · · · | v q def j = 0 c j ) . c j S.3 ¯ = � q def j = 0 ˜ c j z j . z Reduced rank extrapolation (RRE) [Andersen ’65; Kaniel & Stein ’74; Eddy ’79; Mešina ’77] Replace step S.2 by ˜ c ∈ argmin c || V q + 1 c || subject to 1 T c = 1 . 25

  61. Vector extrapolation techniques Vector extrapolation methods with applications (SIAM, 2017) by Avram Sidi. Given a sequence generated by z k = F ( z k − 1 ) . Minimal polynomial extrapolation (MPE) Let z 0 = ¯ z : S.1 Generate points { z j } q + 1 j = 0 and let v j = z j − z j − 1 . S.2 Let c ∈ R q + 1 be such that c q = 1 and V q c ( 0 : q − 1 ) = − v q + 1 where V q = � � . For j ∈ [ 0 , q − 1 ] , ˜ = c j / ( � q v 1 | · · · | v q def j = 0 c j ) . c j S.3 ¯ = � q def j = 0 ˜ c j z j . z Reduced rank extrapolation (RRE) [Andersen ’65; Kaniel & Stein ’74; Eddy ’79; Mešina ’77] Replace step S.2 by ˜ c ∈ argmin c || V q + 1 c || subject to 1 T c = 1 . = � q eplaced by ¯ de def j = 0 ˜ LP LP is is equiv equivalen alent to MPE MPE with with S. S.3 replaced z c j z j + 1 . 25

  62. Regularised non-linear acceleration (RNA) [Scieur, D’Aspremont, Bach ’16] proposed a regularised version of RRE for the case of z k + 1 − z ⋆ = A ( z k − z ⋆ ) + O ( || z k − z ⋆ || 2 ) where A is symmetric with 0 � A � σ Id, σ < 1. 26

  63. Regularised non-linear acceleration (RNA) [Scieur, D’Aspremont, Bach ’16] proposed a regularised version of RRE for the case of z k + 1 − z ⋆ = A ( z k − z ⋆ ) + O ( || z k − z ⋆ || 2 ) where A is symmetric with 0 � A � σ Id, σ < 1. To deal with the possible ill-conditioning of V q , regularise with λ > 0: ˜ c ∈ Argmin c || c T ( V T q V q + λ Id ) c || subject to 1 T c = 1 . 26

  64. Regularised non-linear acceleration (RNA) [Scieur, D’Aspremont, Bach ’16] proposed a regularised version of RRE for the case of z k + 1 − z ⋆ = A ( z k − z ⋆ ) + O ( || z k − z ⋆ || 2 ) where A is symmetric with 0 � A � σ Id, σ < 1. To deal with the possible ill-conditioning of V q , regularise with λ > 0: ˜ c ∈ Argmin c || c T ( V T q V q + λ Id ) c || subject to 1 T c = 1 . In practice, grid search with the objective to find optimal λ ∈ [ λ min , λ max ] . 26

  65. Regularised non-linear acceleration (RNA) [Scieur, D’Aspremont, Bach ’16] proposed a regularised version of RRE for the case of z k + 1 − z ⋆ = A ( z k − z ⋆ ) + O ( || z k − z ⋆ || 2 ) where A is symmetric with 0 � A � σ Id, σ < 1. To deal with the possible ill-conditioning of V q , regularise with λ > 0: ˜ c ∈ Argmin c || c T ( V T q V q + λ Id ) c || subject to 1 T c = 1 . In practice, grid search with the objective to find optimal λ ∈ [ λ min , λ max ] . The angle between z k − z k − 1 and z k + 1 − z k converges to zero, intuitively, this is the regime where standard inertial works well... 26

  66. Acceleration guarantees We have local acceleration guarantees thanks to results on MPE and RRE [Sidi ’98]: 27

  67. Acceleration guarantees We have local acceleration guarantees thanks to results on MPE and RRE [Sidi ’98]: When z k + 1 − z k = M ( z k − z k − 1 ) , z k , s − z ∗ || ≤ || z k + s − z ∗ || + B ǫ k || ¯ ℓ = 1 || M ℓ ||| � s − ℓ where ǫ k = || V k − 1 c − v k || and B def = � s i = 0 ( H i c ) ( 1 , 1 ) | . 27

  68. Acceleration guarantees We have local acceleration guarantees thanks to results on MPE and RRE [Sidi ’98]: When z k + 1 − z k = M ( z k − z k − 1 ) , z k , s − z ∗ || ≤ || z k + s − z ∗ || + B ǫ k || ¯ ℓ = 1 || M ℓ ||| � s − ℓ where ǫ k = || V k − 1 c − v k || and B def = � s i = 0 ( H i c ) ( 1 , 1 ) | . Asymptotic bound ( k → ∞ ): ǫ k = O ( | λ q + 1 | k ) where λ q + 1 is the ( q + 1 ) th largest eigenvalue. Without extrapolation, we just have O ( | λ 1 | k ) . 27

  69. Acceleration guarantees We have local acceleration guarantees thanks to results on MPE and RRE [Sidi ’98]: When z k + 1 − z k = M ( z k − z k − 1 ) , z k , s − z ∗ || ≤ || z k + s − z ∗ || + B ǫ k || ¯ ℓ = 1 || M ℓ ||| � s − ℓ where ǫ k = || V k − 1 c − v k || and B def = � s i = 0 ( H i c ) ( 1 , 1 ) | . Asymptotic bound ( k → ∞ ): ǫ k = O ( | λ q + 1 | k ) where λ q + 1 is the ( q + 1 ) th largest eigenvalue. Without extrapolation, we just have O ( | λ 1 | k ) . Non-asymptotic bound: If Σ( M ) ⊂ [ α, β ] with − 1 < α < β < 1, then B ǫ k ≤ K β k − q � √ η − 1 � q η = 1 − α , √ η + 1 where 1 − β 27

  70. Acceleration guarantees We have local acceleration guarantees thanks to results on MPE and RRE [Sidi ’98]: When z k + 1 − z k = M ( z k − z k − 1 ) , z k , s − z ∗ || ≤ || z k + s − z ∗ || + B ǫ k || ¯ ℓ = 1 || M ℓ ||| � s − ℓ where ǫ k = || V k − 1 c − v k || and B def = � s i = 0 ( H i c ) ( 1 , 1 ) | . Asymptotic bound ( k → ∞ ): ǫ k = O ( | λ q + 1 | k ) where λ q + 1 is the ( q + 1 ) th largest eigenvalue. Without extrapolation, we just have O ( | λ 1 | k ) . Non-asymptotic bound: If Σ( M ) ⊂ [ α, β ] with − 1 < α < β < 1, then B ǫ k ≤ K β k − q � √ η − 1 � q η = 1 − α , √ η + 1 where 1 − β For PD, DR with polyhedral functions, guaranteed acceleration with q = 2. 27

  71. Our contributions We tackle the non-smoothness of the methods using partial smoothness and give in- sight as to why vector extrapolation techniques work. 28

  72. Our contributions We tackle the non-smoothness of the methods using partial smoothness and give in- sight as to why vector extrapolation techniques work. Our acceleration is derived via sequence trajectory . Minor difference in final form. 28

Recommend


More recommend