robust control for analysis and design of large-scale optimization algorithms Laurent Lessard University of Wisconsin–Madison Joint work with Ben Recht and Andy Packard LCCC Workshop on Large-Scale and Distributed Optimization Lund University, June 15, 2017
1. Many algorithms can be viewed as dynamical systems with feedback (control systems!). algorithm convergence ⇐ ⇒ system stability 2. By solving a small convex program, we can recover state-of-the-art convergence results for these algorithms, automatically and efficiently. 3. The ultimate goal: to move from analysis to design. 2
Unconstrained optimization: minimize f ( x ) x ∈ R N subject to • need algorithms that are fast and simple • currently favored family: first-order methods 3
Gradient method 10 0 x k +1 = x k − α ∇ f ( x k ) Error 10 − 2 Heavy ball method 10 − 4 x k +1 = x k − α ∇ f ( x k ) + β ( x k − x k − 1 ) 0 20 40 60 80 Nesterov’s accelerated method y k = x k + β ( x k − x k − 1 ) x k +1 = y k − α ∇ f ( y k ) x 1 contours of f ( x ) (quadratic) x 0 4
Robust algorithm selection G ∈ G : algorithm we’re going to use f ∈ S : function we’d like to minimize � � G opt = arg min max f ∈S cost( f, G ) G ∈G Similar problem for a finite number of iterations: • Drori, Teboulle (2012) • Taylor, Hendrickx, Glineur (2016) 5
Gradient method x k +1 = x k − α ∇ f ( x k ) Heavy ball method G ∈ G x k +1 = x k − α ∇ f ( x k ) + β ( x k − x k − 1 ) Nesterov’s accelerated method � � x k +1 = x k − α ∇ x k + β ( x k − x k − 1 ) + β ( x k − x k − 1 ) f Analytically solvable: Quadratic functions: f ( x ) = 1 2 x T Qx − p T x f ∈ S with the constraint: mI � Q � LI 6
Convergence rate on quadratic functions Iterations to convergence for Gradient method 10 3 1 1 Iterations to convergence 10 2 Convergence rate ρ 0 . 8 0 . 6 10 1 1 / 2 0 . 4 Gradient (quadratic) 10 0 0 . 2 Nesterov (quadratic) Heavy ball (quadratic) 0 10 0 10 1 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 Condition ratio L/m Condition ratio L/m Convergence rate : � x k − x ⋆ � ≤ Cρ k � x 0 − x ⋆ � 1 Iterations to convergence ∝ − log ρ 7
Robust algorithm selection G ∈ G : algorithm we’re going to use f ∈ S : function we’d like to minimize � � G opt = arg min max f ∈S cost( f, G ) G ∈G 1. mathematical representation for G 2. mathematical representation for S 3. main robustness result 8
Dynamical system interpretation Heavy ball: x k +1 = x k − α ∇ f ( x k ) + β ( x k − x k − 1 ) Define u k := ∇ f ( x k ) and p k := x k − 1 algorithm (linear, known, decoupled) � x k +1 � � (1 + β ) I � � x k � � − αI � − βI = + u k p k +1 I 0 p k 0 � � x k � � y k = I 0 p k y u u k = ∇ f ( y k ) function (nonlinear, uncertain, coupled) 9
Dynamical system interpretation x k +1 = x k − α ∇ f ( x k ) + β ( x k − x k − 1 ) Heavy ball: Define u k := ∇ f ( x k ) and p k := x k − 1 algorithm (linear, known, decoupled ) � ( x k +1 ) i � � 1 + β � � ( x k ) i � � − α � − β � ( x k +1 ) i � � 1 + β � � ( x k ) i � � − α � − β � ( x k +1 ) i � ( x k +1 ) i � � � 1 + β � 1 + β � � ( x k ) i � � ( x k ) i � � � − α � − α � � = + ( u k ) i − β − β = + ( u k ) i 0 ( p k +1 ) i 1 0 ( p k ) i = = + + ( u k ) i ( u k ) i 0 ( p k +1 ) i 1 0 ( p k ) i 0 0 ( p k +1 ) i ( p k +1 ) i 1 1 0 0 ( p k ) i ( p k ) i � � ( x k ) i � � � ( x k ) i � � � � ( x k ) i � � ( x k ) i � � ( y k ) i = 1 � 0 ( y k ) i = 1 � � 0 ( p k ) i ( y k ) i = ( y k ) i = 1 1 0 0 ( p k ) i ( p k ) i ( p k ) i y u i = 1 , . . . , N u k = ∇ f ( y k ) function (nonlinear, uncertain, coupled ) 10
G ξ ξ k +1 = Aξ k + Bu k y k = Cξ k y u ∇ f u k = ∇ f ( y k ) � � 1 − α Gradient 1 0 1 + β − β − α � � A B Heavy ball 1 0 0 = 1 0 0 C 0 1 + β − β − α Nesterov 1 0 0 − β 1 + β 0 11
y u ∇ f � �� � ∇ ∇ ∇ f ( x ) f ( x ) f ( x ) ⊂ ⊂ ∇ f ( x ) : x x x linear sector bounded + slope restricted sector bounded � � � f ( x ) f ( x ) f ( x ) ⊂ ⊂ f ( x ) : x x x quadratic strongly convex + Lipschitz gradients radially quasiconvex 12
y u ∇ f Representing function classes express as quadratic constraints on ( y, u ) u k ∇ f is a passive function: u k y k ≥ 0 y k sector bounded 13
y u ∇ f Representing function classes express as quadratic constraints on ( y, u ) u k ∇ f is sector-bounded : L � y k � T � − 2 mL � � y k � m m + L y k ≥ 0 − 2 u k m + L u k sector bounded 14
y u ∇ f Representing function classes express as quadratic constraints on ( y, u ) u k L ∇ f is sector-bounded + slope-restricted: m constraint on ( y k , u k ) depends on history y k ( y 0 , . . . , y k − 1 , u 0 , . . . , u k − 1 ) . sector bounded + slope restricted 15
y u ∇ f z Ψ ζ Introduce extra dynamics • Design dynamics Ψ and multiplier matrix M . • Instead of using q ( u k , y k ) , use z T k Mz k . • Systematic way of doing this for strong convexity via Zames-Falb multipliers (1968). • General theory: Integral Quadratic Constraints (Megretski & Rantzer 1997) 16
� � 1 − α Gradient 1 0 G � 1 + β � � A � − β − α B 1 0 0 y Heavy ball u 1 0 0 C 0 � 1 + β � ∇ f − β − α 1 0 0 Nesterov 1 + β − β 0 ∇ f ( x ) ∇ f ( x ) ∇ f ( x ) ⊂ ⊂ x x x f is quadratic f is strongly convex f is quasiconvex � �� � (Ψ , M ) 17
Main result Problem data: • G (the algorithm) Size of LMI does not grow with problem dimension! • Ψ (what we know about f ) e.g. P ∈ S 3 × 3 , LMI ∈ S 4 × 4 Auxiliary quantities: • Compute ( ˆ A, ˆ B, ˆ C, ˆ D ) matrices from ( G, Ψ) • Choose a candidate rate 0 < ρ < 1 . If there exists P ≻ 0 such that � ˆ � A T P ˆ A T P ˆ ˆ A − ρ 2 P � ˆ � ˆ B � T M � ˆ ˆ + C D C D � 0 B T P ˆ ˆ B T P ˆ ˆ A B � cond( P ) ρ k � x 0 − x ⋆ � for all k . then � x k − x ⋆ � ≤ 18
main results: analytic and numerical 19
Gradient method x k +1 = x k − α ∇ f ( x k ) Convergence rate for Gradient method Iterations to convergence for Gradient method 10 3 1 1 Iterations to convergence Convergence rate ρ 10 2 0 . 8 0 . 6 10 1 1 / 2 0 . 4 Gradient ( all functions ) 10 0 0 . 2 Nesterov (quadratic) Heavy ball (quadratic) 0 10 0 10 1 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 Condition ratio L/m Condition ratio L/m analytic solution! Same rate for: quadratics, strongly convex, or quasiconvex functions. 20
Nesterov’s method x k +1 = x k − α ∇ f ( x k + β ( x k − x k − 1 )) + β ( x k − x k − 1 ) Nesterov rate bounds Nesterov iterations 10 3 1 Iterations to convergence Convergence rate ρ 0 . 8 10 2 0 . 6 10 1 IQC (quasiconvex) 0 . 4 IQC (strongly convex) Nesterov (strongly convex) 10 0 0 . 2 Nesterov (quadratic) Heavy ball (quadratic) 0 10 − 1 10 0 10 1 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 Condition ratio L/m Condition ratio L/m • Cannot certify stability for quasiconvex functions • IQC bound improves upon best known bound! 21
Heavy ball method x k +1 = x k − α ∇ f ( x k ) + β ( x k − x k − 1 ) Heavy ball rate bounds Heavy ball iterations 10 3 1 Iterations to convergence Convergence rate ρ 0 . 8 10 2 0 . 6 10 1 0 . 4 IQC (quasiconvex) IQC (strongly convex) 10 0 0 . 2 Nesterov (quadratic) Heavy ball (quadratic) 0 10 − 1 10 0 10 1 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 Condition ratio L/m Condition ratio L/m • Cannot certify stability for quasiconvex functions • Cannot certify stability for strongly convex functions 22
The heavy ball method is not stable! 25 2 x 2 x < 1 2 x 2 + 24 x − 12 1 counterexample: f ( x ) = 1 ≤ x < 2 2 x 2 − 24 x + 36 25 x ≥ 2 and start the heavy ball iteration at x 0 = x 1 ∈ [3 . 07 , 3 . 46] . 80 • L/m = 25 60 • heavy ball iterations converge to a limit cycle 40 • simple counterexample to 20 the Aizerman (1949) and 0 f ( x ) Kalman (1957) conjectures − 2 0 2 4 23
uncharted territory: noise robustness and algorithm design 24
Noise robustness The ∆ δ block is uncertain u G multiplicative noise: ξ � u k − w k � ≤ δ � w k � y ∆ δ ∇ f w How does an algorithm perform in the presence of noise? 25
Recommend
More recommend