Motivation Contemporary Analyses Partitioning Regularization Methods Summary How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization Frank E. Curtis , Lehigh University joint work with Daniel P. Robinson , Johns Hopkins University U.S.-Mexico Workshop on Optimization and its Applications 8 January 2018 How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 1 of 32
Motivation Contemporary Analyses Partitioning Regularization Methods Summary Thanks, Don! How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 2 of 32
Motivation Contemporary Analyses Partitioning Regularization Methods Summary Outline Motivation Contemporary Analyses Partitioning the Search Space Behavior of Regularization Methods Summary & Perspectives How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 3 of 32
Motivation Contemporary Analyses Partitioning Regularization Methods Summary Outline Motivation Contemporary Analyses Partitioning the Search Space Behavior of Regularization Methods Summary & Perspectives How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 4 of 32
Motivation Contemporary Analyses Partitioning Regularization Methods Summary History Nonlinear optimization has had parallel developments convexity smoothness Rockafellar Powell Fenchel Fletcher Nemirovski Goldfarb Nesterov Nocedal subgradient sufficient inequality decrease convergence, convergence, complexity fast local guarantees convergence Worlds are (finally) colliding! How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 5 of 32
Motivation Contemporary Analyses Partitioning Regularization Methods Summary Worst-case complexity for nonconvex optimization Here is how we do it now: Assuming Lipschitz continuity of derivatives. . . . . . upper bound on # of iterations until �∇ f ( x k ) � 2 ≤ ǫ ? Gradient descent Newton / trust region Cubic regularization O ( ǫ − 2 ) O ( ǫ − 2 ) O ( ǫ − 3 / 2 ) How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 6 of 32
Motivation Contemporary Analyses Partitioning Regularization Methods Summary Self-examination But. . . ◮ Is this the best way to characterize our algorithms? ◮ Is this the best way to represent our algorithms? How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 7 of 32
Motivation Contemporary Analyses Partitioning Regularization Methods Summary Self-examination But. . . ◮ Is this the best way to characterize our algorithms? ◮ Is this the best way to represent our algorithms? People listen! Cubic regularization. . . ◮ Griewank (1981) ◮ Nesterov & Polyak (2006) ◮ Weiser, Deuflhard, Erdmann (2007) ◮ Cartis, Gould, Toint (2011), the ARC method . . . is a framework to which researchers have been attracted. . . ◮ Agarwal, Allen-Zhu, Bullins, Hazan, Ma (2017) ◮ Carmon, Duchi (2017) ◮ Kohler, Lucchi (2017) ◮ Peng, Roosta-Khorasan, Mahoney (2017) However, there remains a large gap between theory and practice! How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 7 of 32
Motivation Contemporary Analyses Partitioning Regularization Methods Summary Purpose of this talk Our goal: A complementary approach to characterize algorithms. ◮ global convergence ◮ worst-case complexity, contemporary type + our approach ◮ local convergence rate How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 8 of 32
Motivation Contemporary Analyses Partitioning Regularization Methods Summary Purpose of this talk Our goal: A complementary approach to characterize algorithms. ◮ global convergence ◮ worst-case complexity, contemporary type + our approach ◮ local convergence rate We’re admitting: Our approach does not give the complete picture. But we believe it is useful! How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 8 of 32
Motivation Contemporary Analyses Partitioning Regularization Methods Summary Purpose of this talk Our goal: A complementary approach to characterize algorithms. ◮ global convergence ◮ worst-case complexity, contemporary type + our approach ◮ local convergence rate We’re admitting: Our approach does not give the complete picture. But we believe it is useful! Nonconvexity is difficult in every sense! ◮ Can we accept a characterization strategy with some (literal) holes? ◮ Or should we be purists, even if we throw out the baby with the bathwater... How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 8 of 32
Motivation Contemporary Analyses Partitioning Regularization Methods Summary Outline Motivation Contemporary Analyses Partitioning the Search Space Behavior of Regularization Methods Summary & Perspectives How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 9 of 32
Motivation Contemporary Analyses Partitioning Regularization Methods Summary Simple setting Consider the iteration x k +1 ← x k − 1 L g k for all k ∈ N . A contemporary complexity analysis considers the set G ( ǫ g ) := { x ∈ R n : � g ( x ) � 2 ≤ ǫ g } and aims to find an upper bound on the cardinality of K g ( ǫ g ) := { k ∈ N : x k �∈ G ( ǫ g ) } . g k := ∇ f ( x k ), g := ∇ f How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 10 of 32
Motivation Contemporary Analyses Partitioning Regularization Methods Summary Upper bound on |K g ( ǫ g ) | Using s k = − 1 L g k and the upper bound f k +1 ≤ f k + g T k s k + 1 2 L � s k � 2 2 , one finds with f inf := inf x ∈ R n f ( x ) that 2 L � g k � 2 1 f k − f k +1 ≥ 2 2 L |K g ( ǫ g ) | ǫ 2 1 = ⇒ ( f 0 − f inf ) ≥ g |K g ( ǫ g ) | ≤ 2 L ( f 0 − f inf ) ǫ − 2 = ⇒ g . How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 11 of 32
Motivation Contemporary Analyses Partitioning Regularization Methods Summary “Nice” f But what if f is “nice”? . . . e.g., satisfying the Polyak-� Lojasiewicz condition for c ∈ (0 , ∞ ), i.e., 2 c � g ( x ) � 2 1 2 for all x ∈ R n . f ( x ) − f inf ≤ Now consider the set F ( ǫ f ) := { x ∈ R n : f ( x ) − f inf ≤ ǫ f } and consider an upper bound on the cardinality of K f ( ǫ f ) := { k ∈ N : x k �∈ F ( ǫ f ) } . How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 12 of 32
Motivation Contemporary Analyses Partitioning Regularization Methods Summary Upper bound on |K f ( ǫ f ) | Using s k = − 1 L g k and the upper bound f k +1 ≤ f k + g T k s k + 1 2 L � s k � 2 2 , one finds that 2 L � g k � 2 1 f k − f k +1 ≥ 2 ≥ c L ( f k − f inf ) (1 − c = ⇒ L )( f k − f inf ) ≥ f k +1 − f inf (1 − c L ) k ( f 0 − f inf ) ≥ f k − f inf = ⇒ � f 0 − f inf � � � �� − 1 L = ⇒ |K f ( ǫ f ) | ≤ log log . L − c ǫ f How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 13 of 32
Motivation Contemporary Analyses Partitioning Regularization Methods Summary For the first step. . . In the “general nonconvex” analysis. . . . . . the expected decrease for the first step is much more pessimistic: 2 L ǫ 2 1 general nonconvex: f 0 − f 1 ≥ g (1 − c PL condition: L )( f 0 − f inf ) ≥ f 1 − f inf . . . and it remains more pessimistic throughout! How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 14 of 32
Motivation Contemporary Analyses Partitioning Regularization Methods Summary Upper bounds on |K f ( ǫ f ) | versus |K g ( ǫ g ) | Let f ( x ) = 1 2 x 2 , meaning that g ( x ) = x . ◮ Let ǫ f = 1 2 ǫ 2 g , meaning that F ( ǫ f ) = G ( ǫ g ). ◮ Let x 0 = 10, c = 1, and L = 2. (Similar pictures for any L > 1.) How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 15 of 32
Motivation Contemporary Analyses Partitioning Regularization Methods Summary Upper bounds on |K f ( ǫ f ) | versus |{ k ∈ N : 1 2 � g k � 2 2 > ǫ g }| 2 g ( x ) 2 = 1 Let f ( x ) = 1 2 x 2 , meaning that 1 2 x 2 . ◮ Let ǫ f = ǫ g , meaning that F ( ǫ f ) = G ( ǫ g ). ◮ Let x 0 = 10, c = 1, and L = 2. (Similar pictures for any L > 1.) How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 16 of 32
Motivation Contemporary Analyses Partitioning Regularization Methods Summary Bad worst-case! Worst-case complexity bounds in the general nonconvex case are very pessimistic. ◮ The analysis immediately admits a large gap when the function is nice. ◮ The “essentially tight” examples for the worst-case bounds are. . . weird. 1 1 Cartis, Gould, Toint (2010) How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 17 of 32
Motivation Contemporary Analyses Partitioning Regularization Methods Summary Plea Let’s not have these be the problems that dictate how we ◮ characterize our algorithms and ◮ represent our algorithms to the world! How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization 18 of 32
Recommend
More recommend