A Generic Quasi-Newton Algorithm for Faster Gradient-Based - PowerPoint PPT Presentation

A Generic Quasi-Newton Algorithm for Faster Gradient-Based Optimization Hongzhou Lin 1 , Julien Mairal 1 , Zaid Harchaoui 2 1 Inria, Grenoble 2 University of Washington LCCC Workshop on large-scale and distributed optimization Lund, 2017 Julien Mairal QuickeNing 1/30

An alternate title: Acceleration by Smoothing Julien Mairal QuickeNing 2/30

Collaborators Hongzhou Zaid Dima Courtney Lin Harchaoui Drusvyatskiy Paquette Publications and pre-prints H. Lin, J. Mairal and Z. Harchaoui. A Generic Quasi-Newton Algorithm for Faster Gradient-Based Optimization . arXiv:1610.00960. 2017 C. Paquette, H. Lin, D. Drusvyatskiy, J. Mairal, Z. Harchaoui. Catalyst Acceleration for Gradient-Based Non-Convex Optimization. arXiv:1703.10993 . 2017 H. Lin, J. Mairal and Z. Harchaoui. A Universal Catalyst for First-Order Optimization. Adv. NIPS 2015. Julien Mairal QuickeNing 3/30

Focus of this work Minimizing large finite sums Consider the minimization of a large sum of convex functions � n � = 1 f ( x ) △ � min f i ( x ) + ψ ( x ) , n x ∈ R d i =1 where each f i is smooth and convex and ψ is a convex regularization penalty but not necessarily differentiable. Motivation Composite Finite sum Exploit “curvature” First-order methods ✔ Quasi-Newton [Nesterov, 2013, Wright et al., 2009, Beck and Teboulle, 2009],... Julien Mairal QuickeNing 4/30

Focus of this work Minimizing large finite sums Consider the minimization of a large sum of convex functions � n � = 1 f ( x ) △ � min f i ( x ) + ψ ( x ) , n x ∈ R d i =1 where each f i is smooth and convex and ψ is a convex regularization penalty but not necessarily differentiable. Motivation Composite Finite sum Exploit “curvature” First-order methods ✔ ✔ Quasi-Newton [Schmidt et al., 2017, Xiao and Zhang, 2014, Defazio et al., 2014a,b, Shalev-Shwartz and Zhang, 2012, Mairal, 2015, Zhang and Xiao, 2015] Julien Mairal QuickeNing 4/30

Focus of this work Minimizing large finite sums Consider the minimization of a large sum of convex functions � n � = 1 f ( x ) △ � min f i ( x ) + ψ ( x ) , n x ∈ R d i =1 where each f i is smooth and convex and ψ is a convex regularization penalty but not necessarily differentiable. Motivation Composite Finite sum Exploit “curvature” First-order methods ✔ ✔ ✗ Quasi-Newton Julien Mairal QuickeNing 4/30

Focus of this work Minimizing large finite sums Consider the minimization of a large sum of convex functions � n � = 1 f ( x ) △ � min f i ( x ) + ψ ( x ) , n x ∈ R d i =1 where each f i is smooth and convex and ψ is a convex regularization penalty but not necessarily differentiable. Motivation Composite Finite sum Exploit “curvature” First-order methods ✔ ✔ ✗ Quasi-Newton — [Byrd et al., 2015, Lee et al., 2012, Scheinberg and Tang, 2016, Yu et al., 2008, Ghadimi et al., 2015, Stella et al., 2016],. . . Julien Mairal QuickeNing 4/30

Focus of this work Minimizing large finite sums Consider the minimization of a large sum of convex functions � n � = 1 f ( x ) △ � min f i ( x ) + ψ ( x ) , n x ∈ R d i =1 where each f i is smooth and convex and ψ is a convex regularization penalty but not necessarily differentiable. Motivation Composite Finite sum Exploit “curvature” First-order methods ✔ ✔ ✗ Quasi-Newton — ✗ ✔ [Byrd et al., 2016, Gower et al., 2016] Julien Mairal QuickeNing 4/30

Focus of this work Minimizing large finite sums Consider the minimization of a large sum of convex functions � n � = 1 f ( x ) △ � min f i ( x ) + ψ ( x ) , n x ∈ R d i =1 where each f i is smooth and convex and ψ is a convex regularization penalty but not necessarily differentiable. Motivation Our goal is to Composite Finite sum Exploit “curvature” accelerate first-order methods with Quasi-Newton heuristics; First-order methods ✔ ✔ ✗ Quasi-Newton — ✗ ✔ design algorithms that can adapt to composite and finite-sum structures and that can also exploit curvature information. [Byrd et al., 2016, Gower et al., 2016] Julien Mairal QuickeNing 4/30

QuickeNing: main idea Idea: Smooth the function and then apply Quasi-Newton . The strategy appears in early work about variable metric bundle methods. [Chen and Fukushima, 1999, Fukushima and Qi, 1996, Mifflin, 1996, Fuentes, Malick, and Lemar´ echal, 2012, Burke and Qian, 2000] ... Julien Mairal QuickeNing 5/30

QuickeNing: main idea Idea: Smooth the function and then apply Quasi-Newton . The strategy appears in early work about variable metric bundle methods. [Chen and Fukushima, 1999, Fukushima and Qi, 1996, Mifflin, 1996, Fuentes, Malick, and Lemar´ echal, 2012, Burke and Qian, 2000] ... The Moreau-Yosida smoothing Given f : R d → R a convex function, the Moreau-Yosida smoothing of f is the function F : R d → R defined as f ( w ) + κ � 2 � w − x � 2 � F ( x ) = min . w ∈ R d The proximal operator p ( x ) is the unique minimizer of the problem. Julien Mairal QuickeNing 5/30

The Moreau-Yosida regularization f ( w ) + κ � 2 � w − x � 2 � F ( x ) = min . w ∈ R d Basic properties [see Lemar´ echal and Sagastiz´ abal, 1997] Minimizing f and F is equivalent in the sense that x ∈ R d F ( x ) = min min x ∈ R d f ( x ) , and the solution set of the two problems coincide with each other. F is continuously differentiable even when f is not and ∇ F ( x ) = κ ( x − p ( x )) . In addition, ∇ F is Lipschitz continuous with parameter L F = κ . If f is µ -strongly convex then F is also strongly convex with µκ parameter µ F = µ + κ . Julien Mairal QuickeNing 6/30

The Moreau-Yosida regularization f ( w ) + κ � 2 � w − x � 2 � F ( x ) = min . w ∈ R d Basic properties [see Lemar´ echal and Sagastiz´ abal, 1997] Minimizing f and F is equivalent in the sense that x ∈ R d F ( x ) = min min x ∈ R d f ( x ) , and the solution set of the two problems coincide with each other. F is continuously differentiable even when f is not and ∇ F ( x ) = κ ( x − p ( x )) . In addition, ∇ F is Lipschitz continuous with parameter L F = κ . F enjoys nice properties: smoothness, (strong) convexity and If f is µ -strongly convex then F is also strongly convex with we can control its condition number 1 + κ/µ . µκ parameter µ F = µ + κ . Julien Mairal QuickeNing 6/30

A fresh look at Catalyst Julien Mairal QuickeNing 7/30

A fresh look at the proximal point algorithm A naive approach consists of minimizing the smoothed objective F instead of f with a method designed for smooth optimization. Consider indeed x k +1 = x k − 1 κ ∇ F ( x k ) . By rewriting the gradient ∇ F ( x k ) as κ ( x k − p ( x k )), we obtain f ( w ) + κ � 2 � w − x k � 2 � x k +1 = p ( x k ) = arg min . w ∈ R p This is exactly the proximal point algorithm [Rockafellar, 1976]. Julien Mairal QuickeNing 8/30

A fresh look at the accelerated proximal point algorithm Consider now x k +1 = y k − 1 κ ∇ F ( y k ) and y k +1 = x k +1 + β k +1 ( x k +1 − x k ) , where β k +1 is a Nesterov-like extrapolation parameter. We may now rewrite the update using the value of ∇ F , which gives: x k +1 = p ( y k ) and y k +1 = x k +1 + β k +1 ( x k +1 − x k ) This is the accelerated proximal point algorithm of G¨ uler [1992]. Julien Mairal QuickeNing 9/30

A fresh look at the accelerated proximal point algorithm Consider now x k +1 = y k − 1 κ ∇ F ( y k ) and y k +1 = x k +1 + β k +1 ( x k +1 − x k ) , where β k +1 is a Nesterov-like extrapolation parameter. We may now rewrite the update using the value of ∇ F , which gives: x k +1 = p ( y k ) and y k +1 = x k +1 + β k +1 ( x k +1 − x k ) This is the accelerated proximal point algorithm of G¨ uler [1992]. Remarks F may be better conditioned than f when 1 + κ/µ ≤ L /µ ; Computing p ( y k ) has a cost! Julien Mairal QuickeNing 9/30

A fresh look at Catalyst [Lin, Mairal, and Harchaoui, 2015] Catalyst is a particular accelerated proximal point algorithm with inexact gradients [G¨ uler, 1992]. x k +1 ≈ p ( y k ) and y k +1 = x k +1 + β k +1 ( x k +1 − x k ) The quantity x k +1 is obtained by using an optimization method M for approximately solving: f ( w ) + κ � 2 � w − y k � 2 � x k +1 ≈ arg min , w ∈ R p Catalyst provides Nesterov’s acceleration to M with... restart strategies for solving the sub-problems; global complexity analysis resulting in theoretical acceleration. parameter choices (as a consequence of the complexity analysis); see also [Frostig et al., 2015] Julien Mairal QuickeNing 10/30

Quasi-Newton and L-BFGS Presentation borrowed from Mark Schmidt, NIPS OPT 2010 Quasi-Newton methods work with the parameter and gradient differences between successive iterations: s k � x k +1 − x k , y k � ∇ f ( x k +1 ) − ∇ f ( x k ) . Julien Mairal QuickeNing 11/30

Quasi-Newton and L-BFGS Presentation borrowed from Mark Schmidt, NIPS OPT 2010 Quasi-Newton methods work with the parameter and gradient differences between successive iterations: s k � x k +1 − x k , y k � ∇ f ( x k +1 ) − ∇ f ( x k ) . They start with an initial approximation B 0 � σ I , and choose B k +1 to interpolate the gradient difference : B k +1 s k = y k . Julien Mairal QuickeNing 11/30

A Generic Quasi-Newton Algorithm for Faster Gradient-Based - PowerPoint PPT Presentation

A Generic Quasi-Newton Algorithm for Faster Gradient-Based Optimization Hongzhou Lin 1 , Julien Mairal 1 , Zaid Harchaoui 2 1 Inria, Grenoble 2 University of Washington LCCC Workshop on large-scale and distributed optimization Lund, 2017 Julien

Quasi-Newton methods for minimization Lectures for PHD course on Non-linear equations and

Quasi-Newton methods for minimization Lectures for PHD course on Numerical optimization Enrico

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Planning and Optimization C14. Merge-and-Shrink Abstractions: Generic Algorithm Malte Helmert and

NEWTON SEPAC End of Year Report to Newton School Committee June 10, 2019 Newton SEPAC Co-Chairs

What are Generics? e.g. Generics, Generic Programming, Generic Types, Generic Methods 6

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

Designs Chapter 11 Quasi-Experimentation Quasi-experiments resemble experiments, but lack

MATH 4211/6211 Optimization Quasi-Newton Method Xiaojing Ye Department of Mathematics &

Convex Optimization ( EE227A: UC Berkeley ) Lecture 25 (Newton, quasi-Newton) 23 Apr, 2013

Worldwide Newton Conference Paris, September 2004 eBook composition on the Newton MessagePad 2100

Images of Isaac Newton 1 Portrait of Isaac Newton, Godfrey Kneller, 1689 This image is in the

Generic Programming in a Dependently Typed Language Generic proofs for generic programs Peter

Generic Methods 36 What are Generic Methods? Generic methods = methods that introduce type

1 Definition of a simple generic class Why generic programming (cont.) class Pair <T> {

1 quasi-newton in one variable: the secant method In a one dimensional problem, approximating the

Generic attacks and index calculus D. J. Bernstein University of Illinois at Chicago The

Ada Ada: Gen : Generi erics cs Neil Mitchell Aga gain, aga in, again, ag in, again ain

j i u a l The Unreasonable Effectiveness of Multiple Dispatch Stefan Karpinski Multiple

Chapter 19 Generics CS165 Colorado State University Original slides by Daniel Liang Modified

Pushing the Boundaries An introduction and discussion on blockchain and how this technology could

Generics Asumu Takikawa RacketCon 2012 1 What are generics? 2 What are generics? hash-ref

Sweep as a Generic Pruning Technique Applied to Constraint Relaxation Nicolas Beldiceanu and

Transcript Collision Attacks: Breaking Authentication in TLS, IKE and SSH or: MD5 MUST DIE

A Generic Quasi-Newton Algorithm for Faster Gradient-Based - PowerPoint PPT Presentation

A Generic Quasi-Newton Algorithm for Faster Gradient-Based Optimization Hongzhou Lin 1 , Julien Mairal 1 , Zaid Harchaoui 2 1 Inria, Grenoble 2 University of Washington LCCC Workshop on large-scale and distributed optimization Lund, 2017 Julien

Quasi-Newton methods for minimization Lectures for PHD course on Non-linear equations and

Quasi-Newton methods for minimization Lectures for PHD course on Numerical optimization Enrico

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Planning and Optimization C14. Merge-and-Shrink Abstractions: Generic Algorithm Malte Helmert and

NEWTON SEPAC End of Year Report to Newton School Committee June 10, 2019 Newton SEPAC Co-Chairs

What are Generics? e.g. Generics, Generic Programming, Generic Types, Generic Methods 6

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

Designs Chapter 11 Quasi-Experimentation Quasi-experiments resemble experiments, but lack

MATH 4211/6211 Optimization Quasi-Newton Method Xiaojing Ye Department of Mathematics &amp;

Convex Optimization ( EE227A: UC Berkeley ) Lecture 25 (Newton, quasi-Newton) 23 Apr, 2013

Worldwide Newton Conference Paris, September 2004 eBook composition on the Newton MessagePad 2100

Images of Isaac Newton 1 Portrait of Isaac Newton, Godfrey Kneller, 1689 This image is in the

Generic Programming in a Dependently Typed Language Generic proofs for generic programs Peter

Generic Methods 36 What are Generic Methods? Generic methods = methods that introduce type

1 Definition of a simple generic class Why generic programming (cont.) class Pair &lt;T&gt; {

1 quasi-newton in one variable: the secant method In a one dimensional problem, approximating the

Generic attacks and index calculus D. J. Bernstein University of Illinois at Chicago The

Ada Ada: Gen : Generi erics cs Neil Mitchell Aga gain, aga in, again, ag in, again ain

j i u a l The Unreasonable Effectiveness of Multiple Dispatch Stefan Karpinski Multiple

Chapter 19 Generics CS165 Colorado State University Original slides by Daniel Liang Modified

Pushing the Boundaries An introduction and discussion on blockchain and how this technology could

Generics Asumu Takikawa RacketCon 2012 1 What are generics? 2 What are generics? hash-ref

Sweep as a Generic Pruning Technique Applied to Constraint Relaxation Nicolas Beldiceanu and

Transcript Collision Attacks: Breaking Authentication in TLS, IKE and SSH or: MD5 MUST DIE

MATH 4211/6211 Optimization Quasi-Newton Method Xiaojing Ye Department of Mathematics &

1 Definition of a simple generic class Why generic programming (cont.) class Pair <T> {