Derivative-Free Optimization of Noisy Functions via Quasi-Newton - PowerPoint PPT Presentation

Derivative-Free Optimization of Noisy Functions via Quasi-Newton Methods Jorge Nocedal Northwestern University Huatulco, Jan 2018 1

Collaborators Albert Berahas Richard Byrd Northwestern University University of Colorado 2

Discussion 1. The BFGS method continues to surprise 2. One of the best methods for nonsmooth optimization (Lewis-Overton) 3. Leading approach for (deterministic) DFO derivative-free optimization 4. This talk: Very good method for the minimization of noisy functions We had not fully recognized the power and generality of quasi-Newton updating until we tried to find alternatives! Subject of this talk: 1. Black-box noisy functions 2. No known structure 3. Not the finite sum loss functions arising in machine learning, where cheap approximate gradients are available 3

Outline : DFO Problem 1: min f ( x ) f smooth but derivatives not available 1. f contains no noise 2. Scalability, Parallelism 3. Robustness Problem 2: min f ( x ; ξ ) f ( ⋅ ; ξ ) smooth min f ( x ) = φ ( x ) + ε ( x ) f ( x ) = φ ( x )(1 + ε ( x )) • Propose method build upon classical quasi-Newton updating using finite- difference gradients • Estimate good finite-difference interval h • Use noise estimation techniques (More’-Wild) • Deal with noise adaptively • Can solve problems with thousands of variables • Novel convergence results – to neighborhood of solution (Richard Byrd) 4

DFO: Derivative free deterministic optimization (no noise) min f ( x ) f is smooth • Direct search/pattern search methods: not scalable • Much better idea: – Interpolation based models with trust regions (Powell, Conn, Scheinberg,…) m ( x ) = x T Bx + g T x s.t. ‖ x ‖ 2 ≤ Δ min 1. Need (n+1)(n+2)/2 function values to define quadratic model by pure interpolation Can use O( n) points and assume minimum norm change in the Hessian 2. Arithmetic costs high: n 4 ß scalability 3. 4. Placement of interpolation points is important 5. Correcting the model may require many function evaluations Parallelizable? ß 6. 5

Why not simply BFGS with finite difference gradients? x k + 1 = x k − α k H k ∇ f ( x k ) ∂ f ( x ) ≈ f ( x + he i ) − f ( x ) ∂ x i h • Invest significant effort in estimation of gradient • Delegate construction of model to BFGS • Interpolating gradients • Modest linear algebra costs O(n) for L-BFGS • Placement of sample points on an orthogonal set • BFGS is an overwriting process: no inconsistencies or ill conditioning with Armijo-Wolfe line search • Gradient evaluation parallelizes easily Why now? • Perception that n function evaluations per step is too high • Derivative-free literature rarely compares with FD – quasi-Newton • Already used extensively: fminunc MATLAB • Black-box competition and KNITRO 6

Some numerical results Compare: Model based trust region code DFOtr by Conn, Scheinberg, Vicente vs FD-L-BFGS with forward and central differences Plot function decrease vs total number of function evaluations 7

Comparison: function decrease vs total # of function evaluations quadratic s271 s334 Smooth Deterministic Smooth Deterministic 10 5 10 2 DFOtr DFOtr LBFGS FD (FD) LBFGS FD (FD) 10 0 LBFGS FD (CD) LBFGS FD (CD) 10 0 10 -5 10 -2 F(x)-F * F(x)-F * 10 -10 10 -4 10 -15 10 -6 10 -20 10 -25 10 -8 20 40 60 80 100 120 20 40 60 80 100 120 140 160 180 200 Number of function evaluations Number of function evaluations s293 s289 Smooth Deterministic Smooth Deterministic 10 10 10 0 DFOtr DFOtr LBFGS FD (FD) LBFGS FD (FD) 10 -2 LBFGS FD (CD) LBFGS FD (CD) 10 5 10 -4 10 -6 10 0 F(x)-F * F(x)-F * 10 -8 10 -5 10 -10 10 -12 10 -10 10 -14 10 -15 10 -16 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 100 200 300 400 500 600 700 800 900 1000 Number of function evaluations Number of function evaluations 8

Conclusion: DFO without noise Finite difference BFGS is a real competitor of DFO method based on function evaluations Can solve problems with thousands of variables … but really nothing new. 9

Optimization of Noisy Functions min f ( x ; ξ ) where f ( ⋅ ; ξ ) is smooth min f ( x ) = φ ( x ) + ε ( x ) f ( x ) = φ ( x )(1 + ε ( x )) Focus on additive noise f(x) = sin(x) + cos(x) + 10 -3 U(0,2sqrt(3)) 1.04 Finite-difference BFGS should not work! 1.03 1.02 1. Difference of noisy functions dangerous 1.01 f(x) 1 2. Just one bad update once in a while: disastrous 0.99 3. Not done to the best of our knowledge 0.98 Smooth 0.97 Noisy 0.96 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 x 10

Finite Differences – Noisy Functions f ( x ) = � ( x ) + ✏ ( x ) h too big h too small � ( x ) True Derivative @ x = − 2 . 5: − 1 . 6 True Derivative @ x = − 3 . 5: − 0 . 5 Finite Di ff erence Estimate @ x = − 2 . 5: 1 . 33 Finite Di ff erence Estimate @ x = − 3 . 5: 0 . 5 � ( x ) f ( x ) = � ( x ) + ✏ ( x ) x x h h h h 11

A Practible Algorithm min f ( x ) = φ ( x ) + ε ( x ) f ( x ) = φ ( x )(1 + ε ( x )) Outline of adaptive finite-difference BFGS method ε ( x ) 1. Estimate noise at every iteration -- More’-Wild Estimate h 2. 3. Compute finite difference gradient 4. Perform line search (?!) 5. Corrective Procedure when case line search fails • (need to modify line search) • Re-estimate noise level Will require very few extra f evaluations/iteration – even none 12

Noise estimation More’-Wild (2011) min f ( x ) = φ ( x ) + ε ( x ) Noise level: σ = [ var ( ε ( x ))] 1/2 Noise estimate: ε f At x choose a random direction v evaluate f at q + 1 equally spaced points x + i β v , i = 0,..., q Compute function differences: Δ 0 f ( x ) = f ( x ) Δ j + 1 f ( x ) = Δ j [ Δ f ( x )] = Δ j [ f ( x + β )] − Δ j [ f ( x )]] Compute finite diverence table: T ij = Δ j f ( x + i β v ) γ j q − j γ j = ( j !) 2 ∑ σ j = 2 T i , j 1 < j < q 0 < i < j − q q − 1 − j (2 j )! i = 0 13

Noise estimation More’-Wild (2011) β = 10 − 2 min f ( x ) =sin( x ) + cos( x ) + 10 − 3 U(0,2 3) q = 6 ∆ 2 f ∆ 3 f ∆ 4 f ∆ 5 f ∆ 6 f x f ∆ f − 3 · 10 − 2 1 . 003 7 . 54 e − 3 2 . 15 e − 3 1 . 87 e − 4 − 5 . 87 e − 3 1 . 46 e − 2 − 2 . 49 e − 2 − 2 · 10 − 2 1 . 011 9 . 69 e − 3 2 . 33 e − 3 − 5 . 68 e − 3 8 . 73 e − 3 − 1 . 03 e − 3 − 10 − 2 1 . 021 1 . 20 e − 2 − 3 . 35 e − 3 3 . 05 e − 3 − 1 . 61 e − 3 0 1 . 033 8 . 67 e − 3 − 2 . 96 e − 3 1 . 44 e − 3 10 − 2 1 . 041 8 . 38 e − 3 1 . 14 e − 3 2 · 10 − 2 1 . 050 9 . 52 e − 3 3 · 10 − 2 1 . 059 6 . 65 e − 3 8 . 69 e − 4 7 . 39 e − 4 7 . 34 e − 4 7 . 97 e − 4 8 . 20 e − 4 σ k High order differences of a smooth function tend to zero rapidly, while differences in noise are bounded away from zero. Changes in sign, useful. Procedure is scale invariant! 14

Finite difference itervals Once noise estimate ε f has been chosen: Forward difference: h = 8 1/4 ( ε f µ 2 = max x ∈ I | ′′ ) 1/2 f ( x ) | µ 2 Central difference: h = 3 1/3 ( ε f µ 3 ≈ | ′′′ ) 1/3 f ( x ) | µ 3 Bad estimates of second and third derivatives can cause problems (not often) 15

Adaptive Finite Difference L-BFGS Method Estimate noise ε f for Compute h by forward or central differences [(4-8) function evaluations] Compute g k While convergence test not satisfied: d = − H k g k [L-BFGS procedure] ( x + , f + , flag ) = LineSearch( x k , f k , g k , d k , f s ) IF flag=1 [line search failed] (x + , f + , h ) = Recovery( x k , f k , g k , d k , max iter ) endif x k + 1 = x + , f k + 1 = f + Compute g k + 1 [finite differences using h ] s k = x k + 1 − x k , y k = g k + 1 − g k T y k ≤ 0 Discard ( s k , y k ) if s k k = k + 1 endwhile 16

Line Search BFGS method requires Armijo-Wolfe line search f ( x k + α d ) ≤ f ( x k ) + α c 1 ∇ f ( x k ) d Armijo ∇ f ( x k + α d ) T d ≥ c 2 ∇ f ( x k ) T d Wolfe Deterministic case: always possible if f is bounded below • Can be problematic in the noisy case. • Strategy: try to satisfy both but limit the number of attempts • If first trial point (unit steplength) is not acceptable relax: f ( x k + α d ) ≤ f ( x k ) + α c 1 ∇ f ( x k ) d + 2 ε f relaxed Armijo Three outcomes: a) both satisfied; b) only Armijo; c) none 17

Corrective Procedure Compute a new noise estimate ε f along search direction d k Compute corresponding h If ˆ h / ≈ h use new estimat h ← h ; return w.o. changing x k Finite difference Else compute new iterate (various options): Stencil (Kelley) small perturbation; stencil point x s 18

Some quotes I believe that eventually the better methods will not use derivative approximations… [Powell, 1972] f is … somewhat noisy, which renders most methods based on finite differences of little or no use [X,X,X]. [Rios & Sahinidis, 2013] 19

END 20

Derivative-Free Optimization of Noisy Functions via Quasi-Newton - PowerPoint PPT Presentation

Derivative-Free Optimization of Noisy Functions via Quasi-Newton Methods Jorge Nocedal Northwestern University Huatulco, Jan 2018 1 Collaborators Albert Berahas Richard Byrd Northwestern University University of Colorado 2 Discussion

Derivative-Free Optimization of Noisy Functions via Quasi-Newton Methods: Experiments and Theory

Geometric Interpretation of the Derivative (Review) Geometric Interpretation of the Derivative

Formal Modeling in Cognitive Science 1 Noisy Channel Model Channel Capacity Lecture 29: Noisy

Designs Chapter 11 Quasi-Experimentation Quasi-experiments resemble experiments, but lack

2. Theory of the Derivative 2.1 Tangent Lines 2.2 Definition of Derivative 2.3 Rates of Change

Derivative Function Math 132 Stewart 2.2 In Notes 2.1, we defined the derivative of a

Transport for the 1D Schr odinger equation via quasi-free systems (Collaboration with V.

Derivative Free Optimization Optimization and AMS Masters - University Paris Saclay Exercices -

Quasi-Resonant Converters Introduction 20.1 The zero-current-switching quasi-resonant switch

Well quasi-ordering Aronszajn lines. Carlos Martinez-Ranero Centro de Ciencias Matematicas March

Contents 1. General Problem 2. Quasi-primal algebras Logics associated with a quasi-primal

Quasi-Newton methods for minimization Lectures for PHD course on Numerical optimization Enrico

Securities & Securities & Derivative Derivative Litigation Report Litigation Report

Securities & Securities & Derivative Derivative Litigation Repor t t Litigation Repor

Securities & Securities & Derivative Derivative Litigation Report Litigation Report

Securities Board of India Guest Lecture Convergence of Derivative and Cash Markets Andrew Sheng

Welcome! Capital City High School 9TH GRADE 2019-20 ENROLLMENT 1 Goals for Today Learn

Practical weight-constrained conditioned portfolio optimisation using risk aversion indicator

By: Mrs. Ferro Do I want to become a dog trainer? Brainstormed Found a Mentor Created

B RANDING , P ROPAGANDA , AND G ENERAL S ALES T ECHNIQUES Brinshay C. King, Jarrod K. Murray,

Lecture 21: Optimization and Regularization CS109A Introduction to Data Science Pavlos

Advanced Section #1: Moving averages, optimization algorithms, understanding dropout and batch

QTL Association Mapping 1 / 38 Introduction to Quantitative Trait Mapping We previously focused

A Balance of Intelligence The Art & Science of working in the digital world @DvirYuval

Derivative-Free Optimization of Noisy Functions via Quasi-Newton - PowerPoint PPT Presentation

Derivative-Free Optimization of Noisy Functions via Quasi-Newton Methods Jorge Nocedal Northwestern University Huatulco, Jan 2018 1 Collaborators Albert Berahas Richard Byrd Northwestern University University of Colorado 2 Discussion

Derivative-Free Optimization of Noisy Functions via Quasi-Newton Methods: Experiments and Theory

Geometric Interpretation of the Derivative (Review) Geometric Interpretation of the Derivative

Formal Modeling in Cognitive Science 1 Noisy Channel Model Channel Capacity Lecture 29: Noisy

Designs Chapter 11 Quasi-Experimentation Quasi-experiments resemble experiments, but lack

2. Theory of the Derivative 2.1 Tangent Lines 2.2 Definition of Derivative 2.3 Rates of Change

Derivative Function Math 132 Stewart 2.2 In Notes 2.1, we defined the derivative of a

Transport for the 1D Schr odinger equation via quasi-free systems (Collaboration with V.

Derivative Free Optimization Optimization and AMS Masters - University Paris Saclay Exercices -

Quasi-Resonant Converters Introduction 20.1 The zero-current-switching quasi-resonant switch

Well quasi-ordering Aronszajn lines. Carlos Martinez-Ranero Centro de Ciencias Matematicas March

Contents 1. General Problem 2. Quasi-primal algebras Logics associated with a quasi-primal

Quasi-Newton methods for minimization Lectures for PHD course on Numerical optimization Enrico

Securities &amp; Securities &amp; Derivative Derivative Litigation Report Litigation Report

Securities &amp; Securities &amp; Derivative Derivative Litigation Repor t t Litigation Repor

Securities &amp; Securities &amp; Derivative Derivative Litigation Report Litigation Report

Securities Board of India Guest Lecture Convergence of Derivative and Cash Markets Andrew Sheng

Welcome! Capital City High School 9TH GRADE 2019-20 ENROLLMENT 1 Goals for Today Learn

Practical weight-constrained conditioned portfolio optimisation using risk aversion indicator

By: Mrs. Ferro Do I want to become a dog trainer? Brainstormed Found a Mentor Created

B RANDING , P ROPAGANDA , AND G ENERAL S ALES T ECHNIQUES Brinshay C. King, Jarrod K. Murray,

Lecture 21: Optimization and Regularization CS109A Introduction to Data Science Pavlos

Advanced Section #1: Moving averages, optimization algorithms, understanding dropout and batch

QTL Association Mapping 1 / 38 Introduction to Quantitative Trait Mapping We previously focused

A Balance of Intelligence The Art &amp; Science of working in the digital world @DvirYuval

Securities & Securities & Derivative Derivative Litigation Report Litigation Report

Securities & Securities & Derivative Derivative Litigation Repor t t Litigation Repor

Securities & Securities & Derivative Derivative Litigation Report Litigation Report

A Balance of Intelligence The Art & Science of working in the digital world @DvirYuval