Benefiting from Negative Curvature Daniel P. Robinson Johns Hopkins - - PowerPoint PPT Presentation

benefiting from negative curvature
SMART_READER_LITE
LIVE PREVIEW

Benefiting from Negative Curvature Daniel P. Robinson Johns Hopkins - - PowerPoint PPT Presentation

Benefiting from Negative Curvature Daniel P. Robinson Johns Hopkins University Department of Applied Mathematics and Statistics Collaborator: Frank E. Curtis (Lehigh University) US and Mexico Workshop on Optimization and Its Applications


slide-1
SLIDE 1

Benefiting from Negative Curvature

Daniel P. Robinson

Johns Hopkins University Department of Applied Mathematics and Statistics Collaborator: Frank E. Curtis (Lehigh University) US and Mexico Workshop on Optimization and Its Applications Huatulco, Mexico January 8, 2018

Negative Curvature US-Mexico-2018 1 / 31

slide-2
SLIDE 2

Outline

1

Motivation

2

Deterministic Setting The Method Convergence Results Numerical Results Comments

3

Stochastic Setting

Negative Curvature US-Mexico-2018 2 / 31

slide-3
SLIDE 3

Motivation

Outline

1

Motivation

2

Deterministic Setting The Method Convergence Results Numerical Results Comments

3

Stochastic Setting

Negative Curvature US-Mexico-2018 3 / 31

slide-4
SLIDE 4

Motivation

Problem of interest: deterministic setting minimize

x∈Rn

f(x) f : Rn → R assumed to be twice-continuously differentiable. L will denote the Lipschitz constant for ∇f σ will denote the Lipschitz constant for ∇2f f may be nonconvex Notation: g(x) := ∇f(x) H(x) := ∇2f(x)

Negative Curvature US-Mexico-2018 4 / 31

slide-5
SLIDE 5

Motivation

Much work has been done on convergence two second-order points:

  • D. Goldfarb (1979) [6]
  • prove convergence result to second-order optimal points (unconstrained)
  • curvilinear search using descent direction and negative curvature direction
  • D. Goldfarb, C. Mu, J. Wright, and C. Zhou (2017) [7]
  • consider equality constrained problems
  • prove convergence result to second-order optimal points
  • extend curvilinear search for unconstrained
  • F. Facchinei and S. Lucidi (1998) [3]
  • consider inequality constrained problems
  • exact penalty function, directions of negative curvature, and line search
  • P. Gill, V. Kungurtsev, and D. Robinson (2017) [4, 5]
  • consider inequality constrained problems
  • convergence to second-order optimal points under weak assumptions
  • J. Moré and D. Sorensen (1979), A. Forsgren, P. Gill, and W. Murray

(1995), and many more . . . None consistently perform better by using directions of negative curvature!

Negative Curvature US-Mexico-2018 5 / 31

slide-6
SLIDE 6

Motivation

Others hope to avoid saddle-points:

  • J. Lee, M. Simchowich, M. Jordan, and B. Recht (2016) [8]
  • Gradient descent converges to local minimizer almost surely.
  • Uses random initialization.
  • Y. Dauphin et al. (2016) [2]
  • Present a saddle-free Newton method (it is a modified-Newton method)
  • Goal is to escape saddle points (move away when close)

These (and others) try to avoid the ill-effects of negative curvature.

Negative Curvature US-Mexico-2018 6 / 31

slide-7
SLIDE 7

Motivation

Purpose of this research: Design a method that consistently performs better by using directions of negative curvature. Do not try to avoid negative curvature. Use it!

Negative Curvature US-Mexico-2018 7 / 31

slide-8
SLIDE 8

Deterministic Setting

Outline

1

Motivation

2

Deterministic Setting The Method Convergence Results Numerical Results Comments

3

Stochastic Setting

Negative Curvature US-Mexico-2018 8 / 31

slide-9
SLIDE 9

Deterministic Setting The Method

Outline

1

Motivation

2

Deterministic Setting The Method Convergence Results Numerical Results Comments

3

Stochastic Setting

Negative Curvature US-Mexico-2018 9 / 31

slide-10
SLIDE 10

Deterministic Setting The Method

Overview: Compute descent direction (sk) and negative curvature direction (dk). Predict which step will make more progress in reducing the objective f. If predicted decrease is not realized, adjust parameters. Iterate until an approximate second-order solution is obtained.

Negative Curvature US-Mexico-2018 10 / 31

slide-11
SLIDE 11

Deterministic Setting The Method

Requirements on the descent direction sk Compute sk to satisfy −g(xk)Tsk ≥ δsk2g(xk)2

  • some δ ∈ (0, 1]
  • Examples:

sk = −g(xk) Bksk = −gk with Bk appropriately chosen Requirements on the negative curvature direction dk Compute dk to satisfy dT

k H(xk)dk ≤ γλkdk2 2 < 0

  • some γ ∈ (0, 1]
  • g(xk)Tdk ≤ 0

Examples: dk = ±vk with (λk, vk) being the left-most eigenpair of H(xk) dk a sufficiently accurate estimate of ±vk

Negative Curvature US-Mexico-2018 11 / 31

slide-12
SLIDE 12

Deterministic Setting The Method

How to use sk and dk? Use both in a curvilinear linesearch?

  • Often taints good descent directions by "poorly scaled" directions of

negative curvature.

  • No consistent performance gains!

Start using dk only once g(xk) is “small"?

  • No consistent performance gains!
  • Misses areas of the space in which great decrease in f is possible.

Use sk when g(xk) is big relative to |(λk)−|. Otherwise, use dk?

  • Better, but still inconsistent performance gains!

We propose to use upper-bounding models. It works!

Negative Curvature US-Mexico-2018 12 / 31

slide-13
SLIDE 13

Deterministic Setting The Method

Predicted decrease along descent direction sk If Lk ≥ L, then f(xk + αsk) ≤ f(xk) − ms,k(α)

  • for all α
  • with

ms,k(α) := −αg(xk)Tsk − 1

2Lkα2sk2 2

and define the quantity αk := −g(xk)Tsk Lksk2

2

= argmax

α≥0

ms,k(α) Comments ms,k(αk) is the best predicted decrease along sk If sk = −g(xk), then αk = 1/Lk

Negative Curvature US-Mexico-2018 13 / 31

slide-14
SLIDE 14

Deterministic Setting The Method

Predicted decrease along the negative curvature direction dk If σk ≥ σ, then f(xk + βdk) ≤ f(xk) − md,k(β)

  • for all β
  • with

md,k(β) := −βg(xk)Tdk − 1

2β2dT k H(xk)dk − σk 6 β3dk3 2

and define, with ck := dT

k H(xk)dk, the quantity

βk :=

  • −ck +
  • c2

k − 2σkdk3 2g(xk)Tdk

  • σkdk3

2

= argmax

β≥0

md,k(β) Comments md,k(βk) is the best predicted decrease along dk

Negative Curvature US-Mexico-2018 14 / 31

slide-15
SLIDE 15

Deterministic Setting The Method

Choose the step that predicts the largest decrease in f. If ms,k(αk) ≥ md,k(βk), then Try the step sk If md,k(βk) > ms,k(αk), then Try the step dk Question: Why “Try" instead of “Use"? Answer: We do not know if Lk ≥ L and σk ≥ σ

  • If Lk < L, then it could be the case that

f(xk + αksk) > f(xk) − ms,k(αk)

  • If σk < σ, then it could be the case that

f(xk + βkdk) > f(xk) − md,k(βk)

Negative Curvature US-Mexico-2018 15 / 31

slide-16
SLIDE 16

Deterministic Setting The Method

Dynamic Step-Size Algorithm

1: for k ∈ N do 2:

compute sk and dk satisfying the required step conditions

3:

loop

4:

compute αk = argmax

α≥0

ms,k(α) and βk = argmax

β≥0

md,k(β)

5:

if ms,k(αk) ≥ md,k(βk) then

6:

if f(xk + αksk) ≤ f(xk) − ms,k(αk) then

7:

set xk+1 ← xk + αksk and then exit loop

8:

else

9:

set Lk ← ρLk [ρ ∈ (1, ∞)]

10:

else

11:

if f(xk + βkdk) ≤ f(xk) − md,k(βk) then

12:

set xk+1 ← xk + βkdk and then exit loop

13:

else

14:

set σk ← ρσk

15:

set (Lk+1, σk+1) ∈ (Lmin, Lk] × (σmin, σk]

Negative Curvature US-Mexico-2018 16 / 31

slide-17
SLIDE 17

Deterministic Setting Convergence Results

Outline

1

Motivation

2

Deterministic Setting The Method Convergence Results Numerical Results Comments

3

Stochastic Setting

Negative Curvature US-Mexico-2018 17 / 31

slide-18
SLIDE 18

Deterministic Setting Convergence Results

Key decrease inequality: For all k ∈ N it holds that f(xk) − f(xk+1) ≥ max δ2 2Lk g(xk)2

2, 2γ3

3σ2

k

|(λk)−|3

  • .

Comments: First term in the max holds when xk+1 = xk + αksk. Second term in the max holds when xk+1 = xk + βkdk. The above max holds because we choose whether to try sk or dk based on ms,k(αk) ≥ md,k(βk) Can prove that {Lk} and {σk} remain uniformly bounded.

Negative Curvature US-Mexico-2018 18 / 31

slide-19
SLIDE 19

Deterministic Setting Convergence Results

Theorem (Limit points satisfy second-order necessary conditions) The computed iterates satisfy lim

k→∞ g(xk)2 = 0 and lim inf k→∞ λk ≥ 0

Theorem (Complexity result) The number of iterations, function, and derivative (i.e., gradient and Hessian) evaluations required until some iteration k ∈ N is reached with g(xk)2 ≤ ǫg and |(λk)−| ≤ ǫH is at most O(max{ǫ−2

g , ǫ−3 H })

Negative Curvature US-Mexico-2018 19 / 31

slide-20
SLIDE 20

Deterministic Setting Numerical Results

Outline

1

Motivation

2

Deterministic Setting The Method Convergence Results Numerical Results Comments

3

Stochastic Setting

Negative Curvature US-Mexico-2018 20 / 31

slide-21
SLIDE 21

Deterministic Setting Numerical Results

Refined parameter increase strategy ˆ Lk ← Lk + 2

  • f(xk + αksk) − f(xk) + ms,k(αk)
  • α2

ksk2

ˆ σk ← σk + 6

  • f(xk + βkdk) − f(xk) + md,k(βk)
  • β3

kdk3

then, with ρ ← 2, use the update Lk ← max{ρLk, min{103Lk, ˆ Lk}} σk ← max{ρσk, min{103σk, ˆ σk}} Refined parameter decrease strategy Lk+1 ← max{10−3, 10−3Lk, ˆ Lk} and σk+1 ← σk when xk+1 ← xk + αksk σk+1 ← max{10−3, 10−3σk, ˆ σk} and Lk+1 ← Lk when xk+1 ← xk + βkdk

Negative Curvature US-Mexico-2018 21 / 31

slide-22
SLIDE 22

Deterministic Setting Numerical Results

Termination condition g(xk) ≤ 10−5 max{1, g(x0)} and |(λk)−| ≤ 10−5 max{1, |(λ0)−|}. Measures of interest Final objective value: ffinal(sk) − ffinal(sk, dk) max{|ffinal(sk)|, |ffinal(sk, dk)|, 1} ∈ [−1, 1] Required number of iterations: #its(sk) − #its(sk, dk) max{#its(sk), #its(sk, dk), 1} ∈ [−1, 1] Required number of function evaluations: #fevals(sk) − #fevals(sk, dk) max{#fevals(sk), #fevals(sk, dk), 1} ∈ [−1, 1]

Negative Curvature US-Mexico-2018 22 / 31

slide-23
SLIDE 23

Deterministic Setting Numerical Results

Steepest descent: sk = −g(xk) and dk = ±vk

5 10 15 20 25 30

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 1

BIGGS6 RAT43LS VIBRBEAM HELIX MGH09LS HEART6LS RAT42LS HUMPS MISRA1ALS HATFLDD DENSCHNE LANCZOS2LS GROWTHLS GULF LANCZOS3LS THURBERLS MEYER3 LANCZOS1LS ROSENBR VESUVIALS NELSONLS SINEVAL CUBE HIMMELBF MARATOSB MGH17LS ENGVAL2 HEART8LS KIRBY2LS HYDC20LS SNAIL

(a) Final objective value.

5 10 15 20 25 30

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 1

BIGGS6 RAT43LS VIBRBEAM HELIX MGH09LS HEART6LS RAT42LS HUMPS MISRA1ALS HATFLDD DENSCHNE LANCZOS2LS GROWTHLS GULF LANCZOS3LS THURBERLS MEYER3 LANCZOS1LS ROSENBR VESUVIALS NELSONLS SINEVAL CUBE HIMMELBF MARATOSB MGH17LS ENGVAL2 HEART8LS KIRBY2LS HYDC20LS SNAIL

(b) Required number of iterations. Figure: Only problems for which at least one negative curvature direction is used and the difference in final f-values is larger than 10−5 in absolute value are presented.

Negative Curvature US-Mexico-2018 23 / 31

slide-24
SLIDE 24

Deterministic Setting Numerical Results

Shifted Newton: Bk = H(xk) + δkI, Bksk = −g(xk), and dk = ±vk

5 10 15 20 25 30 35

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 1

HEART8LS BIGGS6 HEART6LS ENGVAL2 ECKERLE4LS OSBORNEB LOGHAIRY LANCZOS3LS HUMPS LANCZOS2LS BEALE BENNETT5LS MISRA1ALS ROSZMAN1LS DENSCHND DENSCHNE NELSONLS HAHN1LS MEYER3 MGH10LS OSBORNEA GROWTHLS HATFLDE MGH09LS SINEVAL HATFLDD THURBERLS MGH17LS LANCZOS1LS POWELLBSLS CHWIRUT1LS CHWIRUT2LS HYDC20LS DECONVU GULF VIBRBEAM KIRBY2LS SNAIL HELIX

(a) Final objective value.

5 10 15 20 25 30 35

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 1

HEART8LS BIGGS6 HEART6LS ENGVAL2 ECKERLE4LS OSBORNEB LOGHAIRY LANCZOS3LS HUMPS LANCZOS2LS BEALE BENNETT5LS MISRA1ALS ROSZMAN1LS DENSCHND DENSCHNE NELSONLS HAHN1LS MEYER3 MGH10LS OSBORNEA GROWTHLS HATFLDE MGH09LS SINEVAL HATFLDD THURBERLS MGH17LS LANCZOS1LS POWELLBSLS CHWIRUT1LS CHWIRUT2LS HYDC20LS DECONVU GULF VIBRBEAM KIRBY2LS SNAIL HELIX

(b) Required number of iterations. Figure: Only problems for which at least one negative curvature direction is used and the difference in final f-values is larger than 10−5 in absolute value are presented.

Negative Curvature US-Mexico-2018 24 / 31

slide-25
SLIDE 25

Deterministic Setting Comments

Outline

1

Motivation

2

Deterministic Setting The Method Convergence Results Numerical Results Comments

3

Stochastic Setting

Negative Curvature US-Mexico-2018 25 / 31

slide-26
SLIDE 26

Deterministic Setting Comments

Comments: If L and σ are known, do not need to ever update Lk and σk, in theory. In practice, still allow increase and decrease for efficiency. Currently, one function evaluation each trial step. If evaluating f is very cheap, could consider evaluating both trial steps during each iteration. Relevance to strict saddle points

  • We do not make any non-degenerate assumption.
  • Our convergence result holds regardless of the types of saddle points.
  • When the strict saddle point property holds, our theory implies that

* Any limit point of the sequence {xk} is a minimizer of f. * Iterates eventually enter a region that only contains minimizers.

  • We get a stronger convergence theory (cf. Paternain, Mokhtari, and

Ribeiro (2017)) because we incorporate directions of negative curvature.

The complexity result for our method is not “optimal" based on a traditional complexity perspective.

  • F. Curtis and I have been intrigued by alternate complexity perspectives:
  • Typically, results are for general problems and based on worst case.
  • From some perspective, the algorithm I presented today is “optimal".
  • See his talk later this afternoon!

Negative Curvature US-Mexico-2018 26 / 31

slide-27
SLIDE 27

Stochastic Setting

Outline

1

Motivation

2

Deterministic Setting The Method Convergence Results Numerical Results Comments

3

Stochastic Setting

Negative Curvature US-Mexico-2018 27 / 31

slide-28
SLIDE 28

Stochastic Setting

Summary Apply same ideas as in the deterministic case, but in the mini-batch case. Add a negative curvature direction dk = ±vk with the sign chosen

  • randomly. Can be thought of as a “smart noise" approach.

Small gain in performance relative to similar algorithm without dk. See our paper [1] for additional details.

Negative Curvature US-Mexico-2018 28 / 31

slide-29
SLIDE 29

Stochastic Setting

References I

[1] F. E. CURTIS AND D. P. ROBINSON, Exploiting negative curvature directions in stochastic optimization, in http://arxiv.org/abs/1703.00412, Submitted to Mathematical Programming (Special Issue on Nonconvex Optimization for Statistical Learning), 2017. [2] Y. N. DAUPHIN, R. PASCANU, C. GULCEHRE, K. CHO, S. GANGULI,

AND Y. BENGIO, Identifying and attacking the saddle point problem in

high-dimensional non-convex optimization, in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling,

  • C. Cortes, N. D. Lawrence, and K. Q. Weinberger, eds., Curran

Associates, Inc., 2014, pp. 2933–2941. [3] F. FACCHINEI AND S. LUCIDI, Convergence to second order stationary points in inequality constrained optimization, Mathematics of Operations Research, 23 (1998), pp. 746–766.

Negative Curvature US-Mexico-2018 29 / 31

slide-30
SLIDE 30

Stochastic Setting

References II

[4] P. E. GILL, V. KUNGURTSEV, AND D. P. ROBINSON, A stabilized sqp method: global convergence, IMA Journal of Numerical Analysis, 37 (2017), pp. 407–443. [5] , A stabilized sqp method: superlinear convergence, Mathematical Programming, 163 (2017), pp. 369–410. [6] D. GOLDFARB, Curvilinear path steplength algorithms for minimization which use directions of negative curvature, Mathematical programming, 18 (1980), pp. 31–40. [7] D. GOLDFARB, C. MU, J. WRIGHT, AND C. ZHOU, Using negative curvature in solving nonlinear programs, arXiv preprint arXiv:1706.00896, (2017).

Negative Curvature US-Mexico-2018 30 / 31

slide-31
SLIDE 31

Stochastic Setting

References III

[8] J. D. LEE, M. SIMCHOWITZ, M. I. JORDAN, AND B. RECHT, Gradient descent only converges to minimizers, in Conference on Learning Theory, 2016, pp. 1246–1257.

Negative Curvature US-Mexico-2018 31 / 31