Robust nonlinear Optimization Maren Mahsereci Workshop on - - PowerPoint PPT Presentation

robust nonlinear optimization
SMART_READER_LITE
LIVE PREVIEW

Robust nonlinear Optimization Maren Mahsereci Workshop on - - PowerPoint PPT Presentation

Robust nonlinear Optimization Maren Mahsereci Workshop on Uncertainty Quantification 09 / 15 / 2016, Sheffield Emmy Noether Group on Probabilistic Numerics Department of Empirical Inference Max Planck Institute for Intelligent Systems


slide-1
SLIDE 1

Robust nonlinear Optimization

Maren Mahsereci Workshop on Uncertainty Quantification 09 / 15 / 2016, Sheffield

Emmy Noether Group on Probabilistic Numerics Department of Empirical Inference Max Planck Institute for Intelligent Systems Tübingen, Germany

slide-2
SLIDE 2

Robust optimization

. . . outline

▸ basics about greedy optimizers

▸ GD and SGD: (stochastic) gradient descent

▸ robust stochastic optimization

▸ example: step size adaptation ▸ extending line searches ▸ robust search directions 1 ,

slide-3
SLIDE 3

Typical scheme

. . . greedy and gradient based optimizer

x∗ = arg min

x

L(x) xi+1 ← xi − αisi

  • 1. si – which direction? → model objective function locally
  • 2. αi – how far? → prevent blow ups and stagnation
  • 3. repeat

▸ needs to work for many different L(x)

2 ,

slide-4
SLIDE 4

Typical scheme

. . . greedy and gradient based optimizer

x∗ = arg min

x

L(x) xi+1 ← xi − αisi

  • 1. si – which direction? → model objective function locally
  • 2. αi – how far? → prevent blow ups and stagnation
  • 3. repeat

▸ needs to work for many different L(x)

2 ,

slide-5
SLIDE 5

The steepest way downhill

. . . gradient descend finds local minimum

x∗ = arg min

x

L(x) xi+1 ← xi − α∇L(xi), α = const.

3 ,

slide-6
SLIDE 6

The steepest way downhill

. . . gradient descend finds local minimum

x∗ = arg min

x

L(x) xi+1 ← xi − α∇L(xi), α = const.

3 ,

slide-7
SLIDE 7

The steepest way downhill

. . . gradient descend finds local minimum

x∗ = arg min

x

L(x) xi+1 ← xi − α∇L(xi), α = const.

3 ,

slide-8
SLIDE 8

The steepest way downhill

. . . gradient descend finds local minimum

x∗ = arg min

x

L(x) xi+1 ← xi − α∇L(xi), α = const.

3 ,

slide-9
SLIDE 9

The steepest way downhill

. . . gradient descend finds local minimum

x∗ = arg min

x

L(x) xi+1 ← xi − α∇L(xi), α = const.

3 ,

slide-10
SLIDE 10

The steepest way downhill

. . . gradient descend finds local minimum

x∗ = arg min

x

L(x) xi+1 ← xi − α∇L(xi), α = const.

3 ,

slide-11
SLIDE 11

The steepest way downhill

. . . gradient descend finds local minimum

x∗ = arg min

x

L(x) xi+1 ← xi − α∇L(xi), α = const.

3 ,

slide-12
SLIDE 12

The steepest way downhill

. . . gradient descend finds local minimum

x∗ = arg min

x

L(x) xi+1 ← xi − α∇L(xi), α = const.

3 ,

slide-13
SLIDE 13

The steepest way downhill

. . . gradient descend finds local minimum

x∗ = arg min

x

L(x) xi+1 ← xi − α∇L(xi), α = const.

3 ,

slide-14
SLIDE 14

The steepest way downhill

. . . gradient descend finds local minimum

x∗ = arg min

x

L(x) xi+1 ← xi − α∇L(xi), α = const.

3 ,

slide-15
SLIDE 15

Additional difficulty

.. noisy functions by mini-batching x∗ = arg minx L(x)

sometimes we do not know −∇L(xi) precisely!

4 ,

slide-16
SLIDE 16

Additional difficulty

.. noisy functions by mini-batching x∗ = arg minx L(x)

L(x) ∶= 1 M

M

i=1

ℓ(x,yi) ≈ 1 m

m

j=1

ℓ(x,yj) =∶ ˆ L(x), m ≪ M

▸ compute only smaller sum over m ▸ hope that ˆ

L(x) approximates L(x) well

▸ smaller m means higher noise on ∇L(x)

5 ,

slide-17
SLIDE 17

Additional difficulty

.. noisy functions by mini-batching x∗ = arg minx L(x)

L(x) ∶= 1 M

M

i=1

ℓ(x,yi) ≈ 1 m

m

j=1

ℓ(x,yj) =∶ ˆ L(x), m ≪ M

▸ compute only smaller sum over m ▸ hope that ˆ

L(x) approximates L(x) well

▸ smaller m means higher noise on ∇L(x)

for iid. mini-batches, noise is approximately Gaussian L(x) = ˆ L(x) + ǫ, ǫ ∼ N (0,O (M − m m )) ˆ L(x) ∼

5 ,

slide-18
SLIDE 18

The steepest way downhill

. . . in expectation: SGD finds local minimum, too.

x∗ = arg min

x

L(x) xi+1 ← xi − α ˆ ∇L(xi), α = const.

−4 −2 2 4 6 8 10 5 10 15

6 ,

slide-19
SLIDE 19

The steepest way downhill

. . . in expectation: SGD finds local minimum, too.

x∗ = arg min

x

L(x) xi+1 ← xi − α ˆ ∇L(xi), α = const.

−4 −2 2 4 6 8 10 5 10 15

6 ,

slide-20
SLIDE 20

Step size adaptation

... by line searches

xi+1 ← xi − αisi so far α was constant and hand-chosen!

▸ line searches automatically choose step sizes

7 ,

slide-21
SLIDE 21

Line searches

automated learning rate adaptation x∗ = arg minx L(x)

xi+1 ← xi − αi∇L(xi) set scalar step size αi given direction −∇L(xi)

−4 −2 2 4 6 8 10 5 10 15 small step size

8 ,

slide-22
SLIDE 22

Line searches

automated learning rate adaptation x∗ = arg minx L(x)

xi+1 ← xi − αi∇L(xi) set scalar step size αi given direction −∇L(xi)

−4 −2 2 4 6 8 10 5 10 15 small step size large step size

8 ,

slide-23
SLIDE 23

Line searches

automated learning rate adaptation x∗ = arg minx L(x)

xi+1 ← xi − αi∇L(xi) set scalar step size αi given direction −∇L(xi)

−4 −2 2 4 6 8 10 5 10 15 small step size large step size line search

8 ,

slide-24
SLIDE 24

Line searches

automated learning rate adaptation x∗ = arg minx L(x)

xi+1 ← xi − αi∇L(xi) set scalar step size αi given noisy direction −∇ ˆ L(xi)

−4 −2 2 4 6 8 10 5 10 15 small step size large step size line search

Line searches break in stochastic setting!

8 ,

slide-25
SLIDE 25

Step size adaptation

... by line searches

xi+1 ← xi − αisi

▸ line searches automatically choose step sizes ▸ very fast subroutines called in each optimization step ▸ control blow up or stagnation ▸ they do not work in stochastic optimization problems!

9 ,

slide-26
SLIDE 26

Step size adaptation

... by line searches

xi+1 ← xi − αisi

▸ line searches automatically choose step sizes ▸ very fast subroutines called in each optimization step ▸ control blow up or stagnation ▸ they do not work in stochastic optimization problems!

small outline

▸ introduce classic (noise free) line searches ▸ translate concept to language of probability ▸ get a new algorithm robust to noise

9 ,

slide-27
SLIDE 27

Classic line searches

Initial evaluation ≡ current position of optimizer x∗ = arg minx f(x), xi+1 ← xi − tsi+1

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 −2 2 4 distance t in line search direction df(t) 5.5 6 6.5 f(t)

Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f(t) ≤ f(0) + c1tf ′(0) (W-I) f ′(t) ≥ c2f ′(0) (W-IIa) ∣f ′(t)∣ ≤ c2∣f ′(0)∣ (W-IIb)

10 ,

slide-28
SLIDE 28

Classic line searches

Search: candidate # 1 x∗ = arg minx f(x), xi+1 ← xi − tsi+1

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 −2 2 4 ← initial candidate distance t in line search direction df(t) 5.5 6 6.5 f(t)

Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f(t) ≤ f(0) + c1tf ′(0) (W-I) f ′(t) ≥ c2f ′(0) (W-IIa) ∣f ′(t)∣ ≤ c2∣f ′(0)∣ (W-IIb)

10 ,

slide-29
SLIDE 29

Classic line searches

Collapse search space x∗ = arg minx f(x), xi+1 ← xi − tsi+1

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 −2 2 4 distance t in line search direction df(t) 5.5 6 6.5 f(t)

Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f(t) ≤ f(0) + c1tf ′(0) (W-I) f ′(t) ≥ c2f ′(0) (W-IIa) ∣f ′(t)∣ ≤ c2∣f ′(0)∣ (W-IIb)

10 ,

slide-30
SLIDE 30

Classic line searches

Search: candidate # 2 x∗ = arg minx f(x), xi+1 ← xi − tsi+1

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 −2 2 4 extrapolation → distance t in line search direction df(t) 5.5 6 6.5 f(t)

Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f(t) ≤ f(0) + c1tf ′(0) (W-I) f ′(t) ≥ c2f ′(0) (W-IIa) ∣f ′(t)∣ ≤ c2∣f ′(0)∣ (W-IIb)

10 ,

slide-31
SLIDE 31

Classic line searches

Collapse search space x∗ = arg minx f(x), xi+1 ← xi − tsi+1

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 −2 2 4 distance t in line search direction df(t) 5.5 6 6.5 f(t)

Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f(t) ≤ f(0) + c1tf ′(0) (W-I) f ′(t) ≥ c2f ′(0) (W-IIa) ∣f ′(t)∣ ≤ c2∣f ′(0)∣ (W-IIb)

10 ,

slide-32
SLIDE 32

Classic line searches

Search: candidate # 3 x∗ = arg minx f(x), xi+1 ← xi − tsi+1

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 −2 2 4 interpolation → (local minimum) distance t in line search direction df(t) 5.5 6 6.5 f(t)

Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f(t) ≤ f(0) + c1tf ′(0) (W-I) f ′(t) ≥ c2f ′(0) (W-IIa) ∣f ′(t)∣ ≤ c2∣f ′(0)∣ (W-IIb)

10 ,

slide-33
SLIDE 33

Classic line searches

Accept: datapoint # 3 fulfills Wolfe conditions x∗ = arg minx f(x), xi+1 ← xi − tsi+1

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 −2 2 4 accepted → distance t in line search direction df(t) 5.5 6 6.5 f(t)

Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f(t) ≤ f(0) + c1tf ′(0) (W-I) f ′(t) ≥ c2f ′(0) (W-IIa) ∣f ′(t)∣ ≤ c2∣f ′(0)∣ (W-IIb)

10 ,

slide-34
SLIDE 34

Classic line searches

Choosing meaningful step-sizes, at very low overhead

many classic line searches

  • 1. model the 1D objective with cubic spline
  • 2. search candidate points by collapsing search space
  • 3. accept if Wolfe conditions fulfilled

11 ,

slide-35
SLIDE 35

Classic line searches

Fail in the presence of noise.

many classic line searches

  • 1. model the 1D objective with cubic spline
  • 2. search candidate points by collapsing search space
  • 3. accept if Wolfe conditions fulfilled

Classic line searches break in stochastic optimization problems!

11 ,

slide-36
SLIDE 36

Classic line searches

designing a probabilistic line search

many classic line searches

  • 1. model the 1D objective with cubic spline
  • 2. search candidate points by collapsing search space
  • 3. accept if Wolfe conditions fulfilled

Classic line searches break in stochastic optimization problems! extending the line search paradigm:

  • 1. model: cubic spline GP surrogate
  • 2. search: Bayesian optimization for exploration
  • 3. accept: probabilistic Wolfe termination conditions

11 ,

slide-37
SLIDE 37

Building a probabilistic line search

Step 1: cubic spline GP surrogate, Step 2: BO for exploration

  • 1. model: cubic spline GP (integrated Wiener process)

p(f) = GP(f,0;k), k(t,t′) = [ 1

3 min3(t,t′) + 1 2∣t − t′∣min2(t,t′)]

▸ robust and flexible ▸ has analytic minima (root of quadratic equation) 12 ,

slide-38
SLIDE 38

Building a probabilistic line search

Step 1: cubic spline GP surrogate, Step 2: BO for exploration

  • 1. model: cubic spline GP (integrated Wiener process)

p(f) = GP(f,0;k), k(t,t′) = [ 1

3 min3(t,t′) + 1 2∣t − t′∣min2(t,t′)]

▸ robust and flexible ▸ has analytic minima (root of quadratic equation)

  • 2. search: Bayesian optimization (expected improvement)

uEI(t) = Ep(ft ∣ y,y′)[min{0,η − f(t)}] [Jones et al., 1998]

▸ only evaluated at few candidate points: ▸ analytic minima of posterior mean ▸ one extrapolation point 12 ,

slide-39
SLIDE 39

Building a probabilistic line search

Step 3: probabilistic Wolfe termination conditions

  • 3. accept: probabilistic Wolfe termination conditions:

▸ Wolfe conditions are positivity constraints on two variables at, bt

f(t) ≤ f(0) + c1tf ′(0) (W-I) and f ′(t) ≥ c2f ′(0) (W-II) [at bt] = [1 c1t −1 −c2 1] ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ f(0) f ′(0) f(t) f ′(t) ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ≥ 0.

13 ,

slide-40
SLIDE 40

Building a probabilistic line search

Step 3: probabilistic Wolfe termination conditions

  • 3. accept: probabilistic Wolfe termination conditions:

▸ Wolfe conditions are positivity constraints on two variables at, bt

f(t) ≤ f(0) + c1tf ′(0) (W-I) and f ′(t) ≥ c2f ′(0) (W-II) [at bt] = [1 c1t −1 −c2 1] ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ f(0) f ′(0) f(t) f ′(t) ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ≥ 0.

13 ,

slide-41
SLIDE 41

Building a probabilistic line search

Step 3: probabilistic Wolfe termination conditions

  • 3. accept: probabilistic Wolfe termination conditions:

▸ Wolfe conditions are positivity constraints on two variables at, bt

f(t) ≤ f(0) + c1tf ′(0) (W-I) and f ′(t) ≥ c2f ′(0) (W-II) [at bt] = [1 c1t −1 −c2 1] ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ f(0) f ′(0) f(t) f ′(t) ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ≥ 0.

▸ GP on f implies, at each t, a bivariate Gaussian distribution:

p(at, bt) = N ([at bt] ; [ma

t

mb

t

] , [Caa

t

Cab

t

Cba

t

Cbb

t

]) probability for weak Wolfe conditions : pWolfe

t

= p(0 ≤ at ∧ 0 ≤ bt) approximate strong conditions : pWolfe

t

= p(0 ≤ at ∧ 0 ≤ bt≤ ¯ b)

13 ,

slide-42
SLIDE 42

Probabilistic line search routine

Initial belief: first evaluation ≡ current position of optimizer

1 2 −1 1 pWolfe = 0.00 W (II) W (I) −1 1 df 0.5 1 1.5 2 2.5 3 3.5 4 −2 −1 f

14 ,

slide-43
SLIDE 43

Probabilistic line search routine

Search: candidate # 1

1 2 −1 1 pWolfe = 0.00 W (II) W (I) ← initial candidate −1 1 df 0.5 1 1.5 2 2.5 3 3.5 4 −2 −1 f

14 ,

slide-44
SLIDE 44

Probabilistic line search routine

Accept: Check pWolfe for first datapoint

1 2 −1 1 pWolfe =0.00 W (II) W (I) −1 1 df 0.5 1 1.5 2 2.5 3 3.5 4 −2 −1 f

14 ,

slide-45
SLIDE 45

Probabilistic line search routine

Search: candidate # 2

1 2 −1 1 pWolfe =0.00 W (II) W (I) ← extrapolation −1 1 df 0.5 1 1.5 2 2.5 3 3.5 4 −2 −1 f

14 ,

slide-46
SLIDE 46

Probabilistic line search routine

Accept: check pWolfe for datapoints # 1 and # 2

1 2 −1 1 pWolfe =0.07 1 2 −1 1 pWolfe =0.00 W (II) W (I) −1 1 df 0.5 1 1.5 2 2.5 3 3.5 4 −2 −1 f

14 ,

slide-47
SLIDE 47

Probabilistic line search routine

Search: candidates # 3

1 2 −1 1 pWolfe =0.07 1 2 −1 1 pWolfe =0.00 W (II) W (I) ← local minimum extrapolation → −1 1 df 0.5 1 1.5 2 2.5 3 3.5 4 −2 −1 f

14 ,

slide-48
SLIDE 48

Probabilistic line search routine

Search: candidates # 3 (discriminate through EI)

1 2 −1 1 pWolfe =0.07 1 2 −1 1 pWolfe =0.00 W (II) W (I) ← local minimum extrapolation → −1 1 df EI ⋅ pWolfe→ 0.5 1 1.5 2 2.5 3 3.5 4 −2 −1 f

14 ,

slide-49
SLIDE 49

Probabilistic line search routine

Accept: check pWolfe for datapoints # 1, # 2 and # 3

1 2 −1 1 pWolfe =0.68 1 2 −1 1 pWolfe =0.08 1 2 −1 1 pWolfe =0.00 W (II) W (I) ← accepted −1 1 df 0.5 1 1.5 2 2.5 3 3.5 4 −2 −1 f

14 ,

slide-50
SLIDE 50

small summary

. . . probabilistic line searches

make new from old:

  • 1. model cubic spline → GP with cubic spline means
  • 2. search collapsing search space → Bayesian optimization
  • 3. accept binary Wolfe conditions → probabilistic Wolfe conditions

→ lightweight inner optimization routine → robust stochastic optimization

15 ,

slide-51
SLIDE 51

Line search finds learning rates

SGD on 2-layer neural net: mini-batch size: 10

10−4 10−3 10−2 10−1 100 101 0.6 0.7 0.8 intial learning rate test error CIFAR-10 SGD fixed t SGD Line Search 10−4 10−3 10−2 10−1 100 101 10−2 10−1 intial learning rate MNIST 2 4 6 8 10 2 4 6 8 10 0.6 0.8 1 epoch test error 2 4 6 8 10 2 4 6 8 10 0.2 0.4 0.6 0.8 1 epoch

16 ,

slide-52
SLIDE 52

small summary

... about line searches and others

take away

▸ optimizer are learning machines ▸ data: noisy gradient ▸ prior encodes structure of the objective ▸ prob. line search: infers approximate minimum

17 ,

slide-53
SLIDE 53

small summary

... about line searches and others

take away

▸ optimizer are learning machines ▸ data: noisy gradient ▸ prior encodes structure of the objective ▸ prob. line search: infers approximate minimum

there is more

▸ the field is much broader than ’only’ line searches ▸ search directions can also be learned ▸ classic search directions are MAP estimator of Gaussian inference ▸ robust second order search directions are still needed!

17 ,

slide-54
SLIDE 54

Probabilistic line searches

... in Tensorflow

We implement in: Have a beer with Lukas!

18 ,

slide-55
SLIDE 55

Probabilistic line searches

... in Tensorflow

We implement in: Have a beer with Lukas! Thank you!

18 ,