Robust nonlinear Optimization Maren Mahsereci Workshop on - - PowerPoint PPT Presentation
Robust nonlinear Optimization Maren Mahsereci Workshop on - - PowerPoint PPT Presentation
Robust nonlinear Optimization Maren Mahsereci Workshop on Uncertainty Quantification 09 / 15 / 2016, Sheffield Emmy Noether Group on Probabilistic Numerics Department of Empirical Inference Max Planck Institute for Intelligent Systems
Robust optimization
. . . outline
▸ basics about greedy optimizers
▸ GD and SGD: (stochastic) gradient descent
▸ robust stochastic optimization
▸ example: step size adaptation ▸ extending line searches ▸ robust search directions 1 ,
Typical scheme
. . . greedy and gradient based optimizer
x∗ = arg min
x
L(x) xi+1 ← xi − αisi
- 1. si – which direction? → model objective function locally
- 2. αi – how far? → prevent blow ups and stagnation
- 3. repeat
▸ needs to work for many different L(x)
2 ,
Typical scheme
. . . greedy and gradient based optimizer
x∗ = arg min
x
L(x) xi+1 ← xi − αisi
- 1. si – which direction? → model objective function locally
- 2. αi – how far? → prevent blow ups and stagnation
- 3. repeat
▸ needs to work for many different L(x)
2 ,
The steepest way downhill
. . . gradient descend finds local minimum
x∗ = arg min
x
L(x) xi+1 ← xi − α∇L(xi), α = const.
3 ,
The steepest way downhill
. . . gradient descend finds local minimum
x∗ = arg min
x
L(x) xi+1 ← xi − α∇L(xi), α = const.
3 ,
The steepest way downhill
. . . gradient descend finds local minimum
x∗ = arg min
x
L(x) xi+1 ← xi − α∇L(xi), α = const.
3 ,
The steepest way downhill
. . . gradient descend finds local minimum
x∗ = arg min
x
L(x) xi+1 ← xi − α∇L(xi), α = const.
3 ,
The steepest way downhill
. . . gradient descend finds local minimum
x∗ = arg min
x
L(x) xi+1 ← xi − α∇L(xi), α = const.
3 ,
The steepest way downhill
. . . gradient descend finds local minimum
x∗ = arg min
x
L(x) xi+1 ← xi − α∇L(xi), α = const.
3 ,
The steepest way downhill
. . . gradient descend finds local minimum
x∗ = arg min
x
L(x) xi+1 ← xi − α∇L(xi), α = const.
3 ,
The steepest way downhill
. . . gradient descend finds local minimum
x∗ = arg min
x
L(x) xi+1 ← xi − α∇L(xi), α = const.
3 ,
The steepest way downhill
. . . gradient descend finds local minimum
x∗ = arg min
x
L(x) xi+1 ← xi − α∇L(xi), α = const.
3 ,
The steepest way downhill
. . . gradient descend finds local minimum
x∗ = arg min
x
L(x) xi+1 ← xi − α∇L(xi), α = const.
3 ,
Additional difficulty
.. noisy functions by mini-batching x∗ = arg minx L(x)
sometimes we do not know −∇L(xi) precisely!
4 ,
Additional difficulty
.. noisy functions by mini-batching x∗ = arg minx L(x)
L(x) ∶= 1 M
M
∑
i=1
ℓ(x,yi) ≈ 1 m
m
∑
j=1
ℓ(x,yj) =∶ ˆ L(x), m ≪ M
▸ compute only smaller sum over m ▸ hope that ˆ
L(x) approximates L(x) well
▸ smaller m means higher noise on ∇L(x)
5 ,
Additional difficulty
.. noisy functions by mini-batching x∗ = arg minx L(x)
L(x) ∶= 1 M
M
∑
i=1
ℓ(x,yi) ≈ 1 m
m
∑
j=1
ℓ(x,yj) =∶ ˆ L(x), m ≪ M
▸ compute only smaller sum over m ▸ hope that ˆ
L(x) approximates L(x) well
▸ smaller m means higher noise on ∇L(x)
for iid. mini-batches, noise is approximately Gaussian L(x) = ˆ L(x) + ǫ, ǫ ∼ N (0,O (M − m m )) ˆ L(x) ∼
5 ,
The steepest way downhill
. . . in expectation: SGD finds local minimum, too.
x∗ = arg min
x
L(x) xi+1 ← xi − α ˆ ∇L(xi), α = const.
−4 −2 2 4 6 8 10 5 10 15
6 ,
The steepest way downhill
. . . in expectation: SGD finds local minimum, too.
x∗ = arg min
x
L(x) xi+1 ← xi − α ˆ ∇L(xi), α = const.
−4 −2 2 4 6 8 10 5 10 15
6 ,
Step size adaptation
... by line searches
xi+1 ← xi − αisi so far α was constant and hand-chosen!
▸ line searches automatically choose step sizes
7 ,
Line searches
automated learning rate adaptation x∗ = arg minx L(x)
xi+1 ← xi − αi∇L(xi) set scalar step size αi given direction −∇L(xi)
−4 −2 2 4 6 8 10 5 10 15 small step size
8 ,
Line searches
automated learning rate adaptation x∗ = arg minx L(x)
xi+1 ← xi − αi∇L(xi) set scalar step size αi given direction −∇L(xi)
−4 −2 2 4 6 8 10 5 10 15 small step size large step size
8 ,
Line searches
automated learning rate adaptation x∗ = arg minx L(x)
xi+1 ← xi − αi∇L(xi) set scalar step size αi given direction −∇L(xi)
−4 −2 2 4 6 8 10 5 10 15 small step size large step size line search
8 ,
Line searches
automated learning rate adaptation x∗ = arg minx L(x)
xi+1 ← xi − αi∇L(xi) set scalar step size αi given noisy direction −∇ ˆ L(xi)
−4 −2 2 4 6 8 10 5 10 15 small step size large step size line search
Line searches break in stochastic setting!
8 ,
Step size adaptation
... by line searches
xi+1 ← xi − αisi
▸ line searches automatically choose step sizes ▸ very fast subroutines called in each optimization step ▸ control blow up or stagnation ▸ they do not work in stochastic optimization problems!
9 ,
Step size adaptation
... by line searches
xi+1 ← xi − αisi
▸ line searches automatically choose step sizes ▸ very fast subroutines called in each optimization step ▸ control blow up or stagnation ▸ they do not work in stochastic optimization problems!
small outline
▸ introduce classic (noise free) line searches ▸ translate concept to language of probability ▸ get a new algorithm robust to noise
9 ,
Classic line searches
Initial evaluation ≡ current position of optimizer x∗ = arg minx f(x), xi+1 ← xi − tsi+1
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 −2 2 4 distance t in line search direction df(t) 5.5 6 6.5 f(t)
Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f(t) ≤ f(0) + c1tf ′(0) (W-I) f ′(t) ≥ c2f ′(0) (W-IIa) ∣f ′(t)∣ ≤ c2∣f ′(0)∣ (W-IIb)
10 ,
Classic line searches
Search: candidate # 1 x∗ = arg minx f(x), xi+1 ← xi − tsi+1
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 −2 2 4 ← initial candidate distance t in line search direction df(t) 5.5 6 6.5 f(t)
Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f(t) ≤ f(0) + c1tf ′(0) (W-I) f ′(t) ≥ c2f ′(0) (W-IIa) ∣f ′(t)∣ ≤ c2∣f ′(0)∣ (W-IIb)
10 ,
Classic line searches
Collapse search space x∗ = arg minx f(x), xi+1 ← xi − tsi+1
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 −2 2 4 distance t in line search direction df(t) 5.5 6 6.5 f(t)
Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f(t) ≤ f(0) + c1tf ′(0) (W-I) f ′(t) ≥ c2f ′(0) (W-IIa) ∣f ′(t)∣ ≤ c2∣f ′(0)∣ (W-IIb)
10 ,
Classic line searches
Search: candidate # 2 x∗ = arg minx f(x), xi+1 ← xi − tsi+1
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 −2 2 4 extrapolation → distance t in line search direction df(t) 5.5 6 6.5 f(t)
Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f(t) ≤ f(0) + c1tf ′(0) (W-I) f ′(t) ≥ c2f ′(0) (W-IIa) ∣f ′(t)∣ ≤ c2∣f ′(0)∣ (W-IIb)
10 ,
Classic line searches
Collapse search space x∗ = arg minx f(x), xi+1 ← xi − tsi+1
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 −2 2 4 distance t in line search direction df(t) 5.5 6 6.5 f(t)
Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f(t) ≤ f(0) + c1tf ′(0) (W-I) f ′(t) ≥ c2f ′(0) (W-IIa) ∣f ′(t)∣ ≤ c2∣f ′(0)∣ (W-IIb)
10 ,
Classic line searches
Search: candidate # 3 x∗ = arg minx f(x), xi+1 ← xi − tsi+1
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 −2 2 4 interpolation → (local minimum) distance t in line search direction df(t) 5.5 6 6.5 f(t)
Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f(t) ≤ f(0) + c1tf ′(0) (W-I) f ′(t) ≥ c2f ′(0) (W-IIa) ∣f ′(t)∣ ≤ c2∣f ′(0)∣ (W-IIb)
10 ,
Classic line searches
Accept: datapoint # 3 fulfills Wolfe conditions x∗ = arg minx f(x), xi+1 ← xi − tsi+1
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 −2 2 4 accepted → distance t in line search direction df(t) 5.5 6 6.5 f(t)
Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f(t) ≤ f(0) + c1tf ′(0) (W-I) f ′(t) ≥ c2f ′(0) (W-IIa) ∣f ′(t)∣ ≤ c2∣f ′(0)∣ (W-IIb)
10 ,
Classic line searches
Choosing meaningful step-sizes, at very low overhead
many classic line searches
- 1. model the 1D objective with cubic spline
- 2. search candidate points by collapsing search space
- 3. accept if Wolfe conditions fulfilled
11 ,
Classic line searches
Fail in the presence of noise.
many classic line searches
- 1. model the 1D objective with cubic spline
- 2. search candidate points by collapsing search space
- 3. accept if Wolfe conditions fulfilled
Classic line searches break in stochastic optimization problems!
11 ,
Classic line searches
designing a probabilistic line search
many classic line searches
- 1. model the 1D objective with cubic spline
- 2. search candidate points by collapsing search space
- 3. accept if Wolfe conditions fulfilled
Classic line searches break in stochastic optimization problems! extending the line search paradigm:
- 1. model: cubic spline GP surrogate
- 2. search: Bayesian optimization for exploration
- 3. accept: probabilistic Wolfe termination conditions
11 ,
Building a probabilistic line search
Step 1: cubic spline GP surrogate, Step 2: BO for exploration
- 1. model: cubic spline GP (integrated Wiener process)
p(f) = GP(f,0;k), k(t,t′) = [ 1
3 min3(t,t′) + 1 2∣t − t′∣min2(t,t′)]
▸ robust and flexible ▸ has analytic minima (root of quadratic equation) 12 ,
Building a probabilistic line search
Step 1: cubic spline GP surrogate, Step 2: BO for exploration
- 1. model: cubic spline GP (integrated Wiener process)
p(f) = GP(f,0;k), k(t,t′) = [ 1
3 min3(t,t′) + 1 2∣t − t′∣min2(t,t′)]
▸ robust and flexible ▸ has analytic minima (root of quadratic equation)
- 2. search: Bayesian optimization (expected improvement)
uEI(t) = Ep(ft ∣ y,y′)[min{0,η − f(t)}] [Jones et al., 1998]
▸ only evaluated at few candidate points: ▸ analytic minima of posterior mean ▸ one extrapolation point 12 ,
Building a probabilistic line search
Step 3: probabilistic Wolfe termination conditions
- 3. accept: probabilistic Wolfe termination conditions:
▸ Wolfe conditions are positivity constraints on two variables at, bt
f(t) ≤ f(0) + c1tf ′(0) (W-I) and f ′(t) ≥ c2f ′(0) (W-II) [at bt] = [1 c1t −1 −c2 1] ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ f(0) f ′(0) f(t) f ′(t) ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ≥ 0.
13 ,
Building a probabilistic line search
Step 3: probabilistic Wolfe termination conditions
- 3. accept: probabilistic Wolfe termination conditions:
▸ Wolfe conditions are positivity constraints on two variables at, bt
f(t) ≤ f(0) + c1tf ′(0) (W-I) and f ′(t) ≥ c2f ′(0) (W-II) [at bt] = [1 c1t −1 −c2 1] ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ f(0) f ′(0) f(t) f ′(t) ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ≥ 0.
13 ,
Building a probabilistic line search
Step 3: probabilistic Wolfe termination conditions
- 3. accept: probabilistic Wolfe termination conditions:
▸ Wolfe conditions are positivity constraints on two variables at, bt
f(t) ≤ f(0) + c1tf ′(0) (W-I) and f ′(t) ≥ c2f ′(0) (W-II) [at bt] = [1 c1t −1 −c2 1] ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ f(0) f ′(0) f(t) f ′(t) ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ≥ 0.
▸ GP on f implies, at each t, a bivariate Gaussian distribution:
p(at, bt) = N ([at bt] ; [ma
t
mb
t
] , [Caa
t
Cab
t
Cba
t
Cbb
t
]) probability for weak Wolfe conditions : pWolfe
t
= p(0 ≤ at ∧ 0 ≤ bt) approximate strong conditions : pWolfe
t
= p(0 ≤ at ∧ 0 ≤ bt≤ ¯ b)
13 ,
Probabilistic line search routine
Initial belief: first evaluation ≡ current position of optimizer
1 2 −1 1 pWolfe = 0.00 W (II) W (I) −1 1 df 0.5 1 1.5 2 2.5 3 3.5 4 −2 −1 f
14 ,
Probabilistic line search routine
Search: candidate # 1
1 2 −1 1 pWolfe = 0.00 W (II) W (I) ← initial candidate −1 1 df 0.5 1 1.5 2 2.5 3 3.5 4 −2 −1 f
14 ,
Probabilistic line search routine
Accept: Check pWolfe for first datapoint
1 2 −1 1 pWolfe =0.00 W (II) W (I) −1 1 df 0.5 1 1.5 2 2.5 3 3.5 4 −2 −1 f
14 ,
Probabilistic line search routine
Search: candidate # 2
1 2 −1 1 pWolfe =0.00 W (II) W (I) ← extrapolation −1 1 df 0.5 1 1.5 2 2.5 3 3.5 4 −2 −1 f
14 ,
Probabilistic line search routine
Accept: check pWolfe for datapoints # 1 and # 2
1 2 −1 1 pWolfe =0.07 1 2 −1 1 pWolfe =0.00 W (II) W (I) −1 1 df 0.5 1 1.5 2 2.5 3 3.5 4 −2 −1 f
14 ,
Probabilistic line search routine
Search: candidates # 3
1 2 −1 1 pWolfe =0.07 1 2 −1 1 pWolfe =0.00 W (II) W (I) ← local minimum extrapolation → −1 1 df 0.5 1 1.5 2 2.5 3 3.5 4 −2 −1 f
14 ,
Probabilistic line search routine
Search: candidates # 3 (discriminate through EI)
1 2 −1 1 pWolfe =0.07 1 2 −1 1 pWolfe =0.00 W (II) W (I) ← local minimum extrapolation → −1 1 df EI ⋅ pWolfe→ 0.5 1 1.5 2 2.5 3 3.5 4 −2 −1 f
14 ,
Probabilistic line search routine
Accept: check pWolfe for datapoints # 1, # 2 and # 3
1 2 −1 1 pWolfe =0.68 1 2 −1 1 pWolfe =0.08 1 2 −1 1 pWolfe =0.00 W (II) W (I) ← accepted −1 1 df 0.5 1 1.5 2 2.5 3 3.5 4 −2 −1 f
14 ,
small summary
. . . probabilistic line searches
make new from old:
- 1. model cubic spline → GP with cubic spline means
- 2. search collapsing search space → Bayesian optimization
- 3. accept binary Wolfe conditions → probabilistic Wolfe conditions
→ lightweight inner optimization routine → robust stochastic optimization
15 ,
Line search finds learning rates
SGD on 2-layer neural net: mini-batch size: 10
10−4 10−3 10−2 10−1 100 101 0.6 0.7 0.8 intial learning rate test error CIFAR-10 SGD fixed t SGD Line Search 10−4 10−3 10−2 10−1 100 101 10−2 10−1 intial learning rate MNIST 2 4 6 8 10 2 4 6 8 10 0.6 0.8 1 epoch test error 2 4 6 8 10 2 4 6 8 10 0.2 0.4 0.6 0.8 1 epoch
16 ,
small summary
... about line searches and others
take away
▸ optimizer are learning machines ▸ data: noisy gradient ▸ prior encodes structure of the objective ▸ prob. line search: infers approximate minimum
17 ,
small summary
... about line searches and others
take away
▸ optimizer are learning machines ▸ data: noisy gradient ▸ prior encodes structure of the objective ▸ prob. line search: infers approximate minimum
there is more
▸ the field is much broader than ’only’ line searches ▸ search directions can also be learned ▸ classic search directions are MAP estimator of Gaussian inference ▸ robust second order search directions are still needed!
17 ,
Probabilistic line searches
... in Tensorflow
We implement in: Have a beer with Lukas!
18 ,
Probabilistic line searches
... in Tensorflow
We implement in: Have a beer with Lukas! Thank you!
18 ,