Robust nonlinear Optimization Maren Mahsereci Workshop on Uncertainty Quantification 09 / 15 / 2016, Sheffield Emmy Noether Group on Probabilistic Numerics Department of Empirical Inference Max Planck Institute for Intelligent Systems Tübingen, Germany
Robust optimization . . . outline ▸ basics about greedy optimizers ▸ GD and SGD : (stochastic) gradient descent ▸ robust stochastic optimization ▸ example: step size adaptation ▸ extending line searches ▸ robust search directions 1 ,
Typical scheme . . . greedy and gradient based optimizer x ∗ = arg min L( x ) x x i + 1 ← x i − α i s i 1. s i – which direction? → model objective function locally 2. α i – how far? → prevent blow ups and stagnation 3. repeat ▸ needs to work for many different L( x ) 2 ,
Typical scheme . . . greedy and gradient based optimizer x ∗ = arg min L( x ) x x i + 1 ← x i − α i s i 1. s i – which direction? → model objective function locally 2. α i – how far? → prevent blow ups and stagnation 3. repeat ▸ needs to work for many different L( x ) 2 ,
The steepest way downhill . . . gradient descend finds local minimum x ∗ = arg min L( x ) x x i + 1 ← x i − α ∇L( x i ) , α = const. 3 ,
The steepest way downhill . . . gradient descend finds local minimum x ∗ = arg min L( x ) x x i + 1 ← x i − α ∇L( x i ) , α = const. 3 ,
The steepest way downhill . . . gradient descend finds local minimum x ∗ = arg min L( x ) x x i + 1 ← x i − α ∇L( x i ) , α = const. 3 ,
The steepest way downhill . . . gradient descend finds local minimum x ∗ = arg min L( x ) x x i + 1 ← x i − α ∇L( x i ) , α = const. 3 ,
The steepest way downhill . . . gradient descend finds local minimum x ∗ = arg min L( x ) x x i + 1 ← x i − α ∇L( x i ) , α = const. 3 ,
The steepest way downhill . . . gradient descend finds local minimum x ∗ = arg min L( x ) x x i + 1 ← x i − α ∇L( x i ) , α = const. 3 ,
The steepest way downhill . . . gradient descend finds local minimum x ∗ = arg min L( x ) x x i + 1 ← x i − α ∇L( x i ) , α = const. 3 ,
The steepest way downhill . . . gradient descend finds local minimum x ∗ = arg min L( x ) x x i + 1 ← x i − α ∇L( x i ) , α = const. 3 ,
The steepest way downhill . . . gradient descend finds local minimum x ∗ = arg min L( x ) x x i + 1 ← x i − α ∇L( x i ) , α = const. 3 ,
The steepest way downhill . . . gradient descend finds local minimum x ∗ = arg min L( x ) x x i + 1 ← x i − α ∇L( x i ) , α = const. 3 ,
Additional difficulty x ∗ = arg min x L( x ) .. noisy functions by mini-batching sometimes we do not know −∇L( x i ) precisely! 4 ,
Additional difficulty x ∗ = arg min x L( x ) .. noisy functions by mini-batching M m L( x ) ∶ = 1 ℓ ( x,y i ) ≈ 1 ℓ ( x,y j ) = ∶ ˆ L( x ) , m ≪ M ∑ ∑ M m i = 1 j = 1 ▸ compute only smaller sum over m L( x ) approximates L( x ) well ▸ hope that ˆ ▸ smaller m means higher noise on ∇ L( x ) 5 ,
Additional difficulty x ∗ = arg min x L( x ) .. noisy functions by mini-batching M m L( x ) ∶ = 1 ℓ ( x,y i ) ≈ 1 ℓ ( x,y j ) = ∶ ˆ L( x ) , m ≪ M ∑ ∑ M m i = 1 j = 1 ▸ compute only smaller sum over m L( x ) approximates L( x ) well ▸ hope that ˆ ▸ smaller m means higher noise on ∇ L( x ) for iid. mini-batches, noise is approximately Gaussian L( x ) = ˆ L( x ) + ǫ, ǫ ∼ N ( 0 , O ( M − m )) m L( x ) ˆ ∼ 5 ,
The steepest way downhill . . . in expectation: SGD finds local minimum, too. x ∗ = arg min L( x ) x x i + 1 ← x i − α ˆ ∇L( x i ) , α = const. 15 10 5 0 − 4 − 2 0 2 4 6 8 10 6 ,
The steepest way downhill . . . in expectation: SGD finds local minimum, too. x ∗ = arg min L( x ) x x i + 1 ← x i − α ˆ ∇L( x i ) , α = const. 15 10 5 0 − 4 − 2 0 2 4 6 8 10 6 ,
Step size adaptation ... by line searches x i + 1 ← x i − α i s i so far α was constant and hand-chosen! ▸ line searches automatically choose step sizes 7 ,
Line searches x ∗ = arg min x L( x ) automated learning rate adaptation x i + 1 ← x i − α i ∇L( x i ) set scalar step size α i given direction −∇L( x i ) 15 small step size 10 5 0 − 4 − 2 0 2 4 6 8 10 8 ,
Line searches x ∗ = arg min x L( x ) automated learning rate adaptation x i + 1 ← x i − α i ∇L( x i ) set scalar step size α i given direction −∇L( x i ) 15 small step size large step size 10 5 0 − 4 − 2 0 2 4 6 8 10 8 ,
Line searches x ∗ = arg min x L( x ) automated learning rate adaptation x i + 1 ← x i − α i ∇L( x i ) set scalar step size α i given direction −∇L( x i ) 15 small step size large step size line search 10 5 0 − 4 − 2 0 2 4 6 8 10 8 ,
Line searches x ∗ = arg min x L( x ) automated learning rate adaptation x i + 1 ← x i − α i ∇L( x i ) set scalar step size α i given noisy direction −∇ ˆ L( x i ) 15 small step size large step size line search 10 5 0 − 4 − 2 0 2 4 6 8 10 Line searches break in stochastic setting! 8 ,
Step size adaptation ... by line searches x i + 1 ← x i − α i s i ▸ line searches automatically choose step sizes ▸ very fast subroutines called in each optimization step ▸ control blow up or stagnation ▸ they do not work in stochastic optimization problems! 9 ,
Step size adaptation ... by line searches x i + 1 ← x i − α i s i ▸ line searches automatically choose step sizes ▸ very fast subroutines called in each optimization step ▸ control blow up or stagnation ▸ they do not work in stochastic optimization problems! small outline ▸ introduce classic (noise free) line searches ▸ translate concept to language of probability ▸ get a new algorithm robust to noise 9 ,
Classic line searches x ∗ = arg min x f ( x ) , Initial evaluation ≡ current position of optimizer x i + 1 ← x i − ts i + 1 6 . 5 f ( t ) 6 5 . 5 4 2 d f ( t ) 0 − 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 1 . 6 distance t in line search direction Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f ( t ) ≤ f ( 0 ) + c 1 tf ′ ( 0 ) f ′ ( t ) ≥ c 2 f ′ ( 0 ) (W-I) (W-IIa) ∣ f ′ ( t )∣ ≤ c 2 ∣ f ′ ( 0 )∣ (W-IIb) 10 ,
Classic line searches x ∗ = arg min x f ( x ) , Search: candidate # 1 x i + 1 ← x i − ts i + 1 6 . 5 f ( t ) 6 5 . 5 4 2 ← initial candidate d f ( t ) 0 − 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 1 . 6 distance t in line search direction Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f ( t ) ≤ f ( 0 ) + c 1 tf ′ ( 0 ) f ′ ( t ) ≥ c 2 f ′ ( 0 ) (W-I) (W-IIa) ∣ f ′ ( t )∣ ≤ c 2 ∣ f ′ ( 0 )∣ (W-IIb) 10 ,
Classic line searches x ∗ = arg min x f ( x ) , Collapse search space x i + 1 ← x i − ts i + 1 6 . 5 f ( t ) 6 5 . 5 4 2 d f ( t ) 0 − 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 1 . 6 distance t in line search direction Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f ( t ) ≤ f ( 0 ) + c 1 tf ′ ( 0 ) f ′ ( t ) ≥ c 2 f ′ ( 0 ) (W-I) (W-IIa) ∣ f ′ ( t )∣ ≤ c 2 ∣ f ′ ( 0 )∣ (W-IIb) 10 ,
Classic line searches x ∗ = arg min x f ( x ) , Search: candidate # 2 x i + 1 ← x i − ts i + 1 6 . 5 f ( t ) 6 5 . 5 4 extrapolation → 2 d f ( t ) 0 − 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 1 . 6 distance t in line search direction Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f ( t ) ≤ f ( 0 ) + c 1 tf ′ ( 0 ) f ′ ( t ) ≥ c 2 f ′ ( 0 ) (W-I) (W-IIa) ∣ f ′ ( t )∣ ≤ c 2 ∣ f ′ ( 0 )∣ (W-IIb) 10 ,
Classic line searches x ∗ = arg min x f ( x ) , Collapse search space x i + 1 ← x i − ts i + 1 6 . 5 f ( t ) 6 5 . 5 4 2 d f ( t ) 0 − 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 1 . 6 distance t in line search direction Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f ( t ) ≤ f ( 0 ) + c 1 tf ′ ( 0 ) f ′ ( t ) ≥ c 2 f ′ ( 0 ) (W-I) (W-IIa) ∣ f ′ ( t )∣ ≤ c 2 ∣ f ′ ( 0 )∣ (W-IIb) 10 ,
Classic line searches x ∗ = arg min x f ( x ) , Search: candidate # 3 x i + 1 ← x i − ts i + 1 6 . 5 f ( t ) 6 5 . 5 4 interpolation → 2 d f ( t ) (local minimum) 0 − 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 1 . 6 distance t in line search direction Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f ( t ) ≤ f ( 0 ) + c 1 tf ′ ( 0 ) f ′ ( t ) ≥ c 2 f ′ ( 0 ) (W-I) (W-IIa) ∣ f ′ ( t )∣ ≤ c 2 ∣ f ′ ( 0 )∣ (W-IIb) 10 ,
Classic line searches x ∗ = arg min x f ( x ) , Accept: datapoint # 3 fulfills Wolfe conditions x i + 1 ← x i − ts i + 1 6 . 5 f ( t ) 6 5 . 5 4 accepted → 2 d f ( t ) 0 − 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 1 . 6 distance t in line search direction Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f ( t ) ≤ f ( 0 ) + c 1 tf ′ ( 0 ) f ′ ( t ) ≥ c 2 f ′ ( 0 ) (W-I) (W-IIa) ∣ f ′ ( t )∣ ≤ c 2 ∣ f ′ ( 0 )∣ (W-IIb) 10 ,
Classic line searches Choosing meaningful step-sizes, at very low overhead many classic line searches 1. model the 1D objective with cubic spline 2. search candidate points by collapsing search space 3. accept if Wolfe conditions fulfilled 11 ,
Recommend
More recommend