robust nonlinear optimization
play

Robust nonlinear Optimization Maren Mahsereci Workshop on - PowerPoint PPT Presentation

Robust nonlinear Optimization Maren Mahsereci Workshop on Uncertainty Quantification 09 / 15 / 2016, Sheffield Emmy Noether Group on Probabilistic Numerics Department of Empirical Inference Max Planck Institute for Intelligent Systems


  1. Robust nonlinear Optimization Maren Mahsereci Workshop on Uncertainty Quantification 09 / 15 / 2016, Sheffield Emmy Noether Group on Probabilistic Numerics Department of Empirical Inference Max Planck Institute for Intelligent Systems Tübingen, Germany

  2. Robust optimization . . . outline ▸ basics about greedy optimizers ▸ GD and SGD : (stochastic) gradient descent ▸ robust stochastic optimization ▸ example: step size adaptation ▸ extending line searches ▸ robust search directions 1 ,

  3. Typical scheme . . . greedy and gradient based optimizer x ∗ = arg min L( x ) x x i + 1 ← x i − α i s i 1. s i – which direction? → model objective function locally 2. α i – how far? → prevent blow ups and stagnation 3. repeat ▸ needs to work for many different L( x ) 2 ,

  4. Typical scheme . . . greedy and gradient based optimizer x ∗ = arg min L( x ) x x i + 1 ← x i − α i s i 1. s i – which direction? → model objective function locally 2. α i – how far? → prevent blow ups and stagnation 3. repeat ▸ needs to work for many different L( x ) 2 ,

  5. The steepest way downhill . . . gradient descend finds local minimum x ∗ = arg min L( x ) x x i + 1 ← x i − α ∇L( x i ) , α = const. 3 ,

  6. The steepest way downhill . . . gradient descend finds local minimum x ∗ = arg min L( x ) x x i + 1 ← x i − α ∇L( x i ) , α = const. 3 ,

  7. The steepest way downhill . . . gradient descend finds local minimum x ∗ = arg min L( x ) x x i + 1 ← x i − α ∇L( x i ) , α = const. 3 ,

  8. The steepest way downhill . . . gradient descend finds local minimum x ∗ = arg min L( x ) x x i + 1 ← x i − α ∇L( x i ) , α = const. 3 ,

  9. The steepest way downhill . . . gradient descend finds local minimum x ∗ = arg min L( x ) x x i + 1 ← x i − α ∇L( x i ) , α = const. 3 ,

  10. The steepest way downhill . . . gradient descend finds local minimum x ∗ = arg min L( x ) x x i + 1 ← x i − α ∇L( x i ) , α = const. 3 ,

  11. The steepest way downhill . . . gradient descend finds local minimum x ∗ = arg min L( x ) x x i + 1 ← x i − α ∇L( x i ) , α = const. 3 ,

  12. The steepest way downhill . . . gradient descend finds local minimum x ∗ = arg min L( x ) x x i + 1 ← x i − α ∇L( x i ) , α = const. 3 ,

  13. The steepest way downhill . . . gradient descend finds local minimum x ∗ = arg min L( x ) x x i + 1 ← x i − α ∇L( x i ) , α = const. 3 ,

  14. The steepest way downhill . . . gradient descend finds local minimum x ∗ = arg min L( x ) x x i + 1 ← x i − α ∇L( x i ) , α = const. 3 ,

  15. Additional difficulty x ∗ = arg min x L( x ) .. noisy functions by mini-batching sometimes we do not know −∇L( x i ) precisely! 4 ,

  16. Additional difficulty x ∗ = arg min x L( x ) .. noisy functions by mini-batching M m L( x ) ∶ = 1 ℓ ( x,y i ) ≈ 1 ℓ ( x,y j ) = ∶ ˆ L( x ) , m ≪ M ∑ ∑ M m i = 1 j = 1 ▸ compute only smaller sum over m L( x ) approximates L( x ) well ▸ hope that ˆ ▸ smaller m means higher noise on ∇ L( x ) 5 ,

  17. Additional difficulty x ∗ = arg min x L( x ) .. noisy functions by mini-batching M m L( x ) ∶ = 1 ℓ ( x,y i ) ≈ 1 ℓ ( x,y j ) = ∶ ˆ L( x ) , m ≪ M ∑ ∑ M m i = 1 j = 1 ▸ compute only smaller sum over m L( x ) approximates L( x ) well ▸ hope that ˆ ▸ smaller m means higher noise on ∇ L( x ) for iid. mini-batches, noise is approximately Gaussian L( x ) = ˆ L( x ) + ǫ, ǫ ∼ N ( 0 , O ( M − m )) m L( x ) ˆ ∼ 5 ,

  18. The steepest way downhill . . . in expectation: SGD finds local minimum, too. x ∗ = arg min L( x ) x x i + 1 ← x i − α ˆ ∇L( x i ) , α = const. 15 10 5 0 − 4 − 2 0 2 4 6 8 10 6 ,

  19. The steepest way downhill . . . in expectation: SGD finds local minimum, too. x ∗ = arg min L( x ) x x i + 1 ← x i − α ˆ ∇L( x i ) , α = const. 15 10 5 0 − 4 − 2 0 2 4 6 8 10 6 ,

  20. Step size adaptation ... by line searches x i + 1 ← x i − α i s i so far α was constant and hand-chosen! ▸ line searches automatically choose step sizes 7 ,

  21. Line searches x ∗ = arg min x L( x ) automated learning rate adaptation x i + 1 ← x i − α i ∇L( x i ) set scalar step size α i given direction −∇L( x i ) 15 small step size 10 5 0 − 4 − 2 0 2 4 6 8 10 8 ,

  22. Line searches x ∗ = arg min x L( x ) automated learning rate adaptation x i + 1 ← x i − α i ∇L( x i ) set scalar step size α i given direction −∇L( x i ) 15 small step size large step size 10 5 0 − 4 − 2 0 2 4 6 8 10 8 ,

  23. Line searches x ∗ = arg min x L( x ) automated learning rate adaptation x i + 1 ← x i − α i ∇L( x i ) set scalar step size α i given direction −∇L( x i ) 15 small step size large step size line search 10 5 0 − 4 − 2 0 2 4 6 8 10 8 ,

  24. Line searches x ∗ = arg min x L( x ) automated learning rate adaptation x i + 1 ← x i − α i ∇L( x i ) set scalar step size α i given noisy direction −∇ ˆ L( x i ) 15 small step size large step size line search 10 5 0 − 4 − 2 0 2 4 6 8 10 Line searches break in stochastic setting! 8 ,

  25. Step size adaptation ... by line searches x i + 1 ← x i − α i s i ▸ line searches automatically choose step sizes ▸ very fast subroutines called in each optimization step ▸ control blow up or stagnation ▸ they do not work in stochastic optimization problems! 9 ,

  26. Step size adaptation ... by line searches x i + 1 ← x i − α i s i ▸ line searches automatically choose step sizes ▸ very fast subroutines called in each optimization step ▸ control blow up or stagnation ▸ they do not work in stochastic optimization problems! small outline ▸ introduce classic (noise free) line searches ▸ translate concept to language of probability ▸ get a new algorithm robust to noise 9 ,

  27. Classic line searches x ∗ = arg min x f ( x ) , Initial evaluation ≡ current position of optimizer x i + 1 ← x i − ts i + 1 6 . 5 f ( t ) 6 5 . 5 4 2 d f ( t ) 0 − 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 1 . 6 distance t in line search direction Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f ( t ) ≤ f ( 0 ) + c 1 tf ′ ( 0 ) f ′ ( t ) ≥ c 2 f ′ ( 0 ) (W-I) (W-IIa) ∣ f ′ ( t )∣ ≤ c 2 ∣ f ′ ( 0 )∣ (W-IIb) 10 ,

  28. Classic line searches x ∗ = arg min x f ( x ) , Search: candidate # 1 x i + 1 ← x i − ts i + 1 6 . 5 f ( t ) 6 5 . 5 4 2 ← initial candidate d f ( t ) 0 − 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 1 . 6 distance t in line search direction Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f ( t ) ≤ f ( 0 ) + c 1 tf ′ ( 0 ) f ′ ( t ) ≥ c 2 f ′ ( 0 ) (W-I) (W-IIa) ∣ f ′ ( t )∣ ≤ c 2 ∣ f ′ ( 0 )∣ (W-IIb) 10 ,

  29. Classic line searches x ∗ = arg min x f ( x ) , Collapse search space x i + 1 ← x i − ts i + 1 6 . 5 f ( t ) 6 5 . 5 4 2 d f ( t ) 0 − 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 1 . 6 distance t in line search direction Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f ( t ) ≤ f ( 0 ) + c 1 tf ′ ( 0 ) f ′ ( t ) ≥ c 2 f ′ ( 0 ) (W-I) (W-IIa) ∣ f ′ ( t )∣ ≤ c 2 ∣ f ′ ( 0 )∣ (W-IIb) 10 ,

  30. Classic line searches x ∗ = arg min x f ( x ) , Search: candidate # 2 x i + 1 ← x i − ts i + 1 6 . 5 f ( t ) 6 5 . 5 4 extrapolation → 2 d f ( t ) 0 − 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 1 . 6 distance t in line search direction Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f ( t ) ≤ f ( 0 ) + c 1 tf ′ ( 0 ) f ′ ( t ) ≥ c 2 f ′ ( 0 ) (W-I) (W-IIa) ∣ f ′ ( t )∣ ≤ c 2 ∣ f ′ ( 0 )∣ (W-IIb) 10 ,

  31. Classic line searches x ∗ = arg min x f ( x ) , Collapse search space x i + 1 ← x i − ts i + 1 6 . 5 f ( t ) 6 5 . 5 4 2 d f ( t ) 0 − 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 1 . 6 distance t in line search direction Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f ( t ) ≤ f ( 0 ) + c 1 tf ′ ( 0 ) f ′ ( t ) ≥ c 2 f ′ ( 0 ) (W-I) (W-IIa) ∣ f ′ ( t )∣ ≤ c 2 ∣ f ′ ( 0 )∣ (W-IIb) 10 ,

  32. Classic line searches x ∗ = arg min x f ( x ) , Search: candidate # 3 x i + 1 ← x i − ts i + 1 6 . 5 f ( t ) 6 5 . 5 4 interpolation → 2 d f ( t ) (local minimum) 0 − 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 1 . 6 distance t in line search direction Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f ( t ) ≤ f ( 0 ) + c 1 tf ′ ( 0 ) f ′ ( t ) ≥ c 2 f ′ ( 0 ) (W-I) (W-IIa) ∣ f ′ ( t )∣ ≤ c 2 ∣ f ′ ( 0 )∣ (W-IIb) 10 ,

  33. Classic line searches x ∗ = arg min x f ( x ) , Accept: datapoint # 3 fulfills Wolfe conditions x i + 1 ← x i − ts i + 1 6 . 5 f ( t ) 6 5 . 5 4 accepted → 2 d f ( t ) 0 − 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 1 . 6 distance t in line search direction Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f ( t ) ≤ f ( 0 ) + c 1 tf ′ ( 0 ) f ′ ( t ) ≥ c 2 f ′ ( 0 ) (W-I) (W-IIa) ∣ f ′ ( t )∣ ≤ c 2 ∣ f ′ ( 0 )∣ (W-IIb) 10 ,

  34. Classic line searches Choosing meaningful step-sizes, at very low overhead many classic line searches 1. model the 1D objective with cubic spline 2. search candidate points by collapsing search space 3. accept if Wolfe conditions fulfilled 11 ,

Recommend


More recommend