Robust nonlinear Optimization Maren Mahsereci Workshop on - PowerPoint PPT Presentation

Robust nonlinear Optimization Maren Mahsereci Workshop on Uncertainty Quantification 09 / 15 / 2016, Sheffield Emmy Noether Group on Probabilistic Numerics Department of Empirical Inference Max Planck Institute for Intelligent Systems Tübingen, Germany

Robust optimization . . . outline ▸ basics about greedy optimizers ▸ GD and SGD : (stochastic) gradient descent ▸ robust stochastic optimization ▸ example: step size adaptation ▸ extending line searches ▸ robust search directions 1 ,

Typical scheme . . . greedy and gradient based optimizer x ∗ = arg min L( x ) x x i + 1 ← x i − α i s i 1. s i – which direction? → model objective function locally 2. α i – how far? → prevent blow ups and stagnation 3. repeat ▸ needs to work for many different L( x ) 2 ,

The steepest way downhill . . . gradient descend finds local minimum x ∗ = arg min L( x ) x x i + 1 ← x i − α ∇L( x i ) , α = const. 3 ,

Additional difficulty x ∗ = arg min x L( x ) .. noisy functions by mini-batching sometimes we do not know −∇L( x i ) precisely! 4 ,

Additional difficulty x ∗ = arg min x L( x ) .. noisy functions by mini-batching M m L( x ) ∶ = 1 ℓ ( x,y i ) ≈ 1 ℓ ( x,y j ) = ∶ ˆ L( x ) , m ≪ M ∑ ∑ M m i = 1 j = 1 ▸ compute only smaller sum over m L( x ) approximates L( x ) well ▸ hope that ˆ ▸ smaller m means higher noise on ∇ L( x ) 5 ,

Additional difficulty x ∗ = arg min x L( x ) .. noisy functions by mini-batching M m L( x ) ∶ = 1 ℓ ( x,y i ) ≈ 1 ℓ ( x,y j ) = ∶ ˆ L( x ) , m ≪ M ∑ ∑ M m i = 1 j = 1 ▸ compute only smaller sum over m L( x ) approximates L( x ) well ▸ hope that ˆ ▸ smaller m means higher noise on ∇ L( x ) for iid. mini-batches, noise is approximately Gaussian L( x ) = ˆ L( x ) + ǫ, ǫ ∼ N ( 0 , O ( M − m )) m L( x ) ˆ ∼ 5 ,

The steepest way downhill . . . in expectation: SGD finds local minimum, too. x ∗ = arg min L( x ) x x i + 1 ← x i − α ˆ ∇L( x i ) , α = const. 15 10 5 0 − 4 − 2 0 2 4 6 8 10 6 ,

Step size adaptation ... by line searches x i + 1 ← x i − α i s i so far α was constant and hand-chosen! ▸ line searches automatically choose step sizes 7 ,

Line searches x ∗ = arg min x L( x ) automated learning rate adaptation x i + 1 ← x i − α i ∇L( x i ) set scalar step size α i given direction −∇L( x i ) 15 small step size 10 5 0 − 4 − 2 0 2 4 6 8 10 8 ,

Line searches x ∗ = arg min x L( x ) automated learning rate adaptation x i + 1 ← x i − α i ∇L( x i ) set scalar step size α i given direction −∇L( x i ) 15 small step size large step size 10 5 0 − 4 − 2 0 2 4 6 8 10 8 ,

Line searches x ∗ = arg min x L( x ) automated learning rate adaptation x i + 1 ← x i − α i ∇L( x i ) set scalar step size α i given direction −∇L( x i ) 15 small step size large step size line search 10 5 0 − 4 − 2 0 2 4 6 8 10 8 ,

Line searches x ∗ = arg min x L( x ) automated learning rate adaptation x i + 1 ← x i − α i ∇L( x i ) set scalar step size α i given noisy direction −∇ ˆ L( x i ) 15 small step size large step size line search 10 5 0 − 4 − 2 0 2 4 6 8 10 Line searches break in stochastic setting! 8 ,

Step size adaptation ... by line searches x i + 1 ← x i − α i s i ▸ line searches automatically choose step sizes ▸ very fast subroutines called in each optimization step ▸ control blow up or stagnation ▸ they do not work in stochastic optimization problems! 9 ,

Step size adaptation ... by line searches x i + 1 ← x i − α i s i ▸ line searches automatically choose step sizes ▸ very fast subroutines called in each optimization step ▸ control blow up or stagnation ▸ they do not work in stochastic optimization problems! small outline ▸ introduce classic (noise free) line searches ▸ translate concept to language of probability ▸ get a new algorithm robust to noise 9 ,

Classic line searches x ∗ = arg min x f ( x ) , Initial evaluation ≡ current position of optimizer x i + 1 ← x i − ts i + 1 6 . 5 f ( t ) 6 5 . 5 4 2 d f ( t ) 0 − 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 1 . 6 distance t in line search direction Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f ( t ) ≤ f ( 0 ) + c 1 tf ′ ( 0 ) f ′ ( t ) ≥ c 2 f ′ ( 0 ) (W-I) (W-IIa) ∣ f ′ ( t )∣ ≤ c 2 ∣ f ′ ( 0 )∣ (W-IIb) 10 ,

Classic line searches x ∗ = arg min x f ( x ) , Search: candidate # 1 x i + 1 ← x i − ts i + 1 6 . 5 f ( t ) 6 5 . 5 4 2 ← initial candidate d f ( t ) 0 − 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 1 . 6 distance t in line search direction Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f ( t ) ≤ f ( 0 ) + c 1 tf ′ ( 0 ) f ′ ( t ) ≥ c 2 f ′ ( 0 ) (W-I) (W-IIa) ∣ f ′ ( t )∣ ≤ c 2 ∣ f ′ ( 0 )∣ (W-IIb) 10 ,

Classic line searches x ∗ = arg min x f ( x ) , Collapse search space x i + 1 ← x i − ts i + 1 6 . 5 f ( t ) 6 5 . 5 4 2 d f ( t ) 0 − 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 1 . 6 distance t in line search direction Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f ( t ) ≤ f ( 0 ) + c 1 tf ′ ( 0 ) f ′ ( t ) ≥ c 2 f ′ ( 0 ) (W-I) (W-IIa) ∣ f ′ ( t )∣ ≤ c 2 ∣ f ′ ( 0 )∣ (W-IIb) 10 ,

Classic line searches x ∗ = arg min x f ( x ) , Search: candidate # 2 x i + 1 ← x i − ts i + 1 6 . 5 f ( t ) 6 5 . 5 4 extrapolation → 2 d f ( t ) 0 − 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 1 . 6 distance t in line search direction Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f ( t ) ≤ f ( 0 ) + c 1 tf ′ ( 0 ) f ′ ( t ) ≥ c 2 f ′ ( 0 ) (W-I) (W-IIa) ∣ f ′ ( t )∣ ≤ c 2 ∣ f ′ ( 0 )∣ (W-IIb) 10 ,

Classic line searches x ∗ = arg min x f ( x ) , Collapse search space x i + 1 ← x i − ts i + 1 6 . 5 f ( t ) 6 5 . 5 4 2 d f ( t ) 0 − 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 1 . 6 distance t in line search direction Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f ( t ) ≤ f ( 0 ) + c 1 tf ′ ( 0 ) f ′ ( t ) ≥ c 2 f ′ ( 0 ) (W-I) (W-IIa) ∣ f ′ ( t )∣ ≤ c 2 ∣ f ′ ( 0 )∣ (W-IIb) 10 ,

Classic line searches x ∗ = arg min x f ( x ) , Search: candidate # 3 x i + 1 ← x i − ts i + 1 6 . 5 f ( t ) 6 5 . 5 4 interpolation → 2 d f ( t ) (local minimum) 0 − 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 1 . 6 distance t in line search direction Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f ( t ) ≤ f ( 0 ) + c 1 tf ′ ( 0 ) f ′ ( t ) ≥ c 2 f ′ ( 0 ) (W-I) (W-IIa) ∣ f ′ ( t )∣ ≤ c 2 ∣ f ′ ( 0 )∣ (W-IIb) 10 ,

Classic line searches x ∗ = arg min x f ( x ) , Accept: datapoint # 3 fulfills Wolfe conditions x i + 1 ← x i − ts i + 1 6 . 5 f ( t ) 6 5 . 5 4 accepted → 2 d f ( t ) 0 − 2 0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 1 . 6 distance t in line search direction Wolfe conditions: accept when [Wolfe, SIAM Review, 1969] f ( t ) ≤ f ( 0 ) + c 1 tf ′ ( 0 ) f ′ ( t ) ≥ c 2 f ′ ( 0 ) (W-I) (W-IIa) ∣ f ′ ( t )∣ ≤ c 2 ∣ f ′ ( 0 )∣ (W-IIb) 10 ,

Classic line searches Choosing meaningful step-sizes, at very low overhead many classic line searches 1. model the 1D objective with cubic spline 2. search candidate points by collapsing search space 3. accept if Wolfe conditions fulfilled 11 ,

Robust nonlinear Optimization Maren Mahsereci Workshop on - PowerPoint PPT Presentation

Robust nonlinear Optimization Maren Mahsereci Workshop on Uncertainty Quantification 09 / 15 / 2016, Sheffield Emmy Noether Group on Probabilistic Numerics Department of Empirical Inference Max Planck Institute for Intelligent Systems

Short Course in Supervised Learning Robust Optimization and Machine Learning Robust Supervised

Nonlinear Control Lecture # 31 Nonlinear Observers Nonlinear Control Lecture # 31 Nonlinear

Nonlinear Control Lecture # 22 Special nonlinear Forms Nonlinear Control Lecture # 22 Special

Nonlinear Control Lecture # 21 Special nonlinear Forms Nonlinear Control Lecture # 21 Special

Outlier Outlier Outlier- Outlier - -robust - robust robust robust identification

Nonlinear Control Lecture # 8 Special nonlinear Forms Nonlinear Control Lecture # 8 Special

Nonlinear Control Lecture # 12 Nonlinear Observers and Output Feedback Stabilization Nonlinear

Nonlinear Control Lecture # 20 Special nonlinear Forms Nonlinear Control Lecture # 20 Special

Nonlinear Control Lecture # 28 Robust State Feedback Stabilization Nonlinear Control Lecture # 28

Nonlinear Optimization: Discrete optimization INSEAD, Spring 2006 Jean-Philippe Vert Ecole des

Nonlinear Control Lecture # 1 Introduction Nonlinear Control Lecture # 1 Introduction Nonlinear

Numerical Proofs in Nonlinear Control Sicun Gao, UCSD Nonlinear control working Nonlinear

An introduction to Optimization under Uncertainty 0-1 Multiband Robust Optimization* with special

Nonlinear Control Lecture # 10 State Feedback Stabilization and Robust State Feedback

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

The Scenario Approach: Robust Optimization and Application to Control M.C. Campi University of

1 0 / 2 3 / 2 0 0 9 Outline I ndexing Land Surface for Efficient kNN Query Motivation

Exploring the Eastern Frontier: A First Look at Mobile App Tracking in China Zhaohua Wang

C AUSALITY Prasun Dewan Department of Computer Science University of North Carolina at Chapel

How Bitcoin achieves Decentralization Centralization vs. Decentralization Distributed

Treaty Transparency April 2018 Outline 1. Treaty background 1985,1999, 2009, 2019 2.

Redesigned SAT Redesigned SAT Category Redesigned SAT Total Testing 3 hours (plus 50 minutes

Sharing of train tracking and Estimated Time of Arrival (ETA) information CEF Action nr. 2016

Descriptin, Discivery, Access if Multi.Dimensiinal Data and Time Series in the Virtual