ZEROTH-ORDER NON-CONVEX SMOOTH OPTIMIZATION: LOCAL MINIMAX RATES Yining Wang, CMU joint work with Sivaraman Balakrishnan and Aarti Singh
BACKGROUND ➤ Optimization: min x ∈ X f ( x ) ➤ Classical setting (first-order): ✴ f is known (e.g., a likelihood function or an NN objective) ✴ can be evaluated, or unbiasedly approximated. r f ( x ) ➤ Zeroth-order setting: ✴ f is unknown, or very complicated. ✴ is unknown, or very difficult to evaluate. r f ( x ) ✴ can be evaluated, or unbiasedly approximated. f ( x )
BACKGROUND ➤ Hyper-parameter tuning ✴ f maps hyper-parameter to system performance r θ ✴ f is essentially unknown ➤ Experimental design ✴ f maps experimental setting (pressure, temperature, etc.) to synthesized material quality. ➤ Communication-e ffi cient optimization ✴ Data defining the objective scattered throughout machines ✴ Communicating is expensive, but is ok. r f ( x ) f ( x )
PROBLEM FORMULATION ➤ Compact domain X = [0 , 1] d ➤ Objective function f : X → R ✴ f belongs to the Holder class of order α ✴ f may be non-convex ➤ Query model: adaptive x 1 , x 2 , · · · , x n ∈ X i.i.d. ∼ N (0 , 1) y t = f ( x t ) + ξ t ξ t ✴ ➤ Goal: minimize f (ˆ x n ) − inf x ∈ X f ( x )
PROBLEM FORMULATION ➤ Compact domain X = [0 , 1] d ➤ Objective function f : X → R ✴ f belongs to the Holder class of order α ✴ f may be non-convex ➤ Query model: adaptive x 1 , x 2 , · · · , x n ∈ X i.i.d. ∼ N (0 , 1) y t = f ( x t ) + ξ t ξ t ✴ ➤ Goal: minimize f (ˆ x n ) − inf x ∈ X f ( x )
PROBLEM FORMULATION ➤ Compact domain X = [0 , 1] d ➤ Objective function f : X → R k f ( α ) k ∞ M ✴ f belongs to the Holder class of order α ✴ f may be non-convex ➤ Query model: adaptive x 1 , x 2 , · · · , x n ∈ X i.i.d. ∼ N (0 , 1) y t = f ( x t ) + ξ t ξ t ✴ ➤ Goal: minimize f (ˆ x n ) − inf x ∈ X f ( x )
PROBLEM FORMULATION ➤ Compact domain X = [0 , 1] d ➤ Objective function f : X → R k f ( α ) k ∞ M ✴ f belongs to the Holder class of order α ✴ f may be non-convex ➤ Query model: adaptive x 1 , x 2 , · · · , x n ∈ X i.i.d. ∼ N (0 , 1) y t = f ( x t ) + ξ t ξ t ✴ ➤ Goal: minimize f (ˆ x n ) − inf x ∈ X f ( x )
PROBLEM FORMULATION ➤ Compact domain X = [0 , 1] d ➤ Objective function f : X → R k f ( α ) k ∞ M ✴ f belongs to the Holder class of order α ✴ f may be non-convex ➤ Query model: adaptive x 1 , x 2 , · · · , x n ∈ X i.i.d. ∼ N (0 , 1) y t = f ( x t ) + ξ t ξ t ✴ ➤ Goal: minimize f (ˆ x n ) − inf x ∈ X f ( x )
PROBLEM FORMULATION ➤ Compact domain X = [0 , 1] d ➤ Objective function f : X → R k f ( α ) k ∞ M ✴ f belongs to the Holder class of order α ✴ f may be non-convex ➤ Query model: adaptive x 1 , x 2 , · · · , x n ∈ X i.i.d. ∼ N (0 , 1) y t = f ( x t ) + ξ t ξ t ✴ ➤ Goal: minimize f (ˆ x n ) − inf x ∈ X f ( x ) =: L (ˆ x n ; f )
PROBLEM FORMULATION ➤ Compact domain X = [0 , 1] d ➤ Objective function f : X → R k f ( α ) k ∞ M ✴ f belongs to the Holder class of order α ✴ f may be non-convex ➤ Query model: adaptive x 1 , x 2 , · · · , x n ∈ X i.i.d. ∼ N (0 , 1) y t = f ( x t ) + ξ t ξ t ✴ ➤ Goal: minimize f (ˆ x n ) − inf x ∈ X f ( x ) =: L (ˆ x n ; f )
A SIMPLE IDEA FIRST … ➤ Uniform sampling + nonparametric reconstruction
A SIMPLE IDEA FIRST … ➤ Uniform sampling + nonparametric reconstruction
A SIMPLE IDEA FIRST … ➤ Uniform sampling + nonparametric reconstruction
A SIMPLE IDEA FIRST … ➤ Uniform sampling + nonparametric reconstruction ✴ Classical Non-parametric analysis ⇣ n − α / (2 α + d ) ⌘ f n � f k ∞ = e k ˆ O P x n ) � f ∗ 2 k ˆ ✴ Implies optimization error: f (ˆ f n � f k ∞ ➤ Can we do better? ➤ NO! x n ; f )] & n − α / (2 α + d ) inf sup E f [ L (ˆ x n ˆ f ∈ Σ α ( M )
A SIMPLE IDEA FIRST … ➤ Uniform sampling + nonparametric reconstruction ✴ Classical Non-parametric analysis ⇣ n − α / (2 α + d ) ⌘ f n � f k ∞ = e k ˆ O P x n ) � f ∗ 2 k ˆ ✴ Implies optimization error: f (ˆ f n � f k ∞ ➤ Can we do better? No! Intuitions: h n ∼ n − 1 / (2 α + d ) n ∼ n − α / (2 α + d ) h α
LOCAL RESULTS ➤ Characterize error for functions “near” a reference function f 0 ➤ What is the error rate for f close to f 0 that is … ✴ a constant function? ✴ strongly convex? ✴ has regular level sets? ✴ … ➤ Can an algorithm achieve instance-optimal error, without knowing f 0 ?
NOTATIONS ➤ Some definitions L f ( ✏ ) := { x ∈ X : f ( x ) ≤ f ∗ + ✏ } ✴ Level set: ✴ Distribution function: µ f ( ✏ ) := vol( L f ( ✏ )) L f ( ✏ ) L f ( ✏ ) ✏
REGULARITY CONDITIONS ➤ Some definitions L f ( ✏ ) := { x ∈ X : f ( x ) ≤ f ∗ + ✏ } ✴ Level set: ✴ Distribution function: µ f ( ✏ ) := vol( L f ( ✏ )) ➤ Regularity condition (A1): ✴ # of -radius balls needed to cover L f ( ✏ ) ⇣ 1 + µ f ( ✏ ) / � d δ Regular level-set L f ( ✏ )
REGULARITY CONDITIONS ➤ Some definitions L f ( ✏ ) := { x ∈ X : f ( x ) ≤ f ∗ + ✏ } ✴ Level set: ✴ Distribution function: µ f ( ✏ ) := vol( L f ( ✏ )) ➤ Regularity condition (A1): ✴ # of -radius balls needed to cover L f ( ✏ ) ⇣ 1 + µ f ( ✏ ) / � d δ Irregular level-set L f ( ✏ )
REGULARITY CONDITIONS ➤ Some definitions L f ( ✏ ) := { x ∈ X : f ( x ) ≤ f ∗ + ✏ } ✴ Level set: ✴ Distribution function: µ f ( ✏ ) := vol( L f ( ✏ )) ➤ Regularity condition (A1): ✴ # of -radius balls needed to cover L f ( ✏ ) ⇣ 1 + µ f ( ✏ ) / � d δ Irregular level-set L f ( ✏ )
REGULARITY CONDITIONS ➤ Some definitions L f ( ✏ ) := { x ∈ X : f ( x ) ≤ f ∗ + ✏ } ✴ Level set: ✴ Distribution function: µ f ( ✏ ) := vol( L f ( ✏ )) ➤ Regularity condition (A2): µ f ( ✏ log n ) ≤ µ f ( ✏ ) × O (log γ n ) ✴ µ f ( ✏ ) ✏ log n Regular ✏ ✏ f
REGULARITY CONDITIONS ➤ Some definitions L f ( ✏ ) := { x ∈ X : f ( x ) ≤ f ∗ + ✏ } ✴ Level set: ✴ Distribution function: µ f ( ✏ ) := vol( L f ( ✏ )) ➤ Regularity condition (A2): µ f ( ✏ log n ) ≤ µ f ( ✏ ) × O (log γ n ) ✴ µ f ( ✏ ) ✏ log n Irregular ✏ ✏ f
LOCAL UPPER BOUND ➤ Main result on local upper bound: THEOREM 1. Suppose regularity conditions hold. There exists an algorithm such that for sufficiently large n , x n ; f ) ≥ C ε n ( f ) log c n ] ≤ 1 / 4 sup Pr f [ L (ˆ f ∈ Σ α ( M ) n o ✏ > 0 : ✏ − (2+ d/ α ) µ f ( ✏ ) ≥ n where " n ( f ) := sup
LOCAL UPPER BOUND ➤ Main result on local upper bound: THEOREM 1. Suppose regularity conditions hold. There exists an algorithm such that for sufficiently large n , x n ; f ) ≥ C ε n ( f ) log c n ] ≤ 1 / 4 sup Pr f [ L (ˆ f ∈ Σ α ( M ) n o ✏ > 0 : ✏ − (2+ d/ α ) µ f ( ✏ ) ≥ n where " n ( f ) := sup
LOCAL UPPER BOUND ➤ Main result on local upper bound: THEOREM 1. Suppose regularity conditions hold. There exists an algorithm such that for sufficiently large n , x n ; f ) ≥ C ε n ( f ) log c n ] ≤ 1 / 4 sup Pr f [ L (ˆ f ∈ Σ α ( M ) n o ✏ > 0 : ✏ − (2+ d/ α ) µ f ( ✏ ) ≥ n where " n ( f ) := sup
LOCAL UPPER BOUND ➤ Main result on local upper bound: THEOREM 1. Suppose regularity conditions hold. There exists an algorithm such that for sufficiently large n , x n ; f ) ≥ C ε n ( f ) log c n ] ≤ 1 / 4 sup Pr f [ L (ˆ f ∈ Σ α ( M ) n o ✏ > 0 : ✏ − (2+ d/ α ) µ f ( ✏ ) ≥ n where " n ( f ) := sup Adaptivity: The algo does not know f .
LOCAL UPPER BOUND ➤ Main result on local upper bound: THEOREM 1. Suppose regularity conditions hold. There exists an algorithm such that for sufficiently large n , x n ; f ) ≥ C ε n ( f ) log c n ] ≤ 1 / 4 sup Pr f [ L (ˆ f ∈ Σ α ( M ) n o ✏ > 0 : ✏ − (2+ d/ α ) µ f ( ✏ ) ≥ n where " n ( f ) := sup Adaptivity: The algo does not know f .
LOCAL UPPER BOUND ➤ Main result on local upper bound: THEOREM 1. Suppose regularity conditions hold. There exists an algorithm such that for sufficiently large n , x n ; f ) ≥ C ε n ( f ) log c n ] ≤ 1 / 4 sup Pr f [ L (ˆ f ∈ Σ α ( M ) n o ✏ > 0 : ✏ − (2+ d/ α ) µ f ( ✏ ) ≥ n where " n ( f ) := sup Adaptivity: The algo does not know f .
LOCAL UPPER BOUND ➤ Main result on local upper bound: THEOREM 1. Suppose regularity conditions hold. There exists an algorithm such that for sufficiently large n , x n ; f ) ≥ C ε n ( f ) log c n ] ≤ 1 / 4 sup Pr f [ L (ˆ f ∈ Σ α ( M ) n o ✏ > 0 : ✏ − (2+ d/ α ) µ f ( ✏ ) ≥ n where " n ( f ) := sup Adaptivity: Instance dependent: The algo does not know f . Error rate depends on f
LOCAL UPPER BOUND ➤ Main result on local upper bound: THEOREM 1. Suppose regularity conditions hold. There exists an algorithm such that for sufficiently large n , x n ; f ) ≥ C ε n ( f ) log c n ] ≤ 1 / 4 sup Pr f [ L (ˆ f ∈ Σ α ( M ) n o ✏ > 0 : ✏ − (2+ d/ α ) µ f ( ✏ ) ≥ n where " n ( f ) := sup ➤ Example 1: polynomial growth µ f ( ✏ ) ⇣ ✏ β , � � 0 ε n ( f ) ⇣ n − α / (2 α + d − αβ ) Much faster than the “baseline” rate n − α / (2 α + d )
Recommend
More recommend