AISTATS 2018, Playa Blanca, Lanzarote, Spain Zeroth-order Optimization Yining Wang in High Dimensions Carnegie Mellon University Joint work with Simon Du, Sivaraman Balakrishnan and Aarti Singh
B ACKGROUND ❖ Optimization: min x ∈ X f ( x ) ❖ Classical setting (first-order): ✴ f is known (e.g., a likelihood function or an NN objective) ✴ can be evaluated, or unbiasedly approximated r f ( x ) ❖ Zeroth-order setting: ✴ f is unknown, or very complicated ✴ is unknown, or very difficult to evaluate. r f ( x )
A PPLICATIONS ❖ Hyper-parameter tuning ✴ f maps hyper-parameter x to system performance f(x). ❖ Experimental design ✴ f maps experimental setting to experimental results. ❖ Communication-efficient optimization ✴ Data defining the objective scattered throughout machines ✴ Communicating is expensive, but f(x) ok. r f ( x )
F ORMULATION ❖ Convexity: the objective f is convex. ❖ Noisy observation model: i.i.d. ∼ N (0 , σ 2 ) . y t = f ( x t ) + ξ t , ξ t ❖ Evaluation measure: f ( b x T +1 ) − f ∗ ✴ Simple regret: T ✴ Cumulative regret: X f ( x t ) − f ∗ t =1
M ETHODS ❖ Classical method: Estimating Gradient Descent (EGD) ❖ Gradient descent / Mirror descent: x t +1 ← x t − η t b g t ( x t ) x t +1 2 arg min z ∈ R d { η t h b g t ( x t ) , z i + ∆ ψ ( z, x t ) } ❖ Estimating gradient: g t ( x t ) = d ✴ δ · E [ f ( x t + δ v t ) v t ] b ✴ Gained popularity from (Nemirovski & Yudin’83, Flaxman et al.’05)
M ETHODS ❖ Classical method: Estimating Gradient Descent (EGD) ❖ Gradient descent / Mirror descent: b g t ( x t ) ⇡ r f ( x t ) x t +1 ← x t − η t b g t ( x t ) x t +1 2 arg min z ∈ R d { η t h b g t ( x t ) , z i + ∆ ψ ( z, x t ) } ❖ Estimating gradient: g t ( x t ) = d ✴ δ · E [ f ( x t + δ v t ) v t ] b ✴ Gained popularity from (Nemirovski & Yudin’83, Flaxman et al.’05)
M ETHODS ❖ Classical method: Estimating Gradient Descent (EGD) ❖ Gradient descent / Mirror descent: g t ( x t ) ⇡ r f ( x t ) b x t +1 ← x t − η t b g t ( x t ) x t +1 2 arg min z ∈ R d { η t h b g t ( x t ) , z i + ∆ ψ ( z, x t ) } v t δ ❖ Estimating gradient: x t g t ( x t ) = d ✴ δ · E [ f ( x t + δ v t ) v t ] b ✴ Gained popularity from (Nemirovski & Yudin’83, Flaxman et al.’05)
M ETHODS ❖ Classical method: Estimating Gradient Descent (EGD) ❖ Classical analysis ✴ Supposing and k x ∗ k ∗ B kr f k H x ) − f ∗ . √ ✴ Stochastic GD/MD: BH/ T f ( b x ) − f ∗ . √ d · BH/T 1 / 4 ✴ Estimating GD/MD: f ( b ❖ Problem: cannot exploit (sparse) structure in x*
M ETHODS ❖ Classical method: Estimating Gradient Descent (EGD) ❖ Classical analysis E b g t ( x t ) = r f ( x t ) ✴ Supposing and k x ∗ k ∗ B kr f k H First-order x ) − f ∗ . √ ✴ Stochastic GD/MD: BH/ T f ( b x ) − f ∗ . √ d · BH/T 1 / 4 ✴ Estimating GD/MD: f ( b ❖ Problem: cannot exploit (sparse) structure in x*
M ETHODS ❖ Classical method: Estimating Gradient Descent (EGD) ❖ Classical analysis E b g t ( x t ) = r f ( x t ) ✴ Supposing and k x ∗ k ∗ B kr f k H First-order x ) − f ∗ . √ ✴ Stochastic GD/MD: BH/ T f ( b x ) − f ∗ . √ d · BH/T 1 / 4 ✴ Estimating GD/MD: f ( b Zeroth-order ❖ Problem: cannot exploit (sparse) structure in x* g t ( x t ) � r f ( x t ) k 2 E k b small, but 2 E b g t ( x t ) 6 = r f ( x t )
M ETHODS ❖ Classical method: Estimating Gradient Descent (EGD) ❖ Classical analysis E b g t ( x t ) = r f ( x t ) ✴ Supposing and k x ∗ k ∗ B kr f k H First-order x ) − f ∗ . √ ✴ Stochastic GD/MD: BH/ T f ( b x ) − f ∗ . √ d · BH/T 1 / 4 ✴ Estimating GD/MD: f ( b Zeroth-order ❖ Problem: cannot exploit (sparse) structure in x* g t ( x t ) � r f ( x t ) k 2 E k b small, but 2 E b g t ( x t ) 6 = r f ( x t )
A SSUMPTIONS ❖ The “function sparsity” assumption: f ( x ) ≡ f ( x S ) S ✓ [ d ] , | S | = s ⌧ d ❖ Strong theoretically, but slightly acceptable in practice ✴ Hyper-parameter tuning: performance not sensitive to many input parameters ✴ Visual stimuli optimization: most brain activities are not related to visual reactions.
L ASSO GRADIENT ESTIMATE ❖ Local linear approximation: f ( x t + δ v t ) ⇡ f ( x t ) + δ hr f ( x t ) , v t i ❖ Lasso gradient estimate: ✴ Sample and observe y i ≈ f ( x t + δ v i ) − f ( x t ) v 1 , · · · , v n ✴ Construct a sparse linear system: e Y = Y/ δ = V r f ( x t ) + ε
L ASSO GRADIENT ESTIMATE ❖ Local linear approximation: f ( x t + δ v t ) ⇡ f ( x t ) + δ hr f ( x t ) , v t i ❖ Lasso gradient estimate: ✴ Construct a sparse linear system: e Y = Y/ δ = V r f ( x t ) + ε ✴ Because is sparse , one can use the Lasso r f ( x t ) n o k e Y � V g k 2 b g t ( x t ) 2 arg min 2 + λ k g k 1 g ∈ R d
L ASSO GRADIENT ESTIMATE ❖ Local linear approximation: f ( x t + δ v t ) ⇡ f ( x t ) + δ hr f ( x t ) , v t i ❖ Lasso gradient estimate: ✴ Construct a sparse linear system: e certain “de-biasing” Y = Y/ δ = V r f ( x t ) + ε required … see paper ✴ Because is sparse , one can use the Lasso r f ( x t ) n o k e Y � V g k 2 g t ( x t ) 2 arg min b 2 + λ k g k 1 g ∈ R d
M AIN RESULTS f ( x ) ≡ f ( x S ) | S | = s ⌧ d Theorem. Suppose for some , and other smoothness conditions on f hold. Then T 1 f ( x t ) − f ∗ . poly( s, log d ) · T − 1 / 4 X T t =1 T − 1 / 3 Furthermore, for smoother f the can be improved to T − 1 / 4
M AIN RESULTS f ( x ) ≡ f ( x S ) | S | = s ⌧ d Theorem. Suppose for some , and other smoothness conditions on f hold. Then T 1 f ( x t ) − f ∗ . poly( s, log d ) · T − 1 / 4 X T t =1 T − 1 / 3 Furthermore, for smoother f the can be improved to T − 1 / 4 Can handle “high-dimensional” setting d � T
S IMULATION RESULTS
S IMULATION RESULTS
O PEN QUESTIONS ❖ Is function/gradient sparsity absolutely necessary? ✴ Recall in first-order case, only solution x* sparsity required ✴ More specifically, only need k x ∗ k 1 B, kr f k ∞ H ✴ Conjecture : if f only satisfies the above condition, then inf x T sup E [ f ( b x T ) − f ∗ ] & poly( d, 1 /T ) b f
O PEN QUESTIONS ❖ Is convergence achievable in high dimensions? T − 1 / 2 ✴ Challenge 1: MD is awkward in exploiting strong convexity: f ( x 0 ) � f ( x ) + hr f ( x ) , x 0 � x i + ν 2 2 ∆ ψ ( x 0 , x ) ✴ Challenge 2: the Lasso gradient estimate is less efficient — can we design convex body K such that Z g t ( x t ) = ρ ( K ) f ( x t + δ v ) n ( v )d µ ( v ) b δ ∂ K is a good gradient estimator in high dimensions?
O PEN QUESTIONS Wish to replace with k x 0 � x k 2 1 ❖ Is convergence achievable in high dimensions? T − 1 / 2 ✴ Challenge 1: MD is awkward in exploiting strong convexity: f ( x 0 ) � f ( x ) + hr f ( x ) , x 0 � x i + ν 2 2 ∆ ψ ( x 0 , x ) ✴ Challenge 2: the Lasso gradient estimate is less efficient — can we design convex body K such that Z g t ( x t ) = ρ ( K ) f ( x t + δ v ) n ( v )d µ ( v ) b δ ∂ K is a good gradient estimator in high dimensions?
Thank you! Questions
Recommend
More recommend