scaling bayesian optimization in high dimensions
play

Scaling Bayesian Optimization in High Dimensions Stefanie Jegelka, - PowerPoint PPT Presentation

Scaling Bayesian Optimization in High Dimensions Stefanie Jegelka, MIT BayesOpt Workshop 2017 joint work with Zi Wang , Chengtao Li, Clement Gehring (MIT) and Pushmeet Kohli (DeepMind) Bayesian Optimization with GPs f ( x )


  1. 
 Scaling Bayesian Optimization in High Dimensions Stefanie Jegelka, MIT 
 BayesOpt Workshop 2017 
 joint work with Zi Wang , Chengtao Li, Clement Gehring (MIT) 
 and Pushmeet Kohli (DeepMind)

  2. Bayesian Optimization with GPs f ( x ) Gaussian process: BO : sequentially build model of f 
 closed form expressions for for t=1, … T : posterior mean and 
 variance (uncertainty) • select new query point(s) x 
 f ∼ GP ( µ, k ) selection criterion: acquisition function σ − 1 arg max x ∈ X α t ( x ) • observe f(x) • update model & repeat 
 μ −1

  3. Challenges in high dimensions statistical & computational complexity: 
 • estimating & optimizing acquisition function 
 new, sample-efficient acquisition function (ICML 2017) • function estimation in high dimensions 
 learn input structure (ICML 2017) • many observations (data points): huge matrix in GP 
 multiple random partitions (BayesOpt 2017) • parallelization σ − 1 μ −1

  4. (Predictive) Entropy Search new query point: arg max x ∈ X α t ( x ) Location of global Observed Point to query optimum Data X α t ( x ) = I ( { x, y } ; x ∗ | D t ) x ∗ ES = H ( p ( x ∗ | D t )) − E [ H ( p ( x ∗ | D t ∪ { x, y } ))] I ( a ; b ) = H ( a ) − H ( a | b ) = H ( b ) − H ( b | a ) PES = H ( p ( y | D t , x )) − E [ H ( p ( y | D t , x, x ∗ ))] if is high-dimensional: costly to estimate! x ∗ α t ( x ) (Hennig & Schuler, 2012; Hernandez-Lobato, Hoffman & Ghahramani 2014)

  5. Max-value Entropy Search Observed 1-dimensional Query Point Data Input space α t ( x ) = I ( { x, y } ; x ∗ | D t ) Output space α t ( x ) = I ( { x ; y } ; y ∗ | D t ) dimensions! d → 1 X x ∗ d-dimensional  γ y ∗ ( x ) ψ ( γ y ∗ ( x )) � ≈ 1 X − log( Ψ ( γ y ∗ ( x ))) closed-form 2 Ψ ( γ y ∗ ( x )) K y ∗ ∈ Y ∗ How sample ? Expectation over . p ( y ∗ | D t ) y ∗

  6. 
 
 
 
 
 
 Sampling y*: Idea 1 is a 1D Gaussian 
 p ( f ( x )) 3 Fisher-Tippett-Gnedenko Theorem 2 The maximum of a set of i.i.d. Gaussian 1 f ( x ) variables is asymptotically described by 0 a Gumbel distribution. -1 -2 -3 • sample representative points 
 -2 -1 0 1 2 x • approximate max-value of the representative points by a Gumbel distribution

  7. 
 
 
 
 
 Sampling y*: Idea 2 draw functions from GP posterior 
 2 and maximize each. How? 
 1 output, f(x) 0 Neal 1994: 
 − 1 GP infinite 1-layer neural ≡ − 2 network with Gaussian weights. − 5 0 5 input, x (b), posterior f ( x ) • approximate GP as finite neural network 
 random weights (random features) 
 & sample posterior weights …… • maximize network output for each sample x (Hernández-Lobato, Hoffman & Ghahramani 2014)

  8. Max-value Entropy Search Observed 1-dimensional Query Point Data Input space α t ( x ) = I ( { x, y } ; x ∗ | D t ) Output space α t ( x ) = I ( { x ; y } ; y ∗ | D t ) dimensions! d → 1 X x ∗ d-dimensional p ( y ∗ | D t ) Can sample ! Expectation over . y ∗ Does it work?

  9. Empirically: max-value enough? sample-efficiency? 4 PES 1 sampling x ∗ PES 10 3 PES 100 Simple Regret 2 1 MES-G 1 sampling y ∗ MES-G 10 MES-G 100 0 1 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 Iteration

  10. Empirically: faster than PES Runtime Per Iteration (s) 16 15.24 1 10 100 samples 12 8 5.85 4 1.61 0.67 0.2 0.13 0.12 0.09 0.09 0 PES MES-NN MES-Gumbel

  11. Connections & Theory zoo of acquisition functions: EI (Mockus, 1974) , PI (Kushner, 1964) , GP-UCB (Auer, 2002; Srinivas et al., 2010) , GP-MI (Contal et al., 2014), ES (Hennig & Schuler, 2012) , PES (Hernández-Lobato et al., 2014) , EST (Wang et al., 2016) , GLASSES (González et al., 2016) , SMAC (Hutter et al., 2010) , ROAR (Hutter et al., 2010), … MES 
 Lemma (Wang-J17) Equivalent acquisition functions: • MES with a single sample of per step y ∗ with specific, 
 } • UCB (upper confidence bound, Srinivas et al., 2010 ) adaptive 
 parameter 
 • PI (probability of improvement, Kushner, 1964 ) 
 setting Theorem: Regret bound (Wang-J 17) 
 T 0 = O ( T log δ ) With probability , within iterations: 
 1 − δ �q (log T ) d +2 � f ∗ − max t ∈ [1 ,T 0 ] f ( x t ) = O T

  12. Gaussian Processes in high dimensions • estimating a nonlinear function in 
 high input dimensions: 
 statistically challenging 
 3 2 • optimizing nonconvex acquisition 
 1 0 function in high dimensions 
 -1 computationally challenging 
 -2 -3 • many observations: huge matrices 
 -2 -1 0 1 2 computationally challenging

  13. Additive Gaussian Processes f 0 ( x A 0 ) X f m ( x A m ) . f ( x ) = m ∈ [ M ] f 1 ( x A 1 ) f 2 ( x A 2 ) • lower-complexity functions 
 statistical efficiency • optimize acquisition function block-wise 
 computational efficiency What is the partition? (Hastie&Tibshirani, 1990; Kandasamy et al., 2015)

  14. 
 Structural Kernel Learning f = f 0 f 1 f 2 + + z = [0 1 0 0 1 1 1 0 2] 
 f 0 ( x A 0 ) Learn the assignment! 0 1 0 0 1 1 1 0 2 Key idea: Dirichlet prior on z f 1 ( x A 1 ) f 2 ( x A 2 ) Posterior 
 z j ∼ Multi ( θ ) p ( z | D n ; α ) Integrate 
 out via Gibbs sampling. 
 easy updates

  15. Empirical Results 10 5 0 -5 1 1 0.5 0.5 0 0 x 2 x 1 synthetic, 50 dim robot pushing task D=50 120 10 9 No Partition 100 Simple Regret 8 80 7 60 r t 6 Fully Partitioned r t 5 40 Heuristics 4 20 True 3 0 2 SKL 200 400 600 100 200 300 400 500 t (d) t Iteration

  16. 
 
 
 
 
 
 
 Curious connections 0.5 0.9 0.5 • crossover in evolutionary algorithms: 0.1 0.8 0.8 0.3 0.5 0.3 • BO with additive GP: 
 3 observations estimated acquisition function (c) (d) 4 4 3 3 2 2 1 1 0 0 -1 -1 -2 -2 -2 0 2 4 -2 0 2 4 • observed good points: query points: 
 -1 2 -1 2 0 2 2 0 learned instead of completely random coordinate partition

  17. Gaussian Processes in high dimensions 3 • estimating nonlinear functions in 
 2 high input dimensions: 
 1 statistically challenging 0 -1 • optimizing nonconvex acquisition 
 -2 Full kernel function in high dimensions 
 10 -3 -2 -1 0 1 2 computationally challenging 
 5 f(x) 0 σ σ σ 3 σ • many observations : huge matrix inversions 
 -5 µ f -10 computationally challenging 0 0.5 1 Low-rank approximation x (d) 50 µ ( x ) = k n ( x ) > ( K n + τ 2 I ) � 1 y t 0 f(x) -50 σ 2 ( x ) = k ( x, x ) − k n ( x ) > ( K n + τ 2 I ) � 1 k n ( x ) σ σ 3 σ σ -100 µ -150 f 0 0.5 1

  18. Ensemble Bayesian Optimization in each iteration: • partition data via 
 Mondrian process • fit GP in each part: 
 structure learning + 
 Tile Coding; 
 synchronize • select query points in parallel & filter parallelization across parts distribution over partitions — new draw in each iteration

  19. Does it scale? Gibbs sampling time (minutes) 160 SKL 140 EBO 120 100 We stopped SKL after 2 hours 80 60 40 EBO average runtime = 61 seconds 20 0 0 10k 20k 30k 40k 50k Observation size

  20. Variances 100 Observations 1000 Observations 5000 Observations Ground Truth 10 40 100 10 20 50 5 5 0 0 f(x) f(x) f(x) f(x) 0 0 -20 -50 3 3 3 3 -5 -5 -40 -100 f f f f -10 -60 -150 -10 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 x x x x 5000 Observations 5000 Observations 5000 Observations 5000 Observations 10 10 10 10 5 5 5 5 f(x) f(x) f(x) f(x) 0 0 0 0 -5 3 -5 3 -5 3 -5 3 f f f f -10 -10 -10 -10 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 x x x x

  21. Empirical Results BO-SVI PBO BO-Add-SVI EBO 7 6 5 4 Regret 3 2 1 0 0 10 20 30 40 50 60 Time (minutes) (Hensman et al., 2013, Wang et al., 2017)

  22. Summary: GP-BO in high dimensions Challenge: high dimensions, many observations 
 statistical & computational efficiency • Max-value Entropy Search 
 sample-efficient, effective acquisition function 
 (Wang, Jegelka, ICML 2017) • Many dimensions: learning structured kernels 
 (Wang, Li, Jegelka, Kohli, ICML 2017) • Many observations & dimensions & parallelization: ensemble Bayesian Optimization 
 (Wang, Gehring, Kohli, Jegelka, BayesOpt 2017) 


  23. References • Zi Wang, Stefanie Jegelka. Max-value entropy search for efficient Bayesian Optimization. ICML 2017. 
 • Zi Wang, Chengtao Li, Stefanie Jegelka, Pushmeet Kohli. Batched High-dimensional Bayesian Optimization via Structural Kernel Learning. ICML 2017. • Zi Wang, Clement Gehring, Pushmeet Kohli, Stefanie Jegelka. Batched Large-scale Bayesian Optimization in High-dimensional Spaces. BayesOpt, 2017.

Recommend


More recommend