bayesopt hot topics and current challenges
play

BayesOpt: hot topics and current challenges Javier Gonz alez - PowerPoint PPT Presentation

BayesOpt: hot topics and current challenges Javier Gonz alez Masterclass, 7-February, 2107 @Lancaster University Agenda of the day 9:00-11:00, Introduction to Bayesian Optimization : What is BayesOpt and why it works? Relevant


  1. BayesOpt: hot topics and current challenges Javier Gonz´ alez Masterclass, 7-February, 2107 @Lancaster University

  2. Agenda of the day ◮ 9:00-11:00, Introduction to Bayesian Optimization : ◮ What is BayesOpt and why it works? ◮ Relevant things to know. ◮ 11:30-13:00, Connections, extensions and applications : ◮ Extensions to multi-task problems, constrained domains, early-stopping, high dimensions. ◮ Connections to Armed bandits and ABC. ◮ An applications in genetics. ◮ 14:00-16:00, GPyOpt LAB! : Bring your own problem! ◮ 16:30-15:30, Hot topics current challenges : ◮ Parallelization. ◮ Non-myopic methods ◮ Interactive Bayesian Optimization.

  3. Section III: Hot topics and challenges ◮ Parallel Bayesian Optimization ◮ Non-myopic methods. ◮ Interactive Bayesian Optimization.

  4. Scalable BO: Parallel/batch BO Avoiding the bottleneck of evaluating f ◮ Cost of f ( x n ) = cost of { f ( x n, 1 ) , . . . , f ( x n,nb ) } . ◮ Many cores available, simultaneous lab experiments, etc.

  5. Considerations when designing a batch ◮ Available pairs { ( x j , y i ) } n i =1 are augmented with the evaluations of f on B n b = { x t, 1 , . . . , x t,nb } . t ◮ Goal: design B n b 1 , . . . , B n b m . Notation: ◮ I n : represents the available data set D n and the GP structure when n data points are available ( I t,k in the batch context). ◮ α ( x ; I n ): generic acquisition function given I n .

  6. Optimal greedy batch design Sequential policy : Maximize: α ( x ; I t, 0 ) Greedy batch policy, 1st element t-th batch : Maximize: α ( x ; I t, 0 )

  7. Optimal greedy batch design Sequential policy : Maximize: α ( x ; I t, 0 ) Greedy batch policy, 2nd element t-th batch : Maximize: � α ( x ; I t, 1 ) p ( y t, 1 | x t, 1 , I t, 0 ) p ( x t, 1 |I t, 0 ) d x t, 1 dy t, 1 ◮ p ( y t, 1 | x 1 , I t, 0 ): predictive distribution of the GP . ◮ p ( x 1 |I t, 0 ) = δ ( x t, 1 − arg max x ∈X α ( x ; I t, 0 )).

  8. Optimal greedy batch design Sequential policy : Maximize: α ( x ; I t,k − 1 ) Greedy batch policy, k-th element t-th batch : Maximize: k − 1 � � α ( x ; I t,k − 1 ) p ( y t,j | x t,j , I t,j − 1 ) p ( x t,j |I t,j − 1 ) d x t,j dy t,j j =1 ◮ p ( y t,j | x t,j , I t,j − 1 ): predictive distribution of the GP . ◮ p ( x j |I t,j − 1 ) = δ ( x t,j − arg max x ∈X α ( x ; I t,j − 1 )).

  9. Available approaches [Azimi et al., 2010; Desautels et al., 2012; Chevalier et al., 2013; Contal et al. 2013] ◮ Exploratory approaches, reduction in system uncertainty. ◮ Generate ‘fake’ observations of f using p ( y t,j | x j , I t,j − 1 ). ◮ Simultaneously optimize elements on the batch using the joint distribution of y t 1 , . . . y t,nb . Bottleneck: All these methods require to iteratively update p ( y t,j | x j , I t,j − 1 ) to model the iteration between the elements in the batch: O ( n 3 ) How to design batches reducing this cost? Local penalization

  10. Goal: eliminate the marginalization step “To develop an heuristic approximating the ‘optimal batch design strategy’ at lower computational cost, while incorporating information about global properties of f from the GP model into the batch design” Lipschitz continuity: | f ( x 1 ) − f ( x 2 ) | ≤ L � x 1 − x 2 � p .

  11. Interpretation of the Lipschitz continuity of f M = max x ∈X f ( x ) and B r xj ( x j ) = { x ∈ X : � x − x j � ≤ r x j } where r x j = M − f ( x j ) L 20 10 0 f(x) 10 True function 20 Samples Exclusion cones 30 Active regions 0.4 0.6 0.8 1.0 1.2 x x M / ∈ B r xj ( x j ) otherwise, the Lipschitz condition is violated.

  12. Probabilistic version of B r x ( x ) We can do this because f ( x ) ∼ GP ( µ ( x ) , k ( x , x ′ )) and σ 2 ( r x j ) = σ 2 ( x j ) ◮ r x j is Gaussian with µ ( r x j ) = M − µ ( x j ) . L L 2 Local penalizers: ϕ ( x ; x j ) = p ( x / ∈ B r x j ( x j )) ϕ ( x ; x j ) = p ( r x j < � x − x j � ) = 0 . 5erfc( − z ) 1 √ where z = n ( x j ) ( L � x j − x � − M + µ n ( x j )). 2 σ 2 ◮ Reflects the size of the ’Lipschitz’ exclusion areas. ◮ Approaches to 1 when x is far form x j and decreases otherwise.

  13. Idea to collect the batches Without using explicitly the model. Optimal batch: maximization-marginalization k − 1 � � α ( x ; I t,k − 1 ) p ( y t,j | x t,j , I t,j − 1 ) p ( x t,j |I t,j − 1 ) d x t,j dy t,j j =1 Proposal : maximization-penalization. Use the ϕ ( x ; x j ) to penalize the acquisition and predict the expected change in α ( x ; I t,k − 1 ) .

  14. Local penalization strategy [Gonz´ alez, Dai, Hennig, Lawrence, 2016] 1st batch element 2nd batch element 3th batch element 9 9 9 α ( x ) α ( x ) α ( x ) ϕ 1 ( x ) 8 8 8 α ( x ) ϕ 1 ( x ) α ( x ) ϕ 1 ( x ) ϕ 2 ( x ) 7 7 ϕ 1 ( x ) 7 ϕ 2 ( x ) 6 6 6 value 5 value 5 value 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 10 5 0 5 10 10 5 0 5 10 10 5 0 5 10 x x x The maximization-penalization strategy selects x t,k as   k − 1   � x t,k = arg max  g ( α ( x ; I t, 0 )) ϕ ( x ; x t,j )  , x ∈X j =1 g is a transformation of α ( x ; I t, 0 ) to make it always positive.

  15. Local penalization strategy [Gonz´ alez, Dai, Hennig, Lawrence, 2016] 1st batch element 2nd batch element 3th batch element 9 9 9 α ( x ) α ( x ) α ( x ) ϕ 1 ( x ) 8 8 8 α ( x ) ϕ 1 ( x ) α ( x ) ϕ 1 ( x ) ϕ 2 ( x ) 7 7 ϕ 1 ( x ) 7 ϕ 2 ( x ) 6 6 6 value 5 value 5 value 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 10 5 0 5 10 10 5 0 5 10 10 5 0 5 10 x x x The maximization-penalization strategy selects x t,k as   k − 1   � x t,k = arg max  g ( α ( x ; I t, 0 )) ϕ ( x ; x t,j )  , x ∈X j =1 g is a transformation of α ( x ; I t, 0 ) to make it always positive.

  16. Example for L = 50 L controls the exploration-exploitation balance within the batch.

  17. Example for L = 100 L controls the exploration-exploitation balance within the batch.

  18. Example for L = 150 L controls the exploration-exploitation balance within the batch.

  19. Example for L = 250 L controls the exploration-exploitation balance within the batch.

  20. Finding an unique Lipschitz constant Let f : X → R be a L-Lipschitz continuous function defined on a compact subset X ⊆ R D . Then L p = max x ∈X �∇ f ( x ) � p , is a valid Lipschitz constant. The gradient of f at x ∗ is distributed as a multivariate Gaussian ∇ f ( x ∗ ) | X , y , x ∗ ∼ N ( µ ∇ ( x ∗ ) , Σ 2 ∇ ( x ∗ )) We choose: ˆ � µ ∇ ( x ∗ ) � L = max X

  21. Experiments: Sobol function Best (average) result for some given time budget.

  22. 2D experiment with ‘large domain’ Comparison in terms of the wall clock time 1.0 EI 1.1 UCB Rand-EI Best found value 1.2 Rand-UCB SM-UCB 1.3 B-UCB PE-UCB 1.4 Pred-EI Pred-UCB 1.5 qEI LP-EI 1.6 LP-UCB 1.7 0 50 100 150 200 250 300 Time(seconds)

  23. Myopia of optimisation techniques ◮ Most global optimisation techniques are myopic, in considering no more than a single step into the future. ◮ Relieving this myopia requires solving the multi-step lookahead problem. Figure: Two evaluations, if the first evaluation is made myopically, the second must be sub-optimal.

  24. Non-myopic thinking To think non-myopically is important: it is a way of integrating in our decisions the information about our available (limited) resources to solve a given problem.

  25. Acquisition function: expected loss [Osborne, 2010] Loss of evaluating f at x ∗ assuming it is returning y ∗ : � y ∗ ; if y ∗ ≤ η λ ( y ∗ ) � η ; if y ∗ > η. where η = min { y 0 } , the current best found value. The loss expectation is : � Λ 1 ( x ∗ |I 0 ) � E [min( y ∗ , η )] = λ ( y ∗ ) p ( y ∗ | x ∗ , I 0 ) dy ∗ I 0 is the current information D , θ and likelihood type.

  26. The expected loss (improvement) is myopic ◮ Selects the next evaluation as if it was the last one. ◮ The remaining available budget is not taken into account when deciding where to evaluate. How to take into account the effect of future evaluations in the decision?

  27. Expected loss with n steps ahead Intractable even for a handful number of steps ahead n � � Λ n ( x ∗ |I 0 ) = λ ( y n ) p ( y j | x j , I j − 1 ) p ( x j |I j − 1 ) dy ∗ . . . dy n d x 2 . . . d x n j =1 ◮ p ( y j | x j , I j − 1 ): predictive distribution of the GP at x j and ◮ p ( x j |I j − 1 ): optimisation step.

  28. Relieving the myopia of Bayesian optimisation We present... GLASSES! G lobal optimisation with L ook- A head through S tochastic S imulation and E xpected-loss S earch

  29. GLASSES Rendering the approximation sparse Idea : jointly model the epistemic uncertainty about the steps ahead using some defining some point process. � Γ n ( x ∗ |I 0 ) = λ ( y n ) p ( y | X , I 0 , x ∗ ) p ( X |I 0 , x ∗ ) d y d X

  30. GLASSES Technical details Selecting a good p ( X |I 0 , x ∗ ) is complicated. ◮ Replace integrating over p ( X |I 0 , x ∗ ) by conditioning over an oracle predictor F n ( x ∗ ) of the n future locations. ◮ y = ( y ∗ , . . . , y n ) T : Gaussian outputs of f at F n ( x ∗ ). ◮ Λ n � � � � x ∗ | I 0 , F n ( x ∗ ) = Γ n ( x ∗ |I 0 , F n ( x ∗ )) = E min( y , η ) . � � ◮ E min( y , η ) is computed using Expectation Propagation.

Recommend


More recommend