Preferential Bayesian Optimization Javier Gonz alez, Zhenwen Dai , - PowerPoint PPT Presentation

Preferential Bayesian Optimization Javier Gonz´ alez, Zhenwen Dai , Andreas Damianou, Neil D. Lawrence @ICML 2017, Sydney, Australia June 26, 2019

My Colleagues Javier Gonz´ alez Andreas Damianou Neil D. Lawrence

Motivation ◮ Bayesian Optimization aims at searching for the global minimum of an expensive function g , x min = arg min x ∈X g ( x ) . ◮ What if the function g is not directly measurable?

Preference vs. Rating ◮ The objective function of many tasks are difficult to precisely summarize into a single value. ◮ Comparison is almost always easier than rating for humans. ◮ Such observation has been exploited in A/B testing.

BO via Preference ◮ Beyond a single A/B testing. ◮ To optimize a system via tuning this configuration, e.g., the font size, background color of a website. ◮ The objective such as customer experience is not directly measurable ◮ Compare the objective with two different configurations. ◮ The task is to search for the best configuration by iteratively suggesting pairs of configurations and observing the results of comparisons.

Problem Definition ◮ To find the minimum of a latent function g ( x ) , x ∈ X . ◮ Observe only whether g ( x ) < g ( x ′ ) or not, for a duel [ x , x ′ ] ∈ X × X . ◮ The outcomes are binary: true or false . ◮ The outcomes are stochastic .

Preference Function Objective function 20 ◮ In this work, the probabilistic 15 Global minimum 10 distribution is assumed to Bernoulli: f(x) 5 0 − 5 − 10 Copeland and soft-Copeland functions p ( y ∈ { 0 , 1 }| [ x , x ′ ]) = π y (1 − π ) 1 − y , Preference function 1 . 0 0.5 � � g ( x ′ ) − g ( x ) π = σ . 0 . 9 0 . 8 0.5 0 . 8 0 . 7 ◮ π is referred to as a preference 0 . 6 0 . 6 function . 0 . 5 x’ 0 . 4 0 . 4 ◮ A Preferential Bayesian optimization 0 . 3 0 . 2 algorithm will propose a sequence of 0 . 2 0 . 1 0.5 duels that helps efficiently localize the minimum of a latent function g ( x ). 0 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 x

A Surrogate Model Preference function 1 . 0 0.5 0 . 9 0 . 8 0.5 0 . 8 0 . 7 0 . 6 0 . 6 0 . 5 x’ 0 . 4 0 . 4 ◮ The preference function is not observable. 0 . 3 0 . 2 ◮ Only observe a few comparisons. 0 . 2 0.5 0 . 1 ◮ Need a surrogate model to guide the search. 0 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 x ◮ Two choices: Expectation of y ⋆ and σ ( f ⋆ ) ◮ a surrogate model for the latent function (like in 1 . 0 0 . 8 standard BO). [Brochu, 2010, Guo et al., 2010] 0 . 8 0 . 7 ◮ a surrogate model for the preference function 0 . 6 0 . 6 0 . 5 0 . 4 0 . 4 0 . 3 0 . 2 0 . 2 0 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0

A Surrogate Model of Preference Model Preference function 1 . 0 0.5 0 . 9 0 . 8 0.5 0 . 8 ◮ We propose to build a surrogate model for the 0 . 7 0 . 6 0 . 6 preference function. 0 . 5 x’ 0 . 4 ◮ Pros: easy to model (Gaussian process Binary 0 . 4 0 . 3 0 . 2 Classification is used:) 0 . 2 0.5 0 . 1 � 0 . 0 p ( y ⋆ = 1 |D , [ x , x ′ ] , θ ) = σ ( f ⋆ ) p ( f ⋆ |D , [ x ⋆ , x ′ 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 ⋆ ] , θ ) df ⋆ x Expectation of y ⋆ and σ ( f ⋆ ) 1 . 0 ◮ Pros: flexible latent function (e.g., 0 . 8 0 . 8 0 . 7 non-stationality). 0 . 6 0 . 6 ◮ Cons: the minimum of the latent function is not 0 . 5 directly accessible 0 . 4 0 . 4 0 . 3 0 . 2 0 . 2 0 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0

Who is the winner (the minimum)? ◮ The minimum beats all the other locations on average. ◮ Extending an idea from armed-bandits [Zoghi et al., 2015], we define the soft-Copeland score as, (the average winning probability), � C ( x ) = Vol( X ) − 1 π f ([ x , x ′ ]) d x ′ , X ◮ The optimum of g ( x ) can be estimated as, denoted as the Condorcet winner, x c = arg max x ∈X C ( x ) , Objective function Preference function 20 1 . 0 0 . 5 15 Global minimum 10 0 . 9 0 . 8 f(x) 5 0 . 5 0 . 8 0 0 . 7 − 5 0 . 6 0 . 6 − 10 0 . 5 Copeland and soft-Copeland functions x’ 1 . 4 0 . 4 0 . 4 1 . 2 Copeland Score value 0 . 3 1 . 0 soft-Copeland 0 . 8 0 . 2 0 . 6 0 . 2 0 0 . 1 . 5 0 . 4 0 . 2 0 . 0 0 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 x x

The current estimation of minimum ◮ Only have a surrogate model of preference function. ◮ Estimate the soft-Copeland score from the surrogate model and get an approximate Condorcet winner. ◮ Note that the approximated Condorcet winner may not be the optimum of g ( x ).

Acquisition Function ◮ Existing Acq. Func. are not applicable . ◮ They are designed to work with a surrogate model of the objective function. Expectation of y ⋆ and σ ( f ⋆ ) 1 . 0 ◮ In PBO, the surrogate model does not directly 0 . 8 0 . 8 represent the latent objective function. 0 . 7 0 . 6 ◮ We need a new Acq. Func. for duels! 0 . 6 0 . 5 0 . 4 0 . 4 0 . 3 0 . 2 0 . 2 0 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0

Pure Exploration Acquisition Function (PBO-PE) Variance of y ∗ 1 . 0 0 . 24 0 . 8 0 . 22 0 . 20 0 . 6 0 . 18 ◮ The common pure explorative acq. func., i.e. V [ y ], 0 . 4 0 . 16 does not work. 0 . 14 0 . 2 ◮ Propose a pure explorative acq. func. as the 0 . 12 0 . 0 0 . 10 variance (uncertainty) of the “winning” probability 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Variance of σ ( f ⋆ ) of a duel: 1 . 0 0 . 09 � 0 . 8 0 . 08 ( σ ( f ⋆ ) − E [ σ ( f ⋆ )]) 2 p ( f ⋆ |D , [ x , x ′ ]) df ⋆ V [ σ ( f ⋆ )] = 0 . 07 0 . 6 0 . 06 0 . 4 0 . 05 0 . 04 0 . 2 0 . 03 0 . 02 0 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0

Acquisition Function: PBO-DTS To select the next duel [ x next , x ′ next ]: 1. Draw a sample from surrogate model 2. Take the maximum of soft-Copeland score as x next . 3. Take x ′ next that gives the maximum in PBO-PE Sample of σ ( f ⋆ ) Sampled Copeland Function Variance of σ ( f ⋆ ) 1 . 0 0 . 8 1 . 0 0 . 09 0 . 7 0 . 8 0 . 8 0 . 8 0 . 08 0 . 6 0 . 07 0 . 6 0 . 6 0 . 5 0 . 6 0 . 06 0 . 4 0 . 4 0 . 4 0 . 4 0 . 05 0 . 3 0 . 04 0 . 2 0 . 2 0 . 2 0 . 2 0 . 03 0 . 1 0 . 02 0 . 0 0 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0

Experiment: Forrester Function ◮ Synthetic 1D function: Forrester Forrester PBO-PE ◮ Observations drawn with 2 PBO-DTS 1 a probability: PBO-CEI 1+ e g ( x ) − g ( x ′ ) RANDOM ◮ g ( x c ) shows the value at 0 IBO g ( x c ) the location that SPARRING − 2 algorithms believe is the minimum. − 4 ◮ The curve is the average − 6 of 20 trials. 0 25 50 75 100 125 150 175 200 #iterations IBO: [Brochu, 2010] SPARRING: [Ailon et al., 2014]

Experiments: More (2D) Functions Forrester Six Hump Camel PBO-PE 2 . 0 PBO-PE 2 PBO-DTS PBO-DTS 1 . 5 PBO-CEI RANDOM RANDOM IBO 0 IBO 1 . 0 g ( x c ) g ( x c ) SPARRING − 2 0 . 5 0 . 0 − 4 − 0 . 5 − 6 0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200 #iterations #iterations Gold Stein Levy PBO-PE PBO-PE PBO-DTS PBO-DTS RANDOM 10 1 RANDOM IBO IBO 10 4 g ( x c ) g ( x c ) 10 3 10 0 0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200 #iterations #iterations

Summary ◮ Address Bayesian optimization with preferential returns. ◮ Propose to build a surrogate model for the preference function. ◮ Propose a few efficient acquisition functions. ◮ Show the performance on synthetic functions.

Questions?

Exploration & Exploitation The two ingredients in an acquisition function: Exploration & Exploitation.

Exploration in PBO ◮ To understand exploration in PBO by designing a pure explorative acq. func. ◮ Exploration in standard BO can be viewed as the action to reduce uncertainty of a surrogate model. ◮ A purely explorative acq. func. � ( y ⋆ − E [ y ⋆ ]) 2 p ( y ⋆ |D , x ⋆ )d y ⋆ V [ y ⋆ ] = ◮ Can we extend this idea to PBO?

A Straight-Forward Choice Expectation of y ⋆ and σ ( f ⋆ ) 1 . 0 0 . 8 0 . 8 0 . 7 0 . 6 0 . 6 ◮ A straight-forward extension from standard BO: 0 . 5 0 . 4 0 . 4 ( y ⋆ − E [ y ⋆ ]) 2 p ( y ⋆ |D , [ x ⋆ , x ′ � V [ y ⋆ ] = ⋆ ]) 0 . 3 0 . 2 y ⋆ ∈{ 0 , 1 } 0 . 2 0 . 0 = E [ y ⋆ ](1 − E [ y ⋆ ]) 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 Variance of y ∗ 1 . 0 0 . 24 ◮ The maximum variance is always at where 0 . 8 0 . 22 E [ y ⋆ ] = 0 . 5! 0 . 20 0 . 6 ◮ The variance may not reduce with observations! 0 . 18 0 . 16 0 . 4 0 . 14 0 . 2 0 . 12 0 . 10 0 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0

Dueling-Thompson Sampling (DTS) 100 Copeland Samples (#duels = 10) 1 . 0 0 . 8 0 . 6 0 . 4 ◮ To balance exploration & exploitation, we borrow 0 . 2 the idea of Thompson sampling by drawing a 0 . 0 100 Copeland Samples (#duels = 30) sample from the surrogate model. 1 . 0 0 . 8 ◮ Compute the soft-copeland score on the drawn 0 . 6 sample. 0 . 4 ◮ The value x next that gives the maximum 0 . 2 soft-copeland score gives a good balance between 0 . 0 exploration and exploitation. 100 Copeland Samples (#duels = 150) 1 . 0 ◮ Take it as the first value of the next duel. 0 . 8 0 . 6 0 . 4 0 . 2 0 . 0 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8

Preferential Bayesian Optimization Javier Gonz alez, Zhenwen Dai , - PowerPoint PPT Presentation

Preferential Bayesian Optimization Javier Gonz alez, Zhenwen Dai , Andreas Damianou, Neil D. Lawrence @ICML 2017, Sydney, Australia June 26, 2019 My Colleagues Javier Gonz alez Andreas Damianou Neil D. Lawrence Motivation Bayesian

Handout: Power Laws and Preferential Attachment 1 Preferential Attachment Empirical studies of

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Geostatistical Inference under Preferential Sampling Marie Ozanne and Justin Strait Diggle,

Study Committee reference C6 SPECIAL REPORT FOR SC6 (Distribution Systems and Dispersed

Degree distributions in preferential attachment graphs Part I: Multivariate Approximations

A Design Of Secure Preferential E-Voting Kun Peng and Feng Bao { dr.kun.peng } @gmail.com

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

CSC321 Lecture 21: Bayesian Hyperparameter Optimization Roger Grosse Roger Grosse CSC321

A simple Bayesian regression model Alicia Johnson Associate Professor, Macalester College

Part 7 Bayesian hierarchical modelling, simulation and MCMC by Gero Walter 252 Bayesian

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Virtual Machines COMP 520: Compiler Design (4 credits) Alexander Krolik

Transmeta Crusoe and efficeon : Embedded VLIW as a CISC Implementation Jim Dehnert Transmeta

A FRAMEWORK FOR REDUCING THE COST OF INSTRUMENTED CODE Known from Continuous Path and

Grounding Neural Conversation Models into the Real World Michel Galley SCAI October 1 st , 2017

Test and Learn: How CDW Uses a Niche Social Network to Get in Front of the Right Buyers

August Data Jam: Understanding Your Organizations Cost and Utilization Data Scott E. Wetzler,

Parallel Boxes & Jam Jose Rodriguez Rotem David Robert Tolda Hahn Chong Fred Clark Jr.

Jamming-resistant Broadcast Communication without Shared Keys Christina P opper Joint work

Sambuz

Useful Links

Newsletter

Mail Us

Preferential Bayesian Optimization Javier Gonz alez, Zhenwen Dai , - PowerPoint PPT Presentation

Preferential Bayesian Optimization Javier Gonz alez, Zhenwen Dai , Andreas Damianou, Neil D. Lawrence @ICML 2017, Sydney, Australia June 26, 2019 My Colleagues Javier Gonz alez Andreas Damianou Neil D. Lawrence Motivation Bayesian

Handout: Power Laws and Preferential Attachment 1 Preferential Attachment Empirical studies of

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Geostatistical Inference under Preferential Sampling Marie Ozanne and Justin Strait Diggle,

Study Committee reference C6 SPECIAL REPORT FOR SC6 (Distribution Systems and Dispersed

Degree distributions in preferential attachment graphs Part I: Multivariate Approximations

A Design Of Secure Preferential E-Voting Kun Peng and Feng Bao { dr.kun.peng } @gmail.com

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

CSC321 Lecture 21: Bayesian Hyperparameter Optimization Roger Grosse Roger Grosse CSC321

A simple Bayesian regression model Alicia Johnson Associate Professor, Macalester College

Part 7 Bayesian hierarchical modelling, simulation and MCMC by Gero Walter 252 Bayesian

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Virtual Machines COMP 520: Compiler Design (4 credits) Alexander Krolik

Transmeta Crusoe and efficeon : Embedded VLIW as a CISC Implementation Jim Dehnert Transmeta

A FRAMEWORK FOR REDUCING THE COST OF INSTRUMENTED CODE Known from Continuous Path and

Grounding Neural Conversation Models into the Real World Michel Galley SCAI October 1 st , 2017

Test and Learn: How CDW Uses a Niche Social Network to Get in Front of the Right Buyers

August Data Jam: Understanding Your Organizations Cost and Utilization Data Scott E. Wetzler,

Parallel Boxes &amp; Jam Jose Rodriguez Rotem David Robert Tolda Hahn Chong Fred Clark Jr.

Jamming-resistant Broadcast Communication without Shared Keys Christina P opper Joint work

Sambuz

Useful Links

Newsletter

Mail Us

Parallel Boxes & Jam Jose Rodriguez Rotem David Robert Tolda Hahn Chong Fred Clark Jr.