Experiment design Bandit problems and Markov decision processes - PowerPoint PPT Presentation

Experiment design Bandit problems and Markov decision processes Christos Dimitrakakis UiO November 13, 2019 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Bandit problems Planning: Heuristics and exact solutions Bandit problems as MDPs Contextual Bandits Case study: experiment design for clinical trials Practical approaches to experiment design Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Sequential problems: full observation Example 1 ▶ n meteorological stations { µ i | i = 1 , . . . , n } ▶ The i -th station gives a rain probability x t , i = P µ i ( y t | y 1 , . . . , y t − 1 ). ▶ Observation x t = ( x t , 1 , . . . , x t , n ): the predictions of all stations. ▶ Decision a t : Guess if it will rain ▶ Outcome y t : Rain or not rain. ▶ Steps t = 1 , . . . , T . Linear utility function Reward function is ρ ( y t , a t ) = I { y t = a t } simply rewarding correct predictions with utility being T ∑ U ( y 1 , y 2 , . . . , y T , a 1 , . . . , a T ) = ρ ( y t , a t ) , t =1 the total number of correct predictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The n meteorologists problem is simple, as: ▶ You always see their predictions, as well as the weather, no matter whether you bike or take the tram (full information) ▶ Your actions do not influence their predictions (independence events) In the remainder, we’ll see two settings where decisions are made with either partial information or in a dynamical system. Both of these settings can be formalised with Markov decision processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Experimental design and Markov decision processes The following problems ▶ Shortest path problems. ▶ Optimal stopping problems. ▶ Reinforcement learning problems. ▶ Experiment design (clinical trial) problems ▶ Advertising. can be all formalised as Markov decision processes. Applications ▶ Robotics. ▶ Economics. ▶ Automatic control. ▶ Resource allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Bandit problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Bandit problems f ( x ) Applications ▶ Efficient optimisation. f ( x ) = sinc x x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Bandit problems Applications ▶ Efficient optimisation. ▶ Online advertising. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Bandit problems Ultrasound Applications ▶ Efficient optimisation. ▶ Online advertising. ▶ Clinical trials. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Bandit problems Applications ▶ Efficient optimisation. ▶ Online advertising. ▶ Clinical trials. ▶ Robot scientist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The stochastic n -armed bandit problem Actions and rewards ▶ A set of actions A = { 1 , . . . , n } . ▶ Each action gives you a random reward with distribution P ( r t | a t = i ). ▶ The expected reward of the i -th arm is ρ i ≜ E ( r t | a t = i ). Interaction at time t 1. You choose an action a t ∈ A . 2. You observe a random reward r t drawn from the i -th arm. The utility is the sum of the rewards obtained ∑ U ≜ r t . t We must maximise the expected utility, without knowing the values ρ i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Policy Definition 2 (Policies) A policy π is an algorithm for taking actions given the observed history h t ≜ a 1 , r 1 , . . . , a t , r t P π ( a t +1 | h t ) is the probability of the next action a t +1 . Exercise 1 Why should our action depend on the complete history? A The next reward depends on all the actions we have taken. B We don’t know which arm gives the highest reward. C The next reward depends on all the previous rewards. D The next reward depends on the complete history. E No idea. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Policy Definition 2 (Policies) A policy π is an algorithm for taking actions given the observed history h t ≜ a 1 , r 1 , . . . , a t , r t P π ( a t +1 | h t ) is the probability of the next action a t +1 . Example 3 (The expected utility of a uniformly random policy) If P π ( a t +1 | · ) = 1 / n for all t , then . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Policy Definition 2 (Policies) A policy π is an algorithm for taking actions given the observed history h t ≜ a 1 , r 1 , . . . , a t , r t P π ( a t +1 | h t ) is the probability of the next action a t +1 . Example 3 (The expected utility of a uniformly random policy) If P π ( a t +1 | · ) = 1 / n for all t , then ( T ) T T n n 1 n ρ i = T ∑ ∑ ∑ ∑ ∑ E π U = E π E π r t = r t = ρ i n t =1 t =1 t =1 i =1 i =1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Policy Definition 2 (Policies) A policy π is an algorithm for taking actions given the observed history h t ≜ a 1 , r 1 , . . . , a t , r t P π ( a t +1 | h t ) is the probability of the next action a t +1 . The expected utility of a general policy ( T ) E π U = E π ∑ r t t =1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Policy Definition 2 (Policies) A policy π is an algorithm for taking actions given the observed history h t ≜ a 1 , r 1 , . . . , a t , r t P π ( a t +1 | h t ) is the probability of the next action a t +1 . The expected utility of a general policy ( T ) T E π U = E π ∑ ∑ E π ( r t ) = (1.1) r t t =1 t =1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Policy Definition 2 (Policies) A policy π is an algorithm for taking actions given the observed history h t ≜ a 1 , r 1 , . . . , a t , r t P π ( a t +1 | h t ) is the probability of the next action a t +1 . The expected utility of a general policy ( T ) T E π U = E π ∑ ∑ E π ( r t ) = (1.1) r t t =1 t =1 T ∑ ∑ ∑ P π ( a t | h t − 1 ) P π ( h t − 1 ) = E ( r t | a t ) t =1 a t ∈A h t − 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

A simple heuristic for the unknown reward case Say you keep a running average of the reward obtained by each arm ˆ θ t , i = R t , i / n t , i ▶ n t , i the number of times you played arm i ▶ R t , i the total reward received from i . Whenever you play a t = i : R t +1 , i = R t , i + r t , n t +1 , i = n t , i + 1 . Greedy policy: ˆ a t = arg max θ t , i . i What should the initial values n 0 , i , R 0 , i be? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Bernoulli bandits Decision-theoretic approach ▶ Assume r t | a t = i ∼ P θ i , with θ i ∈ Θ . ▶ Define prior belief ξ 1 on Θ . ▶ For each step t , find a policy π selecting action a t | ξ t ∼ π ( a | ξ t ) to ( T − t � ) � ∑ ∑ E π E π � max ξ t ( U t ) = max r t + k � a t π ( a t | ξ t ) . ξ t � π π a t k =1 ▶ Obtain reward r t . ▶ Calculate the next belief ξ t +1 = ξ t ( · | a t , r t ) How can we implement this? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Bayesian inference on Bernoulli bandits ▶ Likelihood: P θ ( r t = 1) = θ . ▶ Prior: ξ ( θ ) ∝ θ α − 1 (1 − θ ) β − 1 (i.e. Beta ( α, β )). 4 prior 3 2 1 0 0 0.2 0.4 0.6 0.8 1 Figure: Prior belief ξ about the mean reward θ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Bayesian inference on Bernoulli bandits For a sequence r = r 1 , . . . , r n , ⇒ P θ ( r ) ∝ θ #1(r) (1 − θ i ) #0(r) i 10 prior likelihood 8 6 4 2 0 0 0.2 0.4 0.6 0.8 1 Figure: Prior belief ξ about θ and likelihood of θ for 100 plays with 70 1s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Bayesian inference on Bernoulli bandits Posterior: Beta ( α + #1(r) , β + #0(r) ). 10 prior likelihood 8 posterior 6 4 2 0 0 0.2 0.4 0.6 0.8 1 Figure: Prior belief ξ ( θ ) about θ , likelihood of θ for the data r , and posterior belief ξ ( θ | r ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Experiment design Bandit problems and Markov decision processes - PowerPoint PPT Presentation

Experiment design Bandit problems and Markov decision processes Christos Dimitrakakis UiO November 13, 2019 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bandit problems

Sodium Reactor Experiment Accident Sodium Reactor Experiment Accident Sodium Reactor Experiment

Future Outlook: Experiment Future Outlook: Experiment Future Outlook: Experiment Future Outlook:

Probabilistic Models of Human Sentence Experiment 1: Entropy and Sentence Length 2 Processing

PHYSICS PROSPECTS OF THE PHYSICS PROSPECTS OF THE JUNO EXPERIMENT JUNO EXPERIMENT Monica Sisti

Pool-based Agnostic Pool-based Agnostic Experiment Design Experiment Design in Linear

TOK presentation Exemplar 1 300 RLS: Stanford Prison Experiment KQ: Is experimentation a

Mould of the pineapple By: John Taghavi Goal of the Experiment The goal of the experiment was to

Review and prospects of the Review and prospects of the CAST experiment CAST experiment

Elastic pp scattering and TOTEM experiment Jan Kas par TOTEM experiment (Total Cross Section,

Rotational Momentum Observation Experiment 1 - Figure Skater Observation Experiment 2 - Diver

The BD experiment M.Battaglieri INFN-GE Italy 1 The BDX experiment- Light Dark Matter search

Experiment/Facility divide (some thoughts) Vitaly Pronskikh Fermilab 01/22/15 Outline

Probability Definition 1 (Experiment). An experiment is a process that can be repeated and may

Short Distance Neutrino Oscillations with BoreXino SOX Borexino experiment Borexino experiment

Unikernel Experiment Theory, practice and perspective @argent_smith Evrone.com {Tver.io} 1 /

Pion scattering with the Pion scattering with the LArIAT experiment LArIAT experiment Justin

NLP Resource Creation and Enrichment using Deep Learning Kevin Patel Guided by: Prof. Shivaram

Continuum A Platform for Cost-Aware, Low-Latency Continual Learning Huangshi Tian, Minchen Yu,

Data Science Alexander Schliep CSE Gothenburg University | Chalmers http://schlieplab.org Data

Robust Pricing in Dynamic Mechanism Design July, 2020 @ ICML Yuan Deng, Duke University =>

Part 15: Context Dependent Recommendations Francesco Ricci Free University of Bozen-Bolzano

Lost in Translation: Privacy in Commercial Use of Biometric Data Niva Elkin-Koren January 2016

Scaling up the Contacts Insights with Activity Graph Praveen Innamuri, Zhidong Ke Salesforce

CREDENTIAL TRANSPARENCY & INTEROPERABILITY H-1B Rural Healthcare Grant Program September 2020

Experiment design Bandit problems and Markov decision processes - PowerPoint PPT Presentation

Experiment design Bandit problems and Markov decision processes Christos Dimitrakakis UiO November 13, 2019 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bandit problems

Sodium Reactor Experiment Accident Sodium Reactor Experiment Accident Sodium Reactor Experiment

Future Outlook: Experiment Future Outlook: Experiment Future Outlook: Experiment Future Outlook:

Probabilistic Models of Human Sentence Experiment 1: Entropy and Sentence Length 2 Processing

PHYSICS PROSPECTS OF THE PHYSICS PROSPECTS OF THE JUNO EXPERIMENT JUNO EXPERIMENT Monica Sisti

Pool-based Agnostic Pool-based Agnostic Experiment Design Experiment Design in Linear

TOK presentation Exemplar 1 300 RLS: Stanford Prison Experiment KQ: Is experimentation a

Mould of the pineapple By: John Taghavi Goal of the Experiment The goal of the experiment was to

Review and prospects of the Review and prospects of the CAST experiment CAST experiment

Elastic pp scattering and TOTEM experiment Jan Kas par TOTEM experiment (Total Cross Section,

Rotational Momentum Observation Experiment 1 - Figure Skater Observation Experiment 2 - Diver

The BD experiment M.Battaglieri INFN-GE Italy 1 The BDX experiment- Light Dark Matter search

Experiment/Facility divide (some thoughts) Vitaly Pronskikh Fermilab 01/22/15 Outline

Probability Definition 1 (Experiment). An experiment is a process that can be repeated and may

Short Distance Neutrino Oscillations with BoreXino SOX Borexino experiment Borexino experiment

Unikernel Experiment Theory, practice and perspective @argent_smith Evrone.com {Tver.io} 1 /

Pion scattering with the Pion scattering with the LArIAT experiment LArIAT experiment Justin

NLP Resource Creation and Enrichment using Deep Learning Kevin Patel Guided by: Prof. Shivaram

Continuum A Platform for Cost-Aware, Low-Latency Continual Learning Huangshi Tian, Minchen Yu,

Data Science Alexander Schliep CSE Gothenburg University | Chalmers http://schlieplab.org Data

Robust Pricing in Dynamic Mechanism Design July, 2020 @ ICML Yuan Deng, Duke University =&gt;

Part 15: Context Dependent Recommendations Francesco Ricci Free University of Bozen-Bolzano

Lost in Translation: Privacy in Commercial Use of Biometric Data Niva Elkin-Koren January 2016

Scaling up the Contacts Insights with Activity Graph Praveen Innamuri, Zhidong Ke Salesforce

CREDENTIAL TRANSPARENCY &amp; INTEROPERABILITY H-1B Rural Healthcare Grant Program September 2020

Robust Pricing in Dynamic Mechanism Design July, 2020 @ ICML Yuan Deng, Duke University =>

CREDENTIAL TRANSPARENCY & INTEROPERABILITY H-1B Rural Healthcare Grant Program September 2020