exploration exploitation in multi armed bandits
play

Exploration/Exploitation in Multi-armed Bandits Spring 2019, CMU - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Exploration/Exploitation in Multi-armed Bandits Spring 2019, CMU 10-403 Katerina Fragkiadaki Used Materials Disclaimer : Some of the material and slides


  1. Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Exploration/Exploitation in Multi-armed Bandits Spring 2019, CMU 10-403 Katerina Fragkiadaki

  2. Used Materials • Disclaimer : Some of the material and slides for this lecture were borrowed from Russ Salakhutdinov who in turn borrowed from Rich Sutton’s class and David Silver’s class on Reinforcement Learning.

  3. Supervised VS Reinforcement Learning • Supervised learning (instructive feedback): the expert directly suggests correct actions • Learning by interaction (evaluative feedback): the environment provides signal whether actions the agent selects are good or bad, not even how far away they are from the optimal actions! • Evaluative feedback depends on the current policy the agent has • Exploration: active search for good actions to execute

  4. Exploration vs. Exploitation Dilemma Online decision-making involves a fundamental choice: ‣ - Exploitation: Make the best decision given current information - Exploration: Gather more information 
 The best long-term strategy may involve short-term sacrifices ‣ Gather enough information to make the best overall decisions ‣

  5. Exploration vs. Exploitation Dilemma Restaurant Selection ‣ - Exploitation: Go to your favorite restaurant - Exploration: Try a new restaurant Oil Drilling ‣ - Exploitation: Drill at the best known location - Exploration: Drill at a new location Game Playing ‣ - Exploitation: Play the move you believe is best - Exploration: Play an experimental move

  6. Reinforcement learning Agent state reward action S t R t A t R t+ 1 Environment S t+ 1 Agent and environment interact at discrete time steps: t = 0,1, 2, K = 0 , 1 , 2 , 3 , . . . . Agent observes state at step t : S t ∈ S produces action at step t : A t ∈ A ( ( S t ) gets resulting reward: R t + 1 ∈ ∈ R ⊂ R , R S + and resulting next state: S t + 1 ∈ R t + 1 R t + 2 R t + 3 . . . . . . S t S t + 1 S t + 2 S t + 3 A t A t + 1 A t + 2 A t + 3

  7. This lecture A closer look to exploration-exploitation balancing in a simplified RL setup

  8. Multi-Armed Bandits Agent state reward action S t R t A t R t+ 1 Environment S t+ 1 S t Agent and environment interact at discrete time steps: t = 0,1, 2, K = 0 , 1 , 2 , 3 , . . . . Agent observes state at step t : S t ∈ S produces action at step t : A t ∈ ( S t ) A t , R t +1 , A t +1 , R t +2 , A t +2 , A t +3 , R t +3 , . . . gets resulting reward: R t + 1 ∈ and resulting next state: S t + 1 ∈ The state does not change.

  9. Multi-Armed Bandits One-armed bandit= Slot machine (English slang) source: infoslotmachine.com

  10. Multi-Armed Bandits • Multi-Armed bandit = Multiple Slot Machine source: Microsoft Research

  11. Multi-Armed Bandit Problem At each timestep t the agent chooses one of the K arms and plays it . The ith arm produces reward r i , t when played at timestep t . The rewards r i , t are drawn from a probability distribution 𝒬 i with mean μ i The agent does not know neither the arm reward distributions neither their means source: Pandey et al.’s slide Alternative notation for mean arm rewards: q ∗ ( a ) . = E [ R t | A t = a ] , ∀ a ∈ { 1 , . . . , k } Agent’s Objective: • Maximize cumulative rewards. • In other words: Find the arm with the highest mean reward

  12. Example: Bernoulli Bandits Recall: The Bernoulli distribution is the discrete probability distribution of a random variable which takes the value 1 with probability p and the value 0 with probability q=1-p, that is, the probability distribution of any single experiment that asks a yes–no question • Each action (arm when played) results in success or failure. Rewards are binary! • Mean reward for each arm represents the probability of success • Action (arm) k ∈ {1, …, K} produces a success with probability θ _k ∈ [0, 1]. source: Pandey et al.’s slide win 0.6 win 0.4 win 0.45 of time of time of time

  13. One Bandit Task from 
 Example: Gaussian Bandits The 10-armed Testbed R t ∼ N ( q ∗ ( a ) , 1) 4 3 q ∗ (3) q ∗ (5) 2 1 q ∗ (9) q ∗ (4) q ∗ (1) Reward 0 q ∗ (7) q ∗ (10) distribution q ∗ (2) q ∗ (8) -1 q ∗ (6) -2 -3 -4 1 2 3 4 5 6 7 8 9 10 Action

  14. Real world motivation: A/B testing • Two arm bandits: each arm corresponds to an image variation shown to users (not necessarily the same user) • Mean rewards: the total percentage of users that would click on each invitation

  15. Real world motivation: NETFLIX artwork For a particular movie, we want to decide what image to show (to all the NEFLIX users) • Actions: uploading one of the K images to a user’s home screen • Ground-truth mean rewards (unknown): the % of NETFLIX users that will click on the title and watch the movie • Estimated mean rewards: the average click rate observed (quality engagement, not clickbait) Netflix Artwork

  16. The Exploration/Exploitation Dilemma • Suppose you form estimates action-value estimates Q t ( a ) ≈ q ∗ ( a ) , ∀ a

  17. The Exploration/Exploitation Dilemma • Suppose you form estimates action-value estimates Q t ( a ) ≈ q ∗ ( a ) , ∀ a • Define the greedy action at time t as . A ∗ = arg max Q t ( a ) t a

  18. The Exploration/Exploitation Dilemma • Suppose you form estimates action-value estimates Q t ( a ) ≈ q ∗ ( a ) , ∀ a • Define the greedy action at time t as . A ∗ = arg max Q t ( a ) t a • If then you are exploiting 
 A t = A ∗ t If then you are exploring A t 6 = A ∗ t

  19. The Exploration/Exploitation Dilemma • Suppose you form estimates action-value estimates Q t ( a ) ≈ q ∗ ( a ) , ∀ a • Define the greedy action at time t as . A ∗ = arg max Q t ( a ) t a • If then you are exploiting 
 A t = A ∗ t If then you are exploring A t 6 = A ∗ t • You can’t do both, but you need to do both

  20. The Exploration/Exploitation Dilemma • Suppose you form estimates action-value estimates Q t ( a ) ≈ q ∗ ( a ) , ∀ a • Define the greedy action at time t as . A ∗ = arg max Q t ( a ) t a • If then you are exploiting 
 A t = A ∗ t If then you are exploring A t 6 = A ∗ t • You can’t do both, but you need to do both • You can never stop exploring, but maybe you should explore less with time.

  21. Regret The action-value is the mean reward for action a, ‣ q ∗ ( a ) . = E [ R t | A t = a ] , ∀ a ∈ { 1 , . . . , k } (expected return) The optimal value is ‣ v * = q ( a *) = max a ∈𝒝 q * ( a ) The regret is the opportunity loss for one step ‣ reward = − regret I t = 𝔽 [ v * − q * ( a t )] The total regret is the total opportunity loss ‣ L t = 𝔽 [ v * − q * ( a t ) ] T ∑ t =1 Maximize cumulative reward = minimize total regret ‣

  22. Regret The count N t (a): the number of times that action a has been selected prior ‣ to time t The gap ∆ a is the difference in value between action a and optimal ‣ action a ∗ : 
 Δ a = v * − q * ( a ) Regret is a function of gaps and the counts ‣ L t = 𝔽 [ v * − q * ( a t ) ] T ∑ t =1 = ∑ 𝔽 [ N t ( a )]( v * − q * ( a )) a ∈𝒝 = ∑ 𝔽 [ N t ( a )] Δ a a ∈𝒝

  23. Forming Action-Value Estimates • Estimate action values as sample averages : P t − 1 i =1 R i · 1 A i = a = sum of rewards when a taken prior to t Q t ( a ) . = P t − 1 number of times a taken prior to t i =1 1 A i = a • The sample-average estimates converge to the true values 
 If the action is taken an infinite number of times N t ( a ) →∞ Q t ( a ) = q ∗ ( a ) lim The number of times action a has been taken by time t

  24. Forming Action-Value Estimates • To simplify notation, let us focus on one action • We consider only its rewards, and its estimate after n+ 1 rewards: = R 1 + R 2 + · · · + R n − 1 Q n . . n − 1

  25. Forming Action-Value Estimates • To simplify notation, let us focus on one action • We consider only its rewards, and its estimate after n+ 1 rewards: = R 1 + R 2 + · · · + R n − 1 Q n . . n − 1 • How can we do this incrementally (without storing all the rewards)? • Could store a running sum and count (and divide), or equivalently: Q n +1 = Q n + 1 h i R n − Q n n

  26. Forming Action-Value Estimates • To simplify notation, let us focus on one action • We consider only its rewards, and its estimate after n+ 1 rewards: = R 1 + R 2 + · · · + R n − 1 Q n . . n − 1 • How can we do this incrementally (without storing all the rewards)? • Could store a running sum and count (and divide), or equivalently: Q n +1 = Q n + 1 h i R n − Q n n • This is a standard form for learning/update rules: h i NewEstimate ← OldEstimate + StepSize Target − OldEstimate

  27. Forming Action-Value Estimates • To simplify notation, let us focus on one action • We consider only its rewards, and its estimate after n+ 1 rewards: = R 1 + R 2 + · · · + R n − 1 Q n . . n − 1 • How can we do this incrementally (without storing all the rewards)? • Could store a running sum and count (and divide), or equivalently: Q n +1 = Q n + 1 h i R n − Q n n • This is a standard form for learning/update rules: error h i NewEstimate ← OldEstimate + StepSize Target − OldEstimate

Recommend


More recommend