still looking for a policy p s new twist don t know t or
play

? Still looking for a policy p (s) New twist: dont know T or R - PDF document

CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld/ University of Washington Image from https://towardsdatascience.com/reinforcement-learning-multi-arm-bandit-implementation-5399ef67b24b [Many slides taken from Dan Klein and Pieter


  1. CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld/ University of Washington Image from https://towardsdatascience.com/reinforcement-learning-multi-arm-bandit-implementation-5399ef67b24b [Many slides taken from Dan Klein and Pieter Abbeel / CS188 Intro to AI at UC Berkeley – materials available at http://ai.berkeley.edu.] 1 Reinforcement Learning § Still assume there is a Markov decision process (MDP): § A set of states s Î S § A set of actions (per state) A § A model T(s,a,s’) § A reward function R(s,a,s’) & discount γ ? § Still looking for a policy p (s) § New twist: don’t know T or R § I.e. we don’t know which states are good or what the actions do § Must actually try actions and states out to learn 2 1

  2. Offline (MDPs) vs. Online (RL) Most people call this RL as well Simulator T, R Planning Monte Carlo Online Learning (Offline Solution) Planning (RL) Don’t know T,R Don’t know T,R Know T,R Differences: with MC planning 1) dying ok; 2) have (re)set button 3 Reminder: Q-Value Iteration For MDPs with known T,R § Forall s, a § Initialize Q 0 (s, a) = 0 no time steps left means an expected reward of zero § K = 0 § Repeat Q k+1 (s,a) do Bellman backups For every (s,a) pair: a s, a s,a,s’ V k (s’)=Max a’ Q k (s’,a’) K += 1 § Until convergence I.e., Q values don’t change much Problem: what if don’t know T, R? We know this…. We can sample this 4 2

  3. Reminder: Q Learning For Reinforcement Learning § Forall s, a § Initialize Q(s, a) = 0 § Repeat Forever Where are you? s. Choose some action a ; e.g. using ∈ -greedy or by maximizing Q e (s, a) Execute it in real world: (s, a, r, s’) Do update: difference ß [r + γ Max a’ Q(s’, a’)] - Q(s,a) Problem: don’t want to store table of Q(-,-) Q(s,a) ß Q(s,a) + 𝛽 (difference) 5 Q(s,a) = w 1 f 1 (s, a) + w 2 f 2 (s, a) + … + w n f n (s, a) § Forall i § Initialize w i = 0 § Repeat Forever Where are you? s. Choose some action a Execute it in real world: (s, a, r, s’) Do update: difference ß [r + γ Max a’ Q(s’, a’)] - Q(s,a) Q(s,a) ß Q(s,a) + 𝛽 (difference) 6 3

  4. Reminder: Approximate Q Learning Q(s,a) = w 1 f 1 (s, a) + w 2 f 2 (s, a) + … + w n f n (s, a) § Forall i § Initialize w i = 0 § Repeat Forever Where are you? s. Choose some action a Execute it in real world: (s, a, r, s’) Do update: difference ß [r + γ Max a’ Q(s’, a’)] - Q(s,a) Forall i do: 7 Reminder: Approximate Q Learning Q(s,a) = w 1 f 1 (s, a) + w 2 f 2 (s, a) + … + w n f n (s, a) § Forall i Wait?! Which One? How? § Initialize w i = 0 § Repeat Forever Where are you? s. Choose some action a Execute it in real world: (s, a, r, s’) Do update: difference ß [r + γ Max a’ Q(s’, a’)] - Q(s,a) Forall i do: 8 4

  5. Exploration vs. Exploitation 9 Questions § How to explore? axploration Uniform exploration Epsilon Greedy Exploration Functions (such as UCB) Thompson Sampling § When to exploit? § How to even think about this tradeoff? 10 5

  6. Questions § How to explore? § Random Exploration § Uniform exploration § Epsilon Greedy § Exploration Functions (such as UCB) § Thompson Sampling § When to exploit? § How to even think about this tradeoff? 11 Video of Demo Crawler Bot More demos at: http://inst.eecs.berkeley.edu/~ee128/fa11/videos.html 12 6

  7. Epsilon-Greedy § With (small) probability e , act randomly § With (large) probability 1- e , act on current policy § Maybe e decrease over time. 13 Evaluation § Is epsilon-greedy good? § Could any method be better? § How should we even THINK about this question? 14 14 7

  8. Regret § Even if you learn the optimal policy, you still make mistakes along the way! § Regret is a measure of your total mistake cost : the difference between your (expected) rewards, including youthful sub-optimality, and optimal (expected) rewards § Minimizing regret goes beyond learning to be optimal – it requires optimally learning to be optimal 15 Two KINDS of Regret § Cumulative Regret: § Goal: achieve near optimal cumulative lifetime reward (in expectation) § Simple Regret: § Goal: quickly identify policy with high reward (in expectation) 16 16 8

  9. Regret Reward Choosing optimal action each time ∞ Time Exploration policy that minimizes cumulative regret Minimizes red area 17 17 Regret You care about performance at times after here Reward You are here ∞ t Time Exploration policy that minimizes simple regret… Given a time, t, in the future , explore in order to minimize red area after t 18 18 9

  10. Offline (MDPs) vs. Online (RL) Simulator Monte Carlo Online Learning Planning (RL) Don’t know T,R Don’t know T,R Minimize: Simple Regret Cumulative Regret 19 RL on Single State MDP § Suppose MDP has a single state and k actions § Can sample rewards of actions using call to simulator § Sampling action a is like pulling slot machine arm with random payoff function R ( s , a ) s a 1 a 2 a k … … R(s,a 2 ) R(s,a k ) R(s,a 1 ) Multi-Armed Bandit Problem 20 Slide adapted from Alan Fern (OSU) 20 10

  11. Multi-Armed Bandits § Bandit algorithms are not just useful as components for RL & Monte-Carlo planning § Pure bandit problems arise in many applications § Applicable whenever: § set of independent options with unknown utilities § cost for sampling options or a limit on total samples § Want to find the best option or maximize utility of samples Slide adapted from Alan Fern (OSU) 21 Multi-Armed Bandits: Example 1 Clinical Trials § Arms = possible treatments § Arm Pulls = application of treatment to individual § Rewards = outcome of treatment § Objective = maximize cumulative reward = maximize benefit to trial population (or find best treatment quickly) Slide adapted from Alan Fern (OSU) 22 11

  12. Multi-Armed Bandits: Example 2 § Online Advertising § Online Advertising § Arms = § Arms = different ads/ad-types for a web page § Arm Pulls = § Arm Pulls = displaying an ad upon a page access § Rewards = § Rewards = click through § Objective = § Objective = maximize cumulative reward = maximum clicks (or find best ad quickly) 23 Multi-Armed Bandit: Possible Objectives § PAC Objective: § find a near optimal arm w/ high probability § Cumulative Regret: § achieve near optimal cumulative reward over lifetime of pulling (in expectation) § Simple Regret: s § quickly identify arm with high reward a k a 1 a 2 § (in expectation) … … R(s,a 2 ) R(s,a k ) R(s,a 1 ) 24 Slide adapted from Alan Fern (OSU) 24 12

  13. Cumulative Regret Objective h Problem: find arm-pulling strategy such that the expected total reward at time n is close to the best possible (one pull per time step) 5 Optimal (in expectation) is to pull optimal arm n times 5 Pull arms uniformly? (UniformBandit) ?? s a 1 a 2 a k … 25 Slide adapted from Alan Fern (OSU) 25 Cumulative Regret Objective h Problem: find arm-pulling strategy such that the expected total reward at time n is close to the best possible (one pull per time step) 5 Optimal (in expectation) is to pull optimal arm n times 5 UniformBandit is poor choice --- waste time on bad arms 5 Must balance exploring all arms to find good payoffs and exploiting current knowledge (pulling best arm) s a 1 a 2 a k … 26 Slide adapted from Alan Fern (OSU) 26 13

  14. Idea • The problem is uncertainty… How to quantify? • Error bars If arm has been sampled n times, With probability at least 1- 𝜀 : log(2 𝜀) 𝜈 − 𝜈 < # 2𝑜 Slide adapted from Travis Mandel (UW) 27 Given Error bars, how do we act? Slide adapted from Travis Mandel (UW) 28 14

  15. Given Error bars, how do we act? • Optimism under uncertainty! • Why? If bad, we will soon find out! Slide adapted from Travis Mandel (UW) 29 One last wrinkle • How to set confidence 𝜀 • Decrease over time If arm has been sampled n times, With probability at least 1- 𝜀 : log(2 𝜀) 𝜈 − 𝜈 < # 2𝑜 / 𝜀 = 0 Slide adapted from Travis Mandel (UW) 30 15

  16. Upper Confidence Bound (UCB) 1. Play each arm once 2. Play arm i that maximizes: 2log(𝑢) 𝜈 1 + # 𝑜 1 3. Repeat Step 2 forever Slide adapted from Travis Mandel (UW) 31 UCB Performance Guarantee [Auer, Cesa-Bianchi, & Fischer, 2002] Theorem : The expected cumulative regret of UCB 𝑭�𝑺𝒇𝒉 𝒐 � after n arm pulls is bounded by O(log n ) 𝑭�𝑺𝒇𝒉 𝒐 � 𝑭�𝑺𝒇𝒉 𝒐 � � Is this good? � � l�� � Yes. The average per-step regret is O l�� � l�� � � � � Theorem: No algorithm can achieve a better expected regret (up to constant factors) 33 Slide adapted from Alan Fern (OSU) 33 16

Recommend


More recommend