Monte Carlo Approaches to Reinforcement Learning Robert Platt (w/ - PowerPoint PPT Presentation

Monte Carlo Approaches to Reinforcement Learning Robert Platt (w/ Marcus Gualtieri’s edits) Northeastern University

Model Free Reinforcement Learning Joystick command Agent World Observe screen pixels Reward = game score Goal: learn a value function through trial-and-error experience

Model Free Reinforcement Learning Joystick command Agent World Observe screen pixels Reward = game score Goal: learn a value function through trial-and-error experience Recall: Value of state when acting according to policy

Model Free Reinforcement Learning Joystick command Agent World Observe screen pixels How? Reward = game score Goal: learn a value function through trial-and-error experience Recall: Value of state when acting according to policy

Model Free Reinforcement Learning Simplest solution: average all outcomes Joystick command from previous experiences in a given state – this is called a Monte Carlo method Agent World Observe screen pixels How? Reward = game score Goal: learn a value function through trial-and-error experience Recall: Value of state when acting according to policy

Running Example: Blackjack State: sum of cards in agent’s hand + dealer’s showing card + does agent have usable ace? Actions: hit, stick Objective: Have agent’s card sum be greater than the dealer’s without exceeding 21 Reward: +1 for winning, 0 for a draw, -1 for losing Discounting: Dealer policy: draw until sum at least 17

Running Example: Blackjack Blackjack “Basic Strategy” is a set of rules for play so as to maximize return – well known in the gambling community – how might an RL agent learn the Basic Strategy?

Monte Carlo Policy Evaluation: Example Dealer card: Agent’s hand: State Action Next State Reward

Monte Carlo Policy Evaluation: Example Dealer card: Agent’s hand: State Action Next State Reward 19, 10, no

Monte Carlo Policy Evaluation: Example Dealer card: Agent’s hand: Agent sum, dealer’s card, ace? State Action Next State Reward 19, 10, no

Monte Carlo Policy Evaluation: Example Dealer card: Agent’s hand: Agent sum, dealer’s card, ace? State Action Next State Reward 19, 10, no HIT

Monte Carlo Policy Evaluation: Example Dealer card: Agent’s hand: Agent sum, dealer’s card, ace? State Action Next State Reward 19, 10, no HIT 22, 10, no -1

Monte Carlo Policy Evaluation: Example Dealer card: Agent’s hand: Bust! Agent sum, dealer’s card, ace? (reward = -1) State Action Next State Reward 19, 10, no HIT 22, 10, no -1

Monte Carlo Policy Evaluation: Example State Action Next State Reward 19, 10, no HIT 22, 10, no -1 Upon episode termination, make the following value function updates:

Monte Carlo Policy Evaluation: Example Next episode...

Monte Carlo Policy Evaluation: Example Dealer card: Agent’s hand: State Action Next State Reward 13, 10, no

Monte Carlo Policy Evaluation: Example Dealer card: Agent’s hand: State Action Next State Reward 13, 10, no HIT 16, 10, no 0

Monte Carlo Policy Evaluation: Example Dealer card: Agent’s hand: State Action Next State Reward 13, 10, no HIT 16, 10, no 0 13, 10, no

Monte Carlo Policy Evaluation: Example Dealer card: Agent’s hand: State Action Next State Reward 13, 10, no HIT 16, 10, no 0 13, 10, no HIT 19, 10, no 0

Monte Carlo Policy Evaluation: Example Dealer card: Agent’s hand: State Action Next State Reward 13, 10, no HIT 16, 10, no 0 13, 10, no HIT 19, 10, no 0 19, 10, no

Monte Carlo Policy Evaluation: Example Dealer card: Agent’s hand: State Action Next State Reward 13, 10, no HIT 16, 10, no 0 13, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1

Monte Carlo Policy Evaluation: Example State Action Next State Reward 13, 10, no HIT 16, 10, no 0 16, 10, no HIT 19, 10, no 0 19, 10, no HIT 21, 22, no 1 Upon episode termination, make the following value function updates:

Monte Carlo Policy Evaluation: Example Value function learned for “hit everything except for 20 and 21” policy.

Monte Carlo Policy Evaluation Given a policy, , estimate the value function, , for all states,

Monte Carlo Policy Evaluation Given a policy, , estimate the value function, , for all states, Monte Carlo Policy Evaluation (first visit):

Monte Carlo Policy Evaluation All states: Rollouts To get an accurate estimate of the value function, every state has to be visited many times.

Think-pair-share: frozenlake env 0123 0 SFFF States: grid world coordinates 1 FHFH Actions: L, R, U, D 2 FFFH Reward: 0 except at G 3 HFFG

Think-pair-share: frozenlake env 0123 0 SFFF States: grid world coordinates 1 FHFH Actions: L, R, U, D 2 FFFH Reward: 0 except at G where r=1 3 HFFG Given: three episodes as shown Calculate: values of states on top row as calculated by MC

Monte Carlo Control So far, we’re only talking about policy evaluation … but RL requires us to find a policy, not just evaluate it… How? Estimate via rollouts Key idea: evaluate/improve policy iteratively...

Monte Carlo Control Monte Carlo, Exploring Starts

Monte Carlo Control Monte Carlo, Exploring Starts Exploring starts: – each episode starts with a random action taken from a random state

Monte Carlo Control Monte Carlo, Exploring Starts

Monte Carlo Control Monte Carlo, Exploring Starts Notice there is only one step of policy evaluation – that’s okay. – each evaluation iter moves value fn toward its optimal value. Good enough to improve policy.

Monte Carlo Control

Monte Carlo Control The official “basic strategy” What the MC agent learned

Monte Carlo Control: Convergence

Monte Carlo Control: Convergence If then i.e. is better than

Policy Improvement Theorem: Proof (Sketch)

E-Greedy Exploration Without exploring starts, we are not Monte Carlo, Exploring Starts: guaranteed to explore the state/action space – why is this a problem? – what happens if we never experience certain transitions?

E-Greedy Exploration Without exploring starts, we are not guaranteed to explore the state/action Monte Carlo, Exploring Starts: space – why is this a problem? – what happens if we never experience certain transitions? Can we accomplish this without exploring starts?

E-Greedy Exploration Without exploring starts, we are not guaranteed to explore the state/action Monte Carlo, Exploring Starts: space – why is this a problem? – what happens if we never experience certain transitions? Can we accomplish this without exploring starts? Yes: create a stochastic (e-greedy) policy

E-Greedy Exploration Greedy policy: E-Greedy policy:

E-Greedy Exploration Greedy policy: E-Greedy policy: Action drawn uniformly from

E-Greedy Exploration Greedy policy: Guarantees every state/action will be visited infinitely often E-Greedy policy: – Notice that this is a stochastic policy (not deterministic). – This is an example of an soft policy – soft policy : all actions in all states have non-zero probability

E-Greedy Exploration Monte Carlo, ε-greedy exploration: E-greedy exploration

Off-Policy Methods ● On-policy methods evaluate or improve the policy that is used to make decisions. ● Off-policy methods evaluate or improve a policy different from that used to generate the data. ● The target policy is the policy (π) we wish to evaluate/improve. ● The behavior policy is the policy (b) used to generate experiences. ● Coverage:

MC Summary MC methods estimate value function by doing rollouts Can estimate either the state value function, , or the action value function, MC Control alternates between policy evaluation and policy improvement E-greedy exploration explores all possible actions while preferring greedy actions Off-policy methods update a policy other than the one used to generate experience

Monte Carlo Approaches to Reinforcement Learning Robert Platt (w/ - PowerPoint PPT Presentation

Monte Carlo Approaches to Reinforcement Learning Robert Platt (w/ Marcus Gualtieris edits) Northeastern University Model Free Reinforcement Learning Joystick command Agent World Observe screen pixels Reward = game score Goal: learn a

Monte Carlo Generators Monte Carlo Generators Monte Carlo Generators QCD Lecture III P .

Monte Carlo Methods Guojin Chen Christopher Cprek Chris Rambicure Monte Carlo Methods 1.

Monte Carlo Approximation of Monte Carlo Filters Adam M. Johansen et al. Collaborators Include:

BROCHURE 2019 TETRA JUICES DEL MONTE DEL MONTE 6 x 1L GOLD PINEAPPLE 6 x 1L 6 x 1L 6 x 1L

Chapter 5: Monte Carlo Methods Monte Carlo methods are learning methods Experience

Draft Introduction to (randomized) quasi-Monte Carlo Pierre LEcuyer MCQMC Conference,

Monte Carlo Estimation 7 January 2019 OSU CSE 1 Monte Carlo Methods Class of computational

Monte Carlo Localization Ximing Yu March 24, 2009 Ximing Yu Monte Carlo Localization 1

Monte Carlo Control CMPUT 366: Intelligent Systems S&B 5.3-5.5, 5.7 Lecture Outline 1.

4. THE MONTE CARLO METHOD 4.1 I ntroduction This chapter is aimed at describing the Monte Carlo

Techniques in Artificial Intelligence - Part I Todd W. Neller Gettysburg College Monte Carlo

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Introduction to Monte Carlo Method Andrzej Palczewski and Jan Palczewski Introduction to Monte

Draft 1 Density estimation by Monte Carlo and randomized quasi-Monte Carlo (RQMC) Pierre

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Monte Carlo Methods CS60077: Reinforcement Learning Abir Das IIT Kharagpur Sep 06 and 12, 2019

Monthly Webinar April 24, 2019 inspire THE SANFORD INSPIRE MOVEMENT PREPARES AND

review and more examples Sample space: S = set of all potential outcomes of experiment E.g., flip

Recursive Datatypes and Lists Recap from week 1: Data types Types and constructors data Suit =

Sustaining and Spreading Trauma Informed Care in Clinical Practice R.J. Gillespie, MD, MHPE, FAAP

It is time to learn from patients like mine Nigam H. Shah Associate Professor of Medicine

!"#$#%&'($#)(+,-.' $ !"#$/&27'.5.<B $ !"#$%

Committee for Accessible AIDS Treatment Formed in 1999 to reduce barriers faced by people

Attenuation Coefficient Estimation Farah Deeba, Ricky Hu, Jefferson Terry, Denise Pugash, Jennifer

Monte Carlo Approaches to Reinforcement Learning Robert Platt (w/ - PowerPoint PPT Presentation

Monte Carlo Approaches to Reinforcement Learning Robert Platt (w/ Marcus Gualtieris edits) Northeastern University Model Free Reinforcement Learning Joystick command Agent World Observe screen pixels Reward = game score Goal: learn a

Monte Carlo Generators Monte Carlo Generators Monte Carlo Generators QCD Lecture III P .

Monte Carlo Methods Guojin Chen Christopher Cprek Chris Rambicure Monte Carlo Methods 1.

Monte Carlo Approximation of Monte Carlo Filters Adam M. Johansen et al. Collaborators Include:

BROCHURE 2019 TETRA JUICES DEL MONTE DEL MONTE 6 x 1L GOLD PINEAPPLE 6 x 1L 6 x 1L 6 x 1L

Chapter 5: Monte Carlo Methods Monte Carlo methods are learning methods Experience

Draft Introduction to (randomized) quasi-Monte Carlo Pierre LEcuyer MCQMC Conference,

Monte Carlo Estimation 7 January 2019 OSU CSE 1 Monte Carlo Methods Class of computational

Monte Carlo Localization Ximing Yu March 24, 2009 Ximing Yu Monte Carlo Localization 1

Monte Carlo Control CMPUT 366: Intelligent Systems S&amp;B 5.3-5.5, 5.7 Lecture Outline 1.

4. THE MONTE CARLO METHOD 4.1 I ntroduction This chapter is aimed at describing the Monte Carlo

Techniques in Artificial Intelligence - Part I Todd W. Neller Gettysburg College Monte Carlo

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Introduction to Monte Carlo Method Andrzej Palczewski and Jan Palczewski Introduction to Monte

Draft 1 Density estimation by Monte Carlo and randomized quasi-Monte Carlo (RQMC) Pierre

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Monte Carlo Methods CS60077: Reinforcement Learning Abir Das IIT Kharagpur Sep 06 and 12, 2019

Monthly Webinar April 24, 2019 inspire THE SANFORD INSPIRE MOVEMENT PREPARES AND

review and more examples Sample space: S = set of all potential outcomes of experiment E.g., flip

Recursive Datatypes and Lists Recap from week 1: Data types Types and constructors data Suit =

Sustaining and Spreading Trauma Informed Care in Clinical Practice R.J. Gillespie, MD, MHPE, FAAP

It is time to learn from patients like mine Nigam H. Shah Associate Professor of Medicine

!&quot;#$#%&amp;'($#)(*+,-.' $ !&quot;#$/&amp;*27'.5.&lt;B $ !&quot;#$%

Committee for Accessible AIDS Treatment Formed in 1999 to reduce barriers faced by people

Attenuation Coefficient Estimation Farah Deeba, Ricky Hu, Jefferson Terry, Denise Pugash, Jennifer

Monte Carlo Control CMPUT 366: Intelligent Systems S&B 5.3-5.5, 5.7 Lecture Outline 1.

!"#$#%&'($#)(+,-.' $ !"#$/&27'.5.<B $ !"#$%