Refresh Your Knowledge. Policy Gradient Policy gradient algorithms - PowerPoint PPT Presentation

Lecture 11: Fast Reinforcement Learning 1 Emma Brunskill CS234 Reinforcement Learning Winter 2020 1 With many slides from or derived from David Silver, Examples new Lecture 11: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement Learning ) Winter 2020 1 / 40

Refresh Your Knowledge. Policy Gradient Policy gradient algorithms change the policy parameters using gradient descent on the mean squared Bellman error True 1 False. 2 Not sure 3 Select all that are true In tabular MDPs the number of deterministic policies is smaller than 1 the number of possible value functions Policy gradient algorithms are very robust to choices of step size 2 Baselines are functions of state and actions and do not change the bias 3 of the value function Not sure 4 Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 2 / 40

Class Structure Last time: Midterm This time: Fast Learning Next time: Fast Learning Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 3 / 40

Up Till Now Discussed optimization, generalization, delayed consequences Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 4 / 40

Teach Computers to Help Us Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 5 / 40

Computational Efficiency and Sample Efficiency Computational Efficiency Sample Efficiency Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 6 / 40

Algorithms Seen So Far How many steps did it take for DQN to learn a good policy for pong? Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 7 / 40

Evaluation Criteria How do we evaluate how ”good” an algorithm is? If converges? If converges to optimal policy? How quickly reaches optimal policy? Mistakes made along the way? Will introduce different measures to evaluate RL algorithms Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 8 / 40

Settings, Frameworks & Approaches Over next couple lectures will consider 2 settings, multiple frameworks, and approaches Settings: Bandits (single decisions), MDPs Frameworks: evaluation criteria for formally assessing the quality of a RL algorithm Approaches: Classes of algorithms for achieving particular evaluation criteria in a certain set Note: We will see that some approaches can achieve multiple frameworks in multiple settings Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 9 / 40

Today Setting: Introduction to multi-armed bandits Framework: Regret Approach: Optimism under uncertainty Framework: Bayesian regret Approach: Probability matching / Thompson sampling Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 10 / 40

Multiarmed Bandits Multi-armed bandit is a tuple of ( A , R ) A : known set of m actions (arms) R a ( r ) = P [ r | a ] is an unknown probability distribution over rewards At each step t the agent selects an action a t ∈ A The environment generates a reward r t ∼ R a t Goal: Maximize cumulative reward � t τ =1 r τ Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 11 / 40

Regret Action-value is the mean reward for action a Q ( a ) = E [ r | a ] Optimal value V ∗ V ∗ = Q ( a ∗ ) = max a ∈A Q ( a ) Regret is the opportunity loss for one step l t = E [ V ∗ − Q ( a t )] Total Regret is the total opportunity loss t V ∗ − Q ( a τ )] � L t = E [ τ =1 Maximize cumulative reward ⇐ ⇒ minimize total regret Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 12 / 40

Evaluating Regret Count N t ( a ) is number of selections for action a Gap ∆ a is the difference in value between action a and optimal action a ∗ , ∆ i = V ∗ − Q ( a i ) Regret is a function of gaps and counts � t � V ∗ − Q ( a τ ) � L t = c τ =1 E [ N t ( a )]( V ∗ − Q ( a )) � = a ∈A � = E [ N t ( a )]∆ a a ∈A A good algorithm ensures small counts for large gap, but gaps are not known Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 13 / 40

Greedy Algorithm We consider algorithms that estimate ˆ Q t ( a ) ≈ Q ( a ) Estimate the value of each action by Monte-Carlo evaluation T 1 ˆ � Q t ( a ) = r t 1 ( a t = a ) N T ( a ) t =1 The greedy algorithm selects action with highest value a ∗ ˆ t = arg max Q t ( a ) a ∈A Greedy can lock onto suboptimal action, forever Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 14 / 40

ǫ -Greedy Algorithm The ǫ -greedy algorithm proceeds as follows: With probability 1 − ǫ select a t = arg max a ∈A ˆ Q t ( a ) With probability ǫ select a random action Always will be making a sub-optimal decision ǫ fraction of the time Already used this in prior homeworks Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 15 / 40

Toy Example: Ways to Treat Broken Toes 1 Consider deciding how to best treat patients with broken toes Imagine have 3 possible options: (1) surgery (2) buddy taping the broken toe with another toe, (3) do nothing Outcome measure / reward is binary variable: whether the toe has healed (+1) or not healed (0) after 6 weeks, as assessed by x-ray 1 Note:This is a made up example. This is not the actual expected efficacies of the various treatment options for a broken toe Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 16 / 40

Check Your Understanding: Bandit Toes 1 Consider deciding how to best treat patients with broken toes Imagine have 3 common options: (1) surgery (2) surgical boot (3) buddy taping the broken toe with another toe Outcome measure is binary variable: whether the toe has healed (+1) or not (0) after 6 weeks, as assessed by x-ray Model as a multi-armed bandit with 3 arms, where each arm is a Bernoulli variable with an unknown parameter θ i Select all that are true Pulling an arm / taking an action is whether the toe has healed or not 1 A multi-armed bandit is a better fit to this problem than a MDP 2 because treating each patient involves multiple decisions After treating a patient, if θ i � = 0 and θ i � = 1 ∀ i sometimes a patient’s 3 toe will heal and sometimes it may not Not sure 4 1 Note:This is a made up example. This is not the actual expected efficacies of the various treatment options for a broken toe Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 17 / 40

Toy Example: Ways to Treat Broken Toes 1 Imagine true (unknown) Bernoulli reward parameters for each arm (action) are surgery: Q ( a 1 ) = θ 1 = . 95 buddy taping: Q ( a 2 ) = θ 2 = . 9 doing nothing: Q ( a 3 ) = θ 3 = . 1 1 Note:This is a made up example. This is not the actual expected efficacies of the various treatment options for a broken toe Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 18 / 40

Toy Example: Ways to Treat Broken Toes, Greedy 1 Imagine true (unknown) Bernoulli reward parameters for each arm (action) are surgery: Q ( a 1 ) = θ 1 = . 95 buddy taping: Q ( a 2 ) = θ 2 = . 9 doing nothing: Q ( a 3 ) = θ 3 = . 1 Greedy Sample each arm once 1 Take action a 1 ( r ∼ Bernoulli(0.95)), get +1, ˆ Q ( a 1 ) = 1 Take action a 2 ( r ∼ Bernoulli(0.90)), get +1, ˆ Q ( a 2 ) = 1 Take action a 3 ( r ∼ Bernoulli(0.1)), get 0, ˆ Q ( a 3 ) = 0 What is the probability of greedy selecting each arm next? Assume ties 2 are split uniformly. 1 Note:This is a made up example. This is not the actual expected efficacies of the various treatment options for a broken toe Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 19 / 40

Toy Example: Ways to Treat Broken Toes, Optimism, Assessing Regret of Greedy True (unknown) Bernoulli reward parameters for each arm (action) are surgery: Q ( a 1 ) = θ 1 = . 95 buddy taping: Q ( a 2 ) = θ 2 = . 9 doing nothing: Q ( a 3 ) = θ 3 = . 1 Action Optimal Action Regret a 1 a 1 a 2 a 1 Greedy a 3 a 1 a 1 a 1 a 2 a 1 Will greedy ever select a 3 again? If yes, why? If not, is this a problem? Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 20 / 40

Toy Example: Ways to Treat Broken Toes, ǫ -Greedy 1 Imagine true (unknown) Bernoulli reward parameters for each arm (action) are surgery: Q ( a 1 ) = θ 1 = . 95 buddy taping: Q ( a 2 ) = θ 2 = . 9 doing nothing: Q ( a 3 ) = θ 3 = . 1 ǫ -greedy Sample each arm once 1 Take action a 1 ( r ∼ Bernoulli(0.95)), get +1, ˆ Q ( a 1 ) = 1 Take action a 2 ( r ∼ Bernoulli(0.90)), get +1, ˆ Q ( a 2 ) = 1 Take action a 3 ( r ∼ Bernoulli(0.1)), get 0, ˆ Q ( a 3 ) = 0 Let ǫ = 0 . 1 2 What is the probability ǫ -greedy will pull each arm next? Assume ties 3 are split uniformly. 1 Note:This is a made up example. This is not the actual expected efficacies of the various treatment options for a broken toe Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 21 / 40

Refresh Your Knowledge. Policy Gradient Policy gradient algorithms - PowerPoint PPT Presentation

Lecture 11: Fast Reinforcement Learning 1 Emma Brunskill CS234 Reinforcement Learning Winter 2020 1 With many slides from or derived from David Silver, Examples new Lecture 11: Fast Reinforcement Learning 1 Emma Brunskill (CS234 Reinforcement

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

From Importance Sampling to Doubly Robust Policy Gradient Jiawei Huang (UIUC) Nan Jiang (UIUC)

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

CS234 Notes - Lecture 9 Advanced Policy Gradient Patrick Cho, Emma Brunskill February 11, 2019

Winston-Salem State University Undergrad Brand Refresh WSSU UG Brand Refresh | Progress IMC:

Refresh RUS 1 Network RUS: Electrification Refresh 30 year perspective as part of Long

JSNA Refresh www.WalsallIntelligence.org.uk JSNA 2018-2019 refresh Gap in life expectancy Ageing

CSC321 Lecture 21: Policy Gradient Roger Grosse Roger Grosse CSC321 Lecture 21: Policy Gradient

How to use Gradient and Multi-Texture 1. Many situations, we need use the gradient texture for our

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Cleaning Up the Clutter: Refresh Your MadCap Flare Project Design PRESENTED BY Nate Wolf

Policy Gradient Prof. Kuan-Ting Lai 2020/5/22 Advantages of Policy-based RL Previously we

and Sustainable Food Systems in the Circular Economy REFRESH final Policy Workshop Brussels, 22

Muti-armed Bandits,Online Learning and Sequential Prediction Jian Li Institute for

Sergeant at Arms (SAA) Club Officer Training Agenda SAA SAA SAA Role

Probably Approximately Correct (PAC) Selection in Simulation/Best-Arm Problems David Eckman

RevARM: A Platform-Agnostic ARM Binary Rewriter for Security Applications * Taegyu Kim, Chung

Probabilis)c Reasoning for Assembly-Based 3D Modeling

ARM EDITION Matt Spisak REcon 2016, Montreal RECON 2016 ABOUT Offense-based approach to

ARM memory generator Arm Memory generator Make sure you create a folder similar to what you

M 3 : INTEGRATING ARBITRARY COMPUTE UNITS AS FIRST-CLASS CITIZENS OS: Nils Asmussen, Hermann H