Class Structure Last time: Fast Learning, Exploration/Exploitation - PowerPoint PPT Presentation

Lecture 12: Fast Reinforcement Learning Part II 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 2 With many slides from or derived from David Silver, Worked Examples New Lecture 12: Fast Reinforcement Learning Part II 3 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 1 / 70

Class Structure Last time: Fast Learning, Exploration/Exploitation Part 1 This Time: Fast Learning Part II Next time: Batch RL Lecture 12: Fast Reinforcement Learning Part II 4 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 2 / 70

Table of Contents Metrics for evaluating RL algorithms 1 Principles for RL Exploration 2 Probability Matching 3 Information State Search 4 MDPs 5 Principles for RL Exploration 6 Metrics for evaluating RL algorithms 7 Lecture 12: Fast Reinforcement Learning Part II 5 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 3 / 70

Performance Criteria of RL Algorithms Empirical performance Convergence (to something ...) Asymptotic convergence to optimal policy Finite sample guarantees: probably approximately correct Regret (with respect to optimal decisions) Optimal decisions given information have available PAC uniform Lecture 12: Fast Reinforcement Learning Part II 6 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 4 / 70

Table of Contents Metrics for evaluating RL algorithms 1 Principles for RL Exploration 2 Probability Matching 3 Information State Search 4 MDPs 5 Principles for RL Exploration 6 Metrics for evaluating RL algorithms 7 Lecture 12: Fast Reinforcement Learning Part II 7 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 5 / 70

Principles Naive Exploration (last time) Optimistic Initialization (last time) Optimism in the Face of Uncertainty (last time + this time) Probability Matching (last time + this time) Information State Search (this time) Lecture 12: Fast Reinforcement Learning Part II 8 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 6 / 70

Multiarmed Bandits Multi-armed bandit is a tuple of ( A , R ) A : known set of m actions R a ( r ) = P [ r | a ] is an unknown probability distribution over rewards At each step t the agent selects an action a t ∈ A The environment generates a reward r t ∼ R a t Goal: Maximize cumulative reward � t τ =1 r τ Lecture 12: Fast Reinforcement Learning Part II 9 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 7 / 70

Regret Action-value is the mean reward for action a Q ( a ) = E [ r | a ] Optimal value V ∗ V ∗ = Q ( a ∗ ) = max a ∈A Q ( a ) Regret is the opportunity loss for one step l t = E [ V ∗ − Q ( a t )] Total Regret is the total opportunity loss t V ∗ − Q ( a τ )] � L t = E [ τ =1 Maximize cumulative reward ⇐ ⇒ minimize total regret Lecture 12: Fast Reinforcement Learning Part II 10 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 8 / 70

Optimism Under Uncertainty: Upper Confidence Bounds Estimate an upper confidence ˆ U t ( a ) for each action value, such that Q ( a ) ≤ ˆ Q t ( a ) + ˆ U t ( a ) with high probability This depends on the number of times N ( a ) has been selected Small N t ( a ) → large ˆ U t ( a ) (estimate value is uncertain) Large N t ( a ) → small ˆ U t ( a ) (estimate value is accurate) Select action maximizing Upper Confidence Bound (UCB) a t = arg max a ∈ A ˆ Q t ( a ) + ˆ U t ( a ) Lecture 12: Fast Reinforcement Learning Part II 11 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 9 / 70

UCB1 This leads to the UCB1 algorithm � 2 log t a t = arg max a ∈A Q ( a ) + N t ( a ) Theorem: The UCB algorithm achieves logarithmic asymptotic total regret � t →∞ L t ≤ 8 log t lim ∆ a a | ∆ a > 0 Lecture 12: Fast Reinforcement Learning Part II 12 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 10 / 70

Toy Example: Ways to Treat Broken Toes 13 Consider deciding how to best treat patients with broken toes Imagine have 3 possible options: (1) surgery (2) buddy taping the broken toe with another toe, (3) do nothing Outcome measure is binary variable: whether the toe has healed (+1) or not healed (0) after 6 weeks, as assessed by x-ray 13 Note:This is a made up example. This is not the actual expected efficacies of the various treatment options for a broken toe Lecture 12: Fast Reinforcement Learning Part II 14 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 11 / 70

Toy Example: Ways to Treat Broken Toes 15 Consider deciding how to best treat patients with broken toes Imagine have 3 common options: (1) surgery (2) surgical boot (3) buddy taping the broken toe with another toe Outcome measure is binary variable: whether the toe has healed (+1) or not (0) after 6 weeks, as assessed by x-ray Model as a multi-armed bandit with 3 arms, where each arm is a Bernoulli variable with an unknown parameter θ i Check your understanding: what does a pull of an arm / taking an action correspond to? Why is it reasonable to model this as a multi-armed bandit instead of a Markov decision process? 15 Note:This is a made up example. This is not the actual expected efficacies of the various treatment options for a broken toe Lecture 12: Fast Reinforcement Learning Part II 16 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 12 / 70

Toy Example: Ways to Treat Broken Toes 17 Imagine true (unknown) parameters for each arm (action) are surgery: Q ( a 1 ) = θ 1 = . 95 buddy taping: Q ( a 2 ) = θ 2 = . 9 doing nothing: Q ( a 3 ) = θ 3 = . 1 17 Note:This is a made up example. This is not the actual expected efficacies of the various treatment options for a broken toe Lecture 12: Fast Reinforcement Learning Part II 18 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 13 / 70

Toy Example: Ways to Treat Broken Toes, Thompson Sampling 19 True (unknown) parameters for each arm (action) are surgery: Q ( a 1 ) = θ 1 = . 95 buddy taping: Q ( a 2 ) = θ 2 = . 9 doing nothing: Q ( a 3 ) = θ 3 = . 1 Optimism under uncertainty, UCB1 (Auer, Cesa-Bianchi, Fischer 2002) Sample each arm once 1 19 Note:This is a made up example. This is not the actual expected efficacies of the various treatment options for a broken toe Lecture 12: Fast Reinforcement Learning Part II 20 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 14 / 70

Toy Example: Ways to Treat Broken Toes, Optimism 21 True (unknown) parameters for each arm (action) are surgery: Q ( a 1 ) = θ 1 = . 95 buddy taping: Q ( a 2 ) = θ 2 = . 9 doing nothing: Q ( a 3 ) = θ 3 = . 1 UCB1 (Auer, Cesa-Bianchi, Fischer 2002) Sample each arm once 1 Take action a 1 ( r ∼ Bernoulli(0.95), get +1, Q ( a 1 ) = 1 Take action a 2 ( r ∼ Bernoulli(0.90), get +1, Q ( a 2 ) = 1 Take action a 3 ( r ∼ Bernoulli(0.1), get 0, Q ( a 3 ) = 0 21 Note:This is a made up example. This is not the actual expected efficacies of the various treatment options for a broken toe Lecture 12: Fast Reinforcement Learning Part II 22 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 15 / 70

Toy Example: Ways to Treat Broken Toes, Optimism 23 True (unknown) parameters for each arm (action) are surgery: Q ( a 1 ) = θ 1 = . 95 buddy taping: Q ( a 2 ) = θ 2 = . 9 doing nothing: Q ( a 3 ) = θ 3 = . 1 UCB1 (Auer, Cesa-Bianchi, Fischer 2002) Sample each arm once 1 Take action a 1 ( r ∼ Bernoulli(0.95), get +1, Q ( a 1 ) = 1 Take action a 2 ( r ∼ Bernoulli(0.90), get +1, Q ( a 2 ) = 1 Take action a 3 ( r ∼ Bernoulli(0.1), get 0, Q ( a 3 ) = 0 Set t = 3, Compute upper confidence bound on each action 2 � 2 lnt ucb ( a ) = Q ( a ) + N t ( a ) 23 Note:This is a made up example. This is not the actual expected efficacies of the various treatment options for a broken toe Lecture 12: Fast Reinforcement Learning Part II 24 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 16 / 70

Toy Example: Ways to Treat Broken Toes, Optimism 25 True (unknown) parameters for each arm (action) are surgery: Q ( a 1 ) = θ 1 = . 95 buddy taping: Q ( a 2 ) = θ 2 = . 9 doing nothing: Q ( a 3 ) = θ 3 = . 1 UCB1 (Auer, Cesa-Bianchi, Fischer 2002) Sample each arm once 1 Take action a 1 ( r ∼ Bernoulli(0.95), get +1, Q ( a 1 ) = 1 Take action a 2 ( r ∼ Bernoulli(0.90), get +1, Q ( a 2 ) = 1 Take action a 3 ( r ∼ Bernoulli(0.1), get 0, Q ( a 3 ) = 0 Set t = 3, Compute upper confidence bound on each action 2 � 2 lnt ucb ( a ) = Q ( a ) + N t ( a ) t = 3, Select action a t = arg max a ucb ( a ), 3 Observe reward 1 4 Compute upper confidence bound on each action 5 25 Note:This is a made up example. This is not the actual expected efficacies of the Lecture 12: Fast Reinforcement Learning Part II 26 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 17 / 70 various treatment options for a broken toe

Class Structure Last time: Fast Learning, Exploration/Exploitation - PowerPoint PPT Presentation

Lecture 12: Fast Reinforcement Learning Part II 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 2 With many slides from or derived from David Silver, Worked Examples New Lecture 12: Fast Reinforcement Learning Part II 3 Emma Brunskill

Programming Abstraction in C++ Eric S. Roberts and Julie Zelenski Stanford University 2010

BIBLICAL SURVEY Introductory Class Introductory Class BIBLICAL SURVEY Introductory Class

Curriculum on The Cadet Corps Uniform Class A Uniform Class A Uniform Agenda C1. Class A

Electing Your Membership Class Class TG, Class TH, or Class DC As a school employee who

TwissOptics Class Joschua Dilly TwissOptics Class 2 The TwissOptics Class Resonance Driving

3/14/16 Review Class/Object Type Class Keyword class class Point

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

STRUCTURE STRUCTURE Highlight the structure of Highlight the structure of material material

Part IV I/O System Chapter 12: Mass Storage Structure Chapter 12: Mass Storage Structure 1

Curriculum on The Cadet Corps Uniform Wear It WIth honor Class C Uniform Class C Uniform

Classroom Assessment Scoring System (CLASS) 104 B New Report Format Interpreting your CLASS

Essential Oils Class with Jami Borlik 1 Essential Oils Class with Jami Borlik 2 Essential Oils

Inheritance II Is-a versus has-a When an object of class A has a n object of class B, use

Inheritance A class can be a sub-type of another class The inheriting class contains all

UML Class Diagrams Steven Zeil February 25, 2013 UML Class Diagrams Outline Class

Multiple inheritance Multiple inheritance Can derive a class from more than one base class

HOST Secure Boot I ECE 525 Secure Boot Introduction Embedded system security can be partitioned

BI BIOS S an and d Se Secu cure e Bo Boot t Attacks acks Unc ncover ered ed Advanced

Redundant Booting with U-Boot Welcome to the Redundancy Theater Playhouse Thomas Rini 1 2

HYDRA: HYbrid Design for Remote Attestation (Using a Formally Verified Microkernel) Karim

Section 15 Section 15 ADSP-BF533 Booting a 15-1 1 What is Booting? What is Booting?

CSGE602055 Operating Systems CSF2600505 Sistem Operasi Week 09: Storage, Firmware, Bootloader,

evil maid on droids or why you should never loose your android smartphone @f0rki 2012-12-06

GRUB, ancient and modern GRUB Legacy Of course originally just GRUB Started in 1995 by