CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld/ - PDF document

CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld/ University of Washington [Many slides taken from Dan Klein and Pieter Abbeel / CS188 Intro to AI at UC Berkeley – materials available at http://ai.berkeley.edu.] Three Key Ideas for RL § Model-based vs model-free learning § What function is being learned? § Approximating the Value Function § Smaller à easier to learn & better generalization § Exploration-exploitation tradeoff 1

Q Learning § Forall s, a § Initialize Q(s, a) = 0 § Repeat Forever Where are you? s. Choose some action a Execute it in real world: (s, a, r, s’) Do update: Questions § How to explore? § Random Exploration § Uniform exploration § Epsilon Greedy § With (small) probability e , act randomly § With (large) probability 1- e , act on current policy § Exploration Functions (such as UCB) § Thompson Sampling § When to exploit? § How to even think about this tradeoff? 2

Regret § Even if you learn the optimal policy, you still make mistakes along the way! § Regret is a measure of your total mistake cost : the difference between your (expected) rewards, including youthful sub-optimality, and optimal (expected) rewards § Minimizing regret goes beyond learning to be optimal – it requires optimally learning to be optimal Two KINDS of Regret § Cumulative Regret: § achieve near optimal cumulative lifetime reward (in expectation) § Simple Regret: § quickly identify policy with high reward (in expectation) 48 3

RL on Single State MDP § Suppose MDP has a single state and k actions § Can sample rewards of actions using call to simulator § Sampling action a is like pulling slot machine arm with random payoff function R ( s , a ) s a 1 a 2 a k … … R(s,a 2 ) R(s,a k ) R(s,a 1 ) 50 Multi-Armed Bandit Problem Slide adapted from Alan Fern (OSU) UCB Algorithm for Minimizing Cumulative Regret Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine learning , 47 (2), 235-256. § Q(a) : average reward for trying action a (in our single state s ) so far § n(a) : number of pulls of arm a so far § Action choice by UCB after n pulls: 2 ln n = + a arg max Q ( a ) n a n ( a ) § Assumes rewards in [0,1] – normalized from R max . 58 Slide adapted from Alan Fern (OSU) 4

UCB Performance Guarantee [Auer, Cesa-Bianchi, & Fischer, 2002] Theorem : The expected cumulative regret of UCB 𝑭[𝑺𝒇𝒉 𝒐 ] after n arm pulls is bounded by O(log n ) 𝑭[𝑺𝒇𝒉 𝒐 ] 𝑭[𝑺𝒇𝒉 𝒐 ] � Is this good? � � log 𝑜 Yes. The average per-step regret is O log 𝑜 log 𝑜 𝑜 𝑜 𝑜 Theorem: No algorithm can achieve a better expected regret (up to constant factors) 60 Slide adapted from Alan Fern (OSU) Two KINDS of Regret § Cumulative Regret: § achieve near optimal cumulative lifetime reward (in expectation) § Simple Regret: § quickly identify policy with high reward (in expectation) 78 5

Simple Regret Objective � � Protocol: At time step n the algorithm picks an “exploration” arm 𝑏 𝑜 𝑠 𝑘 𝑜 “exploration” arm 𝑏 𝑜 to pull and observes reward 𝑜 𝑠 𝑜 and also picks an arm index it thinks is best 𝑘 𝑜 𝑏 𝑜 , 𝑘 𝑜 and 𝑠 𝑜 ( 𝑏 𝑜 , 𝑘 𝑜 and 𝑠 𝑜 are random variables) . 𝑘 𝑜 � � If interrupted at time n the algorithm returns 𝑘 𝑜 . � Expected Simple Regret ( 𝑭[𝑻𝑺𝒇𝒉 𝒐 ]) : difference between 𝑆 ∗ and expected reward of arm 𝑘 𝑜 𝑭[𝑻𝑺𝒇𝒉 𝒐 ]) � selected by our strategy at time n 𝑆 ∗ 𝑘 𝑜 𝐹[𝑇𝑆𝑓𝑕 𝑜 ] = 𝑆 ∗ − 𝐹[𝑆(𝑏 𝑘 𝑜 )] 𝐹[𝑇𝑆𝑓𝑕 𝑜 ] = 𝑆 ∗ − 𝐹[𝑆(𝑏 𝑘 𝑜 )] 79 � � Simple Regret Objective � What about UCB for simple regret? Theorem : The expected simple regret of UCB after n arm pulls is upper bounded by O 𝑜 −𝑑 for a constant c. Seems good, but we can do much better (at least in theory). Ø Intuitively: UCB puts too much emphasis on pulling the best arm Ø After an arm is looking good, maybe better to see if ∃ a better arm 6

Incremental Uniform (or Round Robin) Bubeck, S., Munos, R., & Stoltz, G. (2011). Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Computer Science, 412(19), 1832-1852 Algorithm: � � � At round n pull arm with index (k mod n) + 1 � � � At round n return arm (if asked) with largest average reward Theorem : The expected simple regret of Uniform after n arm pulls is upper bounded by O 𝑓 −𝑑𝑜 for a constant c. 𝑓 −𝑑𝑜 𝑓 −𝑑𝑜 � This bound is exponentially decreasing in n! � This bound is exponentially decreasing in n! � 𝑜 −𝑑 Compared to polynomially for UCB O 𝑜 −𝑑 . 𝑜 −𝑑 81 Can we do even better? Tolpin, D. & Shimony, S, E. (2012). MCTS Based on Simple Regret. AAAI Conference on Artificial Intelligence. Algorithm -Greedy : (parameter ) 𝜗 0 < 𝜗 < 1 § At round n, with probability pull arm with best average reward so far, otherwise pull one of the other arms at random. 𝜗 � § At round n return arm (if asked) with largest average reward � Theorem : The expected simple regret of 𝜗 - Greedy for 𝜗 = 0.5 after n arm pulls is upper bounded by O 𝑓 −𝑑𝑜 for a constant c that is larger than the constant for Uniform (this holds for “large enough” n). 82 7

Summary of Bandits in Theory PAC Objective: UniformBandit is a simple PAC algorithm § MedianElimination improves by a factor of log(k) and is optimal up to § constant factors Cumulative Regret: Uniform is very bad! § UCB is optimal (up to constant factors) § Simple Regret: UCB shown to reduce regret at polynomial rate § Uniform reduces at an exponential rate § 0.5-Greedy may have even better exponential rate § Theory vs . Practice The established theoretical relationships among bandit • algorithms have often been useful in predicting empirical relationships. But not always …. • 8

Theory vs . Practice simple regret Simple regret vs. number of samples UCB maximizes Q a + √ ((2 ln(n)) / n a ) UCB[sqrt] maximizes Q a + √ ((2 √n) / n a ) 9

CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld/ - PDF document

CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld/ University of Washington [Many slides taken from Dan Klein and Pieter Abbeel / CS188 Intro to AI at UC Berkeley materials available at http://ai.berkeley.edu.] Three Key Ideas

Artificial Intelligence Artificial Intelligence Artificial Intelligence Study and design of

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

CSE 473 Artificial Intelligence (AI) Rajesh Rao (Instructor) Yi-Shu Wei (TA) Hunter Whalen (TA)

CSE 473 Artificial Intelligence (AI) Rajesh Rao (Instructor) Jennifer Hanson (TA) Evan Herbst

Artificial intelligence Artificial Intelligence is the science of PHILOSOPHY OF ARTIFICIAL

Artificial Intelligence Intro (Chapter 1 of AIMA) Summary Artificial Intelligence What is AI?

1/29/10 CSE 3402: Intro to Artificial Intelligence CSE 3402: Intro to Artificial Intelligence

What is Artificial Intelligence? CPSC 322 Lecture 1 September 5, 2007 What is Artificial

Traditional Definition of Artificial Intelligence Trends Artificial Intelligence (AI) is

Artificial Intelligence as Law Bart Verheij Department of Artificial Intelligence, Bernoulli

CSCI 446 ARTIFICIAL INTELLIGENCE EXAM 1 STUDY OUTLINE Introduction to Artificial Intelligence

Lecture Overview What is Artificial Intelligence? Agents acting in an environment

CSCI 446: Artificial Intelligence CSCI 446: Artificial Intelligence Course Website:

1.1 What is AI? 1. What is Artificial Intelligence? 2. AI Past and Present 3. Rational

8th November 2019 Artificial Intelligence Finance Institute NYU Courant Artificial Intelligence

Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale

EMPLOYMENT COUNSELLING SERVICES MR M.S BIYELA PRINCIPAL PSYCHOLOGIST KWAZULU-NATAL OVERVIEW

An updated version of the Participant Guidance is due to be published on the ESIF website in

Third Quarter 2020 Earnings Release Supplement October 23, 2020 The data in this package should

L M A D A Learning And Mining from DatA NANJING UNIVERSITY Adaptive Regret of Convex and

Efficient tracking of a growing number of experts Jaouad Mourtada & Odalric-ambrym Maillard

Learning wit ith Pairw rwis ise Losses Problems, Algorithms and Analysis Purushottam Kar

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson

CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld/ - PDF document

CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld/ University of Washington [Many slides taken from Dan Klein and Pieter Abbeel / CS188 Intro to AI at UC Berkeley materials available at http://ai.berkeley.edu.] Three Key Ideas

Artificial Intelligence Artificial Intelligence Artificial Intelligence Study and design of

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

CSE 473 Artificial Intelligence (AI) Rajesh Rao (Instructor) Yi-Shu Wei (TA) Hunter Whalen (TA)

CSE 473 Artificial Intelligence (AI) Rajesh Rao (Instructor) Jennifer Hanson (TA) Evan Herbst

Artificial intelligence Artificial Intelligence is the science of PHILOSOPHY OF ARTIFICIAL

Artificial Intelligence Intro (Chapter 1 of AIMA) Summary Artificial Intelligence What is AI?

1/29/10 CSE 3402: Intro to Artificial Intelligence CSE 3402: Intro to Artificial Intelligence

What is Artificial Intelligence? CPSC 322 Lecture 1 September 5, 2007 What is Artificial

Traditional Definition of Artificial Intelligence Trends Artificial Intelligence (AI) is

Artificial Intelligence as Law Bart Verheij Department of Artificial Intelligence, Bernoulli

CSCI 446 ARTIFICIAL INTELLIGENCE EXAM 1 STUDY OUTLINE Introduction to Artificial Intelligence

Lecture Overview What is Artificial Intelligence? Agents acting in an environment

CSCI 446: Artificial Intelligence CSCI 446: Artificial Intelligence Course Website:

1.1 What is AI? 1. What is Artificial Intelligence? 2. AI Past and Present 3. Rational

8th November 2019 Artificial Intelligence Finance Institute NYU Courant Artificial Intelligence

Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale

EMPLOYMENT COUNSELLING SERVICES MR M.S BIYELA PRINCIPAL PSYCHOLOGIST KWAZULU-NATAL OVERVIEW

An updated version of the Participant Guidance is due to be published on the ESIF website in

Third Quarter 2020 Earnings Release Supplement October 23, 2020 The data in this package should

L M A D A Learning And Mining from DatA NANJING UNIVERSITY Adaptive Regret of Convex and

Efficient tracking of a growing number of experts Jaouad Mourtada &amp; Odalric-ambrym Maillard

Learning wit ith Pairw rwis ise Losses Problems, Algorithms and Analysis Purushottam Kar

Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson

Efficient tracking of a growing number of experts Jaouad Mourtada & Odalric-ambrym Maillard