CS885 Reinforcement Learning Lecture 8a: May 25, 2018 Multi-armed - PowerPoint PPT Presentation

CS885 Reinforcement Learning Lecture 8a: May 25, 2018 Multi-armed Bandits [SutBar] Sec. 2.1-2.7, [Sze] Sec. 4.2.1-4.2.2 University of Waterloo CS885 Spring 2018 Pascal Poupart 1

Outline • Exploration/exploitation tradeoff • Regret • Multi-armed bandits – ! -greedy strategies – Upper confidence bounds University of Waterloo CS885 Spring 2018 Pascal Poupart 2

Exploration/Exploitation Tradeoff • Fundamental problem of RL due to the active nature of the learning process • Consider one-state RL problems known as bandits University of Waterloo CS885 Spring 2018 Pascal Poupart 3

Stochastic Bandits • Formal definition: – Single state: S = {s} – A: set of actions (also known as arms) – Space of rewards (often re-scaled to be [0,1]) • No transition function to be learned since there is a single state • We simply need to learn the stochastic reward function University of Waterloo CS885 Spring 2018 Pascal Poupart 4

Origin • The term bandit comes from gambling where slot machines can be thought as one-armed bandits. • Problem: which slot machine should we play at each turn when their payoffs are not necessarily the same and initially unknown? University of Waterloo CS885 Spring 2018 Pascal Poupart 5

Examples • Design of experiments (Clinical Trials) • Online ad placement • Web page personalization • Games • Networks (packet routing) University of Waterloo CS885 Spring 2018 Pascal Poupart 6

Online Ad Optimization University of Waterloo CS885 Spring 2018 Pascal Poupart 7

Online Ad Optimization • Problem: which ad should be presented? • Answer: present ad with highest payoff !"#$%% = '()'*+ℎ-$./ℎ0"12×!"#4251 – Click through rate: probability that user clicks on ad – Payment: $$ paid by advertiser • Amount determined by an auction University of Waterloo CS885 Spring 2018 Pascal Poupart 8

Simplified Problem • Assume payment is 1 unit for all ads • Need to estimate click through rate • Formulate as a bandit problem: – Arms: the set of possible ads – Rewards: 0 (no click) or 1 (click) • In what order should ads be presented to maximize revenue? – How should we balance exploitation and exploration? University of Waterloo CS885 Spring 2018 Pascal Poupart 9

Simple yet difficult problem • Simple: description of the problem is short • Difficult: no known tractable optimal solution University of Waterloo CS885 Spring 2018 Pascal Poupart 10

Simple heuristics • Greedy strategy: select the arm with the highest average so far – May get stuck due to lack of exploration • ! -greedy: select an arm at random with probability ! and otherwise do a greedy selection – Convergence rate depends on choice of ! University of Waterloo CS885 Spring 2018 Pascal Poupart 11

Regret • Let !(#) be the unknown average reward of # • Let % ∗ = max !(#) and # ∗ = #%,-#. + !(#) + • Denote by /011(#) the expected regret of # /011 # = % ∗ − !(#) • Denote by 3011 4 the expected cumulative regret for 5 time steps 4 3011 4 = ∑ 789 /011(# 7 ) University of Waterloo CS885 Spring 2018 Pascal Poupart 12

Theoretical Guarantees • When ! is constant, then – For large enough " : Pr % & ≠ % ∗ ≈ ! - – Expected cumulative regret: *+,, - ≈ ∑ &/0 ! = 2(4) • Linear regret • When ! 6 ∝ 1/" – For large enough " : Pr % & ≠ % ∗ ≈ ! & = 2 0 & 0 - – Expected cumulative regret: *+,, - ≈ ∑ &/0 & = 2(log 4) • Logarithmic regret University of Waterloo CS885 Spring 2018 Pascal Poupart 13

Empirical mean • Problem: how far is the empirical mean ! "($) from the true mean "($) ? • If we knew that " $ − ! " $ ≤ ()*+, – Then we would know that " $ < ! " $ + ()*+, – And we could select the arm with best ! " $ + ()*+, • Overtime, additional data will allow us to refine ! " ($) and compute a tighter ()*+, . University of Waterloo CS885 Spring 2018 Pascal Poupart 14

Positivism in the Face of Uncertainty • Suppose that we have an oracle that returns an upper bound !" # (%) on '(%) for each arm based on ( trials of arm % . • Suppose the upper bound returned by this oracle converges to '(%) in the limit: – i.e. lim #→- !" # % = '(%) • Optimistic algorithm – At each step, select %/01%2 3 !" # (%) University of Waterloo CS885 Spring 2018 Pascal Poupart 15

Convergence • Theorem: An optimistic strategy that always selects argmax & '( ) (+) will converge to + ∗ • Proof by contradiction: – Suppose that we converge to suboptimal arm + after infinitely many trials. – Then . + = '( 0 + ≥ '( 0 + 2 = .(+ 2 ) ∀+′ – But . + ≥ . + 2 ∀+′ contradicts our assumption that + is suboptimal. University of Waterloo CS885 Spring 2018 Pascal Poupart 16

Probabilistic Upper Bound • Problem: We can’t compute an upper bound with certainty since we are sampling • However we can obtain measures ! that are upper bounds most of the time – i.e., Pr $ % ≤ ! % ≥ 1 − * – Example: Hoeffding’s inequality -./ 0 $ % ≤ + 1 Pr $ % + ≥ 1 − * 23 4 where 5 6 is the number of trials for arm % University of Waterloo CS885 Spring 2018 Pascal Poupart 17

Upper Confidence Bound (UCB) • Set ! " = 1/& ' in Hoeffding’s bound • Choose ( with highest Hoeffding bound UCB( ℎ ) * ← 0 , & ← 0, & . ← 0 ∀( Repeat until & = ℎ 9 :;< " Execute argmax 5 6 7 ( + " = Receive > * ← * + > " = 6 ? . @A 6 7 ( ← " = @B & ← & + 1, & . ← & . + 1 Return * University of Waterloo CS885 Spring 2018 Pascal Poupart 18

UCB Convergence • Theorem: Although Hoeffding’s bound is probabilistic, UCB converges. " #$% & • Idea: As ! increases, the term increases, & ' ensuring that all arms are tried infinitely often • Expected cumulative regret: ()** & = ,(log !) – Logarithmic regret University of Waterloo CS885 Spring 2018 Pascal Poupart 19

Summary • Stochastic bandits – Exploration/exploitation tradeoff • ! -greedy and UCB – Theory: logarithmic expected cumulative regret • In practice: – UCB often performs better than ! -greedy – Many variants of UCB improve performance University of Waterloo CS885 Spring 2018 Pascal Poupart 20

CS885 Reinforcement Learning Lecture 8a: May 25, 2018 Multi-armed - PowerPoint PPT Presentation

CS885 Reinforcement Learning Lecture 8a: May 25, 2018 Multi-armed Bandits [SutBar] Sec. 2.1-2.7, [Sze] Sec. 4.2.1-4.2.2 University of Waterloo CS885 Spring 2018 Pascal Poupart 1 Outline Exploration/exploitation tradeoff Regret

CS885 Reinforcement Learning Lecture 1a: May 2, 2018 Course Introduction [SutBar] Chapter 1,

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

CS885 Reinforcement Learning Lecture 4a: May 11, 2018 Deep Neural Networks [GBC] Chap. 6, 7, 8

CS885 Reinforcement Learning Lecture 2a: May 4, 2018 Intro to Markov decision processes [SutBar]

CS885 Reinforcement Learning Lecture 1b: May 2, 2018 Markov Processes [RusNor] Sec. 15.1

CS885 Reinforcement Learning Lecture 4b: May 11, 2018 Deep Q-networks [SutBar] Sec. 9.4, 9.7,

CS885 Reinforcement Learning Lecture 12: June 8, 2018 Deep Recurrent Q-Networks [GBC] Chap. 10

CS885 Reinforcement Learning Lecture 15c: June 20, 2018 Semi-Markov Decision Processes [Put]

CS885 Reinforcement Learning Lecture 13c: June 13, 2018 Adversarial Search [RusNor] Sec. 5.1-5.4

CS885 Reinforcement Learning Lecture 14c: June 15, 2018 Trust Region Methods [Nocedal and

Neural Combinatorial Optimization With Reinforcement Learning CS885 Reinforcement Learning Paper

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning-Based End-to-End Parking for Automatic Parking System CS885

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

OnlineOptimizationinX OnlineOptimizationinX ArmedBandits

Under the Robotic Knife: A Verifiable Controller for use of Multiple Robotic Arms in Surgery

Collaborative Learning with Limited Interaction: Tight Bounds for Distributed Exploration in

ARM Cortex-M4 Programming Model Memory Addressing Instructions References: Textbook Chapter 4,

Postmortem: Gastronaut Studios' Small Arms Jacob Van Wingen Founder/Director Don Wurster

Bayesian Adaptive Randomization in Early Phase Clinical Development Pantelis Vlachos Cytel Inc,

QIGONG / TAI CHI FOR EMOTIONAL REGULATION An approach to adolescent treatment for anxiety,

Advanced Algorithms (IV) Shanghai Jiao Tong University Chihao Zhang March 23rd, 2020 Review

Sambuz

Useful Links

Newsletter

Mail Us