Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2 Sutton slides - PowerPoint PPT Presentation

Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2 Sutton slides and Silver 295, class 2 1

Multi-Arm Bandits Sutton and Barto, Chapter2 The simplest reinforcement learning problem

The Exploration/Exploitation Dilemma Online decision-making involves a fundamental choice: • Exploitation Make the best decision given current information • Exploration Gather more information The best long-term strategy may involve short-term sacrifices Gather enough information to make the best overall decisions 295, class 2 3

Examples Restaurant Selection Exploitation Go to your favourite restaurant Exploration Try a new restaurant Online Banner Advertisements Exploitation Show the most successful advert Exploration Show a different advert Oil Drilling Exploitation Drill at the best known location Exploration Drill at a new location Game Playing Exploitation Play the move you believe is best Exploration Play an experimental move 295, class 2 4

You are the algorithm! (bandit1)

k -armed Bandit Problem The • On each of a sequence of time steps , t =1,2,3, …, you choose an action A t from k possibilities, and receive a real- valued reward R t true value s • These true values are unknown. The distribution is unknown • Nevertheless, you must maximize your total reward • Y ou must both try actions to learn their values (explore), and prefer those that appear best (exploit)

The Exploration/Exploitation Dilemma

Regret The action-value is the mean reward for action a , q* ( a ) = E [ r | a ] • The optimal value V ∗ is V ∗ = Q ( a ∗ ) = max q* ( a ) • a ∈A The regret is the opportunity loss for one step − Q ( a t )] l t = E [ V ∗ • The total regret is the total opportunity loss 295, class 2 8

Multi-Armed Bandits Regret

Multi-Armed Bandits Regret greedy ϵ -greedy Totalregret decaying ϵ -greedy 0 1 2 3 4 5 6 7 8 9 10 11 12 13 1415 16 17 1819 Time-steps If an algorithm forever explores it will have linear total regret If an algorithm never explores it will have linear total regret Is it possible to achieve sublinear total regret?

Complexity of regret 295, class 2 11

Overview • Action-value methods – Epsilon-greedy strategy – Incremental implementation – Stationary vs. non-stationary environment – Optimistic initial values • UCB action selection • Gradient bandit algorithms • Associative search (contextual bandits) 295, class 2 12

Basics • Maximize total reward collected – vs learn (optimal) policy (RL) • Episode is one step • Complex function of – True value – Uncertainty – Number of time steps – Stationary vs non-stationary? 295, class 2 13

Action-Value Methods

 -Greedy ActionSelection • In greedy action selection, you always exploit • In 𝜁 -greedy, you are usually greedy, but with probability 𝜁 you instead pick an action at random (possibly the greedy action again) • This is perhaps the simplest way to balance exploration and exploitation

A simple bandit algorithm

One Bandit T askfrom Figure 2.1: An example The 10-armedTestbed bandit problem from the 10-armed testbed. The true value q(a) of each of the ten actions was selected according to a normal distribution with mean zero 4 and unit variance, and then the actual rewards were selected 3 according to a mean q(a) q unit variance normal ⇤ (3) distribution, as suggested q ⇤ (5) 2 by these gray distributions. q ⇤ (9) 1 q ⇤ (4) q ⇤ (1) Reward 0 q q ⇤ (7) ⇤ (10) distribution q q ⇤ (2) ⇤ (8) -1 q ⇤ (6) -2 Run for 1000 steps Repeat the whole -3 thing 2000 times with different bandit tasks -4 1 2 3 4 5 6 7 8 9 10 Action

 -Greedy Methods on the 10-ArmedTestbed

Averaging ⟶ learning rule • T o simplify notation, let us focus on one action • We consider only its rewards, and its estimate after n+ 1 rewards: Q n = R 1 + R 2 + · · · + R n- 1 . n - 1 • How can we do this incrementally (without storing all the rewards)? • Could store a running sum and count (and divide), or equivalently:

Derivation of incremental update

Tracking a Non-stationary Problem

Standard stochastic approximation convergence conditions

Optimistic InitialValues • All methods so far depend on Q 1 ( a ) , i.e.,they are biased. Q 1 ( a ) = 0 So far we have used • Suppose we initialize the action values optimistically ( Q 1 ( a ) = 5 ), e.g., on the 10-armed testbed (with alpha = 0 . 1 ) 100% optimistic, greedy Q 1 = 5, E = 0 80% 0 realistic,  -greedy % 60% Q 1 = 0, E = 0.1 Optimal 0 action 40% 20% 0% 0 200 400 600 800 1000 Plays Steps

Upper Confidence Bound (UCB) action selection • A clever way of reducing exploration over time • Focus on actions whose estimate has large degree of uncertainty • Estimate an upper bound on the true action values • Select the action with the largest (estimated) upper bound UCB c =2 E -greedy E = 0.1 Average reward Steps

Complexity of UCB Algorithm Theorem The UCB algorithm achieves logarithmic asymptotic total regret lim L t ≤ 8 log t ∆ a t →∞ a | ∆ > 0 a

Gradient-Bandit Algorithms • Let H t ( a ) be a learned preference for taking action a 100% α =0.1 80% with baseline α =0.4 % 60% α =0.1 Optimal action 40% without baseline α =0.4 20% 0% 0 250 500 750 1000 Steps

Derivation of gradient-bandit algorithm

Summary Comparison of BanditAlgorithms

Conclusions • These are all simple methods • but they are complicated enough — we will build on them • we should understand them completely • there are still open questions • Our first algorithms that learn from evaluative feedback • and thus must balance exploration and exploitation • Our first algorithms that appear to have a goal — that learn to maximize reward by trial and error

Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2 Sutton slides - PowerPoint PPT Presentation

Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2 Sutton slides and Silver 295, class 2 1 Multi-Arm Bandits Sutton and Barto, Chapter2 The simplest reinforcement learning problem The Exploration/Exploitation Dilemma Online

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Reinforcement Learning Kevin Spiteri April 21, 2015 n-armed bandit n-armed bandit 0.9 0.5

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

Multi-Arm Bandit Sutton and Barto Sutton slides and Silver 1 Multi-Arm Bandits Sutton and

ARM Software Suite Powered by GDM Why use ARM Software? ARM is the software solution to plan,

ARM Advanced RISC Machines The ARM Instruction Set The ARM Instruction Set - ARM University

ARM Cortex-M4 Programming Model ARM = Advanced RISC Machines, Ltd. ARM licenses IP to other

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I S ebastien

One Armed Bandit source: http://dogbeforewicket.blogspot.ca EECS 1030 moodle.yorku.ca One Armed

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurlien Garivier

The Multi-Armed Bandit Problem Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Nicol`

Equilibria in large one-arm bandit games A. Salomon Universit e Paris 13 HEC Paris November

ARM Microprocessor and ARM-Based Microcontrollers Nguatem William 24th May 2006 1 / 40 A

Upper confidence bound strategy on stochastical bandits Multiarmed bandit: K arms, at each step we

Verifying the Motion of a Robot Arm Akul Penugonda 1 /6 Akul Penugonda - Robot Arm Motion 2

ARM v4T CS2253 Owen Kaser, UNBSJ ARM v4T History of ARM processors R is for RISC

In Kernel Switcher: A solution to support ARM's new big.LITTLE technology Presenter: Mathieu

Intra-Process Memory Protection Sergey Bratus for Applications on ARM and Julian Bangert x86:

Fixing MC for ARM v7-A Just a few corner cases how hard can it be? 1 MC Hammer?

Development by Azeria @fox0x01 ARM Exploit Benefits of Learning ARM Assembly Reverse

Ninja: Towards Transparent Tracing and Debugging on ARM Zhenyu Ning & Fengwei Zhang Wayne

RevARM: A Platform-Agnostic ARM Binary Rewriter for Security Applications * Taegyu Kim, Chung

Probably Approximately Correct (PAC) Selection in Simulation/Best-Arm Problems David Eckman

Sergeant at Arms (SAA) Club Officer Training Agenda SAA SAA SAA Role

Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2 Sutton slides - PowerPoint PPT Presentation

Class 3: Multi-Arm Bandit Sutton and Barto, Chapter 2 Sutton slides and Silver 295, class 2 1 Multi-Arm Bandits Sutton and Barto, Chapter2 The simplest reinforcement learning problem The Exploration/Exploitation Dilemma Online

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Reinforcement Learning Kevin Spiteri April 21, 2015 n-armed bandit n-armed bandit 0.9 0.5

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

Multi-Arm Bandit Sutton and Barto Sutton slides and Silver 1 Multi-Arm Bandits Sutton and

ARM Software Suite Powered by GDM Why use ARM Software? ARM is the software solution to plan,

ARM Advanced RISC Machines The ARM Instruction Set The ARM Instruction Set - ARM University

ARM Cortex-M4 Programming Model ARM = Advanced RISC Machines, Ltd. ARM licenses IP to other

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I S ebastien

One Armed Bandit source: http://dogbeforewicket.blogspot.ca EECS 1030 moodle.yorku.ca One Armed

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurlien Garivier

The Multi-Armed Bandit Problem Nicol` o Cesa-Bianchi Universit` a degli Studi di Milano Nicol`

Equilibria in large one-arm bandit games A. Salomon Universit e Paris 13 HEC Paris November

ARM Microprocessor and ARM-Based Microcontrollers Nguatem William 24th May 2006 1 / 40 A

Upper confidence bound strategy on stochastical bandits Multiarmed bandit: K arms, at each step we

Verifying the Motion of a Robot Arm Akul Penugonda 1 /6 Akul Penugonda - Robot Arm Motion 2

ARM v4T CS2253 Owen Kaser, UNBSJ ARM v4T History of ARM processors R is for RISC

In Kernel Switcher: A solution to support ARM's new big.LITTLE technology Presenter: Mathieu

Intra-Process Memory Protection Sergey Bratus for Applications on ARM and Julian Bangert x86:

Fixing MC for ARM v7-A Just a few corner cases how hard can it be? 1 MC Hammer?

Development by Azeria @fox0x01 ARM Exploit Benefits of Learning ARM Assembly Reverse

Ninja: Towards Transparent Tracing and Debugging on ARM Zhenyu Ning &amp; Fengwei Zhang Wayne

RevARM: A Platform-Agnostic ARM Binary Rewriter for Security Applications * Taegyu Kim, Chung

Probably Approximately Correct (PAC) Selection in Simulation/Best-Arm Problems David Eckman

Sergeant at Arms (SAA) Club Officer Training Agenda SAA SAA SAA Role

Ninja: Towards Transparent Tracing and Debugging on ARM Zhenyu Ning & Fengwei Zhang Wayne