What we learned last time 1. Intelligence is the computational part - PowerPoint PPT Presentation

What we learned last time 1. Intelligence is the computational part of the ability to achieve goals • looking deeper: 1) its a continuum, 2) its an appearance, 3) it varies with observer and purpose 2. We will (probably) figure out how to make intelligent systems in our lifetimes; it will change everything 3. But prior to that it will probably change our careers • as companies gear up to take advantage of the economic opportunities 4. This course has a demanding workload

Multi-arm Bandits Sutton and Barto, Chapter 2 The simplest reinforcement learning problem

You are the algorithm! (bandit1) • Action 1 — Reward is always 8 • value of action 1 is q ∗ (1) = 8 • Action 2 — 88% chance of 0, 12% chance of 100! • value of action 2 is q ∗ (2) = . 88 × 0 + . 12 × 100 = 12 • Action 3 — Randomly between -10 and 35, equiprobable q ∗ (3) = 12 . 5 -10 0 35 q ∗ (3) • Action 4 — a third 0, a third 20, and a third from {8,9,…, 18} 0 20 q ∗ (4) q ∗ (4) = 1 3 × 0 + 1 3 × 20 + 1 3 × 13 = 0 + 20 3 + 13 3 = 33 3 = 11

The k -armed Bandit Problem • On each of an infinite sequence of time steps , t =1, 2, 3, …,   you choose an action A t from k possibilities, and receive a real- valued reward R t • The reward depends only on the action taken;   it is indentically, independently distributed (i.i.d.): q ∗ ( a ) . = E [ R t | A t = a ] , ∀ a ∈ { 1 , . . . , k } true value s • These true values are unknown. The distribution is unknown • Nevertheless, you must maximize your total reward • You must both try actions to learn their values (explore),   and prefer those that appear best (exploit)

The Exploration/Exploitation Dilemma • Suppose you form estimates action-value estimates Q t ( a ) ≈ q ∗ ( a ) , ∀ a • Define the greedy action at time t as . A ∗ = arg max Q t ( a ) t a • If then you are exploiting   A t = A ∗ t If then you are exploring A t 6 = A ∗ t • You can’t do both, but you need to do both • You can never stop exploring, but maybe you should explore less with time. Or maybe not.

Action-Value Methods • Methods that learn action-value estimates and nothing else • For example, estimate action values as sample averages : P t − 1 i =1 R i · 1 A i = a = sum of rewards when a taken prior to t Q t ( a ) . = P t − 1 number of times a taken prior to t i =1 1 A i = a • The sample-average estimates converge to the true values   If the action is taken an infinite number of times N t ( a ) →∞ Q t ( a ) = q ∗ ( a ) lim The number of times action a has been taken by time t

ε -Greedy Action Selection • In greedy action selection, you always exploit • In 𝜁 -greedy, you are usually greedy, but with probability 𝜁 you instead pick an action at random (possibly the greedy action again) • This is perhaps the simplest way to balance exploration and exploitation

A simple bandit algorithm Initialize, for a = 1 to k : Q ( a ) ← 0 N ( a ) ← 0 Repeat forever: ⇢ arg max a Q ( a ) with probability 1 − ε (breaking ties randomly) A ← a random action with probability ε R ← bandit ( A ) N ( A ) ← N ( A ) + 1 1 ⇥ ⇤ Q ( A ) ← Q ( A ) + R − Q ( A ) N ( A )

What we learned last time 1. Multi-armed bandits are a simplification of the real problem 1. they have action and reward (a goal), but no input or sequentiality 2. A fundamental exploitation-exploration tradeoff arises in bandits 1. 𝜁 -greedy action selection is the simplest way of trading off 3. Learning action values is a key part of solution methods 4. The 10-armed testbed illustrates all

One Bandit Task from   The 10-armed Testbed q ∗ ( a ) ∼ N (0 , 1) 4 R t ∼ N ( q ∗ ( a ) , 1) 3 q ∗ (3) q ∗ (5) 2 1 q ∗ (9) q ∗ (4) q ∗ (1) Reward 0 q ∗ (7) q ∗ (10) distribution -1 q ∗ (2) q ∗ (8) q ∗ (6) -2 Run for 1000 steps Repeat the whole -3 thing 2000 times with different bandit tasks -4 1 2 3 4 5 6 7 8 9 10 Action

ε -Greedy Methods on the 10-Armed Testbed

What we learned last time 1. Multi-armed bandits are a simplification of the real problem 1. they have action and reward (a goal), but no input or sequentiality 2. The exploitation-exploration tradeoff arises in bandits 1. 𝜁 -greedy action selection is the simplest way of trading off 3. Learning action values is a key part of solution methods 4. The 10-armed testbed illustrates all 5. Learning as averaging – a fundamental learning rule

Averaging ⟶ learning rule • To simplify notation, let us focus on one action • We consider only its rewards, and its estimate after n+ 1 rewards: = R 1 + R 2 + · · · + R n − 1 Q n . . n − 1 • How can we do this incrementally (without storing all the rewards)? • Could store a running sum and count (and divide), or equivalently: Q n +1 = Q n + 1 h i R n − Q n n • This is a standard form for learning/update rules: h i NewEstimate ← OldEstimate + StepSize Target − OldEstimate

Derivation of incremental update = R 1 + R 2 + · · · + R n − 1 Q n . . n − 1 n 1 X = Q n +1 R i n i =1 n − 1 ! 1 X = R n + R i n i =1 n − 1 ! 1 1 X = R n + ( n − 1) R i n − 1 n i =1 1 ⇣ ⌘ = R n + ( n − 1) Q n n 1 ⇣ ⌘ = R n + nQ n − Q n n Q n + 1 h i = R n − Q n , n

Averaging ⟶ learning rule • To simplify notation, let us focus on one action • We consider only its rewards, and its estimate after n+ 1 rewards: = R 1 + R 2 + · · · + R n − 1 Q n . . n − 1 • How can we do this incrementally (without storing all the rewards)? • Could store a running sum and count (and divide), or equivalently: Q n +1 = Q n + 1 h i R n − Q n n • This is a standard form for learning/update rules: h i NewEstimate ← OldEstimate + StepSize Target − OldEstimate

Tracking a Non-stationary Problem • Suppose the true action values change slowly over time • then we say that the problem is nonstationary • In this case, sample averages are not a good idea (Why?) • Better is an “exponential, recency-weighted average”: h i Q n +1 = Q n + α R n − Q n n X = (1 − α ) n Q 1 + α (1 − α ) n − i R i i =1 where α is a constant, step-size parameter , 0 < α ≤ 1 • There is bias due to that becomes smaller over time Q 1

Standard stochastic approximation convergence conditions • To assure convergence with probability 1: ∞ ∞ X X α 2 α n ( a ) = ∞ and n ( a ) < ∞ . n =1 n =1 α n = 1 • e.g., if n α n = n − p , p ∈ (0 , 1) then convergence is   α n = 1 • not at the optimal rate: n 2 O (1 / √ n )

Optimistic Initial Values • All methods so far depend on , i.e., they are biased.   Q 1 ( a ) So far we have used Q 1 ( a ) = 0 • Suppose we initialize the action values optimistically ( ),   Q 1 ( a ) = 5 e.g., on the 10-armed testbed (with ) α = 0 . 1 100% optimistic, greedy Q 0 = 5, = 0 80% 휀 1 realistic, ! -greedy 60% % Q 0 = 0, = 0.1 Optimal 휀 1 action 40% 20% 0% 0 200 400 600 800 1000 Plays Steps

Upper Confidence Bound (UCB) action selection • A clever way of reducing exploration over time • Estimate an upper bound on the true action values • Select the action with the largest (estimated) upper bound s " # log t A t . = argmax Q t ( a ) + c N t ( a ) a UCB c = 2 휀 -greedy 휀 = 0.1 Average reward Steps

Gradient-Bandit Algorithms • Let be a learned preference for taking action a H t ( a ) e H t ( a ) Pr { A t = a } . . = = π t ( a ) , P k b =1 e H t ( b ) H t +1 ( a ) . ⇣ ⌘⇣ ⌘ R t − ¯ = H t ( a ) + α { A t = a } − π t ( a ) R t , ∀ a t 100% = 1 R t . ¯ X R i α = 0.1 t 80% with baseline i =1 α = 0.4 % 60% Optimal α = 0.1 action 40% without baseline α = 0.4 20% 0% 0 250 500 750 1000 Steps

Derivation of gradient-bandit algorithm In exact gradient ascent : = H t ( a ) + α∂ E [ R t ] H t +1 ( a ) . (1) ∂ H t ( a ) , where: X E [ R t ] . = π t ( b ) q ∗ ( b ) , b "X # ∂ E [ R t ] ∂ ∂ H t ( a ) = π t ( b ) q ∗ ( b ) ∂ H t ( a ) b q ∗ ( b ) ∂ π t ( b ) X = ∂ H t ( a ) b � ∂ π t ( b ) X � = q ∗ ( b ) − X t ∂ H t ( a ) , b ∂ π t ( b ) where X t does not depend on b , because P ∂ H t ( a ) = 0. b

∂ E [ R t ] � ∂ π t ( b ) X � ∂ H t ( a ) = q ∗ ( b ) − X t ∂ H t ( a ) b � ∂ π t ( b ) X � = π t ( b ) q ∗ ( b ) − X t ∂ H t ( a ) / π t ( b ) b � � ∂ π t ( A t ) � = E q ∗ ( A t ) − X t ∂ H t ( a ) / π t ( A t ) � � � ∂ π t ( A t ) R t − ¯ = E ∂ H t ( a ) / π t ( A t ) , R t where here we have chosen X t = ¯ R t and substituted R t for q ∗ ( A t ), which is permitted because E [ R t | A t ] = q ∗ ( A t ). For now assume: ∂ π t ( b ) � � ∂ H t ( a ) = π t ( b ) 1 a = b − π t ( a ) . Then: R t − ¯ ⇥� � � � ⇤ = E π t ( A t ) 1 a = A t − π t ( a ) / π t ( A t ) R t R t − ¯ ⇥� �� ⇤ = E 1 a = A t − π t ( a ) . R t R t − ¯ � �� H t +1 ( a ) = H t ( a ) + α 1 a = A t − π t ( a ) , (from (1), QED) R t

Thus it remains only to show that ∂ π t ( b ) � � ∂ H t ( a ) = π t ( b ) 1 a = b − π t ( a ) . Recall the standard quotient rule for derivatives:  f ( x ) ∂ f ( x ) ∂ x g ( x ) − f ( x ) ∂ g ( x ) � ∂ ∂ x = g ( x ) 2 . ∂ x g ( x ) Using this, we can write...

What we learned last time 1. Intelligence is the computational part - PowerPoint PPT Presentation

What we learned last time 1. Intelligence is the computational part of the ability to achieve goals looking deeper: 1) its a continuum, 2) its an appearance, 3) it varies with observer and purpose 2. We will (probably) figure out how to make

Math 3B: Lecture 2 Noah White September 26, 2016 Last time Last time, we spoke about The

Lessons Learned Lessons Learned From From Lessons Learned Lessons Learned From From

Cycle time: 40 sec Cycle time: 12 sec Cycle time: 0.75 sec Cycle time: 1.25 sec Cycle time: 5

1/37 Lesson: How I Learned to Stop Worrying and Love the Bot 2/37 Lesson: How I Learned to Stop

Complexity of Counting Lecture 21 #P: Toda s Theorem 1 Last Time 2 Last Time #P:

Trees Last time: recursion In the last lecture, we learned about recursion &

Who Is My Counselor? Last Name A-Co: Mrs. Ary Last Name Cr-He: Mr. Peslak Last Name Hi-Ma:

1.2 Basic Graphics Programming Hao Li http://cs420.hao-li.com 1 Last time Last Time Computer

1.2 Basic Graphics Programming Hao Li http://cs420.hao-li.com 1 Last time Last Time Computer

Object-Oriented Design What Do We Mean by OO Design? Remember how we learned about functions?

DEBUGGING LESSONS LEARNED WHILE DEBUGGING LESSONS LEARNED WHILE FIXING NETBSD FIXING NETBSD

Presentation Last Names A-E Ms. Kennair Last Names F-L Ms. Fornera Last Names M-R Ms. Tippins

Telecom Security - lessons learned (or not)? Personal review on the last 7 years Harald Welte

STAT 213 Cross-Validation (and Multifactor ANOVA?) Colin Reimer Dawson Oberlin College 12

Last time on Types ... picture from http://learnyouahaskell.com Last time on Types ...

#2: Graphics Part 1 SAMS SENIOR NON-CS TRACK Last time Learned what programming is Used data

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for

Sample-Based Methods for Continuous Action Markov Decision Processes Chris Mansley Ari

Wireless Optimisation via Convex Bandits Unlicensed LTE/WiFi Coexistence Cristina Cano and

CS 101.2: Notes for Lecture 2 (Bandit Problems) Andreas Krause January 9, 2009 In these notes we

Deep Hep Reading Group 1611.05763 Learning To Reinforcement Learn 1611.02779 SchemaAc

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Data-Dependent Algorithms for Bandit Convex Optimization Mehryar Mohri 1 Scott Yang 2 1 Google,

Spectrum Sharing Applications Sreeraj Rajendran rsreeraj@gmail.com FOSDEM 15 , Brussels

What we learned last time 1. Intelligence is the computational part - PowerPoint PPT Presentation

What we learned last time 1. Intelligence is the computational part of the ability to achieve goals looking deeper: 1) its a continuum, 2) its an appearance, 3) it varies with observer and purpose 2. We will (probably) figure out how to make

Math 3B: Lecture 2 Noah White September 26, 2016 Last time Last time, we spoke about The

Lessons Learned Lessons Learned From From Lessons Learned Lessons Learned From From

Cycle time: 40 sec Cycle time: 12 sec Cycle time: 0.75 sec Cycle time: 1.25 sec Cycle time: 5

1/37 Lesson: How I Learned to Stop Worrying and Love the Bot 2/37 Lesson: How I Learned to Stop

Complexity of Counting Lecture 21 #P: Toda s Theorem 1 Last Time 2 Last Time #P:

Trees Last time: recursion In the last lecture, we learned about recursion &amp;

Who Is My Counselor? Last Name A-Co: Mrs. Ary Last Name Cr-He: Mr. Peslak Last Name Hi-Ma:

1.2 Basic Graphics Programming Hao Li http://cs420.hao-li.com 1 Last time Last Time Computer

1.2 Basic Graphics Programming Hao Li http://cs420.hao-li.com 1 Last time Last Time Computer

Object-Oriented Design What Do We Mean by OO Design? Remember how we learned about functions?

DEBUGGING LESSONS LEARNED WHILE DEBUGGING LESSONS LEARNED WHILE FIXING NETBSD FIXING NETBSD

Presentation Last Names A-E Ms. Kennair Last Names F-L Ms. Fornera Last Names M-R Ms. Tippins

Telecom Security - lessons learned (or not)? Personal review on the last 7 years Harald Welte

STAT 213 Cross-Validation (and Multifactor ANOVA?) Colin Reimer Dawson Oberlin College 12

Last time on Types ... picture from http://learnyouahaskell.com Last time on Types ...

#2: Graphics Part 1 SAMS SENIOR NON-CS TRACK Last time Learned what programming is Used data

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for

Sample-Based Methods for Continuous Action Markov Decision Processes Chris Mansley Ari

Wireless Optimisation via Convex Bandits Unlicensed LTE/WiFi Coexistence Cristina Cano and

CS 101.2: Notes for Lecture 2 (Bandit Problems) Andreas Krause January 9, 2009 In these notes we

Deep Hep Reading Group 1611.05763 Learning To Reinforcement Learn 1611.02779 SchemaAc

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Data-Dependent Algorithms for Bandit Convex Optimization Mehryar Mohri 1 Scott Yang 2 1 Google,

Spectrum Sharing Applications Sreeraj Rajendran rsreeraj@gmail.com FOSDEM 15 , Brussels

Trees Last time: recursion In the last lecture, we learned about recursion &