what we learned last time
play

What we learned last time 1. Intelligence is the computational part - PowerPoint PPT Presentation

What we learned last time 1. Intelligence is the computational part of the ability to achieve goals looking deeper: 1) its a continuum, 2) its an appearance, 3) it varies with observer and purpose 2. We will (probably) figure out how to make


  1. What we learned last time 1. Intelligence is the computational part of the ability to achieve goals • looking deeper: 1) its a continuum, 2) its an appearance, 3) it varies with observer and purpose 2. We will (probably) figure out how to make intelligent systems in our lifetimes; it will change everything 3. But prior to that it will probably change our careers • as companies gear up to take advantage of the economic opportunities 4. This course has a demanding workload

  2. Multi-arm Bandits Sutton and Barto, Chapter 2 The simplest reinforcement learning problem

  3. You are the algorithm! (bandit1) • Action 1 — Reward is always 8 • value of action 1 is q ∗ (1) = 8 • Action 2 — 88% chance of 0, 12% chance of 100! • value of action 2 is q ∗ (2) = . 88 × 0 + . 12 × 100 = 12 • Action 3 — Randomly between -10 and 35, equiprobable q ∗ (3) = 12 . 5 -10 0 35 q ∗ (3) • Action 4 — a third 0, a third 20, and a third from {8,9,…, 18} 0 20 q ∗ (4) q ∗ (4) = 1 3 × 0 + 1 3 × 20 + 1 3 × 13 = 0 + 20 3 + 13 3 = 33 3 = 11

  4. The k -armed Bandit Problem • On each of an infinite sequence of time steps , t =1, 2, 3, …, 
 you choose an action A t from k possibilities, and receive a real- valued reward R t • The reward depends only on the action taken; 
 it is indentically, independently distributed (i.i.d.): q ∗ ( a ) . = E [ R t | A t = a ] , ∀ a ∈ { 1 , . . . , k } true value s • These true values are unknown. The distribution is unknown • Nevertheless, you must maximize your total reward • You must both try actions to learn their values (explore), 
 and prefer those that appear best (exploit)

  5. The Exploration/Exploitation Dilemma • Suppose you form estimates action-value estimates Q t ( a ) ≈ q ∗ ( a ) , ∀ a • Define the greedy action at time t as . A ∗ = arg max Q t ( a ) t a • If then you are exploiting 
 A t = A ∗ t If then you are exploring A t 6 = A ∗ t • You can’t do both, but you need to do both • You can never stop exploring, but maybe you should explore less with time. Or maybe not.

  6. Action-Value Methods • Methods that learn action-value estimates and nothing else • For example, estimate action values as sample averages : P t − 1 i =1 R i · 1 A i = a = sum of rewards when a taken prior to t Q t ( a ) . = P t − 1 number of times a taken prior to t i =1 1 A i = a • The sample-average estimates converge to the true values 
 If the action is taken an infinite number of times N t ( a ) →∞ Q t ( a ) = q ∗ ( a ) lim The number of times action a has been taken by time t

  7. ε -Greedy Action Selection • In greedy action selection, you always exploit • In 𝜁 -greedy, you are usually greedy, but with probability 𝜁 you instead pick an action at random (possibly the greedy action again) • This is perhaps the simplest way to balance exploration and exploitation

  8. A simple bandit algorithm Initialize, for a = 1 to k : Q ( a ) ← 0 N ( a ) ← 0 Repeat forever: ⇢ arg max a Q ( a ) with probability 1 − ε (breaking ties randomly) A ← a random action with probability ε R ← bandit ( A ) N ( A ) ← N ( A ) + 1 1 ⇥ ⇤ Q ( A ) ← Q ( A ) + R − Q ( A ) N ( A )

  9. What we learned last time 1. Multi-armed bandits are a simplification of the real problem 1. they have action and reward (a goal), but no input or sequentiality 2. A fundamental exploitation-exploration tradeoff arises in bandits 1. 𝜁 -greedy action selection is the simplest way of trading off 3. Learning action values is a key part of solution methods 4. The 10-armed testbed illustrates all

  10. One Bandit Task from 
 The 10-armed Testbed q ∗ ( a ) ∼ N (0 , 1) 4 R t ∼ N ( q ∗ ( a ) , 1) 3 q ∗ (3) q ∗ (5) 2 1 q ∗ (9) q ∗ (4) q ∗ (1) Reward 0 q ∗ (7) q ∗ (10) distribution -1 q ∗ (2) q ∗ (8) q ∗ (6) -2 Run for 1000 steps Repeat the whole -3 thing 2000 times with different bandit tasks -4 1 2 3 4 5 6 7 8 9 10 Action

  11. ε -Greedy Methods on the 10-Armed Testbed

  12. What we learned last time 1. Multi-armed bandits are a simplification of the real problem 1. they have action and reward (a goal), but no input or sequentiality 2. The exploitation-exploration tradeoff arises in bandits 1. 𝜁 -greedy action selection is the simplest way of trading off 3. Learning action values is a key part of solution methods 4. The 10-armed testbed illustrates all 5. Learning as averaging – a fundamental learning rule

  13. Averaging ⟶ learning rule • To simplify notation, let us focus on one action • We consider only its rewards, and its estimate after n+ 1 rewards: = R 1 + R 2 + · · · + R n − 1 Q n . . n − 1 • How can we do this incrementally (without storing all the rewards)? • Could store a running sum and count (and divide), or equivalently: Q n +1 = Q n + 1 h i R n − Q n n • This is a standard form for learning/update rules: h i NewEstimate ← OldEstimate + StepSize Target − OldEstimate

  14. Derivation of incremental update = R 1 + R 2 + · · · + R n − 1 Q n . . n − 1 n 1 X = Q n +1 R i n i =1 n − 1 ! 1 X = R n + R i n i =1 n − 1 ! 1 1 X = R n + ( n − 1) R i n − 1 n i =1 1 ⇣ ⌘ = R n + ( n − 1) Q n n 1 ⇣ ⌘ = R n + nQ n − Q n n Q n + 1 h i = R n − Q n , n

  15. Averaging ⟶ learning rule • To simplify notation, let us focus on one action • We consider only its rewards, and its estimate after n+ 1 rewards: = R 1 + R 2 + · · · + R n − 1 Q n . . n − 1 • How can we do this incrementally (without storing all the rewards)? • Could store a running sum and count (and divide), or equivalently: Q n +1 = Q n + 1 h i R n − Q n n • This is a standard form for learning/update rules: h i NewEstimate ← OldEstimate + StepSize Target − OldEstimate

  16. Tracking a Non-stationary Problem • Suppose the true action values change slowly over time • then we say that the problem is nonstationary • In this case, sample averages are not a good idea (Why?) • Better is an “exponential, recency-weighted average”: h i Q n +1 = Q n + α R n − Q n n X = (1 − α ) n Q 1 + α (1 − α ) n − i R i i =1 where α is a constant, step-size parameter , 0 < α ≤ 1 • There is bias due to that becomes smaller over time Q 1

  17. Standard stochastic approximation convergence conditions • To assure convergence with probability 1: ∞ ∞ X X α 2 α n ( a ) = ∞ and n ( a ) < ∞ . n =1 n =1 α n = 1 • e.g., if n α n = n − p , p ∈ (0 , 1) then convergence is 
 α n = 1 • not at the optimal rate: n 2 O (1 / √ n )

  18. Optimistic Initial Values • All methods so far depend on , i.e., they are biased. 
 Q 1 ( a ) So far we have used Q 1 ( a ) = 0 • Suppose we initialize the action values optimistically ( ), 
 Q 1 ( a ) = 5 e.g., on the 10-armed testbed (with ) α = 0 . 1 100% optimistic, greedy Q 0 = 5, = 0 80% 휀 1 realistic, ! -greedy 60% % Q 0 = 0, = 0.1 Optimal 휀 1 action 40% 20% 0% 0 200 400 600 800 1000 Plays Steps

  19. Upper Confidence Bound (UCB) action selection • A clever way of reducing exploration over time • Estimate an upper bound on the true action values • Select the action with the largest (estimated) upper bound s " # log t A t . = argmax Q t ( a ) + c N t ( a ) a UCB c = 2 휀 -greedy 휀 = 0.1 Average reward Steps

  20. Gradient-Bandit Algorithms • Let be a learned preference for taking action a H t ( a ) e H t ( a ) Pr { A t = a } . . = = π t ( a ) , P k b =1 e H t ( b ) H t +1 ( a ) . ⇣ ⌘⇣ ⌘ R t − ¯ = H t ( a ) + α { A t = a } − π t ( a ) R t , ∀ a t 100% = 1 R t . ¯ X R i α = 0.1 t 80% with baseline i =1 α = 0.4 % 60% Optimal α = 0.1 action 40% without baseline α = 0.4 20% 0% 0 250 500 750 1000 Steps

  21. Derivation of gradient-bandit algorithm In exact gradient ascent : = H t ( a ) + α∂ E [ R t ] H t +1 ( a ) . (1) ∂ H t ( a ) , where: X E [ R t ] . = π t ( b ) q ∗ ( b ) , b "X # ∂ E [ R t ] ∂ ∂ H t ( a ) = π t ( b ) q ∗ ( b ) ∂ H t ( a ) b q ∗ ( b ) ∂ π t ( b ) X = ∂ H t ( a ) b � ∂ π t ( b ) X � = q ∗ ( b ) − X t ∂ H t ( a ) , b ∂ π t ( b ) where X t does not depend on b , because P ∂ H t ( a ) = 0. b

  22. ∂ E [ R t ] � ∂ π t ( b ) X � ∂ H t ( a ) = q ∗ ( b ) − X t ∂ H t ( a ) b � ∂ π t ( b ) X � = π t ( b ) q ∗ ( b ) − X t ∂ H t ( a ) / π t ( b ) b � � ∂ π t ( A t ) � = E q ∗ ( A t ) − X t ∂ H t ( a ) / π t ( A t ) � � � ∂ π t ( A t ) R t − ¯ = E ∂ H t ( a ) / π t ( A t ) , R t where here we have chosen X t = ¯ R t and substituted R t for q ∗ ( A t ), which is permitted because E [ R t | A t ] = q ∗ ( A t ). For now assume: ∂ π t ( b ) � � ∂ H t ( a ) = π t ( b ) 1 a = b − π t ( a ) . Then: R t − ¯ ⇥� � � � ⇤ = E π t ( A t ) 1 a = A t − π t ( a ) / π t ( A t ) R t R t − ¯ ⇥� �� �⇤ = E 1 a = A t − π t ( a ) . R t R t − ¯ � �� � H t +1 ( a ) = H t ( a ) + α 1 a = A t − π t ( a ) , (from (1), QED) R t

  23. Thus it remains only to show that ∂ π t ( b ) � � ∂ H t ( a ) = π t ( b ) 1 a = b − π t ( a ) . Recall the standard quotient rule for derivatives:  f ( x ) ∂ f ( x ) ∂ x g ( x ) − f ( x ) ∂ g ( x ) � ∂ ∂ x = g ( x ) 2 . ∂ x g ( x ) Using this, we can write...

Recommend


More recommend