Summary of part I: prediction and RL Prediction is important for action selection • The problem: prediction of future reward • The algorithm: temporal difference learning • Neural implementation: dopamine dependent learning in BG ⇒ A precise computational model of learning allows one to look in the brain for “hidden variables” postulated by the model ⇒ Precise (normative!) theory for generation of dopamine firing patterns ⇒ Explains anticipatory dopaminergic responding, second order conditioning ⇒ Compelling account for the role of dopamine in classical conditioning: prediction error acts as signal driving learning in prediction areas
prediction error hypothesis of dopamine measured firing rate m model prediction error at end of trial: δ t = r t - V t (just like R-W) t ∑ (1 − η ) t − i r V t = η i i = 1 Bayer & Glimcher (2005)
Global plan • Reinforcement learning I: – prediction – classical conditioning – dopamine • Reinforcement learning II: • Reinforcement learning II: – dynamic programming; action selection – Pavlovian misbehaviour – vigor • Chapter 9 of Theoretical Neuroscience
Action Selection • Evolutionary specification • Immediate reinforcement: – leg flexion – Thorndike puzzle box – Thorndike puzzle box – pigeon; rat; human matching • Delayed reinforcement: – these tasks – mazes Bandler; – chess Blanchard
Immediate Reinforcement • stochastic policy: L m R m ; • based on action values: 5
Indirect Actor use RW rule: L = R = r r 0 . 05 ; 0 . 25 switch every 100 trials L R p p 6
Direct Actor m = L + R E P L r P R r ( ) [ ] [ ] ∂ ∂ P L P R [ ] [ ] = β = − β P L P R P L P R [ ] [ ] [ ] [ ] ∂ L ∂ R m m ( ( ) ) m m ∂ ∂ E E ( ( ) ) ( ( ) ) = = β β L L − − L L + + R R P L P L r r P L P L r r P R P R r r [ ] [ ] [ ] [ ] [ ] [ ] ∂ L m m ∂ E ( ) ( ) m = β L − P L r E [ ] ( ) ∂ L m m ∂ E ( ) ( ) m ≈ β L − r E ( ) if L is chosen ∂ L m m L − R → − ε L − R + ε a − − m m m m r E L R (1 )( ) ( ( ))( )
Direct Actor 8
Could we Tell? • correlate past rewards, actions with present choice • indirect actor (separate clocks): • direct actor (single clock):
Matching: Concurrent VI-VI Lau, Glimcher, Corrado, Sugrue, Newsome
Matching • income not return • approximately exponential in r • alternation choice kernel
Action at a (Temporal) Distance x =1 x =1 x =2 x =3 x =2 x =3 • learning an appropriate action at x= 1 : – depends on the actions at x= 2 and x= 3 – gains no immediate feedback • idea: use prediction as surrogate feedback 12
x =1 Action Selection x =2 x =3 = σ L − R P L x m x m x start with policy: [ ; ] ( ( ) ( )) x =1 x =2 x =3 V V V ( 1 ), ( 2 ), ( 3 ) evaluate it: improve it: x =1 0.025 x =2 x =3 -0.175 -0.125 0.125 ∆ α δ m thus choose R more frequently than L;C 13 *
Policy δ > 0 if ⇒ ∆ v • value is too pessimistic ⇒ ∆ P • action is better than average x =1 x =2 x =3 14
actor/critic m 1 m 2 m 3 m n dopamine signals to both motivational & motor striatum appear, surprisingly the same suggestion: training both values & policies
Formally: Dynamic Programming
Variants: SARSA [ ] = + = = Q C E r V x x u C * * ( 1 , ) ( ) | 1 , + t t t t 1 ( ) → + ε + actual − Q C Q C r Q u Q C ( 1 , ) ( 1 , ) ( 2 , ) ( 1 , ) t Morris et al, 2006
Variants: Q learning [ ] = + = = Q C E r V x x u C * * ( 1 , ) ( ) | 1 , + t t t t 1 ( ) → + ε + − Q C Q C r Q u Q C ( 1 , ) ( 1 , ) max ( 2 , ) ( 1 , ) t u Roesch et al, 2007
Summary • prediction learning – Bellman evaluation • actor-critic – asynchronous policy iteration – asynchronous policy iteration • indirect method (Q learning) – asynchronous value iteration [ ] = + = V E r V x x * * ( 1 ) ( ) | 1 + t t t 1 [ ] = + = = Q C E r V x x u C * * ( 1 , ) ( ) | 1 , + t t t t 1
Impulsivity & Hyperbolic Discounting • humans (and animals) show impulsivity in: – diets – addiction – spending, … • intertemporal conflict between short and long term choices • often explained via hyperbolic discount functions • often explained via hyperbolic discount functions • alternative is Pavlovian imperative to an immediate reinforcer • framing, trolley dilemmas, etc
Direct/Indirect Pathways Frank • direct: D1: GO; learn from DA increase • indirect: D2: noGO; learn from DA decrease • hyperdirect (STN) delay actions given strongly attractive choices
Frank • DARPP-32: D1 effect • DRD2: D2 effect
Three Decision Makers • tree search • position evaluation • situation memory
Multiple Systems in RL • model-based RL – build a forward model of the task, outcomes – search in the forward model (online DP) • optimal use of information • computationally ruinous • computationally ruinous • cached-based RL – learn Q values, which summarize future worth • computationally trivial • bootstrap-based; so statistically inefficient • learn both – select according to uncertainty
Animal Canary • OFC; dlPFC; dorsomedial striatum; BLA? • dosolateral striatum, amygdala
Two Systems:
Behavioural Effects
Effects of Learning • distributional value iteration • (Bayesian Q learning) • fixed additional uncertainty per step
One Outcome shallow tree implies goal-directed control wins
Human Canary... a b c £££ • if a → c → and c , then do more of a or b? – MB: b – MF: a (or even no effect)
Behaviour • action values depend on both systems: ( ) = + β Q x u Q x u Q x u , ( , ) ( , ) tot MF MB β • expect that will vary by subject (but be fixed)
Neural Prediction Errors (1 → 2) R ventral striatum R ventral striatum (anatomical definition) • note that MB RL does not use this prediction error – training signal?
Neural Prediction Errors (1) • right nucleus accumbens behaviour 1-2, not 1
Vigour • Two components to choice: – what: • lever pressing • direction to run • direction to run • meal to choose – when/how fast/how vigorous • free operant tasks • real-valued DP 34
The model cost vigour cost unit cost P R C (reward) how V τ U R C fast U ? τ LP NP Other S 0 S 1 S 2 τ τ 2 time τ τ τ 1 time τ τ τ choose choose Costs Costs (action, τ τ ) (action, τ τ ) τ τ τ τ Rewards Rewards = (LP, τ 1 ) = (LP, τ 2 ) goal 35
The model Goal: Choose actions and latencies to maximize the average rate of return (rewards minus costs per time) S 0 S 1 S 2 τ 2 time τ τ τ τ τ 1 time τ τ choose choose Costs Costs (action, τ τ ) (action, τ τ ) τ τ τ τ Rewards Rewards = (LP, τ 1 ) = (LP, τ 2 ) ARL 36
Average Reward RL Compute differential values of actions ρ = average Differential value rewards of taking action L with latency τ minus costs, per unit time when in state x + Future − τρ Q L , τ (x) = Rewards – Costs Returns C + C v V ( x ' ) τ u • steady state behavior (not learning dynamics) 37 (Extension of Schwartz 1993)
Average Reward Cost/benefit Tradeoffs 1. Which action to take? ⇒ Choose action with largest expected reward minus cost 2.How fast to perform it? • slow → delays (all) rewards • slow → delays (all) rewards • slow → less costly (vigour • slow → less costly (vigour cost) • net rate of rewards = cost of delay (opportunity cost of time) ⇒ Choose rate that balances vigour and opportunity costs explains faster (irrelevant) actions under hunger, etc masochism 38
Optimal response rates 1 st Nose poke Niv, Dayan, Joel, unpublished Experimental data 0.4 30 rate per minute probability 20 0.2 1 st NP 10 LP 0 0 Ex 0 0.5 1 0 0.5 1 1.5 1.5 0 0 20 20 40 40 seconds seconds since reinforcement 1 st Nose poke Model simulation 0.4 30 rate per minute probability 20 0.2 10 0 0 0 0.5 1 1.5 0 20 40 seconds seconds since reinforcement 39
Optimal response rates Experimental data Model simulation 50 100 Pigeon A Model s on lever A Pigeon B Perfect matching s on key A 80 Perfect matching 60 % Responses o % Responses o 40 20 Herrnstein 1961 0 More: 0 20 40 60 80 100 0 50 • # responses % Reinforcements on lever A % Reinforcements on key A • interval length • amount of reward • ratio vs. interval • breaking point • temporal structure • etc. 40
Effects of motivation (in the model) RR25 C τ τ = − − + − ⋅ Q x u p R C v V x R ( , , ) ( ' ) τ r u τ ∂ C C Q x u ( , , ) τ τ τ τ = − = = R v v 0 τ τ τ τ opt ∂ ∂ R R 2 opt opt low utility high utility mean latency energizing effect LP Other 41
Recommend
More recommend