summary of part i prediction and rl
play

Summary of part I: prediction and RL Prediction is important for - PowerPoint PPT Presentation

Summary of part I: prediction and RL Prediction is important for action selection The problem: prediction of future reward The algorithm: temporal difference learning Neural implementation: dopamine dependent learning in BG A


  1. Summary of part I: prediction and RL Prediction is important for action selection • The problem: prediction of future reward • The algorithm: temporal difference learning • Neural implementation: dopamine dependent learning in BG ⇒ A precise computational model of learning allows one to look in the brain for “hidden variables” postulated by the model ⇒ Precise (normative!) theory for generation of dopamine firing patterns ⇒ Explains anticipatory dopaminergic responding, second order conditioning ⇒ Compelling account for the role of dopamine in classical conditioning: prediction error acts as signal driving learning in prediction areas

  2. prediction error hypothesis of dopamine measured firing rate m model prediction error at end of trial: δ t = r t - V t (just like R-W) t ∑ (1 − η ) t − i r V t = η i i = 1 Bayer & Glimcher (2005)

  3. Global plan • Reinforcement learning I: – prediction – classical conditioning – dopamine • Reinforcement learning II: • Reinforcement learning II: – dynamic programming; action selection – Pavlovian misbehaviour – vigor • Chapter 9 of Theoretical Neuroscience

  4. Action Selection • Evolutionary specification • Immediate reinforcement: – leg flexion – Thorndike puzzle box – Thorndike puzzle box – pigeon; rat; human matching • Delayed reinforcement: – these tasks – mazes Bandler; – chess Blanchard

  5. Immediate Reinforcement • stochastic policy: L m R m ; • based on action values: 5

  6. Indirect Actor use RW rule: L = R = r r 0 . 05 ; 0 . 25 switch every 100 trials L R p p 6

  7. Direct Actor m = L + R E P L r P R r ( ) [ ] [ ] ∂ ∂ P L P R [ ] [ ] = β = − β P L P R P L P R [ ] [ ] [ ] [ ] ∂ L ∂ R m m ( ( ) ) m m ∂ ∂ E E ( ( ) ) ( ( ) ) = = β β L L − − L L + + R R P L P L r r P L P L r r P R P R r r [ ] [ ] [ ] [ ] [ ] [ ] ∂ L m m ∂ E ( ) ( ) m = β L − P L r E [ ] ( ) ∂ L m m ∂ E ( ) ( ) m ≈ β L − r E ( ) if L is chosen ∂ L m m L − R → − ε L − R + ε a − − m m m m r E L R (1 )( ) ( ( ))( )

  8. Direct Actor 8

  9. Could we Tell? • correlate past rewards, actions with present choice • indirect actor (separate clocks): • direct actor (single clock):

  10. Matching: Concurrent VI-VI Lau, Glimcher, Corrado, Sugrue, Newsome

  11. Matching • income not return • approximately exponential in r • alternation choice kernel

  12. Action at a (Temporal) Distance x =1 x =1 x =2 x =3 x =2 x =3 • learning an appropriate action at x= 1 : – depends on the actions at x= 2 and x= 3 – gains no immediate feedback • idea: use prediction as surrogate feedback 12

  13. x =1 Action Selection x =2 x =3 = σ L − R P L x m x m x start with policy: [ ; ] ( ( ) ( )) x =1 x =2 x =3 V V V ( 1 ), ( 2 ), ( 3 ) evaluate it: improve it: x =1 0.025 x =2 x =3 -0.175 -0.125 0.125 ∆ α δ m thus choose R more frequently than L;C 13 *

  14. Policy δ > 0 if ⇒ ∆ v • value is too pessimistic ⇒ ∆ P • action is better than average x =1 x =2 x =3 14

  15. actor/critic m 1 m 2 m 3 m n dopamine signals to both motivational & motor striatum appear, surprisingly the same suggestion: training both values & policies

  16. Formally: Dynamic Programming

  17. Variants: SARSA [ ] = + = = Q C E r V x x u C * * ( 1 , ) ( ) | 1 , + t t t t 1 ( ) → + ε + actual − Q C Q C r Q u Q C ( 1 , ) ( 1 , ) ( 2 , ) ( 1 , ) t Morris et al, 2006

  18. Variants: Q learning [ ] = + = = Q C E r V x x u C * * ( 1 , ) ( ) | 1 , + t t t t 1 ( ) → + ε + − Q C Q C r Q u Q C ( 1 , ) ( 1 , ) max ( 2 , ) ( 1 , ) t u Roesch et al, 2007

  19. Summary • prediction learning – Bellman evaluation • actor-critic – asynchronous policy iteration – asynchronous policy iteration • indirect method (Q learning) – asynchronous value iteration [ ] = + = V E r V x x * * ( 1 ) ( ) | 1 + t t t 1 [ ] = + = = Q C E r V x x u C * * ( 1 , ) ( ) | 1 , + t t t t 1

  20. Impulsivity & Hyperbolic Discounting • humans (and animals) show impulsivity in: – diets – addiction – spending, … • intertemporal conflict between short and long term choices • often explained via hyperbolic discount functions • often explained via hyperbolic discount functions • alternative is Pavlovian imperative to an immediate reinforcer • framing, trolley dilemmas, etc

  21. Direct/Indirect Pathways Frank • direct: D1: GO; learn from DA increase • indirect: D2: noGO; learn from DA decrease • hyperdirect (STN) delay actions given strongly attractive choices

  22. Frank • DARPP-32: D1 effect • DRD2: D2 effect

  23. Three Decision Makers • tree search • position evaluation • situation memory

  24. Multiple Systems in RL • model-based RL – build a forward model of the task, outcomes – search in the forward model (online DP) • optimal use of information • computationally ruinous • computationally ruinous • cached-based RL – learn Q values, which summarize future worth • computationally trivial • bootstrap-based; so statistically inefficient • learn both – select according to uncertainty

  25. Animal Canary • OFC; dlPFC; dorsomedial striatum; BLA? • dosolateral striatum, amygdala

  26. Two Systems:

  27. Behavioural Effects

  28. Effects of Learning • distributional value iteration • (Bayesian Q learning) • fixed additional uncertainty per step

  29. One Outcome shallow tree implies goal-directed control wins

  30. Human Canary... a b c £££ • if a → c → and c , then do more of a or b? – MB: b – MF: a (or even no effect)

  31. Behaviour • action values depend on both systems: ( ) = + β Q x u Q x u Q x u , ( , ) ( , ) tot MF MB β • expect that will vary by subject (but be fixed)

  32. Neural Prediction Errors (1 → 2) R ventral striatum R ventral striatum (anatomical definition) • note that MB RL does not use this prediction error – training signal?

  33. Neural Prediction Errors (1) • right nucleus accumbens behaviour 1-2, not 1

  34. Vigour • Two components to choice: – what: • lever pressing • direction to run • direction to run • meal to choose – when/how fast/how vigorous • free operant tasks • real-valued DP 34

  35. The model cost vigour cost unit cost P R C (reward) how V τ U R C fast U ? τ LP NP Other S 0 S 1 S 2 τ τ 2 time τ τ τ 1 time τ τ τ choose choose Costs Costs (action, τ τ ) (action, τ τ ) τ τ τ τ Rewards Rewards = (LP, τ 1 ) = (LP, τ 2 ) goal 35

  36. The model Goal: Choose actions and latencies to maximize the average rate of return (rewards minus costs per time) S 0 S 1 S 2 τ 2 time τ τ τ τ τ 1 time τ τ choose choose Costs Costs (action, τ τ ) (action, τ τ ) τ τ τ τ Rewards Rewards = (LP, τ 1 ) = (LP, τ 2 ) ARL 36

  37. Average Reward RL Compute differential values of actions ρ = average Differential value rewards of taking action L with latency τ minus costs, per unit time when in state x + Future − τρ Q L , τ (x) = Rewards – Costs Returns C + C v V ( x ' ) τ u • steady state behavior (not learning dynamics) 37 (Extension of Schwartz 1993)

  38. Average Reward Cost/benefit Tradeoffs 1. Which action to take? ⇒ Choose action with largest expected reward minus cost 2.How fast to perform it? • slow → delays (all) rewards • slow → delays (all) rewards • slow → less costly (vigour • slow → less costly (vigour cost) • net rate of rewards = cost of delay (opportunity cost of time) ⇒ Choose rate that balances vigour and opportunity costs explains faster (irrelevant) actions under hunger, etc masochism 38

  39. Optimal response rates 1 st Nose poke Niv, Dayan, Joel, unpublished Experimental data 0.4 30 rate per minute probability 20 0.2 1 st NP 10 LP 0 0 Ex 0 0.5 1 0 0.5 1 1.5 1.5 0 0 20 20 40 40 seconds seconds since reinforcement 1 st Nose poke Model simulation 0.4 30 rate per minute probability 20 0.2 10 0 0 0 0.5 1 1.5 0 20 40 seconds seconds since reinforcement 39

  40. Optimal response rates Experimental data Model simulation 50 100 Pigeon A Model s on lever A Pigeon B Perfect matching s on key A 80 Perfect matching 60 % Responses o % Responses o 40 20 Herrnstein 1961 0 More: 0 20 40 60 80 100 0 50 • # responses % Reinforcements on lever A % Reinforcements on key A • interval length • amount of reward • ratio vs. interval • breaking point • temporal structure • etc. 40

  41. Effects of motivation (in the model) RR25 C τ τ = − − + − ⋅ Q x u p R C v V x R ( , , ) ( ' ) τ r u τ ∂ C C Q x u ( , , ) τ τ τ τ = − = = R v v 0 τ τ τ τ opt ∂ ∂ R R 2 opt opt low utility high utility mean latency energizing effect LP Other 41

Recommend


More recommend