Summary of part I: prediction and RL Prediction is important for - PowerPoint PPT Presentation

Summary of part I: prediction and RL Prediction is important for action selection • The problem: prediction of future reward • The algorithm: temporal difference learning • Neural implementation: dopamine dependent learning in BG ⇒ A precise computational model of learning allows one to look in the brain for “hidden variables” postulated by the model ⇒ Precise (normative!) theory for generation of dopamine firing patterns ⇒ Explains anticipatory dopaminergic responding, second order conditioning ⇒ Compelling account for the role of dopamine in classical conditioning: prediction error acts as signal driving learning in prediction areas

prediction error hypothesis of dopamine measured firing rate m model prediction error at end of trial: δ t = r t - V t (just like R-W) t ∑ (1 − η ) t − i r V t = η i i = 1 Bayer & Glimcher (2005)

Global plan • Reinforcement learning I: – prediction – classical conditioning – dopamine • Reinforcement learning II: • Reinforcement learning II: – dynamic programming; action selection – Pavlovian misbehaviour – vigor • Chapter 9 of Theoretical Neuroscience

Action Selection • Evolutionary specification • Immediate reinforcement: – leg flexion – Thorndike puzzle box – Thorndike puzzle box – pigeon; rat; human matching • Delayed reinforcement: – these tasks – mazes Bandler; – chess Blanchard

Immediate Reinforcement • stochastic policy: L m R m ; • based on action values: 5

Indirect Actor use RW rule: L = R = r r 0 . 05 ; 0 . 25 switch every 100 trials L R p p 6

Direct Actor m = L + R E P L r P R r ( ) [ ] [ ] ∂ ∂ P L P R [ ] [ ] = β = − β P L P R P L P R [ ] [ ] [ ] [ ] ∂ L ∂ R m m ( ( ) ) m m ∂ ∂ E E ( ( ) ) ( ( ) ) = = β β L L − − L L + + R R P L P L r r P L P L r r P R P R r r [ ] [ ] [ ] [ ] [ ] [ ] ∂ L m m ∂ E ( ) ( ) m = β L − P L r E [ ] ( ) ∂ L m m ∂ E ( ) ( ) m ≈ β L − r E ( ) if L is chosen ∂ L m m L − R → − ε L − R + ε a − − m m m m r E L R (1 )( ) ( ( ))( )

Direct Actor 8

Could we Tell? • correlate past rewards, actions with present choice • indirect actor (separate clocks): • direct actor (single clock):

Matching: Concurrent VI-VI Lau, Glimcher, Corrado, Sugrue, Newsome

Matching • income not return • approximately exponential in r • alternation choice kernel

Action at a (Temporal) Distance x =1 x =1 x =2 x =3 x =2 x =3 • learning an appropriate action at x= 1 : – depends on the actions at x= 2 and x= 3 – gains no immediate feedback • idea: use prediction as surrogate feedback 12

x =1 Action Selection x =2 x =3 = σ L − R P L x m x m x start with policy: [ ; ] ( ( ) ( )) x =1 x =2 x =3 V V V ( 1 ), ( 2 ), ( 3 ) evaluate it: improve it: x =1 0.025 x =2 x =3 -0.175 -0.125 0.125 ∆ α δ m thus choose R more frequently than L;C 13 *

Policy δ > 0 if ⇒ ∆ v • value is too pessimistic ⇒ ∆ P • action is better than average x =1 x =2 x =3 14

actor/critic m 1 m 2 m 3 m n dopamine signals to both motivational & motor striatum appear, surprisingly the same suggestion: training both values & policies

Formally: Dynamic Programming

Variants: SARSA [ ] = + = = Q C E r V x x u C * * ( 1 , ) ( ) | 1 , + t t t t 1 ( ) → + ε + actual − Q C Q C r Q u Q C ( 1 , ) ( 1 , ) ( 2 , ) ( 1 , ) t Morris et al, 2006

Variants: Q learning [ ] = + = = Q C E r V x x u C * * ( 1 , ) ( ) | 1 , + t t t t 1 ( ) → + ε + − Q C Q C r Q u Q C ( 1 , ) ( 1 , ) max ( 2 , ) ( 1 , ) t u Roesch et al, 2007

Summary • prediction learning – Bellman evaluation • actor-critic – asynchronous policy iteration – asynchronous policy iteration • indirect method (Q learning) – asynchronous value iteration [ ] = + = V E r V x x * * ( 1 ) ( ) | 1 + t t t 1 [ ] = + = = Q C E r V x x u C * * ( 1 , ) ( ) | 1 , + t t t t 1

Impulsivity & Hyperbolic Discounting • humans (and animals) show impulsivity in: – diets – addiction – spending, … • intertemporal conflict between short and long term choices • often explained via hyperbolic discount functions • often explained via hyperbolic discount functions • alternative is Pavlovian imperative to an immediate reinforcer • framing, trolley dilemmas, etc

Direct/Indirect Pathways Frank • direct: D1: GO; learn from DA increase • indirect: D2: noGO; learn from DA decrease • hyperdirect (STN) delay actions given strongly attractive choices

Frank • DARPP-32: D1 effect • DRD2: D2 effect

Three Decision Makers • tree search • position evaluation • situation memory

Multiple Systems in RL • model-based RL – build a forward model of the task, outcomes – search in the forward model (online DP) • optimal use of information • computationally ruinous • computationally ruinous • cached-based RL – learn Q values, which summarize future worth • computationally trivial • bootstrap-based; so statistically inefficient • learn both – select according to uncertainty

Animal Canary • OFC; dlPFC; dorsomedial striatum; BLA? • dosolateral striatum, amygdala

Two Systems:

Behavioural Effects

Effects of Learning • distributional value iteration • (Bayesian Q learning) • fixed additional uncertainty per step

One Outcome shallow tree implies goal-directed control wins

Human Canary... a b c £££ • if a → c → and c , then do more of a or b? – MB: b – MF: a (or even no effect)

Behaviour • action values depend on both systems: ( ) = + β Q x u Q x u Q x u , ( , ) ( , ) tot MF MB β • expect that will vary by subject (but be fixed)

Neural Prediction Errors (1 → 2) R ventral striatum R ventral striatum (anatomical definition) • note that MB RL does not use this prediction error – training signal?

Neural Prediction Errors (1) • right nucleus accumbens behaviour 1-2, not 1

Vigour • Two components to choice: – what: • lever pressing • direction to run • direction to run • meal to choose – when/how fast/how vigorous • free operant tasks • real-valued DP 34

The model cost vigour cost unit cost P R C (reward) how V τ U R C fast U ? τ LP NP Other S 0 S 1 S 2 τ τ 2 time τ τ τ 1 time τ τ τ choose choose Costs Costs (action, τ τ ) (action, τ τ ) τ τ τ τ Rewards Rewards = (LP, τ 1 ) = (LP, τ 2 ) goal 35

The model Goal: Choose actions and latencies to maximize the average rate of return (rewards minus costs per time) S 0 S 1 S 2 τ 2 time τ τ τ τ τ 1 time τ τ choose choose Costs Costs (action, τ τ ) (action, τ τ ) τ τ τ τ Rewards Rewards = (LP, τ 1 ) = (LP, τ 2 ) ARL 36

Average Reward RL Compute differential values of actions ρ = average Differential value rewards of taking action L with latency τ minus costs, per unit time when in state x + Future − τρ Q L , τ (x) = Rewards – Costs Returns C + C v V ( x ' ) τ u • steady state behavior (not learning dynamics) 37 (Extension of Schwartz 1993)

Average Reward Cost/benefit Tradeoffs 1. Which action to take? ⇒ Choose action with largest expected reward minus cost 2.How fast to perform it? • slow → delays (all) rewards • slow → delays (all) rewards • slow → less costly (vigour • slow → less costly (vigour cost) • net rate of rewards = cost of delay (opportunity cost of time) ⇒ Choose rate that balances vigour and opportunity costs explains faster (irrelevant) actions under hunger, etc masochism 38

Optimal response rates 1 st Nose poke Niv, Dayan, Joel, unpublished Experimental data 0.4 30 rate per minute probability 20 0.2 1 st NP 10 LP 0 0 Ex 0 0.5 1 0 0.5 1 1.5 1.5 0 0 20 20 40 40 seconds seconds since reinforcement 1 st Nose poke Model simulation 0.4 30 rate per minute probability 20 0.2 10 0 0 0 0.5 1 1.5 0 20 40 seconds seconds since reinforcement 39

Optimal response rates Experimental data Model simulation 50 100 Pigeon A Model s on lever A Pigeon B Perfect matching s on key A 80 Perfect matching 60 % Responses o % Responses o 40 20 Herrnstein 1961 0 More: 0 20 40 60 80 100 0 50 • # responses % Reinforcements on lever A % Reinforcements on key A • interval length • amount of reward • ratio vs. interval • breaking point • temporal structure • etc. 40

Effects of motivation (in the model) RR25 C τ τ = − − + − ⋅ Q x u p R C v V x R ( , , ) ( ' ) τ r u τ ∂ C C Q x u ( , , ) τ τ τ τ = − = = R v v 0 τ τ τ τ opt ∂ ∂ R R 2 opt opt low utility high utility mean latency energizing effect LP Other 41

Summary of part I: prediction and RL Prediction is important for - PowerPoint PPT Presentation

Summary of part I: prediction and RL Prediction is important for action selection The problem: prediction of future reward The algorithm: temporal difference learning Neural implementation: dopamine dependent learning in BG A

Part-II Parametric Signal Modeling and Linear Prediction Theory 3. Linear Prediction Electrical

Structured Prediction Problem Unstructured prediction Structured prediction Part of

Lecture notes for CS 433 - Chapter 2, part 2 9/26/18 Branch Prediction Buffer Strategies:

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp 1 / 50

Lecture 16: Linear Prediction Mark Hasegawa-Johnson ECE 401: Signal and Image Analysis, Fall 2020

Using lasso and related estimators for prediction Di Liu StataCorp July 12, 2019 1 / 20

Exploring models Summary, explainability, and prediction R.W. Oldford Modelling Recall how J.W.

Hurricane Storm Surge Analysis and Prediction Using ADCIRC-CG and AMPI Part - 2 AMPIzation

Part 7: Structured Prediction and Energy Minimization (2/2) Sebastian Nowozin and Christoph H.

Summary of Chapter 1 (part 1) Summary of Python Functionality in INF1100 Programs must be

Week 1, video 3: Classifiers, Part 1 Prediction Develop a model which can infer a single

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Part 1: Financial Summary for FY2010 and Projections for FY2011 May 2011 1-1 Financial Summary

A prediction is not a decision -- it is only a component of a decision. Prediction

Part-II Parametric Signal Modeling and Linear Prediction Theory 2. Discrete Wiener Filtering

Summary Part One: Examples of interaction with LMS discussion boards Part Two: Using the

PART FORM PREDICTION METHODS FOR CARBON FIBRE REINFORCED THERMOPLASTIC COMPOSITE MATERIALS P. Han

Prediction and Odds 18.05 Spring 2017 Probabilistic Prediction Also called probabilistic

Using the Knowledge of Biology in the Prediction Using the Knowledge of Biology in the Prediction

Prediction and Odds 18.05 Spring 2014 January 1, 2017 1 / 20 Probabilistic Prediction Also

Prediction and Odds 18.05 Spring 2014 January 1, 2017 1 / 26 Probabilistic Prediction Also

Prediction of RNA-RNA-Interaction 20 1 15 1 5 10 20 5 10 20 15 10 1 15 5 1 20 10

Exposure Prediction and Exposure Prediction and Measurement Error in Air P ll ti Pollution and

DBAASP Special prediction as a tool for the prediction of antimicrobial potency against particular