Solving the Bellman Equation Option 1: turn it into update equation Option 2: linear solution (w/ absorbing states) �⇤ ⇥ ⇤ T a ss � [ R ( s ⇥ , a, s ) + V ( s ⇥ )] V ( s ) = π ( a, s t ) a s � R π + T π v = ⇒ v ( I − T π ) � 1 R π = ⇒ v π O ( |S| 3 ) Quentin Huys RL SWC
Solving the Bellman Equation Option 1: turn it into update equation ⇤⇧ ⌅ ⇧ V k +1 ( s ) T a R ( s � , a, s ) + V k ( s � ) � ⇥ = π ( a, s t ) ss � a s � Option 2: linear solution (w/ absorbing states) �⇤ ⇥ ⇤ T a ss � [ R ( s ⇥ , a, s ) + V ( s ⇥ )] V ( s ) = π ( a, s t ) a s � R π + T π v = ⇒ v ( I − T π ) � 1 R π = ⇒ v π O ( |S| 3 ) Quentin Huys RL SWC
Policy update Given the value function for a policy, say via linear solution "X # X ss 0 [ R ( s 0 , a, s ) + V π ( s 0 )] V π ( s ) = π ( a | s ) T a s 0 a | {z } Q π ( s,a ) Given the values V for the policy, we can improve the policy by always choosing the best action: ⇢ 1 if a = argmax a Q π ( s, a ) π 0 ( a | s ) = 0 else It is guaranteed to improve: for deterministic policy Q π ( s, π 0 ( s )) = max Q π ( s, a ) ≥ Q π ( s, π ( s )) = V π ( s ) a Quentin Huys RL SWC
Policy iteration Policy evaluation ( I − T π ) − 1 R π = v π ⇤ � ⇥ s � T a R a ss + V pi ( s � ) ⌅ 1 if a = argmax a ss � π ( a | s ) = 0 else Quentin Huys RL SWC
Policy iteration Policy evaluation ( I − T π ) − 1 R π = v π greedy policy improvement ⇤ � ⇥ s � T a R a ss + V pi ( s � ) ⌅ 1 if a = argmax a ss � π ( a | s ) = 0 else Quentin Huys RL SWC
Policy iteration Policy evaluation ( I − T π ) − 1 R π = v π Value iteration � V � ( s ) = max ss + V � ( s ⇥ )] ss � [ R a T a a s � greedy policy improvement ⇤ � ⇥ s � T a R a ss + V pi ( s � ) ⌅ 1 if a = argmax a ss � π ( a | s ) = 0 else Quentin Huys RL SWC
Model-free solutions ‣ So far we have assumed knowledge of R and T • R and T are the ‘model’ of the world, so we assume full knowledge of the dynamics and rewards in the environment ‣ What if we don’t know them? ‣ We can still learn from state-action-reward samples • we can learn R and T from them, and use our estimates to solve as above • alternatively, we can directly estimate V or Q Quentin Huys RL SWC
Solving the Bellman Equation Option 3: sampling �⇤ ⇥ ⇤ T a ss � [ R ( s � , a, s ) + V ( s � )] V ( s ) = π ( a, s t ) a s � So we can just draw some samples from the policy and the transitions and average over them: X = f ( x k ) p ( x k ) a k a = 1 X x ( i ) ∼ p ( x ) → ˆ f ( x ( i ) ) N i Quentin Huys RL SWC
Solving the Bellman Equation Option 3: sampling So we can just draw some samples from the policy and the transitions and average over them: X = f ( x k ) p ( x k ) a k a = 1 X x ( i ) ∼ p ( x ) → ˆ f ( x ( i ) ) N i Quentin Huys RL SWC
Solving the Bellman Equation Option 3: sampling this is an expectation over policy and transition samples. So we can just draw some samples from the policy and the transitions and average over them: X = f ( x k ) p ( x k ) a k a = 1 X x ( i ) ∼ p ( x ) → ˆ f ( x ( i ) ) N i Quentin Huys RL SWC
Solving the Bellman Equation Option 3: sampling this is an expectation over policy and transition samples. So we can just draw some samples from the policy and the transitions and average over them: X = f ( x k ) p ( x k ) a k a = 1 X x ( i ) ∼ p ( x ) → ˆ f ( x ( i ) ) N i more about this later... Quentin Huys RL SWC
Learning from samples 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 A new problem: exploration versus exploitation Quentin Huys RL SWC
The e ff ect of bootstrapping Markov (every visit) B1 V(B)=3/4 B1 V(A)=0 B1 B1 B1 B1 TD B0 V(B)=3/4 A0 B0 V(A)=~3/4 ‣ Average over various bootstrappings: TD( ) λ after Sutton and Barto 1998 Quentin Huys RL SWC
Monte Carlo ‣ First visit MC • randomly start in all states, generate paths, average for starting state only ( T ) V ( s ) = 1 X X r i t 0 | s 0 = s N i t 0 =1 10 9 ‣ More efficient use of samples 8 • Every visit MC 7 6 • Bootstrap: TD 5 • Dyna 4 3 ‣ Better samples 2 • on policy versus off policy 1 1 2 3 4 5 6 7 8 9 10 • Stochastic search, UCT... Quentin Huys RL SWC
Update equation: towards TD Bellman equation �⇤ ⇥ ⇤ ss � [ R ( s � , a, s ) + V ( s � )] V ( s ) = π ( a, s ) T a s � a Not yet converged, so it doesn’t hold: �⇤ ⇥ ⇤ ss � [ R ( s � , a, s ) + V ( s � )] dV ( s ) = − V ( s ) + π ( a, s ) T a s � a And then use this to update V i +1 ( s ) = V i ( s ) + dV ( s ) Quentin Huys RL SWC
TD learning �⇤ ⇥ ⇤ ss � [ R ( s � , a, s ) + V ( s � )] dV ( s ) = − V ( s ) + π ( a, s ) T a s � a Quentin Huys RL SWC
TD learning �⇤ ⇥ ⇤ ss � [ R ( s � , a, s ) + V ( s � )] dV ( s ) = − V ( s ) + π ( a, s ) T a s � a π ( a | s t ) a t ∼ Sample T a t s t +1 ∼ s t ,s t +1 = R ( s t +1 , a t , s t ) r t Quentin Huys RL SWC
TD learning �⇤ ⇥ ⇤ ss � [ R ( s � , a, s ) + V ( s � )] dV ( s ) = − V ( s ) + π ( a, s ) T a s � a π ( a | s t ) a t ∼ Sample T a t s t +1 ∼ s t ,s t +1 = R ( s t +1 , a t , s t ) r t δ t = − V t − 1 ( s t ) + r t + V t − 1 ( s t +1 ) Quentin Huys RL SWC
TD learning �⇤ ⇥ ⇤ ss � [ R ( s � , a, s ) + V ( s � )] dV ( s ) = − V ( s ) + π ( a, s ) T a s � a π ( a | s t ) a t ∼ Sample T a t s t +1 ∼ s t ,s t +1 = R ( s t +1 , a t , s t ) r t δ t = − V t − 1 ( s t ) + r t + V t − 1 ( s t +1 ) V i +1 ( s ) = V i ( s ) + dV ( s ) V t ( s t ) = V t − 1 ( s t ) + αδ t Quentin Huys RL SWC
TD learning π ( a | s t ) a t ∼ T a t s t +1 ∼ s t ,s t +1 = R ( s t +1 , a t , s t ) r t δ t = − V t ( s t ) + r t + V t ( s t +1 ) V t +1 ( s t ) = V t ( s t ) + αδ t Quentin Huys RL SWC
Phasic dopamine neurone firing in ‣ Pavlovian conditioning Montague et al., 1996, Schultz et al., 1997 Quentin Huys RL SWC
Phasic dopamine neurone firing in ‣ Pavlovian conditioning Reward Reward Loss Reward Montague et al., 1996, Schultz et al., 1997 Quentin Huys RL SWC
Phasic dopamine neurone firing in ‣ Pavlovian conditioning Reward Reward Loss Reward V t +1 ( s ) = V t ( s ) + ✏ ( R t − V t ( s )) | {z } = Prediction error Montague et al., 1996, Schultz et al., 1997 Quentin Huys RL SWC
Phasic dopamine neurone firing in ‣ Pavlovian conditioning Reward Reward Loss Reward V t +1 ( s ) = V t ( s ) + ✏ ( R t − V t ( s )) | {z } = Prediction error Montague et al., 1996, Schultz et al., 1997 Quentin Huys RL SWC
Phasic signals in humans B * * 0.008 0.004 Effect Size 0 unexpected reward re- expected reward -0.004 d reward expected, - not received NS luster D’Ardenne et al., 2008 Science; Zaghloul et al., 2009 Science Quentin Huys RL SWC
Blocking ‣ Are predictions and prediction errors really causally important in learning? • 1: A -> Reward Response • 2: A+B -> Reward • 3: A -> ? approach • B -> ? approach A B C Kamin 1968 Quentin Huys RL SWC
Causal role of phasic DA in learning b Single Compound Test c cue cue Paired stimulation 14–15 d 4 d 1 d Cue AX Time in port US (sucrose) +Stim A US AX US X? Unpaired stimulation Cue AX With paired or unpaired Time in port optical stimulation US (sucrose) Stim e Blocking f 25 * ( n = 12) PairedCre + Control ( n = 9) ** 15 UnpairedCre + ( n = 11) 20 ( n = 9) P = 0.095 PairedCre − 15 ** ( n = 10) 10 P = 0.055 10 5 5 0 0 1 2 3 1 trial 3 trials 1 trial 3 trials Steinberg et al., 2013 Nat. Neurosci. Quentin Huys RL SWC
Markov Decision Problems V ( s t ) = E [ r t + r t +1 + r t +2 + . . . ] = E [ r t ] + E [ r t +1 + r t +2 + r t +3 . . . ] ⇒ V ( s t ) = E [ r t ] + V ( s t +1 ) Quentin Huys RL SWC
“Cached” solutions to MDPs Quentin Huys RL SWC
“Cached” solutions to MDPs ‣ Learn from experience ‣ If we have true values V, then this is true every trial: V ( s t ) = E [ r t ] + V ( s t +1 ) ‣ If it is not true (we don’t know true V), then we get an error: δ = ( E [ r t ] + V ( s t +1 )) � V ( s t ) 6 = 0 ‣ So now we can update with our experience V ( s t ) ← V ( s t ) + ✏� ‣ This is an average over past experience Quentin Huys RL SWC
SARSA ‣ Do TD for state-action values instead: Q ( s t , a t ) ← Q ( s t , a t ) + α [ r t + γ Q ( s t +1 , a t +1 ) − Q ( s t , a t )] s t , a t , r t , s t +1 , a t +1 ‣ convergence guarantees - will estimate Q π ( s, a ) Quentin Huys RL SWC
Q learning: o ff -policy ‣ Learn off-policy • draw from some policy • “only” require extensive sampling � ⇥ Q ( s t , a t ) ← Q ( s t , a t ) + α ⇤ r t + γ max Q ( s t +1 , a ) − Q ( s t , a t ) ⌅ a ⌥ ⌃⇧ � update towards optimum ‣ will estimate Q ∗ ( s, a ) Quentin Huys RL SWC
MF and MB learning of V and Q values Model-free Model-based V MF s V MB s Pavlovian (state) values ð Þ ð Þ Q MF s , a Q MB s , a Instrumental (state-action) values ð Þ ð Þ There are both Pavlovian state and instrumental state-action values, and both of these can be either model-free (cached) or model-based. Quentin Huys RL SWC
Solutions ‣ “Cached” learning If you have an average over large number of subjects, it won’t move much if you add one more. • average experience • do again what worked in the past • averages are cheap to compute - no computational curse • averages move slowly ‣ “Goal-directed” or “Model-based” decisions • Think through possible options and choose the best • Requires detailed model of the world • Requires huge computational resources • Learning = building the model, extracting structure Quentin Huys RL SWC
Solutions ‣ “Cached” learning If you have an average over large number of subjects, it won’t move much if you add one more. • average experience • do again what worked in the past • averages are cheap to compute - no computational curse • averages move slowly ‣ “Goal-directed” or “Model-based” decisions • Think through possible options and choose the best • Requires detailed model of the world • Requires huge computational resources • Learning = building the model, extracting structure Quentin Huys RL SWC
MF and MB learning of V and Q values Model-free Model-based V MF s V MB s Pavlovian (state) values ð Þ ð Þ Q MF s , a Q MB s , a Instrumental (state-action) values ð Þ ð Þ There are both Pavlovian state and instrumental state-action values, and both of these can be either model-free (cached) or model-based. Quentin Huys RL SWC
Pavlovian and instrumental ‣ Pavlovian model-free learning: V t ( s ) = V t − 1 ( s ) + ✏ ( r t − V t − 1 ( s )) p ( a | s, V ) ∝ f ( a, V ( s )) p ( a | s ) ‣ Instrumental model-free learning: Q t ( a, s ) = Q t − 1 ( a, s ) + ✏ ( r t − Q t − 1 ( a, s )) Quentin Huys RL SWC
Innate evolutionary strategies Hirsch and Bolles 1980 Quentin Huys RL SWC
Innate evolutionary strategies more more survive survive fewer survive Hirsch and Bolles 1980 Quentin Huys RL SWC
Innate evolutionary strategies are quite sophisticated... more more survive survive fewer survive Hirsch and Bolles 1980 Quentin Huys RL SWC
Unconditioned responses • powerful • inflexible over short timescale • adaptive on evolutionary scale Hershberger 1986 Quentin Huys RL SWC
Unconditioned responses • powerful • inflexible over short timescale • adaptive on evolutionary scale Hershberger 1986 Quentin Huys RL SWC
Unconditioned responses • powerful • inflexible over short timescale • adaptive on evolutionary scale Hershberger 1986 Quentin Huys RL SWC
Unconditioned responses • powerful • inflexible over short timescale • adaptive on evolutionary scale Hershberger 1986 Quentin Huys RL SWC
Unconditioned responses • powerful • inflexible over short timescale • adaptive on evolutionary scale Hershberger 1986 Quentin Huys RL SWC
A ff ective go / nogo task Go Nogo Rewarded Avoids loss Guitart-Masip et al., 2012 J Neurosci Quentin Huys RL SWC
A ff ective go / nogo task Go Nogo Rewarded 1 Avoids loss Probability correct 0.5 0 Go to Go to Nogo to Nogo to Win Avoid Win Avoid Go rewarded Nogo punished Nogo rewarded Go punished Go to win Go to avoid Nogo to win Nogo to avoid 1 1 1 1 Probability(Go) 0.5 0.5 0.5 0.5 0 0 0 0 20 40 60 20 40 60 20 40 60 20 40 60 Guitart-Masip et al., 2012 J Neurosci Quentin Huys RL SWC
A ff ective go / nogo task Go Nogo Rewarded 1 Avoids loss Probability correct 0.5 0 Go to Go to Nogo to Nogo to Win Avoid Win Avoid Go rewarded Nogo punished Nogo rewarded Go punished Go to win Go to avoid Nogo to win Nogo to avoid 1 1 1 1 Probability(Go) 0.5 0.5 0.5 0.5 0 0 0 0 20 40 60 20 40 60 20 40 60 20 40 60 Guitart-Masip et al., 2012 J Neurosci Quentin Huys RL SWC
Models Go Nogo Rewarded ‣ Instrumental Avoids loss p t ( a | s ) ∝ Q t ( s, a ) Q t +1 ( s, a ) = Q t ( s, a ) + α ( r t − Q t ( s, a )) Go rewarded Nogo punished Nogo rewarded Go punished Go to win Go to avoid Nogo to win Nogo to avoid 1 1 1 1 Probability(Go) 0.5 0.5 0.5 0.5 0 0 0 0 20 40 60 20 40 60 20 40 60 20 40 60 Guitart-Masip et al., 2012 J Neurosci Quentin Huys RL SWC
Models Go Nogo Rewarded ‣ Instrumental + bias Avoids loss Go rewarded Nogo punished Nogo rewarded Go punished Go to win Go to avoid Nogo to win Nogo to avoid 1 1 1 1 Probability(Go) 0.5 0.5 0.5 0.5 0 0 0 0 20 40 60 20 40 60 20 40 60 20 40 60 Guitart-Masip et al., 2012 J Neurosci Quentin Huys RL SWC
Models Go Nogo Rewarded ‣ Instrumental + bias Avoids loss Go rewarded Nogo punished Nogo rewarded Go punished Go to win Go to avoid Nogo to win Nogo to avoid 1 1 1 1 Probability(Go) 0.5 0.5 0.5 0.5 0 0 0 0 20 40 60 20 40 60 20 40 60 20 40 60 Guitart-Masip et al., 2012 J Neurosci Quentin Huys RL SWC
Models Go Nogo Rewarded ‣ Instrumental + bias + Pavlovian Avoids loss Go rewarded Nogo punished Nogo rewarded Go punished Go to win Go to avoid Nogo to win Nogo to avoid 1 1 1 1 Probability(Go) 0.5 0.5 0.5 0.5 0 0 0 0 20 40 60 20 40 60 20 40 60 20 40 60 Guitart-Masip et al., 2012 J Neurosci Quentin Huys RL SWC
Models Go Nogo Rewarded ‣ Instrumental + bias + Pavlovian Avoids loss Go rewarded Nogo punished Nogo rewarded Go punished Go to win Go to avoid Nogo to win Nogo to avoid 1 1 1 1 Probability(Go) 0.5 0.5 0.5 0.5 0 0 0 0 20 40 60 20 40 60 20 40 60 20 40 60 Guitart-Masip et al., 2012 J Neurosci Quentin Huys RL SWC
Models Go Nogo Rewarded ‣ Instrumental + bias + Pavlovian Avoids loss Go rewarded Nogo punished Nogo rewarded Go punished Go to win Go to avoid Nogo to win Nogo to avoid 1 1 1 1 Probability(Go) 0.5 0.5 0.5 0.5 0 0 0 0 20 40 60 20 40 60 20 40 60 20 40 60 Guitart-Masip et al., 2012 J Neurosci Quentin Huys RL SWC
Habitization Quentin Huys RL SWC
Habitization Quentin Huys RL SWC
Habitization Quentin Huys RL SWC
Habitization Get this pattern late if lesion infralimbic cortex Quentin Huys RL SWC
Habitization Get this pattern Get this pattern early if lesion late if lesion prelimbic cortex infralimbic cortex Quentin Huys RL SWC
Habitization Get this pattern Get this pattern early if lesion late if lesion prelimbic cortex infralimbic cortex PEL ETOH Quentin Huys RL SWC
Two-step task C A B Daw et al. 2011, Neuron Quentin Huys RL SWC
Two-step task C A B Daw et al. 2011, Neuron Quentin Huys RL SWC
Two-step task C A B Daw et al. 2011, Neuron Quentin Huys RL SWC
Recommend
More recommend