reinforcement learning
play

Reinforcement Learning Quentin Huys Division of Psychiatry and Max - PowerPoint PPT Presentation

Reinforcement Learning Quentin Huys Division of Psychiatry and Max Planck UCL Centre for Computational Psychiatry and Ageing Research, UCL Complex Depression, Anxiety and Trauma Service, Camden and Islington NHS Foundation Trust Systems and


  1. Solving the Bellman Equation Option 1: turn it into update equation Option 2: linear solution (w/ absorbing states) �⇤ ⇥ ⇤ T a ss � [ R ( s ⇥ , a, s ) + V ( s ⇥ )] V ( s ) = π ( a, s t ) a s � R π + T π v = ⇒ v ( I − T π ) � 1 R π = ⇒ v π O ( |S| 3 ) Quentin Huys RL SWC

  2. Solving the Bellman Equation Option 1: turn it into update equation ⇤⇧ ⌅ ⇧ V k +1 ( s ) T a R ( s � , a, s ) + V k ( s � ) � ⇥ = π ( a, s t ) ss � a s � Option 2: linear solution (w/ absorbing states) �⇤ ⇥ ⇤ T a ss � [ R ( s ⇥ , a, s ) + V ( s ⇥ )] V ( s ) = π ( a, s t ) a s � R π + T π v = ⇒ v ( I − T π ) � 1 R π = ⇒ v π O ( |S| 3 ) Quentin Huys RL SWC

  3. Policy update Given the value function for a policy, say via linear solution "X # X ss 0 [ R ( s 0 , a, s ) + V π ( s 0 )] V π ( s ) = π ( a | s ) T a s 0 a | {z } Q π ( s,a ) Given the values V for the policy, we can improve the policy by always choosing the best action: ⇢ 1 if a = argmax a Q π ( s, a ) π 0 ( a | s ) = 0 else It is guaranteed to improve: for deterministic policy Q π ( s, π 0 ( s )) = max Q π ( s, a ) ≥ Q π ( s, π ( s )) = V π ( s ) a Quentin Huys RL SWC

  4. Policy iteration Policy evaluation ( I − T π ) − 1 R π = v π ⇤ � ⇥ s � T a R a ss + V pi ( s � ) ⌅ 1 if a = argmax a ss � π ( a | s ) = 0 else Quentin Huys RL SWC

  5. Policy iteration Policy evaluation ( I − T π ) − 1 R π = v π greedy policy improvement ⇤ � ⇥ s � T a R a ss + V pi ( s � ) ⌅ 1 if a = argmax a ss � π ( a | s ) = 0 else Quentin Huys RL SWC

  6. Policy iteration Policy evaluation ( I − T π ) − 1 R π = v π Value iteration � V � ( s ) = max ss + V � ( s ⇥ )] ss � [ R a T a a s � greedy policy improvement ⇤ � ⇥ s � T a R a ss + V pi ( s � ) ⌅ 1 if a = argmax a ss � π ( a | s ) = 0 else Quentin Huys RL SWC

  7. Model-free solutions ‣ So far we have assumed knowledge of R and T • R and T are the ‘model’ of the world, so we assume full knowledge of the dynamics and rewards in the environment ‣ What if we don’t know them? ‣ We can still learn from state-action-reward samples • we can learn R and T from them, and use our estimates to solve as above • alternatively, we can directly estimate V or Q Quentin Huys RL SWC

  8. Solving the Bellman Equation Option 3: sampling �⇤ ⇥ ⇤ T a ss � [ R ( s � , a, s ) + V ( s � )] V ( s ) = π ( a, s t ) a s � So we can just draw some samples from the policy and the transitions and average over them: X = f ( x k ) p ( x k ) a k a = 1 X x ( i ) ∼ p ( x ) → ˆ f ( x ( i ) ) N i Quentin Huys RL SWC

  9. Solving the Bellman Equation Option 3: sampling So we can just draw some samples from the policy and the transitions and average over them: X = f ( x k ) p ( x k ) a k a = 1 X x ( i ) ∼ p ( x ) → ˆ f ( x ( i ) ) N i Quentin Huys RL SWC

  10. Solving the Bellman Equation Option 3: sampling this is an expectation over policy and transition samples. So we can just draw some samples from the policy and the transitions and average over them: X = f ( x k ) p ( x k ) a k a = 1 X x ( i ) ∼ p ( x ) → ˆ f ( x ( i ) ) N i Quentin Huys RL SWC

  11. Solving the Bellman Equation Option 3: sampling this is an expectation over policy and transition samples. So we can just draw some samples from the policy and the transitions and average over them: X = f ( x k ) p ( x k ) a k a = 1 X x ( i ) ∼ p ( x ) → ˆ f ( x ( i ) ) N i more about this later... Quentin Huys RL SWC

  12. Learning from samples 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 A new problem: exploration versus exploitation Quentin Huys RL SWC

  13. The e ff ect of bootstrapping Markov (every visit) B1 V(B)=3/4 B1 V(A)=0 B1 B1 B1 B1 TD B0 V(B)=3/4 A0 B0 V(A)=~3/4 ‣ Average over various bootstrappings: TD( ) λ after Sutton and Barto 1998 Quentin Huys RL SWC

  14. Monte Carlo ‣ First visit MC • randomly start in all states, generate paths, average for starting state only ( T ) V ( s ) = 1 X X r i t 0 | s 0 = s N i t 0 =1 10 9 ‣ More efficient use of samples 8 • Every visit MC 7 6 • Bootstrap: TD 5 • Dyna 4 3 ‣ Better samples 2 • on policy versus off policy 1 1 2 3 4 5 6 7 8 9 10 • Stochastic search, UCT... Quentin Huys RL SWC

  15. Update equation: towards TD Bellman equation �⇤ ⇥ ⇤ ss � [ R ( s � , a, s ) + V ( s � )] V ( s ) = π ( a, s ) T a s � a Not yet converged, so it doesn’t hold: �⇤ ⇥ ⇤ ss � [ R ( s � , a, s ) + V ( s � )] dV ( s ) = − V ( s ) + π ( a, s ) T a s � a And then use this to update V i +1 ( s ) = V i ( s ) + dV ( s ) Quentin Huys RL SWC

  16. TD learning �⇤ ⇥ ⇤ ss � [ R ( s � , a, s ) + V ( s � )] dV ( s ) = − V ( s ) + π ( a, s ) T a s � a Quentin Huys RL SWC

  17. TD learning �⇤ ⇥ ⇤ ss � [ R ( s � , a, s ) + V ( s � )] dV ( s ) = − V ( s ) + π ( a, s ) T a s � a π ( a | s t ) a t ∼ Sample T a t s t +1 ∼ s t ,s t +1 = R ( s t +1 , a t , s t ) r t Quentin Huys RL SWC

  18. TD learning �⇤ ⇥ ⇤ ss � [ R ( s � , a, s ) + V ( s � )] dV ( s ) = − V ( s ) + π ( a, s ) T a s � a π ( a | s t ) a t ∼ Sample T a t s t +1 ∼ s t ,s t +1 = R ( s t +1 , a t , s t ) r t δ t = − V t − 1 ( s t ) + r t + V t − 1 ( s t +1 ) Quentin Huys RL SWC

  19. TD learning �⇤ ⇥ ⇤ ss � [ R ( s � , a, s ) + V ( s � )] dV ( s ) = − V ( s ) + π ( a, s ) T a s � a π ( a | s t ) a t ∼ Sample T a t s t +1 ∼ s t ,s t +1 = R ( s t +1 , a t , s t ) r t δ t = − V t − 1 ( s t ) + r t + V t − 1 ( s t +1 ) V i +1 ( s ) = V i ( s ) + dV ( s ) V t ( s t ) = V t − 1 ( s t ) + αδ t Quentin Huys RL SWC

  20. TD learning π ( a | s t ) a t ∼ T a t s t +1 ∼ s t ,s t +1 = R ( s t +1 , a t , s t ) r t δ t = − V t ( s t ) + r t + V t ( s t +1 ) V t +1 ( s t ) = V t ( s t ) + αδ t Quentin Huys RL SWC

  21. Phasic dopamine neurone firing in ‣ Pavlovian conditioning Montague et al., 1996, Schultz et al., 1997 Quentin Huys RL SWC

  22. Phasic dopamine neurone firing in ‣ Pavlovian conditioning Reward Reward Loss Reward Montague et al., 1996, Schultz et al., 1997 Quentin Huys RL SWC

  23. Phasic dopamine neurone firing in ‣ Pavlovian conditioning Reward Reward Loss Reward V t +1 ( s ) = V t ( s ) + ✏ ( R t − V t ( s )) | {z } = Prediction error Montague et al., 1996, Schultz et al., 1997 Quentin Huys RL SWC

  24. Phasic dopamine neurone firing in ‣ Pavlovian conditioning Reward Reward Loss Reward V t +1 ( s ) = V t ( s ) + ✏ ( R t − V t ( s )) | {z } = Prediction error Montague et al., 1996, Schultz et al., 1997 Quentin Huys RL SWC

  25. Phasic signals in humans B * * 0.008 0.004 Effect Size 0 unexpected reward re- expected reward -0.004 d reward expected, - not received NS luster D’Ardenne et al., 2008 Science; Zaghloul et al., 2009 Science Quentin Huys RL SWC

  26. Blocking ‣ Are predictions and prediction errors really causally important in learning? • 1: A -> Reward Response • 2: A+B -> Reward • 3: A -> ? approach • B -> ? approach A B C Kamin 1968 Quentin Huys RL SWC

  27. Causal role of phasic DA in learning b Single Compound Test c cue cue Paired stimulation 14–15 d 4 d 1 d Cue AX Time in port US (sucrose) +Stim A US AX US X? Unpaired stimulation Cue AX With paired or unpaired Time in port optical stimulation US (sucrose) Stim e Blocking f 25 * ( n = 12) PairedCre + Control ( n = 9) ** 15 UnpairedCre + ( n = 11) 20 ( n = 9) P = 0.095 PairedCre − 15 ** ( n = 10) 10 P = 0.055 10 5 5 0 0 1 2 3 1 trial 3 trials 1 trial 3 trials Steinberg et al., 2013 Nat. Neurosci. Quentin Huys RL SWC

  28. Markov Decision Problems V ( s t ) = E [ r t + r t +1 + r t +2 + . . . ] = E [ r t ] + E [ r t +1 + r t +2 + r t +3 . . . ] ⇒ V ( s t ) = E [ r t ] + V ( s t +1 ) Quentin Huys RL SWC

  29. “Cached” solutions to MDPs Quentin Huys RL SWC

  30. “Cached” solutions to MDPs ‣ Learn from experience ‣ If we have true values V, then this is true every trial: V ( s t ) = E [ r t ] + V ( s t +1 ) ‣ If it is not true (we don’t know true V), then we get an error: δ = ( E [ r t ] + V ( s t +1 )) � V ( s t ) 6 = 0 ‣ So now we can update with our experience V ( s t ) ← V ( s t ) + ✏� ‣ This is an average over past experience Quentin Huys RL SWC

  31. SARSA ‣ Do TD for state-action values instead: Q ( s t , a t ) ← Q ( s t , a t ) + α [ r t + γ Q ( s t +1 , a t +1 ) − Q ( s t , a t )] s t , a t , r t , s t +1 , a t +1 ‣ convergence guarantees - will estimate Q π ( s, a ) Quentin Huys RL SWC

  32. Q learning: o ff -policy ‣ Learn off-policy • draw from some policy • “only” require extensive sampling � ⇥ Q ( s t , a t ) ← Q ( s t , a t ) + α ⇤ r t + γ max Q ( s t +1 , a ) − Q ( s t , a t ) ⌅ a ⌥ ⌃⇧ � update towards optimum ‣ will estimate Q ∗ ( s, a ) Quentin Huys RL SWC

  33. MF and MB learning of V and Q values Model-free Model-based V MF s V MB s Pavlovian (state) values ð Þ ð Þ Q MF s , a Q MB s , a Instrumental (state-action) values ð Þ ð Þ There are both Pavlovian state and instrumental state-action values, and both of these can be either model-free (cached) or model-based. Quentin Huys RL SWC

  34. Solutions ‣ “Cached” learning If you have an average over large number of subjects, it won’t move much if you add one more. • average experience • do again what worked in the past • averages are cheap to compute - no computational curse • averages move slowly ‣ “Goal-directed” or “Model-based” decisions • Think through possible options and choose the best • Requires detailed model of the world • Requires huge computational resources • Learning = building the model, extracting structure Quentin Huys RL SWC

  35. Solutions ‣ “Cached” learning If you have an average over large number of subjects, it won’t move much if you add one more. • average experience • do again what worked in the past • averages are cheap to compute - no computational curse • averages move slowly ‣ “Goal-directed” or “Model-based” decisions • Think through possible options and choose the best • Requires detailed model of the world • Requires huge computational resources • Learning = building the model, extracting structure Quentin Huys RL SWC

  36. MF and MB learning of V and Q values Model-free Model-based V MF s V MB s Pavlovian (state) values ð Þ ð Þ Q MF s , a Q MB s , a Instrumental (state-action) values ð Þ ð Þ There are both Pavlovian state and instrumental state-action values, and both of these can be either model-free (cached) or model-based. Quentin Huys RL SWC

  37. Pavlovian and instrumental ‣ Pavlovian model-free learning: V t ( s ) = V t − 1 ( s ) + ✏ ( r t − V t − 1 ( s )) p ( a | s, V ) ∝ f ( a, V ( s )) p ( a | s ) ‣ Instrumental model-free learning: Q t ( a, s ) = Q t − 1 ( a, s ) + ✏ ( r t − Q t − 1 ( a, s )) Quentin Huys RL SWC

  38. Innate evolutionary strategies Hirsch and Bolles 1980 Quentin Huys RL SWC

  39. Innate evolutionary strategies more more survive survive fewer survive Hirsch and Bolles 1980 Quentin Huys RL SWC

  40. Innate evolutionary strategies are quite sophisticated... more more survive survive fewer survive Hirsch and Bolles 1980 Quentin Huys RL SWC

  41. Unconditioned responses • powerful • inflexible over short timescale • adaptive on evolutionary scale Hershberger 1986 Quentin Huys RL SWC

  42. Unconditioned responses • powerful • inflexible over short timescale • adaptive on evolutionary scale Hershberger 1986 Quentin Huys RL SWC

  43. Unconditioned responses • powerful • inflexible over short timescale • adaptive on evolutionary scale Hershberger 1986 Quentin Huys RL SWC

  44. Unconditioned responses • powerful • inflexible over short timescale • adaptive on evolutionary scale Hershberger 1986 Quentin Huys RL SWC

  45. Unconditioned responses • powerful • inflexible over short timescale • adaptive on evolutionary scale Hershberger 1986 Quentin Huys RL SWC

  46. A ff ective go / nogo task Go Nogo Rewarded Avoids loss Guitart-Masip et al., 2012 J Neurosci Quentin Huys RL SWC

  47. A ff ective go / nogo task Go Nogo Rewarded 1 Avoids loss Probability correct 0.5 0 Go to Go to Nogo to Nogo to Win Avoid Win Avoid Go rewarded Nogo punished Nogo rewarded Go punished Go to win Go to avoid Nogo to win Nogo to avoid 1 1 1 1 Probability(Go) 0.5 0.5 0.5 0.5 0 0 0 0 20 40 60 20 40 60 20 40 60 20 40 60 Guitart-Masip et al., 2012 J Neurosci Quentin Huys RL SWC

  48. A ff ective go / nogo task Go Nogo Rewarded 1 Avoids loss Probability correct 0.5 0 Go to Go to Nogo to Nogo to Win Avoid Win Avoid Go rewarded Nogo punished Nogo rewarded Go punished Go to win Go to avoid Nogo to win Nogo to avoid 1 1 1 1 Probability(Go) 0.5 0.5 0.5 0.5 0 0 0 0 20 40 60 20 40 60 20 40 60 20 40 60 Guitart-Masip et al., 2012 J Neurosci Quentin Huys RL SWC

  49. Models Go Nogo Rewarded ‣ Instrumental Avoids loss p t ( a | s ) ∝ Q t ( s, a ) Q t +1 ( s, a ) = Q t ( s, a ) + α ( r t − Q t ( s, a )) Go rewarded Nogo punished Nogo rewarded Go punished Go to win Go to avoid Nogo to win Nogo to avoid 1 1 1 1 Probability(Go) 0.5 0.5 0.5 0.5 0 0 0 0 20 40 60 20 40 60 20 40 60 20 40 60 Guitart-Masip et al., 2012 J Neurosci Quentin Huys RL SWC

  50. Models Go Nogo Rewarded ‣ Instrumental + bias Avoids loss Go rewarded Nogo punished Nogo rewarded Go punished Go to win Go to avoid Nogo to win Nogo to avoid 1 1 1 1 Probability(Go) 0.5 0.5 0.5 0.5 0 0 0 0 20 40 60 20 40 60 20 40 60 20 40 60 Guitart-Masip et al., 2012 J Neurosci Quentin Huys RL SWC

  51. Models Go Nogo Rewarded ‣ Instrumental + bias Avoids loss Go rewarded Nogo punished Nogo rewarded Go punished Go to win Go to avoid Nogo to win Nogo to avoid 1 1 1 1 Probability(Go) 0.5 0.5 0.5 0.5 0 0 0 0 20 40 60 20 40 60 20 40 60 20 40 60 Guitart-Masip et al., 2012 J Neurosci Quentin Huys RL SWC

  52. Models Go Nogo Rewarded ‣ Instrumental + bias + Pavlovian Avoids loss Go rewarded Nogo punished Nogo rewarded Go punished Go to win Go to avoid Nogo to win Nogo to avoid 1 1 1 1 Probability(Go) 0.5 0.5 0.5 0.5 0 0 0 0 20 40 60 20 40 60 20 40 60 20 40 60 Guitart-Masip et al., 2012 J Neurosci Quentin Huys RL SWC

  53. Models Go Nogo Rewarded ‣ Instrumental + bias + Pavlovian Avoids loss Go rewarded Nogo punished Nogo rewarded Go punished Go to win Go to avoid Nogo to win Nogo to avoid 1 1 1 1 Probability(Go) 0.5 0.5 0.5 0.5 0 0 0 0 20 40 60 20 40 60 20 40 60 20 40 60 Guitart-Masip et al., 2012 J Neurosci Quentin Huys RL SWC

  54. Models Go Nogo Rewarded ‣ Instrumental + bias + Pavlovian Avoids loss Go rewarded Nogo punished Nogo rewarded Go punished Go to win Go to avoid Nogo to win Nogo to avoid 1 1 1 1 Probability(Go) 0.5 0.5 0.5 0.5 0 0 0 0 20 40 60 20 40 60 20 40 60 20 40 60 Guitart-Masip et al., 2012 J Neurosci Quentin Huys RL SWC

  55. Habitization Quentin Huys RL SWC

  56. Habitization Quentin Huys RL SWC

  57. Habitization Quentin Huys RL SWC

  58. Habitization Get this pattern late if lesion infralimbic cortex Quentin Huys RL SWC

  59. Habitization Get this pattern Get this pattern early if lesion late if lesion prelimbic cortex infralimbic cortex Quentin Huys RL SWC

  60. Habitization Get this pattern Get this pattern early if lesion late if lesion prelimbic cortex infralimbic cortex PEL ETOH Quentin Huys RL SWC

  61. Two-step task C A B Daw et al. 2011, Neuron Quentin Huys RL SWC

  62. Two-step task C A B Daw et al. 2011, Neuron Quentin Huys RL SWC

  63. Two-step task C A B Daw et al. 2011, Neuron Quentin Huys RL SWC

Recommend


More recommend