reinforcement learning in humans and animals
play

reinforcement learning in humans and animals nathaniel daw nyu - PowerPoint PPT Presentation

reinforcement learning in humans and animals nathaniel daw nyu neuroscience; psychology; neuroeconomics cognition centric roundtable stevens, may 13 2011 collaborators NYU: Aaron Bornstein Sara Constantino Nick Gustafson Jian Li Seth


  1. reinforcement learning in humans and animals nathaniel daw nyu neuroscience; psychology; neuroeconomics cognition centric roundtable stevens, may 13 2011

  2. collaborators NYU: Aaron Bornstein Sara Constantino Nick Gustafson Jian Li Seth Madlon-Kay Dylan Simon Bijan Pesaran Columbia: Daphna Shohamy Elliott Wimmer UCL: Peter Dayan Ben Seymour Ray Dolan Berkeley: Bianca Wittmann U Chicago: Jeff Beeler Xiaoji Zhuang Princeton: Yael Niv Sam Gershman Trinity: John O’Doherty Tel Aviv: Tom Schonberg Daphna Joel Montreal: Aaron Courville CMU: David Touretzky Austin: Ross Otto funding: NIMH, NIDA, NARSAD, McKnight Endowment, HFSP

  3. question longstanding question in psychology: what information is learned from reward – law of effect (Thorndike): learn to repeat reinforced actions • dopamine – cognitive maps (Tolman): learn “map” of task structure; evaluate new actions online • even rats can do this

  4. new leverage on this problem draw on computer science, economics for methods, frameworks 1.new computational & neural tools – examine learning via trial-by-trial adjustments in behavior and neural signals 1.new computational theories – algorithmic view – dopamine associated with “model-free” RL – “model-based” RL as account for cognitive maps (Daw, Niv & Dayan 2005, 2006)

  5. learned decision making in humans 0.5 probability 0.25 + 0 0.5 “bandit” tasks probability Daw et al. 2006 0.25 Wittmann et al 2008 Gershman et al 2009 Schonberg et al 2007, 2010 Glascher et al. 2010 0 Li & Daw 2011 0 100 200 300 trial

  6. trial-by-trial analysis … t-3 t-4 t-1 t-2 experience (past choices & outcomes) Probability model predicted values (RL algorithm + prediction errors probabilistic choice rule: etc experience  choices) Choice behavior: predicted choice which model & parameters make observed choices most likely? (probabilities)

  7. Â Â

  8. Á

  9. ? Á E[ V ( a )] = Σ o P( o | a ) V ( o ) “model- “model- based” free”

  10. rat version (Balleine, Daw & O’Doherty, 2009) Lever Presses Valued (Holland, 2004) Devalued two behavioral modes: 10 devaluation-sensitive actions per minute (“goal directed”) devaluation-insensitive 5 (“habitual”)  neurally dissociable with lesions 0 (Dickinson, Balleine, Killcross) moderate extensive training training  dual systems view

  11. task 70% 70 % with prob : 26% 57% 41% 28% ( all slowly changing ) (Daw, Gershman, Seymour, et al Neuron 2011)

  12. question does choice behavior respect sequential structure?

  13. idea How does bottom-stage feedback affect top-stage choices? Example: rare transition at top level, followed by win 30% • Which top-stage action is now favored?

  14. predictions direct reinforcement model-based planning ignores transition structure respects transition structure

  15. data 17 subs x 201 trials each reward: p<1e-8 reward x rare: p<5e-5 (mixed effects logit) planning reinforcement  results reject pure reinforcement models  suggest mixture of planning and reinforcement processes (Daw, Gershman, Seymour, et al Neuron 2011)

  16. dual task single task dual task dual x reward: p < 5e-7 dual x reward x rare: p< .05 Otto, Gershman, Markman

  17. neural analysis behavior incorporates model knowledge: not just TD want to ask same question neurally can we dissociate multiple neural systems underlying neural behavior •in particular, can we show subcortical systems are dumb?

  18. dopamine & RL (Schultz et al. 1997) (Daw et al. 2011)

  19. fMRI analysis hypothesis: striatal “error” signals are solely reinforcement driven 1. generate candidate error signals assuming TD 2. additional regressor captures how this signal would be changed for errors relative to values computed by planning TD error change due to net signal forward planning + β· = estimate this

  20. fMRI analysis TD error change due to forward planning + β·  contrary to theories: even striatal error signals incorporate knowledge of task structure (Daw, Gershman, Seymour, et al Neuron 2011) (P<.05 cluster)

  21. variation across subjects subjects differ in degree of model usage change due to net signal planning = TD error + β· = compare behavioral & neural estimates

  22. variation across subjects subjects differ in degree of model usage TD error change due to net signal planning + β· = p<.05 SVC

  23. average signal R NAcc: start of trial: • interaction not significant • but size of interaction covaries with behavioral model usage (p=.02)

  24. thoughts can distinguish multiple learned representations in humans •neurally more intertwined than expected related areas: self control (drugs, dieting, savings etc.) learning in multiplayer interactions (games) •equilibrium vs equilibration •do we learn about actions or about opponents?

  25. p-beauty context • fast equilibration with repeated play, most subjects never reinforced Singaporean undergrads – Ho et al. 1998

  26. RPS • do subjects learn by reinforcement? • best respond to reinforcement? • best respond to that? (Hampton et al, 2008)

  27. conclusions 0. use of computational models to quantify phenomena & distinctions for neural study 1. can leverage this to distinguish different sorts of learning, trial-by-trial – beginning to map neural substrates 2. implications for self control, economic interactions

Recommend


More recommend