reinforcement learning in humans and animals nathaniel daw nyu neuroscience; psychology; neuroeconomics cognition centric roundtable stevens, may 13 2011
collaborators NYU: Aaron Bornstein Sara Constantino Nick Gustafson Jian Li Seth Madlon-Kay Dylan Simon Bijan Pesaran Columbia: Daphna Shohamy Elliott Wimmer UCL: Peter Dayan Ben Seymour Ray Dolan Berkeley: Bianca Wittmann U Chicago: Jeff Beeler Xiaoji Zhuang Princeton: Yael Niv Sam Gershman Trinity: John O’Doherty Tel Aviv: Tom Schonberg Daphna Joel Montreal: Aaron Courville CMU: David Touretzky Austin: Ross Otto funding: NIMH, NIDA, NARSAD, McKnight Endowment, HFSP
question longstanding question in psychology: what information is learned from reward – law of effect (Thorndike): learn to repeat reinforced actions • dopamine – cognitive maps (Tolman): learn “map” of task structure; evaluate new actions online • even rats can do this
new leverage on this problem draw on computer science, economics for methods, frameworks 1.new computational & neural tools – examine learning via trial-by-trial adjustments in behavior and neural signals 1.new computational theories – algorithmic view – dopamine associated with “model-free” RL – “model-based” RL as account for cognitive maps (Daw, Niv & Dayan 2005, 2006)
learned decision making in humans 0.5 probability 0.25 + 0 0.5 “bandit” tasks probability Daw et al. 2006 0.25 Wittmann et al 2008 Gershman et al 2009 Schonberg et al 2007, 2010 Glascher et al. 2010 0 Li & Daw 2011 0 100 200 300 trial
trial-by-trial analysis … t-3 t-4 t-1 t-2 experience (past choices & outcomes) Probability model predicted values (RL algorithm + prediction errors probabilistic choice rule: etc experience choices) Choice behavior: predicted choice which model & parameters make observed choices most likely? (probabilities)
 Â
Á
? Á E[ V ( a )] = Σ o P( o | a ) V ( o ) “model- “model- based” free”
rat version (Balleine, Daw & O’Doherty, 2009) Lever Presses Valued (Holland, 2004) Devalued two behavioral modes: 10 devaluation-sensitive actions per minute (“goal directed”) devaluation-insensitive 5 (“habitual”) neurally dissociable with lesions 0 (Dickinson, Balleine, Killcross) moderate extensive training training dual systems view
task 70% 70 % with prob : 26% 57% 41% 28% ( all slowly changing ) (Daw, Gershman, Seymour, et al Neuron 2011)
question does choice behavior respect sequential structure?
idea How does bottom-stage feedback affect top-stage choices? Example: rare transition at top level, followed by win 30% • Which top-stage action is now favored?
predictions direct reinforcement model-based planning ignores transition structure respects transition structure
data 17 subs x 201 trials each reward: p<1e-8 reward x rare: p<5e-5 (mixed effects logit) planning reinforcement results reject pure reinforcement models suggest mixture of planning and reinforcement processes (Daw, Gershman, Seymour, et al Neuron 2011)
dual task single task dual task dual x reward: p < 5e-7 dual x reward x rare: p< .05 Otto, Gershman, Markman
neural analysis behavior incorporates model knowledge: not just TD want to ask same question neurally can we dissociate multiple neural systems underlying neural behavior •in particular, can we show subcortical systems are dumb?
dopamine & RL (Schultz et al. 1997) (Daw et al. 2011)
fMRI analysis hypothesis: striatal “error” signals are solely reinforcement driven 1. generate candidate error signals assuming TD 2. additional regressor captures how this signal would be changed for errors relative to values computed by planning TD error change due to net signal forward planning + β· = estimate this
fMRI analysis TD error change due to forward planning + β· contrary to theories: even striatal error signals incorporate knowledge of task structure (Daw, Gershman, Seymour, et al Neuron 2011) (P<.05 cluster)
variation across subjects subjects differ in degree of model usage change due to net signal planning = TD error + β· = compare behavioral & neural estimates
variation across subjects subjects differ in degree of model usage TD error change due to net signal planning + β· = p<.05 SVC
average signal R NAcc: start of trial: • interaction not significant • but size of interaction covaries with behavioral model usage (p=.02)
thoughts can distinguish multiple learned representations in humans •neurally more intertwined than expected related areas: self control (drugs, dieting, savings etc.) learning in multiplayer interactions (games) •equilibrium vs equilibration •do we learn about actions or about opponents?
p-beauty context • fast equilibration with repeated play, most subjects never reinforced Singaporean undergrads – Ho et al. 1998
RPS • do subjects learn by reinforcement? • best respond to reinforcement? • best respond to that? (Hampton et al, 2008)
conclusions 0. use of computational models to quantify phenomena & distinctions for neural study 1. can leverage this to distinguish different sorts of learning, trial-by-trial – beginning to map neural substrates 2. implications for self control, economic interactions
Recommend
More recommend