reinforcement learning from neural processes modelling to
play

Reinforcement Learning: From neural processes modelling to Robotics - PowerPoint PPT Presentation

TDRL Model Dopamine TD Applications Model-based RL slide # 1 / 141 Reinforcement Learning: From neural processes modelling to Robotics applications Mehdi Khamassi (CNRS, ISIR-UPMC, Paris) 30 January 2015 Michle Sebags course @ Univ.


  1. TDRL Model TDRL Model The Q-learning model Dopamine TD Applications Model-based RL slide # 31 / 141 How can the agent learn a policy? How to learn to perform the right actions other solution: learning Q-values (qualities) Q : (S,A) R Q-table: state / action a1 : North a2 : South a3 : East a4 : West s1 0.92 0.10 0.35 0.05 s2 0.25 0.52 0.43 0.37 s3 0.78 0.9 1.0 0.81 s4 0.0 1.0 0.9 0.9 … … … … … 0 0.1 0.9 0.3 0.8 0.9 0.1 0.8 0.3 0 0. 0.1 0.8 0 0 0.1

  2. TDRL Model TDRL Model The Q-learning model Dopamine TD Applications Model-based RL slide # 32 / 141 How can the agent learn a policy? How to learn to perform the right actions other solution: learning Q-values (qualities) Q : (S,A) R Q-table: state / action a1 : North a2 : South a3 : East a4 : West s1 0.92 0.10 0.35 0.05 s2 0.25 0.52 0.43 0.37 s3 0.78 0.9 1.0 0.81 s4 0.0 1.0 0.9 0.9 … … … … … exp( β . Q(s,a)) The β parameter regulates the exploration – P(a) = Σ exp( β . Q(s,b)) exploitation trade-off. b

  3. TDRL Model TDRL Model Different Temporal-Difference (TD) Dopamine TD Applications methods Model-based RL slide # 33 / 141  ACTOR-CRITIC State-dependent Reward Prediction Error  SARSA (independent from the action)  Q-LEARNING

  4. TDRL Model TDRL Model Different Temporal-Difference (TD) Dopamine TD Applications methods Model-based RL slide # 34 / 141  ACTOR-CRITIC State-dependent Reward Prediction Error  SARSA (independent from the action) Also used to update  Q-LEARNING the ACTOR P(a t |s t ) P(a t |s t ) + α δ t+1

  5. TDRL Model TDRL Model Different Temporal-Difference (TD) Dopamine TD Applications methods Model-based RL slide # 35 / 141  ACTOR-CRITIC  SARSA Reward Prediction Error dependent on the action  Q-LEARNING chosen to be performed next

  6. TDRL Model TDRL Model Different Temporal-Difference (TD) Dopamine TD Applications methods Model-based RL slide # 36 / 141  ACTOR-CRITIC  SARSA  Q-LEARNING Reward Prediction Error dependent on the best action

  7. TDRL Model Dopamine Dopamine TD Applications Model-based RL slide # 37 / 141 Links with biology Activity of dopaminergic neurons

  8. TDRL Model Dopamine Dopamine CLASSICAL CONDITIONING TD Applications Model-based RL slide # 38 / 141 TD-learning explains classical conditioning (predictive learning) Taken from Bernard Balleine’s lecture at Okinawa Computational Neuroscience Course (2005).

  9. TDRL Model Dopamine Dopamine REINFORCEMENT LEARNING TD Applications Model-based RL slide # 39 / 141  Analogy with dopaminergic neurons’ activity R S +1 δ t+1 = r t+1 + γ . V(s t+1 ) – V(s t ) reinforcement reward Schultz et al. (1993); Houk et al. (1995); Schultz et al. (1997).

  10. TDRL Model Dopamine Dopamine REINFORCEMENT LEARNING TD Applications Model-based RL slide # 40 / 141  Analogy with dopaminergic neurons’ activity R S +1 δ t+1 = r t+1 + γ . V(s t+1 ) – V(s t ) reinforcement reward Schultz et al. (1993); Houk et al. (1995); Schultz et al. (1997).

  11. TDRL Model Dopamine Dopamine REINFORCEMENT LEARNING TD Applications Model-based RL slide # 41 / 141  Analogy with dopaminergic neurons’ activity R S 0 δ t+1 = r t+1 + γ . V(s t+1 ) – V(s t ) reinforcement reward Schultz et al. (1993); Houk et al. (1995); Schultz et al. (1997).

  12. TDRL Model Dopamine Dopamine REINFORCEMENT LEARNING TD Applications Model-based RL slide # 42 / 141  Analogy with dopaminergic neurons’ activity R S -1 δ t+1 = r t+1 + γ . V(s t+1 ) – V(s t ) reinforcement reward Schultz et al. (1993); Houk et al. (1995); Schultz et al. (1997).

  13. TDRL Model The Actor-Critic model Dopamine Dopamine TD Applications Model-based RL slide # 43 / 141 Barto (1995); Montague et al. (1996); Schultz et al. (1997); Berns and Sejnowski (1996); Suri and Schultz (1999); Doya (2000); Suri et al. (2001); Baldassarre (2002). see Joel et al. (2002) for a review. Houk et al. (1995) Dopaminergic neuron

  14. TDRL Model The Actor-Critic model Dopamine Dopamine TD Applications Model-based RL slide # 44 / 141 Which state space as an input? Temporal-order input also called: [0 0 1 0 0 0 0] Tapped-delay line Dopaminergic neuron Montague et al. (1996); Suri & Schultz (2001) Daw (2003); Bertin et al. (2007).

  15. TDRL Model The Actor-Critic model Dopamine Dopamine TD Applications Model-based RL slide # 45 / 141 Which state space as an input? reward 5 1 Temporal-order input 2 4 [0 0 1 0 0 0 0] or spatial or visual information 3 Dopaminergic neuron

  16. TDRL Model Dopamine Dopamine TD Applications Model-based RL slide # 46 / 141 Wide application of RL models to model-based analyses of behavioral and physiological data during decision-making tasks

  17. TDRL Model Dopamine Dopamine Typical probabilistic decision-making task TD Applications Model-based RL slide # 47 / 141 Niv et al. (2006), commentary about the results presented in Morris et al. (2006) Nat Neurosci.

  18. TDRL Model Dopamine Dopamine Typical probabilistic decision-making task TD Applications Model-based RL slide # 48 / 141 Niv et al. (2006), commentary about the results presented in Morris et al. (2006) Nat Neurosci.

  19. TDRL Model Dopamine Dopamine Typical probabilistic decision-making task TD Applications Model-based RL slide # 49 / 141 Niv et al. (2006), commentary about the results presented in Morris et al. (2006) Nat Neurosci.

  20. TDRL Model Dopamine Model-based analysis of brain data TD Applications TD Applications Model-based RL slide # 50 / 141 Sequence of observed trials : Left (Reward); Left (Nothing); Right (Nothing); Left (Reward); … fMRI scanner RL model Brain responses Prediction error ? cf. travail de Mathias Pessiglione (ICM) ou Giorgio Coricelli (ENS)

  21. TDRL Model Model-based analysis Dopamine TD Applications Work by Jean Bellot (PhD student) Model-based RL slide # 51 / 141 TD-learning models Behavior of the animal High fitting error Low fitting error Bellot, Sigaud, Khamassi (2012) SAB conference.

  22. TDRL Model Model-based analysis Dopamine TD Applications TD Applications My post-doc work Model-based RL slide # 52 / 141 • Analysis of single neurons recorded in the monkey dorsolateral prefrontal cortex and anterior cingulate cortex • Correlates of prediction errors? Action values? Level of control/exploration? Khamassi et al. (2013) Prog Brain Res; Khamassi et al. (in revision)

  23. TDRL Model Model-based analysis Dopamine TD Applications TD Applications My post-doc work Model-based RL slide # 53 / 141 Q δ β * Multiple regression analysis with bootstrap Khamassi et al. (2013) Prog Brain Res; Khamassi et al. (in revision)

  24. TDRL Model Dopamine TD Applications TD Applications Model-based RL slide # 54 / 141 This works well, but… Most experiments are single-step • All these cases are discrete • Very small number of states, actions • We supposed a perfect state identification •

  25. TDRL Model Dopamine Continuous reinforcement learning TD Applications TD Applications Model-based RL slide # 55 / 141 Sensory input 3 2 1 Actions 4 5 reward TD-Learning model applied to spatial navigation behavior learning in a robot performing the bio-inspired plus-maze task Khamassi et al. (2005) . Adaptive Behavior. Khamassi et al. (2006). Lecture Notes in Computer Science

  26. TDRL Model Dopamine Continuous reinforcement learning TD Applications TD Applications Model-based RL slide # 56 / 141 Coordination by a self-organizing map Actor-Critic multi-modules neural network

  27. TDRL Model Dopamine Continuous reinforcement learning TD Applications TD Applications Model-based RL slide # 57 / 141 Hand-tuned Autonomous Random

  28. TDRL Model Dopamine Continuous reinforcement learning TD Applications TD Applications Model-based RL slide # 58 / 141 Two methods : 1. Self-Organizing Maps (SOMs) 2. specialization based on performance (tests modules' capacity for state prediction) Baldassarre (2002); Doya et al. (2002). Within a particular subpart of the maze, only the module Autonomous with the most accurate reward prediction is trained. Each module thus becomes an expert responsible for learning in a given task subset.

  29. TDRL Model Dopamine Continuous reinforcement learning TD Applications TD Applications Model-based RL slide # 59 / 141 average

  30. TDRL Model Dopamine Continuous reinforcement learning TD Applications TD Applications Model-based RL slide # 60 / 141 Nb of iterations required (Average performance during the second half of the experiment) 94 1. hand-tuned 3,500 2. specialization based on performance 404 3. autonomous categorization (SOM) 30,000 4. random robot

  31. TDRL Model Dopamine Continuous reinforcement learning TD Applications TD Applications Model-based RL slide # 61 / 141 Nb of iterations required (Average performance during the second half of the experiment) 94 1. hand-tuned 3,500 2. specialization based on performance 404 3. autonomous categorization (SOM) 30,000 4. random robot

  32. TDRL Model Dopamine Outline TD Applications Model-based RL slide # 62 / 141 1. Model-free Reinforcement Learning Temporal-Difference RL Algorithm  Dopamine activity  Wide application to Neuroscience of decision-making  2. Model-based Reinforcement Learning Off-line learning / Replay during sleep  Dual-system RL  Online parameters tuning (meta-learning)  Link with Neurobehavioral data  Applications to Robotics 

  33. TDRL Model Dopamine TD Applications Model-based RL Model-based RL slide # 63 / 141 Off-learning (Model-based RL) and hippocampal & prefrontal cortex activity replay during sleep

  34. TDRL Model Dopamine REINFORCEMENT LEARNING TD Applications Model-based RL Model-based RL slide # 64 / 141 After N simulations Very long! δ t+1 = r t+1 + γ . V(s t+1 ) – V(s t ) V(s t ) = V(s t ) + α . δ t+1 learning rate (=0.1) discount factor (=0.9)

  35. TDRL Model Dopamine TRAINING DURING SLEEP TD Applications Model-based RL Model-based RL slide # 65 / 141 Method in Artificial Intelligence: Off-line Dyna-Q-learning (Sutton & Barto, 1998)

  36. TDRL Model Dopamine Model-based Reinforcement Learning TD Applications Model-based RL Model-based RL slide # 66 / 141 To incrementally learn a model of transition and reward functions, then plan within this model by updates “in the head of the agent” (Sutton, 1990). S : state space A : action space Internal model Transition function T : S x A S Reward function R : S x A R

  37. TDRL Model Dopamine Model-based Reinforcement Learning TD Applications Model-based RL Model-based RL slide # 67 / 141 s : state of the agent ( )

  38. TDRL Model Dopamine Model-based Reinforcement Learning TD Applications Model-based RL Model-based RL slide # 68 / 141 s : state of the agent ( ) maxQ=0.3 maxQ=0.9 maxQ=0.7

  39. TDRL Model Dopamine Model-based Reinforcement Learning TD Applications Model-based RL Model-based RL slide # 69 / 141 s : state of the agent ( ) a : action of the agent ( go east ) maxQ=0.3 maxQ=0.9 maxQ=0.7

  40. TDRL Model Dopamine Model-based Reinforcement Learning TD Applications Model-based RL Model-based RL slide # 70 / 141 s : state of the agent ( ) a : action of the agent ( go east ) maxQ=0.3 maxQ=0.9 maxQ=0.7 stored transition function T: proba( ) = 0.9 proba( ) = 0.1 proba( ) = 0

  41. TDRL Model Dopamine Model-based Reinforcement Learning TD Applications Model-based RL Model-based RL slide # 71 / 141 s : state of the agent ( ) a : action of the agent ( go east ) maxQ=0.3 maxQ=0.9 maxQ=0.7 stored transition function T: proba( ) = 0.9 proba( ) = 0.1 proba( ) = 0 0.6 0 0.9*0.7 + 0.1*0.9 + 0*0.3 + …

  42. TDRL Model Dopamine Model-based Reinforcement Learning TD Applications Model-based RL Model-based RL slide # 72 / 141 No reward prediction error! Only: Estimated Q-values Transition function Reward function This process is called Value Iteration or Dynamic prog.

  43. TDRL Model Dopamine Model-based Reinforcement Learning TD Applications Model-based RL Model-based RL slide # 73 / 141 Links with neurobiological data Activity of hippocampal place neurons

  44. TDRL Model Dopamine Hippocampal place cells TD Applications Model-based RL Model-based RL slide # 74 / 141 Nakazawa, McHugh, Wilson, Tonegawa (2004) Nature Reviews Neuroscience

  45. TDRL Model Dopamine Hippocampal place cells TD Applications Model-based RL Model-based RL slide # 75 / 141 • Reactivation of hippocampal place cells during sleep (Wilson & McNaughton, 1994, Science)

  46. TDRL Model Dopamine Hippocampal place cells TD Applications Model-based RL Model-based RL slide # 76 / 141 • Forward replay of hippocampal place cells during sleep (sequence is compressed 7 times) (Euston et al., 2007, Science)

  47. TDRL Model Dopamine Sharp-Wave Ripple (SWR) events TD Applications Model-based RL Model-based RL slide # 77 / 141  “Ripple” events = irregular bursts of population activity that give rise to brief but intense high- frequency (100-250 Hz) oscillations in the CA1 pyramidal cell layer.

  48. TDRL Model Selective suppression of SWRs Dopamine TD Applications impairs spatial memory Model-based RL Model-based RL slide # 78 / 141  Girardeau G, Benchenane K, Wiener SI, Buzsáki G, Zugaro MB (2009) Nat Neurosci.

  49. TDRL Model Contribution to decision making (forward Dopamine TD Applications planning) and evaluation of transitions Model-based RL Model-based RL slide # 79 / 141 Johnson & Redish (2007) J Neurosci

  50. TDRL Model Dopamine Hippocampal place cells TD Applications Model-based RL Model-based RL slide # 80 / 141 SUMMARY OF NEUROSCIENCE DATA Replay their sequential activity during sleep (Foster & Wilson, 2006; Euston et al., 2007; Gupta et al., 2010) Performance is impaired if this replay is disrupted (Girardeau, Benchenane et al. 2012; Jadhav et al. 2012) Only task-related replay in PFC (Peyrache et al., 2009) Hippocampus may contribute to model-based navigation strategies, striatum to model-free navigation strategies (Khamassi & Humphries, 2012)

  51. TDRL Model Applications to robot off-line learning Dopamine TD Applications Work of Jean-Baptiste Mouret et al. @ ISIR Model-based RL Model-based RL slide # 81 / 141 How to recover from damage without needing to identify the damage?

  52. TDRL Model Applications to robot off-line learning Dopamine TD Applications Work of Jean-Baptiste Mouret et al. @ ISIR Model-based RL Model-based RL slide # 82 / 141 The reality gap Self-model vs reality: how to use a simulator? Solution: Learn a transferability function (how well does the simulation match reality?) with SVM or neural networks. Idea: the damage is a large reality gap. Koos, Mouret & Doncieux. IEEE Trans Evolutionary Comput 2012

  53. TDRL Model Applications to robot off-line learning Dopamine TD Applications Work of Jean-Baptiste Mouret et al. @ ISIR Model-based RL Model-based RL slide # 83 / 141 Experiments Koos, Cully & Mouret. Int J Robot Res 2013

  54. TDRL Model Dopamine TD Applications Model-based RL Meta-Learning slide # 84 / 141 META-LEARNING (regulation of decision-making) Dual-system RL coordination 1. Online parameters tuning 2.

  55. TDRL Model Dopamine Multiple decision systems TD Applications Model-based RL Model-based RL slide # 85 / 141 Skinner box (instrumental conditioning) Model-based system Model-free sys. (Daw Niv Dayan 2005, Nat Neurosci) Behavior is initially model-based and becomes model- free (habitual) with overtraining.

  56. TDRL Model Habitual vs goal-directed: sensitive Dopamine TD Applications to changes in outcome Model-based RL Model-based RL slide # 86 / 141 Yin et al. 2004; Balleine 2005; Yin & Knowlton 2006

  57. TDRL Model Habitual vs goal-directed: sensitive Dopamine TD Applications to changes in outcome Model-based RL Model-based RL slide # 87 / 141 Devalue Yin et al. 2004; Balleine 2005; Yin & Knowlton 2006

  58. TDRL Model Habitual vs goal-directed: sensitive Dopamine TD Applications to changes in outcome Model-based RL Model-based RL slide # 88 / 141 Yin et al. 2004; Balleine 2005; Yin & Knowlton 2006

  59. TDRL Model Habitual vs goal-directed: sensitive Dopamine TD Applications to changes in outcome Model-based RL Model-based RL slide # 89 / 141 Habitual Goal-directed Yin et al. 2004; Balleine 2005; Yin & Knowlton 2006

  60. TDRL Model Model-free vs model-based: Dopamine TD Applications outcome sensitivity Model-based RL Model-based RL slide # 90 / 141 Switch with experience [reduce computational Change R: slow to update Change R: fast to update load] Habitual Goal-directed Daw et al 2005 Nat Neurosci

  61. TDRL Model Dopamine Multiple decision systems TD Applications Model-based RL Model-based RL slide # 91 / 141 Keramati et al. (2011): extension of the Daw 2005 model with a speed-accuracy trade-off arbitration criterion.

  62. TDRL Model Progressive shift from model-based Dopamine TD Applications navigation to model-free navigation Model-based RL Model-based RL slide # 92 / 141 Khamassi & Humphries (2012) Frontiers in Behavioral Neuroscience

  63. TDRL Model Model-based and model-free Dopamine TD Applications navigation strategies Model-based RL Model-based RL slide # 93 / 141 Model-based navigation Model-free navigation Benoît Girard 2010 UPMC lecture

  64. TDRL Model Old behavioral evidence for Dopamine TD Applications Place-based model-based RL Model-based RL Model-based RL slide # 94 / 141 Martinet et al. (2011) model applied to the Tolman maze

  65. TDRL Model Old behavioral evidence for Dopamine TD Applications Place-based model-based RL Model-based RL Model-based RL slide # 95 / 141 Martinet et al. (2011) model applied to the Tolman maze

  66. TDRL Model MULTIPLE NAVIGATION STRATEGIES Dopamine TD Applications IN THE RAT Model-based RL Model-based RL slide # 96 / 141 N O E Rats with a lesion Rats with a lesion of of the hippocampus the dorsal striatum S Packard and Knowlton, 2002 Rotation 180 ° Previous platform location Devan and White, 1999

  67. TDRL Model MULTIPLE DECISION SYSTEMS IN A Dopamine TD Applications NAVIGATION MODEL Model-based RL slide # 97 / 141 Model-based system Model-free system (hippocampal (basal ganglia) place cells) Work by Laurent Dollé: Dollé et al., 2008, 2010, submitted

  68. TDRL Model MULTIPLE NAVIGATION STRATEGIES Dopamine TD Applications IN A TD-LEARNING MODEL Model-based RL Model-based RL slide # 98 / 141 Task with a cued platform (visible flag) changing location every 4 trials Task of Pearce et al., 1998 Model: Dollé et al., 2010

  69. TDRL Model Dopamine PSIKHARPAX ROBOT TD Applications Model-based RL slide # 99 / 141 Work by: Ken Caluwaerts (2010) Steve N’Guyen (2010) Mariacarla Staffa (2011) Antoine Favre-Félix (2011) Caluwaerts et al. (2012) Biomimetics & Bioinspiration

  70. TDRL Model Dopamine PSIKHARPAX ROBOT TD Applications Model-based RL Model-based RL slide # 100 / 141 Planning strategy only Planning strategy + Taxon strategy Caluwaerts et al. (2012) Biomimetics & Bioinspiration

Recommend


More recommend