brain and reinforcement learning
play

Brain and Reinforcement Learning Kenji Doya doya@oist.jp Neural - PowerPoint PPT Presentation

MLSS 2012 in Kyoto Brain and Reinforcement Learning Kenji Doya doya@oist.jp Neural Computation Unit Okinawa Institute of Science and Technology Location of Okinawa Seoul ! Beijing ! Tokyo ! 2.5 hour ! 3 hour ! 2.5 hour ! Shanghai ! 2 hour


  1. MLSS 2012 in Kyoto Brain and Reinforcement Learning � Kenji Doya doya@oist.jp Neural Computation Unit Okinawa Institute of Science and Technology

  2. Location of Okinawa � Seoul ! Beijing ! Tokyo ! 2.5 hour ! 3 hour ! 2.5 hour ! Shanghai ! 2 hour ! 1.5 hour ! Okinawa ! Taipei ! 2.5 hour ! Manila !

  3. Okinawa Institute of Science & Technology ! Apr. 2004: Initial research ! President: Sydney Brenner ! Nov. 2011: Graduate university ! President: Jonathan Dorfan ! Sept. 2012: Ph.D. course ! 20 students/year

  4. Our Research Interests How to build adaptive, How the brain realizes autonomous systems robust, flexible adaptation ! robot experiments ! neurobiology

  5. Learning to Walk (Doya & Nakano, 1985) ! Action: cycle of 4 postures ! Reward: speed sensor output ! Problem: a long jump followed by a fall Need for long-term evaluation of action

  6. Reinforcement Learning reward r ! action a ! agent ! environment ! state s ! ! Learn action policy: s → a to maximize rewards ! Value function: expected future rewards ! V(s(t)) = E[ r(t) + γ r(t+1) + γ 2 r(t+2) + γ 3 r(t+3) +…] 0 ≤ γ ≤ 1: discount factor γ V(s(t+1) ! ! Temporal difference (TD) error: ! δ (t) = r(t) + γ V(s(t+1)) - V(s(t))

  7. Example: Grid World ! Reward field Value function γ =0.9 γ =0.3 2 2 1 1 0 0 -1 -1 -2 -2 6 6 6 6 4 4 4 4 2 2 2 2

  8. Cart-Pole Swing-Up ! Reward: height of pole ! Punishment: collision ! Value in 4D state space

  9. Learning to Stand Up � Morimoto & Doya, 2000) ! State: joint/head angles, angular velocity ! Action: torques to motors ! Reward: head height – tumble

  10. Learning to Survive and Reproduce ! Catch battery packs ! Copy ‘genes’ by IR ports ! survival ! reproduction, evolution

  11. Markov Decision Process (MDP) reward r ! ! Markov decision process ! state s ∈ S ! action a ∈ A action a ! agent ! environment ! ! policy p(a|s) state s ! ! reward r(s,a) ! dynamics p(s’|s,a) ! Optimal policy: maximize cumulative reward ! finite horizon: E[ r(1) + r(2) + r(3) + ... + r(T)] ! infinite horizon: E[ r(1) + γ r(2) + γ 2 r(3) + … ] 0 ≤ γ ≤ 1: temporal discount factor ! average reward: E[ r(1) + r(2) + ... + r(T)]/T, T →∞

  12. Solving MDPs � Dynamic Programming Reinforcement Learning ! p(s’|s,a) and r(s,a) are known ! p(s’|s,a) and r(s,a) are unknown ! Solve Bellman equation ! Learn from actual experience V(s) = max a E[ r(s,a) + γ V(s’)] {s,a,r,s,a,r,…} ! V(s): value function ! Monte Carlo expected reward from state s ! SARSA ! Apply optimal policy ! Q-learning a = argmax a E[ r(s,a) + γ V*(s’)] ! Actor-Critic ! Value iteration ! Policy gradient ! Policy iteration ! Model-based ! learn p(s’|s,a), r(s,a) and do DP

  13. Actor-Critic and TD learning � ! Actor: parameterized policy: P(a|s; w) ! Critic: learn value function V(s(t)) = E[ r(t) + γ r(t+1) + γ 2 r(t+2) +…] ! in a table or a neural network ! Temporal Difference (TD) error: ! δ (t) = r(t) + γ V(s(t+1)) - V(s(t)) ! Update ! Critic: Δ V(s(t)) = α δ (t) ! Actor: Δ w = α δ (t) ∂ P(a(t)|s(t);w)/ ∂ w … reinforce a(t) by δ (t)

  14. SARSA and Q Learning ! Action value function ! Q(s,a) = E[ r(t) + γ r(t+1) + γ 2 r(t+2) …| s(t)=s,a(t)=a] ! Action selection ! ε -greedy: a = argmax a Q(s,a) with prob 1- ε* ! Boltzman: P(a i |s) = exp[ β Q(s,a i )] / Σ j exp[ β Q(s,a j )] ! SARSA: on-policy update ! Δ Q(s(t),a(t)) = α { r(t)+ γ Q(s(t+1),a(t+1))-Q(s(t),a(t)) } ! Q learning: off-policy update ! Δ Q(s(t),a(t)) = α { r(t)+ γ max a’ Q(s(t+1),a’)-Q(s(t),a(t)) }

  15. “Lose to Gain” Task ! N states, 2 actions +r 2 � a 2 � -r 1 � -r 1 � -r 1 � s 1 � s 2 � s 3 � s 4 � +r 1 � +r 1 � +r 1 � a 1 � -r 2 � ! if r 2 >> r 1 , then better take a 2 �

  16. Reinforcement Learning ! Predict reward: value function ! V(s) = E[ r(t) + γ r(t+1) + γ 2 r(t+2)…| s(t)=s] ! Q(s,a) = E[ r(t) + γ r(t+1) + γ 2 r(t+2)…| s(t)=s, a(t)=a] ! Select action How to implement these steps? ! greedy : a = argmax Q(s,a) ! Boltzmann : P(a|s) ∝ exp[ β Q(s,a)] ! Update prediction: TD error * ! δ (t) = r(t) + γ V(s(t+1)) - V(s(t)) How to tune these parameters? ! Δ V(s(t)) = α δ (t) ! Δ Q(s(t),a(t)) = α δ (t)

  17. Dopamine Neurons Code TD Error δ (t) = r(t) + γ V(s(t+1)) - V(s(t)) unpredicted ������� r �� ���� V ������� predicted δ* ������� r �� ���� V ������� omitted δ* �������� r �� ���� V (Schultz et al. 1997) ������� δ* . 2. Dopamine neurons report rewards according to an error in re-

  18. Basal Ganglia for Reinforcement Learning? (Doya 2000, 2007) state � action � Cerebral cortex ! state/action coding ! Striatum ! reward prediction ! Q(s,a) � V(s) � Pallidum ! action selection ! δ � Dopamine neurons ! TD signal ! Thalamus !

  19. Monkey Free Choice Task (Samejima et al., 2005) P( reward | Left )= Q L % 5 0 9 0 - 90 50-90 50-50 50-10 50 10-50 10 10 50 90 % P( reward | Right) = Q R Dissociation of action and reward !

  20. Action Value Coding in Striatum (Samejima et al., 2005) ! Q L neuron Q L ! Q R ! ! -1 0 sec ! -Q R neuron Q L ! Q R !

  21. Forced and Free Choice Task Makoto Ito 0.5-1s 1-2s Left poking Right Center Left ! Center ! Right ! Cue tone Rwd tone Pellet No-rwd Cue tone Reward prob. (L, R) Left tone Fixed (900Hz) (50%,0%) Right tone Fixed (6500Hz) (0%, 50%) pellet dish ! Varied Free-choice tone (90%, 50%) (White noise) (50%, 90%) (50%, 10%) (10%, 50%)

  22. Time Course of Choice 10-50 50-10 90-50 50-90 10-50 90-50 50-10 50-90 10-50 50-10 �������������� �������� �������� �������� �������� �������� �������� �������� �������� �������� �������� � P L � �� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� P(r|a=L) Trials Left for tone A 0.9 Right for tone A Tone B 0.5 0.1 P(r|a=R) 0.1 0.5 0.9

  23. Generalized Q-learning Model (Ito & Doya, 2009) � ! Action selection P(a(t)=L) = expQ L (t)/(expQ L (t)+expQ R (t)) ! Action value update: i � {L,R} Q i (t+1) = (1- α 1 )Q i (t) + α 1 κ 1 if a(t)=i, r(t)=1 (1- α 1 )Q i (t) - α 1 κ 2 if a(t)=i, r(t)=0 (1- α 2 )Q i (t) if a(t) ≠ i, r(t)=1 (1- α 2 )Q i (t) if a(t) ≠ i, r(t)=0 ! Parameters ! α 1 : learning rate ! α 2 : forgetting rate ! κ 1 : reward reinforcement ! κ 2 : no-reward aversion

  24. Model Fitting by Particle Filter � (90 50) ! (50 90) ! (50 10) ! Left, reward ! � Left, no-reward ! � Right, no-reward ! � Right, reward ! � Q L ! � Q R ! � � �� �� �� �� �� �� �� �� �� �� �� �� ��� � ��� α 1 ! ��� α 2 ! ��� ��� Trials ! � �� �� �� �� �� �� �� �� �� ���

  25. Model Fitting � 1st Markov model(4) ! ������ ** ! 2nd Markov model(16) ! ������ ** ! 3rd Markov model(64) ! ������ * ! 4th Markov model(256) ! ������ ** ! ! Generalized Q learning standard Q (const)(2) ! ������ ** ! ! α 1 : learning F-Q (const)(3) ! ** ! ������ ! α 2 : forgetting DF-Q (const)(4) ! ������ ** ! local matching law(1) ! ������ ** ! ! κ 1 : reinforcement ! κ 2 : aversion standard Q (variable)(2) ! ������ ** ! F-Q (variable)(2) ! ������ ! standard: α 2 = κ 2 =0 DF-Q (variable)(2) ! ������ ! forgetting: κ 2 =0 ��� ����� ���� ����� ���� ����� ���� ����� ���� normalized likelihood !

  26. Neural Activity in the Striatum ! Dorsolateral C ! R ! ! Dorsomedial C ! R ! ! Ventral

  27. Information of Action and Reward Action (out of center hole) ! Reward (into choice hole) ! DL(122) ! bit/sec ! DM(56) ! NA(59) ! poking C ! Tone ! poking L ! poking R ! pellet dish ! sec !

  28. Action value coded by a DLS neuron Firing rate during tone presentation (blue in left panel) ! DLS ! Trials ! Action value for left estimated by FQ-learning ! Q ! Trials ! Firing rate during tone presentation (blue) ! FSA ! action value for left estimated by FQ-learning !

  29. State value coded by a VS neuron Firing rate during tone presentation (blue in left panel) ! VS ! Action value for left estimated by FQ-learning ! Q ! Firing rate during tone presentation (blue) ! FSA ! action value for left estimated by FQ-learning !

  30. Hierarchy in Cortico-Striatal Network ! Dorsolateral striatum – motor ! early action coding ! what action to take? ! Dorsomedial striatum - frontal ! action value ! in what context? ! Ventral striatum - limbic ! state value ! whether worth doing? � (Voorn et al., 2004) !

  31. Specialization by Learning Algorithms (Doya, 1999) � Cerebral Cortex : Unsupervised Learning ! output� input � Cortex� Basal Ganglia: Reinforcement Learning ! reward � Basal� thalamus� output� input� Ganglia� SN� Cerebellum� Cerebellum: Supervised Learning ! target� IO� +� error � -� input� output�

Recommend


More recommend