machine learning and brain science
play

Machine Learning and Brain Science Kenji Doya doya@oist.jp Neural - PowerPoint PPT Presentation

RIKENOsakaOIST Joint Workshop 2016 Big Waves of Theoretical Science in Okinawa 2016.6.21 Machine Learning and Brain Science Kenji Doya doya@oist.jp Neural Computation Unit Okinawa Institute of Science and Technology Okinawa Institute


  1. RIKEN–Osaka–OIST Joint Workshop 2016 Big Waves of Theoretical Science in Okinawa 2016.6.21 Machine Learning and Brain Science Kenji Doya doya@oist.jp Neural Computation Unit Okinawa Institute of Science and Technology

  2. Okinawa Institute of Science & Technology ������.������0�0��.2��7�����.����7 ������7�-‐‑–�0��0�0����07���1����7��� �7�0�7����7����7���7�0����.����7��� ��������������������� �������������� ���������� ¡���������������� ��������7�1����0�� ¡����� ����������������.�

  3. Our Research Interests How to build adaptive, How the brain realizes autonomous systems robust, flexible adaptation robot experiments neurobiology

  4. Outline Machine Learning and Brain Science Reinforcement Learning and Basal Ganglia Delayed Reward and Serotonin What’s Next

  5. Machine Learning and Brain Science To make intelligent machines by electronics, we should not bother biological constraints. As there’s a superb implementation in the brain, we should learn from that. Currently, brain-like implementation like Deep Learning gives the best performance.

  6. Coevolution in Pattern Recognition RECEPTIVE FIELDS IN CAT STRIATE CORTEX 579 Brain Science found by changing the size, shape and orientation of the stimulus until a clear Artificial Intelligence response was evoked. Often when a region with excitatory or inhibitory responses was established the neighbouring opposing areas in the receptive field could only be demonstrated indirectly. Such an indirect method is illustrated in Fig. 3B, where two flanking areas are indicated by using a short slit in various positions like the hand of a clock, always including the very A B +7 - ! Perceptron Feature detectors mm - I (Rosenblatt 1962) (Hubel & Wiesel 1959) - m_ Multi-layer learning aS~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~T T T Experience dependence 87 HIPPOCAMPAL PLACE UNITS Fig. 3. Same unit as in Fig. 2. A, responses to shinling a rectangular light spot, 1° x 8°; centre of slit superimposed on centre of receptive field; successive stimuli rotated clockwise, as shown (Amari, 1967) to left of figure. B, responses to a 1° x 5° slit oriented in various directions, with one end WALL always covering the centre of the receptive field: note that this central region evoked responses when stimulated alone (Fig. 2 a). Stimulus and background intensities as in Fig. 1; stimulus (Blakemore & Cooper 1970) duration 1 sec. centre of the field. The findings thus agree qualitatively with those obtained Neocognitron with a small spot (Fig. 2 a). Receptive fields having a central area and opposing flanks represented a common pattern, but several variations were seen. Some fields had long narrow central regions with extensive flanking areas (Figs. 1-3): others had a large -213-4-j l central area and concentrated slit-shaped flanks (Figs. 6, 9, 10). In many (Fukushima 1980) fields the two flanking regions were asymmetrical, differing in size and shape; Place cell in these a given spot gave unequal responses in symmetrically corresponding 37 PHYSIO. CXL,VIIT (O’Keefe 1976) RACK ConvNet (Krizhevsky, Sutskever, Hinton, 2012) Face cell (Bruce, Desimone, Gross 1981) GoogleBrain (2012) 2. Place fields for all place units except 21342 and those from animal 217. FIG. distributed around the maze. The concentration of fields from the other animals in arm B may have reflected the fact that many of the rats spent their “free time” in this arm. The fact that the initial search for units 3 was conducted there might also have introduced a bias towards units active in that area. In any case, it was clear that the majority of fields were not located in those places which contained the rewards or other (Sugase et al. 1999) FIG. 3. Place fields for place units from animal 217.

  7. What is Machine Learning Supervised Learning 1 M = 3 t Input-output pairs {(x 1 ,y 1 ), (x 2 ,y 2 ),…} 0 → input-output model y = f(x) + ! − 1 for new input x, predict output y 0 1 x Reinforcement Learning state-action-reward triplets {(x 1 ,y 1 ,r 1 ), (x 2 ,y 2 ,r 2 ),…} → action policy y = f(x) to maximize reward Unsupervised Learning 100 Input data { x 1 , x 2 , x 3 ,…} 80 → statistical model of P(x) 60 discover structure behind data 40 1 2 3 4 5 6

  8. Specialization by Learning Algorithms (Doya, 1999) Cerebral Cortex : Unsupervised Learning output input Cortex Basal Ganglia: Reinforcement Learning reward Basal thalamus output input Ganglia SN Cerebellum Cerebellum: Supervised Learning target IO + error - output input

  9. Learning by Trial and Error (Doya & Nakano, 1985) Explore actions (cycle of 4 postures) Learn from performance feedback (speed sensor)

  10. Reinforcement Learning reward r action a agent environment state s Learn action policy: s " a to maximize rewards Value function: expected future rewards V(s(t)) = E[ r(t) + # r(t+1) + # 2 r(t+2) + # 3 r(t+3) +…] 0 ≤ # ≤ 1: discount factor # V(s(t+1)) Temporal difference (TD) error: $ (t) = r(t) + # V(s(t+1)) – V(s(t))

  11. Pendulum Swing-Up reward function: potential energy value function V(s) s =(angle,angular velocity)

  12. Reinforcement Learning (Morimoto & Doya, 2000) Learning from reward and punishment reward: height of the head punishment: bump on the floor

  13. Learning to Survive and Reproduce (Elfwing et al., 2011, 2014) Catch battery packs Copy ‘genes’ by IR ports survival reproduction, evolution

  14. Reinforcement Learning Predict reward: value function V(s) = E[ r(t) + # r(t+1) + # 2 r(t+2)…| s(t)=s] Q(s,a) = E[ r(t) + # r(t+1) + # 2 r(t+2)…| s(t)=s, a(t)=a] Select action How to implement these steps? greedy : a = argmax Q(s,a) Boltzmann : P(a|s) + exp[ * Q(s,a)] Update prediction: TD error $ (t) = r(t) + # V(s(t+1)) – V(s(t)) How to tune these parameters? ' V(s(t)) = ( $ (t) ' Q(s(t),a(t)) = ( $ (t)

  15. Basal Ganglia Locus of Parkinson’s and Huntington’s diseases Striatum Globus Pallidus Substantia Nigra Thalamus What is their normal function??

  16. Dopamine-dependent Plasticity Medium spiny neurons in striatum glutamate from cortex dopamine from midbrain Three-factor learning rule (Wickens et al.) cortical input + spike " LTD cortical input + spike + dopamine " LTP input x output x reward Time window of plasticity (Yagishita et al., 2014)

  17. Basal Ganglia for Reinforcement Learning? (Doya 2000, 2007) state action Cerebral cortex state/action coding Striatum reward prediction Q(s,a) V(s) Pallidum action selection $ Dopamine neurons TD signal Thalamus

  18. Gambling Rats (Ito & Doya, 2015) 0.5,1s 1,2s Left poking Right Center Left Center Right Cue'tone Rwd'tone Pellet No,rwd Cue$tone Reward$prob.$(L,$R) Left$tone Fixed (900Hz) (50%,0%) Right$tone Fixed (6500Hz) (0%,$50%) pellet dish Varied FreeAchoice$tone (90%,$50%) (White$noise) (50%,$90%) (50%,$10%) (10%,$50%)

  19. Neural Activity in the Striatum (Ito & Doya, 2015) Dorsolateral C R Dorsomedial C R Ventral

  20. State/Action/Reward Coding State Action Reward cue L/R cue L/R cue L/R 0.19 0.81 0.57 6 6 6 phase$1 2 3 4 5 7 1 2 3 4 5 7 3 4 5 7 bits/sec bits/sec bits/sec DLS DMS VS sec sec sec

  21. Generalized Q-learning Model (Ito & Doya, 2009) Action selection P(a(t)=L) = expQ L (t)/(expQ L (t)+expQ R (t)) Action value update: i � {L,R} Q i (t+1) = (1- ( 1 )Q i (t) + ( 1 , 1 if a(t)=i, r(t)=1 (1- ( 1 )Q i (t) - ( 1 , 2 if a(t)=i, r(t)=0 (1- ( 2 )Q i (t) if a(t) ≠ i, r(t)=1 (1- ( 2 )Q i (t) if a(t) ≠ i, r(t)=0 Parameters ( 1 : learning rate ( 2 : forgetting rate , 1 : reward reinforcement , 2 : no-reward aversion

  22. Model Fitting by Particle Filter (90 50) (50 90) (50 10) Left, reward � Left, no-reward � Right, no-reward � Right, reward � Q L � Q R � � �� �� �� �� �� �� �� �� �� �� �� �� ��� � ( 1 ��� ( 2 ��� ��� ��� Trials � �� �� �� �� �� �� �� �� �� ���

  23. Model Fitting 1st$Markov$model(4) ������ ** 2nd$Markov$model(16) ������ ** 3rd$Markov$model(64) ������ * 4th$Markov$model(256) ������ ** Generalized Q learning standard$Q$(const)(2) ** ������ ( 1 : learning FAQ$(const)(3) ** ������ ( 2 : forgetting DFAQ$(const)(4) ������ ** local$matching$law(1) ������ ** , 1 : reinforcement , 2 : aversion standard$Q$(variable)(2) ������ ** FAQ$(variable)(2) ������ standard: ( 2 = , 2 =0 DFAQ$(variable)(2) ������ forgetting: , 2 =0 ��� ����� ���� ����� ���� ����� ���� ����� ���� normalized$ likelihood

  24. Action/State Values in Striatum (Ito & Doya, 2015) cue L/R Action value phase$1 2 3 4 5 6 7 DLS higher$QL firing lower$QL (Hz) DMS Action Reward QL QL QR trials State value firing (Hz) higher$QL VS lower$QL QL higher$QR lower$QR Action Reward QR QL QR trials

Recommend


More recommend