cse neuro 528 lecture 13 reinforcement learning course
play

CSE/NEURO 528 Lecture 13: Reinforcement Learning & Course - PDF document

CSE/NEURO 528 Lecture 13: Reinforcement Learning & Course Review (Chapter 9) 1 Animation: Tom Creed, SJU Early Results: Pavlov and his Dog F Classical (Pavlovian) conditioning experiments F Training: Bell Food F After: Bell


  1. CSE/NEURO 528 Lecture 13: Reinforcement Learning & Course Review (Chapter 9) 1 Animation: Tom Creed, SJU Early Results: Pavlov and his Dog F Classical (Pavlovian) conditioning experiments F Training: Bell  Food F After: Bell  Salivate F Conditioned stimulus (bell) predicts future reward (food) Image: Wikimedia Commons; Animation: Tom Creed, SJU 2

  2. Predicting Delayed Rewards F How do we predict rewards delivered some time after a stimulus is presented? F Given: Many trials, each of length T time steps F Time within a trial: 0  t  T with stimulus u ( t ) and reward r ( t ) at each time step t (Note: r ( t ) can be zero for some t ) F We would like a neuron whose output v ( t ) predicts the expected total future reward starting from time t  T t     ( ) ( ) v t r t   0 trials 3 Learning to Predict Future Rewards F Use a set of synaptic weights w ( t ) and predict ( t ) v based on all past stimuli u ( t ): t      ( ) ( ) ( ) v t w u t w ( T ) ( 0 ) w ( t ) w   0 (Linear filter!) (  ( 0 ) ( t ) u t 1 ) u u F Learn weights w (  ) that minimize error: 2    T t  (Can we minimize this using      ( ) ( ) r t v t   gradient descent and delta rule?)   0 Yes, BUT future rewards are not yet available! 4

  3. Temporal Difference (TD) Learning F Key Idea : Rewrite error function to get rid of future terms: 2 2        1 T t T t                ( ) ( ) ( ) ( 1 ) ( ) r t v t r t r t v t         0 0       2 ( ) ( 1 ) ( ) r t v t v t Minimize this using gradient descent! F Temporal Difference (TD) Learning :           ( ) [ ( ) ( 1 ) ( )] ( ) w r t v t v t u t Expected future reward Prediction 5 Predicting Future Rewards: TD Learning Stimulus at t = 100 and reward at t = 200 Prediction error  for each time step (over many trials) 6 Image Source: Dayan & Abbott textbook

  4. Possible Reward Prediction Error Signal in the Primate Brain Dopaminergic cells in Ventral Tegmental Area (VTA)    Reward Prediction error δ ? [ ( ) ( 1 ) ( )] r t v t v t Before Training After Training No error       [ 0 ( ) ( 1 )] v t v t ( ) ( ) ( 1 ) v t r t v t     [ r ( t ) v ( t 1 ) v ( t )] 0 7 Image Source: Dayan & Abbott textbook More Evidence for Prediction Error Signals Dopaminergic cells in VTA after Training Negative error Reward expected    but not delivered r ( t ) 0 , v ( t 1 ) 0      [ ( ) ( 1 ) ( )] ( ) r t v t v t v t 8 Image Source: Dayan & Abbott textbook

  5. Reinforcement Learning: Acting to Maximize Rewards Agent State Reward Action u t r t a t Environment 9 The Problem Learn a state-to-action Agent mapping or “policy”: State u t  u  ( ) a Action Reward a t which maximizes the r t expected total future Environment reward:  T t    ( ) r t   0 trials 10

  6. Example: Rat in a barn States = locations A, B, or C Actions= L (go left) or R (go right) If the rat chooses L or R at random (random “policy”), what is the expected reward (or “ value ”) v for each state? 11 Image Source: Dayan & Abbott textbook Policy Evaluation For random policy: 1 1      ( ) 0 5 2 . 5 v B 2 2 1 1      ( ) 2 0 1 v C 2 2 1 1      ( ) ( ) ( ) 1 . 75 v A v B v C 2 2 Let value of state u Can learn value of states v ( u ) = weight w ( u ) using TD learning:      ( ) ( ) [ ( ) ( ' ) ( )] w u w u r u v u v u (Location, action)  new location i.e., ( u , a )  u ’ 12

  7. TD Learning of Values for Random Policy 2.5 1.75 1 (For all three,  = 0.5) Once I know the values, I can pick the action that leads to the higher valued state! 13 Image Source: Dayan & Abbott textbook Selecting Actions based on Values Values act as 2.5 1 surrogate immediate rewards  Locally optimal choice leads to globally optimal policy for “Markov” environments (Related to Dynamic Programming ) 14

  8. Putting it all together: Actor-Critic Learning Two separate components: Actor (selects action and F maintains policy) and Critic (maintains value of each state) 1. Critic Learning (“Policy Evaluation”): Value of state u = v ( u ) = w ( u )      ( ) ( ) [ ( ) ( ' ) ( )] w u w u r u v u v u (same as TD rule) 2. Actor Learning (“Policy Improvement”):  exp( ( )) Q u  Probabilistically select an ( ; ) a P a u   exp( ( )) Q u action a at state u b b For all actions a ’:        ( ) ( ) [ ( ) ( ' ) ( )]( ( ' ; )) Q u Q u r u v u v u P a u a ' a ' aa ' 3. Repeat 1 and 2 15 Actor-Critic Learning in our Barn Example Probability of going Left at each location 16 Image Source: Dayan & Abbott textbook

  9. Possible Implementation of the Actor-Critic Model in the Basal Ganglia Cortex State Estimate Hidden Layer STN Striatum DA TD Value SNc GPe error Actor Critic GPi/SNr Action Thalamus 17 (See Supplementary Materials for references) Reinforcement learning has been applied to many real-world problems! Example: Google’s AlphaGo beats human champion in Go, Autonomous Helicopter Flight (learned from human demonstrations) 18 (Videos and papers at: http://heli.stanford.edu/)

  10. Course Summary • Where have we been? • Course Highlights • Where do we go from here? • Challenges and Open Problems • Further Reading 19 What is the neural code? What is the nature of the code? Representing the spiking output: single cells vs populations rates vs spike times vs intervals What features of the stimulus does the neural system represent? 20

  11. Encoding and decoding neural information Encoding : building functional models of neurons/neural systems and predicting the spiking output given the stimulus Decoding : what can we say about the stimulus given what we observe from the neuron or neural population? 21 Information maximization as a design principle of the nervous system 22

  12. Biophysical Models of Neurons • Voltage dependent • transmitter dependent (synaptic) • Ca dependent 23 The neural equivalent circuit Ohm’s law: and Kirchhoff’s law - Capacitive Ionic currents Externally current applied current 24

  13. Simplified models: integrate-and-fire V Integrate-and- Fire Model dV      ( ) V E I R m L e m dt If V > V threshold  Spike Then reset: V = V reset 25 Modeling Networks of Neurons d v      ( W M ) v F u v dt Output Decay Input Feedback 26

  14. Unsupervised Learning •   T T For linear neuron: v w u u w • w  d Basic Hebb Rule:  u v w dt • Average effect over many inputs: Hebb rule performs principal component analysis (PCA) d w    u v Q w w dt • Q is the input correlation matrix:  T Q uu w 27 The Connection to Statistics Unsupervised learning = learning the hidden causes of input data G = (  v,  v ) Causes v [ | ; ] p v u G (posterior) Causes of clustered Generative Recognition data model model Use EM algorithm for “Causes” learning of natural images [ | ; ] p u v G Data u (data likelihood) 28

  15. Generative Models Mathematical Droning lecture Lack of sleep derivations 29 Supervised Learning Backpropagation for Multilayered Networks    m m ( ( )) v g W g w u i ij jk k j k Goal: Find W and w that minimize errors:   1  2 m m ( , ) ( ) m E W w d v x ij jk i i j 2 m , i Desired output Gradient descent learning rules: m u  E k    W W (Delta rule)  ij ij W ij    m x E E        j (Chain rule) w w w    jk jk jk m w x w jk j jk 30

  16. Reinforcement Learning • Learning to predict rewards:    (  ) w w r v u • Learning to predict delayed rewards (TD learning): (http://employees.csbsju.edu/tcreed/pb/pdoganim.html)           ( ) ( ) [ ( ) ( 1 ) ( )] ( ) w w r t v t v t u t • Actor-Critic Learning: 2.5 1 • Critic learns value of each state using TD learning • Actor learns best actions based on value of next state (using the TD error) 31 The Future: Challenges and Open Problems • How do neurons encode information? • Topics : Synchrony, Spike-timing based learning, Dynamic synapses • Does a neuron’s structure confer computational advantages? • Topics : Role of channel dynamics, dendrites, plasticity in channels and their density • How do networks implement computational principles such as efficient coding and Bayesian inference? • How do networks learn “optimal” representations of their environment and engage in purposeful behavior? • Topics : Unsupervised/reinforcement/imitation learning 32

Recommend


More recommend