Reinforcement Learning Course 9.54 Final review
Agent learning to act in an unknown environment
Reinforcement Learning Setup S t
Background and setup • The environment is initially unknown or partially known • It is also stochastic, the agent cannot fully predict what will happen next • What is a ‘good’ action to select under these conditions? • Animals learning seeks to maximize their reward
Formal Setup • The agent is in one of a set of states , {S 1 , S 2 ,… S n } • At each state, it can take an action from a set of available actions {A 1 , A 2 ,… A k } • From state S i taking action A j – > a new state S j and a possible reward
Stochastic transitions S 2 A 1 S 3 R=2 A 2 S 2 S 1 S 3 A 3 R= -1 S 1
The consequence of an action: • (S,A) → (S ' , R) • Governed by: • P(S' | S, A) • P(R | S, A, S') • These probabilities are properties of the world. (‘Contingencies’) • An assumption that the transitions are Markovian
Policy • The goal is to learn a policy π: S → A • The policy determines the future of the agent: π π π S 1 A 3 S 2 S 3 A 1 A 2
Model- based vs. Model- free RL • Model- based methods assume that the probabilities: – P(S' | S, A) – P(R | S, A, S') are known and can be used in the planning • In model- free methods: – The ‘contingencies’ are not known – Need to be learned by exploration as a part of policy learning
Step 1 defining V π (S) S S(1) S(2) a 1 a 2 r 1 r 2 • Start from S and just follow the policy π • We find ourselves in state S(1) and reward r 1 etc. • V π (S) = < r 1 + γ r 2 + γ 2 r 3 + … > • The expected (discounted) reward from S on.
Step 2: equations for V(S) • V π (S) = < r 1 + γ r 2 + γ 2 r 3 + … > • = V π (S) = < r 1 + γ (r 2 + γ r 3 + … ) > • = < r 1 + γ V(S') > • These are equations relating V(S) for different states. • Next write the explicitly in terms of the known parameters (contingencies):
Equations for V(S) S 1 r 1 r 2 A S 2 S r 3 S 3 • V π (S) = < r 1 + γ V π (S') > = • E.g.: p(S' | S, a) [r(S, a, S' ) V (S' ) ] V π (S) = ' S V π (S) = [ 0.2 (r 1 + γ V π (S 1 )) + 0.5 (r 2 + γ V π (S 2 )) + 0.3 (r 3 + γ V π (S 3 )) ] • Linear equations, the unknowns are V π (S i )
Improving the Policy • Given the policy π, we can find the values V π (S) by solving the linear equations iteratively • Convergence is guaranteed (the system of equations is strongly diagonally dominant) • Given V(S), we can improve the policy: • We can combine these steps to find the optimal policy
Improving the policy A π S 1 r 1 r 2 A’ S 2 S r 3 S 3
Value Iteration learning V and π when the ‘contingencies’ are known:
Value Iteration Algorithm * Value iteration is used in the problem set
Q-learning • The main algorithm used for model-free RL
Q-values (state-action) S 2 A 1 S 3 R=2 Q(S 1, A 1 ) A 2 S 2 S 1 Q(S 1, A 3 ) S 3 A 3 R= -1 S 1 Q π (S,a) is the expected return starting from S, taking the action a, and thereafter following policy π
Q-value (state-action) • The same update is done on Q-values rather than on V • Used in most practical algorithms and some brain models • Q π (S,a) is the expected return starting from S, taking the action a, and thereafter following policy π :
Q-values (state-action) S 2 A 1 S 3 Q(S 1, A 1 ) R=2 A 2 S 2 S 1 Q(S 1, A 3 ) S 3 A 3 R= -1 S 1
SARSA • It is called SARSA because it uses: s(t), a(t), r(t+1), s(t+1), a(t+1) • A step like this uses the current π, so that each S has its action according to the policy π: a = π(S)
SARSA RL Algorithm * ԑ -greedy: with probability ԑ , do not select the greedy action. Instead select with equal probability among all actions.
TD learning Biology: dompanine
Behavioral support for ‘prediction error’ Associating light cue with food
‘Blocking’ • No response to the bell • The bell and food were consistently associated • There was no prediction error. • Conclusion : prediction error, not association, drives learning !
Learning and Dopamine • Learning is driven by the prediction error: δ(t ) = r + γV(S’)) – V(S) • Computed by the dopamine system (Here too, if there is no error, no learning will take place)
Dopaminergic neurons • Dopamine is a neuro-modulator • In the: – VTA (ventral tegmental area) – Substantia Nigra • These neurons send their axons to brain structures involved in motivation and goal- directed behavior, for example, the striatum, nucleus accumbens, and frontal cortex.
Major players in RL
Effects of dopamine, why it is associated with reward and reward related learning • Drugs like amphetamine and cocaine exert their addictive actions in part by prolonging the influence of dopamine on target neurons • Second, neural pathways associated with dopamine neurons are among the best targets for electrical self- stimulation: – Animals treated with dopamine receptor blockers learn less rapidly to press a bar for a reward pellet
Dopamine and prediction error • The animal (rat, monkey) gets a cue (visual, or auditory). • A reward after a delay (1 sec below)
Dopamine and prediction error
Recommend
More recommend