reinforcement learning
play

Reinforcement Learning Course 9.54 Final review Agent learning to - PowerPoint PPT Presentation

Reinforcement Learning Course 9.54 Final review Agent learning to act in an unknown environment Reinforcement Learning Setup S t Background and setup The environment is initially unknown or partially known It is also stochastic, the


  1. Reinforcement Learning Course 9.54 Final review

  2. Agent learning to act in an unknown environment

  3. Reinforcement Learning Setup S t

  4. Background and setup • The environment is initially unknown or partially known • It is also stochastic, the agent cannot fully predict what will happen next • What is a ‘good’ action to select under these conditions? • Animals learning seeks to maximize their reward

  5. Formal Setup • The agent is in one of a set of states , {S 1 , S 2 ,… S n } • At each state, it can take an action from a set of available actions {A 1 , A 2 ,… A k } • From state S i taking action A j – > a new state S j and a possible reward

  6. Stochastic transitions S 2 A 1 S 3 R=2 A 2 S 2 S 1 S 3 A 3 R= -1 S 1

  7. The consequence of an action: • (S,A) → (S ' , R) • Governed by: • P(S' | S, A) • P(R | S, A, S') • These probabilities are properties of the world. (‘Contingencies’) • An assumption that the transitions are Markovian

  8. Policy • The goal is to learn a policy π: S → A • The policy determines the future of the agent: π π π S 1 A 3 S 2 S 3 A 1 A 2

  9. Model- based vs. Model- free RL • Model- based methods assume that the probabilities: – P(S' | S, A) – P(R | S, A, S') are known and can be used in the planning • In model- free methods: – The ‘contingencies’ are not known – Need to be learned by exploration as a part of policy learning

  10. Step 1 defining V π (S) S S(1) S(2) a 1 a 2 r 1 r 2 • Start from S and just follow the policy π • We find ourselves in state S(1) and reward r 1 etc. • V π (S) = < r 1 + γ r 2 + γ 2 r 3 + … > • The expected (discounted) reward from S on.

  11. Step 2: equations for V(S) • V π (S) = < r 1 + γ r 2 + γ 2 r 3 + … > • = V π (S) = < r 1 + γ (r 2 + γ r 3 + … ) > • = < r 1 + γ V(S') > • These are equations relating V(S) for different states. • Next write the explicitly in terms of the known parameters (contingencies):

  12. Equations for V(S) S 1 r 1 r 2 A S 2 S r 3 S 3 • V π (S) = < r 1 + γ V π (S') > = • E.g.:    p(S' | S, a) [r(S, a, S' ) V (S' ) ] V π (S) =  ' S V π (S) = [ 0.2 (r 1 + γ V π (S 1 )) + 0.5 (r 2 + γ V π (S 2 )) + 0.3 (r 3 + γ V π (S 3 )) ] • Linear equations, the unknowns are V π (S i )

  13. Improving the Policy • Given the policy π, we can find the values V π (S) by solving the linear equations iteratively • Convergence is guaranteed (the system of equations is strongly diagonally dominant) • Given V(S), we can improve the policy: • We can combine these steps to find the optimal policy

  14. Improving the policy A π S 1 r 1 r 2 A’ S 2 S r 3 S 3

  15. Value Iteration learning V and π when the ‘contingencies’ are known:

  16. Value Iteration Algorithm * Value iteration is used in the problem set

  17. Q-learning • The main algorithm used for model-free RL

  18. Q-values (state-action) S 2 A 1 S 3 R=2 Q(S 1, A 1 ) A 2 S 2 S 1 Q(S 1, A 3 ) S 3 A 3 R= -1 S 1 Q π (S,a) is the expected return starting from S, taking the action a, and thereafter following policy π

  19. Q-value (state-action) • The same update is done on Q-values rather than on V • Used in most practical algorithms and some brain models • Q π (S,a) is the expected return starting from S, taking the action a, and thereafter following policy π :

  20. Q-values (state-action) S 2 A 1 S 3 Q(S 1, A 1 ) R=2 A 2 S 2 S 1 Q(S 1, A 3 ) S 3 A 3 R= -1 S 1

  21. SARSA • It is called SARSA because it uses: s(t), a(t), r(t+1), s(t+1), a(t+1) • A step like this uses the current π, so that each S has its action according to the policy π: a = π(S)

  22. SARSA RL Algorithm * ԑ -greedy: with probability ԑ , do not select the greedy action. Instead select with equal probability among all actions.

  23. TD learning Biology: dompanine

  24. Behavioral support for ‘prediction error’ Associating light cue with food

  25. ‘Blocking’ • No response to the bell • The bell and food were consistently associated • There was no prediction error. • Conclusion : prediction error, not association, drives learning !

  26. Learning and Dopamine • Learning is driven by the prediction error: δ(t ) = r + γV(S’)) – V(S) • Computed by the dopamine system (Here too, if there is no error, no learning will take place)

  27. Dopaminergic neurons • Dopamine is a neuro-modulator • In the: – VTA (ventral tegmental area) – Substantia Nigra • These neurons send their axons to brain structures involved in motivation and goal- directed behavior, for example, the striatum, nucleus accumbens, and frontal cortex.

  28. Major players in RL

  29. Effects of dopamine, why it is associated with reward and reward related learning • Drugs like amphetamine and cocaine exert their addictive actions in part by prolonging the influence of dopamine on target neurons • Second, neural pathways associated with dopamine neurons are among the best targets for electrical self- stimulation: – Animals treated with dopamine receptor blockers learn less rapidly to press a bar for a reward pellet

  30. Dopamine and prediction error • The animal (rat, monkey) gets a cue (visual, or auditory). • A reward after a delay (1 sec below)

  31. Dopamine and prediction error

Recommend


More recommend