anatomy of an rl agent model policy value function
play

Anatomy of an RL agent: model, policy, value function Robert Platt - PowerPoint PPT Presentation

Anatomy of an RL agent: model, policy, value function Robert Platt Northeastern University Running example: gridworld Gridworld: agent lives on grid always occupies a single cell can move left, right, up, down gets zero


  1. Anatomy of an RL agent: model, policy, value function Robert Platt Northeastern University

  2. Running example: gridworld Gridworld: – agent lives on grid – always occupies a single cell – can move left, right, up, down – gets zero reward unless in “+1” or “-1” cells

  3. States and actions State set: Action set:

  4. Reward function Reward function: Otherwise:

  5. Reward function Reward function: Otherwise: In general:

  6. Reward function Expected reward on this time step given that agent takes action from state Reward function: Otherwise: In general:

  7. Agent Model Transition model: For example:

  8. Agent Model Transition model: For example: – This entire probability distribution can be written as a table over state, action, next state. probability of this transition

  9. Agent Model: Summary State set: Action set: Reward function: Transition model:

  10. Agent Model: Frozen Lake Example Frozen Lake is this 4x4 grid State set: Action set: if Reward function: otherwise Transition model: only one third chance of going in specified direction – one third chance of moving +90deg – one third change of moving -90deg

  11. Agent Model: Recycling Robot Example Example 3.4 in SB, 2 nd Ed.

  12. Policy A policy is a rule for selecting actions: If agent is in this state, then take this action

  13. Policy A policy is a rule for selecting actions: If agent is in this state, then take this action

  14. Policy A policy is a rule for selecting actions: If agent is in this state, then take this action A policy can be stochastic:

  15. Episodic vs Continuing Process Episodic process: execution ends at some point and starts over. – after a fixed number of time steps – upon reaching a terminal state Terminal state Example of an episodic task: – execution ends upon reaching terminal state OR after 15 time steps

  16. Episodic vs Continuing Process Continuing process: execution goes on forever. Example of a continuing task Process doesn’t stop – keep getting rewards

  17. Value Function Value of state when acting according to policy Expected discounted future reward starting at state and acting according to policy

  18. Value Function Value of state when acting according to policy Expected discounted future reward starting at state and acting according to policy Called the Value Function

  19. Value Function Value of state when acting according to policy Expected discounted future reward starting at state and acting according to policy Called the Value Function Why we care about the value function: Because it helps us calculate a good policy – we’ll see how shortly.

  20. Value Function Value of state when acting according to policy Expected discounted future reward starting at state and acting according to policy

  21. Value Function Value of state when acting according to policy Expected discounted future reward starting at state and acting according to policy what’s wrong with this?

  22. Value Function Value of state when acting according to policy Expected discounted future reward starting at state and acting according to policy Two viable alternatives: 1. maximize expected future reward over the next T timesteps (finite horizon): 2. maximize expected discounted future rewards:

  23. Value Function Value of state when acting according to policy Expected discounted future reward starting at state and acting according to policy Two viable alternatives: 1. maximize expected future reward over the next T timesteps (finite horizon): Discount factor – 0.9 is a typical value 2. maximize expected discounted future rewards:

  24. Value Function Value of state when acting according to policy Expected discounted future reward starting at state and acting according to policy Two viable alternatives: Standard formulation for value function 1. maximize expected future reward over the next T timesteps (finite horizon): – notice this is a function over state 2. maximize expected discounted future rewards:

  25. Optimal policy Value of state when acting according to policy Expected discounted future reward starting at state and acting according to policy Why we care about the value function: because can be used to calculate a good policy.

  26. Value function example 1 Policy: Discount factor: Value fn: 6.9 6.6 7.3 8.1 9 10

  27. Value function example 1 Notice that value function can help us compare two different policies – how? Policy: Discount factor: Value fn: 6.9 6.6 7.3 8.1 9 10

  28. Value function example 1 Policy: Discount factor: Value fn: 1 0.9 0.81 0.73 0.66 10.66

  29. Value function example 1 Policy: Discount factor: Value fn: 6.9 6.6 7.3 8.1 9 10

  30. Value function example 2 Policy: Discount factor: Value fn: 11 10 10 10 10 10

  31. Value function example 3 Policy: Discount factor: Value fn: 7 6 7 8 9 10

Recommend


More recommend