Anatomy of an RL agent: model, policy, value function Robert Platt Northeastern University
Running example: gridworld Gridworld: – agent lives on grid – always occupies a single cell – can move left, right, up, down – gets zero reward unless in “+1” or “-1” cells
States and actions State set: Action set:
Reward function Reward function: Otherwise:
Reward function Reward function: Otherwise: In general:
Reward function Expected reward on this time step given that agent takes action from state Reward function: Otherwise: In general:
Agent Model Transition model: For example:
Agent Model Transition model: For example: – This entire probability distribution can be written as a table over state, action, next state. probability of this transition
Agent Model: Summary State set: Action set: Reward function: Transition model:
Agent Model: Frozen Lake Example Frozen Lake is this 4x4 grid State set: Action set: if Reward function: otherwise Transition model: only one third chance of going in specified direction – one third chance of moving +90deg – one third change of moving -90deg
Agent Model: Recycling Robot Example Example 3.4 in SB, 2 nd Ed.
Policy A policy is a rule for selecting actions: If agent is in this state, then take this action
Policy A policy is a rule for selecting actions: If agent is in this state, then take this action
Policy A policy is a rule for selecting actions: If agent is in this state, then take this action A policy can be stochastic:
Episodic vs Continuing Process Episodic process: execution ends at some point and starts over. – after a fixed number of time steps – upon reaching a terminal state Terminal state Example of an episodic task: – execution ends upon reaching terminal state OR after 15 time steps
Episodic vs Continuing Process Continuing process: execution goes on forever. Example of a continuing task Process doesn’t stop – keep getting rewards
Value Function Value of state when acting according to policy Expected discounted future reward starting at state and acting according to policy
Value Function Value of state when acting according to policy Expected discounted future reward starting at state and acting according to policy Called the Value Function
Value Function Value of state when acting according to policy Expected discounted future reward starting at state and acting according to policy Called the Value Function Why we care about the value function: Because it helps us calculate a good policy – we’ll see how shortly.
Value Function Value of state when acting according to policy Expected discounted future reward starting at state and acting according to policy
Value Function Value of state when acting according to policy Expected discounted future reward starting at state and acting according to policy what’s wrong with this?
Value Function Value of state when acting according to policy Expected discounted future reward starting at state and acting according to policy Two viable alternatives: 1. maximize expected future reward over the next T timesteps (finite horizon): 2. maximize expected discounted future rewards:
Value Function Value of state when acting according to policy Expected discounted future reward starting at state and acting according to policy Two viable alternatives: 1. maximize expected future reward over the next T timesteps (finite horizon): Discount factor – 0.9 is a typical value 2. maximize expected discounted future rewards:
Value Function Value of state when acting according to policy Expected discounted future reward starting at state and acting according to policy Two viable alternatives: Standard formulation for value function 1. maximize expected future reward over the next T timesteps (finite horizon): – notice this is a function over state 2. maximize expected discounted future rewards:
Optimal policy Value of state when acting according to policy Expected discounted future reward starting at state and acting according to policy Why we care about the value function: because can be used to calculate a good policy.
Value function example 1 Policy: Discount factor: Value fn: 6.9 6.6 7.3 8.1 9 10
Value function example 1 Notice that value function can help us compare two different policies – how? Policy: Discount factor: Value fn: 6.9 6.6 7.3 8.1 9 10
Value function example 1 Policy: Discount factor: Value fn: 1 0.9 0.81 0.73 0.66 10.66
Value function example 1 Policy: Discount factor: Value fn: 6.9 6.6 7.3 8.1 9 10
Value function example 2 Policy: Discount factor: Value fn: 11 10 10 10 10 10
Value function example 3 Policy: Discount factor: Value fn: 7 6 7 8 9 10
Recommend
More recommend