Markov Decision Processes 2/23/18
Recall: State Space Search Problems • A set of discrete states • A distinguished start state • A set of actions available to the agent in each state • An action function that, given a state and an action, returns a new state • A set of goal states , often specified as a function • A way to measure solution quality
What if actions aren’t perfect? • We might not know exactly which next state will result from an action. • We can model this as a probability distribution over next states.
Search with Non-Deterministic Actions • A set of discrete states • A distinguished start state • A set of actions available to the agent in each state • An action function that, given a state and an action, returns a new state a probability distribution over next states • A set of goal states , often specified as a function • A way to measure solution quality • A set of terminal states • A reward function that gives a utility for each state
Markov Decision Processes (MDPs) Named after the “Markov property”: if you know the state you don’t need to remember history. • We still represent states and actions. • Actions no longer lead to a single next state. • Instead they lead to one of several possible states, determined randomly. • We’re now working with utilities instead of goals. • Expected utility works well for handling randomness. • We need to plan for unintended consequences. • We need to plan over an indefinite horizon. • Even an optimal agent may run forever!
State Space Search MDPs • States: S • States: S • Actions: A s • Actions: A s • Transition probabilities • Transition function • P(s’ | s, a) • F(s, a) = s’ • Start ∈ S • Start ∈ S • Terminal ⊂ S • Goals ⊂ S • Can be empty. • Action Costs: C(a) • State Rewards: R(s) • Or action costs: C(a)
We can’t rely on a single plan! Actions might not have the outcome we expect, so our plans need to include contingencies for states we could end up in. Instead of searching for a plan , we devise a policy . A policy is a function that maps states to actions. • For each state we could end up in, the policy tells us which action to take.
A simple example: Grid World end +1 end -1 start If actions were deterministic, we could solve this with state space search. • (3,2) would be a goal state • (3,1) would be a dead end
A simple example: Grid World end +1 end -1 start • Suppose instead that the move we try to make only works correctly 80% of the time. • 10% of the time, we go in each perpendicular direction, e.g. try to go right, go up instead. • If impossible, stay in place.
A simple example: Grid World end +1 end -1 start • Before, we had two equally-good alternatives. • Which path is better when actions are uncertain? • What should we do if we find ourselves in (2,1)?
New Objective: Find an optimal po policy . We can’t just rely on a single plan, since we might end up in an unintended state. A policy is a function that maps every state to an action: π ( s ) = a We want policies that yield high reward. • In expectation (since transitions are random). • Over time (we may be willing to accept low reward now to achieve high reward later, but not always).
Expected Value • Since future states are uncertain, we can’t perfectly maximize future reward. • Instead, we maximize expected reward, a probability-weighted average over state rewards. X E ( R t ) = Pr( s t = s ) · R ( s ) s Expected Reward reward Probability of state s at time t of state s at time t
Discounting How do we trade off short-term vs long-term reward? Key idea: reward now is better than reward later. • Rewards in the future are exponentially decayed. • Reward t steps in the future is discounted by γ t V = γ t · R t Reward at time-step t Value now 0 < γ < 1
Value of a policy • Value depends on what state the agent is in. • Value depends on what the policy tells the agent to do in the future. 1 γ t · π ( s t = s 0 | s 0 = s ) · R ( s 0 ) X X V π ( s ) = Pr t =0 s 0 Value is the sum over all timesteps of the expected discounted reward at that timestep.
Optimal Policy • The optimal policy is the one that maximizes value. • If we knew the optimal policy, we could easily find the true value of any state. • If we knew the true value of every state, we could easily find the optimal policy. end +1 V ∗ ( s ) = V π ∗ ( s ) end -1 start
Value Iteration • The value of state s depends on the value of other states s’ . • The value of s’ may depend on the value of s . We can iteratively approximate the value using dynamic programming. • Initialize all values to the immediate rewards. • Update values based on the best next-state. • Repeat until convergence (values don’t change).
Value Iteration Pseudocode values = {state : R(state) for each state} until values don’t change: prev = copy of values for each state s: initialize best_EV for each action: EV = 0 for each next state ns: EV += prob * prev[ns] best_EV = max(EV, best_EV) values[s] = R(s) + gamma*best_EV
Recommend
More recommend