policy iteration comments
play

Policy iteration comments Each step of policy iteration is - PDF document

Policy iteration comments Each step of policy iteration is guaranteed to strictly improve the policy at some state when improvement is possible MDPs cont, Lecture 25 Converge to optimal policy Converge to optimal policy Gives exact


  1. Policy iteration comments • Each step of policy iteration is guaranteed to strictly improve the policy at some state when improvement is possible MDPs cont, Lecture 25 • Converge to optimal policy Converge to optimal policy • Gives exact value of optimal policy Nov 24 2008 Review Policy Iteration Example • Critical components of MDPs Do one iteration of policy iteration on the MDP below. Assume an initial policy of π 1 (Hungry) = Eat and π 1 (Full) = Sleep. – State space and action space; Transition model; Let γ = 0.9 Reward function 1.0 • Value iteration 0.1 – U(S): the expected sum of maximum rewards U(S) th t d f i d 0.9 Hungry Full achievable starting at a particular state Eat Exercise -10 +10 ∑ – Bellman equation: = + γ ( ) ( ) max ( , , ' ) ( ' ) U s R s T s a s U s Watch TV 0. 8 1.0 a ' Sleep s ∑ – Bellman iteration: = + γ ( ) ( ) max ( , , ' ) ( ' ) U s R s T s a s U s + 1 i i a ' – Optimal policy: s ∑ π = * ( ) arg max ( , , ' ) * ( ' ) s T s a s U s 0.2 a ' s Review: Policy Iteration Policy Iteration Example Policy Evaluation Phase • Start with a randomly chosen initial policy π 0 Use initial policy for Hungry: π 1 (Hungry) = Eat • Iterate until no change in utilities: 1. Policy evaluation: given a policy π i , calculate U 1 (Hungry) = ‐ 10 + (0.9)[(0.1)U 1 (Hungry)+(0.9)U 1 (Full)] the utility U i ( s ) of every state s using policy π i ⇒ U 1 (Hungry) = ‐ 10 + (0.09)U 1 (Hungry)+(0.81)U 1 (Full) 1 ( g y) ( ) 1 ( g y) ( ) 1 ( ) by solving the system of equations: ⇒ (0.91)U 1 (Hungry) ‐ (0.81)U 1 (Full) = ‐ 10 Use initial policy for Full: π 1 (Full) = Sleep. 2. Policy improvement: calculate the new policy π i+1 using one ‐ step look ‐ ahead based on U i ( s ): U 1 (Full) = 10 + (0.9)[(0.8)U 1 (Full) + (0.2)U 1 (Hungry)] ⇒ U 1 (Full) = 10 + (0.72)U 1 (Full) + (0.18)U 1 (Hungry)] ∑ π = ( ) arg max ( , , ' ) ( ' ) s T s a s U s + 1 i i ⇒ (0.28)U 1 (Full) ‐ (0.18)U 1 (Hungry) = 10 a ' s 1

  2. Policy Iteration Example Policy Iteration Example π ( Full ) Solve for 2 (0.91)U 1 (Hungry) ‐ (0.81)U 1 (Full) = ‐ 10 ....(Equation 1) U 1 (Hungry) ⎧ ⎫ T(Full, Exercise, Hungry)U (Hungry) [Exercise] (0.28)U 1 (Full) ‐ (0.18)U 1 (Hungry)=10 ...(Equation 2) 1 ⎪ ⎪ and U 1 (Full) = argmax T(Full, Sleep, Full)U (Full) + ⎨ ⎬ 1 ⎪ ⎪ {Exercise, Sleep} T(Full, Sleep, Hungry)U (Hungry) [Sleep] ⎩ ⎭ 1 From Equation 1: ⎧ ⎧ (1.0)U (1 0)U (Hungry) (Hungry) [Exercise] [Exercise] ⎫ ⎫ = 1 1 (0.91)U 1 (Hungry) = -10+(0.81)U 1 (Full) argmax ⎨ ⎬ + (0.8)U (Full) (0.2)U (Hungry) [Sleep] ⎩ ⎭ {Exercise, Sleep} 1 1 =>U 1 (Hungry) = (-10/0.91)+(0.81/0.91)U 1 (Full) ⎧ (1.0)(48.7 ) [Exercise] ⎫ = argmax ⎨ ⎬ => U 1 (Hungry)=-10.9+(0.89)U 1 (Full) + (0.8)(67) (0.2)(48.7 ) [Sleep] ⎩ ⎭ {Exercise, Sleep} ⎧ ⎫ 4 8.7 [ Exercise ] = argmax ⎨ ⎬ 63 . 34 [ Sleep ] ⎩ ⎭ {Exercise, Sleep} = Sleep Policy Iteration Example Policy Iteration Example Solve for (0.91)U 1 (Hungry) ‐ (0.81)U 1 (Full) = ‐ 10 ....(Equation 1) • π 2 (Hungry) =Eat U 1 (Hungry) (0.28)U 1 (Full) ‐ (0.18)U 1 (Hungry)=10 ...(Equation 2) and U 1 (Full) • π 2 (Full) = Sleep Substitute U 1 (Hungry)=-10.9+(0.89)U 1 (Full) into Equation 2 (0 28)U 1 (Full) - (0 18)[-10 9+(0 89)U 1 (Full)]=10 (0.28)U 1 (Full) (0.18)[ 10.9+(0.89)U 1 (Full)] 10 =>(0.28)U 1 (Full) + 1.96-(0.16)U 1 (Full)=10 =>(0.12)U 1 (Full)=8.04 =>U 1 (Full)=67 =>U 1 (Hungry)=-10.9+(0.89)(67)=-10.9+59.63=48.7 Policy Iteration Example So far …. π ( Hungry ) 2 • Given an MDP model we know how to find ⎧ T(Hungry, Eat, Full)U (Full) + ⎫ 1 ⎪ ⎪ optimal policies = argmax ⎨ T(Hungry, Eat, Hungry)U (Hungry) [Eat] ⎬ 1 ⎪ ⎪ {Eat, WatchTV} – Value Iteration or Policy Iteration T(Hungry, WatchTV, Hungry)U (Hungry) [WatchTV] ⎩ ⎭ 1 + ⎧ (0.9)U1(Fu ll) (0.1)U1(Hu ngry) [Eat] ⎫ • But what if we don’t have any form of the = a g a argmax ⎨ ⎨ ⎬ ⎬ (1.0)U1(Hu ngry) [WatchTV] model of the world (e.g., T, and R) ⎩ ⎭ {Eat, WatchTV} + ⎧ (0.9)(67) (0.1)(48.7 ) [Eat] ⎫ – Like when we were babies . . . = argmax ⎨ ⎬ (1.0)(48.7 ) [WatchTV] ⎩ ⎭ {Eat, WatchTV} – All we can do is wander around the world ⎧ 65.2 [Eat] ⎫ observing what happens, getting rewarded and = argmax ⎨ ⎬ 48.7 [Watch] ⎩ ⎭ punished {Eat, WatchTV} = Eat – This is what reinforcement learning about 2

  3. Why not supervised learning Reinforcement/Reward In supervised learning, we had a teacher providing • The key to this trial ‐ and ‐ error approach is having us with training examples with class labels some sort of feedback about what is good and what is bad Has Has Has Ate Has Asian Fever Cough Breathing Chicken • We call this feedback reward or reinforcement Bird Flu Problems Recently f l false • In some environment, rewards are frequent true true true false true true true true true – Ping ‐ pong: each point scored false false false false true – Learning to crawl: forward motion • In other environments, reward is delayed The agent figures out how to predict the class label – Chess: reward only happens at the end of the game given the features. Can We Use Supervised Learning? Importance of Credit Assignment • Now imagine a complex task such as learning to play a board game • Suppose we took a supervised learning approach to learning an evaluation function approach to learning an evaluation function • For every possible position of your pieces, you need a teacher to provide an accurate and consistent evaluation of that position – This is not feasible Trial and Error Reinforcement • A better approach: imagine we don’t have • This is very similar to what happens in a teacher nature with animals and humans • Instead, the agent gets to experiment in its • Positive reinforcement: environment Happiness Pleasure Food Happiness, Pleasure, Food • The agent tries out actions and discovers by • Negative reinforcement: itself which actions lead to a win or loss Pain, Hunger, Lonelinesss • The agent can learn an evaluation function that can estimate the probability of winning from any given position What happens if we get agents to learn in this way? This leads us to the world of Reinforcement Learning 3

  4. Reinforcement Learning in a nutshell Imagine playing a new game whose rules you don’t know; after a hundred or so moves, your opponent announces, “You lose”. opponent announces, You lose . ‐ Russell and Norvig Introduction to Artificial Intelligence Reinforcement Learning • Agent placed in an environment and must learn to behave optimally in it • Assume that the world behaves like an MDP, except: MDP, except: – Agent can act but does not know the transition model – Agent observes its current state its reward but doesn’t know the reward function • Goal: learn an optimal policy Factors that Make RL Difficult • Actions have non ‐ deterministic effects – which are initially unknown and must be learned • Rewards / punishments can be infrequent / p q – Often at the end of long sequences of actions – How do we determine what action(s) were really responsible for reward or punishment? (credit assignment problem) – World is large and complex 4

Recommend


More recommend