Module 7 Policy Iteration CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo
Policy Optimization • Value iteration – Optimize value function – Extract induced policy • Can we directly optimize the policy? – Yes, by policy iteration 2 CS886 (c) 2013 Pascal Poupart
Policy Iteration • Alternate between two steps 1. Policy evaluation 𝑊 𝜌 𝑡 = 𝑆 𝑡, 𝜌 𝑡 + 𝛿 Pr 𝑡 ′ 𝑡, 𝜌 𝑡 𝑊 𝜌 (𝑡 ′ ) ∀𝑡 𝑡 ′ 2. Policy improvement 𝑆 𝑡, 𝑏 + 𝛿 Pr 𝑡 ′ 𝑡, 𝑏 𝑊 𝜌 (𝑡 ′ ) 𝜌 𝑡 ← argmax ∀𝑡 𝑏 𝑡 ′ 3 CS886 (c) 2013 Pascal Poupart
Algorithm policyIteration(MDP) Initialize 𝜌 0 to any policy 𝑜 ← 0 Repeat 𝑜 = 𝑆 𝜌 𝑜 + 𝛿𝑈 𝜌 𝑜 𝑊 Eval: 𝑊 𝑜 Improve: 𝜌 𝑜+1 ← 𝑏𝑠𝑛𝑏𝑦 𝑏 𝑆 𝑏 + 𝛿𝑈 𝑏 𝑊 𝑜 𝑜 ← 𝑜 + 1 Until 𝜌 𝑜+1 = 𝜌 𝑜 Return 𝜌 𝑜 4 CS886 (c) 2013 Pascal Poupart
Monotonic Improvement • Lemma 1: Let 𝑊 𝑜 and 𝑊 𝑜+1 be successive value functions in policy iteration. Then 𝑊 𝑜+1 ≥ 𝑊 𝑜 . • Proof: – We know that 𝐼 ∗ 𝑊 𝑜 ≥ 𝐼 𝜌 𝑜 𝑊 𝑜 = 𝑊 𝑜 – Let 𝜌 𝑜+1 = 𝑏𝑠𝑛𝑏𝑦 𝑏 𝑆 𝑏 + 𝛿𝑈 𝑏 𝑊 𝑜 – Then 𝐼 ∗ 𝑊 𝑜 = 𝑆 𝜌 𝑜+1 + 𝛿𝑈 𝜌 𝑜+1 𝑊 𝑜 ≥ 𝑊 𝑜 – Rearranging: 𝑆 𝜌 𝑜+1 ≥ 𝐽 − 𝛿𝑈 𝜌 𝑜+1 𝑊 𝑜 𝑜+1 = 𝐽 − 𝛿𝑈 𝜌 𝑜+1 −1 𝑆 𝜌 𝑜+1 ≥ 𝑊 – Hence 𝑊 𝑜 5 CS886 (c) 2013 Pascal Poupart
Convergence • Theorem 2: Policy iteration converges to 𝜌 ∗ & 𝑊 ∗ in finitely many iterations when 𝑇 and 𝐵 are finite. • Proof: – We know that 𝑊 𝑜+1 ≥ 𝑊 𝑜 ∀𝑜 by Lemma 1. – Since 𝐵 and 𝑇 are finite, there are finitely many policies and therefore the algorithm terminates in finitely many iterations. – At termination, 𝜌 𝑜+1 = 𝜌 𝑜 and therefore 𝑊 𝑜 satisfies Bellman’s equation: 𝑆 𝑏 + 𝛿𝑈 𝑏 𝑊 𝑊 𝑜 = 𝑊 𝑜+1 = max 𝑜 𝑏 6 CS886 (c) 2013 Pascal Poupart
Complexity • Value Iteration: – Each iteration: 𝑃( 𝑇 2 𝐵 ) – Many iterations: linear convergence • Policy Iteration: – Each iteration: 𝑃( 𝑇 3 + 𝑇 2 |𝐵|) – Few iterations: linear-quadratic convergence 7 CS886 (c) 2013 Pascal Poupart
Modified Policy Iteration • Alternate between two steps 1. Partial Policy evaluation Repeat 𝑙 times: 𝑊 𝜌 𝑡 ← 𝑆 𝑡, 𝜌 𝑡 Pr 𝑡 ′ 𝑡, 𝜌 𝑡 𝑊 𝜌 (𝑡 ′ ) + 𝛿 ∀𝑡 𝑡 ′ 2. Policy improvement 𝑆 𝑡, 𝑏 + 𝛿 Pr 𝑡 ′ 𝑡, 𝑏 𝑊 𝜌 (𝑡 ′ ) 𝜌 𝑡 ← argmax ∀𝑡 𝑏 𝑡 ′ 8 CS886 (c) 2013 Pascal Poupart
Algorithm modifiedPolicyIteration(MDP) Initialize 𝜌 0 and 𝑊 0 to anything 𝑜 ← 0 Repeat Eval: Repeat 𝑙 times 𝑜 ← 𝑆 𝜌 𝑜 + 𝛿𝑈 𝜌 𝑜 𝑊 𝑊 𝑜 Improve: 𝜌 𝑜+1 ← 𝑏𝑠𝑛𝑏𝑦 𝑏 𝑆 𝑏 + 𝛿𝑈 𝑏 𝑊 𝑜 𝑜+1 ← 𝑛𝑏𝑦 𝑏 𝑆 𝑏 + 𝛿𝑈 𝑏 𝑊 𝑊 𝑜 𝑜 ← 𝑜 + 1 Until 𝑊 𝑜 − 𝑊 ∞ ≤ 𝜗 𝑜−1 Return 𝜌 𝑜 9 CS886 (c) 2013 Pascal Poupart
Convergence • Same convergence guarantees as value iteration: 𝜗 𝑜 − 𝑊 ∗ • Value function 𝑊 𝑜 : 𝑊 ∞ ≤ 1−𝛿 • Value function 𝑊 𝜌 𝑜 of policy 𝜌 𝑜 : 2𝜗 𝑊 𝜌 𝑜 − 𝑊 ∗ ∞ ≤ 1−𝛿 • Proof: somewhat complicated (see Section 6.5 of Puterman’s book) 10 CS886 (c) 2013 Pascal Poupart
Complexity • Value Iteration: – Each iteration: 𝑃( 𝑇 2 𝐵 ) – Many iterations: linear convergence • Policy Iteration: – Each iteration: 𝑃( 𝑇 3 + 𝑇 2 |𝐵|) – Few iterations: linear-quadratic convergence • Modified Policy Iteration: – Each iteration: 𝑃(𝑙 𝑇 2 + 𝑇 2 |𝐵|) – Few iterations: linear-quadratic convergence 11 CS886 (c) 2013 Pascal Poupart
Summary • Policy iteration – Iteratively refine policy • Can we treat the search for a good policy as an optimization problem? – Yes: by linear programming 12 CS886 (c) 2013 Pascal Poupart
Recommend
More recommend