Module 7 Policy Iteration CS 886 Sequential Decision Making and - PowerPoint PPT Presentation

Module 7 Policy Iteration CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Policy Iteration • Alternate between two steps 1. Policy evaluation 𝑊 𝜌 𝑡 = 𝑆 𝑡, 𝜌 𝑡 + 𝛿 Pr 𝑡 ′ 𝑡, 𝜌 𝑡 𝑊 𝜌 (𝑡 ′ ) ∀𝑡 𝑡 ′ 2. Policy improvement 𝑆 𝑡, 𝑏 + 𝛿 Pr 𝑡 ′ 𝑡, 𝑏 𝑊 𝜌 (𝑡 ′ ) 𝜌 𝑡 ← argmax ∀𝑡 𝑏 𝑡 ′ 3 CS886 (c) 2013 Pascal Poupart

Algorithm policyIteration(MDP) Initialize 𝜌 0 to any policy 𝑜 ← 0 Repeat 𝑜 = 𝑆 𝜌 𝑜 + 𝛿𝑈 𝜌 𝑜 𝑊 Eval: 𝑊 𝑜 Improve: 𝜌 𝑜+1 ← 𝑏𝑠𝑕𝑛𝑏𝑦 𝑏 𝑆 𝑏 + 𝛿𝑈 𝑏 𝑊 𝑜 𝑜 ← 𝑜 + 1 Until 𝜌 𝑜+1 = 𝜌 𝑜 Return 𝜌 𝑜 4 CS886 (c) 2013 Pascal Poupart

Monotonic Improvement • Lemma 1: Let 𝑊 𝑜 and 𝑊 𝑜+1 be successive value functions in policy iteration. Then 𝑊 𝑜+1 ≥ 𝑊 𝑜 . • Proof: – We know that 𝐼 ∗ 𝑊 𝑜 ≥ 𝐼 𝜌 𝑜 𝑊 𝑜 = 𝑊 𝑜 – Let 𝜌 𝑜+1 = 𝑏𝑠𝑕𝑛𝑏𝑦 𝑏 𝑆 𝑏 + 𝛿𝑈 𝑏 𝑊 𝑜 – Then 𝐼 ∗ 𝑊 𝑜 = 𝑆 𝜌 𝑜+1 + 𝛿𝑈 𝜌 𝑜+1 𝑊 𝑜 ≥ 𝑊 𝑜 – Rearranging: 𝑆 𝜌 𝑜+1 ≥ 𝐽 − 𝛿𝑈 𝜌 𝑜+1 𝑊 𝑜 𝑜+1 = 𝐽 − 𝛿𝑈 𝜌 𝑜+1 −1 𝑆 𝜌 𝑜+1 ≥ 𝑊 – Hence 𝑊 𝑜 5 CS886 (c) 2013 Pascal Poupart

Convergence • Theorem 2: Policy iteration converges to 𝜌 ∗ & 𝑊 ∗ in finitely many iterations when 𝑇 and 𝐵 are finite. • Proof: – We know that 𝑊 𝑜+1 ≥ 𝑊 𝑜 ∀𝑜 by Lemma 1. – Since 𝐵 and 𝑇 are finite, there are finitely many policies and therefore the algorithm terminates in finitely many iterations. – At termination, 𝜌 𝑜+1 = 𝜌 𝑜 and therefore 𝑊 𝑜 satisfies Bellman’s equation: 𝑆 𝑏 + 𝛿𝑈 𝑏 𝑊 𝑊 𝑜 = 𝑊 𝑜+1 = max 𝑜 𝑏 6 CS886 (c) 2013 Pascal Poupart

Complexity • Value Iteration: – Each iteration: 𝑃( 𝑇 2 𝐵 ) – Many iterations: linear convergence • Policy Iteration: – Each iteration: 𝑃( 𝑇 3 + 𝑇 2 |𝐵|) – Few iterations: linear-quadratic convergence 7 CS886 (c) 2013 Pascal Poupart

Modified Policy Iteration • Alternate between two steps 1. Partial Policy evaluation Repeat 𝑙 times: 𝑊 𝜌 𝑡 ← 𝑆 𝑡, 𝜌 𝑡 Pr 𝑡 ′ 𝑡, 𝜌 𝑡 𝑊 𝜌 (𝑡 ′ ) + 𝛿 ∀𝑡 𝑡 ′ 2. Policy improvement 𝑆 𝑡, 𝑏 + 𝛿 Pr 𝑡 ′ 𝑡, 𝑏 𝑊 𝜌 (𝑡 ′ ) 𝜌 𝑡 ← argmax ∀𝑡 𝑏 𝑡 ′ 8 CS886 (c) 2013 Pascal Poupart

Algorithm modifiedPolicyIteration(MDP) Initialize 𝜌 0 and 𝑊 0 to anything 𝑜 ← 0 Repeat Eval: Repeat 𝑙 times 𝑜 ← 𝑆 𝜌 𝑜 + 𝛿𝑈 𝜌 𝑜 𝑊 𝑊 𝑜 Improve: 𝜌 𝑜+1 ← 𝑏𝑠𝑕𝑛𝑏𝑦 𝑏 𝑆 𝑏 + 𝛿𝑈 𝑏 𝑊 𝑜 𝑜+1 ← 𝑛𝑏𝑦 𝑏 𝑆 𝑏 + 𝛿𝑈 𝑏 𝑊 𝑊 𝑜 𝑜 ← 𝑜 + 1 Until 𝑊 𝑜 − 𝑊 ∞ ≤ 𝜗 𝑜−1 Return 𝜌 𝑜 9 CS886 (c) 2013 Pascal Poupart

Convergence • Same convergence guarantees as value iteration: 𝜗 𝑜 − 𝑊 ∗ • Value function 𝑊 𝑜 : 𝑊 ∞ ≤ 1−𝛿 • Value function 𝑊 𝜌 𝑜 of policy 𝜌 𝑜 : 2𝜗 𝑊 𝜌 𝑜 − 𝑊 ∗ ∞ ≤ 1−𝛿 • Proof: somewhat complicated (see Section 6.5 of Puterman’s book) 10 CS886 (c) 2013 Pascal Poupart

Complexity • Value Iteration: – Each iteration: 𝑃( 𝑇 2 𝐵 ) – Many iterations: linear convergence • Policy Iteration: – Each iteration: 𝑃( 𝑇 3 + 𝑇 2 |𝐵|) – Few iterations: linear-quadratic convergence • Modified Policy Iteration: – Each iteration: 𝑃(𝑙 𝑇 2 + 𝑇 2 |𝐵|) – Few iterations: linear-quadratic convergence 11 CS886 (c) 2013 Pascal Poupart

Module 7 Policy Iteration CS 886 Sequential Decision Making and - PowerPoint PPT Presentation

Module 7 Policy Iteration CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Policy Optimization Value iteration Optimize value function Extract induced policy Can we directly optimize the

JOBS IN VALUE CHAINS ANALYSIS INTRODUCTION Roadmap: Why are we here today? Agenda for the

WebEOC Training 1 Topics Module 1 WebEOC Overview Module 2 Getting Started Module 3

Module E: Solving Systems of Linear Equations Module E Math 237 Module E Section E.0 Section

Module V: Vector Spaces Module V Math 237 Module V Section V.0 Section V.1 Section V.2

Agenda Module 1 - Risk, Volatility & Timescale Module 2 - Asset Allocation Module 3 -

Emergency Management Roles and Responsibilities Joe Myers Agenda MODULE 1 WHAT IS MODULE

1 MODULE SPECIFICATION Module Aims The module aims to deliver knowledge of the essential

Canadian Bioinformatics Workshops www.bioinformatics.ca Module #: Title of Module 2 Module bio

Module A: Algebraic properties of linear maps Module A Math 237 Module A Section A.1 Section

6.15 Module 15: Research and Presentation Module Title Research and Presentation Module NFQ

Module Title: Broadcasting & Presentation Skills Level : 4 Credit Value : 20 Code of module

Agenda Module 1 - Risk, Volatility & Timescale Module 2 - Asset Allocation Module 3 -

Using the Code Review Module Szeged DrupalCon Using the Code Review Module Doug Green Stella

Module 3 Doing a Noise Audit This module and Module 2 provide the necessary training needed

MODERATE SEDATION MODULE MODERATE SEDATION MODULE MODERATE SEDATION MODULE Introduction

Auxiliary Rubrics Module 6 Module 5 Review At the conclusion of Module 5, the team completed

On Completion of Constraint Handling Rules Slim Abdennadher and Thom Fr uhwirth Computer

Instrew: Leveraging LLVM for High Performance Dynamic Binary Instrumentation Alexis Engelke

I dont care! On Incompleteness in Abstract Argumentation (and Belief Revision?) Pietro

Programming with Boolean Satisfaction Michael Codish Department of Computer Science Ben Gurion

Boosting Verifiable Computation on Encrypted Data PKC 2020 Dario Fiore, Anca Nitulescu , David

Concepts of Programming Design Scala and Lightweight Modular Staging (LMS) Alexey Rodriguez

In tro duction to F unctional Programming: Lecture 6 1 In tro duction to F unctional

Lesson 23 Linear partial differential equations 1 We have seen that ODEs can be reduced to

Module 7 Policy Iteration CS 886 Sequential Decision Making and - PowerPoint PPT Presentation

Module 7 Policy Iteration CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Policy Optimization Value iteration Optimize value function Extract induced policy Can we directly optimize the

JOBS IN VALUE CHAINS ANALYSIS INTRODUCTION Roadmap: Why are we here today? Agenda for the

WebEOC Training 1 Topics Module 1 WebEOC Overview Module 2 Getting Started Module 3

Module E: Solving Systems of Linear Equations Module E Math 237 Module E Section E.0 Section

Module V: Vector Spaces Module V Math 237 Module V Section V.0 Section V.1 Section V.2

Agenda Module 1 - Risk, Volatility &amp; Timescale Module 2 - Asset Allocation Module 3 -

Emergency Management Roles and Responsibilities Joe Myers Agenda MODULE 1 WHAT IS MODULE

1 MODULE SPECIFICATION Module Aims The module aims to deliver knowledge of the essential

Canadian Bioinformatics Workshops www.bioinformatics.ca Module #: Title of Module 2 Module bio

Module A: Algebraic properties of linear maps Module A Math 237 Module A Section A.1 Section

6.15 Module 15: Research and Presentation Module Title Research and Presentation Module NFQ

Module Title: Broadcasting &amp; Presentation Skills Level : 4 Credit Value : 20 Code of module

Agenda Module 1 - Risk, Volatility &amp; Timescale Module 2 - Asset Allocation Module 3 -

Using the Code Review Module Szeged DrupalCon Using the Code Review Module Doug Green Stella

Module 3 Doing a Noise Audit This module and Module 2 provide the necessary training needed

MODERATE SEDATION MODULE MODERATE SEDATION MODULE MODERATE SEDATION MODULE Introduction

Auxiliary Rubrics Module 6 Module 5 Review At the conclusion of Module 5, the team completed

On Completion of Constraint Handling Rules Slim Abdennadher and Thom Fr uhwirth Computer

Instrew: Leveraging LLVM for High Performance Dynamic Binary Instrumentation Alexis Engelke

I dont care! On Incompleteness in Abstract Argumentation (and Belief Revision?) Pietro

Programming with Boolean Satisfaction Michael Codish Department of Computer Science Ben Gurion

Boosting Verifiable Computation on Encrypted Data PKC 2020 Dario Fiore, Anca Nitulescu , David

Concepts of Programming Design Scala and Lightweight Modular Staging (LMS) Alexey Rodriguez

In tro duction to F unctional Programming: Lecture 6 1 In tro duction to F unctional

Lesson 23 Linear partial differential equations 1 We have seen that ODEs can be reduced to

Agenda Module 1 - Risk, Volatility & Timescale Module 2 - Asset Allocation Module 3 -

Module Title: Broadcasting & Presentation Skills Level : 4 Credit Value : 20 Code of module

Agenda Module 1 - Risk, Volatility & Timescale Module 2 - Asset Allocation Module 3 -