Module 6 Value Iteration CS 886 Sequential Decision Making and - PowerPoint PPT Presentation

Module 6 Value Iteration CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Markov Decision Process • Definition – Set of states: 𝑇 – Set of actions (i.e., decisions): 𝐵 – Transition model: Pr⁡ (𝑡 𝑢 |𝑡 𝑢−1 , 𝑏 𝑢−1 ) – Reward model (i.e., utility): 𝑆(𝑡 𝑢 , 𝑏 𝑢 ) – Discount factor: 0 ≤ 𝛿 ≤ 1 – Horizon (i.e., # of time steps): ℎ • Goal: find optimal policy 𝜌 2 CS886 (c) 2013 Pascal Poupart

Finite Horizon • Policy evaluation 𝜌 𝑡 = ℎ 𝛿 𝑢 Pr⁡ 𝑊 (𝑇 𝑢 = 𝑡′|𝑇 0 = 𝑡, 𝜌)𝑆(𝑡′, 𝜌 𝑢 (𝑡′)) ℎ 𝑢=0 • Recursive form (dynamic programming) 𝜌 𝑡 = 𝑆(𝑡, 𝜌 0 𝑡 ) 𝑊 0 𝜌 (𝑡 ′ ) 𝜌 𝑡 = 𝑆 𝑡, 𝜌 𝑢 𝑡 Pr 𝑡 ′ 𝑡, 𝜌 𝑢 𝑡 + 𝛿 𝑊 𝑊 𝑡 ′ 𝑢 𝑢−1 3 CS886 (c) 2013 Pascal Poupart

Finite Horizon • Optimal Policy 𝜌 ∗ 𝜌 ∗ 𝑡 ≥ 𝑊 𝜌 𝑡 ⁡⁡∀𝜌, 𝑡 𝑊 ℎ ℎ • Optimal value function 𝑊 ∗ (shorthand for 𝑊 𝜌 ∗ ) ∗ 𝑡 = max 𝑆(𝑡, 𝑏) 𝑊 0 𝑏 ∗ 𝑡 = max Pr 𝑡 ′ 𝑡, 𝑏 𝑊 ∗ (𝑡 ′ ) 𝑆 𝑡, 𝑏 + 𝛿 𝑊 𝑡 ′ 𝑢 𝑢−1 𝑏 Bellman’s equation 4 CS886 (c) 2013 Pascal Poupart

Value Iteration Algorithm valueIteration(MDP) ∗ 𝑡 ← max 𝑊 𝑆(𝑡, 𝑏)⁡∀𝑡 0 𝑏 For 𝑢 = 1 to ℎ do ∗ (𝑡 ′ ) ∗ 𝑡 ← max Pr 𝑡 ′ 𝑡, 𝑏 𝑊 𝑆 𝑡, 𝑏 + 𝛿 𝑊 ⁡∀𝑡 𝑡 ′ 𝑢 𝑢−1 𝑏 Return 𝑊 ∗ Optimal policy 𝜌 ∗ ∗ 𝑡 ← argmax 𝑢 = 0:⁡𝜌 0 𝑆 𝑡, 𝑏 ⁡∀𝑡 𝑏 ∗ 𝑡 ← argmax ∗ (𝑡 ′ ) Pr 𝑡 ′ 𝑡, 𝑏 𝑊 𝑢 > 0 : 𝜌 𝑢 𝑆 𝑡, 𝑏 + 𝛿 ⁡∀𝑡 𝑡 ′ 𝑢−1 𝑏 NB: 𝜌 ∗ is non stationary (i.e., time dependent) 5 CS886 (c) 2013 Pascal Poupart

Value Iteration • Matrix form: 𝑆 𝑏 : 𝑇 × 1 column vector of rewards for 𝑏 ∗ : 𝑇 × 1 column vector of state values 𝑊 𝑢 𝑈 𝑏 : 𝑇 × 𝑇 matrix of transition prob. for 𝑏 valueIteration(MDP) ∗ ← max 𝑆 𝑏 ⁡ 𝑊 0 𝑏 For 𝑢 = 1 to ℎ do ∗ ← max 𝑆 𝑏 + 𝛿𝑈 𝑏 𝑊 ∗ ⁡ 𝑊 𝑢 𝑢−1 𝑏 Return 𝑊 ∗ 6 CS886 (c) 2013 Pascal Poupart

Infinite Horizon • Let ℎ → ∞ 𝜌 → 𝑊 𝜌 𝜌 and 𝑊 𝜌 • Then 𝑊 → 𝑊 ∞ ∞ ℎ ℎ−1 • Policy evaluation: 𝜌 𝑡 = 𝑆 𝑡, 𝜌 ∞ 𝑡 Pr 𝑡 ′ 𝑡, 𝜌 ∞ 𝑡 𝜌 (𝑡 ′ ) + 𝛿 𝑊 𝑊 ⁡∀𝑡 𝑡 ′ ∞ ∞ • Bellman’s equation: ∗ 𝑡 = max Pr 𝑡 ′ 𝑡, 𝑏 𝑊 ∗ (𝑡 ′ ) 𝑆 𝑡, 𝑏 + 𝛿 𝑊 𝑡 ′ ∞ ∞ 𝑏 7 CS886 (c) 2013 Pascal Poupart

Policy evaluation • Linear system of equations 𝜌 𝑡 = 𝑆 𝑡, 𝜌 ∞ 𝑡 Pr 𝑡 ′ 𝑡, 𝜌 ∞ 𝑡 𝜌 (𝑡 ′ ) 𝑊 + 𝛿 𝑊 ⁡∀𝑡 𝑡 ′ ∞ ∞ • Matrix form: 𝑆 : 𝑇 × 1 column vector of sate rewards for 𝜌 𝑊 : 𝑇 × 1 column vector of state values for 𝜌 𝑈 : 𝑇 × 𝑇 matrix of transition prob for 𝜌 𝑊 = 𝑆 + 𝛿𝑈𝑊 8 CS886 (c) 2013 Pascal Poupart

Solving linear equations • Linear system: 𝑊 = 𝑆 + 𝛿𝑈𝑊 • Gaussian elimination: 𝐽 − 𝛿𝑈 𝑊 = 𝑆 • Compute inverse: 𝑊 = 𝐽 − 𝛿𝑈 −1 𝑆 • Iterative methods • Value iteration (a.k.a. Richardson iteration) • Repeat 𝑊 ← 𝑆 + 𝛿𝑈𝑊 9 CS886 (c) 2013 Pascal Poupart

Contraction • Let 𝐼(𝑊) ≝ 𝑆 + 𝛿𝑈𝑊 be the policy eval operator • Lemma 1: 𝐼 is a contraction mapping. − 𝐼 𝑊 − 𝑊 𝐼 𝑊 ∞ ≤ 𝛿 𝑊 ∞ • Proof 𝐼 𝑊 − 𝐼 𝑊 ∞ − 𝑆 − 𝛿𝑈𝑊 = 𝑆 + 𝛿𝑈𝑊 ∞ (by definition) − 𝑊 = 𝛿𝑈 𝑊 ∞ (simplification) − 𝑊 ≤ 𝛿 𝑈 𝑊 ∞ (since 𝐵𝐶 ≤ 𝐵 𝐶 ) ∞ − 𝑊 𝑈(𝑡, 𝑡 ′ ) = 𝛿 𝑊 ∞ (since max = 1 ) 𝑡′ 𝑡 10 CS886 (c) 2013 Pascal Poupart

Convergence • Theorem 2: Policy evaluation converges to 𝑊 𝜌 for any initial estimate 𝑊 𝑜→∞ 𝐼 (𝑜) 𝑊 = 𝑊 𝜌 ⁡⁡∀𝑊 lim • Proof • By definition V 𝜌 = 𝐼 ∞ 0 , but policy evaluation computes 𝐼 ∞ 𝑊 for any initial 𝑊 • By lemma 1, 𝐼 (𝑜) 𝑊 − 𝐼 𝑜 ∞ ≤ 𝛿 𝑜 𝑊 𝑊 − 𝑊 ∞ • Hence, when 𝑜 → ∞ , then 𝐼 (𝑜) 𝑊 − 𝐼 𝑜 0 ∞ → 0 and 𝐼 ∞ 𝑊 = 𝑊 𝜌 ⁡⁡∀𝑊 11 CS886 (c) 2013 Pascal Poupart

Approximate Policy Evaluation • In practice, we can’t perform an infinite number of iterations. • Suppose that we perform value iteration for 𝑙 steps and 𝐼 𝑙 𝑊 − 𝐼 𝑙−1 𝑊 ∞ = 𝜗 , how far is 𝐼 𝑙 𝑊 from 𝑊 𝜌 ? 12 CS886 (c) 2013 Pascal Poupart

Approximate Policy Evaluation • Theorem 3: If 𝐼 𝑙 𝑊 − 𝐼 𝑙−1 𝑊 ∞ ≤ 𝜗 then 𝜗 𝑊 𝜌 − 𝐼 𝑙 𝑊 ∞ ≤ 1 − 𝛿 • Proof 𝑊 𝜌 − 𝐼 𝑙 𝑊 ∞ 𝐼 ∞ (𝑊) − 𝐼 𝑙 𝑊 = ∞ (by Theorem 2) 𝐼 𝑢+𝑙 𝑊 − 𝐼 𝑢+𝑙−1 𝑊 ∞ = ∞ 𝑢=1 𝐼 𝑢+𝑙 (𝑊) − 𝐼 𝑢+𝑙−1 𝑊 ∞ ≤ ( 𝐵 + 𝐶 ≤ 𝐵 + | 𝐶 | ) 𝑢=1 ∞ 𝜗 ∞ 𝛿 𝑢 𝜗 = = 1−𝛿 (by Lemma 1) 𝑢=1 13 CS886 (c) 2013 Pascal Poupart

Optimal Value Function • Non-linear system of equations ∗ 𝑡 = max Pr 𝑡 ′ 𝑡, 𝑏⁡ 𝑊 ∗ (𝑡 ′ ) ⁡∀𝑡 𝑊 𝑆 𝑡, 𝑏⁡ + 𝛿 𝑡 ′ ∞ ∞ 𝑏 • Matrix form: 𝑆 𝑏 : 𝑇 × 1 column vector of rewards for 𝑏 𝑊 ∗ : 𝑇 × 1 column vector of optimal values 𝑈 a : 𝑇 × 𝑇 matrix of transition prob for 𝑏 𝑊 ∗ = max 𝑆 𝑏 + 𝛿𝑈 𝑏 𝑊 ∗ 𝑏 14 CS886 (c) 2013 Pascal Poupart

Contraction 𝑆 𝑏 + 𝛿𝑈 𝑏 𝑊 be the operator in • Let 𝐼 ∗ (𝑊) ≝ max 𝑏 value iteration • Lemma 3: 𝐼 ∗ is a contraction mapping. 𝐼 ∗ 𝑊 − 𝐼 ∗ 𝑊 − 𝑊 ∞ ≤ 𝛿 𝑊 ∞ • Proof: without loss of generality, let 𝐼 ∗ 𝑊 𝑡 ≥ 𝐼 ∗ (𝑊)(𝑡) and ∗ = argmax Pr 𝑡 ′ 𝑡, 𝑏 𝑊(𝑡′) 𝑆 𝑡, 𝑏 + 𝛿 let 𝑏 𝑡 𝑡 ′ 𝑏 15 CS886 (c) 2013 Pascal Poupart

Contraction • Proof continued: • Then 0 ≤ 𝐼 ∗ 𝑊 𝑡 − 𝐼 ∗ 𝑊 𝑡 (by assumption) ∗ + 𝛿 Pr 𝑡 ′ 𝑡, 𝑏 𝑡 ∗ 𝑡 ′ (by definition) ≤ 𝑆 𝑡, 𝑏 𝑡 𝑊 𝑡 ′ ∗ − 𝛿 Pr 𝑡 ′ 𝑡, 𝑏 𝑡 ∗ 𝑊 𝑡 ′ −𝑆 𝑡, 𝑏 𝑡 𝑡 ′ Pr 𝑡 ′ 𝑡, 𝑏 𝑡 𝑡 ′ − 𝑊 𝑡 ′ ∗ = 𝛿 𝑊 𝑡 ′ Pr 𝑡 ′ 𝑡, 𝑏 𝑡 − 𝑊 ∗ ≤ 𝛿 𝑊 (maxnorm upper bound) 𝑡 ′ ∞ Pr 𝑡 ′ 𝑡, 𝑏 𝑡 − 𝑊 ∗ ∞ (since = 𝛿 𝑊 = 1 ) 𝑡 ′ • Repeat the same argument for 𝐼 ∗ 𝑊 )(𝑡) and 𝑡 ≥ 𝐼 ∗ (𝑊 for each 𝑡 16 CS886 (c) 2013 Pascal Poupart

Convergence • Theorem 4: Value iteration converges to 𝑊 ∗ for any initial estimate 𝑊 𝑜→∞ 𝐼 ∗(𝑜) 𝑊 = 𝑊 ∗ ⁡⁡∀𝑊 lim • Proof • By definition V ∗ = 𝐼 ∗ ∞ 0 , but value iteration computes 𝐼 ∗ ∞ 𝑊 for some initial 𝑊 • By lemma 3, 𝐼 ∗(𝑜) 𝑊 − 𝐼 ∗ 𝑜 ≤ 𝛿 𝑜 𝑊 𝑊 − 𝑊 ∞ ∞ • Hence, when 𝑜 → ∞ , then 𝐼 ∗(𝑜) 𝑊 − 𝐼 ∗ 𝑜 0 → ∞ 0 and 𝐼 ∗ ∞ 𝑊 = 𝑊 ∗ ⁡⁡∀𝑊 17 CS886 (c) 2013 Pascal Poupart

Value Iteration • Even when horizon is infinite, perform finitely many iterations • Stop when 𝑊 𝑜 − 𝑊 ≤ 𝜗 𝑜−1 valueIteration(MDP) ∗ ← max 𝑆 𝑏 ; ⁡⁡⁡⁡⁡⁡⁡⁡𝑜 ← 0 𝑊 0 𝑏 Repeat 𝑜 ← 𝑜 + 1 𝑆 𝑏 + 𝛿𝑈 𝑏 𝑊 𝑊 𝑜 ← max 𝑜−1 𝑏 Until 𝑊 𝑜 − 𝑊 ∞ ≤ 𝜗 𝑜−1 Return 𝑊 𝑜 18 CS886 (c) 2013 Pascal Poupart

Induced Policy • Since 𝑊 𝑜 − 𝑊 ∞ ≤ 𝜗 , by Theorem 4: we know 𝑜−1 𝜗 𝑜 − 𝑊 ∗ that 𝑊 ∞ ≤ 1−𝛿 • But, how good is the stationary policy 𝜌 𝑜 𝑡 extracted based on 𝑊 𝑜 ? 𝑆 𝑡, 𝑏 + 𝛿 Pr 𝑡 ′ 𝑡, 𝑏 𝑊 𝑜 (𝑡 ′ ) 𝜌 𝑜 𝑡 = argmax 𝑏 𝑡 ′ • How far is 𝑊 𝜌 𝑜 from 𝑊 ∗ ? 19 CS886 (c) 2013 Pascal Poupart

Induced Policy 2𝜗 • Theorem 5: 𝑊 𝜌 𝑜 − 𝑊 ∗ ∞ ≤ 1−𝛿 • Proof 𝑊 𝜌 𝑜 − 𝑊 ∗ 𝑊 𝜌 𝑜 − 𝑊 𝑜 − 𝑊 ∗ ∞ = 𝑜 + 𝑊 ∞ 𝑊 𝜌 𝑜 − 𝑊 𝑜 − 𝑊 ∗ ≤ ∞ + 𝑊 ∞ ( 𝐵 + 𝐶 ≤ 𝐵 + | 𝐶 | ) 𝑜 𝐼 𝜌 𝑜 ∞ (𝑊 𝑜 − 𝐼 ∗ ∞ 𝑊 = 𝑜 ) − 𝑊 + 𝑊 𝑜 𝑜 ∞ ∞ 𝜗 𝜗 ≤ 1−𝛿 + 1−𝛿 (by Theorems 2 and 4) 2𝜗 = 1−𝛿 20 CS886 (c) 2013 Pascal Poupart

Summary • Value iteration – Simple dynamic programming algorithm – Complexity: 𝑃(𝑜 𝐵 𝑇 2 ) • Here 𝑜 is the number of iterations • Can we optimize the policy directly instead of optimizing the value function and then inducing a policy? – Yes: by policy iteration 21 CS886 (c) 2013 Pascal Poupart

Module 6 Value Iteration CS 886 Sequential Decision Making and - PowerPoint PPT Presentation

Module 6 Value Iteration CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Markov Decision Process Definition Set of states: Set of actions (i.e., decisions): Transition model: Pr

JOBS IN VALUE CHAINS ANALYSIS INTRODUCTION Roadmap: Why are we here today? Agenda for the

WebEOC Training 1 Topics Module 1 WebEOC Overview Module 2 Getting Started Module 3

Module E: Solving Systems of Linear Equations Module E Math 237 Module E Section E.0 Section

Module V: Vector Spaces Module V Math 237 Module V Section V.0 Section V.1 Section V.2

Agenda Module 1 - Risk, Volatility & Timescale Module 2 - Asset Allocation Module 3 -

Emergency Management Roles and Responsibilities Joe Myers Agenda MODULE 1 WHAT IS MODULE

1 MODULE SPECIFICATION Module Aims The module aims to deliver knowledge of the essential

Canadian Bioinformatics Workshops www.bioinformatics.ca Module #: Title of Module 2 Module bio

Module A: Algebraic properties of linear maps Module A Math 237 Module A Section A.1 Section

6.15 Module 15: Research and Presentation Module Title Research and Presentation Module NFQ

Module Title: Broadcasting & Presentation Skills Level : 4 Credit Value : 20 Code of module

Agenda Module 1 - Risk, Volatility & Timescale Module 2 - Asset Allocation Module 3 -

Using the Code Review Module Szeged DrupalCon Using the Code Review Module Doug Green Stella

Module 3 Doing a Noise Audit This module and Module 2 provide the necessary training needed

MODERATE SEDATION MODULE MODERATE SEDATION MODULE MODERATE SEDATION MODULE Introduction

Auxiliary Rubrics Module 6 Module 5 Review At the conclusion of Module 5, the team completed

Convergence to equilibrium for rough differential equations Samy Tindel Purdue University

Convergence of Filtered Spherical Harmonic Equations for Radiation Transport Martin Frank (RWTH)

Solutions of Equations in One Variable Newtons Method Numerical Analysis (9th Edition) R L

Poisson-Minibatching for Gibbs Sampling with Convergence Rate Guarantees Ruqi Zhang and

12.1 Surface Deformation II Hao Li http://cs621.hao-li.com 1 Last Time Linear Surface

Scalable Large-Margin x x the man bit the dog the man bit the dog x x

Introduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabs Pczos

Notes on the Convergence of the Restarted GMRES Eugene Vecharynski Julien Langou Department of

Module 6 Value Iteration CS 886 Sequential Decision Making and - PowerPoint PPT Presentation

Module 6 Value Iteration CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Markov Decision Process Definition Set of states: Set of actions (i.e., decisions): Transition model: Pr

JOBS IN VALUE CHAINS ANALYSIS INTRODUCTION Roadmap: Why are we here today? Agenda for the

WebEOC Training 1 Topics Module 1 WebEOC Overview Module 2 Getting Started Module 3

Module E: Solving Systems of Linear Equations Module E Math 237 Module E Section E.0 Section

Module V: Vector Spaces Module V Math 237 Module V Section V.0 Section V.1 Section V.2

Agenda Module 1 - Risk, Volatility &amp; Timescale Module 2 - Asset Allocation Module 3 -

Emergency Management Roles and Responsibilities Joe Myers Agenda MODULE 1 WHAT IS MODULE

1 MODULE SPECIFICATION Module Aims The module aims to deliver knowledge of the essential

Canadian Bioinformatics Workshops www.bioinformatics.ca Module #: Title of Module 2 Module bio

Module A: Algebraic properties of linear maps Module A Math 237 Module A Section A.1 Section

6.15 Module 15: Research and Presentation Module Title Research and Presentation Module NFQ

Module Title: Broadcasting &amp; Presentation Skills Level : 4 Credit Value : 20 Code of module

Agenda Module 1 - Risk, Volatility &amp; Timescale Module 2 - Asset Allocation Module 3 -

Using the Code Review Module Szeged DrupalCon Using the Code Review Module Doug Green Stella

Module 3 Doing a Noise Audit This module and Module 2 provide the necessary training needed

MODERATE SEDATION MODULE MODERATE SEDATION MODULE MODERATE SEDATION MODULE Introduction

Auxiliary Rubrics Module 6 Module 5 Review At the conclusion of Module 5, the team completed

Convergence to equilibrium for rough differential equations Samy Tindel Purdue University

Convergence of Filtered Spherical Harmonic Equations for Radiation Transport Martin Frank (RWTH)

Solutions of Equations in One Variable Newtons Method Numerical Analysis (9th Edition) R L

Poisson-Minibatching for Gibbs Sampling with Convergence Rate Guarantees Ruqi Zhang and

12.1 Surface Deformation II Hao Li http://cs621.hao-li.com 1 Last Time Linear Surface

Scalable Large-Margin x x the man bit the dog the man bit the dog x x

Introduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabs Pczos

Notes on the Convergence of the Restarted GMRES Eugene Vecharynski Julien Langou Department of

Agenda Module 1 - Risk, Volatility & Timescale Module 2 - Asset Allocation Module 3 -

Module Title: Broadcasting & Presentation Skills Level : 4 Credit Value : 20 Code of module

Agenda Module 1 - Risk, Volatility & Timescale Module 2 - Asset Allocation Module 3 -