Module 8 Linear Programming CS 886 Sequential Decision Making and - PowerPoint PPT Presentation

Module 8 Linear Programming CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Policy Optimization • Value and policy iteration – Iterative algorithms that implicitly solve an optimization problem • Can we explicitly write down this optimization problem? – Yes, it can be formulated as a linear program 2 CS886 (c) 2013 Pascal Poupart

Primal Linear Program primalLP(MDP) 𝑊 𝑥(𝑡)𝑊(𝑡) min 𝑡 Pr 𝑡 ′ 𝑡, 𝑏 𝑊 𝑡 ′ ∀𝑡, 𝑏 subject to 𝑊 𝑡 ≥ 𝑆 𝑡, 𝑏 + 𝛿 𝑡 ′ return 𝑊 • Variables: 𝑊 𝑡 ∀𝑡 • Objective: min 𝑥(𝑡)𝑊(𝑡) 𝑡 where 𝑥(𝑡) is a weight assigned to state 𝑡 • Constraints: Pr 𝑡 ′ 𝑡, 𝑏 𝑊 𝑡 ′ ∀𝑡, 𝑏 𝑊 𝑡 ≥ 𝑆 𝑡, 𝑏 + 𝛿 𝑡 ′ 3 CS886 (c) 2013 Pascal Poupart

Objective • Why do we minimize a weighted combination of the values? Shouldn’t we maximize value? • Value functions 𝑊 that satisfy the constraints are upper bounds on the optimal value function 𝑊 ∗ 𝑊 𝑡 ≥ 𝑊 ∗ 𝑡 ∀𝑡 • Minimizing value ensures that we choose the lowest upper bound V 𝑊(𝑡) = 𝑊 ∗ 𝑡 ∀𝑡 min 4 CS886 (c) 2013 Pascal Poupart

Upper bound • Theorem: Value functions 𝑊 that satisfy Pr 𝑡 ′ 𝑡, 𝑏 𝑊 𝑡 ′ 𝑊 𝑡 ≥ 𝑆 𝑡, 𝑏 + 𝛿 ∀𝑡, 𝑏 are 𝑡 ′ upper bounds on the optimal value function 𝑊 ∗ 𝑊 𝑡 ≥ 𝑊 ∗ 𝑡 ∀𝑡 • Proof: Pr 𝑡 ′ 𝑡, 𝑏 𝑊 𝑡 ′ – Since 𝑊 𝑡 ≥ 𝑆 𝑡, 𝑏 + 𝛿 ∀𝑡, 𝑏 𝑡 ′ Pr 𝑡 ′ 𝑡, 𝑏 𝑊 𝑡 ′ – Then 𝑊 𝑡 ≥ max 𝑆 𝑡, 𝑏 + 𝛿 ∀𝑡 𝑡 ′ 𝑏 = 𝐼 ∗ (𝑊)(𝑡) ∀𝑡 – Furthermore 𝑊 ≥ 𝐼 ∗ 𝑊 ≥ 𝐼 ∗ (𝐼 ∗ ≥ ⋯ ≥ 𝐼 ∗ ∞ 𝑊 = 𝑊 ∗ 𝑊 5 CS886 (c) 2013 Pascal Poupart

Weight function (initial state) • How do we choose the weight function? • If the policy always starts in the same initial state 𝑡 0 , then set 𝑥 𝑡 = 1 𝑡 = 𝑡 0 otherwise 0 • This ensures that 𝑥 𝑡 𝑊 𝑡 = 𝑊 ∗ (𝑡 0 ) 𝑡 6 CS886 (c) 2013 Pascal Poupart

Weight function (any state) • If the policy may start in any state, then assign a positive weight to each state, i.e. 𝑥 𝑡 > 0 ∀𝑡 • This ensures that 𝑊 is minimized at each 𝑡 and therefore 𝑊 𝑡 = 𝑊 ∗ 𝑡 ∀𝑡 • The magnitude of the weight doesn’t matter when the LP is solved exactly. We will revisit the choice of 𝑥(𝑡) when we discuss approximate linear programming. 7 CS886 (c) 2013 Pascal Poupart

Optimal Policy • Linear program finds 𝑊 ∗ • We can extract 𝜌 ∗ from 𝑊 ∗ as usual: 𝜌 ∗ 𝑡 ← 𝑏𝑠𝑕𝑛𝑏𝑦 𝑏 𝑆 𝑡, 𝑏 + 𝛿 Pr 𝑡 ′ 𝑡, 𝑏 𝑊 ∗ (𝑡 ′ ) 𝑡 ′ • Or check the active constraints – For each 𝑡 , check which 𝑏 ∗ leads to equality 𝑊 𝑡 = 𝑆 𝑡, 𝑏 ∗ + 𝛿 Pr 𝑡 ′ 𝑡, 𝑏 ∗ 𝑊(𝑡 ′ ) 𝑡 ′ Pr 𝑡 ′ 𝑡, 𝑏 𝑊 𝑡 ′ ∀𝑏 𝑊 𝑡 ≥ 𝑆 𝑡, 𝑏 + 𝛿 𝑡 ′ – Set 𝜌 ∗ 𝑡 ← 𝑏 ∗ 8 CS886 (c) 2013 Pascal Poupart

Direct Policy Optimization • The optimal solution to the primal linear program is 𝑊 ∗ , but we still have to extract 𝜌 ∗ • Could we directly optimize 𝜌 ? – Yes, by considering the dual linear program 9 CS886 (c) 2013 Pascal Poupart

Dual Linear Program dualLP(MDP) max 𝑧 𝑡, 𝑏 𝑆(𝑡, 𝑏) 𝑡,𝑏 𝑧 (𝑡 ′ |𝑡, 𝑏)𝑧 𝑡, 𝑏 subject to 𝑧 𝑡′, 𝑏′ = 𝑐 𝑡′ + 𝛿 Pr ∀𝑡 𝑏′ 𝑡,𝑏 𝑧 𝑡, 𝑏 ≥ 0 ∀𝑡, 𝑏 Let 𝜌 𝑏|𝑡 = Pr 𝑏 𝑡 = 𝑧(𝑡, 𝑏)/ 𝑧(𝑡, 𝑏) 𝑏 return 𝜌 • Variables: y 𝑡, 𝑏 ∀𝑡, 𝑏 – frequency of each 𝑡, 𝑏 -pair (proportional to 𝜌 ) • Objective: max 𝑧 𝑡, 𝑏 𝑆(𝑡, 𝑏) 𝑡,𝑏 𝑧 • Constraints: (𝑡 ′ |𝑡, 𝑏)𝑧 𝑡, 𝑏 𝑧 𝑡′, 𝑏′ = 𝑐 𝑡′ + 𝛿 Pr 𝑏′ 𝑡,𝑏 10 CS886 (c) 2013 Pascal Poupart

Duality • For every primal linear program in the form min Interpretation: 𝑦 𝑑 𝑈 𝑦 𝑑 = 𝑥 s. t. 𝐵𝑦 ≥ 𝑐 𝑦 = 𝑊 𝑧 ∝ 𝜌 • There is an equivalent dual 𝐵 = 𝐽 − 𝛿𝑈 𝑏 ∀𝑏 linear program in the form 𝑐 = [𝑆 𝑏 ]∀𝑏 max 𝑐 𝑈 𝑧 𝑧 s. t. 𝐵 𝑈 𝑧 = 𝑑 and 𝑧 ≥ 0 𝑑 𝑈 𝑦 = max 𝑐 𝑈 𝑧 • Where min 𝑦 𝑧 11 CS886 (c) 2013 Pascal Poupart

State Frequency • Let 𝑔(𝑡) be the frequency of 𝑡 under policy 𝜌 . 0 step: 𝑔 0 𝑡 = 𝑥(𝑡) 1 𝑡′ = 𝑥 𝑡′ + 𝛿 Pr 1 step: 𝑔 (𝑡′|𝑡, 𝜌 𝑡 )𝑥 𝑡 𝑡 2 𝑡′′ = 𝑥 𝑡′′ + 𝛿 Pr 2 steps: 𝑔 (𝑡′′|𝑡′, 𝜌 𝑡′ )𝑥 𝑡′ 𝑡′ +𝛿 2 Pr 𝑡 ′ 𝑡, 𝜌 𝑡 Pr 𝑡′′ 𝑡 ′ , 𝜌 𝑡 ′ 𝑥(𝑡) 𝑡,𝑡 ′ … n steps: + 𝛿 Pr 𝑡 𝑜 𝑡 𝑜−1 , 𝜌 𝑡 𝑜−1 𝑜−1 (𝑡 𝑜−1 ) 𝑜 𝑡 𝑜 = 𝑥 𝑡 𝑜 𝑔 𝑔 𝑡 𝑜−1 ∞ steps: 𝑔 𝑡′ = 𝑥 𝑡′ + 𝛿 Pr 𝑡′ 𝑡, 𝜌(𝑡) 𝑔(𝑡) 𝑡 12 CS886 (c) 2013 Pascal Poupart

State-Action Frequency • Let 𝑧 𝑡, 𝑏 be the state-action frequency 𝑧 𝑡, 𝑏 = 𝜌 𝑏|𝑡 𝑔 𝑡 where 𝜌 𝑏 𝑡 = Pr 𝑏 𝑡 is a stochastic policy • Then the following equations are equivalent 𝑔 𝑡′ = 𝑥 𝑡′ + 𝛿 Pr 𝑡′ 𝑡, 𝜌(𝑡) 𝑔(𝑡) 𝑡 𝑔 𝜌 𝑡′ = 𝑥 𝑡′ + Pr 𝑡 ′ 𝑡, 𝑏 𝜌 𝑏|𝑡 𝑔 𝜌 (𝑡) ⇔ 𝜌(𝑏 ′ |𝑡 ′ ) 𝑏 ′ 𝑡 = 𝑥 𝑡′ + Pr 𝑡 ′ 𝑡, 𝑏 𝑧(𝑡, 𝑏) ⇔ 𝑧(𝑡 ′ , 𝑏 ′ ) 𝑏 ′ 𝑡 Constraint of dual LP 13 CS886 (c) 2013 Pascal Poupart

Policy • We can recover 𝜌 from 𝑧 . 𝑧 𝑡, 𝑏 = 𝜌 𝑏 𝑡 𝑔 𝑡 (by definition) 𝑧 𝑡,𝑏 𝜌 𝑏 𝑡 = 𝑔 𝑡 (isolate 𝜌 ) 𝑧 𝑡,𝑏 𝜌 𝑏 𝑡 = (by definition) 𝑧 𝑡,𝑏 𝑏 • 𝜌 may be stochastic • Actions with non-zero probability are necessarily optimal 14 CS886 (c) 2013 Pascal Poupart

Objective • Duality theory guarantees that the objectives of the primal and dual LPs are equal max 𝑧 𝑡, 𝑏 𝑆 𝑡, 𝑏 = min 𝑊 𝑥(𝑡) 𝑊(𝑡) y 𝑡,𝑏 𝑡 • This means that 𝑧 𝑡, 𝑏 𝑆 𝑡, 𝑏 implicitly 𝑡,𝑏 measures the value of the optimal policy. 15 CS886 (c) 2013 Pascal Poupart

Solution Algorithms • Two broad classes of algorithms: – Simplex (corner search) – Interior point methods (interior iterative methods) • Polynomial complexity (MDP is in P, not NP) • Many packages for linear programming – CPLEX (robust, efficient and free for academia) 16 CS886 (c) 2013 Pascal Poupart

Module 8 Linear Programming CS 886 Sequential Decision Making and - PowerPoint PPT Presentation

Module 8 Linear Programming CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Policy Optimization Value and policy iteration Iterative algorithms that implicitly solve an optimization problem Can

JOBS IN VALUE CHAINS ANALYSIS INTRODUCTION Roadmap: Why are we here today? Agenda for the

WebEOC Training 1 Topics Module 1 WebEOC Overview Module 2 Getting Started Module 3

Module E: Solving Systems of Linear Equations Module E Math 237 Module E Section E.0 Section

Module V: Vector Spaces Module V Math 237 Module V Section V.0 Section V.1 Section V.2

Agenda Module 1 - Risk, Volatility & Timescale Module 2 - Asset Allocation Module 3 -

Emergency Management Roles and Responsibilities Joe Myers Agenda MODULE 1 WHAT IS MODULE

1 MODULE SPECIFICATION Module Aims The module aims to deliver knowledge of the essential

Canadian Bioinformatics Workshops www.bioinformatics.ca Module #: Title of Module 2 Module bio

Module A: Algebraic properties of linear maps Module A Math 237 Module A Section A.1 Section

6.15 Module 15: Research and Presentation Module Title Research and Presentation Module NFQ

Module Title: Broadcasting & Presentation Skills Level : 4 Credit Value : 20 Code of module

Agenda Module 1 - Risk, Volatility & Timescale Module 2 - Asset Allocation Module 3 -

Using the Code Review Module Szeged DrupalCon Using the Code Review Module Doug Green Stella

Module 3 Doing a Noise Audit This module and Module 2 provide the necessary training needed

MODERATE SEDATION MODULE MODERATE SEDATION MODULE MODERATE SEDATION MODULE Introduction

Auxiliary Rubrics Module 6 Module 5 Review At the conclusion of Module 5, the team completed

Chapter 1 Linear Programming Paragraph 6 LPs in Polynomial Time What we did so far We

Fundamentals of Integer Programming Di Yuan Department of Information Technology, Uppsala

Linear Programming Linear Programs - example 1 Optimization problem x 1 ,x 2 = variables

ts t t

Effective Linear Programming-Based Placement Techniques Sherief Reda Sherief Reda Amit

P a r t 1 7 L i n e a r p r o g r a m m i n g 2 : A n a v e s o

Uni.lu HPC School 2019 PS14: Distributed Mixed-Integer Programming (MIP) optimization with Cplex

A subexponential lower bound for Zadehs pivoting rule for solving linear programs and games

Module 8 Linear Programming CS 886 Sequential Decision Making and - PowerPoint PPT Presentation

Module 8 Linear Programming CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Policy Optimization Value and policy iteration Iterative algorithms that implicitly solve an optimization problem Can

JOBS IN VALUE CHAINS ANALYSIS INTRODUCTION Roadmap: Why are we here today? Agenda for the

WebEOC Training 1 Topics Module 1 WebEOC Overview Module 2 Getting Started Module 3

Module E: Solving Systems of Linear Equations Module E Math 237 Module E Section E.0 Section

Module V: Vector Spaces Module V Math 237 Module V Section V.0 Section V.1 Section V.2

Agenda Module 1 - Risk, Volatility &amp; Timescale Module 2 - Asset Allocation Module 3 -

Emergency Management Roles and Responsibilities Joe Myers Agenda MODULE 1 WHAT IS MODULE

1 MODULE SPECIFICATION Module Aims The module aims to deliver knowledge of the essential

Canadian Bioinformatics Workshops www.bioinformatics.ca Module #: Title of Module 2 Module bio

Module A: Algebraic properties of linear maps Module A Math 237 Module A Section A.1 Section

6.15 Module 15: Research and Presentation Module Title Research and Presentation Module NFQ

Module Title: Broadcasting &amp; Presentation Skills Level : 4 Credit Value : 20 Code of module

Agenda Module 1 - Risk, Volatility &amp; Timescale Module 2 - Asset Allocation Module 3 -

Using the Code Review Module Szeged DrupalCon Using the Code Review Module Doug Green Stella

Module 3 Doing a Noise Audit This module and Module 2 provide the necessary training needed

MODERATE SEDATION MODULE MODERATE SEDATION MODULE MODERATE SEDATION MODULE Introduction

Auxiliary Rubrics Module 6 Module 5 Review At the conclusion of Module 5, the team completed

Chapter 1 Linear Programming Paragraph 6 LPs in Polynomial Time What we did so far We

Fundamentals of Integer Programming Di Yuan Department of Information Technology, Uppsala

Linear Programming Linear Programs - example 1 Optimization problem x 1 ,x 2 = variables

ts t t

Effective Linear Programming-Based Placement Techniques Sherief Reda Sherief Reda Amit

P a r t 1 7 L i n e a r p r o g r a m m i n g 2 : A n a v e s o

Uni.lu HPC School 2019 PS14: Distributed Mixed-Integer Programming (MIP) optimization with Cplex

A subexponential lower bound for Zadehs pivoting rule for solving linear programs and games

Agenda Module 1 - Risk, Volatility & Timescale Module 2 - Asset Allocation Module 3 -

Module Title: Broadcasting & Presentation Skills Level : 4 Credit Value : 20 Code of module

Agenda Module 1 - Risk, Volatility & Timescale Module 2 - Asset Allocation Module 3 -