Module 15 POMDP Bounds CS 886 Sequential Decision Making and - PowerPoint PPT Presentation

Module 15 POMDP Bounds CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Bounds • POMDP algorithms typically find approximations to optimal value function or optimal policy – Need some performance guarantees • Lower bounds on 𝑊 ∗ – 𝑊 𝜌 for any policy 𝜌 – Point-based value iteration • Upper bounds on 𝑊 ∗ – QMDP – Fast-informed bound – Finite Belief-State MDP 2 CS886 (c) 2013 Pascal Poupart

Point-based Value Iteration • Theorem: If 𝑊 0 is a lower bound, then the value functions 𝑊 𝑜 produced by point-based value iteration at each iteration 𝑜 are lower bounds. • Proof by induction – Base case: pick 𝑊 0 to be a lower bound 𝑜 𝑐 ≤ 𝑊 ∗ 𝑐 ∀𝑐 – Inductive assumption: 𝑊 – Induction: • Let Τ 𝑜+1 be the set of 𝛽 -vectors for some set 𝐶 of beliefs ∗ • Let Τ 𝑜+1 be the set of 𝛽 -vectors for all beliefs 𝛽 𝑐 ≤ 𝑊 ∗ (𝑐) • Hence 𝑊 𝑜+1 𝑐 = max 𝛽∈Τ 𝑜+1 𝛽(𝑐) ≤ max ∗ 𝛽∈Τ 𝑜+1 4 CS886 (c) 2013 Pascal Poupart

Upper Bounds • Idea: make decision based on more information than normally available to obtain higher value than optimal. • POMDP: states are hidden • MDP: states are observable • Hence 𝑊 𝑁𝐸𝑄 ≥ 𝑊 𝑄𝑃𝑁𝐸𝑄 5 CS886 (c) 2013 Pascal Poupart

QMDP Algorithm • Derive upper bound from MDP Q-function by allowing the state to be observable • Policy: 𝑡 𝑢 → 𝑏 𝑢 QMDP(POMDP) Solve MDP to find 𝑅 𝑁𝐸𝑄 Pr 𝑡 ′ 𝑡, 𝑏 max 𝑅 𝑁𝐸𝑄 𝑡, 𝑏 = 𝑆 𝑡, 𝑏 + 𝛿 𝑏 ′ 𝑅 𝑁𝐸𝑄 (𝑡′, 𝑏′) 𝑡 ′ 𝑐 = max 𝑐 𝑡 𝑅 𝑁𝐸𝑄 (𝑡, 𝑏) Let 𝑊 𝑡 𝑏 Return 𝑊 6 CS886 (c) 2013 Pascal Poupart

Fast Informed Bound • QMDP upper bound is too loose – Actions depend on current state (too informative) • Tighter upper bound: fast Informed bound (FIB) – Actions depend on previous state (less informative) 𝐺𝐽𝐶 ≥ 𝑊 ∗ 𝑊 𝑁𝐸𝑄 ≥ 𝑊 7 CS886 (c) 2013 Pascal Poupart

FIB Algorithm • Derive upper bound by allowing the previous state to be observable • Policy: 𝑡 𝑢−1 , 𝑏 𝑢−1 , 𝑝 𝑢 → 𝑏 𝑢 FIB(POMDP) Find 𝑅 𝐺𝐽𝐶 by value iteration Pr 𝑡 ′ 𝑡, 𝑏 Pr 𝑝 ′ 𝑡 ′ , 𝑏 𝑅 𝐺𝐽𝐶 (𝑡 ′ , 𝑏 ′ ) 𝑅 𝐺𝐽𝐶 𝑡, 𝑏 = 𝑆 𝑡, 𝑏 + 𝛿 𝑏 ′ max 𝑝 ′ 𝑡 ′ 𝑐 = max 𝑐 𝑡 𝑅 𝐺𝐽𝐶 (𝑡, 𝑏) Let 𝑊 𝑡 𝑏 Return 𝑊 8 CS886 (c) 2013 Pascal Poupart

FIB Analysis 𝐺𝐽𝐶 ≥ 𝑊 ∗ • Theorem: 𝑊 𝑁𝐸𝑄 ≥ 𝑊 • Proof: Pr 𝑡 ′ 𝑡, 𝑏 max 𝑏 ′ 𝑅 𝑡 ′ , 𝑏 ′ 1) 𝑅 𝑁𝐸𝑄 𝑡, 𝑏 = 𝑆 𝑡, 𝑏 + 𝛿 𝑡 ′ Pr 𝑡 ′ 𝑡, 𝑏 Pr 𝑝 ′ 𝑡 ′ , 𝑏 max = 𝑆 𝑡, 𝑏 + 𝛿 𝑏 ′ 𝑅(𝑡 ′ , 𝑏 ′ ) 𝑡 ′ 𝑝 ′ Pr 𝑡 ′ 𝑡, 𝑏 Pr 𝑝 ′ 𝑡 ′ , 𝑏 𝑅(𝑡 ′ , 𝑏 ′ ) ≥ 𝑆 𝑡, 𝑏 + 𝛿 𝑏 ′ max 𝑝 ′ 𝑡 ′ = 𝑅 𝐺𝐽𝐶 (𝑡, 𝑏) 𝐺𝐽𝐶 ≥ 𝑊 ∗ since 𝑊 2) 𝑊 𝐺𝐽𝐶 is based on observing the previous state (too informative) 9 CS886 (c) 2013 Pascal Poupart

Finite Belief-State MDP • Belief state MDP: all beliefs are treated as states 𝑊 ∗ 𝑐 = max 𝑅 ∗ (𝑐, 𝑏) 𝑏 • QMDP and FIB: value of each interior belief is 𝑐 = max 𝑐 𝑡 𝑅 𝐺𝐽𝐶 (𝑡, 𝑏) interpolated: i.e., 𝑊 𝑡 𝑏 • Idea: retain subset of beliefs – Interpolate value of remaining beliefs 10 CS886 (c) 2013 Pascal Poupart

Finite Belief-State MDP • Belief state MDP 𝑅 𝑐, 𝑏 = 𝑆 𝑐, 𝑏 + 𝛿 Pr 𝑝 ′ 𝑐, 𝑏 𝑏 ′ 𝑅(𝑐 𝑏,𝑝 , 𝑏 ′ ) max 𝑝 ′ • Let 𝐶 be a subset of representative beliefs • Approximate 𝑅(𝑐 𝑏,𝑝 , 𝑏 ′ ) with lowest interpolation – Linear program 𝑅 𝑐 𝑏,𝑝 , 𝑏 ′ = min 𝑑 𝑐 𝑅 𝑐, 𝑏 ′ 𝑐∈𝐶 𝑑 such that 𝑑 𝑐 = 1 and 𝑑 𝑐 ≥ 0 ∀𝑐 𝑐 11 CS886 (c) 2013 Pascal Poupart

Finite Belief-State MDP Algorithm • Derive upper bound by interpolating values based on a finite subset of values FiniteBeliefStateMDP(POMDP) Find 𝑅 𝐶 by value iteration Pr 𝑝 ′ 𝑐, 𝑏 max 𝑏 ′ 𝑅 𝐶 (𝑐 𝑏𝑝 ′ , 𝑏 ′ ) 𝑅 𝐶 𝑐, 𝑏 = 𝑆 𝑐, 𝑏 + 𝛿 ∀𝑐 ∈ 𝐶, 𝑏 𝑝 ′ where 𝑅 𝐶 𝑐 𝑏𝑝 ′ , 𝑏 ′ 𝑑 𝑐 𝑅 𝐶 (𝑐, 𝑏 ′ ) = min 𝑐∈𝐶 𝑑 such that 𝑑 𝑐 = 1 and 𝑑 𝑐 ≥ 0 ∀𝑐 ∈ 𝐶 𝑐∈𝐶 𝑐 = max 𝑐 𝑡 𝑅 𝐶 (𝑡, 𝑏) Let 𝑊 𝑡 𝑏 Return 𝑊 12 CS886 (c) 2013 Pascal Poupart

Module 15 POMDP Bounds CS 886 Sequential Decision Making and - PowerPoint PPT Presentation

Module 15 POMDP Bounds CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Bounds POMDP algorithms typically find approximations to optimal value function or optimal policy Need some performance

JOBS IN VALUE CHAINS ANALYSIS INTRODUCTION Roadmap: Why are we here today? Agenda for the

WebEOC Training 1 Topics Module 1 WebEOC Overview Module 2 Getting Started Module 3

Module E: Solving Systems of Linear Equations Module E Math 237 Module E Section E.0 Section

Module V: Vector Spaces Module V Math 237 Module V Section V.0 Section V.1 Section V.2

Agenda Module 1 - Risk, Volatility & Timescale Module 2 - Asset Allocation Module 3 -

Emergency Management Roles and Responsibilities Joe Myers Agenda MODULE 1 WHAT IS MODULE

1 MODULE SPECIFICATION Module Aims The module aims to deliver knowledge of the essential

Canadian Bioinformatics Workshops www.bioinformatics.ca Module #: Title of Module 2 Module bio

Module A: Algebraic properties of linear maps Module A Math 237 Module A Section A.1 Section

6.15 Module 15: Research and Presentation Module Title Research and Presentation Module NFQ

Module Title: Broadcasting & Presentation Skills Level : 4 Credit Value : 20 Code of module

Agenda Module 1 - Risk, Volatility & Timescale Module 2 - Asset Allocation Module 3 -

Using the Code Review Module Szeged DrupalCon Using the Code Review Module Doug Green Stella

Module 3 Doing a Noise Audit This module and Module 2 provide the necessary training needed

MODERATE SEDATION MODULE MODERATE SEDATION MODULE MODERATE SEDATION MODULE Introduction

Auxiliary Rubrics Module 6 Module 5 Review At the conclusion of Module 5, the team completed

Sorting Lower Bound Radix Sort Radix sort to the rescue sort of After today, you should be

FINANCIAL STATEMENT AND RELATED ANNOUNCEMENT Page 1 of 1

FIRST QUARTER 2010 FINANCIAL RESULTS 19 April 2010 1 Important Notice The value of units in

A Study on the Use of Wireless Sensor Networks in a Retail Store Dawud Gordon TU Braunschweig

Lower Bounds on the Probability of a Finite Union of Events Jun Yang (joint work with Fady

Branch and Bound Marco Chiarandini Department of Mathematics & Computer Science University

Theory of DM-atom Interactions Oren Slone, Princeton University Non CDM simulations 1 Theory

Boundary Scan Smith Text: Chapter 14.2 Top-down test design flow BSDArchitect Source: FlexTest

Module 15 POMDP Bounds CS 886 Sequential Decision Making and - PowerPoint PPT Presentation

Module 15 POMDP Bounds CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Bounds POMDP algorithms typically find approximations to optimal value function or optimal policy Need some performance

JOBS IN VALUE CHAINS ANALYSIS INTRODUCTION Roadmap: Why are we here today? Agenda for the

WebEOC Training 1 Topics Module 1 WebEOC Overview Module 2 Getting Started Module 3

Module E: Solving Systems of Linear Equations Module E Math 237 Module E Section E.0 Section

Module V: Vector Spaces Module V Math 237 Module V Section V.0 Section V.1 Section V.2

Agenda Module 1 - Risk, Volatility &amp; Timescale Module 2 - Asset Allocation Module 3 -

Emergency Management Roles and Responsibilities Joe Myers Agenda MODULE 1 WHAT IS MODULE

1 MODULE SPECIFICATION Module Aims The module aims to deliver knowledge of the essential

Canadian Bioinformatics Workshops www.bioinformatics.ca Module #: Title of Module 2 Module bio

Module A: Algebraic properties of linear maps Module A Math 237 Module A Section A.1 Section

6.15 Module 15: Research and Presentation Module Title Research and Presentation Module NFQ

Module Title: Broadcasting &amp; Presentation Skills Level : 4 Credit Value : 20 Code of module

Agenda Module 1 - Risk, Volatility &amp; Timescale Module 2 - Asset Allocation Module 3 -

Using the Code Review Module Szeged DrupalCon Using the Code Review Module Doug Green Stella

Module 3 Doing a Noise Audit This module and Module 2 provide the necessary training needed

MODERATE SEDATION MODULE MODERATE SEDATION MODULE MODERATE SEDATION MODULE Introduction

Auxiliary Rubrics Module 6 Module 5 Review At the conclusion of Module 5, the team completed

Sorting Lower Bound Radix Sort Radix sort to the rescue sort of After today, you should be

FINANCIAL STATEMENT AND RELATED ANNOUNCEMENT Page 1 of 1

FIRST QUARTER 2010 FINANCIAL RESULTS 19 April 2010 1 Important Notice The value of units in

A Study on the Use of Wireless Sensor Networks in a Retail Store Dawud Gordon TU Braunschweig

Lower Bounds on the Probability of a Finite Union of Events Jun Yang (joint work with Fady

Branch and Bound Marco Chiarandini Department of Mathematics &amp; Computer Science University

Theory of DM-atom Interactions Oren Slone, Princeton University Non CDM simulations 1 Theory

Boundary Scan Smith Text: Chapter 14.2 Top-down test design flow BSDArchitect Source: FlexTest

Agenda Module 1 - Risk, Volatility & Timescale Module 2 - Asset Allocation Module 3 -

Module Title: Broadcasting & Presentation Skills Level : 4 Credit Value : 20 Code of module

Agenda Module 1 - Risk, Volatility & Timescale Module 2 - Asset Allocation Module 3 -

Branch and Bound Marco Chiarandini Department of Mathematics & Computer Science University