Introduction to Partially Observable Markov Decision Processes CS - PowerPoint PPT Presentation

Module 14 Introduction to Partially Observable Markov Decision Processes CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Markov Decision Processes • MDPs: – Fully Observable MDPs – Decision maker knows the state at each time step • POMDPs: – Partially Observable MDPs – Decision does not know the state – But makes observations that are correlated with the underlying state • E.g. sensors provide noisy information about the state 2 CS886 (c) 2013 Pascal Poupart

Model Description • Definition – Set of states: 𝑇 – Set of actions (i.e., decisions): 𝐵 – Transition model: Pr⁡ (𝑡 𝑢 |𝑡 𝑢−1 , 𝑏 𝑢−1 ) – Reward model (i.e., utility): 𝑆(𝑡 𝑢 , 𝑏 𝑢 ) – Discount factor: 0 ≤ 𝛿 ≤ 1 – Horizon (i.e., # of time steps): ℎ – Set of observations: 𝑷 – Observation model: 𝐐𝐬⁡ (𝒑 𝒖 |𝒕 𝒖 , 𝒃 𝒖−𝟐 ) 4 CS886 (c) 2013 Pascal Poupart

Policies • MDP policies: 𝜌: 𝑇 → 𝐵 – Markovian policy • But state is unknown in POMDPs • POMDP policies: 𝜌: 𝐶 0 × 𝐼 𝑢 → 𝐵 𝑢 – 𝐶 0 is the space of initial beliefs 𝑐 0 𝑐 0 = Pr⁡ (𝑡 0 ) – 𝐼 𝑢 is the space histories ℎ 𝑢 of observables up to time 𝑢 ℎ 𝑢 ≝ 𝑏 0 , 𝑝 1 , 𝑏 1 , 𝑝 2 , … , 𝑏 𝑢−1 , 𝑝 𝑢 – Non-Markovian policy 7 CS886 (c) 2013 Pascal Poupart

Policy Trees (continued) • Policy 𝜌: 𝐶 × 𝐼 𝑢 → 𝐵 𝑢 – Set of trees Let 𝐶 = 𝐶 1 ∪ 𝐶 2 ∪ 𝐶 3 𝑐 ∈ 𝐶 1 𝑐 ∈ 𝐶 3 𝑐 ∈ 𝐶 2 𝑏 1 𝑏 2 𝑏 1 𝑝 1 𝑝 2 𝑝 1 𝑝 2 𝑝 1 𝑝 2 𝑏 1 𝑏 2 𝑏 2 𝑏 1 𝑏 1 𝑏 1 𝑝 1 𝑝 2 𝑝 2 𝑝 1 𝑝 1 𝑝 2 𝑝 1 𝑝 2 𝑝 2 𝑝 1 𝑝 2 𝑝 1 𝑏 1 𝑏 2 𝑏 2 𝑏 1 𝑏 2 𝑏 1 𝑏 1 𝑏 2 𝑏 2 𝑏 1 𝑏 2 𝑏 1 9 CS886 (c) 2013 Pascal Poupart

Belief Update • Belief update: 𝑐 𝑢 , 𝑏 𝑢 , 𝑝 𝑢+1 → 𝑐 𝑢+1 𝑐 𝑢+1 𝑡 𝑢+1 = Pr 𝑡 𝑢+1 ℎ 𝑢+1 , 𝑐 0 = Pr⁡ (𝑡 𝑢+1 |𝑝 𝑢+1 , 𝑏 𝑢 , ℎ 𝑢 , 𝑐 0 ) ℎ 𝑢+1 ≡ 𝑝 𝑢+1 , 𝑏 𝑢 , ℎ 𝑢 = Pr 𝑡 𝑢+1 𝑝 𝑢+1 , 𝑏 𝑢 , 𝑐 𝑢 𝑐 𝑢 ≡ 𝑐 0 , ℎ 𝑢 Pr 𝑡 𝑢+1 ,𝑝 𝑢+1 𝑏 𝑢 ,𝑐 𝑢 = Bayes’ theorem Pr 𝑝 𝑢+1 𝑏 𝑢 ,𝑐 𝑢 Pr 𝑝 𝑢+1 𝑡 𝑢+1 ,𝑏 𝑢 Pr 𝑡 𝑢+1 𝑏 𝑢 ,𝑐 𝑢 = chain rule Pr 𝑝 𝑢+1 𝑏 𝑢 ,𝑐 𝑢 Pr 𝑝 𝑢+1 𝑡 𝑢+1 ,𝑏 𝑢 Pr 𝑡 𝑢+1 𝑡 𝑢 ,𝑏 𝑢 𝑐 𝑢 (𝑡 𝑢 ) = 𝑡𝑢 belief definition Pr 𝑝 𝑢+1 𝑏 𝑢 ,𝑐 𝑢 ∝ Pr 𝑝 𝑢+1 𝑡 𝑢+1 , 𝑏 𝑢 Pr 𝑡 𝑢+1 𝑡 𝑢 , 𝑏 𝑢 𝑐 𝑢 (𝑡 𝑢 ) 𝑡 𝑢 11 CS886 (c) 2013 Pascal Poupart

Markovian Policies • Beliefs are sufficient statistics equivalent to histories (with the initial belief) 𝑐 0 , ℎ 𝑢 ⇔ 𝑐 𝑢 • Policies: – Based on histories: 𝜌: 𝐶 0 × 𝐼 𝑢 → 𝐵 𝑢 • Non-Markovian – Based on beliefs: 𝜌: 𝐶 → 𝐵 • Markovian 12 CS886 (c) 2013 Pascal Poupart

Belief State MDPs • POMDPs can be viewed as belief state MDPs – States: 𝐶 (beliefs) – Actions: 𝐵 – Transitions: Pr 𝑐 𝑢+1 𝑐 𝑢 , 𝑏 𝑢 = Pr⁡ (𝑝 𝑢+1 |𝑐 𝑢 , 𝑏 𝑢 ) if⁡𝑐 𝑢 , 𝑏 𝑢 , 𝑝 𝑢+1 → 𝑐 𝑢+1 0 otherwise – Rewards: 𝑆 𝑐, 𝑏 = 𝑐 𝑡 𝑆(𝑡, 𝑏) 𝑡 • Belief state MDPs – Fully observable – Continuous belief space 13 CS886 (c) 2013 Pascal Poupart

Policy Evaluation • Value 𝑊 𝜌 ⁡ of a POMDP policy 𝜌 – Expected sum of rewards: 𝑊 𝜌 𝑐 = 𝐹 𝛿 𝑢 𝑆 𝑐 𝑢 , 𝜌 𝑐 𝑢 𝑢 – Policy evaluation: Bellman’s equation 𝑊 𝜌 𝑐 = 𝑆 𝑐, 𝜌(𝑐) + 𝛿 Pr 𝑐 ′ 𝑐, 𝜌 𝑐 𝑊 𝜌 𝑐 ′ ⁡⁡∀𝑐 𝑐 ′ – Equivalent equation 𝑊 𝜌 𝑐 = 𝑆 𝑐, 𝜌 𝑐 + 𝛿 Pr 𝑝 ′ 𝑐, 𝑏 𝑊 𝜌 (𝑐 𝑏,𝑝 ′ ) ⁡⁡⁡∀𝑐 𝑝 ′ 14 CS886 (c) 2013 Pascal Poupart

Policy Tree Value Function • Theorem: The value function 𝑊 𝜌 (𝑐) of a policy tree is linear in 𝑐 – i.e. 𝑊 𝜌 𝑐 = 𝛽 𝑡 𝑐(𝑡) 𝑡 • Proof by induction: – Base case: at the leaves = 𝑐 𝑡 𝑆(𝑡, 𝜌 𝑡 ) • 𝑊 0 𝑐 = 𝑆 𝑐, 𝜌 𝑐 𝑡 – Hence 𝛽 𝑡 = 𝑆(𝑡, 𝜌 𝑡 ) – Assumption: for all trees of depth 𝑜 , there exists an 𝛽 - 𝑜 𝑐 = 𝑐 𝑡 𝛽(𝑡) vector such that 𝑊 𝑡 15 CS886 (c) 2013 Pascal Poupart

Proof continued • Induction Pr 𝑝 ′ 𝑐, 𝜌(𝑐) 𝑊 𝑊 𝑜 (𝑐 𝜌(𝑐),𝑝′ ) + 𝛿 𝑜+1 𝑐 = 𝑆 𝑐, 𝜌 𝑐 𝑝 ′ 𝑐 𝜌(𝑐),𝑝 ′ 𝑡 ′ 𝛽 𝑝 ′ (𝑡 ′ ) = 𝑆 𝑐, 𝜌 𝑐 (𝑝 ′ |𝑐, 𝜌(𝑐)) + 𝛿 Pr⁡ 𝑝 ′ 𝑡 ′ = 𝑆 𝑐, 𝜌 𝑐 𝑐 𝑡 Pr 𝑡 ′ 𝑡,𝜌(𝑐) Pr 𝑝 ′ 𝑡 ′ ,𝜌(𝑐) 𝛽 𝑝 ′ (𝑡 ′ ) (𝑝 ′ |𝑐, 𝜌(𝑐)) + 𝛿 Pr⁡ 𝑝 ′ 𝑡,𝑡 ′ Pr 𝑝 ′ 𝑐,𝜌(𝑐) Pr 𝑡 ′ 𝑡, 𝜌(𝑐) Pr 𝑝 ′ 𝑡 ′ , 𝜌(𝑐) 𝛽 𝑝 ′ 𝑡 ′ = 𝑐 𝑡 𝑆 𝑡, 𝜌 𝑐 + 𝛿 𝑝 ′ ,𝑡 ′ 𝑡 𝛽(𝑡) 16 CS886 (c) 2013 Pascal Poupart

Value Function • Corollary: A policy made up of a set of trees is piece-wise linear • Proof: – Each tree leads to a linear piece for a region of the belief space – Hence the value function is made up of several linear pieces. 17 CS886 (c) 2013 Pascal Poupart

Optimal Value Function • Theorem: Optimal value function 𝑊 ∗ (𝑐) for finite horizon is piece-wise linear and convex in 𝑐 • Proof: – There are finitely many trees of finite depth – Each tree gives rise to a linear piece 𝛽 – At each belief, select the highest linear piece 18 CS886 (c) 2013 Pascal Poupart

Value Iteration • Bellman’s Equation: 𝑊 ∗ 𝑐 = max 𝑆 𝑐, 𝑏 + 𝛿 Pr 𝑝 ′ 𝑐, 𝑏 𝑊 ∗ (𝑐 𝑏,𝑝 ′ ) 𝑏 𝑝 ′ • Value Iteration: – Idea: repeat 𝑊 ∗ 𝑐 ← max Pr 𝑝 ′ 𝑐, 𝑏 𝑊 ∗ (𝑐 𝑏,𝑝 ′ ) 𝑆 𝑐, 𝑏 + 𝛿 ⁡⁡⁡∀𝑐 𝑝 ′ 𝑏 – But we can’t enumerate all beliefs – Instead compute linear pieces 𝛽 for a subset of beliefs 19 CS886 (c) 2013 Pascal Poupart

Point-Base Value Iteration • Let 𝐶 = {𝑐 1 , 𝑐 2 , … , 𝑐 𝑙 } be a subset of beliefs • Let Γ = {𝛽 1 , 𝛽 2 , … , 𝛽 𝑙 } be a set of 𝛽 -vectors such that 𝛽 𝑗 is associated with 𝑐 𝑗 • Point-based value iteration: – Repeatedly improve 𝑊(𝑐 𝑗 ) at each 𝑐 𝑗 Pr 𝑝 ′ 𝑐, 𝑏 max 𝑊 𝑐 𝑗 = max 𝛽∈Γ 𝛽(𝑐 𝑏,𝑝 ′ ) 𝑆 𝑐 𝑗 , 𝑏 + 𝛿 𝑝 ′ 𝑏 – Find 𝛽 𝑗 (𝑐) such that 𝑊 𝑐 𝑗 = 𝑐 𝑗 𝑡 𝛽(𝑡) 𝑡 𝑏,𝑝 ′ 𝑡 ′ 𝛽(𝑡 ′ ) • 𝛽 𝑏,𝑝 ′ ← 𝑏𝑠𝑕𝑛𝑏𝑦 𝛽∈Γ ⁡ 𝑐 𝑗 𝑡 ′ 𝑏,𝑝 ′ ) • 𝑏 ∗ ← 𝑏𝑠𝑕𝑛𝑏𝑦 𝑏 ⁡𝑆 𝑐 𝑗 , 𝑏 + 𝛿 Pr 𝑝 ′ 𝑐 𝑗 , 𝑏 𝛽 𝑏,𝑝 ′ (𝑐 𝑗 𝑝 ′ • 𝛽 𝑗 𝑡 ← 𝑆 𝑡, 𝑏 ∗ + 𝛿 Pr 𝑡 ′ 𝑡, 𝑏 ∗ Pr 𝑝 ′ 𝑡 ′ , 𝑏 ∗ 𝛽 𝑏 ∗ ,𝑝 ′ (𝑡 ′ ) 𝑡 ′ ,𝑝 ′ 20 CS886 (c) 2013 Pascal Poupart

Algorithm Point-base Value Iteration(B, ℎ ) Let 𝐶 be a set of beliefs 𝑆 𝑡,𝑏 𝛽 𝑗𝑜𝑗𝑢 𝑡 = min 1−𝛿 ⁡⁡∀𝑡 𝑏,𝑡 Γ 0 ← 𝛽 𝑗𝑜𝑗𝑢 For 𝑜 = 1 to ℎ do For each 𝑐 𝑗 ∈ 𝐶 do 𝑏,𝑝 ′ 𝑡 ′ 𝛽(𝑡 ′ ) 𝛽 𝑏,𝑝 ′ ← 𝑏𝑠𝑕𝑛𝑏𝑦 𝛽∈Γ n ⁡ 𝑐 𝑗 𝑡 ′ 𝑏,𝑝 ′ ) 𝑏 ∗ ← 𝑏𝑠𝑕𝑛𝑏𝑦 𝑏 ⁡𝑆 𝑐 𝑗 , 𝑏 + 𝛿 Pr 𝑝 ′ 𝑐 𝑗 , 𝑏 𝛽 𝑏,𝑝 ′ (𝑐 𝑗 𝑝 ′ 𝛽 𝑗 𝑡 ← 𝑆 𝑡, 𝑏 ∗ + 𝛿 Pr 𝑡 ′ 𝑡, 𝑏 ∗ Pr 𝑝 ′ 𝑡 ′ , 𝑏 ∗ 𝛽 𝑏 ∗ ,𝑝 ′ (𝑡 ′ ) 𝑡 ′ ,𝑝 ′ Γ 𝑜 ← 𝛽 𝑗 ∀𝑗 Return Γ 𝑜 21 CS886 (c) 2013 Pascal Poupart

Introduction to Partially Observable Markov Decision Processes CS - PowerPoint PPT Presentation

Module 14 Introduction to Partially Observable Markov Decision Processes CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Markov Decision Processes MDPs: Fully Observable MDPs Decision maker

INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION

Introduction ATV Introduction A T V Introduction A lphabet T V Introduction A lphabet

Brief Brief Introduction Introduction Brief Brief Introduction Introduction Zhengzhou

Brief Brief Introduction Introduction Brief Brief Introduction Introduction Zhengzhou

Shenzhen Cuilu jewelry Co., Ltd was founded in 1996 and its a large private enterprise

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Spectrum Painting Richard Shipman MW0RCZ ADARS 6th Jan 2020 Introduction Introduction

Introduction Introduction Introduction Introduction Outline Motivation Failures

Introduction Introduction Introduction Nationwide Cause for Concern 1

Team Introduction Experiments Outreach Problem Project Brainstorm Introduction Introduction

Lecture 1 Andreas Habegger Introduction Zynq Introduction Zynq Introduction Zynq PS vs. PL

Introduction to Web Design & Computer Principles Class 1 CSCI-UA 4 Introduction and Overview

Introduction to CICS Course introduction Course introduction What is CICS? What is an

INF5110 Compiler Construction Introduction Spring 2016 1 / 33 Outline 1. Introduction

INTRODUCTION I Syllabus INTRODUCTION I Syllabus I Why study labor economics? INTRODUCTION I

2018.06 01 SMILE5 Introduction S E 5 02 Alpha Cloud M I L 03 Company Introduction 04

Decision Making in IT Ventures March 25 th , 2014 Vittoria Aiello, MBA, IT 496 Instructor

Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical

(DMLAs) Outcome from an IRGC workshop, July 2018 https://irgc.epfl.ch No part of this document

C Programming for Engineers Data Types, Decision Making ICEN 360 Spring 2017 Prof. Dola

POLI 359 Public Policy Making Session 3-Prescriptive Models of Public Policy Making Lecturer: Dr.

decision-making Maryam Hashemzadeh Winter 2019 1 What is cognitive science? The study of

Collective Decision-Making with Goals Arianna Novaro PhD Thesis Defense 12 th of November 2019

Decision Making Marco Chiarandini Department of Mathematics & Computer Science University of

Introduction to Partially Observable Markov Decision Processes CS - PowerPoint PPT Presentation

Module 14 Introduction to Partially Observable Markov Decision Processes CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Markov Decision Processes MDPs: Fully Observable MDPs Decision maker

INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION

Introduction ATV Introduction A T V Introduction A lphabet T V Introduction A lphabet

Brief Brief Introduction Introduction Brief Brief Introduction Introduction Zhengzhou

Brief Brief Introduction Introduction Brief Brief Introduction Introduction Zhengzhou

Shenzhen Cuilu jewelry Co., Ltd was founded in 1996 and its a large private enterprise

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Spectrum Painting Richard Shipman MW0RCZ ADARS 6th Jan 2020 Introduction Introduction

Introduction Introduction Introduction Introduction Outline Motivation Failures

Introduction Introduction Introduction Nationwide Cause for Concern 1

Team Introduction Experiments Outreach Problem Project Brainstorm Introduction Introduction

Lecture 1 Andreas Habegger Introduction Zynq Introduction Zynq Introduction Zynq PS vs. PL

Introduction to Web Design &amp; Computer Principles Class 1 CSCI-UA 4 Introduction and Overview

Introduction to CICS Course introduction Course introduction What is CICS? What is an

INF5110 Compiler Construction Introduction Spring 2016 1 / 33 Outline 1. Introduction

INTRODUCTION I Syllabus INTRODUCTION I Syllabus I Why study labor economics? INTRODUCTION I

2018.06 01 SMILE5 Introduction S E 5 02 Alpha Cloud M I L 03 Company Introduction 04

Decision Making in IT Ventures March 25 th , 2014 Vittoria Aiello, MBA, IT 496 Instructor

Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical

(DMLAs) Outcome from an IRGC workshop, July 2018 https://irgc.epfl.ch No part of this document

C Programming for Engineers Data Types, Decision Making ICEN 360 Spring 2017 Prof. Dola

POLI 359 Public Policy Making Session 3-Prescriptive Models of Public Policy Making Lecturer: Dr.

decision-making Maryam Hashemzadeh Winter 2019 1 What is cognitive science? The study of

Collective Decision-Making with Goals Arianna Novaro PhD Thesis Defense 12 th of November 2019

Decision Making Marco Chiarandini Department of Mathematics &amp; Computer Science University of

Introduction to Web Design & Computer Principles Class 1 CSCI-UA 4 Introduction and Overview

Decision Making Marco Chiarandini Department of Mathematics & Computer Science University of