introduction to
play

Introduction to Partially Observable Markov Decision Processes CS - PowerPoint PPT Presentation

Module 14 Introduction to Partially Observable Markov Decision Processes CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Markov Decision Processes MDPs: Fully Observable MDPs Decision maker


  1. Module 14 Introduction to Partially Observable Markov Decision Processes CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

  2. Markov Decision Processes • MDPs: – Fully Observable MDPs – Decision maker knows the state at each time step • POMDPs: – Partially Observable MDPs – Decision does not know the state – But makes observations that are correlated with the underlying state • E.g. sensors provide noisy information about the state 2 CS886 (c) 2013 Pascal Poupart

  3. Applications • Robotic control • Dialog systems • Assistive Technologies • Operations Research 3 CS886 (c) 2013 Pascal Poupart

  4. Model Description • Definition – Set of states: 𝑇 – Set of actions (i.e., decisions): 𝐵 – Transition model: Pr⁡ (𝑡 𝑢 |𝑡 𝑢−1 , 𝑏 𝑢−1 ) – Reward model (i.e., utility): 𝑆(𝑡 𝑢 , 𝑏 𝑢 ) – Discount factor: 0 ≤ 𝛿 ≤ 1 – Horizon (i.e., # of time steps): ℎ – Set of observations: 𝑷 – Observation model: 𝐐𝐬⁡ (𝒑 𝒖 |𝒕 𝒖 , 𝒃 𝒖−𝟐 ) 4 CS886 (c) 2013 Pascal Poupart

  5. Graphical Model • Fully Observable MDP a 1 a 0 a 3 a 2 s 0 s 1 s 2 s 4 s 3 r 2 r 3 r 0 r 1 5 CS886 (c) 2013 Pascal Poupart

  6. Graphical Model • Partially Observable MDP a 1 o 2 o 4 a 0 o 1 a 2 o 3 a 3 s 0 s 1 s 2 s 4 s 3 r 2 r 3 r 0 r 1 6 CS886 (c) 2013 Pascal Poupart

  7. Policies • MDP policies: 𝜌: 𝑇 → 𝐵 – Markovian policy • But state is unknown in POMDPs • POMDP policies: 𝜌: 𝐶 0 × 𝐼 𝑢 → 𝐵 𝑢 – 𝐶 0 is the space of initial beliefs 𝑐 0 𝑐 0 = Pr⁡ (𝑡 0 ) – 𝐼 𝑢 is the space histories ℎ 𝑢 of observables up to time 𝑢 ℎ 𝑢 ≝ 𝑏 0 , 𝑝 1 , 𝑏 1 , 𝑝 2 , … , 𝑏 𝑢−1 , 𝑝 𝑢 – Non-Markovian policy 7 CS886 (c) 2013 Pascal Poupart

  8. Policy Trees • Policy 𝜌: 𝐶 × 𝐼 𝑢 → 𝐵 𝑢 • Consider a single initial belief 𝑐 • Then 𝜌 can be represented by a tree 𝑏 1 𝑝 1 𝑝 2 𝑏 1 𝑏 2 𝑝 1 𝑝 1 𝑝 2 𝑝 2 𝑏 2 𝑏 1 𝑏 1 𝑏 2 8 CS886 (c) 2013 Pascal Poupart

  9. Policy Trees (continued) • Policy 𝜌: 𝐶 × 𝐼 𝑢 → 𝐵 𝑢 – Set of trees Let 𝐶 = 𝐶 1 ∪ 𝐶 2 ∪ 𝐶 3 𝑐 ∈ 𝐶 1 𝑐 ∈ 𝐶 3 𝑐 ∈ 𝐶 2 𝑏 1 𝑏 2 𝑏 1 𝑝 1 𝑝 2 𝑝 1 𝑝 2 𝑝 1 𝑝 2 𝑏 1 𝑏 2 𝑏 2 𝑏 1 𝑏 1 𝑏 1 𝑝 1 𝑝 2 𝑝 2 𝑝 1 𝑝 1 𝑝 2 𝑝 1 𝑝 2 𝑝 2 𝑝 1 𝑝 2 𝑝 1 𝑏 1 𝑏 2 𝑏 2 𝑏 1 𝑏 2 𝑏 1 𝑏 1 𝑏 2 𝑏 2 𝑏 1 𝑏 2 𝑏 1 9 CS886 (c) 2013 Pascal Poupart

  10. Beliefs • Belief 𝑐 𝑢 𝑡 = Pr⁡ (𝑡 𝑢 ) – Distribution over states at time 𝑢 • Belief about the underlying state based on history ℎ 𝑢 𝑐 𝑢 𝑡 = Pr⁡ (𝑡 𝑢 |ℎ 𝑢 , 𝑐 0 ) 10 CS886 (c) 2013 Pascal Poupart

  11. Belief Update • Belief update: 𝑐 𝑢 , 𝑏 𝑢 , 𝑝 𝑢+1 → 𝑐 𝑢+1 𝑐 𝑢+1 𝑡 𝑢+1 = Pr 𝑡 𝑢+1 ℎ 𝑢+1 , 𝑐 0 = Pr⁡ (𝑡 𝑢+1 |𝑝 𝑢+1 , 𝑏 𝑢 , ℎ 𝑢 , 𝑐 0 ) ℎ 𝑢+1 ≡ 𝑝 𝑢+1 , 𝑏 𝑢 , ℎ 𝑢 = Pr 𝑡 𝑢+1 𝑝 𝑢+1 , 𝑏 𝑢 , 𝑐 𝑢 𝑐 𝑢 ≡ 𝑐 0 , ℎ 𝑢 Pr 𝑡 𝑢+1 ,𝑝 𝑢+1 𝑏 𝑢 ,𝑐 𝑢 = Bayes’ theorem Pr 𝑝 𝑢+1 𝑏 𝑢 ,𝑐 𝑢 Pr 𝑝 𝑢+1 𝑡 𝑢+1 ,𝑏 𝑢 Pr 𝑡 𝑢+1 𝑏 𝑢 ,𝑐 𝑢 = chain rule Pr 𝑝 𝑢+1 𝑏 𝑢 ,𝑐 𝑢 Pr 𝑝 𝑢+1 𝑡 𝑢+1 ,𝑏 𝑢 Pr 𝑡 𝑢+1 𝑡 𝑢 ,𝑏 𝑢 𝑐 𝑢 (𝑡 𝑢 ) = 𝑡𝑢 belief definition Pr 𝑝 𝑢+1 𝑏 𝑢 ,𝑐 𝑢 ∝ Pr 𝑝 𝑢+1 𝑡 𝑢+1 , 𝑏 𝑢 Pr 𝑡 𝑢+1 𝑡 𝑢 , 𝑏 𝑢 𝑐 𝑢 (𝑡 𝑢 ) 𝑡 𝑢 11 CS886 (c) 2013 Pascal Poupart

  12. Markovian Policies • Beliefs are sufficient statistics equivalent to histories (with the initial belief) 𝑐 0 , ℎ 𝑢 ⇔ 𝑐 𝑢 • Policies: – Based on histories: 𝜌: 𝐶 0 × 𝐼 𝑢 → 𝐵 𝑢 • Non-Markovian – Based on beliefs: 𝜌: 𝐶 → 𝐵 • Markovian 12 CS886 (c) 2013 Pascal Poupart

  13. Belief State MDPs • POMDPs can be viewed as belief state MDPs – States: 𝐶 (beliefs) – Actions: 𝐵 – Transitions: Pr 𝑐 𝑢+1 𝑐 𝑢 , 𝑏 𝑢 = Pr⁡ (𝑝 𝑢+1 |𝑐 𝑢 , 𝑏 𝑢 ) if⁡𝑐 𝑢 , 𝑏 𝑢 , 𝑝 𝑢+1 → 𝑐 𝑢+1 0 otherwise – Rewards: 𝑆 𝑐, 𝑏 = 𝑐 𝑡 𝑆(𝑡, 𝑏) 𝑡 • Belief state MDPs – Fully observable – Continuous belief space 13 CS886 (c) 2013 Pascal Poupart

  14. Policy Evaluation • Value 𝑊 𝜌 ⁡ of a POMDP policy 𝜌 – Expected sum of rewards: 𝑊 𝜌 𝑐 = 𝐹 𝛿 𝑢 𝑆 𝑐 𝑢 , 𝜌 𝑐 𝑢 𝑢 – Policy evaluation: Bellman’s equation 𝑊 𝜌 𝑐 = 𝑆 𝑐, 𝜌(𝑐) + 𝛿 Pr 𝑐 ′ 𝑐, 𝜌 𝑐 𝑊 𝜌 𝑐 ′ ⁡⁡∀𝑐 𝑐 ′ – Equivalent equation 𝑊 𝜌 𝑐 = 𝑆 𝑐, 𝜌 𝑐 + 𝛿 Pr 𝑝 ′ 𝑐, 𝑏 𝑊 𝜌 (𝑐 𝑏,𝑝 ′ ) ⁡⁡⁡∀𝑐 𝑝 ′ 14 CS886 (c) 2013 Pascal Poupart

  15. Policy Tree Value Function • Theorem: The value function 𝑊 𝜌 (𝑐) of a policy tree is linear in 𝑐 – i.e. 𝑊 𝜌 𝑐 = 𝛽 𝑡 𝑐(𝑡) 𝑡 • Proof by induction: – Base case: at the leaves = 𝑐 𝑡 𝑆(𝑡, 𝜌 𝑡 ) • 𝑊 0 𝑐 = 𝑆 𝑐, 𝜌 𝑐 𝑡 – Hence 𝛽 𝑡 = 𝑆(𝑡, 𝜌 𝑡 ) – Assumption: for all trees of depth 𝑜 , there exists an 𝛽 - 𝑜 𝑐 = 𝑐 𝑡 𝛽(𝑡) vector such that 𝑊 𝑡 15 CS886 (c) 2013 Pascal Poupart

  16. Proof continued • Induction Pr 𝑝 ′ 𝑐, 𝜌(𝑐) 𝑊 𝑊 𝑜 (𝑐 𝜌(𝑐),𝑝′ ) + 𝛿 𝑜+1 𝑐 = 𝑆 𝑐, 𝜌 𝑐 𝑝 ′ 𝑐 𝜌(𝑐),𝑝 ′ 𝑡 ′ 𝛽 𝑝 ′ (𝑡 ′ ) = 𝑆 𝑐, 𝜌 𝑐 (𝑝 ′ |𝑐, 𝜌(𝑐)) + 𝛿 Pr⁡ 𝑝 ′ 𝑡 ′ = 𝑆 𝑐, 𝜌 𝑐 𝑐 𝑡 Pr 𝑡 ′ 𝑡,𝜌(𝑐) Pr 𝑝 ′ 𝑡 ′ ,𝜌(𝑐) 𝛽 𝑝 ′ (𝑡 ′ ) (𝑝 ′ |𝑐, 𝜌(𝑐)) + 𝛿 Pr⁡ 𝑝 ′ 𝑡,𝑡 ′ Pr 𝑝 ′ 𝑐,𝜌(𝑐) Pr 𝑡 ′ 𝑡, 𝜌(𝑐) Pr 𝑝 ′ 𝑡 ′ , 𝜌(𝑐) 𝛽 𝑝 ′ 𝑡 ′ = 𝑐 𝑡 𝑆 𝑡, 𝜌 𝑐 + 𝛿 𝑝 ′ ,𝑡 ′ 𝑡 𝛽(𝑡) 16 CS886 (c) 2013 Pascal Poupart

  17. Value Function • Corollary: A policy made up of a set of trees is piece-wise linear • Proof: – Each tree leads to a linear piece for a region of the belief space – Hence the value function is made up of several linear pieces. 17 CS886 (c) 2013 Pascal Poupart

  18. Optimal Value Function • Theorem: Optimal value function 𝑊 ∗ (𝑐) for finite horizon is piece-wise linear and convex in 𝑐 • Proof: – There are finitely many trees of finite depth – Each tree gives rise to a linear piece 𝛽 – At each belief, select the highest linear piece 18 CS886 (c) 2013 Pascal Poupart

  19. Value Iteration • Bellman’s Equation: 𝑊 ∗ 𝑐 = max 𝑆 𝑐, 𝑏 + 𝛿 Pr 𝑝 ′ 𝑐, 𝑏 𝑊 ∗ (𝑐 𝑏,𝑝 ′ ) 𝑏 𝑝 ′ • Value Iteration: – Idea: repeat 𝑊 ∗ 𝑐 ← max Pr 𝑝 ′ 𝑐, 𝑏 𝑊 ∗ (𝑐 𝑏,𝑝 ′ ) 𝑆 𝑐, 𝑏 + 𝛿 ⁡⁡⁡∀𝑐 𝑝 ′ 𝑏 – But we can’t enumerate all beliefs – Instead compute linear pieces 𝛽 for a subset of beliefs 19 CS886 (c) 2013 Pascal Poupart

  20. Point-Base Value Iteration • Let 𝐶 = {𝑐 1 , 𝑐 2 , … , 𝑐 𝑙 } be a subset of beliefs • Let Γ = {𝛽 1 , 𝛽 2 , … , 𝛽 𝑙 } be a set of 𝛽 -vectors such that 𝛽 𝑗 is associated with 𝑐 𝑗 • Point-based value iteration: – Repeatedly improve 𝑊(𝑐 𝑗 ) at each 𝑐 𝑗 Pr 𝑝 ′ 𝑐, 𝑏 max 𝑊 𝑐 𝑗 = max 𝛽∈Γ 𝛽(𝑐 𝑏,𝑝 ′ ) 𝑆 𝑐 𝑗 , 𝑏 + 𝛿 𝑝 ′ 𝑏 – Find 𝛽 𝑗 (𝑐) such that 𝑊 𝑐 𝑗 = 𝑐 𝑗 𝑡 𝛽(𝑡) 𝑡 𝑏,𝑝 ′ 𝑡 ′ 𝛽(𝑡 ′ ) • 𝛽 𝑏,𝑝 ′ ← 𝑏𝑠𝑕𝑛𝑏𝑦 𝛽∈Γ ⁡ 𝑐 𝑗 𝑡 ′ 𝑏,𝑝 ′ ) • 𝑏 ∗ ← 𝑏𝑠𝑕𝑛𝑏𝑦 𝑏 ⁡𝑆 𝑐 𝑗 , 𝑏 + 𝛿 Pr 𝑝 ′ 𝑐 𝑗 , 𝑏 𝛽 𝑏,𝑝 ′ (𝑐 𝑗 𝑝 ′ • 𝛽 𝑗 𝑡 ← 𝑆 𝑡, 𝑏 ∗ + 𝛿 Pr 𝑡 ′ 𝑡, 𝑏 ∗ Pr 𝑝 ′ 𝑡 ′ , 𝑏 ∗ 𝛽 𝑏 ∗ ,𝑝 ′ (𝑡 ′ ) 𝑡 ′ ,𝑝 ′ 20 CS886 (c) 2013 Pascal Poupart

  21. Algorithm Point-base Value Iteration(B, ℎ ) Let 𝐶 be a set of beliefs 𝑆 𝑡,𝑏 𝛽 𝑗𝑜𝑗𝑢 𝑡 = min 1−𝛿 ⁡⁡∀𝑡 𝑏,𝑡 Γ 0 ← 𝛽 𝑗𝑜𝑗𝑢 For 𝑜 = 1 to ℎ do For each 𝑐 𝑗 ∈ 𝐶 do 𝑏,𝑝 ′ 𝑡 ′ 𝛽(𝑡 ′ ) 𝛽 𝑏,𝑝 ′ ← 𝑏𝑠𝑕𝑛𝑏𝑦 𝛽∈Γ n ⁡ 𝑐 𝑗 𝑡 ′ 𝑏,𝑝 ′ ) 𝑏 ∗ ← 𝑏𝑠𝑕𝑛𝑏𝑦 𝑏 ⁡𝑆 𝑐 𝑗 , 𝑏 + 𝛿 Pr 𝑝 ′ 𝑐 𝑗 , 𝑏 𝛽 𝑏,𝑝 ′ (𝑐 𝑗 𝑝 ′ 𝛽 𝑗 𝑡 ← 𝑆 𝑡, 𝑏 ∗ + 𝛿 Pr 𝑡 ′ 𝑡, 𝑏 ∗ Pr 𝑝 ′ 𝑡 ′ , 𝑏 ∗ 𝛽 𝑏 ∗ ,𝑝 ′ (𝑡 ′ ) 𝑡 ′ ,𝑝 ′ Γ 𝑜 ← 𝛽 𝑗 ∀𝑗 Return Γ 𝑜 21 CS886 (c) 2013 Pascal Poupart

Recommend


More recommend