partially observable
play

Partially-Observable Markov Decision Processes as Dynamical Causal - PowerPoint PPT Presentation

Partially-Observable Markov Decision Processes as Dynamical Causal Models Finale Doshi-Velez NIPS Causality Workshop 2013 The POMDP Mindset We poke the world (perform an action ) Agent World The POMDP Mindset We poke the world (perform an


  1. Partially-Observable Markov Decision Processes as Dynamical Causal Models Finale Doshi-Velez NIPS Causality Workshop 2013

  2. The POMDP Mindset We poke the world (perform an action ) Agent World

  3. The POMDP Mindset We poke the world (perform an action ) We get a poke back (see an observation ) We get a poke back (get a reward ) Agent World -$1

  4. What next? We poke the world (perform an action ) We get a poke back (see an observation ) We get a poke back (get a reward ) Agent World -$1

  5. What next? ? We poke the world (perform an action ) We get a poke back The world is (see an observation ) a mystery... We get a poke back (get a reward ) Agent World -$1

  6. The agent needs a representation to use when making decisions ? We poke the world (perform an action ) Representation Representation of how the of current world works world state The world is We get a poke back a mystery... (see an observation ) We get a poke back Agent (get a reward ) World -$1

  7. Many problems can be framed this way ● Robot navigation (take movement actions, receive sensor measurements) ● Dialog management (ask questions, receive answers) ● Target tracking (search a particular area, receive sensor measurements) … the list goes on ...

  8. The Causal Process, Unrolled ... ... a t-1 a t a t+1 a t+2 ... ... o t-1 o t o t+1 o t+2 ... ... r t-1 r t r t+1 r t+2 -$5 $10 -$1 -$1 8

  9. The Causal Process, Unrolled ... ... a t-1 a t a t+1 a t+2 Given a history of actions, ... ... observations, o t-1 o t o t+1 o t+2 and rewards ... ... How can we r t-1 r t r t+1 r t+2 act in order to -$5 $10 -$1 -$1 maximize long-term future rewards ? 9

  10. The Causal Process, Unrolled ... ... a t-1 a t a t+1 a t+2 Key Challenge: The entire history may be needed ... ... to make near-optimal decisions o t-1 o t o t+1 o t+2 ... ... r t-1 r t r t+1 r t+2 10

  11. The Causal Process, Unrolled ... ... a t-1 a t a t+1 a t+2 All past events are needed to predict ... ... o t-1 o t o t+1 o t+2 future events ... ... r t-1 r t r t+1 r t+2 11

  12. The Causal Process, Unrolled ... ... a t-1 a t a t+1 a t+2 The representation ... ... is a sufficient s t-1 s t s t+1 s t+2 statistic that summar izes the history ... ... o t-1 o t o t+1 o t+2 ... ... r t-1 r t r t+1 r t+2 12

  13. The Causal Process, Unrolled ... ... a t-1 a t a t+1 a t+2 The representation ... ... is a sufficient s t-1 s t s t+1 s t+2 statistic that summar izes the history ... ... o t-1 o t o t+1 o t+2 ... ... r t-1 r t r t+1 r t+2 We call this representation the information state . 13

  14. What is state ? ● Sometimes, there exists an obvious choice for this hidden variable (such as a robot's true position) ● At other times, learning a representation that makes the system Markovian may provide insights into the problem.

  15. Formal POMDP definition A POMDP consists of ● A set of states S, actions A, and observations O ● A transition function T( s' | s , a ) ● An observation function O( o | s , a ) ● A reward function R( s , a ) ● A discount factor γ ∞ t R t term discounted reward. ∑ γ The goal is to maximize E[ ], the expected long- t = 1

  16. Relationship to Other Models Hidden State? Markov Model Hidden Markov Model ... ... s t-1 s t s t+1 s t+2 Decisions? ... ... s t-1 s t s t+1 s t+2 ... ... o t-1 o t o t+1 o t+2 Markov Decision Process POMDP ... ... ... ... a t-1 a t-1 a t a t a t+1 a t+1 a t+2 a t+2 ... ... a t-1 a t a t+1 a t+2 ... ... ... ... s t-1 s t-1 s t s t s t+1 s t+1 s t+2 s t+2 ... ... s t-1 s t s t+1 s t+2 ... ... ... ... o t-1 o t-1 o t o t o t+1 o t+1 o t+2 o t+2 ... ... r t-1 r t r t+1 r t+2 ... ... ... ... r t-1 r t-1 r t r t r t+1 r t+1 r t+2 r t+2

  17. Formal POMDP definition A POMDP consists of ● A set of states S, actions A, and observations O ● A transition function T( s' | s , a ) ● An observation function O( o | s , a ) ● A reward function R( s , a ) ● A discount factor γ This ∞ t R t optimization term discounted reward. ∑ γ The goal is to maximize E[ ], the expected long- is called t = 1 “Planning”

  18. Formal POMDP definition A POMDP consists of ● A set of states S, actions A, and observations O ● A transition function T( s' | s , a ) Learning These ● An observation function O( o | s , a ) is called ● A reward function R( s , a ) “Learning” ● A discount factor γ ∞ t R t term discounted reward. ∑ γ The goal is to maximize E[ ], the expected long- t = 1

  19. Planning Bellman Recursion for the value (long-term expected reward) ∞ t R t ∣ b 0 = b ] V ( b )= max E [ ∑ γ t = 1 . = max a [ R ( b,a )+γ( ∑ P ( b oa ∣ b ) V ( b oa ))] o ∈ O

  20. State and State , a quick aside ● In the POMDP literature, the term “state” usually refers to the hidden state (i.e. the robot's true location). ● The posterior distribution of states s is called the “belief” b(s). It is a sufficient statistic for the history, and thus the information state for the POMDP.

  21. Planning Bellman Recursion for the value (long-term expected reward) ∞ t R t ∣ b 0 = b ] V ( b )= max E [ ∑ γ t = 1 . = max a [ R ( b,a )+γ( ∑ P ( b oa ∣ b ) V ( b oa ))] o ∈ O

  22. Planning Bellman Recursion for the value (long-term expected reward) ∞ t R t ∣ b 0 = b ] V ( b )= max E [ ∑ γ t = 1 . = max a [ R ( b,a )+γ( ∑ Belief b P ( b oa ∣ b ) V ( b oa ))] (sufficient statistic/ o ∈ O information state)

  23. Planning Bellman Recursion for the value (long-term expected reward) ∞ t R t ∣ b 0 = b ] V ( b )= max E [ ∑ γ t = 1 . = max a [ R ( b,a )+γ( ∑ P ( b oa ∣ b ) V ( b oa ))] o ∈ O Immediate reward for taking action a in belief b

  24. Planning Bellman Recursion for the value (long-term expected reward) ∞ t R t ∣ b 0 = b ] V ( b )= max E [ ∑ γ t = 1 . = max a [ R ( b,a )+γ( ∑ P ( b oa ∣ b ) V ( b oa ))] o ∈ O Expected future rewards

  25. Planning Bellman Recursion for the value (long-term expected reward) ∞ t R t ∣ b 0 = b ] V ( b )= max E [ ∑ γ t = 1 . = max a [ R ( b,a )+γ( ∑ P ( b oa ∣ b ) V ( b oa ))] o ∈ O … especially when b is high-dimensional, solving for this continuous function is not easy (PSPACE-HARD)

  26. Planning: Yes, we can! ● Global: Approximate the entire function V(b) via a set of support points b'. - e.g. SARSOP b ● Local: Approximate a the value for a particular belief ... with forward simulation b t a - e.g. POMCP o

  27. Learning ● Given histories h =( a 1 ,r 1 ,o 1 ,a 2 ,r 2 ,o 2 , ... ,a T ,r T ,o T ) we can learn T, O, R via forward-filtering/backward- sampling or <fill in your favorite timeseries algorithm> ● Two principles usually suffice for exploring to learn: ● Optimism under uncertainty: Try actions that might be good ● Risk control: If an action seems risky, ask for help.

  28. Example: Timeseries in Diabetes Data: Electronic health records of ~17,000 diabetics with 5+ A1c lab measurements and 5+ anti-diabetic agents prescribed. Meds (Anti- diabetic agents) Clinician Model Patient Model Lab Results (A1c) Collaborators: Isaac Kohane, Stan Shaw

  29. Example: Timeseries in Diabetes Data: Electronic health records of ~17,000 diabetics with 5+ A1c lab measurements and 5+ anti-diabetic agents prescribed. Meds (Anti- diabetic agents) Clinician Model Patient Model Lab Results (A1c)

  30. Discovered Patient States ● The “patient states” each correspond to a set of A1c levels (unsurprising) A1c A1c A1c A1c A1c 5.5- 6.5- 7.5- < 5.5 > 8.5 6.5 7.5 8.5

  31. Example: Timeseries in Diabetes Data: Electronic health records of ~17,000 diabetics with 5+ A1c lab measurements and 5+ anti-diabetic agents prescribed. Meds (Anti- diabetic agents) Clinician Model Patient Model Lab Results (A1c)

  32. Discovered Clinician States ● The “clinician states” follow the standard treatment protocols for diabetes (unsurprising, but exciting that we discovered this is a completely unsupervised manner) ● Next steps: Incorporate more variables; identify patient and clinician outliers (quality of care) Metformin, Basic A1c up A1c control Glipizide Insulins A1c up A1c up Metformin A1c control Glargine, Metformin, Lispro, A1c up Glyburide Aspart

  33. Example: Experimental Design ● In a very general sense: ● Action space: all possible experiments + “submit” ● State space: which hypothesis is true ● Observation space: results of experiments ● Reward: cost of experiment ● Allows for non-myopic sequencing of experiments. ● Example: Bayesian Optimization ? Joint with: Ryan Adams/HIPS group

  34. Summary ● POMDPs provide a framework for ● modeling causal dynamical systems ● making optimal sequential decisions ● POMDPs can be learned and solved! ... ... a t-1 a t a t+1 a t+2 ... ... s t-1 s t s t+1 s t+2 ... ... o t-1 o t o t+1 o t+2 ... ... r t-1 r t r t+1 r t+2

Recommend


More recommend