Utrecht University INFOB2KI 2019-2020 The Netherlands ARTIFICIAL INTELLIGENCE Planning under uncertainty: POMDPs Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html
Markov models types Prediction Planning Markov chain MDP Fully observable (Markov decision process) Hidden POMDP Partially observable (Partially observable Markov model Markov decision process) Prediction models can be represented at variable level by a (Dynamic) Bayesian network: S 3 S 2 S 1 … S 2 … S 1 S 3 O 1 O 2 O 3 2
Recap: Markov Decision Process An MDP is defined by: A set of states s S ; typically finite A set of actions a A ; typically finite T: S � A � S → �0,1� A transition function T(s, a, s’) = P(S t+1 = s ’ | S t = s and A t = a) R: S � A → 𝑺 A reward function (Note: simpler than before) Stationary process: T and R are time independent Note the representation at variable level! 3
Solving a Markov System Goal: find an optimal policy π* (recall: best action to choose in each state s ) compute, for each state s , the expected sum of future rewards, when acting optimally: * * ( ) max ( , , ' ) ( , ) ( ' ) V s T s a s R s a V s a ' s (this can be done by e.g. Value Iteration (DP)) for π* take the argmax , as before 4
Planning under partial observability Imperfect observation Goal Environment Environment Action POMDP allows partial satisfaction of goals and tradeoffs among competing goals 5
POMDP a world o Goal is to maximize expected long‐term reward from the initial state distribution But: state is not directly observed 6
Definition of POMDP A Partially Observable Markov Decision Process (POMDP) is defined by: All MDP variables/sets and functions A set of observations o O ; typically finite An observation function : S A O Z(s, a, o) = P(O t = o | S t = s and A t‐1 = a) hidden states layer 7
Memory vs Markov The POMDP is non‐Markovian from viewpoint of agent: any of the past actions and observations may influence the agent’s belief concerning the current state if action choices are based on only most recent observation the policy becomes memoryless 8
Two sources of POMDP complexity Curse of dimensionality – size of state space – shared by other planning problems Curse of memory – size of value function (number of vectors) – or equivalently, size of controller (memory) – unique to POMDPs dimensionality memory 2 1 O | | | || | Complexity of each iteration of DP: n S A | | n Γ represents a vector of values for possible states (true state unknown) 9
Note The following slides contain several formulas; you don’t have to understand these in detail if you grasp the general idea: Rather than states, we use a probability distribution over states The transition and observation functions of the POMDP contain all information necessary to update this distribution 10
Solution: Belief state In regular MDP we track and update current state. Since in POMDP the actual state is uncertain, we maintain a b : S → �0,1� belief state: (note: this is a probability distribution!) The belief state contains all relevant information from history: consider belief b ; after subsequent action a and observation o this belief can be updated (using POMDP ingredients): 𝑐 �� 𝑡 � � 𝑄 𝑡 � 𝑏, 𝑝, 𝑐 � 𝑄 𝑝 𝑏, 𝑐, 𝑡 � 𝑄 𝑡 � 𝑏, 𝑐� (Bayes’ rule) 𝑄 𝑝 𝑏, 𝑐 � � 𝑎 𝑡 � , 𝑏, 𝑝 𝑈 𝑡, 𝑏, 𝑡 � 𝑐�𝑡�/𝑄�𝑝|𝑏, 𝑐� Also expressible in Z , T and b �∈� (see next slide) 11
A belief-state MDP A belief‐state MDP is a continuous space MDP, completely specified with ingredients from the POMDP. It contains: A set of states b B , where B is space of distributions over S A set of actions a A ; same as original POMDP 𝑈 � : B � A → 𝐶 A transition function 𝑄 𝑐 � 𝑏, 𝑐, 𝑝 𝑄 𝑝 𝑏, 𝑐 � b 𝑐, 𝑏, 𝑐′ � 𝑄 𝑐 � 𝑐, 𝑏 � ∑ �∈� 𝑄 𝑐 � 𝑏, 𝑐, 𝑝 ∑ 𝑈 𝑡, 𝑏, 𝑡 � 𝑐 𝑡 𝑎 𝑡 � , 𝑏, 𝑝 ∑ = ∑ � � ∈� �∈� �∈� zero or one 𝑆 � : B � A → 𝑺 A reward function b 𝑐, 𝑏 � ∑ 𝑐 𝑡 𝑆�𝑡, 𝑏� �∈� where S , T, R and Z are defined by the corresponding POMDP 12
Belief state and Markov property The process of maintaining the belief state is Markovian! For any belief state, the successor belief state depends only on the action and observation a 1 o 1 o 2 o 2 o 1 a 2 P(s 0 ) = 0 P(s 0 ) = 1 13
Solving the POMDP Current a Action b Policy Obs. o b Belief P(b |b,a,o) State (Register) Update belief state after action and observation Policy maps belief state to action Policy is found by solving the belief‐state MDP 14
Solving a belief-state MDP For solving the continuous space MDP use the value iteration algorithm... after some adaptations to cope with continuous space: We cannot find a state's new value by looping over all the possible (= infinitely many) next states Representation of value function in tabular form not possible; POMDP restrictions cause finite horizon value function to be piecewise linear and convex 15
Conclusions MDP’s have some efficient solution methods, but require fully observable states POMDP’s are used in more realistic settings, but require sophisticated solution methods 16
Let’s learn! We now know how to plan, if we have a fully specified (PO)MDP But what if we don’t…? 17
Recommend
More recommend