Statistical Filtering and Control for AI and Robotics Exploration and information gathering Alessandro Farinelli
Outline • POMDPs – The POMDP model – Finite world POMDP algorithm – Point based value iteration • Exploration – Information gain – Exploration in occupancy grid maps – Extension to MRS • Acknowledgment: material based on – Thrun, Burgard, Fox; Probabilistic Robotics
POMDPs • In POMDPs we apply the same idea as in MDPs. • Since the state is not observable , the agent has to make its decisions based on the belief state which is a posterior distribution over states. • Let b be the belief of the agent about the state under consideration. • POMDPs compute a value function over belief space : V ( b ) max r ( b , u ) V ( b ' ) p ( b ' | u , b ) db ' T T 1 u b '
Problems • Each belief is a probability distribution, thus, each value in a POMDP is a function of an entire probability distribution . • This is problematic, since probability distributions are continuous . • Additionally, we have to deal with the huge complexity of belief spaces . • For finite worlds with finite state, action, and measurement spaces and finite horizons, however, we can effectively represent the value functions by piecewise linear functions . – Possible because Expectation is a linear operator
Example measurements state x 1 action u 3 state x 2 measurements 0 . 2 0 . 8 z z 0 . 7 0 . 3 u 1 1 x x 3 u 1 2 z z 3 0 . 3 0 . 7 2 2 0 . 8 0 . 2 u u u u actions u 1, u 2 1 2 1 2 100 100 100 50 payoff payoff
Discussion on the example • The two states have different optimal actions – u2 in x1 and u1 in x2 • Action u3 is non deterministic, it flips state and acquires knowledge with a small cost – z1 increases confidence of being in x1 – z2 increases confidence of being in x2 – cost is -1 (see later) • Two states: belief is p1 = p(x1) – p(x2) = 1-p1 : 0 ; 1 u –
Payoff in POMDPs • In MDPs, the payoff (or reward) depends on the state of the system. • In POMDPs the true state is not exactly known. • Therefore, we compute the expected payoff by integrating over all states : r b , u E r x , u x r b , u r x ' , u p x ' dx ' p r x , u p r x , u 1 1 2 2 x '
Payoffs in the example I • If we are in x1 and execute u1 we receive -100 • If we are in x2 and execute u1 we receive +100 • When we are not certain of state we have a linear combination weighted with the probabilities: r b , u 100 p 100 1 p 1 1 1 r b , u 100 p 50 1 p 2 1 1 r b , u 1 3
Payoffs in the example II
The resulting policy for T=1 • Finte POMDP with T=1, use V 1 (b) to determine the optimal policy: – Choose best next action among u1,u2,u3 • In our example, the optimal policy for T=1 is 3 u if p 1 1 7 b 3 u if p 2 1 7 • This is the upper thick graph in the diagram.
Piecewise, linearity and convexity • The resulting value function V 1 (b) is the maximum of the three functions at each point 100 p 100 1 p 1 1 100 p 50 1 p V b max 1 1 1 1 • It is piecewise linear and convex.
Pruning • Only the first two components contribute. • The third component can be pruned away from V 1 (b) . • Pruning is crucial to have an efficient solution approach 100 p 100 1 p 1 1 V b max 1 100 p 50 1 p 1 1
Increasing the time horizon • Assume robot can make an observation before acting • Sensing will provide a better belief, how much better? V 1 (b)
Sensing • Suppose the robot perceives z 1. • Recall: – p(z 1 | x 1 )=0.7 and p(z 1 | x 2 )=0.3 . • Given the observation z 1 we update the belief using Bayes rule. p ( z | x ) p ( x ) 0 . 7 p 1 1 1 1 p ' p ( x | z ) 1 1 p ( z ) p ( z ) 1 1 p ( z | x ) p ( x ) 0 . 3 ( 1 p ) 1 2 1 1 p ' p ( x | z ) 2 2 p ( z ) p ( z ) 1 1
Value Function considering z 1 V 1 (b) project b ’=p( x 1 |z 1 ) V 1 (b|z 1 )
Computing the new value function • Suppose the robot perceives z 1. • We update the belief using Bayes rule • We can compute V 1 (b | z 1 ) by replacing p1 with p’1: 0 . 7 p 0 . 3 ( 1 p ) 1 1 100 100 p ( z ) p ( z ) V b | z max 1 1 1 1 0 . 7 p 0 . 3 ( 1 p ) 1 1 100 50 p ( z ) p ( z ) 1 1 70 p 30 ( 1 p ) 1 1 1 max 70 p 50 ( 1 p ) p z 1 1 1
Expected value after measuring • We do not know in advance what will be the next measurement • Need to compute the expectation 2 V b E V ( b | z ) p ( z ) V ( b | z ) 1 z 1 i 1 i i 1 2 2 p ( z | x ) p i 1 1 p ( z ) V V p ( z | x ) p i 1 1 i 1 1 p ( z ) i 1 i 1 1
Expected value after measuring • We do not know in advance what will be the next measurement • Need to compute the expectation 2 V b E V ( b | z ) p ( z ) V ( b | z ) 1 z 1 i 1 i i 1 70 p 30 ( 1 p ) 30 p 70 ( 1 p ) 1 1 1 1 max max 70 p 15 ( 1 p ) 30 p 35 ( 1 p ) 1 1 1 1 p ( z ) V ( b | z ) p ( z ) V ( b | z ) 1 1 1 2 1 2
Resulting value function • Need to consider the four possible combinations and find the max • As before we can perform pruning 70 p 30 ( 1 p ) 30 p 70 ( 1 p ) 1 1 1 1 70 p 30 ( 1 p ) 30 p 35 ( 1 p ) 1 1 1 1 V b max 1 70 p 15 ( 1 p ) 30 p 70 ( 1 p ) 1 1 1 1 70 p 15 ( 1 p ) 30 p 35 ( 1 p ) 1 1 1 1 100 p 100 ( 1 p ) 1 1 max 40 p 55 ( 1 p ) 1 1 100 p 50 ( 1 p ) 1 1
Value Function considering sensing p(z 1 ) V 1 (b|z 1 ) u 1 u 2 unclear p(z 2 ) V 2 (b|z 2 )
State transition • Need to consider how actions affect the state • In our case u1 and u2 leads to final states and are deterministic • u3 has a non deterministic effect on the state 2 p ' E p x ' | x , u p x ' | x , u p 1 1 3 1 i 3 i i 1 p ( x ' | x , u ) p p ( x ' | x , u )( 1 p ) 1 1 3 1 1 2 3 1 0 . 2 p 0 . 8 ( 1 p ) 0 . 8 0 . 6 p 1 1 1
State transition p ' E p x ' | x , u 1 1 3 p ' 1 0 . 8 0 . 6 p 1 p 1
Resulting value function after u 3 • Considering the state transition we can compute V b | u 1 3 • Substitute p’1 in p1 100 p ' 100 ( 1 p ' ) 1 1 V b | u max 40 p ' 55 ( 1 p ' ) 1 3 1 1 100 p ' 50 ( 1 p ' ) 1 1 60 p 60 ( 1 p ) 1 1 max 52 p 43 ( 1 p ) 1 1 20 p 70 ( 1 p ) 1 1
Value Function considering u 3 u 1 u 2 unclear project u 2 u 1 unclear
Resulting value function for T=2 • can execute any of the three actions u 1 , u 2, u 3 100 p 100 ( 1 p ) • need to discount cost for u 3 1 1 100 p 50 ( 1 p ) 1 1 V b max 59 p 61 ( 1 p ) 2 1 1 52 p 42 ( 1 p ) 1 1 21 p 69 ( 1 p ) 1 1 100 p 100 ( 1 p ) 1 1 max 100 p 50 ( 1 p ) 1 1 52 p 42 ( 1 p ) 1 1
Graphical representation for V 2 (b) u 2 optimal u 1 optimal unclear outcome of measurement is important here
Deep horizons and pruning • We have now completed a full backup in belief space. • This process can be applied recursively. • The value functions for T=10 and T=20 are
Importance of pruning V b 1 V 1 b V 2 b
Recommend
More recommend