ai based mobile robotics
play

AI-based Mobile Robotics Planning and Control: Markov Decision - PowerPoint PPT Presentation

CSE-571 AI-based Mobile Robotics Planning and Control: Markov Decision Processes Planning Static vs. Dynamic Predictable vs. Unpredictable Environment Fully vs. Discrete vs. Partially Continuous Observable Outcomes What action next?


  1. CSE-571 AI-based Mobile Robotics Planning and Control: Markov Decision Processes

  2. Planning Static vs. Dynamic Predictable vs. Unpredictable Environment Fully vs. Discrete vs. Partially Continuous Observable Outcomes What action next? Deterministic vs. Stochastic Perfect vs. Noisy Percepts Actions Full vs. Partial satisfaction

  3. Classical Planning Static Predictable Environment Fully Observable Discrete What action next? Deterministic Perfect Percepts Actions Full

  4. Stochastic Planning Static Unpredictable Environment Fully Observable Discrete What action next? Stochastic Perfect Percepts Actions Full

  5. Deterministic, fully observable

  6. Stochastic, Fully Observable

  7. Stochastic, Partially Observable

  8. Markov Decision Process (MDP)  S : A set of states  A : A set of actions  P r(s’|s,a): transition model  C (s,a,s’): cost model  G : set of goals  s 0 : start state   : discount factor  R ( s,a,s’): reward model

  9. Role of Discount Factor (  )  Keep the total reward/total cost finite • useful for infinite horizon problems • sometimes indefinite horizon: if there are deadends  Intuition (economics): • Money today is worth more than money tomorrow.  Total reward: r 1 +  r 2 +  2 r 3 + …  Total cost: c 1 +  c 2 +  2 c 3 + …

  10. Objective of a Fully Observable MDP  Find a policy  : S → A  which optimises • minimises expected cost to reach a goal discounted • maximises expected reward or undiscount. • maximises expected (reward-cost)  given a ____ horizon • finite • infinite • indefinite  assuming full observability

  11. Examples of MDPs  Goal-directed, Indefinite Horizon, Cost Minimisation MDP • < S , A , P r, C , G , s 0 >  Infinite Horizon, Discounted Reward Maximisation MDP • < S , A , P r, R ,  > • Reward =  t t  t r t  Goal-directed, Finite Horizon, Prob. Maximisation MDP • < S , A , P r, G , s 0 , T>

  12. Bellman Equations for MDP 1  < S , A , P r, C , G , s 0 >  Define J*(s) {optimal cost} as the minimum expected cost to reach a goal from this state.  J* should satisfy the following equation: Q*(s,a)

  13. Bellman Equations for MDP 2  < S , A , P r, R , s 0,  >  Define V* V*(s) {optimal val alue ue} as the ma maxim imum um expected di disco counted unted rew ewar ard from this state.  V* should satisfy the following equation:

  14. Bellman Backup  Given an estimate of V* function (say V n )  Backup V n function at state s • calculate a new estimate (V n+1 +1 ) :  Q n+1 (s,a) : value/cost of the strategy: • execute action a in s, execute  n subsequently •  n = argmax a ∈ Ap(s) Q n (s,a) (greedy action)

  15. Bellman Backup Q 1 (s,a 1 ) = 20 + 5 max Q 1 (s,a 2 ) = 20 + 0.9 £ 2 + 0.1 £ 3 a greedy dy = a = a 1 Q 1 (s,a 3 ) = 4 + 3 a 1 s 1 V 0 = 20 V 1 = 25 20 s 0 a 2 ? s 2 V 0 = 2 a 3 s 3 V 0 = 3

  16. Value iteration [Bellman’57]  assign an arbitrary assignment of V 0 to each non-goal state.  repeat • for all states s Iteration n+1 compute V n+1 (s) by Bellman backup at s.  until max s |V n+1 (s) – V n (s)| <  Residual(s)  -convergence

  17. Complexity of value iteration  One iteration takes O(| A || S | 2 ) time.  Number of iterations required • poly(| S |,| A |,1/(1- γ))  Overall: • the algorithm is polynomial in state space • thus exponential in number of state variables.

  18. Policy Computation Optimal policy is stationary and time-independent. • for infinite/indefinite horizon problems Policy Evaluation A system of linear equations in | S | variables.

  19. Markov Decision Process (MDP) r=1 0.01 s 2 0.9 0.7 0.1 0.3 0.99 r=0 s 3 0.3 s 1 r=20 0.3 0.4 0.2 s 5 s 4 r=0 r=-10 0.8

  20. Value Function and Policy  Value residual and policy residual

  21. Changing the Search Space  Value Iteration • Search in value space • Compute the resulting policy  Policy Iteration [Howard’60] • Search in policy space • Compute the resulting value

  22. Policy iteration [Howard’60]  assign an arbitrary assignment of  0 to each state.  repeat costly: O(n 3 ) • compute V n+1 : the evaluation of  n • for all states s compute  n+1 (s): argmax a 2 Ap(s) Q n+1 (s,a)  until  n+1 =  n approximate Modified by value iteration Policy Iteration using fixed policy Advantage  searching in a finite (policy) space as opposed to uncountably infinite (value) space ⇒ convergence faster.  all other properties follow!

  23. LP Formulation minimise  s 2S 2S V*(s) under constraints: for every s, a V*(s) ≥ R (s) +  s’ 2S 2S P r(s’|a,s)V*(s’) A big LP. So other tricks used to solve it!

  24. Hybrid MDPs Hybrid Markov decision process: Markov state = ( n , x ), where n is the discrete component l   x (set of fluents) and . Bellman’s equation:    t 1 = V ( x ) max Pr( n ' | n , x , a )  n   a A ( x ) n  n ' N     t  Pr( x' | n , x , a , n ' ) R ( x' ) V ( x' ) d x'  n ' n '   X x'

  25. Hybrid MDPs Hybrid Markov decision process: Markov state = ( n , x ), where n is the discrete component l   x (set of fluents) and . Bellman’s equation:    t 1 = V ( x ) max Pr( n ' | n , x , a )  n   a A ( x ) n  n ' N     t  Pr( x' | n , x , a , n ' ) R ( x' ) V ( x' ) d x'  n ' n '   X x'

  26. Convolutions discrete-discrete    constant-discrete [Feng et.al.’04] constant-constant [Li&Littman’05]

  27. Result of convolutions value function discrete constant linear probability density function discrete discrete constant linear constant constant linear quadratic linear linear quadratic cubic

  28. Value Iteration for Motion Planning (assumes knowledge of robot’s location)

  29. Frontier-based Exploration • Every unknown location is a target point.

  30. Manipulator Control Arm with two joints Configuration space

  31. Manipulator Control Path State space Configuration space

  32. Manipulator Control Path State space Configuration space

  33. Collision Avoidance via Planning  Potential field methods have local minima  Perform efficient path planning in the local perceptual space  Path costs depend on length and closeness to obstacles [Konolige, Gradient method]

  34. Paths and Costs  Path is list of points P= { p 1 , p 2 ,… p k }  p k is only point in goal set  Cost of path is separable into intrinsic cost at each point along with adjacency cost of moving from one point to the next   =  F ( P ) I ( p ) A ( p , p )  i i i 1 i i • Adjacency cost typically Euclidean distance • Intrinsic cost typically occupancy, distance to obstacle

  35. Navigation Function • Assignment of potential field value to every element in configuration space [Latombe, 91]. • Goal set is always downhill, no local minima. • Navigation function of a point is cost of minimal cost path that starts at that point. = N min F ( P ) k k P k

  36. Computation of Navigation Function • Initialization • Points in goal set  0 cost • All other points  infinite cost • Active list  goal set • Repeat • Take point from active list and update neighbors • If cost changes, add the point to the active list • Until active list is empty

  37. Challenges  Where do we get the state space from?  Where do we get the model from?  What happens when the world is slightly different?  Where does reward come from?  Co Cont ntinuo inuous us sta tate te var aria iables bles  Co Cont ntinuo inuous us ac action tion spa pace ce

  38. How to solve larger problems?  If deterministic problem • Use dijkstra’s algorithm  If no back-edge • Use backward Bellman updates  Prioritize Bellman updates • to maximize information flow  If known initial state • Use dynamic programming + heuristic search • LAO*, RTDP and variants  Divide an MDP into sub-MDPs are solve the hierarchy  Aggregate states with similar values  Relational MDPs

  39. Approximations: n-step lookahead  n=1 : greedy •  1 (s) = argmax a R (s,a)  n-step lookahead •  n (s) = argmax a V n (s)

  40. Approximation: Incremental approaches deterministic relaxation Deterministic planner plan Stochastic simulation Identify weakness Solve/Merge

  41. Approximations: Planning and Replanning deterministic relaxation Deterministic planner send the state reached plan Execute the action

  42. CSE-571 AI-based Mobile Robotics Planning and Control: (1) Reinforcement Learning (2) Partially Observable Markov Decision Processes SA-1

  43. Reinforcement Learning  Still have an MDP • Still looking for policy   New twist: don’t know P r and/or R • i.e. don’t know which states are good • And what actions do  Must actually try actions and states out to learn

  44. Model based methods  Visit different states, perform different actions  Estimate P r and R  Once model built, do planning using V.I. or other methods  Cons: require _huge_ amounts of data

Recommend


More recommend