chapter 16 planning based on markov decision processes
play

Chapter 16 Planning Based on Markov Decision Processes Dana S. Nau - PowerPoint PPT Presentation

Lecture slides for Automated Planning: Theory and Practice Chapter 16 Planning Based on Markov Decision Processes Dana S. Nau University of Maryland 12:48 PM February 29, 2012 Dana Nau: Lecture slides for Automated Planning 1 Licensed


  1. Lecture slides for Automated Planning: Theory and Practice Chapter 16 Planning Based on Markov Decision Processes Dana S. Nau University of Maryland 12:48 PM February 29, 2012 Dana Nau: Lecture slides for Automated Planning 1 Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License: http://creativecommons.org/licenses/by-nc-sa/2.0/

  2. Motivation c a b Intended ● Until now, we’ve assumed outcome c that each action has only one grasp(c) a b possible outcome ◆ But often that’s unrealistic ● In many situations, actions may have a b more than one possible outcome Unintended ◆ Action failures outcome » e.g., gripper drops its load ◆ Exogenous events » e.g., road closed ● Would like to be able to plan in such situations ● One approach: Markov Decision Processes Dana Nau: Lecture slides for Automated Planning 2 Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License: http://creativecommons.org/licenses/by-nc-sa/2.0/

  3. Stochastic Systems ● Stochastic system : a triple Σ = ( S, A, P ) ◆ S = finite set of states ◆ A = finite set of actions ◆ P a ( s ʹ″ | s ) = probability of going to s ʹ″ if we execute a in s ◆ ∑ s ʹ″ ∈ S P a ( s ʹ″ | s ) = 1 ● Several different possible action representations ◆ e.g., Bayes networks, probabilistic operators ● The book does not commit to any particular representation ◆ It only deals with the underlying semantics ◆ Explicit enumeration of each P a ( s ʹ″ | s ) Dana Nau: Lecture slides for Automated Planning 3 Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License: http://creativecommons.org/licenses/by-nc-sa/2.0/

  4. Example wait 2 ● Robot r1 starts wait at location l1 move(r1,l2,l1) ◆ State s1 in the diagram ● Objective is to wait get r1 to location l4 ◆ State s4 in Goal wait Start the diagram Dana Nau: Lecture slides for Automated Planning 4 Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License: http://creativecommons.org/licenses/by-nc-sa/2.0/

  5. Example wait 2 ● Robot r1 starts wait at location l1 move(r1,l2,l1) ◆ State s1 in the diagram ● Objective is to wait get r1 to location l4 ◆ State s4 in Goal wait Start the diagram ● No classical plan (sequence of actions) can be a solution, because we can’t guarantee we’ll be in a state where the next action is applicable π = 〈 move(r1,l1,l2), move(r1,l2,l3), move(r1,l3,l4) 〉 Dana Nau: Lecture slides for Automated Planning 5 Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License: http://creativecommons.org/licenses/by-nc-sa/2.0/

  6. Policies wait 2 wait move(r1,l2,l1) π 1 = { (s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, wait), wait (s5, wait) } Goal wait Start ● Policy : a function that maps states into actions ◆ Write it as a set of state-action pairs Dana Nau: Lecture slides for Automated Planning 6 Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License: http://creativecommons.org/licenses/by-nc-sa/2.0/

  7. Initial States wait ● For every state s , there 2 will be a probability wait P ( s ) that the system move(r1,l2,l1) starts in s ● The book assumes there’s a unique state wait s 0 such that the system always starts in s 0 Goal wait Start ● In the example, s 0 = s 1 ◆ P ( s 1 ) = 1 ◆ P ( s ) = 0 for all s ≠ s 1 Dana Nau: Lecture slides for Automated Planning 7 Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License: http://creativecommons.org/licenses/by-nc-sa/2.0/

  8. Histories ● History : a sequence wait of system states 2 h = 〈 s 0 , s 1 , s 2 , s 3 , s 4 , … 〉 wait h 0 = 〈 s1, s3, s1, s3, s1, … 〉 move(r1,l2,l1) h 1 = 〈 s1, s2, s3, s4, s4, … 〉 h 2 = 〈 s1, s2, s5, s5, s5, … 〉 h 3 = 〈 s1, s2, s5, s4, s4, … 〉 wait h 4 = 〈 s1, s4, s4, s4, s4, … 〉 h 5 = 〈 s1, s1, s4, s4, s4, … 〉 Goal wait Start h 6 = 〈 s1, s1, s1, s4, s4, … 〉 h 7 = 〈 s1, s1, s1, s1, s1, … 〉 ● Each policy induces a probability distribution over histories ◆ If h = 〈 s 0 , s 1 , … 〉 then P ( h | π ) = P ( s 0 ) ∏ i ≥ 0 P π ( S i ) ( s i+1 | s i ) The book omits this because it assumes a unique starting state Dana Nau: Lecture slides for Automated Planning 8 Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License: http://creativecommons.org/licenses/by-nc-sa/2.0/

  9. Example wait 2 π 1 = { (s1, move(r1,l1,l2)), wait (s2, move(r1,l2,l3)), move(r1,l2,l1) (s3, move(r1,l3,l4)), (s4, wait), (s5, wait) } wait Goal wait Start h 1 = 〈 s1, s2, s3, s4, s4, … 〉 P ( h 1 | π 1 ) = 1 × 1 × .8 × 1 × … = 0.8 goal h 2 = 〈 s1, s2, s5, s5 … 〉 P ( h 2 | π 1 ) = 1 × 1 × .2 × 1 × … = 0.2 P ( h | π 1 ) = 0 for all other h so π 1 reaches the goal with probability 0.8 Dana Nau: Lecture slides for Automated Planning 9 Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License: http://creativecommons.org/licenses/by-nc-sa/2.0/

  10. Example wait 2 π 2 = { (s1, move(r1,l1,l2)), wait wait (s2, move(r1,l2,l3)), move(r1,l2,l1) (s3, move(r1,l3,l4)), (s4, wait), (s5, move(r1,l5,l4)) } wait Goal wait Start h 1 = 〈 s1, s2, s3, s4, s4, … 〉 P ( h 1 | π 2 ) = 1 × 0.8 × 1 × 1 × … = 0.8 h 3 = 〈 s1, s2, s5, s4, s4, … 〉 P ( h 3 | π 2 ) = 1 × 0.2 × 1 × 1 × … = 0.2 P ( h | π 1 ) = 0 for all other h goal so π 2 reaches the goal with probability 1 Dana Nau: Lecture slides for Automated Planning 10 Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License: http://creativecommons.org/licenses/by-nc-sa/2.0/

  11. Example wait π 3 = { (s1, move(r1,l1,l4)), 2 (s2, move(r1,l2,l1)), wait (s3, move(r1,l3,l4)), move(r1,l2,l1) (s4, wait), (s5, move(r1,l5,l4) } π 3 reaches the goal with wait probability 1.0 Goal wait Start goal h 4 = 〈 s1, s4, s4, s4, … 〉 P ( h 4 | π 3 ) = 0.5 × 1 × 1 × 1 × 1 × … = 0.5 h 5 = 〈 s1, s1, s4, s4, s4, … 〉 P ( h 5 | π 3 ) = 0.5 × 0.5 × 1 × 1 × 1 × … = 0.25 h 6 = 〈 s1, s1, s1, s4, s4, … 〉 P ( h 6 | π 3 ) = 0.5 × 0.5 × 0.5 × 1 × 1 × … = 0.125 • • • h 7 = 〈 s1, s1, s1, s1, s1, s1, … 〉 P ( h 7 | π 3 ) = 0.5 × 0.5 × 0.5 × 0.5 × 0.5 × … = 0 Dana Nau: Lecture slides for Automated Planning 11 Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License: http://creativecommons.org/licenses/by-nc-sa/2.0/

  12. r = –100 Utility wait wait ● Numeric cost C ( s,a ) for each state s and action a ● Numeric reward R ( s ) for each state s ● No explicit goals any more ◆ Desirable states have wait high rewards wait Start ● Example: ◆ C ( s ,wait ) = 0 at every state except s3 ◆ C ( s,a ) = 1 for each “ horizontal ” action ◆ C ( s,a ) = 100 for each “ vertical ” action ◆ R as shown ● Utility of a history: ◆ If h = 〈 s 0 , s 1 , … 〉 , then V ( h | π ) = ∑ i ≥ 0 [ R ( s i ) – C ( s i , π ( s i ))] Dana Nau: Lecture slides for Automated Planning 12 Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License: http://creativecommons.org/licenses/by-nc-sa/2.0/

  13. r = –100 Example wait wait π 1 = { (s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, wait), (s5, wait) } wait wait Start h 1 = 〈 s1, s2, s3, s4, s4, … 〉 h 2 = 〈 s1, s2, s5, s5 … 〉 V ( h 1 | π 1 ) = [ R ( s1 ) – C( s1, π 1 ( s1 ))] + [ R ( s2 ) – C( s2, π 1 ( s2 ))] + [ R ( s3 ) – C( s3, π 1 ( s3 ))] + [ R ( s4 ) – C( s4, π 1 ( s4 ))] + [ R ( s4 ) – C( s4, π 1 ( s4 ))] + … = [0 – 100] + [0 – 1] + [0 – 100] + [100 – 0] + [100 – 0] + … = ∞ V ( h 2 | π 1 ) = [0 – 100] + [0 – 1] + [–100 – 0] + [–100 – 0] + [–100 – 0] + … = – ∞ Dana Nau: Lecture slides for Automated Planning 13 Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License: http://creativecommons.org/licenses/by-nc-sa/2.0/

  14. r = –100 Discounted Utility wait wait ● We often need to use a discount factor , γ γ = 0.9 ◆ 0 ≤ γ ≤ 1 ● Discounted utility wait of a history: wait Start V ( h | π ) = ∑ i ≥ 0 γ i [ R ( s i ) – C ( s i , π ( s i ))] ◆ Distant rewards/costs have less influence ◆ Convergence is guaranteed if 0 ≤ γ < 1 ● Expected utility of a policy: ◆ E ( π ) = ∑ h P ( h | π ) V ( h | π ) Dana Nau: Lecture slides for Automated Planning 14 Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License: http://creativecommons.org/licenses/by-nc-sa/2.0/

  15. r = –100 Example wait wait π 1 = { (s1, move(r1,l1,l2)), (s2, move(r1,l2,l3)), (s3, move(r1,l3,l4)), (s4, wait), (s5, wait) } γ = 0.9 wait wait h 1 = 〈 s1, s2, s3, s4, s4, … 〉 Start h 2 = 〈 s1, s2, s5, s5 … 〉 V ( h 1 | π 1 ) = .9 0 [0 – 100] + .9 1 [0 – 1] + .9 2 [0 – 100] + .9 3 [100 – 0] + .9 4 [100 – 0] + … = 547.9 V ( h 2 | π 1 ) = .9 0 [0 – 100] + .9 1 [0 – 1] + .9 2 [–100 – 0] + .9 3 [–100 – 0] + … = –910.1 E ( π 1 ) = 0.8 V ( h 1 | π 1 ) + 0.2 V ( h 2 | π 1 ) = 0.8(547.9) + 0.2(–910.1) = 256.3 Dana Nau: Lecture slides for Automated Planning 15 Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License: http://creativecommons.org/licenses/by-nc-sa/2.0/

Recommend


More recommend