goal directed mdps
play

Goal-Directed MDPs Models and Algorithms Mausam Indian Institute - PowerPoint PPT Presentation

Goal-Directed MDPs Models and Algorithms Mausam Indian Institute of Technology, Delhi Joint work with Andrey Kolobov and Dan Weld Planning la Sutton control full sequential model-based value-based


  1. LAO* V s 0 s 0 s 1 s 2 s 3 s 4 s 1 s 2 s 3 s 4 V h h h h h s 5 s 6 s 7 s 8 s 6 s 7 S g FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph 41

  2. LAO* V s 0 s 0 s 1 s 2 s 3 s 4 s 1 s 2 s 3 s 4 V V h h h h h h 0 s 5 s 5 s 6 s 7 s 8 s 6 s 7 S g S g FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph 42

  3. LAO* V s 0 s 0 s 1 s 2 s 3 s 4 s 1 s 2 s 3 s 4 V V h h h h h h 0 s 5 s 5 s 6 s 7 s 8 s 6 s 7 S g S g FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph 43

  4. LAO* V s 0 s 0 s 1 s 2 s 3 s 4 s 1 s 2 s 3 s 4 V V h h V h h h 0 s 5 s 5 s 6 s 7 s 8 s 6 s 7 S g S g FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph 44

  5. LAO* V s 0 s 0 s 1 s 2 s 3 s 4 s 1 s 2 s 3 s 4 V V h h V h h h 0 s 5 s 5 s 6 s 7 s 8 s 6 s 7 S g S g FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph 45

  6. LAO* V s 0 s 0 s 1 s 2 s 3 s 4 s 1 s 2 s 3 s 4 V V h V V h h h 0 s 5 s 5 s 6 s 7 s 8 s 6 s 7 S g S g FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph 46

  7. LAO* V s 0 s 0 s 1 s 2 s 3 s 4 s 1 s 2 s 3 s 4 V V h V V h h h 0 s 5 s 5 s 6 s 7 s 8 s 6 s 7 S g S g FIND: expand some states on the fringe (in greedy graph) initialize all new states by their heuristic value subset = all states in expanded graph that can reach s perform VI on this subset recompute the greedy graph 47

  8. LAO* V s 0 s 0 s 1 s 2 s 3 s 4 s 1 s 2 s 3 s 4 V V h V V V h h 0 s 5 s 5 s 6 s 7 s 8 s 6 s 7 S g S g output the greedy graph as the final policy 48

  9. LAO* V s 0 s 0 s 1 s 2 s 3 s 4 s 1 s 2 s 3 s 4 V V h V V V h h 0 s 5 s 5 s 6 s 7 s 8 s 6 s 7 S g S g output the greedy graph as the final policy 49

  10. LAO* V s 0 M#1: some states s 0 can be ignored for efficient compuation s 1 s 2 s 3 s 4 s 1 s 2 s 3 s 4 V V h V V V h h 0 s 5 s 5 s 6 s 7 s 8 s 6 s 7 s 8 S g S g s 4 was never expanded s 8 was never touched 50

  11. LAO* [Hansen&Zilberstein 98] add s 0 to the fringe and to greedy policy graph one expansion repeat  FIND: expand best state s on the fringe (in greedy graph)  initialize all new states by their heuristic value  subset = all states in expanded graph that can reach s  perform VI on this subset  recompute the greedy graph lot of computation until greedy graph has no fringe output the greedy graph as the final policy 51

  12. Optimizations in LAO* add s 0 to the fringe and to greedy policy graph repeat  FIND: expand best state s on the fringe (in greedy graph)  initialize all new states by their heuristic value  subset = all states in expanded graph that can reach s  VI iterations until greedy graph changes (or low residuals)  recompute the greedy graph until greedy graph has no fringe output the greedy graph as the final policy 52

  13. Optimizations in LAO* add s 0 to the fringe and to greedy policy graph repeat  FIND: expand all states in greedy fringe  initialize all new states by their heuristic value  subset = all states in expanded graph that can reach s  VI iterations until greedy graph changes (or low residuals)  recompute the greedy graph until greedy graph has no fringe output the greedy graph as the final policy 53

  14. iLAO* [Hansen&Zilberstein 01] add s 0 to the fringe and to greedy policy graph repeat  FIND: expand all states in greedy fringe  initialize all new states by their heuristic value  subset = all states in expanded graph that can reach s  only one backup per state in greedy graph  recompute the greedy graph until greedy graph has no fringe in what order? (fringe  start) DFS postorder output the greedy graph as the final policy 54

  15. Real Time Dynamic Programming [Barto et al 95] • Original Motivation – agent acting in the real world • Trial – simulate greedy policy starting from start state; – perform Bellman backup on visited states – stop when you hit the goal No termination condition! • RTDP: repeat trials forever – Converges in the limit #trials ! 1 55

  16. Trial s 0 s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 S g 56

  17. Trial s 0 V s 1 s 2 s 3 s 4 h h h h s 5 s 6 s 7 s 8 S g start at start state repeat perform a Bellman backup simulate greedy action 57

  18. Trial s 0 V s 1 s 2 s 3 s 4 h h h h s 5 s 6 s 7 s 8 S g h h start at start state repeat perform a Bellman backup simulate greedy action 58

  19. Trial s 0 V s 1 s 2 s 3 s 4 V h h h s 5 s 6 s 7 s 8 S g h h start at start state repeat perform a Bellman backup simulate greedy action 59

  20. Trial s 0 V s 1 s 2 s 3 s 4 V h h h s 5 s 6 s 7 s 8 S g h h start at start state repeat perform a Bellman backup simulate greedy action 60

  21. Trial s 0 V s 1 s 2 s 3 s 4 V h h h s 5 s 6 s 7 s 8 S g V h start at start state repeat perform a Bellman backup simulate greedy action 61

  22. Trial s 0 V s 1 s 2 s 3 s 4 V h h h s 5 s 6 s 7 s 8 S g V h start at start state repeat perform a Bellman backup simulate greedy action until hit the goal 62

  23. Trial s 0 V s 1 s 2 s 3 s 4 V h h h s 5 s 6 s 7 s 8 S g V h RTDP start at start state repeat repeat perform a Bellman backup forever simulate greedy action until hit the goal 63

  24. RTDP Family of Algorithms repeat s à s 0 repeat //trials REVISE s; identify a greedy FIND: pick s’ s.t. T(s, a greedy , s’) > 0 s à s’ until s 2 G until termination test 64

  25. Termination Test: Labeling • Admissible heuristic ⇒ V(s) · V*(s) ⇒ Q(s,a) · Q*(s,a) • Label a state s as solved – if V(s) has converged best action s s g Res V (s ) < ² ) V(s ) won’t change ! ) label s as solved

  26. Labeling (contd) best action s s g s' Res V (s ) < ² s' already solved ) V(s ) won’t change ! ) label s as solved 66

  27. Labeling (contd) M#3: some algorithms use explicit best action best action knowledge of goals s s g s s g best action s' s' Res V (s ) < ² s' already solved Res V (s ) < ² ) V(s ) won’t change ! ) M#1: some states Res V (s’ ) < ² can be ignored for efficient computation label s as solved V(s), V(s’) won’t change ! label s, s’ as solved 67

  28. Labeled RTDP [Bonet&Geffner 03b] repeat s à s 0 label all goal states as solved repeat //trials REVISE s; identify a greedy FIND: sample s’ from T(s, a greedy , s ’) s à s’ until s is solved for all states s in the trial try to label s as solved until s 0 is solved 68

  29. LRTDP • terminates in finite time – due to labeling procedure • anytime – focuses attention on more probable states • fast convergence – focuses attention on unconverged states 69

  30. LRTDP Extensions • Different ways to pick next state • Different termination conditions • Bounded RTDP [McMahan et al 05] • Focused RTDP [Smith&Simmons 06] • Value of Perfect Information RTDP [Sanner et al 09] 70

  31. Where do Heuristics come from? • Domain-dependent heuristics • Domain-independent heuristics – dependent on specific domain representation M#2: factored representations expose useful problem structure 71

  32. Take-Homes • efficient computation given start state s 0 – heuristic search • automatic computation of heuristics – domain independent manner

  33. Shameless Plug 74

  34. Agenda • Background: Stochastic Shortest Paths MDPs • Background: Heuristic Search for SSP MDPs • Algorithms: Automatic Basis Function Discovery • Models: SSPs  Generalized SSPs

  35. Previous Work Our Work • Determinization • Function Approximation – Determinize the MDP – Dimensionality reduction – Classical planners fast – Represent state values – E.g., FF-Replan with basis functions – Cons: may be troubled by • E.g., V*(s) ≈ ∑ i w i b i (s) – Cons: • Complex contingencies • Probabilities • Need a human to get b i Marry these paradigms to extract problem-specific structure in a fast, problem-independent way. 76

  36. Example Domain G G G e e e t t t H S W 78

  37. Example Domain (cont’d) T S w m e a a s k h 79

  38. SSP s0 MDP • S: A set of states • A: A set of actions • T(s,a,s ’): transition GetW, GetH, GetS, Tweak, Smash model • C(s,a,s ’): action cost • s 0 : start state • G: set of goals

  39. Contributions ReTrASE — a scalable approximate MDP solver – Combines function approximation with classical planning – Uses classical planner to automatically generate basis functions – Fast, memory-efficient, high-quality policies 81

  40. The Big Picture: ReTrASE [Kolobov, Mausam, Weld, AIJ’12] Extraction Module MDP P Determinize P Det(P) State s Run a classical planner Trajectory Dead End Regress Run a state space SixthSense trajectory exploration routine State s (e.g, RTDP) Basis Functions Nogoods Evaluate s Value( s ) Policy 82

  41. Determinizing the Domain P = 9/10 P = 1/10 83

  42. Generating Trajectories Extraction Module MDP P Determinize P Det(P) State s Run a classical planner Trajectory Dead End Regress Run a state space SixthSense trajectory exploration routine State s (e.g, RTDP) Basis Functions Nogoods Evaluate s Value( s ) Policy 84

  43. Generating Trajectories 85

  44. Computing Basis Functions Extraction Module MDP P Determinize P Det(P) State s Run a classical planner Trajectory Dead End Regress Run a state space SixthSense trajectory exploration routine State s (e.g, RTDP) Basis Functions Nogoods Evaluate s Value( s ) Policy 86

  45. Regressing Trajectories basis function guarantees goal is reachable from s = 1 Initial weights basis funct ctions ions = 2 87

  46. Basis Functions 88

  47. Computing Values Extraction Module MDP P Determinize P Det(P) State s Run a classical planner Trajectory Dead End Regress Run a state space SixthSense trajectory exploration routine State s (e.g, RTDP) Basis Functions Nogoods Evaluate s Value( s ) Policy 89

  48. Meaning of Basis Function Weights Want to compute basis function weights so that the blue basis function looks “better” than the pink one! 90 90

  49. Value of a Basis Function • Basis function enables at least one trajectory – applicable from all relevant states • Trajectories combine to form policies • Value of a basis function ~ “quality” of its policies • Algorithm based on RTDP – Learn basis function values – Use them to compute values of states 91

  50. Experimental Results • Criteria: – Scalability (vs. VI/RTDP-based planners) – Solution quality (vs. IPPC winners) • Domains: 6 from IPPC-06 and IPPC-08 • Competitors: – Best performer on the particular domain – Best performer in the particular IPPC – LRTDP 92

  51. The Big Picture • ReTrASE is vastly more scalable than VI/RTDP-based planners • ReTrASE typically rivals or outperforms the best-performing planners on IPPC goal- oriented domains 93

  52. Triangle-Tire: Memory Consumption LRTDP OPT LRTDP FF LOG 10 (Amount of Memory) ReTrASE Triangle-Tire Problem # 94

  53. Triangle-Tire: Success Rate ReTrASE HMDPP % of Successful Trials RFF-PG Triangle- Tire World’08 Problem # 95

  54. Exploding Blocks World: Success Rate ReTrASE FFReplan % of Successful Trials ~2 800 states! FPG Exploding Blocks World’06 Problem # 96

  55. SSP s0 • S: A set of states • A: A set of actions • T(s,a,s ’): transition model ? • C(s,a,s ’): cost • G: set of goals • s 0 : start state Under two conditions: • There is a proper policy (reaches a goal with P= 1 from all states) • Every improper policy incurs a cost of ∞ from every state from which it does not reach the goal with P=1 97

  56. Key Drawback of ReTrASE … • Dead-end handling expensive – expensive to identify: drain on time – too many to store: drain on space

  57. Computing Values Extraction Module MDP P Determinize P Det(P) State s Run a classical planner Trajectory Dead End Regress Run a state space SixthSense trajectory exploration routine State s (e.g, RTDP) Basis Functions Nogoods Evaluate s Value( s ) Policy 99

  58. Research Question Can we devise a sound dead-end identification procedure fast enough to obviate memoization? Learns feature combinations whose presence guarantees a state to be a dead end 100

  59. Nogoods Nogood 101

  60. Generate-and-Test Procedure • Generate a nogood candidate – Key insight: Nogood = conjunction that defeats all b.f.s – For each b.f., pick a literal that defeats it • Test the candidate – Needed for soundness , since we don’t know all b.f.s – Use the non-relaxed Planning Graph algorithm 102

  61. Benefits of SixthSense • Can act as submodule of many planners and ID dead ends – By checking discovered nogoods against every state – 110

Recommend


More recommend